Jonathan Lewis

Subscribe to Jonathan Lewis feed Jonathan Lewis
Just another Oracle weblog
Updated: 5 days 15 hours ago

Temp space

Fri, 2019-12-06 06:18

A question about hunting down the source of the error “ORA-01652 unable to extend temp segment by NNN in tablespace XXX” shows up on the Oracle-L mailing list or the Oracle developer community forum from time to time. In most cases the tablespace referenced is the temporary tablespace, which means the session reporting the error was probably trying to allocate some space for sorting, or doing a hash join, or instantiating a GTT (global temporary table) or a CTE (common table expression / “with” subquery). The difficulty in cases like this is that the session reporting the error might be the victim of some other session’s greed – so looking at what the session was doing won’t necessarily point you to the real problem.

Of course you then run into a further problem tracking down the source of the problem. By the time you hear the complaint (even if it’s only seconds after the error appeared) the session that had been hogging the temporary tablespace may have finished what it was doing, leaving a huge amount of free space in the temporary tablespace and suggesting (to the less experienced and cynical user) that there’s something fundamentally wrong with the way Oracle has been accounting for space usage.

If you find yourself in this situation remember that (if you’re licensed to take advantage of it) the active session history may be able to help.  One of the columns in v$active_session_history is called temp_space_allocated with the slightly misleading description: “Amount of TEMP memory (in bytes) consumed by this session at the time this sample was taken”. A simple query against v$active_session_history may be enough to identify the session and SQL  statement that had been holding the temporary space when the error was raised, for example:


column pga_allocated        format 999,999,999,999
column temp_space_allocated format 999,999,999,999

break on session_id skip 1 on session_serial#

select
        session_id, session_serial#, 
        sample_id, 
        sql_id, 
        pga_allocated,
        temp_space_allocated
from
        v$active_session_history
where
        sample_time between sysdate - 5/1440 and sysdate
and     nvl(temp_space_allocated,0) != 0
order by
        session_id, sample_id
/

All I’ve done for this example is query v$active_session_history for the last 5 minutes reporting a minimum of information from any rows that show temp space usage. As a minor variation on the theme you can obviously change the time range, and you might want to limit the output to rows reporting more than 1MB (say) of temp space usage.

You’ll notice that I’ve also reported the pga_allocated (Description: Amount of PGA memory (in bytes) consumed by this session at the time this sample was taken) in this query; this is just a little convenience – a query that’s taking a lot of temp space will probably start by acquiring a lot of memory so it’s nice to be able to see the two figures together.

There are plenty of limitations and flaws in the usefulness of this report and I’ll say something about that after showing an example of usage. Let’s start with a script to build some data before running a space-consuming query:


rem
rem     Script:         allocate_tempspace.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2019
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem

create table t1 as 
select * from all_objects
;

insert into t1 select * from t1;
commit;

insert into t1 select * from t1;
commit;

insert into t1 select * from t1;
commit;

insert into t1 select * from t1;
commit;

execute dbms_stats.gather_table_stats(null,'t1')

execute dbms_lock.sleep(20)

set pagesize  60
set linesize 255
set trimspool on
set serveroutput off
alter session set statistics_level = all;

with ttemp as (
        select /*+ materialize */ * from t1
)
select 
        /*+ no_partial_join(@sel$2 t1b) no_place_group_by(@sel$2) */ 
        t1a.object_type,
        max(t1a.object_name)
from
        ttemp t1a, ttemp t1b
where
        t1a.object_id = t1b.object_id
group by
        t1a.object_type
order by
        t1a.object_type
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

My working table t1 consists of 16 copies of the view all_objects – so close to 1 million rows in my case – and the query is hinted to avoid any of the clever transformations that the optimizer could use to reduce the workload so it’s going to do a massive hash join and aggregation to report a summary of a couple of dozen rows. Here’s the execution plan (in this case from 12.2.0.1, though the plan is the same for 19.3 with some variations in the numbers).


SQL_ID  1cwabt12zq6zb, child number 0
-------------------------------------

Plan hash value: 1682228242

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name                       | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                            |      1 |        |     29 |00:00:10.03 |   47413 |  21345 |  12127 |       |       |          |         |
|   1 |  TEMP TABLE TRANSFORMATION               |                            |      1 |        |     29 |00:00:10.03 |   47413 |  21345 |  12127 |       |       |          |         |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D665B_E2772D3 |      1 |        |      0 |00:00:01.51 |   28915 |      0 |   9217 |  2068K|  2068K|          |         |
|   3 |    TABLE ACCESS FULL                     | T1                         |      1 |    989K|    989K|00:00:00.24 |   19551 |      0 |      0 |       |       |          |         |
|   4 |   SORT GROUP BY                          |                            |      1 |     29 |     29 |00:00:08.51 |   18493 |  21345 |   2910 |  6144 |  6144 | 6144  (0)|         |
|*  5 |    HASH JOIN                             |                            |      1 |     15M|     15M|00:00:03.93 |   18493 |  21345 |   2910 |    48M|  6400K|   65M (1)|   25600 |
|   6 |     VIEW                                 |                            |      1 |    989K|    989K|00:00:00.36 |    9233 |   9218 |      0 |       |       |          |         |
|   7 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D665B_E2772D3 |      1 |    989K|    989K|00:00:00.35 |    9233 |   9218 |      0 |       |       |          |         |
|   8 |     VIEW                                 |                            |      1 |    989K|    989K|00:00:00.40 |    9257 |   9217 |      0 |       |       |          |         |
|   9 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D665B_E2772D3 |      1 |    989K|    989K|00:00:00.39 |    9257 |   9217 |      0 |       |       |          |         |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - access("T1A"."OBJECT_ID"="T1B"."OBJECT_ID")

Critically this plan shows us two uses of the temp space but only reports one of them as Used-Tmp. The “hash join” at operation 5 tells us that it reached 65MB of (tunable PGA) memory before going “1-pass”, eventually dumping 25,600 KB to disc. This space usage is corroborated by the 2,910 writes (which, at an 8KB block size, would be 23,280 KB). The missing Used-Tmp, however, is the space taken up by the materialized CTE. We can see that operation 2 is a “load as select” that writes 9,217 blocks to disc (subsequently read back twice – the tablescans shown in operations 7 and 9). That’s  74,000 KB of temp space that doesn’t get reported Used-Tmp.

If we take a look at the plan from 19.3 we see different numbers, but the same “error of omission”:

SQL_ID  1cwabt12zq6zb, child number 0
-------------------------------------

Plan hash value: 1682228242

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name                       | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                            |      1 |        |     25 |00:00:08.15 |   34905 |  13843 |   8248 |       |       |          |         |
|   1 |  TEMP TABLE TRANSFORMATION               |                            |      1 |        |     25 |00:00:08.15 |   34905 |  13843 |   8248 |       |       |          |         |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D6624_E259E68 |      1 |        |      0 |00:00:01.26 |   23706 |      0 |   5593 |  2070K|  2070K|          |         |
|   3 |    TABLE ACCESS FULL                     | T1                         |      1 |    907K|    907K|00:00:00.21 |   18024 |      0 |      0 |       |       |          |         |
|   4 |   SORT GROUP BY                          |                            |      1 |     25 |     25 |00:00:06.89 |   11193 |  13843 |   2655 |  6144 |  6144 | 6144  (0)|         |
|*  5 |    HASH JOIN                             |                            |      1 |     14M|     14M|00:00:03.55 |   11193 |  13843 |   2655 |    44M|  6400K|   64M (1)|      23M|
|   6 |     VIEW                                 |                            |      1 |    907K|    907K|00:00:00.26 |    5598 |   5594 |      0 |       |       |          |         |
|   7 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D6624_E259E68 |      1 |    907K|    907K|00:00:00.25 |    5598 |   5594 |      0 |       |       |          |         |
|   8 |     VIEW                                 |                            |      1 |    907K|    907K|00:00:00.34 |    5595 |   5594 |      0 |       |       |          |         |
|   9 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D6624_E259E68 |      1 |    907K|    907K|00:00:00.33 |    5595 |   5594 |      0 |       |       |          |         |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - access("T1A"."OBJECT_ID"="T1B"."OBJECT_ID")


With slightly fewer rows in t1 (907K vs. 989K) we write 5,593 blocks for the materialized CTE  (instead of 9,217) and spill 2,655 blocks during the hash join (instead of 2,910). But again it’s only the hash join spill that is reported under Used-Tmp. Note, by the way, that the Used-Tmp in 12.2 was reported in KB when it’s reported in MB in 19.3.0.0.

Side note: comparing the number of rows created and blocks written for the CTE, it looks as if 19.3 is using the data blocks much more efficiently than 12.2. There’s no obvious reason for this (though a first guess would be that the older mechanism is to write a GTT with pctfree=10 while the new avoid any free space and transactional details) so, as ever, I now have another draft for a blog note reminding me to investigate (eventually) what differences there are in CTE storage on the upgrade. It’s something that might make a difference in a few special cases.

With the figures from the execution plans in mind we can now look at the results of the query against v$active_session_history. Conveniently the queries took a few seconds to complete, so we’re going to see several rows for each execution.

First the results from 12.2.0.1

SESSION_ID SESSION_SERIAL#  SAMPLE_ID SQL_ID           PGA_ALLOCATED TEMP_SPACE_ALLOCATED
---------- --------------- ---------- ------------- ---------------- --------------------
        14           22234   15306218 1cwabt12zq6zb       95,962,112            1,048,576
                             15306219 1cwabt12zq6zb       97,731,584           37,748,736
                             15306220 1cwabt12zq6zb      148,194,304           77,594,624
                             15306221 1cwabt12zq6zb      168,117,248           85,983,232
                             15306222 1cwabt12zq6zb      168,117,248           90,177,536
                             15306223 1cwabt12zq6zb      168,117,248           95,420,416
                             15306224 1cwabt12zq6zb      168,117,248           98,566,144
                             15306225 1cwabt12zq6zb      168,117,248          102,760,448
                             15306226 1cwabt12zq6zb      116,933,632          103,809,024
                             15306227 1cwabt12zq6zb      116,933,632          103,809,024
                             15306228 b66ycurnwpgud        8,602,624            1,048,576

I pointed out that we had 25,600 KB reported as Used-Tmp and roughly 74,000 KB unreported – a total of nearly 100,000 KB that is reasonably close to the 103,800,000 bytes reported by ASH. Moreover the timing of the plan (loading the CTE in the first 2 seconds) seems to agree with the growth to 77,590,000 of temp_space_allocated by the time we get to sample_id 15306220 in ASH. Then we have several seconds of slow growth as the hash join takes place and feeds its resulte up to the sort group by. At the end of the query we happen to have been lucky enough to catch one last sample just before the session had released all its temp space and ceased to be active.  (Note: however, that the sql_id at that sample point was not the sql_id of our big query – and that’s a clue about one of the limitations of using ASH to find the greedy SQL.)

We see the same pattern of behaviour in 19.3.0.0:


SESSION_ID SESSION_SERIAL#  SAMPLE_ID SQL_ID           PGA_ALLOCATED TEMP_SPACE_ALLOCATED
---------- --------------- ---------- ------------- ---------------- --------------------
       136           42767    2217500 1cwabt12zq6zb      143,982,592           46,137,344
                              2217501 1cwabt12zq6zb      193,527,808           54,525,952
                              2217502 1cwabt12zq6zb      193,527,808           57,671,680
                              2217503 1cwabt12zq6zb      193,527,808           61,865,984
                              2217504 1cwabt12zq6zb      197,722,112           67,108,864
                              2217505 1cwabt12zq6zb      150,601,728           70,254,592
                              2217506 1cwabt12zq6zb      150,601,728           70,254,592

We start with an almost instantaneous jump to 46MB of temp_space_allocated in the first second of the query – that’s the 5,593 blocks of the CTE being materialized, then the slow growth of temp space as the hash join runs, spills to disc, and passes its data up to the sort group by. Again we can see that the peak usage was the CTE (46MB) plus the reported spill of 23MB (plus rounding errors and odd bits).

Preliminary Observations

Queries against ASH (v$active_session_history) can show us sessions that were holding space in the temporary tablespace at the moment a sample of active sessions was taken. This may allow us to identify greedy sessions that were causing other sessions to fail with ORA-01652 (unable to allocate temp segment).

We have seen that there is at least one case where we get better information about temp space allocation from ASH than we do from the variants on v$sql_plan that include the SQL Workarea information (v$sql_workarea, v$sql_workarea_active) because the space acquired during materialization of CTEs is not reported as a “tunable SQL workarea” but does appear in the ASH temp_space_allocated.

At first sight it looks as if we may be able to use the query against ASH to identify the statement (by sql_id) that was the one being run by the greedy session when it consumed all the space. As we shall see in a further article, there are various reasons why this may over-optimistic, however in many cases there’s a fair chance that when you see the same sql_id appearing in a number of consecutive rows of the report then that statement may be the thing that is responsible for the recent growth in temp space usage – and you can query v$sql to find the text and call dbms_xplan.display_cursor() to get as much execution plan information as possible.

Further questions
  • When does a session release the temp_space_allocated? Will the space be held (locked) as long as the cursor is open, or can it be released when it is no longer needed? Will it be held, but releasable, even after the cursor has (from the client program’s perspective) been closed?
  • Could we be fooled by a report that said a session was holding a lot of space when it didn’t need it and would have released it if the demand had appeared?
  • Under what conditions might the temp_space_allocated in an ASH sample have nothing to do with the sql_id reported in the same sample?
  • Are there any other reasons why ASH might report temp_space_allocated when an execution plan doesn’t?
  • Is temp_space_allocated only about the temporary tablespace, or could it include informatiom about other (“permanent”) tablespaces ?

Stay tuned for further analysis of the limitations of using v$active_session_history.temp_space_allocated to help identify the srouce of a space management ORA-01652 issue.

 

 

E-rows / A-rows

Wed, 2019-12-04 07:17

This note was prompted by an error I made at the UKOUG TechFest19 yesterday. It’s fairly well-known that when you read an execution plan that includes the rowsource execution stats – so you get the E-rows (estimated) and A-rows (Actual) reported – then a sensible check of the quality of the optimizer’s calculations is to compare the estimates and actuals allowing for the fact that the E-rows is “per start” and the A-rows is “cumulative”, so A-rows = E-rows * Starts.

The error I made yesterday was to forget that this relationship isn’t always true. In particular partitioning and parallel query introduced the need to be a little flexibility in reading the numbers – which I’ll demonstrate with a coupld of simple examples running under 12.2.0.1


rem
rem     Script:         estimate_actual.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2019
rem

create table pt_composite_1 (
        id,
        grp,
        small_vc,
        padding
)
nologging
partition by range(id) 
subpartition by hash (grp)
subpartitions 4
(
        partition p1 values less than (  4000),
        partition p2 values less than (  8000),
        partition p3 values less than ( 16000),
        partition p4 values less than ( 32000),
        partition p5 values less than ( 64000),
        partition p6 values less than (128000)
)
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                                            id,
        trunc(rownum/100)                                 grp,
        cast(to_char(trunc(rownum/20)) as varchar2(10))   small_vc,
        cast(rpad('x',100) as varchar2(100))              padding
from
        generator       g1,
        generator       g2
where 
        rownum <= 1e5 -- > comment to avoid WordPress format issue
/

create table t3 
nologging pctfree 80
storage (initial 1M next 1M)
as
select * from pt_composite_1
/

All I’ve done is create a couple of tables with 100,000 rows each – and now I’m going to count the rows and see what I get from the execution plans with rowsource execution stats enabled:


set serveroutput off
alter session set statistics_level = all;

prompt  =================
prompt  Partition effects
prompt  =================

select count(id) from pt_composite_1;
select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

prompt  ================
prompt  Parallel effects
prompt  ================

select /*+ parallel (t3 4) */ count(id) from t3;
select * from table(dbms_xplan.display_cursor(null,null,'allstats'));

With a little cosmetic tidying, here are the two execution plans (note that I haven’t used the “last” format option when reporting the parallel plan:


=================
Partition effects
=================

select count(id) from pt_composite_1

-------------------------------------------------------------------------------------------------
| Id  | Operation            | Name           | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |                |      1 |        |      1 |00:00:00.04 |    1866 |
|   1 |  SORT AGGREGATE      |                |      1 |      1 |      1 |00:00:00.04 |    1866 |
|   2 |   PARTITION RANGE ALL|                |      1 |    100K|    100K|00:00:00.04 |    1866 |
|   3 |    PARTITION HASH ALL|                |      6 |    100K|    100K|00:00:00.04 |    1866 |
|   4 |     TABLE ACCESS FULL| PT_COMPOSITE_1 |     24 |    100K|    100K|00:00:00.04 |    1866 |
-------------------------------------------------------------------------------------------------


================
Parallel effects
================

select /*+ parallel (t3 4) */ count(id) from t3

------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |      1 |        |      1 |00:00:00.04 |      20 |      0 |
|   1 |  SORT AGGREGATE        |          |      1 |      1 |      1 |00:00:00.04 |      20 |      0 |
|   2 |   PX COORDINATOR       |          |      1 |        |      4 |00:00:00.04 |      20 |      0 |
|   3 |    PX SEND QC (RANDOM) | :TQ10000 |      0 |      1 |      0 |00:00:00.01 |       0 |      0 |
|   4 |     SORT AGGREGATE     |          |      4 |      1 |      4 |00:00:00.11 |    8424 |   7692 |
|   5 |      PX BLOCK ITERATOR |          |      4 |    100K|    100K|00:00:00.11 |    8424 |   7692 |
|*  6 |       TABLE ACCESS FULL| T3       |     61 |    100K|    100K|00:00:00.06 |    8424 |   7692 |
------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   6 - access(:Z>=:Z AND :Z<=:Z)

As you can see, the lines specifying partition selection report E-Rows for the whole table, not for any partition-level approximation, so for operations 3 and 4 we shouldn’t multiply Starts by E-rows to compare with A-row. (Starts = 6 for operation 3 because we have 6 partitions, and Start = 24 for operation 4 because at the lowest level we have a total of 24 data segments).

For the parallel query we see the same pattern – every parallel slave reports the expected total number of rows, and every “block iterator” (rowid range) reports the expected total number of rows. Again we see that using multiplication to compare E-rows and A-rows would not be valid.

In fact it’s not just partitioning and parallelism that can cause confusion. Even something as simple as a serial nested loop join has a couple of surprises (largely thanks to the evolution of the mechanics – without a matching adjustment to the execution plans – over time). Here’s a script to generate a couple of tables, which we will then join – hinting various mechanisms for the nested loop.


create table t1
as
select
        rownum           id,
        mod(rownum,100)  n1,
        cast(lpad(rownum,20) as varchar2(20)) v1 
from
        dual
connect by
        level <= 1000 -- > comment to avoid WordPress format issue
;

create table t2
as
select  * from t1
union all
select  * from t1
union all
select  * from t1
;

create index t2_i1 on t2(id);

It’s not a subtle test – as you can see we have 3 rows in table t2 for every row in t1. So let’s pick some t1 rows and join to t2 on id. Again it’s 12.2.0.1:

set serveroutput off
alter session set statistics_level = all;

prompt  ==============================
prompt  Nested loop join (traditional)
prompt  ==============================

select
        /*+ 
                leading(t1 t2) use_nl_with_index(t2 t2_i1) 
                opt_param('_nlj_batching_enabled', 0)
                no_nlj_prefetch(t2)
        */
        t1.v1, t2.v1
from
        t1, t2
where
        t1.n1 = 4
and
        t2.id = t1.id
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

prompt  ==============================
prompt  Nested loop join with prefetch
prompt  ==============================

select
        /*+ 
                leading(t1 t2) use_nl_with_index(t2 t2_i1) 
                nlj_prefetch(t2) 
        */
        t1.v1, t2.v1
from
        t1, t2
where
        t1.n1 = 4
and
        t2.id = t1.id
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

prompt  ==============================
prompt  Nested loop join with batching
prompt  ==============================

select
        /*+ 
                leading(t1 t2) use_nl_with_index(t2 t2_i1) 
                nlj_batching(t2) 
        */
        t1.v1, t2.v1
from
        t1, t2
where
        t1.n1 = 4
and
        t2.id = t1.id
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

Here are the three plans, with a couple of comments after each – all three queries returned the same 30 *- 10 * 3) rows.


==============================
Nested loop join (traditional)
==============================

--------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |       |      1 |        |     30 |00:00:00.01 |      60 |
|   1 |  NESTED LOOPS                        |       |      1 |     30 |     30 |00:00:00.01 |      60 |
|*  2 |   TABLE ACCESS FULL                  | T1    |      1 |     10 |     10 |00:00:00.01 |      15 |
|   3 |   TABLE ACCESS BY INDEX ROWID BATCHED| T2    |     10 |      3 |     30 |00:00:00.01 |      45 |
|*  4 |    INDEX RANGE SCAN                  | T2_I1 |     10 |      3 |     30 |00:00:00.01 |      15 |
--------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("T1"."N1"=4)
   4 - access("T2"."ID"="T1"."ID")

This is the original nested loop structured (apart from the “batched” option that appeared in 12c) and follows the rule/guideline:

  • Operation 2 operates once and returns the 10 rows predicted.
  • Operation 3 is started 10 times by operation 1, with a prediction of 3 rows per start – and the actual comes out at 30 rows.
  • Operation 4 is started 10 times (once for each start of operation 3), with a predication of 3 rowids per start – and the actual comes out at 30 rows
==============================
Nested loop join with prefetch
==============================

-------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       |      0 |        |      0 |00:00:00.01 |       0 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T2    |      1 |      3 |     30 |00:00:00.01 |      60 |
|   2 |   NESTED LOOPS                      |       |      1 |     30 |     30 |00:00:00.01 |      30 |
|*  3 |    TABLE ACCESS FULL                | T1    |      1 |     10 |     10 |00:00:00.01 |      15 |
|*  4 |    INDEX RANGE SCAN                 | T2_I1 |     10 |      3 |     30 |00:00:00.01 |      15 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter("T1"."N1"=4)
   4 - access("T2"."ID"="T1"."ID")
 

Again in the order of rowsource generation

  • Operation 3 starts once with a prediction of 10 rows and the rule works.
  • Operation 4 is started 10 times by operation 2, with a prediction of 3 rows (rowids) per start, and the rule works.
  • Operation 2 was started once by operation 1, with a predication of 30 rows (rowids), and the rule works.
  • Operation 1 starts once, but the prediction is reported as the value you would have got from the original NLJ shape – and breaks the rule.
==============================
Nested loop join with batching
==============================

------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |      1 |        |     30 |00:00:00.01 |      60 |
|   1 |  NESTED LOOPS                |       |      1 |     30 |     30 |00:00:00.01 |      60 |
|   2 |   NESTED LOOPS               |       |      1 |     30 |     30 |00:00:00.01 |      30 |
|*  3 |    TABLE ACCESS FULL         | T1    |      1 |     10 |     10 |00:00:00.01 |      15 |
|*  4 |    INDEX RANGE SCAN          | T2_I1 |     10 |      3 |     30 |00:00:00.01 |      15 |
|   5 |   TABLE ACCESS BY INDEX ROWID| T2    |     30 |      3 |     30 |00:00:00.01 |      30 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter("T1"."N1"=4)
   4 - access("T2"."ID"="T1"."ID")


In the order in which rowsources are created

  • Operation 3 starts once with a prediction of 10 rows – and the A-rows matches the rule
  • Operation 4 is started 10 times by opreation 2, with a prediction of 3 rows per start – and the A-rows matches the rule.
  • Operation 5 is started 30 times by operation 1, with a prediction of 3 rows per start – again reporting the value that you would have seen from the original representation of the NLJ, the prediction obviously should be 1 – so the rule is broken again
tl;dr

It is important to remember that the basic rule of “A-rows = starts * E-rows” does not hold for the partition-related lines or the PX related lines of partitioned and parallel execution plans.

You may also find a few other cases where you need be a little cautious about trusting the rule without first thinking carefully about the mechanics of what the shape of the plan is telling you.

Parse Time

Sun, 2019-11-17 13:37

This is a note I started drafting In October 2012. It’s a case study from an optimizer (10053) trace file someone emailed to me, and it describes some of the high-level steps I went through to see if I could pinpoint what the optimizer was doing that fooled it into spending a huge amount of time optimising a statement that ultimately executed very quickly.

Unfortunately I never finished my notes and I can no longer find the trace file that the article was based on, so I don’t really know what I was planning to say to complete the last observation I had recorded.

I was prompted a  couple of days ago to publish the notes so far becuase I was reminded in a conversation with members of the Oak Table Network about an article that Franck Pachot wrote a couple of years ago. In 12c Oracle Corp. introduced a time-reporting mechanism for the optimizer trace. If some optimisation step takes “too long” (1 second, by default) then then optimizer will write a “TIMER:” line into the trace file telling you what the operation was and how long it took to complete and how much CPU time it used.  The default for “too long” can be adjusted by setting a “fix control”.  This makes it a lot easier to find out where the time went if you see a very long parse time.

But let’s get back to the original trace file and drafted blog note. It started with a question on OTN and an extract from a tkprof output to back up a nasty  performance issue.

=============================================================================================

 

What do you do about a parse time of 46 seconds ? That was the question that came up on OTN a few days ago – and here’s the tkprof output to demonstrate it.

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1     46.27      46.53          0          5          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2      0.33       0.63        129      30331          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4     46.60      47.17        129      30336          0           1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 144  
Number of plan statistics captured: 1
 
Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  SORT AGGREGATE (cr=30331 pr=129 pw=0 time=637272 us)
       863        863        863   VIEW  VM_NWVW_1 (cr=30331 pr=129 pw=0 time=637378 us cost=1331 size=10 card=1)
       ... and lots more lines of plan

According to tkprof, it takes 46 seconds – virtually all CPU time – to optimise this statement, then 0.63 seconds to run it. You might spot that this is 11gR2 (in fact it’s 11.2.0.3) from the fact that the second line of the “Row Source Operation” includes a report of the estimated cost of the query, which is only 1,331.

Things were actually worse than they seem at first sight; when we saw more of tkprof output the following also showed up:

SELECT /* OPT_DYN_SAMP */ /*+ ALL_ROWS IGNORE_WHERE_CLAUSE 
  NO_PARALLEL(SAMPLESUB) opt_param('parallel_execution_enabled', 'false') 
  NO_PARALLEL_INDEX(SAMPLESUB) NO_SQL_TUNE */ NVL(SUM(C1),:"SYS_B_00"), 
  NVL(SUM(C2),:"SYS_B_01") 
FROM
 (SELECT /*+ IGNORE_WHERE_CLAUSE NO_PARALLEL("VAL_000002") FULL("VAL_000002") 
  NO_PARALLEL_INDEX("VAL_000002") */ :"SYS_B_02" AS C1, 
  CASE WHEN
    ...
  END AS C2 FROM "BISWEBB"."RECORDTEXTVALUE" 
  SAMPLE BLOCK (:"SYS_B_21" , :"SYS_B_22") SEED (:"SYS_B_23") "VAL_000002" 
  WHERE ... 
 ) SAMPLESUB
 
 
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        5      0.00       0.00          0          0          0           0
Execute      5      0.00       0.00          0          0          0           0
Fetch        5     21.41      24.14      11108      37331          0           5
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total       15     21.41      24.15      11108      37331          0           5
 
Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 144     (recursive depth: 1)
Number of plan statistics captured: 3
 
Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  SORT AGGREGATE (cr=7466 pr=3703 pw=0 time=5230126 us)
   3137126    3137126    3137126   PARTITION HASH ALL PARTITION: 1 128 (cr=7466 pr=3703 pw=0 time=2547843 us cost=18758 size=131597088 card=3133264)
   3137126    3137126    3137126    TABLE ACCESS SAMPLE RECORDTEXTVALUE PARTITION: 1 128 (cr=7466 pr=3703 pw=0 time=2372509 us cost=18758 size=131597088 card=3133264)

This piece of SQL executed five times as the query was optimised, adding a further 24 seconds elapsed time and 21 CPU seconds which, surprisingly, weren’t included in the headline 46 seconds. The total time spent in optimising the statement was around 70 seconds, of which about 68 seconds were spent on (or waiting for) the CPU.

This is unusual – I don’t often see SQL statements taking more than a few seconds to parse – not since 8i, and not without complex partition views – and I certainly don’t expect to see a low cost query in 11.2.0.3 taking anything like 70 (or even 46) seconds to optimise.

The OP had enabled the 10046 and the 10053 traces at the same time – and since the parse time was sufficiently unusual I asked him to email me the raw trace file – all 200MB of it.

Since it’s not easy to process 200MB of trace the first thing to do is extract a few headline details, and I thought you might be interested to hear about some of the methods I use on the rare occasions when I decide to look at a 10053.

My aim is to investigate a very long parse time and the tkprof output had already shown me that there were a lot of tables in the query, so I had the feeling that the problem would relate to the amount of work done testing possible join orders; I’ve also noticed that the dynamic sampling code ran five times – so I’m expecting to see some critical stage of the optimisation run 5 times (although I don’t know why it should).

Step 1: Use grep (or find if you’re on Windows) to do a quick check for the number of join orders considered. I’m just searching for the text “Join order[” appearing at the start of line and then counting how many times I find it:

[jonathan@linux01 big_trace]$ grep "^Join order\[" orcl_ora_25306.trc  | wc -l
6266

That’s 6,266 join orders considered – let’s take a slightly closer look:

[jonathan@linux01 big_trace]$ grep -n "^Join order\[" orcl_ora_25306.trc >temp.txt
[jonathan@linux01 big_trace]$ tail -2 temp.txt
4458394:Join order[581]:  RECORDTYPEMEMBER[RTM]#9  RECORD_[VAL_000049]#13  ...... from$_subquery$_008[TBL_000020]#2
4458825:Join order[1]:  VM_NWVW_1[VM_NWVW_1]#0

The line of dots represents another 11 tables (or similar objects) in the join order. But there are only 581 join orders (apparently) before the last one in the file (which is a single view transformation). I’ve used the “-n” option with grep, so if I wanted to look at the right bit of the file I could tail the last few thousand lines, but my machine is happy to use vi on a 200MB file, and a quick search (backwards) through the file finds the number 581 in the following text (which does not appear in all versions of the trace file):

Number of join permutations tried: 581

So a quick grep for “join permutations” might be a good idea. (In the absence of this line I’d have got to the same result by directing the earlier grep for “^Join order\[“ to a file and playing around with the contents of the file.

[jonathan@linux01 big_trace]$ grep -n "join permutations" orcl_ora_25306.trc
11495:Number of join permutations tried: 2
11849:Number of join permutations tried: 1
12439:Number of join permutations tried: 2
13826:Number of join permutations tried: 2
14180:Number of join permutations tried: 1
14552:Number of join permutations tried: 2
15938:Number of join permutations tried: 2
16292:Number of join permutations tried: 1
16665:Number of join permutations tried: 2
18141:Number of join permutations tried: 2
18550:Number of join permutations tried: 2
18959:Number of join permutations tried: 2
622799:Number of join permutations tried: 374
624183:Number of join permutations tried: 2
624592:Number of join permutations tried: 2
624919:Number of join permutations tried: 1
625211:Number of join permutations tried: 2
1759817:Number of join permutations tried: 673
1760302:Number of join permutations tried: 1
1760593:Number of join permutations tried: 2
1760910:Number of join permutations tried: 1
1761202:Number of join permutations tried: 2
2750475:Number of join permutations tried: 674
2751325:Number of join permutations tried: 2
2751642:Number of join permutations tried: 1
2751933:Number of join permutations tried: 2
2752250:Number of join permutations tried: 1
2752542:Number of join permutations tried: 2
3586276:Number of join permutations tried: 571
3587133:Number of join permutations tried: 2
3587461:Number of join permutations tried: 1
3587755:Number of join permutations tried: 2
3588079:Number of join permutations tried: 1
3588374:Number of join permutations tried: 2
4458608:Number of join permutations tried: 581
4458832:Number of join permutations tried: 1

The key thing we see here is that there are five sections of long searches, and a few very small searches. Examination of the small search lists shows that they relate to some inline views which simply join a couple of tables. For each of the long searches we can see that the first join order in each set is for 14 “tables”. This is where the work is going. But if you add up the number of permutations in the long searches you get a total of 2,873, which is a long way off the 6,266 that we found with our grep for “^Join order[“ – so where do the extra join orders come from ? Let’s take a closer look at the file where we dumped all the Join order lines – the last 10 lines look like this:

4452004:Join order[577]:  RECORD_[VAL_000033]#10  from$_subquery$_017[TBL_000029]#1 ...
4452086:Join order[577]:  RECORD_[VAL_000033]#10  from$_subquery$_017[TBL_000029]#1 ...
4453254:Join order[578]:  RECORD_[VAL_000033]#10  from$_subquery$_017[TBL_000029]#1 ...
4453382:Join order[578]:  RECORD_[VAL_000033]#10  from$_subquery$_017[TBL_000029]#1 ...
4454573:Join order[579]:  RECORD_[VAL_000033]#10  from$_subquery$_017[TBL_000029]#1 ...
4454655:Join order[579]:  RECORD_[VAL_000033]#10  from$_subquery$_017[TBL_000029]#1 ...
4455823:Join order[580]:  RECORD_[VAL_000033]#10  from$_subquery$_017[TBL_000029]#1 ...
4455905:Join order[580]:  RECORD_[VAL_000033]#10  from$_subquery$_017[TBL_000029]#1 ...
4457051:Join order[581]:  RECORDTYPEMEMBER[RTM]#9  RECORD_[VAL_000049]#13  ...
4458394:Join order[581]:  RECORDTYPEMEMBER[RTM]#9  RECORD_[VAL_000049]#13  ...
4458825:Join order[1]:  VM_NWVW_1[VM_NWVW_1]#0

Every single join order seems to have appeared twice, and doubling the counts we got for the sum of the permutations gets us close to the total we got for the join order search. Again, we could zoom in a little closer, does the text near the start of the two occurrences of join order 581 give us any clues ? We see the following just before the second one:

****** Recost for ORDER BY (using join row order) *******

The optimizer has tried to find a way of eliminating some of the cost by letting the table join order affect the order of the final output. Let’s do another grep to see how many join orders have been recosted:

[jonathan@linux01 big_trace]$ grep "Recost for ORDER BY" orcl_ora_25306.trc | sort | uniq -c
    452 ****** Recost for ORDER BY (using index) ************
   2896 ****** Recost for ORDER BY (using join row order) *******

So we’ve done a huge amount recosting. Let’s check arithmetic: 452 + 2,896 + 2,873 = 6,221, which is remarkably close to the 6,266 we needed (and we have ignored a few dozen join orders that were needed for the inline views, and the final error is too small for me to worry about).

We can conclude, therefore, that we did a huge amount of work costing a 14 table join a little over 6,000 times. It’s possible, of course, that we discarded lots of join orders very early on in the cost stage, so we could count the number of times we see a “Now joining” message – to complete a single pass on a 14 table join the optimizer will have to report “Now joining” 13 times.

[jonathan@linux01 big_trace]$ grep -n "Now joining" orcl_ora_25306.trc | wc -l
43989

Since the message appeared 44,000 times from 6,200 join orders we have an average of 7 steps evaluated per join order. Because of the way that the optimizer takes short-cuts I think this is a fairly strong clue that most of the join order calculations actually completed, or get very close to completing, over the whole 14 tables. (The optimizer remembers “partial results” from previous join order calculations, so doesn’t have to do 13 “Now joining” steps on every single join order.)

We still need to know why the optimizer tried so hard before supplying a plan – so let’s look for the “Best so far” lines, which the trace file reports each time the optimizer finds a better plan than the previous best. Here’s an example of what we’re looking for:

       Cost: 206984.61  Degree: 1  Resp: 206984.61  Card: 0.00 Bytes: 632
***********************
Best so far:  Table#: 0  cost: 56.9744  card: 1.0000  bytes: 30
              Table#: 3  cost: 59.9853  card: 0.0000  bytes: 83
              Table#: 6  cost: 60.9869  card: 0.0000  bytes: 151
              Table#:10  cost: 61.9909  card: 0.0000  bytes: 185
              Table#: 5  cost: 62.9928  card: 0.0000  bytes: 253
              Table#: 2  cost: 65.0004  card: 0.0000  bytes: 306
              Table#: 1  cost: 122.4741  card: 0.0000  bytes: 336
              Table#: 8  cost: 123.4760  card: 0.0000  bytes: 387
              Table#: 4  cost: 125.4836  card: 0.0000  bytes: 440
              Table#: 7  cost: 343.2625  card: 0.0000  bytes: 470
              Table#: 9  cost: 345.2659  card: 0.0000  bytes: 530
              Table#:11  cost: 206981.5979  card: 0.0000  bytes: 564
              Table#:12  cost: 206982.6017  card: 0.0000  bytes: 598
              Table#:13  cost: 206984.6055  card: 0.0000  bytes: 632
***********************

As you can see, we get a list of the tables (identified by their position in the first join order examined) with details of accumulated cost. But just above this tabular display there’s a repeat of the cost that we end up with. So let’s write, and apply, a little awk script to find all the “Best so far” lines and then print the line two above. Here’s a suitable script, followed by a call to use it:

{
        if (index($0,"Best so far") != 0) {print NR m2}
        m2 = m1; m1 = $0;
}

awk -f cost.awk orcl_ora_25306.trc >temp.txt

There was a bit of a mess in the output – there are a couple of special cases (relating, in our trace file, to the inline views and the appearance of a “group by placement”) that cause irregular patterns to appear, but the script was effective for the critical 14 table join. And looking through the list of costs for the various permutations we find that almost all the options show a cost of about 206,000 – except for the last few in two of the five “permutation sets” that suddenly drop to costs of around 1,500 and 1,300. The very high starting cost explains why the optimizer was prepared to spend so much time trying to find a good path and why it kept working so hard until the cost dropped very sharply.

Side bar: I have an old note from OCIS (the precursor or the precursor of the precursor of MOS) that the optimizer will stop searching when the number of join orders tests * the number of “non-single-row” tables (according to the single table access path) * 0.3 is greater than the best cost so far.  I even have a test script (run against 8.1.7.4, dated September 2002) that seems to demonstrate the formula.  The formula may be terribly out of date by now and the rules of exactly how and when it applies may have changed – the model didn’t seem to work when I ran it against 19.3 – but the principle probably still holds true.

At this point we might decide that we ought to look at the initial join order and at the join order where the cost dropped dramatically, and try to work out why Oracle picked such a bad starting join order, and what it was about the better join order that the optimizer had missed. This might allow us to recognise some error in the statistics for either the “bad” starting order or the “good” starting order and allow us to solve the problem by (e.g.) creating a column group or gather some specific statistics. We might simply decide that we’ll take a good join order and pass it to the optimizer through a /*+ leading() */ hint, or simply take the entire outline and attach it to the query through a faked SQL Profile (or embedded set of hints).

However, for the purposes of this exercise (and because sometimes you have to find a strategic solution rather than a “single statement” solution) I’m going to carry on working through mechanisms for dissecting the trace file without looking too closely at any of the fine detail.

The final “high-level” target I picked was to pin down why there were 5 sets of join orders. I had noticed something particular about the execution plan supplied – it showed several occurrences of the operation “VIEW PUSHED PREDICATE” so I wondered if this might be relevant. So I did a quick check near the start of the main body of the trace file for anything that might be a clue, and found the following just after the “QUERY BLOCK SIGNATURE”.

QUERY BLOCK SIGNATURE
---------------------
  signature(): NULL
***********************************
Cost-Based Join Predicate Push-down
***********************************
JPPD: Checking validity of push-down in query block SEL$6E5D879B (#4)
JPPD:   Checking validity of push-down from query block SEL$6E5D879B (#4) to query block SEL$C20BB4FE (#6)
Check Basic Validity for Non-Union View for query block SEL$C20BB4FE (#6)
JPPD:     JPPD bypassed: View has non-standard group by.
JPPD:   No valid views found to push predicate into.
JPPD: Checking validity of push-down in query block SEL$799AD133 (#3)
JPPD:   Checking validity of push-down from query block SEL$799AD133 (#3) to query block SEL$EFE55ECA (#7)
Check Basic Validity for Non-Union View for query block SEL$EFE55ECA (#7)
JPPD:     JPPD bypassed: View has non-standard group by.
JPPD:   No valid views found to push predicate into.
JPPD: Checking validity of push-down in query block SEL$C2AA4F6A (#2)
JPPD:   Checking validity of push-down from query block SEL$C2AA4F6A (#2) to query block SEL$799AD133 (#3)
Check Basic Validity for Non-Union View for query block SEL$799AD133 (#3)
JPPD:     Passed validity checks
JPPD:   Checking validity of push-down from query block SEL$C2AA4F6A (#2) to query block SEL$6E5D879B (#4)
Check Basic Validity for Non-Union View for query block SEL$6E5D879B (#4)
JPPD:     Passed validity checks
JPPD:   Checking validity of push-down from query block SEL$C2AA4F6A (#2) to query block SEL$FC56C448 (#5)
Check Basic Validity for Non-Union View for query block SEL$FC56C448 (#5)
JPPD:     Passed validity checks
JPPD: JPPD:   Pushdown from query block SEL$C2AA4F6A (#2) passed validity checks.
Join-Predicate push-down on query block SEL$C2AA4F6A (#2)
JPPD: Using search type: linear
JPPD: Considering join predicate push-down
JPPD: Starting iteration 1, state space = (3,4,5) : (0,0,0)

As you can see we are doing cost-based join-predicate pushdown, and there are three targets which are valid for the operation. Notice the line that says “using search type: linear”, and the suggestive “starting iteration 1” – let’s look for more lines with “Starting iteration”

[jonathan@linux01 big_trace]$ grep -n "Starting iteration" orcl_ora_25306.trc
9934:GBP: Starting iteration 1, state space = (20,21) : (0,0)
11529:GBP: Starting iteration 2, state space = (20,21) : (0,C)
11562:GBP: Starting iteration 3, state space = (20,21) : (F,0)
12479:GBP: Starting iteration 4, state space = (20,21) : (F,C)
12517:GBP: Starting iteration 1, state space = (18,19) : (0,0)
13860:GBP: Starting iteration 2, state space = (18,19) : (0,C)
13893:GBP: Starting iteration 3, state space = (18,19) : (F,0)
14587:GBP: Starting iteration 4, state space = (18,19) : (F,C)
14628:GBP: Starting iteration 1, state space = (16,17) : (0,0)
15972:GBP: Starting iteration 2, state space = (16,17) : (0,C)
16005:GBP: Starting iteration 3, state space = (16,17) : (F,0)
16700:GBP: Starting iteration 4, state space = (16,17) : (F,C)
16877:JPPD: Starting iteration 1, state space = (3,4,5) : (0,0,0)
622904:JPPD: Starting iteration 2, state space = (3,4,5) : (1,0,0)
1759914:JPPD: Starting iteration 3, state space = (3,4,5) : (1,1,0)
2750592:JPPD: Starting iteration 4, state space = (3,4,5) : (1,1,1)

There are four iterations for state space (3,4,5) – and look at the huge gaps between their “Starting iteration” lines. In fact, let’s go a little closer and combine their starting lines with the lines above where I searched for “Number of join permutations tried:”


16877:JPPD: Starting iteration 1, state space = (3,4,5) : (0,0,0)
622799:Number of join permutations tried: 374

622904:JPPD: Starting iteration 2, state space = (3,4,5) : (1,0,0)
1759817:Number of join permutations tried: 673

1759914:JPPD: Starting iteration 3, state space = (3,4,5) : (1,1,0)
2750475:Number of join permutations tried: 674

2750592:JPPD: Starting iteration 4, state space = (3,4,5) : (1,1,1)
3586276:Number of join permutations tried: 571

4458608:Number of join permutations tried: 581

At this point my notes end and I don’t know where I was going with the investigation. I know that I suggested to the OP that the cost-based join predicate pushdown was having a huge impact on the optimization time and suggested he experiment with disabling the feature. (Parse time dropped dramatically, but query run-time went through the roof – so that proved a point, but wasn’t a useful strategy). I don’t know, however, what the fifth long series of permutations was for, so if I could find the trace file one of the things I’d do next would be to look at the detail a few lines before line 4,458,608 to see what triggered that part of the re-optimization. I’d also want to know whether the final execution plan came from the fifth series and could be reached without involving all the join predicate pushdown work, or whether it was a plan that was only going to appear after the optimizer had worked through all 4 iterations.

The final plan did involve all 3 pushed predicates (which looksl like it might have been from iteration 4), so it might have been possible to find a generic strategy for forcing unconditional predicate pushing without doing all the expensive intermediate work.

Version 12c and beyond

That was then, and this is now. And something completely different might have appeared in 12c (or 19c) – but the one thing that is particularly helpful is that you can bet that every iteration of the JPPD state spaces would have produced a “TIMER:” line in the trace file, making it very easy to run grep -n “TIMER:” (or -nT as I recently discovered) against the trace file to pinpoint the issue very quickly.

Here’s an example from my “killer_parse.sql” query after setting “_fix_control”=’16923858:4′ (1e4 microseconds = 1/100th second) in an instance of 19c:


$ grep -nT TIMER or19_ora_21051.trc

16426  :TIMER:      bitmap access paths cpu: 0.104006 sec elapsed: 0.105076 sec
252758 :TIMER:     costing general plans cpu: 0.040666 sec elapsed: 0.040471 sec
309460 :TIMER:      bitmap access paths cpu: 0.079509 sec elapsed: 0.079074 sec
312584 :TIMER: CBQT OR expansion SEL$765CDFAA cpu: 10.474142 sec elapsed: 10.508788 sec
313974 :TIMER: Complex View Merging SEL$765CDFAA cpu: 1.475173 sec elapsed: 1.475418 sec
315716 :TIMER: Table Expansion SEL$765CDFAA cpu: 0.046262 sec elapsed: 0.046647 sec
316036 :TIMER: Star Transformation SEL$765CDFAA cpu: 0.029077 sec elapsed: 0.026912 sec
318207 :TIMER: Type Checking after CBQT SEL$765CDFAA cpu: 0.220506 sec elapsed: 0.219273 sec
318208 :TIMER: Cost-Based Transformations (Overall) SEL$765CDFAA cpu: 13.632516 sec elapsed: 13.666360 sec
328948 :TIMER:      bitmap access paths cpu: 0.093973 sec elapsed: 0.095008 sec
632935 :TIMER: Access Path Analysis (Final) SEL$765CDFAA cpu: 7.703016 sec elapsed: 7.755957 sec
633092 :TIMER: SQL Optimization (Overall) SEL$765CDFAA cpu: 21.539010 sec elapsed: 21.632012 sec

The closing 21.63 seconds (line 633092) is largely 7.7559 seconds (632,935) plus 13.666 seconds (line 318208) Cost-Based Transformation time, and that 13.666 seconds is mostly the 1.475 seconds (line 313,974) plus 10.508 seconds (line 312,584) for CBQT OR expansion – so let’s try disabling OR expansion (alter session set “_no_or_expansion”=true;) and try again:


$ grep -nT TIMER or19_ora_22205.trc
14884  :TIMER:      bitmap access paths cpu: 0.062453 sec elapsed: 0.064501 sec
15228  :TIMER: Access Path Analysis (Final) SEL$1 cpu: 0.256751 sec elapsed: 0.262467 sec
15234  :TIMER: SQL Optimization (Overall) SEL$1 cpu: 0.264099 sec elapsed: 0.268183 sec

Not only was optimisation faster, the runtime was quicker too.

Warning – it’s not always that easy.

 

Library Cache Stats

Sun, 2019-11-17 03:36

In resonse to a comment that one of my notes references a call to a packate “snap_libcache”, I’ve posted this version of SQL that can be run by SYS to create the package, with a public synonym, and privileges granted to public to execute it. The package doesn’t report the DLM (RAC) related activity, and is suitable only for 11g onwards (older versions require a massive decode of an index value to convert indx numbers into names).

rem
rem Script: snap_11_libcache.sql
rem Author: Jonathan Lewis
rem Dated: March 2001 (updated for 11g)
rem Purpose: Package to get snapshot start and delta of library cache stats
rem
rem Notes
rem Lots of changes needed by 11.2.x.x where x$kglst holds
rem two types – TYPE (107) and NAMESPACE (84) – but no
rem longer needs a complex decode.
rem
rem Has to be run by SYS to create the package
rem
rem Usage:
rem set serveroutput on size 1000000 format wrapped
rem set linesize 144
rem set trimspool on
rem execute snap_libcache.start_snap
rem — do something
rem execute snap_libcache.end_snap
rem

create or replace package snap_libcache as
procedure start_snap;
procedure end_snap;
end;
/

create or replace package body snap_libcache as

cursor c1 is
select
indx,
kglsttyp lib_type,
kglstdsc name,
kglstget gets,
kglstght get_hits,
kglstpin pins,
kglstpht pin_hits,
kglstrld reloads,
kglstinv invalidations,
kglstlrq dlm_lock_requests,
kglstprq dlm_pin_requests,
— kglstprl dlm_pin_releases,
— kglstirq dlm_invalidation_requests,
kglstmiv dlm_invalidations
from x$kglst
;

type w_type1 is table of c1%rowtype index by binary_integer;
w_list1 w_type1;
w_empty_list w_type1;

m_start_time date;
m_start_flag char(1);
m_end_time date;

procedure start_snap is
begin

m_start_time := sysdate;
m_start_flag := ‘U’;
w_list1 := w_empty_list;

for r in c1 loop
w_list1(r.indx).gets := r.gets;
w_list1(r.indx).get_hits := r.get_hits;
w_list1(r.indx).pins := r.pins;
w_list1(r.indx).pin_hits := r.pin_hits;
w_list1(r.indx).reloads := r.reloads;
w_list1(r.indx).invalidations := r.invalidations;
end loop;

end start_snap;

procedure end_snap is
begin

m_end_time := sysdate;

dbms_output.put_line(‘———————————‘);
dbms_output.put_line(‘Library Cache – ‘ ||
to_char(m_end_time,’dd-Mon hh24:mi:ss’)
);

if m_start_flag = ‘U’ then
dbms_output.put_line(‘Interval:- ‘ ||
trunc(86400 * (m_end_time – m_start_time)) ||
‘ seconds’
);
else
dbms_output.put_line(‘Since Startup:- ‘ ||
to_char(m_start_time,’dd-Mon hh24:mi:ss’)
);
end if;

dbms_output.put_line(‘———————————‘);

dbms_output.put_line(
rpad(‘Type’,10) ||
rpad(‘Description’,41) ||
lpad(‘Gets’,12) ||
lpad(‘Hits’,12) ||
lpad(‘Ratio’,6) ||
lpad(‘Pins’,12) ||
lpad(‘Hits’,12) ||
lpad(‘Ratio’,6) ||
lpad(‘Invalidations’,14) ||
lpad(‘Reloads’,10)
);

dbms_output.put_line(
rpad(‘—–‘,10) ||
rpad(‘—–‘,41) ||
lpad(‘—-‘,12) ||
lpad(‘—-‘,12) ||
lpad(‘—–‘,6) ||
lpad(‘—-‘,12) ||
lpad(‘—-‘,12) ||
lpad(‘—–‘,6) ||
lpad(‘————-‘,14) ||
lpad(‘——‘,10)
);

for r in c1 loop
if (not w_list1.exists(r.indx)) then
w_list1(r.indx).gets := 0;
w_list1(r.indx).get_hits := 0;
w_list1(r.indx).pins := 0;
w_list1(r.indx).pin_hits := 0;
w_list1(r.indx).invalidations := 0;
w_list1(r.indx).reloads := 0;
end if;

if (
(w_list1(r.indx).gets != r.gets)
or (w_list1(r.indx).get_hits != r.get_hits)
or (w_list1(r.indx).pins != r.pins)
or (w_list1(r.indx).pin_hits != r.pin_hits)
or (w_list1(r.indx).invalidations != r.invalidations)
or (w_list1(r.indx).reloads != r.reloads)
) then

dbms_output.put(rpad(substr(r.lib_type,1,10),10));
dbms_output.put(rpad(substr(r.name,1,41),41));
dbms_output.put(to_char(
r.gets – w_list1(r.indx).gets,
‘999,999,990’)
);
dbms_output.put(to_char(
r.get_hits – w_list1(r.indx).get_hits,
‘999,999,990’));
dbms_output.put(to_char(
(r.get_hits – w_list1(r.indx).get_hits)/
greatest(
r.gets – w_list1(r.indx).gets,
1
),
‘999.0’));
dbms_output.put(to_char(
r.pins – w_list1(r.indx).pins,
‘999,999,990’)
);
dbms_output.put(to_char(
r.pin_hits – w_list1(r.indx).pin_hits,
‘999,999,990’));
dbms_output.put(to_char(
(r.pin_hits – w_list1(r.indx).pin_hits)/
greatest(
r.pins – w_list1(r.indx).pins,
1
),
‘999.0’));
dbms_output.put(to_char(
r.invalidations – w_list1(r.indx).invalidations,
‘9,999,999,990’)
);
dbms_output.put(to_char(
r.reloads – w_list1(r.indx).reloads,
‘9,999,990’)
);
dbms_output.new_line;
end if;

end loop;

end end_snap;

begin
select
startup_time, ‘S’
into
m_start_time, m_start_flag
from
v$instance;

end snap_libcache;
/

drop public synonym snap_libcache;
create public synonym snap_libcache for snap_libcache;
grant execute on snap_libcache to public;

You’ll note that there are two classes of data, “namespace” and “type”. The dynamic view v$librarycache reports only the namespace rows.

Nested Tables

Sat, 2019-11-09 05:28

This note is a little side effect of answering a question about the total space associated with a table, including its indexes, LOB columns and nested tables. The first suggestion I published failed to take account of the fact that nested tables can contain their own nested tables so I had to modify the code.

The basic change was easy – in the original code I had joined to the view dba_nested_tables to translate between a table_name and its parent_table_name. The only change I needed to make was to replace the reference to the view with a different view that joined a table_name to its “oldest ancestor”. To achieve this I had to do two things: first, create a table with multiple levels of nesting, then create a suitable view definition.  For reference – because it may help somebody – I’ve published the two stages here.

A revolting nested table:

The following code creates three table types, but the second table type


rem
rem     Script:         nested_tables.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2019
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem

create or replace type jpl_item3 as object (v1 varchar2(10), n3 number);
/

create or replace type jpl_tab3_type as table of jpl_item3;
/

create or replace type jpl_item2 as object(v2 varchar2(10), jpl3 jpl_tab3_type);
/

create or replace type jpl_tab2_type as table of jpl_item2;
/

create or replace type jpl_item1 as object(v3 varchar2(10), jpl2 jpl_tab2_type);
/

create or replace type jpl_tab1_type as table of jpl_item1;
/

create table demo_nest_2 (
        id      number  constraint d2_pk primary key,
        jpl1    jpl_tab1_type
)
segment creation immediate
nested table jpl1 store as t_jpl1
        (
        nested table jpl2  store as t_jpl2
                (
                nested table jpl3 store as t_jpl3 
                return as locator
                )
        return as locator
        )
return as locator
/

I’ve never seen nested tables manually created in a production system though I believe they are likely to appear (along with varrays and LOBs) as a side effect of some XML or JSON mechanisms, but many years ago (in Practical Oracle 8i, possibly) I discussed the pros and cons of returning them “by value” or (as I have here) “by reference”. As you can see, you need to exercise some care with brackets and locating the text as you try to refine multiple levels of nesting.

Tne Ancestor View

We’ll take this in three steps – first, report from user_nested_tables. (The final script for reporting space used dba_nested_tables with a predicate available on the owner column, but I don’t give myself DBA privileges while writing quick and dirty bits of SQL.).


select 
        parent_table_name, parent_table_column, table_name, 
        ltrim(storage_spec) storage_spec,       -- the DBA view definition includes lpad( , 30) !!
        ltrim(return_type ) return_type         -- the DBA view definition includes lpad( , 20) !!
from 
        user_nested_tables
order by
        parent_table_name, parent_table_column
/

PARENT_TABLE_NAME    PARENT_TABLE_COLUMN  TABLE_NAME           STORAGE_SPEC         RETURN_TYPE
-------------------- -------------------- -------------------- -------------------- --------------------
DEMO_NEST_2          JPL1                 T_JPL1               USER_SPECIFIED       LOCATOR
T_JPL1               JPL2                 T_JPL2               USER_SPECIFIED       LOCATOR
T_JPL2               JPL3                 T_JPL3               DEFAULT              LOCATOR

You’ll notice the odd ltrim() – I have no idea why the view defines these columns to be left-padded the way they are, possibly it dates back to the time when something like cast(… as vachar2(30)) wasn’t a possible option.

Next a simple “connect by” query what uses the above list in a materialize “with subquery” (CTE):


with my_nested_tables as (
select
        /*+ materialize */
        parent_table_name, table_name
from
        user_nested_tables
)
select
        parent_table, child_table, pathname
from    (
        select
                level,
                sys_connect_by_path(table_name, '/')    pathname,
                connect_by_root parent_table_name parent_table,
                table_name child_table
        from
                my_nested_tables
        connect by
                parent_table_name = prior table_name
        )
order by
        parent_table, child_table, pathname
/

PARENT_TABLE         CHILD_TABLE          PATHNAME
-------------------- -------------------- ----------------------------------------
DEMO_NEST_2          T_JPL1               /T_JPL1
DEMO_NEST_2          T_JPL2               /T_JPL1/T_JPL2
DEMO_NEST_2          T_JPL3               /T_JPL1/T_JPL2/T_JPL3
T_JPL1               T_JPL2               /T_JPL2
T_JPL1               T_JPL3               /T_JPL2/T_JPL3
T_JPL2               T_JPL3               /T_JPL3


As required this shows me demo_nest_2 as the owning ancestor of t_jpl1, t_jpl2 and t_jpl3. Unfortunately it has also produced three rows that we don’t want to see in our final space-summing code. But it’s easy enough to get rid of those – the only rows we want are the rows with a parent_table that doesn’t appear as a child_table:


with my_nested_tables as (
select
        /*+ materialize */
        parent_table_name, table_name
from
        user_nested_tables
)
select  parent_table, child_table, pathname
from    (
        select
                level,
                sys_connect_by_path(table_name, '/')    pathname,
                connect_by_root parent_table_name parent_table,
                table_name child_table
        from
                my_nested_tables
        connect by
                parent_table_name = prior table_name
        )
where   (parent_table) not in (
                select table_name
                from   my_nested_tables
        )
order by
        parent_table, child_table, pathname
/

PARENT_TABLE         CHILD_TABLE          PATHNAME
-------------------- -------------------- ----------------------------------------
DEMO_NEST_2          T_JPL1               /T_JPL1
DEMO_NEST_2          T_JPL2               /T_JPL1/T_JPL2
DEMO_NEST_2          T_JPL3               /T_JPL1/T_JPL2/T_JPL3

3 rows selected.


And there’s the result we wanted to see. A first simple corroboration of the fact is that the output corresponds with the “NESTED TABLE” segments reported by user_segments.

Of course, having written a query that gets the right result from a table definition we used to help us define the query in the first place we now ought to create a few more tables with different structures to see if the query continues to give the right results in more complex cases.

What happens, for example, if I create a table with two columns of nested tables, and one of the nested table typed also contained two nested tables ? What happens if the base table is an index organized table ?

It’s easy to do the second test – just add “organization index” immediately after “segment creation immediate” in the table creation statement. The correct results drop out.

As for the first test – here’s the SQL to create one example (and the query still gets the right answers – even if you change the table to be index organized).


drop type jpl_tab1_type;
drop type jpl_item1;

drop type jpl_tab2_type;
drop type jpl_item2;

drop type jpl_tab3_type;
drop type jpl_item3;

purge recyclebin;

create or replace type jpl_item3 as object (v1 varchar2(10), n3 number);
/

create or replace type jpl_tab3_type as table of jpl_item3;
/

create or replace type jpl_item2 as object(v2 varchar2(10), jpl3 jpl_tab3_type, jpl3x jpl_tab3_type);
/

create or replace type jpl_tab2_type as table of jpl_item2;
/

create or replace type jpl_item1 as object(v3 varchar2(10), jpl2 jpl_tab2_type)
/

create or replace type jpl_tab1_type as table of jpl_item1;
/

create table demo_nest_3 (
        id      number  constraint d2_pk primary key,
        jpl1    jpl_tab1_type,
        jpl1a   jpl_tab1_type
)
segment creation immediate
-- organization index
nested table jpl1 store as t_jpl1
        (
        nested table jpl2  store as t_jpl2
                (
                nested table jpl3 store as t_jpl3 
                return as locator
                nested table jpl3x store as t_jpl3x 
                return as locator
                )
        return as locator
        )
return as locator
nested table jpl1a store as t_jpl1a
        (
        nested table jpl2  store as t_jpl2a
                (
                nested table jpl3 store as t_jpl3a
                return as locator
                nested table jpl3x store as t_jpl3xa
                return as locator
                )
        return as locator
        )
return as locator
/

All that remains now is to modify the code to use the equivalent DBA views, with the addition of the owner column, then slot the resulting code into the original query in place of the simple references to dba_nested_tables. If you go to the original posting you’ll see that I’ve done this by wrapping the code into a CTE so that the only changes to the rest of the code appear as two (flagged) changes where the CTE is then used.

Table Space

Wed, 2019-11-06 11:59

There’s a request on the Oracle Developer Forum for assistance to write a report that shows the total space allocation for a table – including its indexes and LOB columns. There is no requirement to allow for partitioned tables, but it’s not entirely clear whether the code should allow for index-organized tables and clustered tables, and there’s no comment about nested tables.

The OP has made an attempt to write a suitable query, but the query joins dba_tables to dba_indexes then to dba_lobs then three times to dba_segments (all the joins being outer joins) before aggregating on table name. Unsurprisingly this produces the wrong results because (for example) if a table has two indexes the join to from dba_tables to dba_indexes will double the bytes reported for the table. As Andrew Sayer points out in the thread, it would be better to aggregate on each of the separate component types before joining.

This introduces an important, generic, principle to writing code.

  • If it’s difficult to write a single statement can you break the task down into separate components that are easier to handle.
  • If you can express the problem as a small set of simpler components, can you then combine the components
  • If this approach works is any loss of efficiency balanced by a gain in clarity and maintainability.

In the case of the OP, this isn’t a query that’s going to be runing every few seconds – maybe it will be once per day, maybe three or four times per day. so there’s no need to be worried about making it as efficient as possible – so let’s go for simplicity.

Here’s a query that reports the space allocate for a simple heap table. It references dba_tables and dba_segments, so has to be run by a user with DBA privileges, and at the moment it’s restricted to reporting a single user.

Wait – before I write the query I ought to create a testbed to see if the query is working. So let’s take an empty schema and create a few objects. Let’s create

  • a simple heap table with two indexes and two LOB columns.
  • an Index Organized Table (IOT) with an overflow and two LOB columns.
  • a table with two columns that are a nested table type

Here’s some suitable code:


rem
rem     Script:         sum_table_space.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2019
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem

prompt  ============================================================
prompt  Simple heap table - copy of all_objects - then add two CLOBs
prompt  ============================================================

create table test_user.t1 as select * from all_objects;

alter table test_user.t1 add (c1 clob) lob (c1) store as basicfile t1_lob1 (disable storage in row);
alter table test_user.t1 add (c2 clob) lob (c2) store as basicfile t1_lob2 (disable storage in row);

update test_user.t1 set c1 = 'x', c2 = 'x' where rownum <= 125;

create index test_user.t1_i1 on test_user.t1(object_name, object_type, object_id);
create index test_user.t1_i2 on test_user.t1(object_type, object_name, object_id);

prompt  =========================================================
prompt  Simple IOT table - with a couple of CLOBs in the overflow
prompt  =========================================================

create table test_user.iot1 (
        id              number          not null,
        ind_pad         varchar2(20)    not null,
        n2              number          not null,
        n3              number          not null,
        padding         varchar2(50),
        ct1             clob,
        ct2             clob,
        constraint iot1_pk primary key (id, ind_pad)
)
organization index
        overflow
        including n3
lob(ct1) store as basicfile iot_lob1(
        disable storage in row
)
lob(ct2) store as basicfile iot_lob2(
        disable storage in row
)
;

insert into test_user.iot1
select
        rownum,
        rpad(rownum,20),
        rownum,
        rownum,
        rpad('x',50,'x'),
        case when mod(rownum,100) = 0 then rpad('x',100,'x') end,
        case when mod(rownum,100) = 0 then rpad('x',100,'x') end
from
        all_objects
where
        rownum <= 10000
;

commit;

prompt  ====================================================
prompt  Now create types to allow creation of a nested table
prompt  ====================================================

create type test_user.jpl_item as object (n1 number, v1 varchar2(32));
/

create type test_user.jpl_tab_type as table of jpl_item;
/

create table test_user.demo_nest (
        id      number,
        nest1   test_user.jpl_tab_type,
        nest2   test_user.jpl_tab_type
)
nested table nest1 store as t_nested1
return as locator
nested table nest2 store as t_nested2
return as locator
;

create unique index test_user.tn1_pk on test_user.t_nested1(nested_table_id, n1) compress 1;
create unique index test_user.tn2_pk on test_user.t_nested2(nested_table_id, n1) compress 1;

create index  test_user.tn1_v1 on test_user.t_nested1(v1);
create index  test_user.tn2_v1 on test_user.t_nested2(v1);

insert into test_user.demo_nest values (
        1,
        test_user.jpl_tab_type(
                test_user.jpl_item(1,'Item One one'),
                test_user.jpl_item(2,'Item One two')
        ),
        test_user.jpl_tab_type(
                test_user.jpl_item(1,'Item One one'),
                test_user.jpl_item(2,'Item One two')
        )
);

insert into test_user.demo_nest values (
        2,
        test_user.jpl_tab_type(
                test_user.jpl_item(1,'Item Two one'),
                test_user.jpl_item(2,'Item Two two')
        ),
        test_user.jpl_tab_type(
                test_user.jpl_item(1,'Item Two one'),
                test_user.jpl_item(2,'Item Two two')
        )
);

commit;

You’ll notice that I’ve prefixed every table, index and type name with a schema name. This is because I set up this test to run as a DBA so I’m creating the objects while connected with a different id.

Of course, before trying to report allocations summed by base table, it would be sensible to produce a simple list of the segments we should see so that we know when we’ve reported all of them. So let’s start with that very simple report:


column bytes format 999,999,999,999
column segment_name format a25

break on report on owner on object_name skip 1
compute sum of bytes on report

select
        segment_name, segment_type, bytes
from
        dba_segments
where
        owner = 'TEST_USER'
order by
        segment_type, segment_name
;

SEGMENT_NAME              SEGMENT_TYPE                  BYTES
------------------------- ------------------ ----------------
IOT1_PK                   INDEX                     1,048,576
SYS_C008380               INDEX                     1,048,576
SYS_C008381               INDEX                     1,048,576
SYS_FK0000075579N00002$   INDEX                     1,048,576
SYS_FK0000075579N00004$   INDEX                     1,048,576
T1_I1                     INDEX                     5,242,880
T1_I2                     INDEX                     5,242,880
TN1_PK                    INDEX                     1,048,576
TN1_V1                    INDEX                     1,048,576
TN2_PK                    INDEX                     1,048,576
TN2_V1                    INDEX                     1,048,576
SYS_IL0000075565C00027$$  LOBINDEX                  1,048,576
SYS_IL0000075565C00028$$  LOBINDEX                  1,048,576
SYS_IL0000075572C00006$$  LOBINDEX                  1,048,576
SYS_IL0000075572C00007$$  LOBINDEX                  1,048,576
IOT_LOB1                  LOBSEGMENT                1,048,576
IOT_LOB2                  LOBSEGMENT                1,048,576
T1_LOB1                   LOBSEGMENT                2,097,152
T1_LOB2                   LOBSEGMENT                2,097,152
T_NESTED1                 NESTED TABLE              1,048,576
T_NESTED2                 NESTED TABLE              1,048,576
DEMO_NEST                 TABLE                     1,048,576
SYS_IOT_OVER_75572        TABLE                     1,048,576
T1                        TABLE                    12,582,912
                                             ----------------
sum                                                47,185,920

So we have a list of segments, and we have a sum of bytes to aim for. One thing you might notice, though, is that the name “IOT1”  has “disappeared”, instead the thing we see as a “TABLE” is its overflow segment, called SYS_IOT_OVER_75572 (the number being the object_id of the table we originally defined.  We will have to do something in our code to translate that sys_iot_over_75572 to iot1 if we want to make our final report easy to comprehend.

There are a few other oddities in the list, but some of them we may be familiar with already – the indexes with names like SYS_IL0000075565C00027$$ are the lob indexes for the four lobs we created. (I deliberately created basicfile LOBs in case the OP wasn’t using securefile LOBs that most systems are now using)  The 75565 in the sample here is the object_id of the base table the C00027 tells us that the LOB is column 27 of the table definition.

You may be wondering about the indexes like SYS_C008380 and SYS_FK0000075579N00002$. We may need to work on those – but I’ll tell you the answer straight away – the former is a unique index on the “nest1” column  in the demo_nest table, the latter is a “foreign key” index on the (invisible) nested_table_id column in the nested tables.

Side note: although the sys_fk0000075579N000025 index looks as if it’s supposed to be a “foreign key” index, as suggested by the name, the dependency, and the function, there’s no declared foreign key constraint declared between the parent table and the nested table. Also, if you check dba_indexes, the generated column has the value ‘N’, even though the name of the index was clearly generated by Oracle internal code.  (The index at the other end of the join – sys_c008380 – is, however, flagged with generated = ‘Y’)

Let’s start building our solution with the easiest bit.


select 
        tab.owner, 
        cast('TABLE' as varchar2(11))   object_type,
        tab.table_name,
        nvl(seg_tab.bytes,0)            bytes  
from
        dba_tables        tab  
left outer join
        dba_segments      seg_tab  
on
        seg_tab.owner      = tab.owner
and     seg_tab.table_name = tab.segment_name  
where
        tab.owner          = 'TEST_USER'
/


OWNER           OBJECT_TYPE TABLE_NAME                      BYTES
--------------- ----------- -------------------- ----------------
TEST_USER       TABLE       T1                         12,582,912
                TABLE       SYS_IOT_OVER_75572          1,048,576
                TABLE       DEMO_NEST                   1,048,576
                TABLE       IOT1                                0
***************                                  ----------------
sum                                                    14,680,064

Immediately we have to correct two problems – we can see the original Iiot1 table – but it doesn’t have an associated data segment, it has a “top”, which is the primary key index segment, and an “overflow” which is the designated by Oracle as a table segment. So we need to check the columns in dba_tables to work out how to get the more appropriate name reported for the table segment, and add a predicate to hide the original name. Here’s “tables only mark 2”:

select 
        tab.owner, 
        cast('TABLE' as varchar2(11))   object_type,
        case
                 when tab.iot_type = 'IOT_OVERFLOW' 
                        then tab.iot_name
                else tab.table_name
        end                             object_name,
        case
                 when tab.iot_type = 'IOT_OVERFLOW' 
                        then tab.table_name
        end                             auxiliary_name,
        nvl(seg_tab.bytes,0)            bytes  
from
        dba_tables        tab  
left outer join
        dba_segments      seg_tab  
on
        seg_tab.owner        = tab.owner
and     seg_tab.table_name   = tab.segment_name  
where
        tab.owner            = 'TEST_USER'
and     nvl(tab.iot_type,'NORMAL') != 'IOT'


OWNER           OBJECT_TYPE OBJECT_NAME          AUXILIARY_NAME                       BYTES
--------------- ----------- -------------------- ------------------------- ----------------
TEST_USER       TABLE       T1                                                   12,582,912
TEST_USER       TABLE       IOT1                 SYS_IOT_OVER_75572               1,048,576
TEST_USER       TABLE       DEMO_NEST                                             1,048,576

I’ve identified the table type using the iot_type column in dba_tables, hidden rows where the iot_type is ‘IOT’, and reported the iot_name (with table_name as an auxiliary name) when the iot_type is ‘IOT_OVERFLOW’. And we can now check that the result is consistent with the content of dba_segments that lists segment_type = ‘TABLE’.

So let’s move on to indexes.

select 
        ind.table_owner                 owner, 
        cast('INDEX' as varchar2(11))   object_type,
        ind.table_name,
        index_name                      auxiliary_name,
        nvl(seg_ind.bytes,0)            bytes  
from
        dba_indexes       ind  
left outer join 
        dba_segments      seg_ind  
on
        ind.owner      = seg_ind.owner
and     ind.index_name = seg_ind.segment_name  
where
        ind.table_owner = 'TEST_USER'


OWNER           OBJECT_TYPE TABLE_NAME           AUXILIARY_NAME                       BYTES
--------------- ----------- -------------------- ------------------------- ----------------
TEST_USER       INDEX       T1                   SYS_IL0000075565C00027$$         1,048,576
TEST_USER       INDEX       T1                   SYS_IL0000075565C00028$$         1,048,576
TEST_USER       INDEX       T1                   T1_I1                            5,242,880
TEST_USER       INDEX       T1                   T1_I2                            5,242,880
TEST_USER       INDEX       IOT1                 IOT1_PK                          1,048,576
TEST_USER       INDEX       IOT1                 SYS_IL0000075572C00007$$         1,048,576
TEST_USER       INDEX       IOT1                 SYS_IL0000075572C00006$$         1,048,576
TEST_USER       INDEX       T_NESTED1            SYS_FK0000075579N00002$          1,048,576
TEST_USER       INDEX       T_NESTED2            SYS_FK0000075579N00004$          1,048,576
TEST_USER       INDEX       DEMO_NEST            SYS_C008380                      1,048,576
TEST_USER       INDEX       DEMO_NEST            SYS_C008381                      1,048,576
TEST_USER       INDEX       T_NESTED1            TN1_PK                           1,048,576
TEST_USER       INDEX       T_NESTED2            TN2_PK                           1,048,576
TEST_USER       INDEX       T_NESTED1            TN1_V1                           1,048,576
TEST_USER       INDEX       T_NESTED2            TN2_V1                           1,048,576

A quick check shows that we’ve picked up the 15 indexes reported by dba_segments – although I’ve labelled them all just as “INDEX” while dba_segments does label LOB indexes differently. As you can see I’ve reported the table name in each case with the index name as the “auxiliary” name. This will be of use when I’m summing up the space associated with each table.

On to the LOBs – also easy:

select 
        lob.owner, 
        cast('LOB' as varchar2(11))   object_type,
        lob.table_name,
        lob.column_name               auxiliary_name,
        nvl(seg_lob.bytes,0)          bytes  
from
        dba_lobs    lob
left outer join 
        dba_segments      seg_lob
on
        seg_lob.owner        = lob.owner
and     seg_lob.segment_name = lob.segment_name
where 
        lob.owner            = 'TEST_USER'


OWNER           OBJECT_TYPE TABLE_NAME           AUXILIARY_NAME                       BYTES
--------------- ----------- -------------------- ------------------------- ----------------
TEST_USER       LOB         T1                   C1                               2,097,152
TEST_USER       LOB         T1                   C2                               2,097,152
TEST_USER       LOB         IOT1                 CT1                              1,048,576
TEST_USER       LOB         IOT1                 CT2                              1,048,576

And finally the nested tables:


select 
        nst.owner, 
        cast('NESTED' as varchar2(11))   object_type,
        nst.parent_table_name            object_name,
        table_name                       auxiliary_name,
        nvl(seg_nst.bytes,0)             bytes  
from
        dba_nested_tables    nst  
left outer join 
       dba_segments          seg_nst  
on
        seg_nst.owner        = nst.owner
and     seg_nst.segment_name = nst.table_name  
where
        nst.owner            = 'TEST_USER'

OWNER           OBJECT_TYPE OBJECT_NAME          AUXILIARY_NAME                       BYTES
--------------- ----------- -------------------- ------------------------- ----------------
TEST_USER       NESTED      DEMO_NEST            T_NESTED1                        1,048,576
TEST_USER       NESTED      DEMO_NEST            T_NESTED2                        1,048,576

A quick check tells us that the four pieces of code produce an exact match for all the segments reported in dba_segments – so all we have to do now is stitch the four queries together with UNION ALL, then sum(bytes) by owner and table_name.

There are 3 little problems to deal with though.

  • We have the predicate “where {owner} = ‘TEST_USER'” appearing 4 times in total, which is a bit messy. If we put this outside the UNION ALL will Oracle be smart enough to push it back inside the UNION ALL as part of its query transformation. It seems to, but the plan (which is several hundred lines long – so I’m not going to look too closely) does change a little, but on my little test system it didn’t make much difference to the performance. Initially it looks as if it might be okay to create a view from this query with no restricition on table owner, and allow users to query the view with a schema name.
  • The code to report indexes reports the indexes on the nested tables under t_nested1 and t_nested2 – we want the indexes to be associated with table demo_nest, so we need to refine that bit of the code. It needs an outer join to dba_nested_tables to supply the parent_table_name if it exists.
  • On my little system the query takes several seconds to run. Should I worry about that. Not until I’ve tested it on a much bigger system, and not until I know how frequently it needs to run. It may be good enough as it stands, and the ease with which I actually modified the first version of my code to handle the nested tables indexing issue is an indicator of the benefits of keeping a complex job as the sum of its parts if it’s reasonable to do so.

Let’s go straight to the final (so far) SQL:


select
        owner, object_name, sum(bytes) 
from    (
        select 
                tab.owner, 
                cast('TABLE' as varchar2(11))   object_type,
                case
                         when tab.iot_type = 'IOT_OVERFLOW' 
                                then tab.iot_name
                        else tab.table_name
                end                             object_name,
                case
                         when tab.iot_type = 'IOT_OVERFLOW' 
                                then tab.table_name
                end                             auxiliary_name,
                nvl(seg_tab.bytes,0)            bytes  
        from
                dba_tables        tab  
        left outer join
                dba_segments      seg_tab  
        on
                seg_tab.owner        = tab.owner
        and     seg_tab.segment_name = tab.table_name
        where
                tab.owner            = 'TEST_USER'
        and     nvl(tab.iot_type,'NORMAL') != 'IOT'
        union all
        select 
                ind.table_owner                  owner, 
                cast('INDEX' as varchar2(11))    object_type,
                nvl(
                        nst.parent_table_name,
                        ind.table_name
                )                                table_name,
                index_name                       auxiliary_name,
                nvl(seg_ind.bytes,0)             bytes  
        from
                dba_indexes       ind  
        left outer join
                dba_nested_tables nst
        on
                nst.owner       = ind.table_owner
        and     nst.table_name  = ind.table_name
        left outer join 
                dba_segments      seg_ind  
        on
                seg_ind.owner         = ind.owner
        and     seg_ind.segment_name  = ind.index_name 
        where
                ind.table_owner = 'TEST_USER'
        union all
        select 
                lob.owner, 
                cast('LOB' as varchar2(11))   object_type,
                lob.table_name,
                lob.column_name               auxiliary_name,
                nvl(seg_lob.bytes,0)          bytes  
        from
                dba_lobs    lob
        left outer join 
                dba_segments      seg_lob
        on
                seg_lob.owner        = lob.owner
        and     seg_lob.segment_name = lob.segment_name
        where 
                lob.owner            = 'TEST_USER'
        union all
        select 
                nst.owner, 
                cast('NESTED' as varchar2(11))   object_type,
                nst.parent_table_name            object_name,
                table_name                       auxiliary_name,
                nvl(seg_nst.bytes,0)             bytes  
        from
                dba_nested_tables    nst  
        left outer join 
               dba_segments          seg_nst  
        on
                seg_nst.owner        = nst.owner
        and     seg_nst.segment_name = nst.table_name
        where
                nst.owner         = 'TEST_USER'
        )
where
        owner = 'TEST_USER'
group by
        owner, object_name
order by
        sum(bytes)
/

OWNER           OBJECT_NAME                SUM(BYTES)
--------------- -------------------- ----------------
TEST_USER       IOT1                        6,291,456
TEST_USER       DEMO_NEST                  11,534,336
TEST_USER       T1                         29,360,128
                                     ----------------
sum                                        47,185,920


I wrote this note because my response to the original query was an informal suggestion of taking this approach; then I thought it might be worth writing a general comment about solving big problems by starting with little problems and stitching the pieces together using this case as a concrete example – then I just went ahead to finish the thing off within the bourndaries of the original requirements.

There’s a reason, though, why I tend to avoid publishing SQL that might be useful – it’s always possible to overlook bits and end up with people trying to use code that’s not totally appropriate to their system. In this case, of course, there’s a total (but deliberate) failure to handle clusters, partitions and subpartitions, but I don’t know how many other of the “less commonly used” bits of Otacle I haven’t considered. I do know that I haven’t touched on domain indexes (such as text indexes with all their convoluted collections of tables and indexes) but there may be some oddities of (e.g.) advanced queueing, replication, and audit vault that fall outside the range of the possibilities I’ve covered above.

Update 7th Nov 2019

The OP from OTN reported seeing some nested table in the output – which I thought I’d catered for so I modified the code I have on my sandbox to add a nested table to the IOT, and added a couple of extra indexes on the parent of the nested table, changed a couple of columns and object names to quoted mixed case, and everything still seemed to work okay.

It turned out that the OP had nested tables that were nested 5 deep – which means I need a hierarchical query to connect the 2nd to 5th (or nth) layer of nesting back to the parent to report the total volume against the top level table.

Watch this space – but not for a day or two.

 

IOT Hash

Thu, 2019-10-31 09:59

It’s another of my double-entendre titles. The optimizer can turn a hash join involving an index-organized table into a real performance disaster (though you may have to help it along the way by using a silly definition for your primary key columns). This post was inspired by a question posted on the Oracle Developer Community forum recently so the table and column names I’ve used in my model reflect (almost, with a few corrections) the names used in the post.

We start with a simple requirement expressed through the following SQL:


rem
rem     Script:         iot_hash.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2019
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem

insert
        /*+
                qb_name(insert)
        */
into t_iot(
        id, inst_id, nr_time,
        o_time, st, null_col, svname
)
select
        /*+
                qb_name(main)
                unnest(@subq)
                leading(@sel$a93afaed apar@main ob@subq)
                use_hash(@sel$a93afaed ob@subq)
                swap_join_inputs(@sel$a93afaed ob@subq)
                index_ss_asc(@sel$a93afaed ob@subq (t_iot.st t_iot.inst_id t_iot.svname))a
        */
        apar.id,
        'UP',
        to_date('2019-10-24','yyyy-mm-dd'),
        to_date('1969-12-31','yyyy-mm-dd'),
        'IDLE',
        null,
        'tkt007.jj.bb.com'
from
        t_base apar
where
        apar.id not in (
                select
                        /*+
                                qb_name(subq)
                        */
                        id
                from
                        t_iot ob
                where
                        inst_id = 'UP'
        )
and     nvl(apar.gp_nm,'UA') = 'UA'
and     rownum <= 5000
/

The requirement is simple – insert into table t_iot a set of values dictated by a subset of the rows in table t_base if they do not already exist in t_iot. To model the issue that appeared I’ve had to hint the SQL to get the following plan (which I pulled from memory after enabling rowsource execution stats):


---------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name        | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |             |      1 |        |   296 (100)|      0 |00:00:00.03 |     788 |    148 |       |       |          |
|   1 |  LOAD TABLE CONVENTIONAL | T_IOT       |      1 |        |            |      0 |00:00:00.03 |     788 |    148 |       |       |          |
|*  2 |   COUNT STOPKEY          |             |      1 |        |            |    100 |00:00:00.03 |      99 |     91 |       |       |          |
|*  3 |    HASH JOIN RIGHT ANTI  |             |      1 |    100 |   296   (2)|    100 |00:00:00.03 |      99 |     91 |    14M|  1843K|   15M (0)|
|*  4 |     INDEX SKIP SCAN      | T_IOT_STATE |      1 |  12614 |   102   (0)|  10000 |00:00:00.01 |      92 |     91 |       |       |          |
|*  5 |     TABLE ACCESS FULL    | T_BASE      |      1 |    100 |     2   (0)|    100 |00:00:00.01 |       7 |      0 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(ROWNUM<=5000)
   3 - access("APAR"."ID"="ID")
   4 - access("INST_ID"='UP')
       filter("INST_ID"='UP')
   5 - filter(NVL("APAR"."GP_NM",'UA')='UA')

The optimizer has unnested (as hinted) the subquery and converted it to an anti-join using a right hash anti-join. Take a look at the Used-mem for the hash join – would it surprise you to learn that the total size of the (not compressed in any way) IOT, and all its indexes, and the t_base table together total less than 4 MB. Something dramatically awful has happened in the hash join to generated a requirement of 14MB. (In the case of the OP this appeared as an unexpected 5GB written to the temporary tablespace.)

Before I address the source of the high memory usage, take a close look at the Predicate Information, particularly operation 3, and ask yourself what the definition of index t_iot_state might be. The predicate joins t_base.id to t_iot.id, and here’s the code to create both tables and all the indexes.

create table t_iot (
        nr_time         timestamp,
        id              varchar2(1024),
        inst_id         varchar2(200),
        o_time          timestamp,
        st              varchar2(200),
        null_col        varchar2(100),
        svname          varchar2(200),
        constraint t_iot_pk primary key(nr_time, id, inst_id)
)
organization index
/

insert into t_iot
select
        sysdate,
        dbms_random.string('l',10),
        'UP',
        sysdate,
        'IDLE',
        null,
        rpad('x',25,'x')
from
        all_objects
where
        rownum <= 1e4 -- > hint to avoid wordpress format issue
/

create index t_iot_state on t_iot(st, inst_id, svname); 
create index idx2        on t_iot(id, inst_id, svname);

create table t_base(
        id              varchar2(400) not null,
        gp_nm           varchar2(200)
)
/

insert into t_base
select
        dbms_random.string('l',10),
        'UA'
from
        all_objects
where
        rownum <= 100 -- > hint to avoid wordpress format issue
;


begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 't_iot',
                cascade     => true,
                method_opt  => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 't_base',
                cascade     => true,
                method_opt  => 'for all columns size 1'
        );
end;
/


The index t_iot_state that Oracle has used in the hash join is defined on the columns (st, inst_id, svname) – so the predicate is doing a comparison with a column that’s not in the index! At least, it’s not visibly declared in the index; but this is a secondary index on an IOT, and IOTs don’t have “normal” rowids, the rowid in a secondary index is the value of the primary key (plus a “block guess”). So the columns in the index (even though not declared in the index) are: (st, inst_id, svname, {nr_time, id, inst_id, blockguess}). So this index does supply the required id column.

Side note: you’ll see in the list of columns above that inst_id appears twice. In fact (since Oracle 9, I think) the code to handle secondary indexes has been smart enough to avoid this duplication. If the secondary index contains columns from the primary key then the “rowid” doesn’t store those columns, the code knows how to construct the primaryh key value from the stored PK columns combined with the needed columns from the index entry. This can make IOTs a very nice choice of implementation for “intersection” tables that are used to represent many-to-many joins between two other tables.

Unfortunately this “rowid” is the explanation for the massive memory demand. Take a look at the “Column Projection Information” for my execution plan:


Column Projection Information (identified by operation id):
-----------------------------------------------------------
   2 - "APAR"."ID"[VARCHAR2,400], "APAR"."GP_NM"[VARCHAR2,200], ROWNUM[8]
   3 - (#keys=1) "APAR"."ID"[VARCHAR2,400], "APAR"."GP_NM"[VARCHAR2,200]
   4 - "OB".ROWID[ROWID,1249], "NR_TIME"[TIMESTAMP,11], "ID"[VARCHAR2,1024], "INST_ID"[VARCHAR2,200], "OB".ROWID[ROWID,1249]
   5 - "APAR"."ID"[VARCHAR2,400], "APAR"."GP_NM"[VARCHAR2,200]

The interesting line is operation 4. A hash join takes the rowsource from its first child (the build table) and creates an in-memory hash table (which may spill to disc, of course), so if I see an unreasonable memory allocation (or unexpected spill to disc) a good starting point is to look at what the first child is supplying. In this case the first child seems to be saying that it’s supplying (or allowing for) nearly 3,700 bytes to be passed up to the hash join.

On closer inspection we can see it’s reporting the “rowid” twice, and also reporting the three columns that make up that rowid. I think it’s reasonable to assume that it’s only supplying the rowid once, and maybe it’s not even supplying the other three columns because they are embedded in the rowid. Doing a quick arithmetic check, let’s multiply the size of the rowid by the value of A-rows: 1,249 * 10,000 = 12,490,000. That’s pretty close to the 14MB reported by the hash join in operation 3.

Hypothesis – to get at the id column, Oracle has used this index (actually a very bad choice of those available) to extract the rowid and then passed the rowid up to the parent in a (length padded) fixed format. Oracle has then created a hash table by extracting the id column from the rowid and building the hash table on it but also carrying the length-padded rowid into the hash table.  Possible variants on this theme are that some or all of the other columns in the Column Projection Information are also passed upwards so that the id doesn’t have to be extracted, but if they are they are not padded to their maximum length.

A simple test that this is broadly the right assumption is to re-run the model making the declared length of the rowid much larger to see what happens to the memory allocation. Changing the inst_id declaration from 200 bytes to 1000 bytes (note the stored value is only the 2 bytes needed for the value ‘UP’) the Used-mem jumps to 23 MB (which is an increment very similar to 800 * 10,000).  You’ll note that I chose to experiment with a column that wasn’t the column used in the join. It was a column in the secondary index definition, though, so another test would be to change the nr_time column from a timestamp (11 bytes) to a large varchar2, so I re-ran the test declaring the nr_time as a varchar2(1000) – reverting the inst_id to varchar2(200) – and the Used-mem increased to 25MB.

Preliminary Conclusion

If Oracle uses the contents of the rowid of a secondary index on an IOT in a join then it constructs a fixed format version for the rowid by padding every column in the primary key to its maximum length and concatenating the results. This can have catastrophic side effects on performance if you’ve declared some very long columns “just in case”. Any time you use index organized tables you should remember to check the Column Projection Information in any execution plans that use secondary indexes in case they are passing a large, padded, primary key through the plan to a point where a blocking operation (such as a hash join or merge join) has to accumulate a large number of rows.

Footnote

In my test case I had to hint the query heavily to force Oracle into the path I wanted to demonstrate.

It’s surprising that the optimizer should have chosen this path in the OP’s system, given that there’s another secondary index that contains the necessary columns in its definition. (So one thought is that there’s a statistics problem to address, or possibly the “good” index is subject to updates that make it become very inefficient (large) very quickly.)

Another oddity of the OP’s system was that Oracle should have chosen to do a right hash anti-join when it looked as if joining the tables in the opposite order would produce a much smaller memory demand and lower cost – there was an explict swap_join_inputs() hint in the Outline Information (so copying the outline into the query and changing that to no_swap_join_inputs() might have been abother viable workaround.) In the end the OP hinted the query to use a nested loop (anti-)join from t_base to t_iot – which is another way to avoid the hash table threat with padded rowids.

 

Strange Estimates.

Wed, 2019-10-30 08:10

A question came up on the Oracle-L list server a couple of days ago expressing some surprise at the following execution plan:


--------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                 |       |       |   845 (100)|          |
|   1 |  SORT AGGREGATE                      |                 |     1 |     7 |            |          |
|*  2 |   TABLE ACCESS BY INDEX ROWID BATCHED| ANY_TABLE       | 84827 |   579K|   845   (1)| 00:00:01 |
|   3 |    SORT CLUSTER BY ROWID             |                 | 68418 |       |    76   (0)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN                 | ANY_INDEX       | 68418 |       |    76   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("X"."ANY_COLUMN1"='J')
   4 - access("X"."ANY_COLUMN2"=89155)

You’ll notice that this is a very simple query accessing a table by index, yet the estimated table rows found exceeds the estimated number of index entries used to probe the table. How can this happen. The answer (most frequently) is that there’s a mismatch between the table (or, more commonly, column) statistics and the index statistics. This seems to happen very frequently when you start mixing partitioned tables with global (or globally partitioned) indexes but it can happen in very simple cases, especially since a call to gather_table_stats() with cascade set to true and using the auto_sample_size will take a small sample from the index while using a 100% “sample” from the table.

Here’s an example I engineered very quickly to demonstrate the point. There’s no particular reason for the choice of DML I’ve used on the data beyond a rough idea of setting up a percentage of nulls and deleting a non-uniform pattern of rows.


rem
rem     Script:         table_index_mismatch.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2019
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem
create table t1
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        mod(rownum,1000)                n1,
        mod(rownum,1000)                n2,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6 -- > comment to avoid WordPress format issue
;

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1, for columns (n1,n2) size 1'
        );
end;
/

create index t1_i1 on t1(n1);

delete from t1 where mod(trunc(sqrt(n1)),7) = 0;
update t1 set n1 = null where mod(n1,10) = 0;
delete from t1 where mod(n1,10) = trunc(dbms_random.value(0,10));

execute dbms_stats.gather_table_stats(user,'t1',estimate_percent=>1)
execute dbms_stats.gather_index_stats(null,'t1_i1',estimate_percent=> 0.01)

Of course you’re not supposed to collect stats with arbitrary samples in any recent version of Oracle, so going for a 1% and 0.01% sample seems a little daft but I’m just doing that to demonstrate the problem with a very small data set.

After generating the data and gathering the stats I ran a few queries to pick out some critical numbers.


select
        table_name, sample_size, num_rows
from
        user_tables
where
        table_name = 'T1'
/

select 
        index_name, sample_size, num_rows, distinct_keys
from
        user_indexes
where
        table_name = 'T1'
and     index_name = 'T1_I1'
/

select
        column_name, sample_size, num_nulls, num_distinct
from
        user_tab_cols
where
        table_name = 'T1'
and     (
            column_name = 'N1'
         or virtual_column = 'YES'
        )
order by
        column_name
/

You’ll notice that I’ve only picked one of my original columns and any virtual columns. My gather_table_stats() call had a method_opt that included the creation of extended stats for the column group (n1, n2) and I want to report the stats on the resulting virtual column.


TABLE_NAME           SAMPLE_SIZE   NUM_ROWS
-------------------- ----------- ----------
T1                          7865     786500


INDEX_NAME           SAMPLE_SIZE   NUM_ROWS DISTINCT_KEYS
-------------------- ----------- ---------- -------------
T1_I1                     385779     713292           714


COLUMN_NAME                      SAMPLE_SIZE  NUM_NULLS NUM_DISTINCT
-------------------------------- ----------- ---------- ------------
N1                                      7012      85300          771
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS          7865          0          855

A couple of observations on the stats

  • the table sample size is, as expected, 1% of the reported num_rows (the actual count is 778,154).
  • The index sample size is much bigger than expected – but that’s probably related to the normal “select 1,100 leaf blocks strategy”. Because of the skew in the pattern of deleted values it’s possible for the sample size in this model to vary between 694,154 and something in the region of 380,000.
  • The n1 sample size is about 10% smaller than the table sample size – but that’s because I set 10% of the column to null.
  • The column group sample size matches the table sample size because column group hash values are never null, even if an underlying column is null.

So let’s check the execution plan for a very simple query:


set autotrace on explain
select id from t1 where n1 = 140 and n2 = 140;
set autotrace off


---------------------------------------------------------------------------------------------
| Id  | Operation                           | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       |   920 | 11960 |   918   (1)| 00:00:01 |
|*  1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1    |   920 | 11960 |   918   (1)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | T1_I1 |   909 |       |     5   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N2"=140)
   2 - access("N1"=140)

The estimate for relevant index rowids is smaller than the estimate for the number of table rows! The difference is not as extreme as the case reported on Oracle-l, but I’m only trying to demonstrate a principle, not reproduce the exact results.

There are several ways in which contradictory results like this can appear – but in this case we can see the following:

  • For the table access table.num_rows/column.num_distinct = 786,500 / 855 = 919.88  (using the column group num_distinct)
  • For the index range scan: (table.num_rows – column.num_nulls) / column.num_distinct = (786500 – 85300) / 771 = 909.47 (using the n1 statistics)

So the change in strategy as it becomes possible for the optimizer to take advantage of the column group means the index and table have been using incompatible sets of stats (in particular there’s that loss of information about NULLs) as their cardinalities are calculated. The question, then, is “how much is that likely to matter”, and the follow-up if it can matter is “in what circumstancs could the effect be large enough to cause problems”. But that’s a topic for another day.

Update / Footnote

In the case of the Oracle-l example, there was no column group, and in some cases the optimizer would produce a plan where the table estimate was much smaller than the index estimate, and in other cases (like the opening plan above) the table estimate was signficantly greater than the index estimate. This was a side effect of adaptive statistics: the low table estimate was due to the basic “multiply separate selectivities”; but the with adaptive statistics enabled Oracle started sampling the table to check the correlation between the two tables, and then produced an SQL Plan Directive to do so and got to the higher (and correct) result.

 

 

match_recognize()

Fri, 2019-10-25 12:47

A couple of days ago I posted a note with some code (from Stew Ashton) that derived the clustering_factor you would get for an index when you had set a value for the table_cached_blocks preference but had not yet created the index. In the note I said that I had produced a simple and elegant (though massively CPU-intensive) query using match_recognize() that seemed to produce the right answer, only to be informed by Stew that my interpretation of how Oracle calculated the clustering_factor was wrong and that the query was irrelevant.  (The fact that it happened to produce the right answer in all my tests was an accidental side effect of the way I had been generating test data. With Stew’s explanation of what Oracle was doing it was easy to construct a data set that proved my query was doing the wrong thing.)

The first comment I got on the posting was a request to publish my match_recognize() query – even though it was irrelevant to the problem in hand – simply because it might be a useful lesson in what could be done with the feature; so here’s some code that doesn’t do anything useful but does demonstrate a couple of points about match_recognize(). I have to stress, though, that I’m not an expert on match_recognize() so there may be a more efficient way of using it to acheieve the same result. There is certainly a simple and more efficient way to get the same result using some fairly straightforward PL/SQL code.

Requirement

I had assumed that if you set the table_cached_blocks preference to N then Oracle would keep track of the previous N rowids as it walked an index to gather stats and only increment the “clustering factor counter” if it failed to find a match for the block address of the current rowid in the block addresses extracted from the previous N rowids. My target was to emulate a way of doing this counting.

Strategy

Rather than writing code that “looked backwards” as it walked the index, I decided to take advantage of the symmetry of the situation and write code that looked forwards (you could think of this as viewing the index in descending order and looking backwards along the descending index). I also decided that I could count the number of times I did find a match in the trail of rowids, and subtract that from the total number of index entries.

So my target was to look for patterns where I start with the block address from the current rowid, then find the same block address after zero to N-1 failures to find the block address. Since the index doesn’t exist I’ll need to emulate its existence by selecting the columns that I want in the index along with the rowid, ordering the data in index order. Then I can walk this “in memory” index looking for the desired pattern.

Here’s some code to create a table to test against:


rem
rem     Script:         clustering_factor_est_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2019
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem 

create table t1
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                                  id,
        cast(rownum as varchar2(10))            v1,
        trunc(dbms_random.value(0,10000))       rand,
        rpad('x',100,'x')                       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6 -- > comment to avoid WordPress format issue
/

My table has 1M rows, and there’s a column called rand which has 10,000 distinct values. This is generated through Oracle’s dbms_random package and the procedure I’ve used will give me roughly 100 occurrences for each value scattered uniformly across the table. An index on this column might be quite useful but it’s probably going to have a very high clustering_factor because, on average, any two rows of a particular value are likely to be separated by 10,000 rows, and even if I look for rows with a pair of consecutive values any two rows are likely to be separated by a few hundred other rows.

Here’s the code that I was using to get an estimate of (my erroneous concept of) the clustering_factor with table_cached_blocks set to 32. For debugging purposes it reports the first row of the pattern for each time my pattern is matched, so in this version of the code I’d  have to check the number of rows returned and subtract that from the number of rows in the table (where rand is not null).


select
        first_rand, first_file_id, first_block_id, first_row_id, step_size
from    (
                select
                        rand, 
                        dbms_rowid.rowid_relative_fno(rowid) file_id,
                        dbms_rowid.rowid_block_number(rowid) block_id,
                        dbms_rowid.rowid_row_number(rowid)   row_id
                from
                        t1
                where
                        rand is not null
        )
match_recognize(
        order by rand, file_id, block_id, row_id
        measures
                count(*) as step_size,
                first(strt.rand)        as first_rand,
                first(strt.file_id)     as first_file_id,
                first(strt.block_id)    as first_block_id,
                first(strt.row_id)      as first_row_id
        one row per match
        after match skip to next row
        pattern (strt miss{0, 31} hit)
        define 
                miss as (block_id != strt.block_id or  file_id != strt.file_id),
                hit  as (block_id  = strt.block_id and file_id  = strt.file_id)
)
order by 
        first_rand, first_file_id, first_block_id, first_row_id
;

The first point to note is that the inline view is the thing I use to model an index on column rand. It simply selects the value and rowid from each row. However I’ve used the dbms_rowid package to break the rowid down into the file_id, block_id and row_id within block. Technically (to deal with partitioned tables and indexes) I also ought to be thinking about the object_id which is the first component of the full rowid. I haven’t actually ordered the rows in the in-line view as I’m going to let the match_recognize() operation handle that.

A warning as we start looking at the match_recognize() section of the query: I’ll be jumping around the clause a little bit, rather than working through it in order from start to finish.

First I need the raw data sorted in order so that any results I get will be deterministic. I have an order by clause that sorts my data by rand and the three components I’ve extracted from the rowid. (It is also possible to have a partition clause – similar to the partition clause of an analytic function – but I don’t need one for this model.)

Then I need to define the “columns” that I’m planning to output in the final query. This is the set of measures and in the list of measures you can see a count(*) (which doesn’t quite mean what it usually does) and a number of calls to a function first() which takes an aliased column name as it’s input and, although the column names look familiar, the alias (strt) seems to have sprung from nowhere.

At this point I want to jump down to the define clause because that’s where the meaning of strt could have been defined.  The define clause is a list of “rules”, which are also treated as “variables”, that tell us how to classify a row. We are looking for patterns, and pattern is a set of rows that follows a set of rules in a particular order, so one of the things I need to do is create a rule that tells Oracle how to identify a row that qualifies as the starting point for a pattern – so I’ve defined a rule called strt to do this – except I haven’t defined it explicitly, it’s not visible in the define list, so Oracle has assumed that the rule is “1 = 1”, in other words every row I’m going to look at could be classified as a strt row.

Now that I have a definition for what strt means I could go back to the measures – but I’ll postpone doing that for a moment and look at the other rules in my define list. I have a rule called miss which says “if either of these comparisons evaluates to true” then the row is a miss;  but the predicates includes a reference to strt which means we are doing comparisons with the most recent row that was classified as a strt row. So a miss means we’ve found a starting row and we’re now looking at other rows comparing the block_id and file_id for each row to check that they don’t match the block_id and file_id of the starting row.

Similarly we have a hit rule which says a hit means we’ve previously found a starting row and we’re now looking at other rows checking for rows where the current block_id and file_id match the starting block_id and file_id.

Once we’ve got a set of rules explaining how to classify rows we can specify a pattern (which means going back up the match_recognize() clause one section). Our pattern reads:  “find a strt row, followed by zero and 31 miss rows, followed by a hit row”. And that’s just a description of my back-to-front way of saying “remember the last 32  rowids and check if the current block address matches the block address in one of those rowids”.

The last two clauses that need to be explained before I revisit the measures clause are the “one row per match” and “after match skip to next row”.

If I find a sequence of rows that matches my pattern there are several things I could do with that set of rows – for example I could report every row in that pattern along with the classification of strt/miss/hit (which would be useful if I’m looking for a few matches to a small pattern in a large data set), or (as I’ve done here) I could report just one row from the pattern and give Oracle some instruction about which one row I want reported.

Then, after I’ve found (and possibly reported) a pattern, what should I do next. Again there are several possibilities – the two most obvious ones, perhaps are: “just keep going” i.e. look at the row after the end of the pattern to see if it’s another strt row, and “go to the row after the start of the pattern you’ve just reported”. These translate into: “after match skip past last row” (which is the default if you don’t specify an “after match” clause) and “after match skip to next row” (which is what I’ve specified).

Finally we get back to the measures clause – I’ve defined four “columns” with names like ‘first_xxx’ and a step_size. The step_size is defined as count(*) which – in this context – means “count the number of rows in the current matched pattern”. The other measures are defined using the first() function, referencing strt alias which tells Oracle I want to retain the value from the first row that met the strt rule in the current matched pattern.

In summary, then my match_recognize() clause tells Oracle to

  • Sort the data by rand, file_id, block_id, row_id
  • For each row in turn
    • extract the file_id and block_id
    • take up to 32 steps down the list looking for a matching file_id and block_id
    • If you find a match pass a row to the parent operation that consists of: the number of rows between strt to hit inclusive, and the values of rand, file_id, block_id, and row_id of the strt row.

Before you try testing the code, you might like to know about some results.

As it stands my laptop with a virtual machine running 12.2.0.1 took 1 minute and 5 seconds to complete with “miss{0, 31}” in the pattern. In fact the first version of the code had the file_id and block_id tests in the define clause in the opposite order viz:

        define 
                miss as (file_id != strt.file_id or  block_id != strt.block_id),
                hit  as (file_id  = strt.file_id and block_id  = strt.block_id)

Since the file_id for my test is the same for every row in my table (it’s a single file tablespace), this was wasting a surprising amount of CPU, leading to a run time of 1 minute 17 seconds! Neither time looks particularly good when compared to the time required to create the index, set the table_cached_blocks to 32, and gather stats on the index – in a total time of less than 5 seconds. The larger the value I chose for the pattern match, the worse the workload became, until at “miss{0, 199}” – emulating a table_cached_blocks of 200 – the time to run was about 434 seconds of CPU!

A major portion of the problem is the way that Oracle is over-optimistic (or “greedy”, to use the technical term) with its pattern matching, combined with the nature of the data which (in this example) isn’t going to offer much opportunity for matching, combined with the fact that Oracle cannot determine that a row that is not a “miss” has to be a “hit”.  In this context “greedy” means Oracle will try to find as many consecutive occurrences of the first rule in the pattern before it tries to find and occurrence of the second rule – and when it fails to match a pattern it will “backtrack” one step and have another go, being slightly less greedy. So, for our example, the greedy algorithm will operate as follows:

  • find 31 rows that match miss, then discover the 32nd row does not match hit
  • go back to the strt and find 30 rows that match miss, then discover the 31st row does not match hit
  • go back to the strt and find 29 rows that match miss, then discover the 30th row does not match hit
  • … repeat until
  • go back to the strt and find 0 rows that match miss, then discover the 1st does not match hit
  • go to the next row, call it strt, and repeat the above … 1M times.

From a human perspective this is a pretty stupid strategy for this specific problem – but that’s because we happen to know that “hit” = “not(miss)” (ignoring nulls, of which there are none) while Oracle has to assume that there is no relationship between “hit” and “miss”.

There is hope, though, because you can tell Oracle that it should be “reluctant” rather than “greedy” which means Oracle will consume the smallest possible number of occurrences of the first rule before testing the second rule, and so on. All you have to do is append a “?” to the count qualifier:

        pattern (strt miss{0, 31 }? hit)

Unfortunately this seemed to have very little effect on execution time (and CPU usage) in our case. Again this may be due to the nature of the data etc., but it may also be a limitation in the way that the back tracking works. I had expected a response time that would more closely mirror the human expectation, but a few timed tests suggest the code uses exactly the same type of strategy for the reluctant strategy as it does for the greedy one, viz:

  • find 0 rows that match miss, then discover the next row does not match hit
  • go back to the strt and find 1 row that matches miss, then discover the 2nd row does not match hit
  • go back to the strt and find 2 rows that match miss, then discover the 3rd row does not match hit
  • … repeat until
  • go back to the strt and find 31 rows that match miss, then discover the 32nd does not match hit
  • go to the next row, call it strt, and repeat the above … 1M times.

Since there are relatively few cases in our data where a pattern match will occur both the reluctant and the greedy strategies will usually end up doing all 32 steps. I was hoping for a more human-like algorithm that would recognise that Oracle would recognise that if it’s just missed on the first X rows then it need only check the X+1th and not go back to the beginning (strt) – but my requirement makes it easy to see (from a human perspective) that that makes sense; in a generic case (with more complex patterns and without the benefit of having two mutially exclusive rules) the strategy of “start all over again” is probably a much safer option to code.

Plan B

Plan B was to send the code to Stew Ashton and ask if he had any thoughts on making it more efficient – I wasn’t expecting him to send a PL/SQL solution my return of post, but that’s what I got and published in the previous post.

Plan C

It occurred to me that I don’t really mind if the predicted clustering_factor is a little inaccurate, and that the “backtracking” introduced by the variable qualifer {0,31} was the source of a huge amount of the work done, so I took a different approach which simply said: “check that the next 32 (or preferred value of N) rows are all misses”.  This required two changes – eliminate one of the defines (the “hit”) and modify the pattern definition as follows:

         pattern (strt  miss{32} )
         define
                 miss as (block_id != strt.block_id or  file_id != strt.file_id)

The change in strategy means that the result is going to be (my version of) the clustering_factor rather than the number to subtract from num_rows to derive the clustering_factor. And I’ve introduced a small error which shows up towards the end of the data set – I’ve demanded that a pattern should include exactly 32 misses; but when you’re 32 rows from the end of the data set there aren’t enough rows left to match the pattern. So the result produced by the modified code could be as much as 32 short of the expected result.  However, when I showed the code to Stew Ashton he pointed out that I could include “alternatives” in the pattern, so all I had to do was add in to the pattern something which said “and if there aren’t 32 rows left, getting to the end of the data set is good enough” (technically that should be end of the current partition, but we have only one partition).

         pattern (strt  ( miss{32} | ( miss* $) ) )

The “miss” part of the pattern now reads:  “32 misses” or “zero or more misses and then the end of file/partition/dataset”.

It’s still not great – but the time to process the 1M rows with a table_cached_blocks of 32 came down to 31 seconds

Finally

I’ll close with one important thought. There’s a significant difference in the execution plans for the two strategies – which I’m showing as outputs from the SQL Monitor report using a version of the code that does a simple count(*) rather than listing any rows:


pattern (strt miss{0, 31} hit)
==========================================================================================================================================
| Id |         Operation          | Name  |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   |  Mem  | Activity | Activity Detail |
|    |                            |       | (Estim) |       | Active(s) | Active |       | (Actual) | (Max) |   (%)    |   (# samples)   |
==========================================================================================================================================
|  0 | SELECT STATEMENT           |       |         |       |        54 |    +14 |     1 |        1 |     . |          |                 |
|  1 |   SORT AGGREGATE           |       |       1 |       |        54 |    +14 |     1 |        1 |     . |          |                 |
|  2 |    VIEW                    |       |      1M | 12147 |        54 |    +14 |     1 |     2782 |     . |          |                 |
|  3 |     MATCH RECOGNIZE SORT   |       |      1M | 12147 |        60 |     +8 |     1 |     2782 |  34MB |          |                 |
|  4 |      VIEW                  |       |      1M |   325 |        14 |     +1 |     1 |       1M |     . |          |                 |
|  5 |       INDEX FAST FULL SCAN | T1_I1 |      1M |   325 |         7 |     +8 |     1 |       1M |     . |          |                 |
==========================================================================================================================================

pattern (strt  miss{32} )
==================================================================================================================================================================
| Id |                     Operation                      | Name  |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   |  Mem  | Activity | Activity Detail |
|    |                                                    |       | (Estim) |       | Active(s) | Active |       | (Actual) | (Max) |   (%)    |   (# samples)   |
==================================================================================================================================================================
|  0 | SELECT STATEMENT                                   |       |         |       |        15 |    +16 |     1 |        1 |     . |          |                 |
|  1 |   SORT AGGREGATE                                   |       |       1 |       |        15 |    +16 |     1 |        1 |     . |          |                 |
|  2 |    VIEW                                            |       |      1M | 12147 |        15 |    +16 |     1 |     997K |     . |          |                 |
|  3 |     MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO |       |      1M | 12147 |        29 |     +2 |     1 |     997K |  34MB |          |                 |
|  4 |      VIEW                                          |       |      1M |   325 |        16 |     +1 |     1 |       1M |     . |          |                 |
|  5 |       INDEX FAST FULL SCAN                         | T1_I1 |      1M |   325 |         9 |     +8 |     1 |       1M |     . |          |                 |
==================================================================================================================================================================

The significant line to note is operation 3 in both cases. The query with the pattern that’s going to induce back-tracking reports only “MATCH RECOGNIZE SORT”. The query with the “fixed” pattern reports “MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO” Oracle can implement a finite state machine with a fixed worst case run-time. When you write some code that uses match_recognize() the three magic words you want to see in the plan are “deterministic finite auto” – if you don’t then, in principle, your query might be one of those that could (theoretically) run forever.

Addendum

Congratulations if you’ve got this far – but please remember that I’ve had very little practice using match_recognize; this was a little fun and learning experience for me and there may be many more things you could usefully know about the technology before you use it in production; there may also be details in what I’ve written which are best forgotten about. That being the case I’d be more than happy for anyone who wants to enhance or correct my code, descriptions and observations to comment below.

 

Clustering_Factor

Wed, 2019-10-23 15:56

A few days ago I published a little note of a script I wrote some time ago to estimate the clustering_factor of an index before it had been built. At the time I pointed out that one of its limitations was that it would not handle cases where you were planning to set the table_cached_blocks preference, but a couple of days later I decided that I’d write another version of the code that would cater for the new feature – and that’s how I made an embarrassing discovery.

Having reviewed a number of notes I’ve published about the table_cached_blocks preference and its impact on the clustering_factor I’ve realised the what I’ve written has always been open to two interpretations – the one that I had in mind as I was writing, and the correct one.  I made this discovery because I had written a simple SQL statement – using the match_recognize() mechanism – to do what I considered to  be the appropriate calculation. After testing the query with a few sets of sample data that produced the correct results I emailed Stew Ashton (my “go-to” person for match_recognize() questions) asking if he would do a sanity check on the code because it was rather slow and I wondered if there was a better way of writing it.

His reply was roughly:

“I’ve read the notes you and Richard Foote have written about the clustering_factor and table_cached_blocks, and this isn’t doing what your description says it should.”

Then he explained what he had inferred from what I had written … and it made more sense than what I had been thinking when I wrote it. He also supplied some code to implement his interpretation – so I designed a couple of data models that would produce the wrong prediction for whichever piece of code implemented the wrong interpretation. His code gave the right answers, mine didn’t.

So here’s the difference in interpretation – the wrong one first – using 16 as a discussion value for the table_cached_blocks:

  • WRONG interpretation:  As you walk through index entries in order remember the last 16 rowids (that’s rowid for the rows in the table that the index is pointing to) you’ve seen. If the current rowid has a block id component that doesn’t match the block id from one of the remembered 16 rowids then increment the counter for the clustering_factor.
    • The simplicity of this algorithm means you can fix a “circular” array of 16 entries and keep walking around the circle overwriting the oldest entry each time you read a new one. It’s a pity that it’s the wrong idea because there’s a simple (though massively CPU -intensive match_recognize() strategy for implementing it – and if you were using an internal library mechanism during a proper gather_index_stats() it could be incredibly efficient.
  • RIGHT interpretation: set up an array for 16 block ids, each with an associated “row-number”. Walk through the index in order – giving each entry a row-number as you go. Extract the block id from the current entry and search through the array for a matching block id.  If you find a match then update its entry with the current row-number (so you can remembr how recently you saw the block id); if you don’t find a match then replace the entry that has the smallest (i.e. greatest distance into the past) row-number with the current block id and row-number and increment the counter for the clustering_factor.

The first piece of code that Stew Ashton sent me was an anonymous PL/SQL block that included some hard-coded fragments and embedded SQL to use a test table and index that I had defined, but he then sent a second piece of code that creates a generic function that uses dynamic SQL to construct a query against a table and an index definition that you want to test. The latter is the code I’ve published (with permission) below:


create or replace function predict_clustering_factor(
/*
Function to predict the clustering factor of an index,
taking into account the intended value of
the TABLE_CACHED_BLOCKS parameter of DBMS_STATS.SET_TABLE_PREFS.

Input is the table name, the list of column names
and the intended value of TABLE_CACHED_BLOCKS.

The function collects the last N block ids (not the last N entries).
When there is no more room, it increments the clustering factor
and replaces the least recently used block id with the current one.

Note: here a "block id" is a rowid with the row_number portion set to 0.
It is effectively a "truncated" rowid.
*/
  p_table_name in varchar2,
  p_column_list in varchar2,
  p_table_cached_blocks in number
) return number authid current_user is

  rc sys_refcursor;
  type tt_rids is table of rowid;
  lt_rids tt_rids;
  
  type t_block_list is record(
    rid rowid,
    last_hit number
  );

  type tt_block_list is table of t_block_list;
  lt_block_list tt_block_list := new tt_block_list();

  l_rid rowid;
  l_clustering_factor number := 0;
  b_block_found boolean;
  l_rn number := 0;
  l_oldest_hit number;
  i_oldest_hit binary_integer := 0;
  
  function truncated_rid(p_rid in rowid) return rowid is
    rowid_type number;
    object_number NUMBER;
    relative_fno NUMBER;
    block_number NUMBER;
    row_number NUMBER;
    rid rowid;

  begin

    DBMS_ROWID.ROWID_INFO (
      p_rid,
      rowid_type,
      object_number,
      relative_fno,
      block_number,
      row_number
    );

    rid := DBMS_ROWID.ROWID_CREATE (
      rowid_type,
      object_number,
      relative_fno,
      block_number,
      0
    );

    return rid;

  end truncated_rid;
  
begin
  if p_table_cached_blocks != trunc(p_table_cached_blocks)
  or p_table_cached_blocks not between 1 and 255 then
    raise_application_error(
      -20001, 
      'input parameter p_table_cached_blocks must be an integer between 1 and 255'
    );
  end if;

  open rc for 'select rowid from '||p_table_name||' order by '||p_column_list||', rowid';
  loop
    fetch rc bulk collect into lt_rids limit 1000;

    for irid in 1..lt_rids.count loop
      l_rn := l_rn + 1;
      l_rid := truncated_rid(lt_rids(irid));
      b_block_found := false;
      l_oldest_hit := l_rn;

      if l_rn = 1 then
        l_clustering_factor := l_clustering_factor + 1;
        lt_block_list.extend;
        lt_block_list(1).rid := l_rid;
        lt_block_list(1).last_hit := l_rn;

      else

        for i in 1..lt_block_list.count loop
          if l_oldest_hit > lt_block_list(i).last_hit then
            l_oldest_hit := lt_block_list(i).last_hit;
            i_oldest_hit := i;
          end if;
          if lt_block_list(i).rid = l_rid then
            b_block_found := true;
            lt_block_list(i).last_hit := l_rn;
            exit;
          end if;
        end loop;

        if not b_block_found then
          l_clustering_factor := l_clustering_factor + 1;
          if lt_block_list.count < p_table_cached_blocks then
            lt_block_list.extend;
            lt_block_list(lt_block_list.count).rid := l_rid;
            lt_block_list(lt_block_list.count).last_hit := l_rn; 
          else         
            lt_block_list(i_oldest_hit).rid := l_rid;
            lt_block_list(i_oldest_hit).last_hit := l_rn;
          end if;
        end if;

      end if;

    end loop;
    exit when rc%notfound;
  end loop;

  close rc;
  return l_clustering_factor;

exception when others then
  if rc%isopen then
    close rc;
  end if;
  raise;

end predict_clustering_factor;
/

After executing the above to create the function, here’s an example of usage:

rem
rem     Script:         clustering_factor_est_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2019
rem
rem     Last tested
rem             19.3.0.0
rem             12.2.0.1
rem

create table t1
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                                  id,
        cast(rownum as varchar2(10))            v1,
        trunc(dbms_random.value(0,10000))       rand,
        rpad('x',100,'x')                       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6 -- > comment to avoid WordPress format issue
/

-- -------------------------------------------------------------------

SQL> execute dbms_output.put_line('Predicted cf for t1(rand, id): ' || predict_clustering_factor('t1','rand, id',16))
Predicted cf for t1(rand, id): 997218
Elapsed: 00:00:07.54

SQL> execute dbms_output.put_line('Predicted cf for t1(rand, id): ' || predict_clustering_factor('t1','rand, id',255))
Predicted cf for t1(rand, id): 985607
Elapsed: 00:00:50.61

You’ll notice that the larger the setting for the “table_cached_blocks” parameter the more time it takes to predict the clustering_factor – and it was all CPU time in my example. This isn;t surprising given the need to search through an array holding the previous history. In this example the table t1 holds 1,000,000 rows, and the number and scatter of distinct values is so arranged that the code will hardly ever find a cached block id – essentially it’s the sort of index that isn’t going to cause much of confusion to the optimizer and isn’t likely to need special attention to make the optimizer use it when it should and ignore it when it’s inappropriate.

Finally a cut-n-paste to show the accuracy of the two predictions:

SQL> create index t1_i on t1(rand, id);
Elapsed: 00:00:02.96

SQL> execute dbms_stats.set_table_prefs(null,'t1','table_cached_blocks',16)
Elapsed: 00:00:00.01

SQL> execute dbms_stats.gather_index_stats(null,'t1_i')
Elapsed: 00:00:09.55

SQL> select clustering_factor from user_indexes where index_name = 'T1_I';

CLUSTERING_FACTOR
-----------------
           997218

Elapsed: 00:00:00.11

SQL> execute dbms_stats.set_table_prefs(null,'t1','table_cached_blocks',255)
Elapsed: 00:00:00.01

SQL> execute dbms_stats.gather_index_stats(null,'t1_i')
Elapsed: 00:00:07.80

SQL> select clustering_factor from user_indexes where index_name = 'T1_I';

CLUSTERING_FACTOR
-----------------
           985607

Elapsed: 00:00:00.00

Both match perfectly – but you might notice that creating the index and gathering the stats was much faster than predicting the clustering factor for the case where we set table_cached_blocks = 255.

(If you’re wondering, my “simple but irrelevant” match_recognize() query took 370 CPU second to complete for table_cached_blocks = 200 – and a limit on march_recognize() meant that 200 was the maximum value I was allowed to use – so now you know why I emailed Stew Ashton (and just for lagniappe. he also told me about a simple workaround for the 200 limit)).

 

 

ANSI Plans

Tue, 2019-10-22 12:59

Here’s a thought that falls somewhere between philosophical and pragmatic. It came up while I was playing around with a problem from the Oracle database forum that was asking about options for rewriting a query with a certain type of predicate. This note isn’t really about that question but the OP supplied a convenient script to demonstrate their requirement and I’ve hi-jacked most of the code for my own purposes so that I can ask the question:

Should the presence of an intermediate view name generated by the optimizer in the course of cost-based query transformation cause two plans, which are otherwise identical and do exactly the same thing, to have different plan hash values ?

To demonstrate the issue let’s start with a simple script to create some data and generate an execution plan.


rem
rem     Script:         or_expand_plans.sql
rem     Author:         Jonathan Lewis
rem     Dated           Oct 2019
rem
rem     Last tested
rem             19.3.0.0
rem             12.2.0.1
rem
rem     Notes:
rem     Data creation copied from ODC
rem     https://community.oracle.com/thread/4297365
rem

create table t as   
select 1 as id, to_date('2019-10-11', 'YYYY-MM-DD') as lastupdated, 'Toronto' as place from dual  
union all  
select 2 as id, to_date('2019-10-12', 'YYYY-MM-DD') as lastupdated, 'Toronto' as place from dual  
union all  
select 3 as id, to_date('2019-10-15', 'YYYY-MM-DD') as lastupdated, 'Toronto' as place from dual  
union all  
select 4 as id, to_date('2019-10-21', 'YYYY-MM-DD') as lastupdated, 'Toronto' as place from dual  
;  

create table t_others as  
select 1 as id, to_date('2019-10-12', 'YYYY-MM-DD') as lastupdated, 'Blue'  as color, 'Zone 7' as zoneid from dual  
union all  
select 2 as id, to_date('2019-10-21', 'YYYY-MM-DD') as lastupdated, 'Red'   as color, 'Zone 7' as zoneid from dual  
union all  
select 3 as id, to_date('2019-10-16', 'YYYY-MM-DD') as lastupdated, 'White' as color, 'Zone 7' as zoneid from dual  
union all  
select 4 as id, to_date('2019-10-17', 'YYYY-MM-DD') as lastupdated, 'Green' as color, 'Zone 7' as zoneid from dual  
;  

create table t_further_info as  
select 1 as id, to_date('2019-10-12', 'YYYY-MM-DD') as lastupdated, 'A' as typeinfo from dual  
union all   
select 2 as id, to_date('2019-10-14', 'YYYY-MM-DD') as lastupdated, 'C' as typeinfo from dual  
union all  
select 3 as id, to_date('2019-10-21', 'YYYY-MM-DD') as lastupdated, 'D' as typeinfo from dual  
union all  
select 4 as id, to_date('2019-10-21', 'YYYY-MM-DD') as lastupdated, 'E' as typeinfo from dual  
;  

prompt  ====================
prompt  "Traditional" syntax
prompt  ====================

explain plan for
select 
        /*+ or_expand */ 
        * 
from
        t,
        t_others        pt,
        t_further_info  fi  
/*
where   (
             t.lastupdated >= to_date('2019-10-21', 'YYYY-MM-DD')   
         or pt.lastupdated >= to_date('2019-10-21', 'YYYY-MM-DD')  
         or fi.lastupdated >= to_date('2019-10-21', 'YYYY-MM-DD')  
        )  
*/
where   to_date('2019-10-21', 'YYYY-MM-DD') <= any(t.lastupdated, pt.lastupdated, fi.lastupdated)   
and     pt.id = t.id  
and     fi.id = t.id  
;

select * from table(dbms_xplan.display(null,null,'outline'));


You’ll see that I have a simple three-table join with the nasty little detail that I have a “non-join” predicates that may require Oracle to check across all three tables before it can decide whether or not a row should be discarded. I’ve shown two variants on a theme – they both have exactly the same effect but the ANY() presentation is just a little bit neater and more compact.

Essentially Oracle can use one of two strategies for this type of query/predicate; the first is to join all three tables and wait until the final join rowsource appears and then apply the check, or it can split the query into a union all of three separate queries where each query drives off a different table selecting only the rows from that table that match “its” part of the predicate.

In the latter case the second and third branches of the union all have to be modified to ensure that they discard any rows already returned by preceding parts of the union all; this can mean lots of new predicates appearing that use the lnnvl() function. (Of course there are variants between these two extremes, but the general principle is unchanged.)

The presence of the (incomplete) /*+ or_expand */ hint in my query is there to tell the optimizer that it should attempt to transform the query into the three-part union all. This, by the way, is a 12c feature, though older versions of Oracle could get similar effects in some cases from the /*+ use_concat */ hint. Here’s the plan, with outline,  I got from 12.2.0.1:


Plan hash value: 3181357500

----------------------------------------------------------------------------------------------
| Id  | Operation               | Name               | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |                    |     3 |   270 |    17   (6)| 00:00:01 |
|   1 |  VIEW                   | VW_ORE_67EF6547    |     3 |   270 |    17   (6)| 00:00:01 |
|   2 |   UNION-ALL             |                    |       |       |            |          |
|*  3 |    HASH JOIN            |                    |     1 |    55 |     6   (0)| 00:00:01 |
|   4 |     MERGE JOIN CARTESIAN|                    |     4 |   144 |     4   (0)| 00:00:01 |
|*  5 |      TABLE ACCESS FULL  | T_FURTHER_INFO     |     1 |    13 |     2   (0)| 00:00:01 |
|   6 |      BUFFER SORT        |                    |     3 |    69 |     2   (0)| 00:00:01 |
|*  7 |       TABLE ACCESS FULL | T_OTHERS           |     3 |    69 |     2   (0)| 00:00:01 |
|*  8 |     TABLE ACCESS FULL   | T                  |     3 |    57 |     2   (0)| 00:00:01 |
|*  9 |    HASH JOIN            |                    |     2 |   182 |    11  (10)| 00:00:01 |
|  10 |     VIEW                | VW_JF_SET$FB5125FC |     2 |   156 |     9  (12)| 00:00:01 |
|  11 |      UNION-ALL          |                    |       |       |            |          |
|* 12 |       HASH JOIN         |                    |     1 |    42 |     4   (0)| 00:00:01 |
|* 13 |        TABLE ACCESS FULL| T_OTHERS           |     1 |    23 |     2   (0)| 00:00:01 |
|* 14 |        TABLE ACCESS FULL| T                  |     3 |    57 |     2   (0)| 00:00:01 |
|* 15 |       HASH JOIN         |                    |     1 |    42 |     4   (0)| 00:00:01 |
|* 16 |        TABLE ACCESS FULL| T                  |     1 |    19 |     2   (0)| 00:00:01 |
|  17 |        TABLE ACCESS FULL| T_OTHERS           |     4 |    92 |     2   (0)| 00:00:01 |
|  18 |     TABLE ACCESS FULL   | T_FURTHER_INFO     |     4 |    52 |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------

Outline Data
-------------

  /*+
      BEGIN_OUTLINE_DATA
      USE_HASH(@"SEL$7C4216F7" "PT"@"SEL$1")
      LEADING(@"SEL$7C4216F7" "T"@"SEL$1" "PT"@"SEL$1")
      FULL(@"SEL$7C4216F7" "PT"@"SEL$1")
      FULL(@"SEL$7C4216F7" "T"@"SEL$1")
      USE_HASH(@"SEL$A4A33BE0" "T"@"SEL$1")
      LEADING(@"SEL$A4A33BE0" "PT"@"SEL$1" "T"@"SEL$1")
      FULL(@"SEL$A4A33BE0" "T"@"SEL$1")
      FULL(@"SEL$A4A33BE0" "PT"@"SEL$1")
      USE_HASH(@"SET$49E1C21B_3" "T"@"SEL$1")
      USE_MERGE_CARTESIAN(@"SET$49E1C21B_3" "PT"@"SEL$1")
      LEADING(@"SET$49E1C21B_3" "FI"@"SEL$1" "PT"@"SEL$1" "T"@"SEL$1")
      FULL(@"SET$49E1C21B_3" "T"@"SEL$1")
      FULL(@"SET$49E1C21B_3" "PT"@"SEL$1")
      FULL(@"SET$49E1C21B_3" "FI"@"SEL$1")
      USE_HASH(@"SEL$5FCD2D3C" "FI"@"SEL$1")
      LEADING(@"SEL$5FCD2D3C" "VW_JF_SET$FB5125FC"@"SEL$81DF0931" "FI"@"SEL$1")
      FULL(@"SEL$5FCD2D3C" "FI"@"SEL$1")
      NO_ACCESS(@"SEL$5FCD2D3C" "VW_JF_SET$FB5125FC"@"SEL$81DF0931")
      NO_ACCESS(@"SEL$67EF6547" "VW_ORE_67EF6547"@"SEL$67EF6547")
      OUTLINE(@"SET$49E1C21B_2")
      OUTLINE(@"SET$49E1C21B_1")
      OUTLINE(@"SEL$1")
      FACTORIZE_JOIN(@"SET$49E1C21B"("FI"@"SET$49E1C21B_2" "FI"@"SET$49E1C21B_1"))
      OUTLINE(@"SET$0E101D56")
      OUTLINE(@"SEL$81DF0931")
      OUTLINE(@"SEL$5AB42CD1")
      OR_EXPAND(@"SEL$1" (1) (2) (3))
      OUTLINE(@"SET$49E1C21B")
      OUTLINE_LEAF(@"SEL$67EF6547")
      FACTORIZE_JOIN(@"SET$49E1C21B"("FI"@"SET$49E1C21B_2" "FI"@"SET$49E1C21B_1"))
      OUTLINE_LEAF(@"SET$0E101D56")
      OUTLINE_LEAF(@"SEL$5FCD2D3C")
      OUTLINE_LEAF(@"SET$FB5125FC")
      OUTLINE_LEAF(@"SEL$A4A33BE0")
      OUTLINE_LEAF(@"SEL$7C4216F7")
      OUTLINE_LEAF(@"SET$49E1C21B_3")
      ALL_ROWS
      DB_VERSION('12.2.0.1')
      OPTIMIZER_FEATURES_ENABLE('12.2.0.1')
      IGNORE_OPTIM_EMBEDDED_HINTS
      END_OUTLINE_DATA
  */

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("PT"."ID"="T"."ID" AND "FI"."ID"="T"."ID")
   5 - filter("FI"."LASTUPDATED">=TO_DATE(' 2019-10-21 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))
   7 - filter(LNNVL("PT"."LASTUPDATED">=TO_DATE(' 2019-10-21 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss')))
   8 - filter(LNNVL("T"."LASTUPDATED">=TO_DATE(' 2019-10-21 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss')))
   9 - access("FI"."ID"="ITEM_1")
  12 - access("PT"."ID"="T"."ID")
  13 - filter("PT"."LASTUPDATED">=TO_DATE(' 2019-10-21 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))
  14 - filter(LNNVL("T"."LASTUPDATED">=TO_DATE(' 2019-10-21 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss')))
  15 - access("PT"."ID"="T"."ID")
  16 - filter("T"."LASTUPDATED">=TO_DATE(' 2019-10-21 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))


This is a wonderful demonstration of how brilliant the optimizer can be.  The query has gone through several transformations and two of them have very high visibility. First, you can see the name VW_ORE_67EF6547 at operation 1. This is a view name that Oracle generates to express (cost-based) OR-Expansion” so the optimizer has clearly obeyed my hint. As a consequence of OR-expansion we can also see several examples of the lnnvl() function appearing in the Predicate Information section of the output.; we can also see the hint re-appearing in the completed form of OR_EXPAND(@”SEL$1″ (1) (2) (3)) in the Outline Data.

However, we don’t have the union all of three pieces that we might have expected; we have a union all of two pieces and the second piece is a hash join between the table t_further_info and a view called VW_JF_SET$FB5125FC. This view is the result of “join factorization”. The optimizer has taken the 2nd and 3rd sections of our union all view and decided that it would be cost-effective to “factor out” a common table, so this:

select from t_others, t, t_further_info ... where t_others.date_predicate ...
union all
select from t, t_others, t_further_info ... where t.date_predicate ...

changes to this:

select from 
        (
        select from t_others, t   ... where t_others.date_predicate ...
        union all
        select from t, t_others   ... where t.date_predicate ...
        ),
        t_further_info 
where   ...

Having said all that, I now have to change the code because the original query was written using “ANSI” style joins – like this:


explain plan for
select
        /*+ or_expand */
        * 
from
        t   
inner join 
        t_others       pt  
on      pt.id = t.id  
inner join 
        t_further_info fi  
on      fi.id = t.id  
where
        (
             t.lastupdated >= to_date('2019-10-21', 'YYYY-MM-DD')   
         or pt.lastupdated >= to_date('2019-10-21', 'YYYY-MM-DD')  
         or fi.lastupdated >= to_date('2019-10-21', 'YYYY-MM-DD')  
        )
;

select  * from table(dbms_xplan.display(null,null,'outline'));


In the ANSI example I happen to have used the explicit “OR” list for the date predicates but that’s not really signficant . Here’s the plan produced by this query – and the first thing I’d like you to note is the Plan hash value:

Plan hash value: 3309788271

----------------------------------------------------------------------------------------------
| Id  | Operation               | Name               | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |                    |     3 |   270 |    17   (6)| 00:00:01 |
|   1 |  VIEW                   | VW_ORE_31069B60    |     3 |   270 |    17   (6)| 00:00:01 |
|   2 |   UNION-ALL             |                    |       |       |            |          |
|*  3 |    HASH JOIN            |                    |     1 |    55 |     6   (0)| 00:00:01 |
|   4 |     MERGE JOIN CARTESIAN|                    |     4 |   144 |     4   (0)| 00:00:01 |
|*  5 |      TABLE ACCESS FULL  | T_FURTHER_INFO     |     1 |    13 |     2   (0)| 00:00:01 |
|   6 |      BUFFER SORT        |                    |     3 |    69 |     2   (0)| 00:00:01 |
|*  7 |       TABLE ACCESS FULL | T_OTHERS           |     3 |    69 |     2   (0)| 00:00:01 |
|*  8 |     TABLE ACCESS FULL   | T                  |     3 |    57 |     2   (0)| 00:00:01 |
|*  9 |    HASH JOIN            |                    |     2 |   182 |    11  (10)| 00:00:01 |
|  10 |     VIEW                | VW_JF_SET$997549B1 |     2 |   156 |     9  (12)| 00:00:01 |
|  11 |      UNION-ALL          |                    |       |       |            |          |
|* 12 |       HASH JOIN         |                    |     1 |    42 |     4   (0)| 00:00:01 |
|* 13 |        TABLE ACCESS FULL| T_OTHERS           |     1 |    23 |     2   (0)| 00:00:01 |
|* 14 |        TABLE ACCESS FULL| T                  |     3 |    57 |     2   (0)| 00:00:01 |
|* 15 |       HASH JOIN         |                    |     1 |    42 |     4   (0)| 00:00:01 |
|* 16 |        TABLE ACCESS FULL| T                  |     1 |    19 |     2   (0)| 00:00:01 |
|  17 |        TABLE ACCESS FULL| T_OTHERS           |     4 |    92 |     2   (0)| 00:00:01 |
|  18 |     TABLE ACCESS FULL   | T_FURTHER_INFO     |     4 |    52 |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------

Outline Data
-------------

  /*+
      BEGIN_OUTLINE_DATA
      USE_HASH(@"SEL$D12FC97A" "PT"@"SEL$1")
      LEADING(@"SEL$D12FC97A" "T"@"SEL$1" "PT"@"SEL$1")
      FULL(@"SEL$D12FC97A" "PT"@"SEL$1")
      FULL(@"SEL$D12FC97A" "T"@"SEL$1")
      USE_HASH(@"SEL$09C9729D" "T"@"SEL$1")
      LEADING(@"SEL$09C9729D" "PT"@"SEL$1" "T"@"SEL$1")
      FULL(@"SEL$09C9729D" "T"@"SEL$1")
      FULL(@"SEL$09C9729D" "PT"@"SEL$1")
      USE_HASH(@"SET$E8D85587_3" "T"@"SEL$1")
      USE_MERGE_CARTESIAN(@"SET$E8D85587_3" "PT"@"SEL$1")
      LEADING(@"SET$E8D85587_3" "FI"@"SEL$2" "PT"@"SEL$1" "T"@"SEL$1")
      FULL(@"SET$E8D85587_3" "T"@"SEL$1")
      FULL(@"SET$E8D85587_3" "PT"@"SEL$1")
      FULL(@"SET$E8D85587_3" "FI"@"SEL$2")
      USE_HASH(@"SEL$95B99BAF" "FI"@"SEL$2")
      LEADING(@"SEL$95B99BAF" "VW_JF_SET$997549B1"@"SEL$BB7F1ECF" "FI"@"SEL$2")
      FULL(@"SEL$95B99BAF" "FI"@"SEL$2")
      NO_ACCESS(@"SEL$95B99BAF" "VW_JF_SET$997549B1"@"SEL$BB7F1ECF")
      NO_ACCESS(@"SEL$31069B60" "VW_ORE_31069B60"@"SEL$31069B60")
      OUTLINE(@"SEL$1")
      OUTLINE(@"SEL$2")
      MERGE(@"SEL$1" >"SEL$2")
      OUTLINE(@"SEL$58A6D7F6")
      OUTLINE(@"SEL$3")
      OUTLINE(@"SET$E8D85587_2")
      OUTLINE(@"SET$E8D85587_1")
      MERGE(@"SEL$58A6D7F6" >"SEL$3")
      OUTLINE(@"SEL$9E43CB6E")
      FACTORIZE_JOIN(@"SET$E8D85587"("FI"@"SET$E8D85587_2" "FI"@"SET$E8D85587_1"))
      OUTLINE(@"SET$6117B24C")
      OUTLINE(@"SEL$BB7F1ECF")
      OUTLINE(@"SEL$344003E3")
      OR_EXPAND(@"SEL$9E43CB6E" (1) (2) (3))
      OUTLINE(@"SET$E8D85587")
      OUTLINE_LEAF(@"SEL$31069B60")
      FACTORIZE_JOIN(@"SET$E8D85587"("FI"@"SET$E8D85587_2" "FI"@"SET$E8D85587_1"))
      OUTLINE_LEAF(@"SET$6117B24C")
      OUTLINE_LEAF(@"SEL$95B99BAF")
      OUTLINE_LEAF(@"SET$997549B1")
      OUTLINE_LEAF(@"SEL$09C9729D")
      OUTLINE_LEAF(@"SEL$D12FC97A")
      OUTLINE_LEAF(@"SET$E8D85587_3")
      ALL_ROWS
      DB_VERSION('12.2.0.1')
      OPTIMIZER_FEATURES_ENABLE('12.2.0.1')
      IGNORE_OPTIM_EMBEDDED_HINTS
      END_OUTLINE_DATA
  */

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("FI"."ID"="T"."ID" AND "PT"."ID"="T"."ID")
   5 - filter("FI"."LASTUPDATED">=TO_DATE(' 2019-10-21 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))
   7 - filter(LNNVL("PT"."LASTUPDATED">=TO_DATE(' 2019-10-21 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss')))
   8 - filter(LNNVL("T"."LASTUPDATED">=TO_DATE(' 2019-10-21 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss')))
   9 - access("FI"."ID"="ITEM_1")
  12 - access("PT"."ID"="T"."ID")
  13 - filter("PT"."LASTUPDATED">=TO_DATE(' 2019-10-21 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))
  14 - filter(LNNVL("T"."LASTUPDATED">=TO_DATE(' 2019-10-21 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss')))
  15 - access("PT"."ID"="T"."ID")
  16 - filter("T"."LASTUPDATED">=TO_DATE(' 2019-10-21 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))


Is the plan for the “ANSI” version of the query the same as the plan for the “traditional” version? How carefully have you checked – after simply noting that the two Plan hash values were different.

The plans are the same – in that the mechanics are exactly the same and that ought to be the criterion on which we should judge them. But the hash values are different because of the change from traditional to ANSI syntax. The traditional form of the query starts life with a single query block while the ANSI form is considered to be two query blocks, so the initial ANSI query goes through three stages:


1) select from t1 join t2 join t3

2) select /*+ qb_name(sel$2) */ from (select /* qb_name(sel$1) */ from t1, t2), t3

3) select /*+ qb_name(sel$9E43CB6E) */ from t1, t2, t3

So the query is rapidly transformed to the equivalent traditional syntax but we  now have a query block name of SEL$9E43CB6E instead of SEL$1 that the traditional query (in the absence of a /*+ qb_name() */ hint would have had. This is why you see the difference in the two or_expand() hints in the Outline Data section. One reads: OR_EXPAND(@SEL$1 (1) (2) (3)), the other reads OR_EXPAND(@”SEL$9E43CB6E” (1) (2) (3)), and all the subseqent query block name differences follow on from this initial transformation. (Note: the value “sel$9e43cb6e” is derived from the input query block names of sel$1 and sel$2 that the new query block is derived from)

You may decide that this mismatch isn’t really terribly important. If you’ve modified the code to switch to ANSI style joins then you may be prepared to put in a little extra effort to check the plan in detail to see that it hasn’t changed; but it would be a lot nicer if the hash value wasn’t dependent on generated view names. You may recall that at one time Oracle had problems with plans that used materialized CTEs (“with” subqueries) because the plan hash value was affected by object names like sys_temp_0fd9d6791_dfc12da. The same principle ought, I think, to apply here.

If you don’t mind the ANSI/tradiational switch though, you might find that you’re less happy when you upgrade to 19c, because the same effect appears there too, only it’s worse. Not only do “identical” traditional and ANSI plans have different hash values, they don’t match the values from 12c because the generated name for the join factorization views (VW_JF) change in the upgrade. So if you’re depending on SQL Plan Baselines to reproduce 12c plans on 19c when you upgrade you may find cases where you know the stored baseline is giving you the same plan but Oracle thinks it isn’t and refuses to use it.

tl;dr

Plans which are functionally identical can have different plan hash values because the plans were reached through a different series of tranformations. In particular if you rewrite queries from “traditional” Oracle syntax to “ANSI” syntax you will find cases where the plan doesn’t change but the plan hash value does thanks to a change in the names of views generated by some transformations.

More significantly, if you upgrade from 12c to 19c there are case where the names of views generated by transformations may change, which could cause the optimizer to discard some of your carefully constructed SQL Plan Baselines as “not reproducible”.

Footnote

For reference, if I add the hint /*+ opt_param(‘_optimizer_join_factorization’ ‘false’) */ to the “traditional query then I get the following plan which shows more clearly the three branches that the original query has been split into – each section starting with a different choice for the driving table:


-------------------------------------------------------------------------------------------
| Id  | Operation               | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |                 |     3 |   270 |    19   (6)| 00:00:01 |
|   1 |  VIEW                   | VW_ORE_67EF6547 |     3 |   270 |    19   (6)| 00:00:01 |
|   2 |   UNION-ALL             |                 |       |       |            |          |
|*  3 |    HASH JOIN            |                 |     1 |    55 |     7  (15)| 00:00:01 |
|*  4 |     HASH JOIN           |                 |     1 |    32 |     4   (0)| 00:00:01 |
|*  5 |      TABLE ACCESS FULL  | T               |     1 |    19 |     2   (0)| 00:00:01 |
|   6 |      TABLE ACCESS FULL  | T_FURTHER_INFO  |     4 |    52 |     2   (0)| 00:00:01 |
|   7 |     TABLE ACCESS FULL   | T_OTHERS        |     4 |    92 |     2   (0)| 00:00:01 |
|*  8 |    HASH JOIN            |                 |     1 |    55 |     6   (0)| 00:00:01 |
|   9 |     MERGE JOIN CARTESIAN|                 |     4 |   144 |     4   (0)| 00:00:01 |
|* 10 |      TABLE ACCESS FULL  | T_OTHERS        |     1 |    23 |     2   (0)| 00:00:01 |
|  11 |      BUFFER SORT        |                 |     4 |    52 |     2   (0)| 00:00:01 |
|  12 |       TABLE ACCESS FULL | T_FURTHER_INFO  |     4 |    52 |     2   (0)| 00:00:01 |
|* 13 |     TABLE ACCESS FULL   | T               |     3 |    57 |     2   (0)| 00:00:01 |
|* 14 |    HASH JOIN            |                 |     1 |    55 |     6   (0)| 00:00:01 |
|  15 |     MERGE JOIN CARTESIAN|                 |     4 |   144 |     4   (0)| 00:00:01 |
|* 16 |      TABLE ACCESS FULL  | T_FURTHER_INFO  |     1 |    13 |     2   (0)| 00:00:01 |
|  17 |      BUFFER SORT        |                 |     3 |    69 |     2   (0)| 00:00:01 |
|* 18 |       TABLE ACCESS FULL | T_OTHERS        |     3 |    69 |     2   (0)| 00:00:01 |
|* 19 |     TABLE ACCESS FULL   | T               |     3 |    57 |     2   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------

Outline Data
-------------

  /*+
      BEGIN_OUTLINE_DATA
      USE_HASH(@"SET$BB614FD2_3" "T"@"SET$BB614FD2_3")
      USE_MERGE_CARTESIAN(@"SET$BB614FD2_3" "PT"@"SET$BB614FD2_3")
      LEADING(@"SET$BB614FD2_3" "FI"@"SET$BB614FD2_3" "PT"@"SET$BB614FD2_3"
              "T"@"SET$BB614FD2_3")
      FULL(@"SET$BB614FD2_3" "T"@"SET$BB614FD2_3")
      FULL(@"SET$BB614FD2_3" "PT"@"SET$BB614FD2_3")
      FULL(@"SET$BB614FD2_3" "FI"@"SET$BB614FD2_3")
      USE_HASH(@"SET$BB614FD2_2" "T"@"SET$BB614FD2_2")
      USE_MERGE_CARTESIAN(@"SET$BB614FD2_2" "FI"@"SET$BB614FD2_2")
      LEADING(@"SET$BB614FD2_2" "PT"@"SET$BB614FD2_2" "FI"@"SET$BB614FD2_2"
              "T"@"SET$BB614FD2_2")
      FULL(@"SET$BB614FD2_2" "T"@"SET$BB614FD2_2")
      FULL(@"SET$BB614FD2_2" "FI"@"SET$BB614FD2_2")
      FULL(@"SET$BB614FD2_2" "PT"@"SET$BB614FD2_2")
      USE_HASH(@"SET$BB614FD2_1" "PT"@"SET$BB614FD2_1")
      USE_HASH(@"SET$BB614FD2_1" "FI"@"SET$BB614FD2_1")
      LEADING(@"SET$BB614FD2_1" "T"@"SET$BB614FD2_1" "FI"@"SET$BB614FD2_1"
              "PT"@"SET$BB614FD2_1")
      FULL(@"SET$BB614FD2_1" "PT"@"SET$BB614FD2_1")
      FULL(@"SET$BB614FD2_1" "FI"@"SET$BB614FD2_1")
      FULL(@"SET$BB614FD2_1" "T"@"SET$BB614FD2_1")
      NO_ACCESS(@"SEL$49E1C21B" "VW_ORE_67EF6547"@"SEL$67EF6547")
      OUTLINE(@"SEL$1")
      OR_EXPAND(@"SEL$1" (1) (2) (3))
      OUTLINE_LEAF(@"SEL$49E1C21B")
      OUTLINE_LEAF(@"SET$BB614FD2")
      OUTLINE_LEAF(@"SET$BB614FD2_1")
      OUTLINE_LEAF(@"SET$BB614FD2_2")
      OUTLINE_LEAF(@"SET$BB614FD2_3")
      ALL_ROWS
      OPT_PARAM('_optimizer_join_factorization' 'false')
      DB_VERSION('19.1.0')
      OPTIMIZER_FEATURES_ENABLE('19.1.0')
      IGNORE_OPTIM_EMBEDDED_HINTS
      END_OUTLINE_DATA
  */

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("PT"."ID"="T"."ID")
   4 - access("FI"."ID"="T"."ID")
   5 - filter("T"."LASTUPDATED">=TO_DATE(' 2019-10-21 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))
   8 - access("PT"."ID"="T"."ID" AND "FI"."ID"="T"."ID")
  10 - filter("PT"."LASTUPDATED">=TO_DATE(' 2019-10-21 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))
  13 - filter(LNNVL("T"."LASTUPDATED">=TO_DATE(' 2019-10-21 00:00:00',
              'syyyy-mm-dd hh24:mi:ss')))
  14 - access("PT"."ID"="T"."ID" AND "FI"."ID"="T"."ID")
  16 - filter("FI"."LASTUPDATED">=TO_DATE(' 2019-10-21 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))
  18 - filter(LNNVL("PT"."LASTUPDATED">=TO_DATE(' 2019-10-21 00:00:00',
              'syyyy-mm-dd hh24:mi:ss')))
  19 - filter(LNNVL("T"."LASTUPDATED">=TO_DATE(' 2019-10-21 00:00:00',
              'syyyy-mm-dd hh24:mi:ss')))

88 rows selected.

Although the “traditional” and “ANSI” plans still show a difference in their Plan hash values when join factorization is blocked, the absence of the join factorization view means that the plan hash values are now consistent between 12c to 19c (the output above came from 19.3.0.0 as you can see in the Outline information).

CBO Oddities – 1

Fri, 2019-10-18 12:10

I’ve decided to do a little rewriting and collating so that I can catalogue related ideas in an order that makes for a better narrative. So this is the first in a series of notes designed to help you understand why the optimizer has made a particular choice and why that choice is (from your perspective) a bad one, and what you can do either to help the optimizer find a better plan, or subvert the optimizer and force a better plan.

If you’re wondering why I choose to differentiate between “help the optimizer” and “subvert the optimizer” consider the following examples.

  • A query is joining two tables in the wrong order with a hash join when you know that a nested loop join in the opposite order would far better because you know that the data you want is very nicely clustered and there’s a really good index that would make access to that data very efficient. You check the table preferences and discover that the table_cached_blocks preference (see end notes) is at its default value of 1, so you set it to 16 and gather fresh stats on the indexes on the table. Oracle now recognises the effectiveness of this index and changes plan accordingly.
  • The optimizer has done a surprising transformation of a query, aggregating a table before joining to a couple of other tables when you were expecting it to use the joins to eliminate a huge fraction of the data before aggregating it.  After a little investigation you find that setting hidden parameter _optimizer_distinct_placement to false stops this happening.

You may find the distinction unnecessarily fussy, but I’d call the first example “helping the optimzier” – it gives the optimizer some truthful information about your data that is potentially going to result in better decisions in many different statements – and the second example “subverting the optimizer” – you’ve brute-forced it into not taking a path you didn’t like but at the same time you may have stopped that feature from appearing in other ways or in other queries. Of course, you might have minimised the impact of setting the parameter by using the opt_param() hint to apply the restriction to just this one query, nevertheless it’s possible that there is a better plan for the query that would have used the feature at some other point in the query if you’d managed to do something to help the optimizer rather than constraining it.

What’s up with the Optimizer

It’s likely that most of the articles will be based around interpreting execution plans since those are the things that tell us what the optimizer thinks will happen when it executes a statement, and within execution plans there are three critical aspects to consider –

  1. the numbers (most particularly Cost and Rows),
  2. the shape of the plan,
  3. the Predicate Information.

I want to use this note to make a couple of points about just the first of the three.

  • First – the estimates on any one line of an execution plan are “per start” of the line; some lines of an execution plan will be called many times in the course of a statement. In many cases the Rows estimate from one line of a plan will dictate the number of times that some other line of the plan will be executed – so a bad estimate of “how much data” can double up as a bad estimate of “how many times”, leading to a plan that looks efficient on paper but does far too much work at run-time. A line in a plan that looks a little inefficient may be fine if it executes only one, a line that looks very efficient may be a disaster if it executes a million time. Being able to read a plan and spot the places where the optimizer has produced a poor estimate of Rows is a critical skill – and there are many reasons why the optimizer produces poor estimates. Being able to spot poor estimates depends fairly heavily on knowing the data, but if you know the generic reasons for the optimizer producing poor estimates you’ve got a head start for recognising and addressing the errors when they appear.
  • Second – Cost is synonymous with Time. For a given instance at a given moment there is a simple, linear, relationship between the figure that the optimizer reports for the Cost of a statement (or subsection of a statement) and the Time that the optimizer reports. For many systems (those that have not run the calibrate_io procedure) the Time is simply the Cost multiplied by the time the optimizer thinks it will take to satisfy a single block read request, and the Cost is the optimizer’s estimate of the I/O requirement to satisfy the statement – with a fudge factor introduced to recognise the fact that a “single block” read request ought to complete in less time than a “multiblock” read request. Generally speaking the optimizer will consider many possible plans for a statement and pick the plan with the lowest estimated cost – but there is at least one exception to this rule, and it is an unfortunate weakness in the optimizer that there are many valid reasons why its estimates of Cost/Time are poor. Of course, you will note that the values that Oracle reports for the Time column are only accurate to the second – which isn’t particularly helpful when a single block read typically operates in the range of a few milliseconds.

To a large degree the optimizer’s task boils down to:

  • What’s the volume and scatter of the data I need
  • What access paths, with what wastage, are available to get to that data
  • How much time will I spend on I/O reading (and possibly discarding) data to extract the bit I want

Of course there are other considerations like the amount of CPU needed for a sort, the potential for I/O as sorts or hash joins, the time to handle a round-trip to a remote system, and RAC variations on the basic theme. But for many statements the driving issue is that any bad estimates of “how much data” and “how much (real) I/O” will lead to bad, potentially catastrophic, choices of execution plan. In the next article I’ll list all the different reasons (that I can think of at the time) why the optimizer can produce bad estimates of volume and time.

References for Cost vs. Time

References for table_cached_blocks:

 

Clustering_Factor

Wed, 2019-10-16 08:07

Originally drafted July 2018

“How do you find out what the clustering_factor of an index would be without first creating the index ?”

I’m not sure this is really a question worth asking or answering[1], but since someone asked it (and given the draft date I have no idea who, where, when or why), here’s an answer for simple heap tables in the good old days before Oracle made public the table_cached_blocks preference. It works by sorting the columns you want in the index together with the table rowid, and then comparing the file/block component of the rowid (cutting the relevant characters from the string representation of the rowid) with the previous one to see if the current row is in the same block as the previous row.  If the row is in a different block we count one, otherwise zero.  Finally we sum the ones.

In the demo I’ve copied a little data from view all_objects, with the intention of creating an index on object_name. After running my counting query I’ve then created the index and checked its clustering_factor to show that I’ve got a match.


rem
rem     Script: clustering_factor_est.sql
rem     Author: J.P.Lewis
rem     Dated:  July 2018
rem
create table t1
as
select  *
from    all_objects
where   rownum <= 10000
;

prompt  ====================
prompt  Tablescan with lag()
prompt  ====================

select
        sum(count_chg)
from    (
select
        case
                when substr(rowid,7,9) <> lag(substr(rowid,7,9),1,'000000000') over(order by object_name, rowid)
                        then 1
                        else 0
        end     count_chg
from t1
)
;

prompt  ======================
prompt  create index and check
prompt  ======================

create index t1_i1 on t1(object_name);

select  clustering_factor
from    user_indexes 
where   table_name = 'T1'
and     index_name = 'T1_I1'
;

Pasting the output from running the above:


Table created.

====================
Tablescan with lag()
====================

SUM(COUNT_CHG)
--------------
          3901

1 row selected.

======================
create index and check
======================

Index created.


CLUSTERING_FACTOR
-----------------
             3901

1 row selected.


This will work for a global index on a partitioned table, but will give meaningless answers for globally partitioned indexes and local indexes. Furthermore it’s not a mechanism that lends itself to calculating the clustering_factor if you’ve set the table_cached_blocks preference to a value other than 1.

[1] Given the amount of work it would take to run the query to check the clustering_factor you might as well create the index (invisible, perhaps and nologging if that doesn’t cause problems) – which also gives you the option for setting the table_cached_blocks and gathering_stats to see how the clustering_factor varies.

Update (shortly after posting)

It has occurred to me that if you have a “nosegment” index that you’ve been using to test whether or not the optimizer would use it IF you created it, then the one piece of information that is missing from the data dicionary for that index is its clustering_factor (For a multicolumn index you can get a valid value for distinct_keys by creating a column group on the set of columns – which would otherwise be faked when you create the nosegment index.) This code might allow you to write a clustering_factor to the index definition that would give you a better idea of whether the optimizer would use it if you created it.  (To be tested.)

 

 

v$session

Fri, 2019-10-11 06:29

Here’s an odd, and unpleasant, detail about querying v$session in the “most obvious” way. (And if you were wondering what made me resurrect and complete a draft on “my session id” a couple of days ago, this posting is the reason). Specifically if you want to select some information for your own session from v$session the query you’re likely to use in any recent version of Oracle will probably be of the form:


select {list for columns} from v$session where sid = to_number(sys_context('userenv','sid'));

Unfortunately that one little statement hides two anomalies – which you can see in the execution plan. Here’s a demonstration cut from an SQL*Plus session running under 19.3.0.0:


SQL> select * from table(dbms_xplan.display_cursor);

SQL_ID  gcfrzq9knynj3, child number 0
-------------------------------------
select program from V$session where sid = sys_context('userenv','sid')

Plan hash value: 2422122865

----------------------------------------------------------------------------------
| Id  | Operation                 | Name            | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                 |       |       |     1 (100)|
|   1 |  MERGE JOIN CARTESIAN     |                 |     1 |    33 |     0   (0)|
|   2 |   NESTED LOOPS            |                 |     1 |    12 |     0   (0)|
|*  3 |    FIXED TABLE FULL       | X$KSLWT         |     1 |     8 |     0   (0)|
|*  4 |    FIXED TABLE FIXED INDEX| X$KSLED (ind:2) |     1 |     4 |     0   (0)|
|   5 |   BUFFER SORT             |                 |     1 |    21 |     0   (0)|
|*  6 |    FIXED TABLE FULL       | X$KSUSE         |     1 |    21 |     0   (0)|
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
------ -------------------------------------------
   3 - filter("W"."KSLWTSID"=TO_NUMBER(SYS_CONTEXT('userenv','sid')))
   4 - filter("W"."KSLWTEVT"="E"."INDX")
   6 - filter((BITAND("S"."KSUSEFLG",1)<>0 AND BITAND("S"."KSSPAFLG",1)<>0 AND 
              "S"."INDX"=TO_NUMBER(SYS_CONTEXT('userenv','sid'))
              AND INTERNAL_FUNCTION("S"."CON_ID") AND "S"."INST_ID"=USERENV('INSTANCE')))

As you can see, v$session is a join of 3 separate structures – x$kslwt (v$session_wait), x$ksled (v$event_name), and x$ksuse (the original v$session as it was some time around 8i), and the plan shows two “full tablescans” and a Cartesian merge join. Tablescans and Cartesian merge joins are not necessarily bad – especially where small tables and tiny numbers of rows are concerned – but they do merit at least a brief glance.

x$ksuse is a C structure in the fixed SGA and that structure is a segmented array (which seems to be chunks of 126 entries in 19.3, and chunks of 209 entries in 12.2 – but that’s fairly irrelevant). The SID is simply the index into the array counting from 1, so if you have a query with a predicate like ‘SID = 99’ Oracle can work out the address of the 99th entry in the array and access it very quickly – which is why the SID column is reported as a “fixed index” column in the view v$indexed_fixed_column.

But we have two problems immediately visible:

  1. the optimizer is not using the “index” to access x$ksuse despite the fact that we’re giving it exactly the value we want to use (and we can see a suitable predicate at operation 6 in the plan)
  2. the optimizer has decided to start executing the query at the x$kslwt table

Before looking at why thing’s have gone wrong, let’s check the execution plan to see what would have happened if we’d copied the value from the sys_context() call into a bind variable and queried using the bind variable – which we’ll keep as a character type to make it a fair comparison:

SQL_ID  cm3ub1tctpdyt, child number 0
-------------------------------------
select program from v$session where sid = to_number(:v1)

Plan hash value: 1627146547

----------------------------------------------------------------------------------
| Id  | Operation                 | Name            | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                 |       |       |     1 (100)|
|   1 |  MERGE JOIN CARTESIAN     |                 |     1 |    32 |     0   (0)|
|   2 |   NESTED LOOPS            |                 |     1 |    12 |     0   (0)|
|*  3 |    FIXED TABLE FIXED INDEX| X$KSLWT (ind:1) |     1 |     8 |     0   (0)|
|*  4 |    FIXED TABLE FIXED INDEX| X$KSLED (ind:2) |     1 |     4 |     0   (0)|
|   5 |   BUFFER SORT             |                 |     1 |    20 |     0   (0)|
|*  6 |    FIXED TABLE FIXED INDEX| X$KSUSE (ind:1) |     1 |    20 |     0   (0)|
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter("W"."KSLWTSID"=TO_NUMBER(:V1))
   4 - filter("W"."KSLWTEVT"="E"."INDX")
   6 - filter(("S"."INDX"=TO_NUMBER(:V1) AND BITAND("S"."KSSPAFLG",1)<>0
              AND BITAND("S"."KSUSEFLG",1)<>0 AND INTERNAL_FUNCTION("S"."CON_ID") AND
              "S"."INST_ID"=USERENV('INSTANCE')))


When we have a (character) bind variable instead of a sys_context() value the optimizer manages to use the “fixed indexes”., but it’s still started executing at x$kslwt, and still doing a Cartesian merge join. The plan would be the same if the bind variable were a numeric type, and we’d still get the same plan if we replaced the bind variable with a literal number.

So problem number 1 is that Oracle only seems able to use the fixed index path for literal values and simple bind variables (plus a few “simple” functions). It doesn’t seem to use the fixed indexes for most functions (even deterministic ones) returning a value and the sys_context() function is a particular example of this.

Transitivity

Problem number 2 comes from a side-effect of something that I first described about 13 years ago – transitive closure. Take a look at the predicates in both the execution plans above. Where’s the join condition between x$ksuse and x$kslwt ? There should be one, because the underlying SQL defining [g]v$session  has the following joins:

from
      x$ksuse s,
      x$ksled e,
      x$kslwt w
where
      bitand(s.ksspaflg,1)!=0
and   bitand(s.ksuseflg,1)!=0
and   s.indx=w.kslwtsid       -- this is the SID column for v$session and v$session_wait
and   w.kslwtevt=e.indx
 

What’s happened here is that the optimizer has used transitive closure: “if a = b and b = c then a = c” to clone the predicate “s.indx = to_number(sys_context(…))” to “w.kslwtsid = to_number(sys_context(…))”. But at the same time the optmizer has eliminated the predicate “s.indx = w.kslwtsid”, which it shouldn’t do because this is 12.2.0.1 and ever since 10g we’ve had the parameter _optimizer_transitivity_retain = true — but SYS is ignoring the parameter setting.

So we no longer have a join condition between x$ksuse and x$kslwt – which means there has to be a cartesian merge join between them and the only question is whether this should take place before or after the join between x$kslwt and x$ksled. In fact, the order doesn’t really matter because there will be only one row identified in x$kslwt and one row in x$ksuse, and the join to x$ksled is simply a lookup (by undeclarable unique key) to translate an id into a name and it will take place only once whatever we do about the other two structures.

But there is a catch – especially if your sessions parameter is 25,000 (which it shouldn’t be) and the number of connected sessions is currently 20,000 (which it shouldn’t be) – the predicate against x$ksuse does a huge amount of work as it walks the entire array testing every row (and it doesn’t even do the indx test first – it does a couple of bitand() operations). Even then this wouldn’t be a disaster – we’re only talking a couple of hundredths of a second of CPU – until you find the applications that run this query a huge number of times.

We would prefer to avoid two full tablescans since the arrays could be quite large, and of the two it’s the tablescan of x$ksuse that is going to be the greater threat; so is there a way to bypass the threat?  Once we’ve identified the optimizer anomaly we’ve got a pointer to a solution. Transitivity is going wrong, so let’s attack the transitivity. Checking the hidden parameters we can find a parameter: _optimizer_generate_transitive_pred which defaults to true, so let’s set it to false for the query and check the plan:

select  /*+   opt_param('_optimizer_generate_transitive_pred','FALSE')
*/  program from  v$session where  sid = sys_context('userenv','sid')

Plan hash value: 3425234845

----------------------------------------------------------------------------------
| Id  | Operation                 | Name            | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                 |       |       |     1 (100)|
|   1 |  NESTED LOOPS             |                 |     1 |    32 |     0   (0)|
|   2 |   NESTED LOOPS            |                 |     1 |    28 |     0   (0)|
|   3 |    FIXED TABLE FULL       | X$KSLWT         |    47 |   376 |     0   (0)|
|*  4 |    FIXED TABLE FIXED INDEX| X$KSUSE (ind:1) |     1 |    20 |     0   (0)|
|*  5 |   FIXED TABLE FIXED INDEX | X$KSLED (ind:2) |     1 |     4 |     0   (0)|
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter(("S"."INDX"="W"."KSLWTSID" AND BITAND("S"."KSSPAFLG",1)<>0
              AND BITAND("S"."KSUSEFLG",1)<>0 AND "S"."INDX"=TO_NUMBER(SYS_CONTEXT('user
              env','sid')) AND INTERNAL_FUNCTION("S"."CON_ID") AND
              "S"."INST_ID"=USERENV('INSTANCE')))
   5 - filter("W"."KSLWTEVT"="E"."INDX")


Although it’s not nice to insert hidden parameters into the optimizer activity we do have a result. We don’t have any filtering on x$kslwt – fortunately this seems to be limited in size (but see footnote) to the number of current sessions (unlike x$ksuse which has an array size defined by the sessions parameter or derived from the processes parameter). For each row in x$kslwt we do an access into x$ksuse using the “index” (note that we don’t see access predicates for the fixed tables, we just have to note the operation says FIXED INDEX and spot the “index-related” predicate in the filter predicate list), so this strategy has reduced the number of times we check the complex predicate on x$ksuse rows.

It’s still far from ideal, though. What we’d really like to do is access x$kslwt by index using the known value from sys_context(‘userenv’,’sid’). As it stands the path we get from using a hidden parameter which isn’t listed as legal for the opt_param() hint is a plan that we would get if we used an unhinted query that searched for audsid = sys_context(‘userenv’,’sessionid’).

SQL_ID  7f3f9b9f32u7z, child number 0
-------------------------------------
select  program from  v$session where  audsid =
sys_context('userenv','sessionid')

Plan hash value: 3425234845

----------------------------------------------------------------------------------
| Id  | Operation                 | Name            | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                 |       |       |     1 (100)|
|   1 |  NESTED LOOPS             |                 |     2 |    70 |     0   (0)|
|   2 |   NESTED LOOPS            |                 |     2 |    62 |     0   (0)|
|   3 |    FIXED TABLE FULL       | X$KSLWT         |    47 |   376 |     0   (0)|
|*  4 |    FIXED TABLE FIXED INDEX| X$KSUSE (ind:1) |     1 |    23 |     0   (0)|
|*  5 |   FIXED TABLE FIXED INDEX | X$KSLED (ind:2) |     1 |     4 |     0   (0)|
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - filter(("S"."INDX"="W"."KSLWTSID" AND BITAND("S"."KSSPAFLG",1)<>0
              AND BITAND("S"."KSUSEFLG",1)<>0 AND "S"."KSUUDSES"=TO_NUMBER(SYS_CONTEXT('
              userenv','sessionid')) AND INTERNAL_FUNCTION("S"."CON_ID") AND
              "S"."INST_ID"=USERENV('INSTANCE')))
   5 - filter("W"."KSLWTEVT"="E"."INDX")


The bottom line, then, seems to be that if you need a query by SID against v$session to be as efficient as possible then your best bet is to load a numeric variable with the sys_context(‘userenv’,’sid’) and then select where “sid = :bindvariable”.  Otherwise query by audsid, or use a hidden parameter to affect the optimizer.

Until the the SYS schema follows the _optimizer_transitivity_retain parameter or treats sys_context() the same way it treats a bind variable there is always going to be some unnecessary work when querying v$session and that excess will grow with either the number of connected sessions (if you optimize the query) or with the value of the sessions parameter.

Footnote

In (much) older versions of Oracle v$session_wait sat on top of x$ksusecst, which was part of the same C structure as x$ksuse. In newer versions of Oracle x$kslwt is a structure that is created on demand in the PGA/UGA – I hope that there’s a short cut that allows Oracle to find the waiting elements in x$ksuse[cst] efficiently, rather than requiring a walk through the whole thing, otherwise a tablescan of the (nominally smaller) x$kslwt structure will be at least as expensive as a tablescan of the x$ksuse structure.

Update (just a few minutes after posting)

Bob Bryla has pointed out in a tweet that there are many “bugs” not fixed until 19.1 for which the workaround is to set “_optimizer_transitivity_retain” to false. So maybe this isn’t an example of SYS doing something particularly strange – it may be part of a general reworking of the mechanism that still has a couple of undesirable side effects.

Bob’s comment prompted me to clone the x$ tables into real tables in a non-SYS schema and model the fixed indexes with primary keys, and I found that the resulting plan (though very efficient) still discarded the join predicate. So we may be seeing the side effects of a code enhancement relating to generating predicates that produce unique key access paths. (A “contrary” test, the one in the 2013 article I linked to, still retains the join predicate for the query that has non-unique indexes.)

 

Cursor_sharing

Wed, 2019-10-09 10:58

Here’s a funny little detail that I don’t think I’ve noticed before – needing only a simple demo script:


rem
rem     Script:         cursor_sharing_oddity.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2019
rem
rem     Last tested 
rem             12.2.0.1
rem

create table t1 as
select  * 
from    all_objects 
;

set serveroutput off
alter session set cursor_sharing = force;

select  *
from    t1
where
        created between date'2019-06-01' and date'2019-06-30'
;

select * from table(dbms_xplan.display_cursor);

Given that I’ve set cursor_sharing to FORCE (and flushed the shared pool just in case), what SQL do you expect to see if I pull the plan memory, and what sort of thing do you expect to see in the Predicate Information. Probably some references to system-constructed bind variables like :”SYS_B_0″. This is what I got on 12.2.0.1:


SQL_ID  9qwa9gpg9rmjv, child number 0
-------------------------------------
select * from t1 where  created between date:"SYS_B_0" and
date:"SYS_B_1"

Plan hash value: 3617692013

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |       |       |   170 (100)|          |
|*  1 |  TABLE ACCESS FULL| T1   |  1906 |   251K|   170   (8)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(("CREATED">=TO_DATE(' 2019-06-01 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND "CREATED"<=TO_DATE(' 2019-06-30 00:00:00',
              'syyyy-mm-dd hh24:mi:ss')))


Somehow I’ve got system-generated bind variables in the SQL (and v$sql – when I checked), but the original literal values are still present (in a different form) in the predicate information. Then, when I re-ran the query changing 1st June to 15th June I got the same SQL_ID (and generated bind variables) but with child number 1 and suitably modified filter predicates.

Of course, just for completion, if I write the query using the “old-fashioned” to_date() approach I end up with a single statement with lots of system-generated bind variables that are consistent between the SQL and the Predicate Information.

SQL_ID  10sfymvwv00qx, child number 0
-------------------------------------
select * from t1 where  created between to_date(:"SYS_B_0",:"SYS_B_1")
                   and to_date(:"SYS_B_2",:"SYS_B_3")

Plan hash value: 3332582666

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |       |       |   189 (100)|          |
|*  1 |  FILTER            |      |       |       |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |  1029 |   135K|   189  (17)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(TO_DATE(:SYS_B_2,:SYS_B_3)>=TO_DATE(:SYS_B_0,:SYS_B_1))
   2 - filter(("CREATED">=TO_DATE(:SYS_B_0,:SYS_B_1) AND
              "CREATED"<=TO_DATE(:SYS_B_2,:SYS_B_3)))

If you are planning to do anything with cursor_sharing, watch out for the side effects of the “ANSI” date and time operators.

Update (3 hours later)

It turns out that I have come across this before – and written about it

The same behaviour is shown by timestamp literals and interval literals. For details on the two types of literal here’s a link to the 12.2 SQL Language Reference Manual section on literals.

My SID

Wed, 2019-10-09 06:03

Here’s a little note that’s been hanging around as a draft for more than eight years according to the OTN (as it was) posting that prompted me to start writing it. At the time there were still plenty of people using Oracle 10g. so the question didn’t seem entirely inappropriate:

On 10g R2 when I open a sqlplus session how can I know my session SID ? I’m not DBA then can not open as sysdba and query v$session.

In all fairly recent versions of Oracle, of course, we have the option to use the sys_context() function to get the SID, but this specific option didn’t appear until some time in the 10g timeline – so you might have spent years “knowing” that you could get the audsid though sys_context(‘userenv’,’sessionid’) there was no equivalent way to get the sid. Now, of course, and even in the timeline of the original posting, the simplest solution to the requirement is to execute:


select sys_context('userenv','sid') from dual;

But there are a number of alternatives – which may occasionally do a better job (and sometimes are just plain silly). It’s also worth noting that even in 19c Oracle still doesn’t have access to v$session.serial# through sys_context() and, anyway, sys_context() behaves like an unpeekable bind variable – which can be a problem.

So here’s the first of several options:

select sid from V$mystat where rownum = 1;

You’ll need SYS to grant you select on v_$mystat to use this one, of course, but v$mystat is a very convenient view giving you the session activity stats since logon for your own session – so there ought to be some mechanism that allows you to see some form of it in place anyway (ideally including the join to v$statname).

One of the oldest ways of getting access to your session ID without having access to any of the dynamic performance views was through the dbms_support package:

variable v1 varchar2(32)
execute :v1 := dbms_support.mysid
execute dbms_output.put_line(:v1)

Again you’ll need SYS to grant you extra privileges, in this case execute on the dbms_support package – worse still, the package is not installed by default. In fact (after installing it) if you call dbms_support.package_version it returns the value: “DBMS_SUPPORT Version 1.0 (17-Aug-1998) – Requires Oracle 7.2 – 8.0.5” – which gives you some idea of how old it is. It used to be useful for the start_trace_in_session() procedure it contains but that procedure has been superseded by many newer mechanisms. If you enable SQL tracing to see what’s happening under the covers when you call dbms_support.mysid you’ll see that the function actually runs the query I showed above against v$mystat .

Unlike dbms_support the dbms_session package is installed automatically with the privilege to execute granted to public,  and this gives you a function to generate a “unique session id”, . The notes in the scripts $ORACLE_HOME/rdbms/admin/dbmssess.sql that create the package say that the return value can be up to 24 bytes long, but so far the maximum I’ve seen is 12.


select dbms_session.unique_session_id from dual;
UNIQUE_SESSION_ID
--------------------------
00FF5F980001


select
        to_number(substr(dbms_session.unique_session_id,1,4),'XXXX') sid,
        to_number(substr(dbms_session.unique_session_id,5,4),'XXXX') serial#,
        to_number(substr(dbms_session.unique_session_id,9,4),'XXXX') instance
from
        dual
;

       SID    SERIAL# INSTANCE
---------- ---------- --------
       255      24472        1

As you can see, the session_unique_id can be decoded to produce three useful bits of information, and the nice thing about this call is that it gives you session serial# at the same time as the SID. It’s possible, of course, that this query is as efficient as it could be, but there’s some scope for writing a query that uses a non-mergeable in-line view to call the function once, then splits the result into three pieces.

While we’re on the session_unique_id, the dbms_pipe package also has a “unique identifier” function unique_session_name():

SQL> select dbms_pipe.unique_session_name from dual;

UNIQUE_SESSION_NAME
------------------------
ORA$PIPE$00FF5F980001

It doesn’t take a lot of effort to spot that the “unique session name” is the “unique session id” of dbms_session prefixed with the text “ORA$PIPE$”. It’s convenient for the dbms_pipe package to be able to generate a unique name so that one session can create a safely named pipe and tell another session about it. Anyone using pipes should take advantage of this function for its original purpose. Unlike dbms_session you’ll need to be granted the privilege to execute this package, it’s not available to public. Interestingly the script that creates dbms_pipe says that this function could return 30 bytes – since it appears to be 9 bytes prepended to the (“could be 24 bytes”) dbms_session.unique_session_id you have to wonder whether there’s something more subtle that could happen.

There may be many more mechanisms available as built-ins, but the last one I know of is in the dbms_debug_jdwp package (another package with execute privilege already granted to public and the ability to supply both the sid and serial#):

SQL> select
  2          dbms_debug_jdwp.current_session_id     sid,
  3          dbms_debug_jdwp.current_session_serial serial#
  4  from dual
  5  /

       SID    SERIAL#
---------- ----------
       255      24472

There is a reason why I’ve decided to resurrect this list of ways of getting at a session’s SID, but that’s the topic of another blog note.

 

 

Resumable

Mon, 2019-10-07 04:31

There are two questions about temporary space that appear fairly regularly on the various Oracle forums. One is of the form:

From time to time my temporary tablespace grows enormously (and has to be shrunk), how do I find what’s making this happen?

The other follows the more basic pattern:

My process sometimes crashes with Oracle error: “ORA-01652: unable to extend temp segment by %n in tablespace %s” how do I stop this happening?

Before moving on to the topic of the blog, it’s worth pointing out two things about the second question:

  • First, it’s too easy to get stuck at the word temp and leap to the conclusion that the problem is about the temporary tablespace without noticing that the error message includes the specific tablespace that’s raised the problem. If, for example, you rebuild an index in a nominated tablespace Oracle first creates the index as a temporary segment (with a name like {starting_file_number}.{starting_block_number}) in that tablespace then renames it to match the original index name once the rebuild is complete and drops the old index.
  • Secondly a process that raises ORA-01652 isn’t necessarily the guilty party – it may be the victim of some other process hogging all the available space when it shouldn’t. Moreover that other process may have completed and released its space by the time you start looking for the problem – causing extra confusion because your process seems to have crashed without a cause. Taking my example of an index rebuild – your index rebuild may fail because someone else was rebuilding a different index at the same time in the same tablespace; but when you check the tablespace all the space from their original index is now free as their rebuild completed in the interim.

So, before you start chasing something that you think is a problem with your code, pause a moment to double-check the error message and think about whether you could have been the victim of some concurrent, but now complete, activity.

I’ve listed the two questions as variants on the same theme because the workaround to one of them introduces the risk of the other – if you want to avoid ORA-01652 you could make all your data files and temp files “autoextensible”, but then there may be occasions when they extend far too much and you need to shrink them down again (and that’s not necessarily easy if it’s not the temporary tablespace). Conversely, if you think your data or temp files randomly explode to ludicrous sizes you could decide on a maximum size for your files and disable autoextension – then handle the complaints when a user reports an ORA-01652.

There are various ways you could monitor your system in near real time to spot the threat as it builds, of course; and there are various ways to identify potentially guilty SQL after the event. You could keep an eye on various v$ dynamic performance views or dba_ administrative views to try and intercept a problem; you could set event 1652 to dump an errorstack (or even systemstate) for post-crash analysis to see what that reported. Neither is an ideal solution – one requires you to pay excessive attention to the system, the other is designed to let the problem happen then leave you to clean up afterwards.  There is, however, a strategy that may stop the problem from appearing without requiring constant monitoring. The strategy is to enable (selectively) resumable operations.

If a resumable operation needs to allocate space but is unable to do so – i.e. it would normally be about to raise ORA-01652 – it will suspend itself for a while going into the wait state “statement suspended, wait error to be cleared” which will show up as the event in v$session_wait, timing out every 2 seconds The session will also be reporting its current action in the view v$resumable or, for slightly more information, dba_resumable. As it suspends the session will also write a message to the alert log but you can also create an “after suspend” database trigger to alert you that a problem has occurred.

If you set the resumable timeout to a suitable value then you may find:

  • the problem goes away of its own accord and the session resumes before the timeout is reached

or

  • you receive a warning and have some time to identify the source of the problem and take the minimum action needed to allow the session to resume
Implementation

The parameter resumable_timeout is a general control for resumable sessions if you don’t handle the feature at a more granular level than the system.

By default this parameter is set to zero which translates into a default value of 7,200 seconds but that default doesn’t come into effect unless a session declares itself resumable. If you set the parameter to a non-zero value all session will automatically be operating as resumable sessions – and you’ll soon hear why you don’t want to do that.

The second enabling feature for resumable sessions is the resumable privilege – a session can’t control it’s own resumability unless the schema has been granted the resumable privilege – which may be granted through a role. If a session has the privilege it may set its own resumable_timeout, even if the system value is zero.

Assume we have set resumable_timeout to 10 (seconds) through the instance parameter file and restarted the instance. If we now issue (for example) the following ‘create table’ statement:


create table t1 (n1, v1 ) 
pctfree 90 pctused 10
tablespace tiny
as
select 
        rownum, cast(lpad('x',800) as varchar2(1000))
from    all_objects
where   rownum <= 20000
/

This will attempt to allocate 1 row per block for 20,000 blocks (plus about 1.5% for bitmap space management blocks) – and tablespace tiny lives up (or down) to its name, consisting of a single file of only 10,000 Oracle blocks. Shortly after starting, the session will hit Oracle error “ORA-01652: unable to extend temp segment by 128 in tablespace TINY”, but it won’t report it; instead it will suspend itself for 10 seconds before failing and reporting the error. This will happen whether or not the session has the resumable privilege – in this case the behaviour is dictated by our setting the system parameter. If you look in the alert log after the session finally errors out you will find text like the following:

2019-10-04T14:01:11.847943+01:00
ORCL(3):ORA-1652: unable to extend temp segment by 128 in tablespace TINY [ORCL] 
ORCL(3):statement in resumable session 'User TEST_USER(138), Session 373, Instance 1' was suspended due to
ORCL(3):    ORA-01652: unable to extend temp segment by 128 in tablespace TINY
2019-10-04T14:01:23.957586+01:00
ORCL(3):statement in resumable session 'User TEST_USER(138), Session 373, Instance 1' was timed out

Note that there’s a 10 (plus a couple) second gap between the point where the session reports that it is suspending itself and the point where it fails with a timeout. The two-extra seconds appear because the session polls every 2 seconds to see whether the problem is still present or whether it has spontaneously disappeared so allowing the session to resume.

Let’s change the game slightly; let’s try to create the table again, but this time execute the following statement first:

alter session enable resumable timeout 60 name 'Help I''m stuck';

The initial response to this will be Oracle error “ORA-01031: insufficient privileges” because the session doesn’t have the resumable privilege, but after granting resumable to the user (or a relevant role) we try again and find we will be allowed a little extra time before the CTAS times out. Our session now overrides the system timeout and will wait 60 seconds (plus a bit) before failing.The “timeout” clause is optional and if we omit it the session will use the system value, similarly the “name” clause is optional though there’s no default for it, it’s just a message that will get into various views and reports.

There are several things you might check in this 60 second grace period. The session wait history will confirm that your session has been timing out every two seconds (as will the active session history if you’re licensed to use it):


select seq#, event, wait_time from v$session_wait_history where sid = 373

      SEQ# EVENT							     WAIT_TIME
---------- ---------------------------------------------------------------- ----------
	 1 statement suspended, wait error to be cleared			   204
	 2 statement suspended, wait error to be cleared			   201
	 3 statement suspended, wait error to be cleared			   201
	 4 statement suspended, wait error to be cleared			   201
	 5 statement suspended, wait error to be cleared			   200
	 6 statement suspended, wait error to be cleared			   200
	 7 statement suspended, wait error to be cleared			   202
	 8 statement suspended, wait error to be cleared			   200
	 9 statement suspended, wait error to be cleared			   200
	10 statement suspended, wait error to be cleared			   200

Then there’s a special dynamic performance view, v$resumable which I’ve reported below using a print_table() procedure that Tom Kyte wrote many, many years ago to report rows in a column format:

SQL> set serveroutput on
SQL> execute print_table('select * from v$resumable where sid = 373')

ADDR                          : 0000000074515B10
SID                           : 373
ENABLED                       : YES
STATUS                        : SUSPENDED
TIMEOUT                       : 60
SUSPEND_TIME                  : 10/04/19 14:26:20
RESUME_TIME                   :
NAME                          : Help I'm stuck
ERROR_NUMBER                  : 1652
ERROR_PARAMETER1              : 128
ERROR_PARAMETER2              : TINY
ERROR_PARAMETER3              :
ERROR_PARAMETER4              :
ERROR_PARAMETER5              :
ERROR_MSG                     : ORA-01652: unable to extend temp segment by 128 in tablespace TINY
CON_ID                        : 0
-----------------
1 rows selected

Notice how the name column reports the name I supplied when I enabled the resumable session. The view also tells us when the critical statement was suspended and how long it is prepared to wait (in total) – leaving us to work out from the current time how much time we have left to work around the problem.

There’s also a dba_resumable variant of the view which is slightly more informative (though the sample below is not consistent with the one above because I ran the CTAS several times, editing the blog as I did so):

SQL> execute print_table('select * from dba_resumable where session_id = 373')

USER_ID                       : 138
SESSION_ID                    : 373
INSTANCE_ID                   : 1
COORD_INSTANCE_ID             :
COORD_SESSION_ID              :
STATUS                        : SUSPENDED
TIMEOUT                       : 60
START_TIME                    : 10/04/19 14:21:14
SUSPEND_TIME                  : 10/04/19 14:21:16
RESUME_TIME                   :
NAME                          : Help I'm stuck
SQL_TEXT                      : create table t1 (n1, v1 ) pctfree 90 pctused 10 tablespace tiny as  select rownum, 
                                cast(lpad('x',800) as varchar2(1000)) from all_objects where rownum <= 20000
ERROR_NUMBER                  : 1652
ERROR_PARAMETER1              : 128
ERROR_PARAMETER2              : TINY
ERROR_PARAMETER3              :
ERROR_PARAMETER4              :
ERROR_PARAMETER5              :
ERROR_MSG                     : ORA-01652: unable to extend temp segment by 128 in tablespace TINY
-----------------
1 rows selected

This view includes the text of the statement that has been suspended and shows us when it started running (so that we can decide whether we really want to rescue it, or might be happy to kill it to allow some other suspended session to resume).

If you look at the alert log in this case you’ll see that the name has been reported there instead of the user, session and instance – which means you might want to think carefully about how you use the name option:


2019-10-04T14:21:16.151839+01:00
ORCL(3):statement in resumable session 'Help I'm stuck' was suspended due to
ORCL(3):    ORA-01652: unable to extend temp segment by 128 in tablespace TINY
2019-10-04T14:22:18.655808+01:00
ORCL(3):statement in resumable session 'Help I'm stuck' was timed out

Once your resumable task has completed (or timed out and failed) you can stop the session from being resumable with the command:

alter session disable resumable;

And it’s important that every time you enable resumability you should disable it as soon as the capability is no longer needed. Also, be careful about when you enable it, don’t be tempted to make every session resumable. Use it only for really important cases. Once a session is resumable virtually everything that goes on in that session is deemed to be resumable, and this has side effects.

The first side effect that may spring to mind is the impact of the view v$resumable – it’s a memory structure in the SGA so that everyone can see it and all the resumable sessions can populate and update it. That means there’s got to be some latch (or mutex) protection going on – and if you look at v$latch you’ll discover that there;s just a single (child) latch doing the job, so resumability can introduce a point of contention. Here’s a simple script (using my “start_XXX” strategy to “select 1 from dual;” one thousand times, with calls to check the latch activity:

set termout off
set serveroutput off
execute snap_latch.start_snap

@start_1000

set termout on
set serveroutput on
execute snap_latch.end_snap(750)

And here are the results of running the script – reporting only the latches with more than 750 gets in the interval – first without and then with a resumable session:

---------------------------------
Latch waits:-   04-Oct 15:04:31
Lower limit:-  750
---------------------------------
Latch                              Gets      Misses     Sp_Get     Sleeps     Im_Gets   Im_Miss Holding Woken Time ms
-----                              ----      ------     ------     ------     -------   ------- ------- ----- -------
session idle bit                  6,011           0          0          0           0         0       0     0      .0
enqueue hash chains               2,453           0          0          0           0         0       0     0      .0
enqueue freelist latch                1           0          0          0       2,420         0       0     0      .0
JS queue state obj latch          1,176           0          0          0           0         0       0     0      .0

SQL> alter session enable resumable;

SQL> @test
---------------------------------
Latch waits:-   04-Oct 15:04:46
Lower limit:-  750
---------------------------------
Latch                              Gets      Misses     Sp_Get     Sleeps     Im_Gets   Im_Miss Holding Woken Time ms
-----                              ----      ------     ------     ------     -------   ------- ------- ----- -------
session idle bit                  6,011           0          0          0           0         0       0     0      .0
enqueue hash chains               2,623           0          0          0           0         0       0     0      .0
enqueue freelist latch                1           0          0          0       2,588         0       0     0      .0
resumable state object            3,005           0          0          0           0         0       0     0      .0
JS queue state obj latch          1,260           0          0          0           0         0       0     0      .0

PL/SQL procedure successfully completed.

SQL> alter session disable resumable;

That’s 1,000 selects from dual – 3,000 latch gets on a single child latch. It looks like every call to the database results in a latch get and an update to the memory structure. (Note: You wouldn’t see the same effect if you ran a loop inside an anonymous PL/SQL block since the block would be the single database call).

For other side effects with resumability think about what else is going on around your session. If you allow a session to suspend for (say) 3600 seconds and it manages to resume just in time to avoid a timeout it now has 3,600 seconds of database changes to unwind if it’s trying to produce a read-consistent result; so not only do you have to allow for increasing the size of the undo tablespace and increasing the undo retention time, you have to allow for the fact that when the process resumes it may run much more slowly than usual because it spends more of its time trying to see the data as it was before it suspended, which may require far more single block reads of the undo tablespace – and the session may then crash anyway with an Oracle error ORA-01555 (which is so well-known that I won’t quote the text).

In the same vein – if a process acquires a huge amount of space in the temporary tablespace (in particular) and fails instantly because it can’t get any more space it normally crashes and releases the space. If you allow that process to suspend for an hour it’s going to hold onto that space – which means other processes that used to run safely may now crash because they find there’s no free space left for them in the temporary tablespace.

Be very cautious when you introduce resumable sessions – you need to understand the global impact, not just the potential benefit to your session.

Getting Alerts

Apart from the (passive) views telling you that a session has suspended it’s also possible to get some form of (active) alert when the event happens. There’s an “after suspend” event that you can use to create a database trigger to take some defensive action, e.g.:

create or replace trigger call_for_help
after suspend
on test_user.schema
begin
        if sysdate between trunc(sysdate) and trunc(sysdate) + 3/24 then
                null;
                -- use utl_mail, utl_smtp et. al. to page the DBA
        end if;
end;
/

This trigger is restricted to the test_user schema, and (code not included) sends a message to the DBA’s pager only between the hours of midnight and 3:00 a.m. Apart from the usual functions in dbms_standard that returnn error codes, names of objects and so on you might want to take a look at the dbms_resumable package for the “helper” functions and procedures it supplies.

For further information on resumable sessions here’s a link to the 12.2 manual to get you started.

opt_estimate catalogue

Fri, 2019-10-04 04:10

This is just a list of the notes I’ve written about the opt_estimate() hint.

  • opt_estimate – using the hint to affect index calculations: index_scan and index_filter
  • opt_estimate 2 – applying the hint to nested loop joins, options: nlj_index_scan and nlj_index_filter
  • opt_estimate 3 – a couple of little-known options for the hint, “group_by” and “having”.
  • opt_estimate 4 – applying the hint at the query block level: particularly useful for CTEs (“with subquery”) and non-mergeable views.
  • opt_estimate 5 – a story of failure: trying to use opt_estimate to push predicates into a union all view.

I have a couple more drafts on the topic awaiting completion, but if you know of any other articles that would be a good addition to the list feel free to reference them in the comments.

 

Trace Files

Thu, 2019-10-03 07:38

A recent blog note by Martin Berger about reading trace files in 12.2 poped up in my twitter timeline yesterday and reminded me of a script I wrote a while ago to create a simple view I could query to read the tracefile generated by the current session while the session was still connected. You either have to create the view and a public synonym through the SYS schema, or you have to use the SYS schema to grant select privileges on several dynamic performance views to the user to allow the user to create the view in the user’s schema. For my scratch database I tend to create the view in the SYS schema.

Script to be run by SYS:

rem
rem     Script: read_trace_122.sql
rem     Author: Jonathan Lewis
rem     Dated:  Sept 2018
rem
rem     Last tested
rem             12.2.0.1

create or replace view my_trace_file as
select 
        *
from 
        v$diag_trace_file_contents
where
        (adr_home, trace_filename) = (
                select
                --      substr(tracefile, 1, instr(tracefile,'/',-1)-1),
                        substr(
                                substr(tracefile, 1, instr(tracefile,'/',-1)-1),
                                1,
                                instr(
                                        substr(tracefile, 1, instr(tracefile,'/',-1)),
                                        'trace'
                                ) - 2
                        ),
                        substr(tracefile, instr(tracefile,'/',-1)+1) trace_filename
                from 
                        v$process
                where   addr = (
                                select  paddr
                                from    v$session
                                where   sid = (
                                        sys_context('userenv','sid')
                                        -- select sid from v$mystat where rownum = 1
                                        -- select dbms_support.mysid from dual
                                )
                        )
        )
;


create public synonym my_trace_file for sys.my_trace_file;
grant select on my_trace_file to {some role};

Alternatively, the privileges you could grant to a user from SYS so that they could create their own view:


grant select on v_$process to some_user;
grant select on v_$session to some_user;
grant select on v_$diag_trace_file_contents to some_user;
and optionally one of:
        grant select on v_$mystat to some_user;
        grant execute on dbms_support to some_user;
                but dbms_support is no longer installed by default.

The references to package dbms_support and view v$mystat are historic ones I have lurking in various scripts from the days when the session id (SID) wasn’t available in any simpler way.

Once the view exists and is available, you can enable some sort of tracing from your session then query the view to read back the trace file. For example, here’s a simple “self-reporting” (it’s going to report the trace file that it causes) script that I’ve run from 12.2.0.1 as a demo:


alter system flush shared_pool;
alter session set sql_trace true;

set linesize 180
set trimspool on
set pagesize 60

column line_number      format  999,999
column piece            format  a150    
column plan             noprint
column cursor#          noprint

break on plan skip 1 on cursor# skip 1

select
        line_number,
        line_number - row_number() over (order by line_number) plan,
        substr(payload,1,instr(payload,' id=')) cursor#,
        substr(payload, 1,150) piece
from
        my_trace_file
where
        file_name = 'xpl.c'
order by
        line_number
/

alter session set sql_trace false;

The script flushes the shared pool to make sure that it’s going to trigger some recursive SQL then enables a simple SQL trace. The query then picks out all the lines in the trace file generated by code in the Oracle source file xpl.c (execution plans seems like a likely guess) which happens to pick out all the STAT lines in the trace (i.e. the ones showing the execution plans).

I’ve used the “tabibitosan” method to identify all the lines that belong to a single execution plan by assuming that they will be consecutive lines in the output starting from a line which includes the text ” id=1 “ (the surrounding spaces are important), but I’ve also extracted the bit of the line which includes the cursor number (STAT #nnnnnnnnnnnnnnn) because two plans may be dumped one after the other if multiple cursors close at the same time. There is still a little flaw in the script because sometimes Oracle will run a sys-recursive statement in the middle of dumping a plan to turn an object_id into an object_name, and this will cause a break in the output.

The result of the query is to extract all the execution plans in the trace file and print them in the order they appear – here’s a sample of the output:


LINE_NUMBER PIECE
----------- ------------------------------------------------------------------------------------------------------------------------------------------------------
         38 STAT #140392790549064 id=1 cnt=0 pid=0 pos=1 obj=18 op='TABLE ACCESS BY INDEX ROWID BATCHED OBJ$ (cr=3 pr=0 pw=0 str=1 time=53 us cost=4 size=113 card
         39 STAT #140392790549064 id=2 cnt=0 pid=1 pos=1 obj=37 op='INDEX RANGE SCAN I_OBJ2 (cr=3 pr=0 pw=0 str=1 time=47 us cost=3 size=0 card=1)'


         53 STAT #140392790535800 id=1 cnt=1 pid=0 pos=1 obj=0 op='MERGE JOIN OUTER (cr=5 pr=0 pw=0 str=1 time=95 us cost=2 size=178 card=1)'
         54 STAT #140392790535800 id=2 cnt=1 pid=1 pos=1 obj=4 op='TABLE ACCESS CLUSTER TAB$ (cr=3 pr=0 pw=0 str=1 time=57 us cost=2 size=138 card=1)'
         55 STAT #140392790535800 id=3 cnt=1 pid=2 pos=1 obj=3 op='INDEX UNIQUE SCAN I_OBJ# (cr=2 pr=0 pw=0 str=1 time=11 us cost=1 size=0 card=1)'
         56 STAT #140392790535800 id=4 cnt=0 pid=1 pos=2 obj=0 op='BUFFER SORT (cr=2 pr=0 pw=0 str=1 time=29 us cost=0 size=40 card=1)'
         57 STAT #140392790535800 id=5 cnt=0 pid=4 pos=1 obj=73 op='TABLE ACCESS BY INDEX ROWID TAB_STATS$ (cr=2 pr=0 pw=0 str=1 time=10 us cost=0 size=40 card=1)
         58 STAT #140392790535800 id=6 cnt=0 pid=5 pos=1 obj=74 op='INDEX UNIQUE SCAN I_TAB_STATS$_OBJ# (cr=2 pr=0 pw=0 str=1 time=8 us cost=0 size=0 card=1)'


         84 STAT #140392791412824 id=1 cnt=1 pid=0 pos=1 obj=20 op='TABLE ACCESS BY INDEX ROWID BATCHED ICOL$ (cr=4 pr=0 pw=0 str=1 time=25 us cost=2 size=54 card
         85 STAT #140392791412824 id=2 cnt=1 pid=1 pos=1 obj=42 op='INDEX RANGE SCAN I_ICOL1 (cr=3 pr=0 pw=0 str=1 time=23 us cost=1 size=0 card=2)'


         94 STAT #140392790504512 id=1 cnt=2 pid=0 pos=1 obj=0 op='SORT ORDER BY (cr=7 pr=0 pw=0 str=1 time=432 us cost=6 size=374 card=2)'
         95 STAT #140392790504512 id=2 cnt=2 pid=1 pos=1 obj=0 op='HASH JOIN OUTER (cr=7 pr=0 pw=0 str=1 time=375 us cost=5 size=374 card=2)'
         96 STAT #140392790504512 id=3 cnt=2 pid=2 pos=1 obj=0 op='NESTED LOOPS OUTER (cr=4 pr=0 pw=0 str=1 time=115 us cost=2 size=288 card=2)'
         97 STAT #140392790504512 id=4 cnt=2 pid=3 pos=1 obj=19 op='TABLE ACCESS CLUSTER IND$ (cr=3 pr=0 pw=0 str=1 time=100 us cost=2 size=184 card=2)'
         98 STAT #140392790504512 id=5 cnt=1 pid=4 pos=1 obj=3 op='INDEX UNIQUE SCAN I_OBJ# (cr=2 pr=0 pw=0 str=1 time=85 us cost=1 size=0 card=1)'
         99 STAT #140392790504512 id=6 cnt=0 pid=3 pos=2 obj=75 op='TABLE ACCESS BY INDEX ROWID IND_STATS$ (cr=1 pr=0 pw=0 str=2 time=8 us cost=0 size=52 card=1)'
        100 STAT #140392790504512 id=7 cnt=0 pid=6 pos=1 obj=76 op='INDEX UNIQUE SCAN I_IND_STATS$_OBJ# (cr=1 pr=0 pw=0 str=2 time=7 us cost=0 size=0 card=1)'
        101 STAT #140392790504512 id=8 cnt=0 pid=2 pos=2 obj=0 op='VIEW  (cr=3 pr=0 pw=0 str=1 time=47 us cost=3 size=43 card=1)'
        102 STAT #140392790504512 id=9 cnt=0 pid=8 pos=1 obj=0 op='SORT GROUP BY (cr=3 pr=0 pw=0 str=1 time=44 us cost=3 size=15 card=1)'
        103 STAT #140392790504512 id=10 cnt=0 pid=9 pos=1 obj=31 op='TABLE ACCESS CLUSTER CDEF$ (cr=3 pr=0 pw=0 str=1 time=21 us cost=2 size=15 card=1)'
        104 STAT #140392790504512 id=11 cnt=1 pid=10 pos=1 obj=30 op='INDEX UNIQUE SCAN I_COBJ# (cr=2 pr=0 pw=0 str=1 time=11 us cost=1 size=0 card=1)'


        116 STAT #140392791480168 id=1 cnt=4 pid=0 pos=1 obj=0 op='SORT ORDER BY (cr=3 pr=0 pw=0 str=1 time=62 us cost=3 size=858 card=13)'
        117 STAT #140392791480168 id=2 cnt=4 pid=1 pos=1 obj=21 op='TABLE ACCESS CLUSTER COL$ (cr=3 pr=0 pw=0 str=1 time=24 us cost=2 size=858 card=13)'
        118 STAT #140392791480168 id=3 cnt=1 pid=2 pos=1 obj=3 op='INDEX UNIQUE SCAN I_OBJ# (cr=2 pr=0 pw=0 str=1 time=11 us cost=1 size=0 card=1)'


        126 STAT #140392789565328 id=1 cnt=1 pid=0 pos=1 obj=14 op='TABLE ACCESS CLUSTER SEG$ (cr=3 pr=0 pw=0 str=1 time=21 us cost=2 size=68 card=1)'
        127 STAT #140392789565328 id=2 cnt=1 pid=1 pos=1 obj=9 op='INDEX UNIQUE SCAN I_FILE#_BLOCK# (cr=2 pr=0 pw=0 str=1 time=12 us cost=1 size=0 card=1)'


        135 STAT #140392789722208 id=1 cnt=1 pid=0 pos=1 obj=18 op='TABLE ACCESS BY INDEX ROWID BATCHED OBJ$ (cr=3 pr=0 pw=0 str=1 time=22 us cost=3 size=51 card=
        136 STAT #140392789722208 id=2 cnt=1 pid=1 pos=1 obj=36 op='INDEX RANGE SCAN I_OBJ1 (cr=2 pr=0 pw=0 str=1 time=16 us cost=2 size=0 card=1)'


        153 STAT #140392792055264 id=1 cnt=1 pid=0 pos=1 obj=68 op='TABLE ACCESS BY INDEX ROWID HIST_HEAD$ (cr=3 pr=0 pw=0 str=1 time=25 us)'
        154 STAT #140392792055264 id=2 cnt=1 pid=1 pos=1 obj=70 op='INDEX RANGE SCAN I_HH_OBJ#_INTCOL# (cr=2 pr=0 pw=0 str=1 time=19 us)'

If you want to investigate further, the “interesting” columns in the underlying view are probably: section_name, component_name, operation_name, file_name, and function_name. The possible names of functions, files, etc. vary with the trace event you’ve enabled.

 

_cursor_obsolete_threshold

Wed, 2019-10-02 08:39

At the recent Trivadis Performance Days in Zurich, Chris Antognini answered a question that had been bugging me for some time. Why would Oracle want to set the default value of _cursor_obsolete_threshold to a value like 8192 in 12.2 ?

In 11.2.0.3 the parameter was introduced with the default value 100; then in 11.2.0.4, continuing into 12.1, the default value increased to 1,024 – what possible reason could anyone have for thinking that 8192 was a good idea ?

The answer is PDBs – specifically the much larger number of PDBs a single CBD can (theoretically) support in 12.2.

In fact a few comments, and the following specific explanation, are available on MoS in Doc ID 2431353.1 “High Version Counts For SQL Statements (>1024) Post Upgrade To 12.2 and Above Causing Database Slow Performance”:

The default value of _cursor_obsolete_threshold is increased heavily (8192 from 1024) from 12.2 onwards in order to support 4096 PDBs which was only 252 PDBs till 12.1. This parameter value is the maximum limit for obsoleting the parent cursors in an multitenant environment and cannot be increased beyond 8192.

Having said, this is NOT applicable for non-CDB environment and hence for those databases, this parameter should be set to 12.1 default value manually i.e. 1024. The default value of 1024 holds good for non-CDB environment and the same parameter can be adjusted case-to-case basis should there be a problem.

It’s all about PDBs – more precisely, it’s all about CDBs running a huge number of PDBs, which is not necessarily the way that many companies are likely to use PDBs. So if you’re a fairly typical companyy running a handful of PDBs in a single CDB then it’s probably a good idea to set the parameter down to the 12.1 value of 1024 (and for bad applications I’d consider going even lower) – and this MOS note actually makes that an official recommendation.

Impact analysis

What’s the worst that could happen if you actually have many PDBs all executing the same application and that application has a few very popular and frequently executed statements? Chris Antognini described a model he’d constructed and some tests he’d done to show the effects. The following code is a variation onhis work. It addresses the following question:

If you have an application that repeatedly issues (explicitly or implicitly) parse calls but doesn’t take advantage of the session cursor cache it has to search the library cache by hash_value / sql_id for the parent cursor, then has to walk the chain of child cursors looking for the right child. What’s the difference in the work done if this “soft parse” has to walk the list to child number 8,191 instead of finding the right cursor at child number 0.

Here’s the complete code for the test:


create table t1
select 1 id from dual
/

alter table t1 add constraint t1_pk primary key (id)
/

spool cursor_obsolete.lst

alter system flush shared_pool;
alter system flush shared_pool;

set serveroutput off
select /*+ index(t1) */ id from t1 where id > 0;
select * from table(dbms_xplan.display_cursor);

execute snap_my_stats.start_snap
execute snap_my_stats.start_snap

declare
        m_id number;
begin
        for i in 100+1..100+8192 loop
                execute immediate 'alter session set optimizer_index_cost_adj = ' || i ;
                select /*+ index(t1) */ id into m_id from t1 where id > 0;
        end loop;
end;
/

set serveroutput on
execute snap_my_stats.end_snap

column sql_text format a60
select sql_id, child_number, loaded_versions, executions, sql_text from v$sql where sql_text like 'SELECT%T1%' order by child_number;

prompt  ===============
prompt  Low child reuse
prompt  ===============

set serveroutput off
execute snap_my_stats.start_snap

declare
        m_id number;
begin
        for i in 100+1..100+1024 loop
                execute immediate 'alter session set optimizer_index_cost_adj = ' || i ;
                select /*+ index(t1) */ id into m_id from t1 where id > 0;
        end loop;
end;
/

set serveroutput on
execute snap_my_stats.end_snap

prompt  ================
prompt  High child reuse
prompt  ================

set serveroutput off
execute snap_my_stats.start_snap

declare
        m_id number;
begin
        for i in 7168+1..7168+1024 loop
                execute immediate 'alter session set optimizer_index_cost_adj = ' || i ;
                select /*+ index(t1) */ id into m_id from t1 where id > 0;
        end loop;
end;
/

set serveroutput on
execute snap_my_stats.end_snap

spool off

I’ve created a table with just one row and given it a primary key. My testing query is going to be very short and simple. A query hinted to return that one row by primary key index range scan.

I’ve flushed the shared pool (twice) to minimise fringe contention from pre-existing information, then executed the statement to populate the dictionary cache and some library cache information and to check the execution plan.

The call to the package snap_my_stats is my standard method for reporting changes in v$mystat across the test. I’ve called the start_snap procedure twice in a row to make sure that its first load doesn’t add some noise to the statistics that we’re trying to capture.

The test runs in three parts.

  • First I loop 8192 times executing the same statement, but with a different value for the optimizer_index_cost_adj for each execution – this gives me the limit of 8192 child cursors, each reporting “Optimizer Mismatch” as the reason for not sharing. I’ve run a query against v$sql after this to check that I have 8192 child cursors – you’ll need to make sure your shared pool is a few hundred megabytes if you want to be sure of keeping them all in memory.
  • The second part of the test simply repeats the loop, but only for the first 1,024 child cursors. At this point the child cursors exist, so the optimizer should be doing “soft” parses rather than hard parses.
  • The final part of the test repeats the loop again, but only for the last 1,024 child cursors. Again they should exist and be usable, so the optimizer should again be doing “soft” parses rather than hard parses.

What I’m looking for is the extra work it takes for Oracle to find the right child cursor when there’s a very long chain of child cursors. From my memory of dumping the library cache in older versions of Oracle, the parent will point to a “segmented array” of pointers to child cursors, and each segment of the array will consist of 16 pointers, plus a pointer to the next segment. So if you have to find child cursor 8191 you will have to following 512 segment pointers, and 16 pointers per segment (totalling 8708 pointers) before you find the child you want – and you’re probably holding a mutex (or latch) while doing so.

One preipheral question to ask, of course, is whether Oracle keeps appending to the segmented array, or whether it uses a “pushdown” approach when allocating a new segment so that newer child cursors are near the start of the array. (i.e. will searching for child cursor 0 be the cheapest one or the most expensive one).

And the results, limited to just the second and third parts, with just a couple of small edits are as follows:


host sdiff -w 120 -s temp1.txt temp2.txt >temp.txt

===============                                            |    ================
Low child reuse                                            |    High child reuse
===============                                            |    ================

Interval:-  0 seconds                                      |    Interval:-  6 seconds

opened cursors cumulative                      2,084       |    opened cursors cumulative                      2,054
recursive calls                                6,263       |    recursive calls                                6,151
recursive cpu usage                               33       |    recursive cpu usage                              570
session logical reads                          1,069       |    session logical reads                          1,027
CPU used when call started                        33       |    CPU used when call started                       579
CPU used by this session                          37       |    CPU used by this session                         579
DB time                                           34       |    DB time                                          580
non-idle wait count                               16       |    non-idle wait count                                5
process last non-idle time                         1       |    process last non-idle time                         6
session pga memory                           524,288       |    session pga memory                            65,536
enqueue requests                                  10       |    enqueue requests                                   3
enqueue releases                                  10       |    enqueue releases                                   3
consistent gets                                1,069       |    consistent gets                                1,027
consistent gets from cache                     1,069       |    consistent gets from cache                     1,027
consistent gets pin                            1,039       |    consistent gets pin                            1,024
consistent gets pin (fastpath)                 1,039       |    consistent gets pin (fastpath)                 1,024
consistent gets examination                       30       |    consistent gets examination                        3
consistent gets examination (fastpath)            30       |    consistent gets examination (fastpath)             3
logical read bytes from cache              8,757,248       |    logical read bytes from cache              8,413,184
calls to kcmgcs                                    5       |    calls to kcmgcs                                    3
calls to get snapshot scn: kcmgss              1,056       |    calls to get snapshot scn: kcmgss              1,026
table fetch by rowid                              13       |    table fetch by rowid                               1
rows fetched via callback                          6       |    rows fetched via callback                          1
index fetch by key                                 9       |    index fetch by key                                 1
index scans kdiixs1                            1,032       |    index scans kdiixs1                            1,024
session cursor cache hits                         14       |    session cursor cache hits                          0
cursor authentications                         1,030       |    cursor authentications                         1,025
buffer is not pinned count                     1,066       |    buffer is not pinned count                     1,026
parse time cpu                                    23       |    parse time cpu                                   558
parse time elapsed                                29       |    parse time elapsed                               556
parse count (total)                            2,076       |    parse count (total)                            2,052
parse count (hard)                                11       |    parse count (hard)                                 3
execute count                                  1,050       |    execute count                                  1,028
bytes received via SQL*Net from client         1,484       |    bytes received via SQL*Net from client         1,486

Two important points to note:

  • the CPU utilisation goes up from 0.33 seconds to 5.7 seconds.
  • the number of hard parses is zero, this is all about searching for the

You might question are the 2,048-ish parse count(total) – don’t forget that we do an “execute immediate” to change the optimizer_index_cost_adj on each pass through the loop. That’s probably why we double the parse count, although the “alter session” doesn’t then report as an “execute count”.

The third call to a statement is often an important one – it’s often the first one that doesn’t need “cursor authentication”, so I ran a similar test executing the last two loops a second time – there was no significant change in the CPU or parse activity between the 2nd and 3rd executions of each cursor. For completeness I also ran a test with the loop for the last 1,024 child cursors ran before the loop for the first child cursors. Again this made no significant difference to the results – the low number child cursors take less CPU to find than the high number child cursors.

Bottom line

The longer the chain of child cursors the more time (elapsed and CPU) you spend searching for the correct child; and when a parent is allowed 8,192 child cursors the extra time can become significant. I would claim that the ca. 5 seconds difference in CPU time appearing in this test corresponds purely to an extra 5 milliseconds walking an extra 7,000 steps down the chain.

If you have a well-behaved application that uses the session cursor cache effectively, or uses “held cursors”, then you may not be worried by very long chains of child cursors. But I have seen many applications where cursor caching is not used and every statement execution from the client turns into a parse call (usually implicit) followed by a hunt through the library cache and walk along the child chain. These applications will not scale well if they are cloned to multiple PDBs sharing the same CDB.

Footnote 1

The odd thing about this “cursor obselete” feature is that I have a distinct memory that when  PDBs were introduced at an ACE Director’s meeting a few years ago the first thought that crossed my mind was about the potential for someone running multiple copies of the same application as separate PDBs seeing a lot of library cache latch contention or cursor mutex contention because any popular statement would now be hitting the same parent cursor from multiple PDBs. I think the casual (i.e. neither formal, nor official) response I got when I raised the point was that the calculation of the sql_id in future releases would take the con_id into consideration. It seems that that idea fell by the wayside.

Footnote 2

If you do see a large number of child cursors for a single parent then you will probably end up looking at v$sql_shared_cursor for the sql_id to see if that gives you some good ideas about why a particular statement has generated so many child cursors. For a list of explainations of the different reasons captured in this view MOS Doc Id  296377.1“Troubleshooting: High Version Count Issues” is a useful reference.

Pages