Blog Archive

Database / SQL Experiences

I have consolidated, and will do so in future, all my Experiences for Oracle Community. All these are my own opinions and does not necessarily reflect the views of Oracle. There are many databases with different permutation and combination which I have not seen. Hence, with these limitations, my views might be wrong, before implementing do test these. Please feel free to comment on my views, be it a Criticism, as this will only improve my writing abilities.

Tuesday, 23 June 2009

Cost Based Optimizer : Inefficient Input yields Inefficient Output


Cost Based Optimizer has always been a mystery for most of the Database Administrators and Developers. The calculation made by the Optimizer to come-up with an Execution Plan has been doubted by many of the Oracle users, especially, when (to name a few) :


  1. The Application Performance Degrades after Version Upgrade
  2. Plan of a Query change after gathering of fresh Statistics, and
  3. The Performance of a Query with higher cost is much better that the plan with a least cost


Cost Based Optimizer has improved a lot in previous few versions. Therefore, it can be said that CBO is still undergoing some enhancements. With these enhancements, Oracle Users have accepted the fact that the Application Performance might degrade after version upgrade and leave no room with rigorous testing of the application before production upgrade. It is fact no.2 and 3 that raises a doubt over Cost Based Optimization.


Cost Based Optimizer is a piece of code, wherein, you provide Inputs, by way of Optimizer Parameters, Statistics and it computes and generates an Output which is seen as an Execution Plan. It is not only the Optimizer Parameters and Statistics, which is passed as an Input. There is one more component, which is mostly overlooked, a QUERY. The initial input passed is the Query and based on the way it is written, optimizer does its cost calculation by taking into account the other two : Parameters and Statistics. In this blog, I will address the last two facts stated above, plus, a common misconception that Index Scan is better that Full Table Scan and finally, the issue with the critical first component passed as an input to CBO, A Query.



Recently, was working on a performance issue, related to a critical process. The overall performance of the system was OK, but one critical process started taking 18-20 hours, failing several times with ora-1555. This process, usually takes 1 hour 30 minutes. Therefore, all of a sudden, why this degradation ?, was the question raised by the customer.



The Database version is 8174. This is one of a critical process and a part of seeded application. It connects to a user, dedicated for this process, therefore, when this process is running, it is very easy to get the sid of the session executing this process. While monitoring the performance of this process, by querying v$session_wait for the sid, I could see one particular query doing exorbitant amount of Logical Reads. The text of the query and execution plan is as under :


explain plan for
select CS.CO_ID ,CS.SPCODE ,CS.SNCODE ,CS.TMCODE ,CS.CS_SEQNO ,CS.CS_STAT_CHNG ,
to_char(CS.TRIAL_END_DATE,'YYYYMMDDHH24MISS')
from CR_SERVICES CS
where (CS.CO_ID not in (select CO_ID
from CO_ALL
where (CO_ID=CS.CO_ID and CUSTOMER_ID=DEALER_ID))
and CS.SNCODE in (select distinct SNCODE from MPV
where (SNCODE=CS.SNCODE and RATING_IND='Y')));

Explained.

SQL> @?/rdbms/admin/utlxpls

Plan Table
------------------------------------------------------------------------------------------
| Operation | Name | Rows | Bytes| Cost | Pstart| Pstop |
--------------------------------------------------------------------------------
| SELECT STATEMENT | | 1K| 77K| 141575 | | |
| FILTER | | | | | | |
| NESTED LOOPS | | 1K| 77K| 141575 | | |
| TABLE ACCESS FULL |MPV | 74 | 444 | 13 | | |
| TABLE ACCESS BY INDEX R|CR_SERVICES | 3M| 120M| 1913 | | |
| INDEX RANGE SCAN |I1_CR_SERVICES | 3M| | 1616 | | |
| FILTER | | | | | | |
| INDEX RANGE SCAN |IDX_CUST_CO_DEALER | 1 | 14 | 2 | | |
--------------------------------------------------------------------------------


There are three tables in the query. The Driving Table, from the explain plan, MPV is a small table with 220 rows and 80 Blocks, while the other two are huge tables with 69 Million and 30 Million Rows. Usually, such situation can crop-up due to following reasons :


  1. Change in Optimizer Parameters
  2. Change in Statistics
  3. Creation of New Index



Initial Investigation revealed that the Statistics of CR_SERVICES table and indexes were gathered the previous night. Once this was confirmed, Cost Based Optimization was at the receiving end. This is quite natural as without any change in the application query, the plan changed for bad, due to CBO. Therefore, thorough investigation was required about the cause of this plan change. Seeing the query and other optimizer statistics, Full Table Scan on CR_SERVICES would have been a better choice and therefore, my investigation was around this. Let us revisit the relevant portion of the Execution Plan, the statistics and the Query.


## Table Statistics of CR_SERVICES

OWNER PAR NUM_ROWS BLOCKS
------------------------------ --- ---------- ----------
SM NO 69875510 1107733

## Column Stats of the Columns used on the Query

COLUMN_NAME NUM_DISTINCT NUM_NULLS DENSITY
------------------------------ ------------ ---------- ----------
CO_ID 6863377 0 1.4570E-07
SNCODE 370 0 .002702703 <-- Index on this Column

## Columns in the Index used the Query

INDEX_NAME COLUMN_NAME COLUMN_POSITION COLUMN_EXPRESSION
------------------------------ ------------------------------ --------------- --------------------
I1_CR_SERVICES SNCODE 1
CS_STAT_CHNG 2

## Index Statistics

INDEX_NAME NUM_ROWS BLEVEL LEAF_BLOCKS DISTINCT_KEYS CLUSTERING_FACTOR
------------------------------ ---------- ---------- ----------- ------------- -----------------
I1_CR_SERVICES 70724121 3 596745 11207514 40527179

## Relevant portion of the Query and Plan

and CS.SNCODE in (select distinct SNCODE from MPV
where (SNCODE=CS.SNCODE and RATING_IND='Y')));

SQL> @?/rdbms/admin/utlxpls

Plan Table
------------------------------------------------------------------------------------------
| Operation | Name | Rows | Bytes| Cost | Pstart| Pstop |
--------------------------------------------------------------------------------
| NESTED LOOPS | | 1K| 77K| 141575 | | |
| TABLE ACCESS FULL |MPV | 74 | 444 | 13 | | |
| TABLE ACCESS BY INDEX R|CR_SERVICES | 3M| 120M| 1913 | | |
| INDEX RANGE SCAN |I1_CR_SERVICES | 3M| | 1616 | | |


The Query is using an Index on SNCODE column, which is a very low cardinality column. Therefore, this was my primary target and I worked on the Optimizer Calculation to get the possible cause of this Index Scan.



Optimizer Calculation



The Cost Calculation of an Index Scan and Table Scan via Index is

Index Cost = Blevel+ceil(leaf_blocks*ix_selectivity)
Table Scan = ceil(clustering_factor*table_selectivity)

select 3+ceil(596745*1/370) "Index Scan Cost", ceil(40527179*1/370) from dual;
Index Cost = 1616 ## This matches the execution plan
Table Scan = 109533
Final Cost = 1616+109533=111149

select 3+ceil(596745*1/370) "Index Scan Cost", ceil(40527179*1/370*1/370) from dual;
Index Cost = 1616
Table Scan = 297
Final Cost = 1616+297=1913 ## This Matches.

This additional 1/370, was due to the fact that SNCODE column is used twice in the query. One for Index Filter, therefore considered only once for an index cost, and twice for Table cost due to SNCODE=CS.SNCODE join. The final Nested Loop Cost is 141575 and this is calculated as



NL Cost = Cost of an Outer Table + (Cost of Inner Table * Cardinality of Outer Table)
NL Cost = 13 + (1913*74)
NL Cost = 141575

Cost of Full Table Scan = Num Blocks / 6.59 (8k block size)
Cost of Full Table Scan = 1107733 / 6.59
Cost of Full Table Scan = 168093.02

Clearly, the cost of a Full Table Scan is higher than the Nested Loop Join and optimizer opted for the cheaper costed plan, which is, a Nested Loop Join and this requires an Index Scan of an Inner Table. When I checked the execution plan of this query with a FULL hint on CR_SERVICES, the FTS cost was 168200. I am yet to figure out the difference between my calculation and optimizer calculation, but for investigation purpose, I considered the calculation done by the optimizer. Based on this calculation, I came out with three possible solutions, these are :



  1. Modify the Query (Immediately Not possible, as it is a seeded query)
  2. Change in Optimizer Statistics
  3. Change in Optimizer Parameters



The first optimization was a better and safe approach, but was not possible immediately. The other two required some testing, as change in Optimizer Statistis can have negative impact on other queries using these objects and, change in optimizer parameters at db level can have an impact on the overall performance. As mentioned earlier, this process connects to a user dedicated for this process, therefore, ON-LOGON trigger with the Optimizer Parameter change at session level was thought of as a better option without impacting overall application performance. The parameter, I could immediately think of was, OPTIMIZER_INDEX_COST_ADJ. This parameter was default (100). For this query, to force Index Cost be Costlier than FTS, the calculation is OICA = (FTS Cost/NL Cost)*100 = (168200/141575)*100 = 118.80. Any value above this, would compute the NL cost be costlier than Full Table Cost and would opt for FTS and Hash Join. Initially, the Change was done at session level and the plan changed.



Plan Table
------------------------------------------------------------------------------------------
| Operation | Name | Rows | Bytes| Cost | Pstart| Pstop |
--------------------------------------------------------------------------------
| SELECT STATEMENT | | 1K| 77K| 168219 | | |
| FILTER | | | | | | |
| HASH JOIN | | 1K| 77K| 168219 | | |
| TABLE ACCESS FULL |MPV | 74 | 444 | 13 | | |
| TABLE ACCESS FULL |CR_SERVICES | 3M| 120M| 168200 | | |
| FILTER | | | | | | |
| INDEX RANGE SCAN |IDX_CUST_CO_DEALER | 1 | 14 | 2 | | |
--------------------------------------------------------------------------------

Finally, ON-LOGON trigger was created to set this parameter to 120, whenever, this user logged on to the database. The process, when re-executed, completed in 1 Hour 15 minutes and the desired performance was achieved.



Optimizer Calculation with OICA as 120


NL Table Scan Cost = (Current Cost of Inner Table Scan * Outer Table Card)*OICA/100
NL Table Scan Cost = (1913*74)*120/100
NL Table Scan Cost = (1913*74)*120/100
NL Table Scan Cost = 169874.4 > 168200 (FTS)

With this setting, the plan changed. Now for the reverse engineering. I wanted to check for any of the Index Statistics, say Clustering Factor, that would have calculated NL Join costlier than Full Table Scan. The Cost of NL Join with OICA is 169874.4. The cost of Table Scan via Index should be round(169874.4/74), which is 2296. Index Scan Cost is 1616, so the Table Scan cost component should be 2296-1616 is 680. Based on this, I calculated that if the Clustering Factor is changed to 93092000, then without any parameter setting, the query would go for a Full Table Scan.



SQL> exec dbms_stats.SET_INDEX_STATS('SM','I1_CR_SERVICES',CLSTFCT=>93092000);

PL/SQL procedure successfully completed.

SQL> @index_stats
Enter value for o: SM
old 3: where table_owner='&O'
new 3: where table_owner='SM'
Enter value for t: CR_SERVICES
old 4: and table_name='&T'
new 4: and table_name='CR_SERVICES'

INDEX_NAME NUM_ROWS BLEVEL LEAF_BLOCKS DISTINCT_KEYS CLUSTERING_FACTOR
------------------------------ ---------- ---------- ----------- ------------- -----------------
I1_CR_SERVICES 70724121 3 596745 11207514 93092000

12 rows selected.

explain plan for
select CS.CO_ID ,CS.SPCODE ,CS.SNCODE ,CS.TMCODE ,CS.CS_SEQNO ,CS.CS_STAT_CHNG ,
to_char(CS.TRIAL_END_DATE,'YYYYMMDDHH24MISS')
from CR_SERVICES CS
where (CS.CO_ID not in (select CO_ID
from CO_ALL
where (CO_ID=CS.CO_ID and CUSTOMER_ID=DEALER_ID))
and CS.SNCODE in (select distinct SNCODE from MPV
where (SNCODE=CS.SNCODE and RATING_IND='Y')));

Explained.

SQL> @?/rdbms/admin/utlxpls

Plan Table
------------------------------------------------------------------------------------------
| Operation | Name | Rows | Bytes| Cost | Pstart| Pstop |
--------------------------------------------------------------------------------
| SELECT STATEMENT | | 1K| 77K| 168219 | | |
| FILTER | | | | | | |
| HASH JOIN | | 1K| 77K| 168219 | | |
| TABLE ACCESS FULL |MPV | 74 | 444 | 13 | | |
| TABLE ACCESS FULL |CR_SERVICES | 3M| 120M| 168200 | | |
| FILTER | | | | | | |
| INDEX RANGE SCAN |IDX_CUST_CO_DEALER | 1 | 14 | 2 | | |
--------------------------------------------------------------------------------

11 rows selected.

Back to the problem, does it look like to be an Optimizer Issue ? This table has a very high growth rate. While the num_blocks value of a table grows faster, the statistics of Leaf Blocks, BLevel and Clustering Factor does not change significantly. This is enough to blow up the cost of a Full Table Scan. If you see the query and the relevant portion of the query, pasted above, the join on SNCODE is really not required as it uses an IN clause. This unwanted join, forced the optimizer to consider the Density of SNCODE twice in Table Scan cost calculation.




## Relevant portion of the Query and Plan

and CS.SNCODE in (select distinct SNCODE from MPV
where (SNCODE=CS.SNCODE and RATING_IND='Y')));

## Optimizer Calculation (1/370 is considered twice to get Table Scan cost of 297, thus making it cheaper).
Index Cost = Blevel+ceil(leaf_blocks*ix_selectivity)
Table Scan = ceil(clustering_factor*table_selectivity)

select 3+ceil(596745*1/370) "Index Scan Cost", ceil(40527179*1/370*1/370) from dual;
Index Cost = 1616
Table Scan = 297
Final Cost = 1616+297=1913

If this portion of the query is modified as :



and CS.SNCODE in (select distinct SNCODE from MPV
where (RATING_IND='Y')));

## Optimizer Calculation, with this change
Index Cost = Blevel+ceil(leaf_blocks*ix_selectivity)
Table Scan = ceil(clustering_factor*table_selectivity)

select 3+ceil(596745*1/370) "Index Scan Cost", ceil(40527179*1/370) from dual;
Index Cost = 1616
Table Scan = 109533
Final Cost = 1616+109533= 111149

## With this Cost, the cost of NL Join would have been

NL Cost = Cost of an Outer Table + (Cost of Inner Table * Cardinality of Outer Table)
NL Cost = 13 + (111149*74)
NL Cost = 8225039 > 168200 (FTS)

One unwanted Join Condition made Cost based Optimizer to come up with an in-efficient plan. Therefore, as mentioned earlier, my Initial and Crucial input to the optimizer is a Query and if these are not optimally written, Cost Based Optimizer is bound to compute an sub-optimal plan. The Query, modified by removing unwanted join and no parameter change or statistics change, gave the desired and better plan (see below ).



explain plan for
select CS.CO_ID ,CS.SPCODE ,CS.SNCODE ,CS.TMCODE ,CS.CS_SEQNO ,CS.CS_STAT_CHNG ,
to_char(CS.TRIAL_END_DATE,'YYYYMMDDHH24MISS')
from CR_SERVICES CS
where (CS.CO_ID not in (select CO_ID
from CO_ALL
where (CO_ID=CS.CO_ID and CUSTOMER_ID=DEALER_ID))
and CS.SNCODE in (select distinct SNCODE from MPV
where (RATING_IND='Y')));

Explained.

SQL> @?/rdbms/admin/utlxpls

Plan Table
------------------------------------------------------------------------------------------
| Operation | Name | Rows | Bytes| Cost | Pstart| Pstop |
--------------------------------------------------------------------------------
| SELECT STATEMENT | | 1K| 77K| 168219 | | |
| FILTER | | | | | | |
| HASH JOIN | | 1K| 77K| 168219 | | |
| TABLE ACCESS FULL |MPV | 74 | 444 | 13 | | |
| TABLE ACCESS FULL |CR_SERVICES | 3M| 120M| 168200 | | |
| FILTER | | | | | | |
| INDEX RANGE SCAN |IDX_CUST_CO_DEALER | 1 | 14 | 2 | | |
--------------------------------------------------------------------------------

Many developers are of the Opinion that an Execution Plan with cheaper cost is better than a Plan with higher cost, and Index Scans are better than Full Table Scans. This blog demonstrates that these are just misconceptions and it is the application design and knowledge that should be taken into account while writing an effecient code.

To sum up, it is just not the Optimizer Statistics and Parameter that control the Cost Based Optimizer, Queries play an important role too. A well written code should never bother or doubt about Cost Based Optimization.

Tuesday, 26 May 2009

Additional Info my Previos Blog on "Optimizing a rare DWH Query in a OLTP"

This is in continuation to my previous blog. Based on the Comment / Feedback from one of my reader, thought of posting the response. The comment from Narendra is as under :

Thanks for sharing this detailed case study. I, however, did not understand the following paragraph "Once the Query was modified, to maintain data consistency, the original query was modified with additional Filter predicate as PN=:PN. This :PN is the Permanent Number fetched from the temporary table. All other filter predicates will still be applied, so that, in case, if any of the queried column is modified in the production table after the temp table was populated, this data will be fetched from the temporary table, but the condition will evaulate to FALSE when again queried from the production table. Querying the production table with PN, it is a Primary Key, is very fastdoes it mean predicate PN = :PN was ANDed to the original query? How did the revised query work in the scenario where data is modified after population of temp table? What does the revised query look like? "

May be I have not explained the changes correctly, therefore, for the benefit of the other readers, thought of writing on the changes again. The changes suggested was to populate a Temporary table every night. This will be a TRUNCATE AND INSERT INTO command. Therefore, every night at scheduled time, a temporary table A_F49A_XFER is truncated and rows are inserted from the Production Table.



## Step 1 of the Changes


truncate table a_f49a_xfer;
insert into a_f49a_xfer
Select /*+ PARALLEL(a_f49a,24) */ pn, a_cd, a_type, r_cd, a_no, title, surname, first_name, middle_name,
status, dob, src_sal, src_house, -- WHERE PREDICATE COLUMNS
src_bus, src_cap, src_others, surname, first_name, ho_nm, -- WHERE PREDICATE COLUMNS
off_nm, off_line1, off_line2, off_line3, off_line4, off_line5, -- WHERE PREDICATE COLUMNS
off_state, off_pin, area_cd, ao_type, range_cd, ao_no -- WHERE PREDICATE COLUMNS
from a_f49a;

This, so-called temporary, table is created with only those columns that are used in the Query. Therefore, the size of this table is less than 1/4th the size of the Production Table. Since, the query has 22 columns in the WHERE predicate, equal number of single column bitmap indexes are created on this XFER table. Next step is to create another query with some modifications so that BITMAP indexes can be effeciently used. This was, to dynamically write a query on A_F49A_XFER table with only those columns in the WHERE predicate for which the end-user has entered non-null values.



## Step 2 of the Changes


if :f_parms_manual.n_status is not null then
l_where := l_where||' '||'(status = :f_parms_manual.n_status) and';
end if;

if :f_parms_manual.n_dob_start is not null then
l_where := l_where||' '||'(to_char(dob,''''J'''') >= to_char(:f_parms_manual.n_dob_start,''''J'''')) and';
end if;

if :f_parms_manual.n_dob_end is not null then
l_where := l_where||' '||'(to_char(dob,''''J'''') <= to_char(:f_parms_manual.n_dob_end,''''J'''')) and';
end if

The IF condition above is for all the 22 columns, therefore, if the end-user enters selection criteria for only 5 columns, the query will be dynamically written with 5 Columns, as shown below :




Select pn, a_cd, a_type, r_cd, a_no, title, surname,
first_name, middle_name
from as_f49a_xfer
where ((status = :f_parms_manual.n_status) and
(src_sal = :f_parms_manual.n_src_sal) and
(src_house = :f_parms_manual.n_src_house) and
(src_cap = :f_parms_manual.n_src_cap) and
(src_others = :f_parms_manual.n_src_oth));

Since, the data into this XFER table is populated once in day and contains a day old data (refreshed at 00:00 hrs), it is quite possible that many of the columns, out of 22 columns, have undergone some changes. This being an OLTP table, users are allowed to update some columns or delete rows from the table. These changes are not replicated into the XFER table, hence, the WHERE condition applied on XFER table does not guarantee 100% consistency from the production table. For eg, assuming status column for PN 100 has was 'T' when the table was populated last night and has changed to 'P' today morning, and the end-user executes the search criteria for status='T', then 100 will be fetched from XFER table (should not have been fetched from the production table). Therefore, to maintain data consistency, once the rows (PN's) from XFER table were fetched, the ORIGINAL query, with additional predicate appended as PN=:PN is executed on A_F49A table. The Query executed is :




Select pn, a_cd, a_type, r_cd, a_no, title, surname,
first_name, middle_name
from as_f49a
where ((:f_parms_manual.n_status is null or status = :f_parms_manual.n_status) and
(:f_parms_manual.n_dob_start is null or to_char(dob,'J') >= to_char(:f_parms_manual.n_dob_start,'J')) and
(:f_parms_manual.n_dob_end is null or to_char(dob,'J') <= to_char(:f_parms_manual.n_dob_end,'J')) and
(:f_parms_manual.n_src_sal is null or src_sal = :f_parms_manual.n_src_sal) and
(:f_parms_manual.n_src_house is null or src_house = :f_parms_manual.n_src_house) and
(:f_parms_manual.n_src_bus is null or src_bus = :f_parms_manual.n_src_bus) and
(:f_parms_manual.n_src_cap is null or src_cap = :f_parms_manual.n_src_cap) and
(:f_parms_manual.n_src_oth is null or src_others = :f_parms_manual.n_src_oth) and
(:f_parms_manual.n_surnm is null or surname like :f_parms_manual. ||'%') and
(:f_parms_manual.n_first_name is null or first_name like :f_parms_manual.n_first_name ||'%') and
(:f_parms_manual.n_ho_nm is null or ho_nm like '%' || :f_parms_manual.n_ho_nm ||'%') and
(:f_parms_manual.n_off_nm is null or off_nm like '%' || :f_parms_manual.n_off_nm ||'%') and
(:f_parms_manual.n_off_line1 is null or off_line1 like '%' || :f_parms_manual.n_off_line1 ||'%') and
(:f_parms_manual.n_off_line2 is null or off_line2 like '%' || :f_parms_manual.n_off_line2 ||'%') and
(:f_parms_manual.n_off_line3 is null or off_line3 like '%' || :f_parms_manual.n_off_line3 ||'%') and
(:f_parms_manual.n_off_line4 is null or off_line4 like '%' || :f_parms_manual.n_off_line4 ||'%') and
(:f_parms_manual.n_off_line5 is null or off_line5 like '%' || :f_parms_manual.n_off_line5 ||'%') and
(:f_parms_manual.n_off_state is null or off_state = :f_parms_manual.n_off_state ) and
(:f_parms_manual.n_off_pin is null or off_pin = :f_parms_manual.n_off_pin ) and
(:f_parms_manual.n_area_cd is null or area_cd = :f_parms_manual.n_area_cd) and
(:f_parms_manual.n_ao_type is null or ao_type = :f_parms_manual.n_ao_type) and
(:f_parms_manual.n_range_cd is null or range_cd = :f_parms_manual.n_range_cd) and
(:f_parms_manual.n_ao_no is null or ao_no = :f_parms_manual.n_ao_no) and
((:f_parms_manual.n_ret_inc_start is null and :f_parms_manual.n_ret_inc_end is null )))
and pn=:pn; -- THIS IS THE ADDITIONAL PREDICATE



Continuing from our previous example, wherein, the end-user provided input to only 5 columns, this query, on production table and appended PN predicate, will be executed for all the columns and PN passed from the resultset of the first query. Therefore, when PN is passed as 100, the query will not return any rows as the STATUS column for this PN has changed to 'P', while the query checks for 'T'. Data consistency is guaranteed by applying all the predicates again on the production table and performance is guaranteed by passing PN (primary key) to the query.

To Summarize, the Sequence of this optimization is :


Truncate and Populate the Table every 00:00 Hrs (Not a part of the Application, but mandatory prerequisite to maintain day old data).
Dynamic Query on Temporary Table and fetch the result set. PN is the Key column of importance. (Part of FORM).
Once the PN is fetched from Step 2, these are passed to a Query similar to Original with appended PN condition. (Part of FORM).
Rest of the application logic remains the same. i.e.process the records, finally fetched by Step 3. (Untouched Part of FORM).

Tuesday, 19 May 2009

Optimizing a rare DWH Query in an OLTP System ! Compromising Performance and Data Consistency...



During my 1 Hour session at "All India Oracle User Group" event in Hyderabad, I mentioned that Tuning successes often come randomly, depending more on intuition and trial-and-error that any sort of reliable or repeatable method. Optimization sometime requires a different approach and thinking. I encountered one such situation at one of my customer site.



The setup is a 2 Node 10.2.0.3 RAC and the customer reported slowness during peak hour. Investigation revealed that the issue normally comes during the execution of one particular query that consumes most of the resources with huge GC related waits. This was a Full Table Scan on a 30 GB Partitioned Table. The Query was written in such a way, in fact, the customer requirement was such that, it required Full Scan of the table. There were around 20-25 filter predicates in the query. The end-users, at runtime, has choice of specifying one restriction, multiple restrictions or values for all the filter predicates. This requirement is similar to a typical Datawarehouse environment and the optimization technique for such queries could have been applied on this too. In Datawarehouse, such requirements are normally satisfied by way of Bitmap Indexes, wherein, multiple Bitmap Indexes can be combined together to fetch the desired resultset in a faster way. Refer to my Article "Bitmap Index v/s BTree Index ! Which and When ?" on this subject.



This being an OLTP system, creating Bitmap Indexes is definetely a right choice here, as it can introduce severe locking issues. Therefore, optimizing this query was a biggest challenge. The Original Query is pasted below :




Select pn, a_cd, a_type, r_cd, a_no, title, surname,
first_name, middle_name
from as_f49a
where ((:f_parms_manual.n_status is null or status = :f_parms_manual.n_status) and
(:f_parms_manual.n_dob_start is null or to_char(dob,'J') >= to_char(:f_parms_manual.n_dob_start,'J')) and
(:f_parms_manual.n_dob_end is null or to_char(dob,'J') <= to_char(:f_parms_manual.n_dob_end,'J')) and
(:f_parms_manual.n_src_sal is null or src_sal = :f_parms_manual.n_src_sal) and
(:f_parms_manual.n_src_house is null or src_house = :f_parms_manual.n_src_house) and
(:f_parms_manual.n_src_bus is null or src_bus = :f_parms_manual.n_src_bus) and
(:f_parms_manual.n_src_cap is null or src_cap = :f_parms_manual.n_src_cap) and
(:f_parms_manual.n_src_oth is null or src_others = :f_parms_manual.n_src_oth) and
(:f_parms_manual.n_surnm is null or surname like :f_parms_manual. ||'%') and
(:f_parms_manual.n_first_name is null or first_name like :f_parms_manual.n_first_name ||'%') and
(:f_parms_manual.n_ho_nm is null or ho_nm like '%' || :f_parms_manual.n_ho_nm ||'%') and
(:f_parms_manual.n_off_nm is null or off_nm like '%' || :f_parms_manual.n_off_nm ||'%') and
(:f_parms_manual.n_off_line1 is null or off_line1 like '%' || :f_parms_manual.n_off_line1 ||'%') and
(:f_parms_manual.n_off_line2 is null or off_line2 like '%' || :f_parms_manual.n_off_line2 ||'%') and
(:f_parms_manual.n_off_line3 is null or off_line3 like '%' || :f_parms_manual.n_off_line3 ||'%') and
(:f_parms_manual.n_off_line4 is null or off_line4 like '%' || :f_parms_manual.n_off_line4 ||'%') and
(:f_parms_manual.n_off_line5 is null or off_line5 like '%' || :f_parms_manual.n_off_line5 ||'%') and
(:f_parms_manual.n_off_state is null or off_state = :f_parms_manual.n_off_state ) and
(:f_parms_manual.n_off_pin is null or off_pin = :f_parms_manual.n_off_pin ) and
(:f_parms_manual.n_area_cd is null or area_cd = :f_parms_manual.n_area_cd) and
(:f_parms_manual.n_ao_type is null or ao_type = :f_parms_manual.n_ao_type) and
(:f_parms_manual.n_range_cd is null or range_cd = :f_parms_manual.n_range_cd) and
(:f_parms_manual.n_ao_no is null or ao_no = :f_parms_manual.n_ao_no) and
((:f_parms_manual.n_ret_inc_start is null and :f_parms_manual.n_ret_inc_end is null )));



As mentioned earlier, this is a huge partitioned table, with a Full Table Scan. Since, there are no mandatory restrictions, it was difficult to create an Index to optimize the performance. This query runs from both the nodes, there are 3 to 4 users concurrently running this query thus causing most of the issues.



About the Query:



PN, stands for Permanent Number, column of Table AS_F49A is a Primary Key. At the report level, user has an option to limit the number of rows fetched by this query, to either : 500 (DEFAULT), or any specified number, limiting upto 10000. Once, the rows are fetched, series of small batch processes, like transfer of a PN from one location to another, closure of one PN and opening of new PN etc, are executed. In this entire process, this query is the most time consuming and resource intensive as it involves FTS of a huge table. Once the rows are fetched, executing other DML's does not take much resource.



Optimization Possibilities


There were various options discussed and suggested, and the few of them, that really worked were :



  1. Modify the FORM to generate Dynamic Query with the columns with non null inputs.
  2. Populate a Temporary Table and Create Bitmap Indexes.



Option 1 above will have a negative implication that every time the user input changes, the query will be hard parsed. The Query will be built dynamically based on the evaluation of each FORM Level variable :



if :f_parms_manual.n_status is not null then
l_where := l_where||' '||'(status = :f_parms_manual.n_status) and';
end if;

if :f_parms_manual.n_dob_start is not null then
l_where := l_where||' '||'(to_char(dob,''''J'''') >= to_char(:f_parms_manual.n_dob_start,''''J'''')) and';
end if;

if :f_parms_manual.n_dob_end is not null then
l_where := l_where||' '||'(to_char(dob,''''J'''') <= to_char(:f_parms_manual.n_dob_end,''''J'''')) and';
end if

In this case, I don't have to bother on hard parses as this query does not run multiple times in a second i.e. the execution per second rate is very less. The Query runs, once in 5-10 minutes, but since it is doing FTS of a huge table, each execution takes more than 30 minutes and therefore, there is always an overlap. Therefore, Issues due to Hard Parses was considered as negligible as compared to the issues due to the execution.



On the Second option, Populating a Temporary Table, during the discussion with the customer, he mentioned that it is very rare that the query processes newly inserted PN's. Also, seeing the negative implication of this query, when it runs on the production table, customer agreed to implement a workaround to process a day old entries into the table AS_F49A. In that case, maintaining data consistency should be a top most priority, as any error or incorrect data coould lead to huge loss. The solution that was built on this was :



  1. Truncate and Populate a Temporary Table every night (SAY A_F49A_XFER).
  2. Create BITMAP Indexes on this temporary table (One time Activity)
  3. Modify the FORM, to get the desired result from the Temporary Table.
  4. To maintain Data Consistency, run the original Query against the Production Table, with additional mandatory filtering as PN=:PN_FROM_TEMP_TABLE.



During Night Time, the system is almost IDLE and the available CPU Power can be utilized to populate the temporary table in parallel. The temporary table will hold only those columns used in the query. Therefore, the size of this temporary table was less than 1/4th of the the actual size of the table. Since, the data will be only fetched from this table, Bitmap Indexes would be an ideal choice. Therefore, single column Bitmap Index will be created on all the columns used in the WHERE predicate. The FORM will be modified to build the query dynamically based on the user input and point to the Temporary Table. Now, since we have Bitmap Indexes in place, the query started combining multiple Bitmap Indexes and fetch the records in less than a second, as against 20 to 30 Minutes on Production Table, with no Indexes.



Once the Query was modified, to maintain data consistency, the original query was modified with additional Filter predicate as PN=:PN. This :PN is the Permanent Number fetched from the temporary table. All other filter predicates will still be applied, so that, in case, if any of the queried column is modified in the production table after the temp table was populated, this data will be fetched from the temporary table, but the condition will evaulate to FALSE when again queried from the production table. Querying the production table with PN, it is a Primary Key, is very fast.



While the Original Query takes 20 to 30 minutes, this modified version takes less than 2-3 minutes. Also, since we have eliminated FTS on a huge table, the GC related waits also vanished giving much of the relief to the end-users.



The point is that when developing a system, you need to give a lot of thought to how people will actually use that system. End-User does not know the importance of FILTERING. You give them the choice of restrictions, they will open-heartedly accept the choice. It is upto the developers to restrict these choices. In cases, where the restrictions are just not applicable or feasible, there are some compromises and sacrifices to be made, and this can be either Performance or Data Fetching. In this case, performance was badly impacted as there were no mandatory restrictions, but after compromising on day old data, performance and scalability was achieved by applying a solution that was faster and guaranteed consistency.

Monday, 20 April 2009

DBA - Developer Relationship ! A fine example of Optimization.



Thomas Kyte wrote a very good paragraph on DBA - Developer Relationship in his book "Expert One-on-One". He truly says that this relationship should be healthy. During my AIOUG presentation, I had a slide on this topic and I went on explaining about the importance of this relationship. I have come across many performance issues, where, the issues come up because Developers work in isolation, without bothering much about the performance of their application queries and this indirectly means, leaving the job of optimization for the DBA's. This impacts the scalability of an entire production system. This also means that the application queries are revisited for optimization, based on the recommendations by the DBA's.



Let me present a real life example, wherein, a discussion with the Developer helped me optimizing performance of a critical process. Recently, I was working on one Severity 1 performance issue. The problem summary is overall slowness at the database level and the main contributor to this performance issue being, Hard Parses and Huge Resource Intensive Queries. Surprisingly, The database version is 8.1.7.4.



While optimizing the performance of a critical process, we came across a top query. The entire process use to take almost 11 hours, wherein, this problematic step takes 8 hours or sometimes fails with ora-1555. Therefore, it was evident that optimizing this step would bring down the completion time of the entire process. This particular step had three different queries and based on the inputs from the developers, and 10046 trace files, the contribution of each of these were 7 hours, 25 minutes and 25 minutes. Therefore, it is clearly the first query that takes significant amount of time.



Query Text


SELECT opnamt_gl, entdate, argdate
FROM ORDER_ALL
WHERE customer_id = :b1
AND ((invtype = 5 AND status IN('IN','CM'))
OR (status IN('FC','CO')))
AND opnamt_gl<>0;




This query executes in a loop and processes almost 1 Million customers. There is a Composite Index on (customer_id, status, invtype) and this query takes almost 2 seconds to process a customer. Based on this data, the next two queries are executed. The table ORDER_ALL is 16 GB in size. Since the last 2 columns of the Index is used as OR or IN predicate, the index scan is only done based on the customer_id. The Query fetches only 1 row for each customer.



A Casual Discussion that helped Optimizing this code


While we were discussing this with the developer of this process, he mentioned that this query takes 2 seconds, whereas, on another database it takes 31 milliseconds. I casually asked him to select all the columns used in the query, commenting all other columns used in the WHERE clause and execute it on both the databases. The query executed was :



SELECT opnamt_gl, entdate, argdate, invtype, status, opnamt_gl
FROM ORDER_ALL
WHERE customer_id = :b1;




Again to the comparison of 31 msec and 2 sec. Though, the query fetches only 1 row on both the databases, when this modified query was executed on each of these, it revealed that on the problematic database, it is fetching approx 115 rows for each customer against 25 rows on another. It is this difference that is taking resource and the additional time.


The issue was 2 seconds for each customer and this was not acceptable. The challenge was reduce this processing time for each customer. As an optimizing technique, we executed the query, only on an indexed column, i.e.customer_id, in the WHERE clause and we got 115 rows. I concentrated on the values in the other columns of the query that were used in the WHERE clause of the Original Query and was surprised to see that only 2 rows had a non zero opnamt_gl. The original query has a predicate opnamt_gl<>0. When asked, the developer replied that for all the customers, there will be only 1 or 2 rows with non zero value and this helped me recommending an interesting optimizing technique.


As mentioned earlier, the query fetches only 1 row, which means, out of the (approx) 115 rows for each rows, 114 rows are discarded. TKPROF also showed this behaviour. Therefore, instead of discarding 114 rows, why not get 2 rows from an index and discard 1 row. By now, the developer understood that we are planning to recommend a new composite index on (customer_id, opnamt_gl) and therefore, before we recommend anything, he himself mentioned that they tried creating this index but the response time did not improve. The reason they did not see any performance gain was the condition opnamt_gl<>0. The recommendation provided here gave a major performance relief. The steps involved for optimizing this piece of the code were :



  1. A Function based Index.
  2. Create a View. This is an additional step since the db version is 8174.
  3. Modify the query so that it queries the view.




We primarily targetted



WHERE customer_id = :b1
and opnamt_gl<>0;



We created a function based Index on (customer_id, case when opnamt_gl=0 then null else 'Y' end). Since the db version is 8174, CASE statements cannot be used in a pl/sql block, therefore we created a view as :


create view ORDER_ALL_VW as
SELECT opnamt_gl, entdate, argdate, customer_id
FROM ORDER_ALL
WHERE ((invtype = 5 AND status IN('IN','CM'))
OR (status IN('FC','CO')))
AND (case when opnamt_gl=0 then null else 'Y' end) = 'Y';




Then the Query was modified as

SELECT opnamt_gl, entdate, argdate
FROM ORDER_ALL_VW
WHERE customer_id = :b1;


With this optimization, the query response time was 3 msec for a customer and with this the completion time of this query for 1 million customer came down from 7 hours to 7 minutes. This optimization also guaranteed that the process would not fail with ora-1555.



Definetely, a discussion on the optimizing strategy with the developer helped us optimizing the performance of a critical query. Developers know their data well, and in my view, this recommendation should have come from the developers. For this, developers needs to be well conversant with database features and explore each of these while writing an application query.




Monday, 16 March 2009

Latch: Row Cache Objects causing huge performance issue





Since last few weeks, I was a bit busy with one critical issue pertaining to "Latch: row cache objects". I wrote about this on 19th February 2009.



The setup here was a 2 Node RAC running Oracle 10g Release 2 (10203). Seeing the AWR reports, we could clearly see that the GETS on DC_Users were pretty high. The additional statistics was the version_count of the queries. Row Cache Latches are normally seen when the parse ratio is very high and in this case too, the parse ratio was very high but one thing worth mentioning is that the application queries heavily make use of Bind Variables. Since, the GETS on DC_Users seemed to be alarming, we directly targetted on the reasons that can lead to these huge Gets. One of the reason for these huge gets is Virtual Private Database (VPD). The customer here has implemented VPD and this implementation caused the contention on these latches. As far as the feature is concerned, VPD is a very useful feature as it provides a easy and manageable method for implementing row level security, but
there are certain implementation aspects that needs to be considered based on the requirement and it is this aspect that caused a major performance issue at this customer site.




A brief explanation on VPD




VPD also called as Row Level Security or Fine Grained Access Control is a feature that was introduced in Oracle 8i Database that allows you to define security policies on tables by way of a function that implies a restriction on the rows that a user can Select or update or Insert or Delete based on the privileges defined for that user. For example, assuming a following herierchy :


A, B and C report to D, while D and E reports to Z.



With this herierchy, a requirement like :



1. Z can see and manipulate all the data, and

2. E can see and manipulate only his data, and

3. D can see and manipulate his data and data pertaining to his reportees i.e. A, B and C

4. A, B and C can see and manipulate only their relevant Data.



To accomplish this, VPD can be implemented without much of an effort.



There are three basic components of VPD. These are :



Policy : A declarative command that determines when and how to apply the policy i.e. during Selects, Insertions, Deletion, Updation or combination of any or all of these.


Policy Function : A Pl/SQL function that is called whenever the condition specified in the policy are met.


Predicate : A string that is generated by the policy function, and then applied to the users' sql statements. This is an additional Filter predicate
generated by the policy function and appended to the sql statement executed by the user.



Once these three are defined in a policy, then this policy can be attached to the tables that require Security to be implemented.


While Policy, Policy Function and Predicate are three main components that require to be defined for VPD, there is another component, that is mostly ignored, called as Policy Type. This is very important to be defined. If not defined, it is defaulted to a value depending upon the database version and therefore, should not be left at default.


Starting Oracle 10g, the policy type defaults to DYNAMIC. While there are other policy types : SHARED Static, CONTEXT-SENSITIVE and SHARED Context Sensitive.
The implementation of each of these depend upon the requirement.



Back to the Issue with the Customer



As mentioned earlier, DYNAMIC policy is a default in Oracle 10g and this is where the problem occured. In most of the cases, policy predicate needs to be build based on the user's issuing the query and this require Dynamic built up of predicate as it evaluates the predicate at runtime.



With this implementation, when each record is returned, the policy executes the policy function, checks for the predicate and decides, at run time, whether
the row is to be passed or not. This means, the query goes through parse-execute-fetch phase each time for the number of records processed by a query and this is certainly an expensive operation.


A better option, in this case would be either Context-Sensitive or Shared Context-Sensitive that implements dynamic predicate generation at runtime but goes thru the parse phase only when the context information of a user changes. For example, by way of SET ROLE. Though this also involves parsing of a statement each time the context is changed but is less expensive as compared to Dynamic policy type that involves parsing for each and every row processed. Since parsing involves latching, Latch Contention is inevitable with Dynamic Policy type and this is what the current issue at the customer site was.


In terms of CPU cycles, Dynamic policy type consumes huge CPU as compared to Context-Sensitive that consumes slightly higher than Static.



In order to use Context-Sensitive Policy type, following steps were implemented.



create context ctx_policy using ITUSR.set_ctx_policy;



create procedure set_ctx_policyl(p_attr in varchar2, p_val in varchar2)

is

begin dbms_session.set_context('CTX_POLICY',p_attr,p_val);

end;



and in the policy function, the RETURN value was changed from



## Commented this

----l_predicate:='C_NUM in 'sys_context('CTX_POLICY','C_NUM');

l_predicate := 'C_NUM = sys_context (''CTX_POLICY'', ''C_NUM'')';

return l_predicate;




Once this was implemented, the CPU Utilization which was always 95% dropped drastically to 50% with no latch contention. This drop was basically due to less parsing and latching making the system more scalable.



Saturday, 7 March 2009

AIOUG TechNight @ Mumbai, INDIA

I will be speaking at AIOUG Technight to be held in Mumbai on 9th March 2009. This time too, I would be covering "Oracle Query Optimizer" with some new findings and information.

For further details, visit http://www.aioug.org/
.

Thursday, 19 February 2009

Huge waits on latch: row cache objects

It has been more than a month and a new technical blog is long overdue. These 2 months, I was very much involved in family matters and was almost out of action.

Currently, I am working on a Performance Issue of one of my customer. Though there are many issues, some related to database and some related to application, the current top most event that needs a serious investigation is "Latch:row cache objects". Whenever this event comes on top, the CPU Utilization is almost 80% to 90% and major slowness is reported. I am currently working on this issue with a possible solution in hand. This solution will be implemented on production in a day or so and once, I feel that the issue is resolved would share this in my next blog.
Just to get the opinion from other Oracle Experts across the globe, am pasting some of the statistics relevant for this wait worth investigating.
60% Waits on "Latch:row cache latch"
The Dictionary Cache Statistics in the AWR report shows huge Get Requests against dc_users and dc_username.

About Me

My Photo
Vivek Sharma
Mumbai, Maharashtra, India
I am an Oracle Professional, mainly into Application and Database Tuning. I have been associated with Oracle technologies for more than 8 years and, in these 8 years, have seen Oracle Database growing at a fast pace. This is one of the primary reason to be associated with this technology. I initiated my association with Oracle as a FORMS Developer and then transitioned to a DBA. Development experience helped me a lot in performing my DBA responsibilities, especially, when it comes to Database and Application tuning. I am a technical speaker with a passion to share and learn from Oracle Community across the world.
View my complete profile

Number of Hits on this Page (Counter Initiated since 4th Nov 2007)

Visit online slots or online blackjack at casino games site.

Visitor Locations - Across Globe

Some of my Favourite Books

  • Oracle 10g Real Application Clusters Handbook by K. Gopalakrishnan
  • Oracle Wait Interface by K. Gopalakrishnan
  • Effective Oracle by Design, Author : Thomas Kyte