Latest Event Updates

OCI Client-Side Deployment in 12C

Posted on Updated on

The 12c version has introduced a new features called “OCI Client-Side Deployment Parameters” this feature gives you the possibility to override some OCI parameters without even changing your application code.

Let’s make a simple test case using Python and the cx_oracle module (which use the OCI interface) for connecting to the oracle database.

You can find this module for your python version and the database 12c version here.

This is the code of our test case:

#!/usr/bin/python2.7 import cx_Oracle
 connection = cx_Oracle.connect ('test','test','testdb')
 for i in range (0,1000000) :
        cursor = connection.cursor ()
        cursor.execute ('select  1 FROM dual')
        cursor.close ()
sql="select a.name name, b.value value from v$statname a, v$mystat b where a.statistic# = b.statistic# and (a.name like '%parse% or a.name like '%session cursor%'" 
cursor = connection.cursor () 
cursor.execute(sql) rows = cursor.fetchall() 
for name,value in rows:    
        print "%s : %d"%(name,value)  
connection.close ()

So basically, we execute this simple “select 1 from dual”  1 million times by opening and closing the cursor during each iteration.

Then we display our session statistics regarding parsing.

After executing that script and check the result :

session cursor cache hits : 16961
session cursor cache count : 7
parse time cpu : 176
parse time elapsed : 641
parse count (total) : 1000010
parse count (hard) : 2

The time consumed for parsing was around 6 seconds and as you can see was mainly soft parsing (only 2 hard parses have been performed for the two cursors opened in our code).

This is even better due to the fact that this cursor was already cached in our Private SQL Area(as observed by checking the session cursor cache hits statistic), so the soft parse was even more softer:)

But wait!! why the session session cursor cache hits is so small after 1M executions.

Adding those lines in the code :

cursor.execute ('select 1 FROM dual')
sql="select a.name name, b.value value from v$statname a, v$mystat b where a.statistic# = b.statistic# and ( a.name = 'session cursor cache hits')"
        cursor.execute(sql)
        rows = cursor.fetchall()
        for name,value in rows:
            print "%s : %d"%(name,value)

You will see the following result :

session cursor cache hits : 65525
session cursor cache hits : 65527
session cursor cache hits : 65529
session cursor cache hits : 65531
session cursor cache hits : 65533
session cursor cache hits : 65535
session cursor cache hits : 1
session cursor cache hits : 3
session cursor cache hits : 5
session cursor cache hits : 7

So i guess that this statistic value is stored in a 16 bits integer and cannot store more than 65535.

I found the confirmation in this Christian’s post so this is not fixed yet in 12.1.0.2.

As i said earlier, with 12c you can now set override your application OCI parameters by using a file named oraaccess.xml.

This XML file needs to be in your TNS_ADMIN directory to be taken in account.

You could find an example in this directory $ORACLE_HOME/dbms/demo/oraaccess.xml.

So now let’s check if that works with the cx_oracle module.

You can set a number of clients parameters like arraysize, client result cache and in our case statement cache.

We didn’t implement the statement cache feature in our code but we will set the value for client statement cache to 1.

<oraaccess xmlns="http://xmlns.oracle.com/oci/oraaccess"
xmlns:oci="http://xmlns.oracle.com/oci/oraaccess"

schemaLocation="http://xmlns.oracle.com/oci/oraaccess
http://xmlns.oracle.com/oci/oraaccess.xsd">
<default_parameters>
<prefetch>
<rows>50</rows>
</prefetch>
<statement_cache>
<size>1</size>
</statement_cache>
<result_cache>
<max_rset_rows>100</max_rset_rows>
<!-- 'K' or 'k' for kilobytes and 'M' or 'm' for megabytes -->       
<max_rset_size>10K</max_rset_size>       
<max_size>64M</max_size>
</result_cache>
</default_parameters>
</oraaccess>

So now we run the python script again and check our result.

session cursor cache hits : 2
session cursor cache count : 6
parse time cpu : 0
parse time elapsed : 0
parse count (total) : 11
parse count (hard) : 0
parse count (failures) : 0
parse count (describe) : 0

As you can see there is no parse time anymore as the client uses its statement cache automatically.

In conclusion soft parsing is better but no parsing at all is possible even by not touching the code.

 

Lock by reference

Posted on Updated on

As you may know, not indexing foreign key can have a huge impact on lock behavior and hurt database scalability (Not only with Oracle).

But there is also some rules of thumbs (coming from Asktom) when non indexing the foreign key is considered as reasonable.

So, when do you NOT necessary need to index a foreign key ?

In general when the following conditions are met:

o you do NOT delete from the parent table. (especially with delete cascade — it is a
double whammy)

o you do NOT update the parent tables unique/primary key value.

o you do NOT join from the PARENT to the CHILD (like DEPT->EMP).

This was verified on one of the application i am working on, until reference partitioning comes into play with extreme side effects.

Here is a demo of what can happen when foreign keys are not indexed and you start to use reference partitioning :

First create test1 table:

CREATE TABLE test1
 (
 id number,
 PERIOD_STOP DATE
 )
 PARTITION BY RANGE (PERIOD_STOP)
 (
 PARTITION P2015 VALUES LESS THAN (TO_DATE('12/31/2015 23:01', 'MM/DD/YYYY HH24:MI')),
 PARTITION P2016 VALUES LESS THAN (TO_DATE('12/31/2016 23:01', 'MM/DD/YYYY HH24:MI')),
 PARTITION AA_PMAX VALUES LESS THAN (MAXVALUE)
 );

enable row movement on test1 and create the primary key :

alter table test1 enable row movement;

ALTER TABLE test1 ADD (
 CONSTRAINT PK_test1
 PRIMARY KEY
 (ID))
 ;

now create table test2 partitioned by reference on test1 :

CREATE TABLE test2
 (
 id number not null,
 CONSTRAINT fk_test1
 FOREIGN KEY (ID)
 REFERENCES test1 (ID)
 ENABLE VALIDATE
 )
 PARTITION BY REFERENCE (fk_test1) enable row movement;

Important Remark: We don’t create an index on fk_test1 and we don’t populate any child in test2 table

now let’s insert some data in test1 :

insert into test1 values(1,sysdate);
insert into test1 values(2,sysdate+200);
commit;

now let’s enable 10046 and 10704 traces in order to trace oracle locks:

alter session set events '10046 trace name context forever, level 4 : 10704 trace name context forever, level 3';

update table test1 to force a row migration :

update test1 set period_stop=sysdate + 200 where id =1;

Check what you have in terms of locks (for more info on that, please check Franck Pachot blog post ):

First list objects involved in this transaction:

select object_id , to_char(object_id,'0XXXXXXX') , object_name,subobject_name,object_type from all_objects where object_name like 'TEST%';
OBJECT_ID  TO_CHAR(OBJECT_ID,'0XXXXXXX OBJECT_NAME  SUBOBJECT_NAME       OBJECT_TYPE
---------- -----------------------     ------------ -------------------- -------------------
 83166     000144DE                    TEST1        AA_PMAX              TABLE PARTITION
 83164     000144DC                    TEST1        P2015                TABLE PARTITION
 83165     000144DD                    TEST1        P2016                TABLE PARTITION
 83163     000144DB                    TEST1        TABLE
 83171     000144E3                    TEST2        AA_PMAX              TABLE PARTITION
 83169     000144E1                    TEST2        P2015                TABLE PARTITION
 83170     000144E2                    TEST2        P2016                TABLE PARTITION
 83168     000144E0                    TEST2        TABLE

we have mostly what is described in Franck’s post except that a reference partition is created and involved in row movement.

The lock sequences is the following

first oracle locks table test1 is TM Row exclusive mode :

ksqgtl *** TM-000144DB-00000000-00000000-00000000 mode=3 flags=0x400 timeout=21474836 ***

then test2 table is locked is share mode (mode 4) once again due to the non-indexed foreign key :

ksqgtl *** TM-000144E0-00000000-00000000-00000000 mode=4 flags=0x400 timeout=21474836 ***

this lock on test2 is released just afterward:

ksqrcl: TM-000144E0-00000000-00000000-00000000

then partition P2015 on test1 is locked in RX mode :

ksqgtl *** TM-000144DC-00000000-00000000-00000000 mode=3 flags=0x400 timeout=21474836 ***

then a TX lock is acquired for locking the updated row :

ksqgtl *** TX-00010002-00000449-00000000-00000000 mode=6 flags=0x401 timeout=0 ***

then partition P2016 on test1 is locked (RX) in order to move row :

ksqgtl *** TM-000144DD-00000000-00000000-00000000 mode=3 flags=0x400 timeout=21474836 ***

and once again share lock is acquired and release immediately on test2.

5 locks for one operation but wait that’s not all :

Checking oracle 10046 trace oracle executes the following recursive statement:

update /*+ opt_param('_and_pruning_enabled', 'false') */ "SYS"."TEST2" partition (dataobj_to_partition( "SYS"."TEST1" , 83164))
 move to partition (dataobj_to_partition( "SYS"."TEST1" , 83165)) set "ID" = "ID" where "ID" = 1;

That makes senses oracle instructs to move row on table test2 that are dependent from table test1 (id=1) coming from partition p2015 (object_id=83164) to partition p2016 (object_id=83165).

So if we check the 10074 trace again we can see that table test2 is locked on TM mode 3:

ksqgtl *** TM-000144E0-00000000-00000000-00000000 mode=3 flags=0x400 timeout=21474836 ***

So now if we update test1 in another session,event without any row migration,

please remember that updating test1 with non-indexed key leads to a share lock on test2 :

update test1 set period_stop=period_stop where id=2;

as it’s not possible to acquire a share lock when a TM lock Row-X mode is already present.

This update is locked until the first update is commited or rollbacked.

As opposite if you create an index on test2 you will see the following lock sequence:

first oracle locks table TEST1 is TM Row exclusive mode :

ksqgtl *** TM-000144DB-00000000-00000000-00000000 mode=3 flags=0x400 timeout=21474836 ***

then table TEST2 is TM Row exclusive mode :

ksqgtl *** TM-000144E0-00000000-00000000-00000000 mode=3 flags=0x400 timeout=21474836 ***

test1 partition P2015 in Row exclusive mode :

ksqgtl *** TM-000144DC-00000000-00000000-00000000 mode=3 flags=0x400 timeout=21474836 ***

lock the update row :

ksqgtl *** TX-00010002-00000449-00000000-00000000 mode=6 flags=0x401 timeout=0 ***

and last test1 partition P2016 in Row exclusive mode :

ksqgtl *** TM-000144DD-00000000-00000000-00000000 mode=3 flags=0x400 timeout=21474836 ***
no more share lock and no more recursive statement as we don’t have any child in test2 table.

Inserting million of time series data (rows) per second inside oracle database part2

Posted on Updated on

In the previous post we have demonstrated how index columns order (matching or not, the data arrival pattern ) can impact the performance of the insertion.

Going a little bit further we will try to increase again the insertion performance rate.

One of the available options is to use direct path insert, since 11.2 release the APPEND_VALUES hint is available for standard insert statement.

Again using the two different column order, we will check the impact using direct path insert.

Then let’s create again the kpi table :

SQL> create table kpi (timeid number,insid number,indid number,value number);

And first create the index  like the first example in the previous blog post (not sorted on our data arrival pattern):

SQL> create unique index ix_kpi on kpi (insid,indid,timeid);

Using the same dataset let’s insert again 7 millions rows in our table via direct path insert method :

declare
 TYPE kpi_array is TABLE OF kpi%ROWTYPE;
 l_kpi_array kpi_array:=kpi_array ();
 start_time number;
 elapsed number;
 begin
 for time in 1 .. 100
 loop
 for indid in 1 .. 7
 loop
 for insid in 1 .. 10000
 loop
 l_kpi_array.extend();
 l_kpi_array(l_kpi_array.last).timeid:=time;
 l_kpi_array(l_kpi_array.last).indid:=indid;
 l_kpi_array(l_kpi_array.last).insid:=insid;
 l_kpi_array(l_kpi_array.last).value:=dbms_random.value(1,1000);
 end loop;
 end loop;
 end loop;
 start_time:=DBMS_UTILITY.GET_TIME ;
 forall p in l_kpi_array.first .. l_kpi_array.last
 insert /*+ APPEND_VALUES */ into kpi
 values l_kpi_array(p);
 elapsed := DBMS_UTILITY.GET_TIME - start_time;
 dbms_output.put_line('elapsed time (seconds)=' || elapsed/100 );end;
 /

elapsed time (seconds)=31,41 seconds

After executing this script, we can see that it takes 31 seconds to insert 7 millions rows, for comparison purpose, the exact same test without direct path insert used to complete within 56 seconds.Let’s have a look at the session statistics :

select
       name, value
from    v$mystat,v$statname
where
       v$mystat.statistic#=v$statname.statistic#
and name in ('leaf node splits','leaf node 90-10 splits','redo size','redo entries','session logical reads','physical reads',
'db block changes','db block gets','physical reads direct temporary tablespace','physical writes direct temporary tablespace')
;

NAME       VALUE
---------------------------------------------------------------- ----------
session logical reads     506672
db block gets     470582
physical reads       87041
db block changes     466778
physical reads direct temporary tablespace       87038
physical writes direct temporary tablespace       87038
redo entries     235452
redo size   550084680
leaf node splits       17526
leaf node 90-10 splits       17526

You can notice now than the index leaf node split are now 90-10 splits as opposed to the same example without direct path insert when we had mainly 50-50 node splits :

leaf node splits      17050
leaf node 90-10 splits 32

For comparison purpose here are the statistics for the former test (without direct path insert) :

NAME VALUE
---------------------------------------------------------------- ----------
session logical reads 15684333
db block gets         15544039
physical reads        413
db block changes      10610256
redo entries          5363399
redo size             1922236380
leaf node splits      17050
leaf node 90-10 splits 32

Why this difference between direct and non direct path insert?

Because when direct path insert is used the index maintenance is deferred and merged at the end of the process.

That means that oracle has to sort the entire array before updating the index and now can also take advantage of array processing to maintain the index structure.

This explains why the redo entries and the session logical reads have decreased so much compared to the former test (without direct path insert).

Another difference from the former test, is that we now have some physical reads/writes direct temporary tablespace.

The work area is too much small to process the entire sorts in memory and oracle has to spilled sort operations to disk.

This can be verified by checking the insert statement statistics values in v$sql_workarea :

select sql_id,child_number,operation_type,policy,last_execution,optimal_executions,multipasses_executions from v$sql_workarea where sql_id='10qx37pz299jw' and child_number=0;

SQL_ID CHILD_NUMBER OPERATION_TYPE     POLICY     LAST_EXECUTION     OPTIMAL_EXECUTIONS MULTIPASSES_EXECUTIONS
------------- ------------ --------------- ---------- ------------------------------ ------------------ ---------------------
10qx37pz299jw 0 IDX MAINTENANCE(SORT)  AUTO       13 PASSES             0       1

10qx37pz299jw 0 LOAD WRITE BUFFERS     AUTO       OPTIMAL               1       0

The sort was done using  multipasses executions (13 passes)

In order to perform the entire sorts in memory,  we can set the workarea size policy to manual and set the sort_area_size large enough.

alter session set workarea_size_policy=manual;
alter session set sort_area_size=800000000;

Let’s restart the same test with modified settings :

now 23.38 seconds to insert the 7 millions rows .

Let’s check sorting statistics again:

select sql_id,child_number,operation_type,policy,last_execution,optimal_executions,multipasses_executions from v$sql_workarea where sql_id='10qx37pz299jw' and child_number=1;

SQL_ID       CHILD_NUMBER OPERATION_TYPE  POLICY     LAST_EXECUTION     OPTIMAL_EXECUTIONS MULTIPASSES_EXECUTIONS ------------- ------------ --------------- ---------- ------------------------------ ------------------ ------- 
10qx37pz299jw 1 IDX MAINTENANCE(SORT) MANUAL     OPTIMAL       1       0 

10qx37pz299jw 1 LOAD WRITE BUFFERS    MANUAL     OPTIMAL       1       0

The execution time has reduced and all the sort operations are performed in memory.

Now let’s execute again the same test changing the index column order based on the data arrival pattern (see the previous post where without direct path insert we took advantage of array processing) and check if there is still a difference:

SQL> create unique index ix_kpi on kpi (timeid,indid,insid);
alter session set workarea_size_policy=manual;
alter session set sort_area_size=800000000;
declare
TYPE kpi_array is TABLE OF kpi%ROWTYPE;
l_kpi_array kpi_array:=kpi_array ();
start_time number;
elapsed number;
begin
for time in 1 .. 100
loop
for indid in 1 .. 7
loop
for insid in 1 .. 10000
loop
l_kpi_array.extend();
l_kpi_array(l_kpi_array.last).timeid:=time;
l_kpi_array(l_kpi_array.last).indid:=indid;
l_kpi_array(l_kpi_array.last).insid:=insid;
l_kpi_array(l_kpi_array.last).value:=dbms_random.value(1,1000);
end loop;
end loop;
end loop;
start_time:=DBMS_UTILITY.GET_TIME ;
forall p in l_kpi_array.first .. l_kpi_array.last
insert /*+ APPEND_VALUES */ into kpi
values l_kpi_array(p);
elapsed := DBMS_UTILITY.GET_TIME - start_time;
dbms_output.put_line('elapsed time (seconds)=' || elapsed/100 );
commit;
end;
elapsed time (seconds)=13.9 seconds
As you can see the index column order still has a impact on the insertion performance.

Let’s compare the event 10032 (trace sort statistics) with the two different columns order:

first index (insid,indid,timeid) 

—- Sort Parameters ——————————
sort_area_size                                800000000
sort_area_retained_size               800000000
sort_multiblock_read_count       2
max intermediate merge width  21982
 
*** 2015-07-07 14:53:01.341
—- Sort Statistics ——————————
Input records                             7000000
Output records                            7000000
Total number of comparisons performed     150334451
  Comparisons performed by in-memory sort 150334451
Total amount of memory used               524142592
Uses version 1 sort
—- End of Sort Statistics ———————–
 

second index (timeid,indid,insid)

—- Sort Parameters ——————————
sort_area_size                                  800000000
sort_area_retained_size                800000000
sort_multiblock_read_count        2
max intermediate merge width    21982
 
*** 2015-07-07 15:05:19.599
—- Sort Statistics ——————————
Input records                             7000000
Output records                            7000000
Total number of comparisons performed     6999999
  Comparisons performed by in-memory sort 6999999
Total amount of memory used               459501568
Uses version 1 sort
—- End of Sort Statistics ———————–
 

You can see that the algorithm used for sorting is “version 1 sort”, as explained in this jonathan lewis blog this algorithm is a “binary insertion tree” so oracle builds a binary tree as reading data from memory.

Oracle has to compare each row to the current binary tree structure and as in the second example the data are already sorted in the same order than the index only 6999999 comparaisons (7 Millions – 1) were done compared to the previous example where it has performed 150334451 comparaisons.

This sorting cost could  also be verified by checking ‘CPU used by this session’ session statistic, please take in account that this statistic includes the creation of the nested table and the insert statement (all other statistics were almost the same )

first index column order :       5432 cs
second index column order :      1891 cs

Conclusion :

When using direct path insert the index maintenance is deferred  so oracle is able to process index operation by array even if the data pattern is not sorted in the same way that our index column order.

We have also demonstrated that the sorting algorithm keeps taking advantage of sorted data pattern and good index order even in direct path insert mode.

Inserting million of time series data (rows) per second inside oracle database

Posted on Updated on

This blog post is the first of a series of 4 related to time series and indexing.

According to wikipedia this is the definition of a time series:

A time series is a sequence of data points, typically consisting of successive measurements made over a time interval.

This blog post shows how we can efficiently store time series in an oracle database and how indexing strategy can have a big impact on insertion performance rate.

The time series data used in this example are performance metrics coming from a network performance monitoring tool.

We will use three components to identify our time series data point:

• a timestamp (timeid) which will be based on epoch time (and start from 70s to simplify the demonstration)
• a performance indicator indentifier (indid) that represents a performance metric (load average, cpu usage,…)
• a network object identifier (insid) (router1,…)

and finally, we will store the collected metric value.

The key point to have in mind is that our data arrival pattern is sorted based on the time (timeid).
That means that for the current timestamp, we will have to store all the metrics of the entire network.

In terms of oracle modeling, we will create a oracle table called “KPI” to store that kind of information.

This is a pretty simple structure that is in fact a key/value model.

SQL> create table kpi (timeid number,insid number,indid number,value number);

Now comes the question, which indexing strategy to use? as we store billions of rows in this table, we will create a composite index on timeid, insid, and indid columns in order to retrieve and graph our performance metric values over time.

But which order is optimal and what is the impact on the insertion rate and reporting queries?

As the goal is to store and graph the time series data for a specific network equipment and a specific performance metric over time, this is what the sql reporting queries looks like:

select value from KPI where indid=‘CPU %’ and insid=‘router 1’ and timeid between 1 and 51;

When creating a composite index, the columns are sorted by the order mentioned in the index definition.

So considering our reporting sql query, the best access path is going to be an index range scan, and the most obvious implementation for our index is the following :

SQL> create unique index ix_kpi on kpi (insid,indid,timeid);

This will keep the requested index entries « clustered” together and reduce the i/o(s) required for the index range scan operation:

Capture d’écran 2015-05-01 à 12.11.25

While this index seems optimal for our reporting query, what is the index impact on our insertion performance rate ?

Let’s perform a quick insertion of 7 millions rows into this index by using bulk loading.

declare
TYPE kpi_array is TABLE OF kpi%ROWTYPE;
l_kpi_array kpi_array:=kpi_array ();
start_time number;
elapsed number;
begin
for time in 1 .. 100
loop
for indid in 1 .. 7
loop
for insid in 1 .. 10000
loop
l_kpi_array.extend();
l_kpi_array(l_kpi_array.last).timeid:=time;
l_kpi_array(l_kpi_array.last).indid:=indid;
l_kpi_array(l_kpi_array.last).insid:=insid;
l_kpi_array(l_kpi_array.last).value:=dbms_random.value(1,1000);
end loop;
end loop;
end loop;
start_time:=DBMS_UTILITY.GET_TIME ;
forall p in l_kpi_array.first .. l_kpi_array.last
insert into kpi
values l_kpi_array(p);
elapsed := DBMS_UTILITY.GET_TIME - start_time;
dbms_output.put_line('elapsed time (seconds)=' || elapsed/100 );

end;
/

Notice that the data is sorted by time because this is how they are gathered on our network.

elapsed time (seconds)=55.6

After executing this script, we can see that it takes 55 seconds to insert 7 millions rows, so around 127000 rows per second.

Now let’s check some of our session statistics values :

select
name, value
from v$mystat,v$statname
where
v$mystat.statistic#=v$statname.statistic#
and name in ('leaf node splits','leaf node 90-10 splits','redo size','redo entries','session logical reads','physical reads','db block changes','db block gets')
;

NAME VALUE
---------------------------------------------------------------- ----------
session logical reads 15684333
db block gets         15544039
physical reads        413
db block changes      10610256
redo entries          5363399
redo size             1922236380
leaf node splits      17050
leaf node 90-10 splits 32

We can observe the big amount of “session logical reads » and redo entries generated. The index maintenance operations has generated a lot of “50-50 » leaf node splits as well.

As our kpis data is coming sorted in a different order than the index structure, oracle has to maintain the Btree in a non efficient way.

For each data inserted oracle has to go through all the index structure like this:

Capture d’écran 2015-05-02 à 11.14.29

Dumping the redo log, we could find that for each change vector generated during the leaf block operations (« insert leaf row operation »), these operations are “not batched », this is clearly visible by checking the change vector 2 (REDO: SINGLE operation)

REDO RECORD - Thread:1 RBA: 0x0007c8.00000002.0010 LEN: 0x011c VLD: 0x05
SCN: 0x0000.030f4fcb SUBSCN: 1 12/06/2014 15:51:59
(LWN RBA: 0x0007c8.00000002.0010 LEN: 11398 NST: 0001 SCN: 0x0000.030f4fcb)
CHANGE #1 TYP:0 CLS:34 AFN:3 DBA:0x00c96222 OBJ:4294967295 SCN:0x0000.030f4fc8 SEQ:58 OP:5.1 ENC:0 RBL:0
ktudb redo: siz: 84 spc: 3248 flg: 0x0022 seq: 0x0fb9 rec: 0x3a
xid: 0x0009.009.00004bef
ktubu redo: slt: 9 rci: 57 opc: 10.22 objn: 307081 objd: 307081 tsn: 12
Undo type: Regular undo Undo type: Last buffer split: No
Tablespace Undo: No
0x00000000
index undo for leaf key operations
KTB Redo
op: 0x02 ver: 0x01
compat bit: 4 (post-11) padding: 1
op: C uba: 0x00c961af.0fb9.03
Dump kdilk : itl=2, kdxlkflg=0x41 sdc=0 indexid=0x300008a block=0x0300416c
(kdxlpu): purge leaf row
key :(10): 03 c2 38 02 02 c1 06 02 c1 52
CHANGE #2 TYP:0 CLS:1 AFN:12 DBA:0x0300416c OBJ:307081 SCN:0x0000.030f4fc2 SEQ:1 OP:10.2 ENC:0 RBL:0
index redo (kdxlin): insert leaf row
KTB Redo
op: 0x02 ver: 0x01
compat bit: 4 (post-11) padding: 1
op: C uba: 0x00c96222.0fb9.3a
REDO: SINGLE / -- / --
itl: 2, sno: 230, row size 13
insert key: (10): 03 c2 38 02 02 c1 06 02 c1 52
keydata: (6): 03 00 a4 b7 00 43
prefix exists on block, sno: 2 len: 7

Now let’s perform the same insertion test but with the following index structure on the KPI table:

SQL> create unique index ix_kpi on kpi (timeid,indid,insid);

and reinsert again the same 7 millions rows dataset :

elapsed time (seconds)=17.8

Now it takes only 18 seconds for 7 millions rows, so around 390000 rows per second.

Let’s check again our session statistics :

NAME VALUE
---------------------------------------------------------------- ----------
session logical reads 809367
db block gets         711929
physical reads        40
db block changes      601218
redo entries          363907
redo size             823523804
leaf node splits      12273
leaf node 90-10 splits 12273

You can see that the number of logical reads and redo entries decreased dramatically.

The index is now what is called a « right hand» index as the timeid value is always increasing.

All index entries are « clustered » together in the same block on the right of the Btree :

Capture d’écran 2015-05-01 à 15.31.10

This is also the reason why all leaf node splits are now only 90-10 (no more 50-50).

If we check again the redo log records, we can observe that the « insert leaf row » operations are now batched and performed by array ( REDO: ARRAY)

REDO RECORD - Thread:1 RBA: 0x0007ca.00000102.015c LEN: 0x2530 VLD: 0x01
SCN: 0x0000.030f6b8d SUBSCN: 1 12/06/2014 15:58:25
CHANGE #1 TYP:0 CLS:25 AFN:3 DBA:0x00c000c0 OBJ:4294967295 SCN:0x0000.030f6b8d SEQ:1 OP:5.2 ENC:0 RBL:0
ktudh redo: slt: 0x0019 sqn: 0x00000000 flg: 0x000a siz: 2832 fbi: 255
uba: 0x00c8519f.0e15.01 pxid: 0x0000.000.00000000
CHANGE #2 TYP:1 CLS:26 AFN:3 DBA:0x00c8519f OBJ:4294967295 SCN:0x0000.030f6b8d SEQ:1 OP:5.1 ENC:0 RBL:0
ktudb redo: siz: 2832 spc: 832 flg: 0x000a seq: 0x0e15 rec: 0x01
xid: 0x0005.019.00004bbe
ktubu redo: slt: 25 rci: 0 opc: 10.22 objn: 307083 objd: 307083 tsn: 12
Undo type: Regular undo Undo type: Last buffer split: No
Tablespace Undo: No
0x00c8519e
index undo for leaf key operations
KTB Redo
op: 0x02 ver: 0x01
compat bit: 4 (post-11) padding: 1
op: C uba: 0x00c8519e.0e15.33
Dump kdilk : itl=2, kdxlkflg=0x61 sdc=0 indexid=0x300008a block=0x0300008b
(kdxlpu): purge leaf row
number of keys: 235
key sizes:
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

…….
CHANGE #3 TYP:0 CLS:1 AFN:12 DBA:0x0300008b OBJ:307083 SCN:0x0000.030f6b8d SEQ:1 OP:10.2 ENC:0 RBL:0
index redo (kdxlin): insert leaf row
KTB Redo
op: 0x02 ver: 0x01
compat bit: 4 (post-11) padding: 1
op: C uba: 0x00c8519f.0e15.01
REDO: ARRAY / -- / --
itl: 2, sno: 1, row size 3190
number of keys: 235
slots:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
221 222 223 224 225 226 227 228 229 230 231 232 233 234 235
insert key: (2250):
02 c1 02 02 c1 02 02 c1 03 02 c1 02 02 c1 02 02 c1 04 02 c1 02 02 c1 02 02
c1 05 02 c1 02 02 c1 02 02 c1 06 02 c1 02 02 c1 02 02 c1 07 02 c1 02 02 c1
02 02 c1 08 02 c1 02 02 c1 02 02 c1 09 02 c1 02 02 c1 02 02 c1 0a 02 c1 02
02 c1 02 02 c1 0b 02 c1 02 02 c1 02 02 c1 0c 02 c1 02 02 c1 02 02 c1 0d 02
c1 02 02 c1 02 02 c1 0e 02 c1 02 02 c1 02 02 c1 0f 02 c1 02 02 c1 02 02 c1
10 02 c1 02 02 c1 02 02 c1 11

Conclusion :

For composite index, the column’s order has an impact on the insertion performance rate especially when your data is inserted on a specific pattern.

In the next part we will continue to explore the insertion rate and demonstrate that it’s still possible to increase its performance.

I would like to thank Bertrand Drouvot for kindly accepting to review this post.

Building a numa beast (or how to test Oracle behaviour on Numa system without having Numa)

Posted on Updated on

Just wanted to share a little trick to emulate NUMA architecture on a non numa hardware.

The linux kernel is able to simulate NUMA architecture if the kernel is compiled with the following flag

CONFIG_NUMA_EMU=y

and of course if

CONFIG_NUMA=y

By default at least with UEK kernel is not compiled with this FLAG.

So let’s compile the kernel with this set to Y.

first we need this packages installed on the system:

yum groupinstall "Development Tools"
yum install ncurses-devel

download the linux kernel here https://www.kernel.org/pub/linux/kernel

in the following exemple i used the 3.8.13 version.

wget https://www.kernel.org/pub/linux/kernel/v3.x/linux-3.8.13.tar.xz
tar -Jxvf linux-3.8.13.tar.xz -C /usr/src/
cd /usr/src/linux-3.8.13/

copy your kernel config file inside the directory containing kernel sources.
edit this file and set CONFIG_NUMA_EMU=y

cp /boot/config-3.8.13-55.1.5.el6uek.x86_64 /usr/src/linux-3.8.13/

compile the kernel, modules and install them(and wait,wait,wait….):

make
make modules
make modules_install
make install

now edit your grub.conf file and fake 40 numa nodes :

kernel /vmlinuz-3.8.13 ro root=/dev/mapper/vg_oracle12-lv_root rd_NO_LUKS rd_NO_MD rd_LVM_LV=vg_oracle12/lv_swap LANG=fr_FR.UTF-8 
SYSFONT=latarcyrheb-sun16 crashkernel=auto rd_LVM_LV=vg_oracle12/lv_root KEYBOARDTYPE=pc KEYTABLE=fr-latin9 rd_NO_DM rhgb quiet numa=on 
numa=fake=40 transparent_hugepage=never
initrd /initramfs-3.8.13.img

reboot the system and enjoy your numa beast:

numactl –hardware
available: 39 nodes (0-38)
node 0 cpus: 0
node 0 size: 95 MB
node 0 free: 0 MB
node 1 cpus: 1
node 1 size: 96 MB
node 1 free: 0 MB
node 2 cpus:
node 2 size: 96 MB
node 2 free: 76 MB
node 3 cpus:
node 3 size: 96 MB
node 3 free: 93 MB
node 4 cpus:
node 4 size: 96 MB
node 4 free: 90 MB
node 5 cpus:
node 5 size: 96 MB
node 5 free: 90 MB
node 6 cpus:
node 6 size: 96 MB
node 6 free: 90 MB
node 7 cpus:
node 7 size: 96 MB
node 7 free: 93 MB
node 8 cpus:
node 8 size: 96 MB
node 8 free: 90 MB
node 9 cpus:
node 9 size: 96 MB
node 9 free: 90 MB
node 10 cpus:
node 10 size: 96 MB
node 10 free: 90 MB
node 11 cpus:
node 11 size: 96 MB
node 11 free: 93 MB
node 12 cpus:
node 12 size: 96 MB
node 12 free: 90 MB
node 13 cpus:
node 13 size: 96 MB
node 13 free: 26 MB
node 14 cpus:
node 14 size: 64 MB
node 14 free: 61 MB
node 15 cpus:
node 15 size: 64 MB
node 15 free: 58 MB
node 16 cpus:
node 16 size: 64 MB
node 16 free: 61 MB
node 17 cpus:
node 17 size: 64 MB
node 17 free: 58 MB
node 18 cpus:
node 18 size: 64 MB
node 18 free: 61 MB
node 19 cpus:
node 19 size: 64 MB
node 19 free: 58 MB
node 20 cpus:
node 20 size: 64 MB
node 20 free: 61 MB
node 21 cpus:
node 21 size: 64 MB
node 21 free: 58 MB
node 22 cpus:
node 22 size: 64 MB
node 22 free: 61 MB
node 23 cpus:
node 23 size: 64 MB
node 23 free: 58 MB
node 24 cpus:
node 24 size: 64 MB
node 24 free: 61 MB
node 25 cpus:
node 25 size: 64 MB
node 25 free: 58 MB
node 26 cpus:
node 26 size: 64 MB
node 26 free: 61 MB
node 27 cpus:
node 27 size: 64 MB
node 27 free: 58 MB
node 28 cpus:
node 28 size: 64 MB
node 28 free: 61 MB
node 29 cpus:
node 29 size: 64 MB
node 29 free: 58 MB
node 30 cpus:
node 30 size: 64 MB
node 30 free: 61 MB
node 31 cpus:
node 31 size: 64 MB
node 31 free: 58 MB
node 32 cpus:
node 32 size: 64 MB
node 32 free: 61 MB
node 33 cpus:
node 33 size: 64 MB
node 33 free: 58 MB
node 34 cpus:
node 34 size: 64 MB
node 34 free: 61 MB
node 35 cpus:
node 35 size: 64 MB
node 35 free: 58 MB
node 36 cpus:
node 36 size: 95 MB
node 36 free: 90 MB
node 37 cpus:
node 37 size: 64 MB
node 37 free: 61 MB
node 38 cpus:
node 38 size: 120 MB
node 38 free: 114 MB
node distances:
node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
0: 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
1: 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
2: 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
3: 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
4: 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
5: 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
6: 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
7: 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
8: 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
9: 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
10: 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
11: 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
12: 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
13: 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
14: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
15: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
16: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
17: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
18: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
19: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
21: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
22: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
23: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
24: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20
25: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20
26: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20
27: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20
28: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20
29: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20
30: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20
31: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20
32: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20
33: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20
34: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20
35: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20
36: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20
37: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20
38: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10

2-hop NUMA where is our last “hop”?

Posted on Updated on

Following the Kevin Closson comments about my first blog post, i dug a little deeper in the NUMA architecture of this specific box i have access to.

My box comes with Intel “Intel E5-4627” and as mentioned by Kevin this is a 2-hop numa architecture.

As a reminder this is the Numa configuration on this box:

numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32733 MB
node 0 free: 29452 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32768 MB
node 1 free: 29803 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32768 MB
node 2 free: 29796 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32767 MB
node 3 free: 29771 MB
node distances:
node   0   1   2   3
0:  10  21  21  21
1:  21  10  21  21
2:  21  21  10  21
3:  21  21  21  10

We have 4 nodes (0,1,2,3), each node having 8 cores (0,1,2….,31) and accessing 32gigs of memory.

The purpose of this first test is to measure memory access penalty with distant memory access on 1-hop plus an extra penalty on 2-hop (even if it’s not mentioned in numa node distances output) .

Then we will see how to find which Numa node is introducing the 2-hop penalty.

SLOB will be used in my next post in order to test the NUMA impact with an oracle workload.

For this very basic test, I have used lmbech (http://pkgs.repoforge.org/lmbench/).

lmbech is a suite of Micro benchmarking tools and one of them (called lat_mem_rd) will be used to measure memory latency.

So let’s measure latency across the NUMA nodes

But first this is a description of lat_mem_rd usage in “Measuring Cache and Memory Latency and CPU to Memory Bandwidth” Intel Document :

lat_mem_rd [depth] [stride] 

The [depth] specification indicates how far into memory the utility will measure. 
In order to ensure an accurate measurement, specify an amount that will go 
far enough beyond the cache so that it does not factor in latency 
measurements. 
Finally, [stride] is the skipped amount of memory before the 
next access. If [stride] is not large enough, modern processors have the 
ability to prefetch the data, thus providing artificial latencies for system 
memory region. If a stride it too large, the utility will not report correct 
latencies as it will be skipping past measured intervals. The utility will default 
to 128 bytes when [stride] is not specified. Refer to Figure 4 for results 
graph. Binding to the first core and accessing only up to 256M of RAM, the 
command line would looks as follows: 
./lat_mem_rd –N 1 –P 1 256M 512

For memory latency measurement on local NUMA node 0, we will use a 20M array with a stride size of 256B :

First column is the memory depth accessed (in Mbytes)

Second column is the memory latency measured (in NanoSeconds)

numactl --membind=0 --cpunodebind=0 ./lat_mem_rd 20 256
"stride=256
depth(MB)  Latency(ns)
0.00049 1.117
0.00098 1.117
.....
0.02344 1.117
0.02539 1.117
0.02734 1.117
0.02930 1.117
0.03125 1.117
0.03516 3.350
0.03906 3.351
0.04297 3.351
0.04688 3.350
0.05078 2.980
0.05469 2.942
0.05859 3.114
0.06250 3.351
0.07031 3.350
.....
0.10938 3.352
0.11719 3.234
0.12500 3.352
0.14062 3.350
0.15625 3.352
0.17188 3.352
0.18750 3.354
0.20312 5.017
0.21875 5.868
0.23438 5.704
0.25000 7.155
0.28125 7.867
0.31250 10.289
0.34375 10.267
......
6.50000 10.843
7.00000 10.843
7.50000 10.841
8.00000 10.841
9.00000 10.838
10.00000 10.839
11.00000 10.949
12.00000 11.139
13.00000 11.409
14.00000 12.978
15.00000 18.131
16.00000 29.493
18.00000 52.400
20.00000 53.660

One of the interesting part is that as the memory read access move from L1(32k) cache to L2(256k),L3(16MB) we could see impact on read latency and finally the main memory latency.(53 ns in this case)

Now check the latency performing memory access to the other distant node(1,2,3) till using binding to cpunode 0 and note the max latency for this 20MB array.

numactl --membind=1 --cpunodebind=0 ./lat_mem_rd 20 256
depth(MB)  Latency(ns)
20.00000 211.483
numactl --membind=2 --cpunodebind=0 ./lat_mem_rd 20 256
depth(MB)  Latency(ns)
20.00000 217.169
numactl --membind=3 --cpunodebind=0 ./lat_mem_rd 20 256
depth(MB)  Latency(ns)
20.00000 248.341

As you can see, we can observe an extra latency for NUMA node 3 compared to other distant node (around 14%) so it must be the NUMA node causing the extra hop.

We could conclude that our NUMA node configuration could be illustrates like this (2-hop needed from node 0 to access memory on NUMA node 3):

numa

In next post we will “SLOB them all” these NUMA nodes and check the impact of the extra hope on ORACLE.

Numa/interleave memory/oracle

Posted on Updated on

All the following tests have been executed on a 11.2.0.4 version.

Following some posts about numa on Bertrand Drouvot website (cpu binding (processor_group_name) vs Instance caging comparison during LIO pressure and Measure the impact of remote versus local NUMA node access thanks to processor_group_name)

and some input of Martin Bach book and Kevin Closson posts:Oracle on Opteron with Linux-The NUMA Angle (Part VI). Introducing Cyclops. and Oracle11g Automatic Memory Management – Part III. A NUMA Issue.

Martin Bach book : http://goo.gl/lfdkli and Linux large pages and non-uniform memory distribution post

I was a little bit confused about how oracle manages numa system memory when numa is disabled at ORACLE level aka _enable_NUMA_support is set to FALSE and enable at system level.

By default on Linux system each process you execute should follow the numa policy defined at OS level which is by default allocates interleave memory to local node:

numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
cpubind: 0 1 2 3
nodebind: 0 1 2 3
membind: 0 1 2 3

So i decided to test the behavior of oracle in different test case (using a SGA of 20G ).

My platform have the following numa configuration (4 numa nodes with 32Gb of memory each):

First test :

Start the instance with huge pages and ASMM :

Let’s have a look of memory allocation after startup:

numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32733 MB
node 0 free: 29452 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32768 MB
node 1 free: 29803 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32768 MB
node 2 free: 29796 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32767 MB
node 3 free: 29771 MB
node distances:
node   0   1   2   3
0:  10  21  21  21
1:  21  10  21  21
2:  21  21  10  21
3:  21  21  21  10

As you can see the memory is distributed across the numa node even if it is not the default policy at os level.

So oracle is forcing interleave memory spread accros numa node and it could be seen via strace when starting oracle:

The following system call is executed :

mbind(1744830464, 21340618752, MPOL_INTERLEAVE, 0xd863da0, 513, 0) = 0

The description of this system call is the following :

mbind() sets the NUMA memory policy, which consists of a policy mode
       and zero or more nodes, for the memory range starting with addr and
       continuing for len bytes.  The memory policy defines from which node
       memory is allocated.

So it looks like that oracle set the memory policy to memory interleave for the SGA:

The MPOL_INTERLEAVE mode specifies that page allocations has to be interleaved across the set of nodes specified in nodemask. This optimizes for bandwidth instead of latency by spreading out pages and memory accesses to those pages across multiple nodes. To be effective the memory area should be fairly large, at least 1MB or bigger with a fairly uniform access pattern. Accesses to a single page of the area will still be limited to the memory bandwidth of a single node.

Second test :

Start the instance without huge pages and with ASMM :

And again let’s have a look of memory allocation after startup :

numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32733 MB
node 0 free: 26394 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32768 MB
node 1 free: 26272 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32768 MB
node 2 free: 26619 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32767 MB
node 3 free: 26347 MB
node distances:
node   0   1   2   3
0:  10  21  21  21
1:  21  10  21  21
2:  21  21  10  21
3:  21  21  21  10

again the memory is interleaved across node.

and again we can see same mbind system calls.

Third test :

Start the instance with AMM :

Once again !!! the memory is interleaved across the nodes.

and again!!! same mbind system calls.

So the conclusion is that in each configuration oracle allocates memory using memory interleave policy.

So is it possible to switch oracle to default OS numa policy mode? It looks like it’s possible thanks to “_enable_NUMA_interleave” hidden parameter which is set to TRUE by default (and thanks to Bertrand for pointing me to that  parameter).

Let’s test this (i tested this only on the first configuration with huge pages in place):

alter system set "_enable_NUMA_interleave"=false scope=spfile;

so now restart the instance and look again at memory allocation :

numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32733 MB
node 0 free: 28842 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32768 MB
node 1 free: 29091 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32768 MB
node 2 free: 29580 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32767 MB
node 3 free: 18022 MB
node distances:
node   0   1   2   3
0:  10  21  21  21
1:  21  10  21  21
2:  21  21  10  21
3:  21  21  21  10

So oracle switched back from default OS numa policy (Look at node 3) and there is no more trace of mbind system calls at all.

To conclude it seems that at least in this version memory is interleaved on each node by default on every ORACLE memory management mode.

But it’s still possible and i agree not very useful to switch to non interleave mode.