Latest Event Updates
OCI Client-Side Deployment in 12C
The 12c version has introduced a new features called “OCI Client-Side Deployment Parameters” this feature gives you the possibility to override some OCI parameters without even changing your application code.
Let’s make a simple test case using Python and the cx_oracle module (which use the OCI interface) for connecting to the oracle database.
You can find this module for your python version and the database 12c version here.
This is the code of our test case:
#!/usr/bin/python2.7 import cx_Oracle connection = cx_Oracle.connect ('test','test','testdb') for i in range (0,1000000) : cursor = connection.cursor () cursor.execute ('select 1 FROM dual') cursor.close () sql="select a.name name, b.value value from v$statname a, v$mystat b where a.statistic# = b.statistic# and (a.name like '%parse% or a.name like '%session cursor%'" cursor = connection.cursor () cursor.execute(sql) rows = cursor.fetchall() for name,value in rows: print "%s : %d"%(name,value) connection.close ()
So basically, we execute this simple “select 1 from dual” 1 million times by opening and closing the cursor during each iteration.
Then we display our session statistics regarding parsing.
After executing that script and check the result :
session cursor cache hits : 16961 session cursor cache count : 7 parse time cpu : 176 parse time elapsed : 641 parse count (total) : 1000010 parse count (hard) : 2
The time consumed for parsing was around 6 seconds and as you can see was mainly soft parsing (only 2 hard parses have been performed for the two cursors opened in our code).
This is even better due to the fact that this cursor was already cached in our Private SQL Area(as observed by checking the session cursor cache hits statistic), so the soft parse was even more softer:)
But wait!! why the session session cursor cache hits is so small after 1M executions.
Adding those lines in the code :
cursor.execute ('select 1 FROM dual') sql="select a.name name, b.value value from v$statname a, v$mystat b where a.statistic# = b.statistic# and ( a.name = 'session cursor cache hits')" cursor.execute(sql) rows = cursor.fetchall() for name,value in rows: print "%s : %d"%(name,value)
You will see the following result :
session cursor cache hits : 65525 session cursor cache hits : 65527 session cursor cache hits : 65529 session cursor cache hits : 65531 session cursor cache hits : 65533 session cursor cache hits : 65535 session cursor cache hits : 1 session cursor cache hits : 3 session cursor cache hits : 5 session cursor cache hits : 7
So i guess that this statistic value is stored in a 16 bits integer and cannot store more than 65535.
I found the confirmation in this Christian’s post so this is not fixed yet in 12.1.0.2.
As i said earlier, with 12c you can now set override your application OCI parameters by using a file named oraaccess.xml.
This XML file needs to be in your TNS_ADMIN directory to be taken in account.
You could find an example in this directory $ORACLE_HOME/dbms/demo/oraaccess.xml.
So now let’s check if that works with the cx_oracle module.
You can set a number of clients parameters like arraysize, client result cache and in our case statement cache.
We didn’t implement the statement cache feature in our code but we will set the value for client statement cache to 1.
<oraaccess xmlns="http://xmlns.oracle.com/oci/oraaccess" xmlns:oci="http://xmlns.oracle.com/oci/oraaccess" schemaLocation="http://xmlns.oracle.com/oci/oraaccess http://xmlns.oracle.com/oci/oraaccess.xsd"> <default_parameters> <prefetch> <rows>50</rows> </prefetch> <statement_cache> <size>1</size> </statement_cache> <result_cache> <max_rset_rows>100</max_rset_rows> <!-- 'K' or 'k' for kilobytes and 'M' or 'm' for megabytes --> <max_rset_size>10K</max_rset_size> <max_size>64M</max_size> </result_cache> </default_parameters> </oraaccess>
So now we run the python script again and check our result.
session cursor cache hits : 2 session cursor cache count : 6 parse time cpu : 0 parse time elapsed : 0 parse count (total) : 11 parse count (hard) : 0 parse count (failures) : 0 parse count (describe) : 0
As you can see there is no parse time anymore as the client uses its statement cache automatically.
In conclusion soft parsing is better but no parsing at all is possible even by not touching the code.
Lock by reference
As you may know, not indexing foreign key can have a huge impact on lock behavior and hurt database scalability (Not only with Oracle).
But there is also some rules of thumbs (coming from Asktom) when non indexing the foreign key is considered as reasonable.
So, when do you NOT necessary need to index a foreign key ?
In general when the following conditions are met:
o you do NOT delete from the parent table. (especially with delete cascade — it is a
double whammy)o you do NOT update the parent tables unique/primary key value.
o you do NOT join from the PARENT to the CHILD (like DEPT->EMP).
This was verified on one of the application i am working on, until reference partitioning comes into play with extreme side effects.
Here is a demo of what can happen when foreign keys are not indexed and you start to use reference partitioning :
First create test1 table:
CREATE TABLE test1 ( id number, PERIOD_STOP DATE ) PARTITION BY RANGE (PERIOD_STOP) ( PARTITION P2015 VALUES LESS THAN (TO_DATE('12/31/2015 23:01', 'MM/DD/YYYY HH24:MI')), PARTITION P2016 VALUES LESS THAN (TO_DATE('12/31/2016 23:01', 'MM/DD/YYYY HH24:MI')), PARTITION AA_PMAX VALUES LESS THAN (MAXVALUE) );
enable row movement on test1 and create the primary key :
alter table test1 enable row movement;
ALTER TABLE test1 ADD ( CONSTRAINT PK_test1 PRIMARY KEY (ID)) ;
now create table test2 partitioned by reference on test1 :
CREATE TABLE test2 ( id number not null, CONSTRAINT fk_test1 FOREIGN KEY (ID) REFERENCES test1 (ID) ENABLE VALIDATE ) PARTITION BY REFERENCE (fk_test1) enable row movement;
Important Remark: We don’t create an index on fk_test1 and we don’t populate any child in test2 table
now let’s insert some data in test1 :
insert into test1 values(1,sysdate); insert into test1 values(2,sysdate+200); commit;
now let’s enable 10046 and 10704 traces in order to trace oracle locks:
alter session set events '10046 trace name context forever, level 4 : 10704 trace name context forever, level 3';
update table test1 to force a row migration :
update test1 set period_stop=sysdate + 200 where id =1;
Check what you have in terms of locks (for more info on that, please check Franck Pachot blog post ):
First list objects involved in this transaction:
select object_id , to_char(object_id,'0XXXXXXX') , object_name,subobject_name,object_type from all_objects where object_name like 'TEST%';
OBJECT_ID TO_CHAR(OBJECT_ID,'0XXXXXXX OBJECT_NAME SUBOBJECT_NAME OBJECT_TYPE ---------- ----------------------- ------------ -------------------- ------------------- 83166 000144DE TEST1 AA_PMAX TABLE PARTITION 83164 000144DC TEST1 P2015 TABLE PARTITION 83165 000144DD TEST1 P2016 TABLE PARTITION 83163 000144DB TEST1 TABLE 83171 000144E3 TEST2 AA_PMAX TABLE PARTITION 83169 000144E1 TEST2 P2015 TABLE PARTITION 83170 000144E2 TEST2 P2016 TABLE PARTITION 83168 000144E0 TEST2 TABLE
we have mostly what is described in Franck’s post except that a reference partition is created and involved in row movement.
The lock sequences is the following
first oracle locks table test1 is TM Row exclusive mode :
ksqgtl *** TM-000144DB-00000000-00000000-00000000 mode=3 flags=0x400 timeout=21474836 ***
then test2 table is locked is share mode (mode 4) once again due to the non-indexed foreign key :
ksqgtl *** TM-000144E0-00000000-00000000-00000000 mode=4 flags=0x400 timeout=21474836 ***
this lock on test2 is released just afterward:
ksqrcl: TM-000144E0-00000000-00000000-00000000
then partition P2015 on test1 is locked in RX mode :
ksqgtl *** TM-000144DC-00000000-00000000-00000000 mode=3 flags=0x400 timeout=21474836 ***
then a TX lock is acquired for locking the updated row :
ksqgtl *** TX-00010002-00000449-00000000-00000000 mode=6 flags=0x401 timeout=0 ***
then partition P2016 on test1 is locked (RX) in order to move row :
ksqgtl *** TM-000144DD-00000000-00000000-00000000 mode=3 flags=0x400 timeout=21474836 ***
and once again share lock is acquired and release immediately on test2.
5 locks for one operation but wait that’s not all :
Checking oracle 10046 trace oracle executes the following recursive statement:
update /*+ opt_param('_and_pruning_enabled', 'false') */ "SYS"."TEST2" partition (dataobj_to_partition( "SYS"."TEST1" , 83164)) move to partition (dataobj_to_partition( "SYS"."TEST1" , 83165)) set "ID" = "ID" where "ID" = 1;
That makes senses oracle instructs to move row on table test2 that are dependent from table test1 (id=1) coming from partition p2015 (object_id=83164) to partition p2016 (object_id=83165).
So if we check the 10074 trace again we can see that table test2 is locked on TM mode 3:
ksqgtl *** TM-000144E0-00000000-00000000-00000000 mode=3 flags=0x400 timeout=21474836 ***
So now if we update test1 in another session,event without any row migration,
please remember that updating test1 with non-indexed key leads to a share lock on test2 :
update test1 set period_stop=period_stop where id=2;
as it’s not possible to acquire a share lock when a TM lock Row-X mode is already present.
This update is locked until the first update is commited or rollbacked.
As opposite if you create an index on test2 you will see the following lock sequence:
first oracle locks table TEST1 is TM Row exclusive mode :
ksqgtl *** TM-000144DB-00000000-00000000-00000000 mode=3 flags=0x400 timeout=21474836 ***
then table TEST2 is TM Row exclusive mode :
ksqgtl *** TM-000144E0-00000000-00000000-00000000 mode=3 flags=0x400 timeout=21474836 ***
test1 partition P2015 in Row exclusive mode :
ksqgtl *** TM-000144DC-00000000-00000000-00000000 mode=3 flags=0x400 timeout=21474836 ***
lock the update row :
ksqgtl *** TX-00010002-00000449-00000000-00000000 mode=6 flags=0x401 timeout=0 ***
and last test1 partition P2016 in Row exclusive mode :
ksqgtl *** TM-000144DD-00000000-00000000-00000000 mode=3 flags=0x400 timeout=21474836 ***
no more share lock and no more recursive statement as we don’t have any child in test2 table.
Inserting million of time series data (rows) per second inside oracle database part2
In the previous post we have demonstrated how index columns order (matching or not, the data arrival pattern ) can impact the performance of the insertion.
Going a little bit further we will try to increase again the insertion performance rate.
One of the available options is to use direct path insert, since 11.2 release the APPEND_VALUES hint is available for standard insert statement.
Again using the two different column order, we will check the impact using direct path insert.
Then let’s create again the kpi table :
SQL> create table kpi (timeid number,insid number,indid number,value number);
And first create the index like the first example in the previous blog post (not sorted on our data arrival pattern):
SQL> create unique index ix_kpi on kpi (insid,indid,timeid);
Using the same dataset let’s insert again 7 millions rows in our table via direct path insert method :
declare TYPE kpi_array is TABLE OF kpi%ROWTYPE; l_kpi_array kpi_array:=kpi_array (); start_time number; elapsed number; begin for time in 1 .. 100 loop for indid in 1 .. 7 loop for insid in 1 .. 10000 loop l_kpi_array.extend(); l_kpi_array(l_kpi_array.last).timeid:=time; l_kpi_array(l_kpi_array.last).indid:=indid; l_kpi_array(l_kpi_array.last).insid:=insid; l_kpi_array(l_kpi_array.last).value:=dbms_random.value(1,1000); end loop; end loop; end loop; start_time:=DBMS_UTILITY.GET_TIME ; forall p in l_kpi_array.first .. l_kpi_array.last insert /*+ APPEND_VALUES */ into kpi values l_kpi_array(p); elapsed := DBMS_UTILITY.GET_TIME - start_time; dbms_output.put_line('elapsed time (seconds)=' || elapsed/100 );end; /
elapsed time (seconds)=31,41 seconds
After executing this script, we can see that it takes 31 seconds to insert 7 millions rows, for comparison purpose, the exact same test without direct path insert used to complete within 56 seconds.Let’s have a look at the session statistics :
select name, value from v$mystat,v$statname where v$mystat.statistic#=v$statname.statistic# and name in ('leaf node splits','leaf node 90-10 splits','redo size','redo entries','session logical reads','physical reads', 'db block changes','db block gets','physical reads direct temporary tablespace','physical writes direct temporary tablespace') ; NAME VALUE ---------------------------------------------------------------- ---------- session logical reads 506672 db block gets 470582 physical reads 87041 db block changes 466778 physical reads direct temporary tablespace 87038 physical writes direct temporary tablespace 87038 redo entries 235452 redo size 550084680 leaf node splits 17526 leaf node 90-10 splits 17526
You can notice now than the index leaf node split are now 90-10 splits as opposed to the same example without direct path insert when we had mainly 50-50 node splits :
leaf node splits 17050 leaf node 90-10 splits 32
For comparison purpose here are the statistics for the former test (without direct path insert) :
NAME VALUE ---------------------------------------------------------------- ---------- session logical reads 15684333 db block gets 15544039 physical reads 413 db block changes 10610256 redo entries 5363399 redo size 1922236380 leaf node splits 17050 leaf node 90-10 splits 32
Why this difference between direct and non direct path insert?
Because when direct path insert is used the index maintenance is deferred and merged at the end of the process.
That means that oracle has to sort the entire array before updating the index and now can also take advantage of array processing to maintain the index structure.
This explains why the redo entries and the session logical reads have decreased so much compared to the former test (without direct path insert).
Another difference from the former test, is that we now have some physical reads/writes direct temporary tablespace.
The work area is too much small to process the entire sorts in memory and oracle has to spilled sort operations to disk.
This can be verified by checking the insert statement statistics values in v$sql_workarea :
select sql_id,child_number,operation_type,policy,last_execution,optimal_executions,multipasses_executions from v$sql_workarea where sql_id='10qx37pz299jw' and child_number=0; SQL_ID CHILD_NUMBER OPERATION_TYPE POLICY LAST_EXECUTION OPTIMAL_EXECUTIONS MULTIPASSES_EXECUTIONS ------------- ------------ --------------- ---------- ------------------------------ ------------------ --------------------- 10qx37pz299jw 0 IDX MAINTENANCE(SORT) AUTO 13 PASSES 0 1 10qx37pz299jw 0 LOAD WRITE BUFFERS AUTO OPTIMAL 1 0
The sort was done using multipasses executions (13 passes)
In order to perform the entire sorts in memory, we can set the workarea size policy to manual and set the sort_area_size large enough.
alter session set workarea_size_policy=manual; alter session set sort_area_size=800000000;
Let’s restart the same test with modified settings :
now 23.38 seconds to insert the 7 millions rows .
Let’s check sorting statistics again:
select sql_id,child_number,operation_type,policy,last_execution,optimal_executions,multipasses_executions from v$sql_workarea where sql_id='10qx37pz299jw' and child_number=1; SQL_ID CHILD_NUMBER OPERATION_TYPE POLICY LAST_EXECUTION OPTIMAL_EXECUTIONS MULTIPASSES_EXECUTIONS ------------- ------------ --------------- ---------- ------------------------------ ------------------ ------- 10qx37pz299jw 1 IDX MAINTENANCE(SORT) MANUAL OPTIMAL 1 0 10qx37pz299jw 1 LOAD WRITE BUFFERS MANUAL OPTIMAL 1 0
The execution time has reduced and all the sort operations are performed in memory.
Now let’s execute again the same test changing the index column order based on the data arrival pattern (see the previous post where without direct path insert we took advantage of array processing) and check if there is still a difference:
SQL> create unique index ix_kpi on kpi (timeid,indid,insid);
alter session set workarea_size_policy=manual; alter session set sort_area_size=800000000; declare TYPE kpi_array is TABLE OF kpi%ROWTYPE; l_kpi_array kpi_array:=kpi_array (); start_time number; elapsed number; begin for time in 1 .. 100 loop for indid in 1 .. 7 loop for insid in 1 .. 10000 loop l_kpi_array.extend(); l_kpi_array(l_kpi_array.last).timeid:=time; l_kpi_array(l_kpi_array.last).indid:=indid; l_kpi_array(l_kpi_array.last).insid:=insid; l_kpi_array(l_kpi_array.last).value:=dbms_random.value(1,1000); end loop; end loop; end loop; start_time:=DBMS_UTILITY.GET_TIME ; forall p in l_kpi_array.first .. l_kpi_array.last insert /*+ APPEND_VALUES */ into kpi values l_kpi_array(p); elapsed := DBMS_UTILITY.GET_TIME - start_time; dbms_output.put_line('elapsed time (seconds)=' || elapsed/100 ); commit; end;
Let’s compare the event 10032 (trace sort statistics) with the two different columns order:
first index (insid,indid,timeid)
second index (timeid,indid,insid)
You can see that the algorithm used for sorting is “version 1 sort”, as explained in this jonathan lewis blog this algorithm is a “binary insertion tree” so oracle builds a binary tree as reading data from memory.
This sorting cost could also be verified by checking ‘CPU used by this session’ session statistic, please take in account that this statistic includes the creation of the nested table and the insert statement (all other statistics were almost the same )
first index column order : 5432 cs second index column order : 1891 cs
Conclusion :
When using direct path insert the index maintenance is deferred so oracle is able to process index operation by array even if the data pattern is not sorted in the same way that our index column order.
We have also demonstrated that the sorting algorithm keeps taking advantage of sorted data pattern and good index order even in direct path insert mode.
Inserting million of time series data (rows) per second inside oracle database
This blog post is the first of a series of 4 related to time series and indexing.
According to wikipedia this is the definition of a time series:
A time series is a sequence of data points, typically consisting of successive measurements made over a time interval.
This blog post shows how we can efficiently store time series in an oracle database and how indexing strategy can have a big impact on insertion performance rate.
The time series data used in this example are performance metrics coming from a network performance monitoring tool.
We will use three components to identify our time series data point:
• a timestamp (timeid) which will be based on epoch time (and start from 70s to simplify the demonstration)
• a performance indicator indentifier (indid) that represents a performance metric (load average, cpu usage,…)
• a network object identifier (insid) (router1,…)
and finally, we will store the collected metric value.
The key point to have in mind is that our data arrival pattern is sorted based on the time (timeid).
That means that for the current timestamp, we will have to store all the metrics of the entire network.
In terms of oracle modeling, we will create a oracle table called “KPI” to store that kind of information.
This is a pretty simple structure that is in fact a key/value model.
SQL> create table kpi (timeid number,insid number,indid number,value number);
Now comes the question, which indexing strategy to use? as we store billions of rows in this table, we will create a composite index on timeid, insid, and indid columns in order to retrieve and graph our performance metric values over time.
But which order is optimal and what is the impact on the insertion rate and reporting queries?
As the goal is to store and graph the time series data for a specific network equipment and a specific performance metric over time, this is what the sql reporting queries looks like:
select value from KPI where indid=‘CPU %’ and insid=‘router 1’ and timeid between 1 and 51;
When creating a composite index, the columns are sorted by the order mentioned in the index definition.
So considering our reporting sql query, the best access path is going to be an index range scan, and the most obvious implementation for our index is the following :
SQL> create unique index ix_kpi on kpi (insid,indid,timeid);
This will keep the requested index entries « clustered” together and reduce the i/o(s) required for the index range scan operation:
While this index seems optimal for our reporting query, what is the index impact on our insertion performance rate ?
Let’s perform a quick insertion of 7 millions rows into this index by using bulk loading.
declare TYPE kpi_array is TABLE OF kpi%ROWTYPE; l_kpi_array kpi_array:=kpi_array (); start_time number; elapsed number; begin for time in 1 .. 100 loop for indid in 1 .. 7 loop for insid in 1 .. 10000 loop l_kpi_array.extend(); l_kpi_array(l_kpi_array.last).timeid:=time; l_kpi_array(l_kpi_array.last).indid:=indid; l_kpi_array(l_kpi_array.last).insid:=insid; l_kpi_array(l_kpi_array.last).value:=dbms_random.value(1,1000); end loop; end loop; end loop; start_time:=DBMS_UTILITY.GET_TIME ; forall p in l_kpi_array.first .. l_kpi_array.last insert into kpi values l_kpi_array(p); elapsed := DBMS_UTILITY.GET_TIME - start_time; dbms_output.put_line('elapsed time (seconds)=' || elapsed/100 ); end; /
Notice that the data is sorted by time because this is how they are gathered on our network.
elapsed time (seconds)=55.6
After executing this script, we can see that it takes 55 seconds to insert 7 millions rows, so around 127000 rows per second.
Now let’s check some of our session statistics values :
select name, value from v$mystat,v$statname where v$mystat.statistic#=v$statname.statistic# and name in ('leaf node splits','leaf node 90-10 splits','redo size','redo entries','session logical reads','physical reads','db block changes','db block gets') ; NAME VALUE ---------------------------------------------------------------- ---------- session logical reads 15684333 db block gets 15544039 physical reads 413 db block changes 10610256 redo entries 5363399 redo size 1922236380 leaf node splits 17050 leaf node 90-10 splits 32
We can observe the big amount of “session logical reads » and redo entries generated. The index maintenance operations has generated a lot of “50-50 » leaf node splits as well.
As our kpis data is coming sorted in a different order than the index structure, oracle has to maintain the Btree in a non efficient way.
For each data inserted oracle has to go through all the index structure like this:
Dumping the redo log, we could find that for each change vector generated during the leaf block operations (« insert leaf row operation »), these operations are “not batched », this is clearly visible by checking the change vector 2 (REDO: SINGLE operation)
REDO RECORD - Thread:1 RBA: 0x0007c8.00000002.0010 LEN: 0x011c VLD: 0x05 SCN: 0x0000.030f4fcb SUBSCN: 1 12/06/2014 15:51:59 (LWN RBA: 0x0007c8.00000002.0010 LEN: 11398 NST: 0001 SCN: 0x0000.030f4fcb) CHANGE #1 TYP:0 CLS:34 AFN:3 DBA:0x00c96222 OBJ:4294967295 SCN:0x0000.030f4fc8 SEQ:58 OP:5.1 ENC:0 RBL:0 ktudb redo: siz: 84 spc: 3248 flg: 0x0022 seq: 0x0fb9 rec: 0x3a xid: 0x0009.009.00004bef ktubu redo: slt: 9 rci: 57 opc: 10.22 objn: 307081 objd: 307081 tsn: 12 Undo type: Regular undo Undo type: Last buffer split: No Tablespace Undo: No 0x00000000 index undo for leaf key operations KTB Redo op: 0x02 ver: 0x01 compat bit: 4 (post-11) padding: 1 op: C uba: 0x00c961af.0fb9.03 Dump kdilk : itl=2, kdxlkflg=0x41 sdc=0 indexid=0x300008a block=0x0300416c (kdxlpu): purge leaf row key :(10): 03 c2 38 02 02 c1 06 02 c1 52 CHANGE #2 TYP:0 CLS:1 AFN:12 DBA:0x0300416c OBJ:307081 SCN:0x0000.030f4fc2 SEQ:1 OP:10.2 ENC:0 RBL:0 index redo (kdxlin): insert leaf row KTB Redo op: 0x02 ver: 0x01 compat bit: 4 (post-11) padding: 1 op: C uba: 0x00c96222.0fb9.3a REDO: SINGLE / -- / -- itl: 2, sno: 230, row size 13 insert key: (10): 03 c2 38 02 02 c1 06 02 c1 52 keydata: (6): 03 00 a4 b7 00 43 prefix exists on block, sno: 2 len: 7
Now let’s perform the same insertion test but with the following index structure on the KPI table:
SQL> create unique index ix_kpi on kpi (timeid,indid,insid);
and reinsert again the same 7 millions rows dataset :
elapsed time (seconds)=17.8
Now it takes only 18 seconds for 7 millions rows, so around 390000 rows per second.
Let’s check again our session statistics :
NAME VALUE ---------------------------------------------------------------- ---------- session logical reads 809367 db block gets 711929 physical reads 40 db block changes 601218 redo entries 363907 redo size 823523804 leaf node splits 12273 leaf node 90-10 splits 12273
You can see that the number of logical reads and redo entries decreased dramatically.
The index is now what is called a « right hand» index as the timeid value is always increasing.
All index entries are « clustered » together in the same block on the right of the Btree :
This is also the reason why all leaf node splits are now only 90-10 (no more 50-50).
If we check again the redo log records, we can observe that the « insert leaf row » operations are now batched and performed by array ( REDO: ARRAY)
REDO RECORD - Thread:1 RBA: 0x0007ca.00000102.015c LEN: 0x2530 VLD: 0x01 SCN: 0x0000.030f6b8d SUBSCN: 1 12/06/2014 15:58:25 CHANGE #1 TYP:0 CLS:25 AFN:3 DBA:0x00c000c0 OBJ:4294967295 SCN:0x0000.030f6b8d SEQ:1 OP:5.2 ENC:0 RBL:0 ktudh redo: slt: 0x0019 sqn: 0x00000000 flg: 0x000a siz: 2832 fbi: 255 uba: 0x00c8519f.0e15.01 pxid: 0x0000.000.00000000 CHANGE #2 TYP:1 CLS:26 AFN:3 DBA:0x00c8519f OBJ:4294967295 SCN:0x0000.030f6b8d SEQ:1 OP:5.1 ENC:0 RBL:0 ktudb redo: siz: 2832 spc: 832 flg: 0x000a seq: 0x0e15 rec: 0x01 xid: 0x0005.019.00004bbe ktubu redo: slt: 25 rci: 0 opc: 10.22 objn: 307083 objd: 307083 tsn: 12 Undo type: Regular undo Undo type: Last buffer split: No Tablespace Undo: No 0x00c8519e index undo for leaf key operations KTB Redo op: 0x02 ver: 0x01 compat bit: 4 (post-11) padding: 1 op: C uba: 0x00c8519e.0e15.33 Dump kdilk : itl=2, kdxlkflg=0x61 sdc=0 indexid=0x300008a block=0x0300008b (kdxlpu): purge leaf row number of keys: 235 key sizes: 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 ……. CHANGE #3 TYP:0 CLS:1 AFN:12 DBA:0x0300008b OBJ:307083 SCN:0x0000.030f6b8d SEQ:1 OP:10.2 ENC:0 RBL:0 index redo (kdxlin): insert leaf row KTB Redo op: 0x02 ver: 0x01 compat bit: 4 (post-11) padding: 1 op: C uba: 0x00c8519f.0e15.01 REDO: ARRAY / -- / -- itl: 2, sno: 1, row size 3190 number of keys: 235 slots: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 insert key: (2250): 02 c1 02 02 c1 02 02 c1 03 02 c1 02 02 c1 02 02 c1 04 02 c1 02 02 c1 02 02 c1 05 02 c1 02 02 c1 02 02 c1 06 02 c1 02 02 c1 02 02 c1 07 02 c1 02 02 c1 02 02 c1 08 02 c1 02 02 c1 02 02 c1 09 02 c1 02 02 c1 02 02 c1 0a 02 c1 02 02 c1 02 02 c1 0b 02 c1 02 02 c1 02 02 c1 0c 02 c1 02 02 c1 02 02 c1 0d 02 c1 02 02 c1 02 02 c1 0e 02 c1 02 02 c1 02 02 c1 0f 02 c1 02 02 c1 02 02 c1 10 02 c1 02 02 c1 02 02 c1 11
Conclusion :
For composite index, the column’s order has an impact on the insertion performance rate especially when your data is inserted on a specific pattern.
In the next part we will continue to explore the insertion rate and demonstrate that it’s still possible to increase its performance.
I would like to thank Bertrand Drouvot for kindly accepting to review this post.
Building a numa beast (or how to test Oracle behaviour on Numa system without having Numa)
Just wanted to share a little trick to emulate NUMA architecture on a non numa hardware.
The linux kernel is able to simulate NUMA architecture if the kernel is compiled with the following flag
CONFIG_NUMA_EMU=y
and of course if
CONFIG_NUMA=y
By default at least with UEK kernel is not compiled with this FLAG.
So let’s compile the kernel with this set to Y.
first we need this packages installed on the system:
yum groupinstall "Development Tools" yum install ncurses-devel
download the linux kernel here https://www.kernel.org/pub/linux/kernel
in the following exemple i used the 3.8.13 version.
wget https://www.kernel.org/pub/linux/kernel/v3.x/linux-3.8.13.tar.xz tar -Jxvf linux-3.8.13.tar.xz -C /usr/src/ cd /usr/src/linux-3.8.13/
copy your kernel config file inside the directory containing kernel sources.
edit this file and set CONFIG_NUMA_EMU=y
cp /boot/config-3.8.13-55.1.5.el6uek.x86_64 /usr/src/linux-3.8.13/
compile the kernel, modules and install them(and wait,wait,wait….):
make make modules make modules_install make install
now edit your grub.conf file and fake 40 numa nodes :
kernel /vmlinuz-3.8.13 ro root=/dev/mapper/vg_oracle12-lv_root rd_NO_LUKS rd_NO_MD rd_LVM_LV=vg_oracle12/lv_swap LANG=fr_FR.UTF-8 SYSFONT=latarcyrheb-sun16 crashkernel=auto rd_LVM_LV=vg_oracle12/lv_root KEYBOARDTYPE=pc KEYTABLE=fr-latin9 rd_NO_DM rhgb quiet numa=on numa=fake=40 transparent_hugepage=never initrd /initramfs-3.8.13.img
reboot the system and enjoy your numa beast:
numactl –hardware
available: 39 nodes (0-38)
node 0 cpus: 0
node 0 size: 95 MB
node 0 free: 0 MB
node 1 cpus: 1
node 1 size: 96 MB
node 1 free: 0 MB
node 2 cpus:
node 2 size: 96 MB
node 2 free: 76 MB
node 3 cpus:
node 3 size: 96 MB
node 3 free: 93 MB
node 4 cpus:
node 4 size: 96 MB
node 4 free: 90 MB
node 5 cpus:
node 5 size: 96 MB
node 5 free: 90 MB
node 6 cpus:
node 6 size: 96 MB
node 6 free: 90 MB
node 7 cpus:
node 7 size: 96 MB
node 7 free: 93 MB
node 8 cpus:
node 8 size: 96 MB
node 8 free: 90 MB
node 9 cpus:
node 9 size: 96 MB
node 9 free: 90 MB
node 10 cpus:
node 10 size: 96 MB
node 10 free: 90 MB
node 11 cpus:
node 11 size: 96 MB
node 11 free: 93 MB
node 12 cpus:
node 12 size: 96 MB
node 12 free: 90 MB
node 13 cpus:
node 13 size: 96 MB
node 13 free: 26 MB
node 14 cpus:
node 14 size: 64 MB
node 14 free: 61 MB
node 15 cpus:
node 15 size: 64 MB
node 15 free: 58 MB
node 16 cpus:
node 16 size: 64 MB
node 16 free: 61 MB
node 17 cpus:
node 17 size: 64 MB
node 17 free: 58 MB
node 18 cpus:
node 18 size: 64 MB
node 18 free: 61 MB
node 19 cpus:
node 19 size: 64 MB
node 19 free: 58 MB
node 20 cpus:
node 20 size: 64 MB
node 20 free: 61 MB
node 21 cpus:
node 21 size: 64 MB
node 21 free: 58 MB
node 22 cpus:
node 22 size: 64 MB
node 22 free: 61 MB
node 23 cpus:
node 23 size: 64 MB
node 23 free: 58 MB
node 24 cpus:
node 24 size: 64 MB
node 24 free: 61 MB
node 25 cpus:
node 25 size: 64 MB
node 25 free: 58 MB
node 26 cpus:
node 26 size: 64 MB
node 26 free: 61 MB
node 27 cpus:
node 27 size: 64 MB
node 27 free: 58 MB
node 28 cpus:
node 28 size: 64 MB
node 28 free: 61 MB
node 29 cpus:
node 29 size: 64 MB
node 29 free: 58 MB
node 30 cpus:
node 30 size: 64 MB
node 30 free: 61 MB
node 31 cpus:
node 31 size: 64 MB
node 31 free: 58 MB
node 32 cpus:
node 32 size: 64 MB
node 32 free: 61 MB
node 33 cpus:
node 33 size: 64 MB
node 33 free: 58 MB
node 34 cpus:
node 34 size: 64 MB
node 34 free: 61 MB
node 35 cpus:
node 35 size: 64 MB
node 35 free: 58 MB
node 36 cpus:
node 36 size: 95 MB
node 36 free: 90 MB
node 37 cpus:
node 37 size: 64 MB
node 37 free: 61 MB
node 38 cpus:
node 38 size: 120 MB
node 38 free: 114 MB
node distances:
node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
0: 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
1: 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
2: 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
3: 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
4: 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
5: 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
6: 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
7: 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
8: 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
9: 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
10: 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
11: 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
12: 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
13: 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
14: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
15: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
16: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
17: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
18: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
19: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
21: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
22: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
23: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
24: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20
25: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20 20
26: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20 20
27: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20 20
28: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20 20
29: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20 20
30: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20 20
31: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20 20
32: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20 20
33: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20 20
34: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20 20
35: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20 20
36: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20 20
37: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 20
38: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10
2-hop NUMA where is our last “hop”?
Following the Kevin Closson comments about my first blog post, i dug a little deeper in the NUMA architecture of this specific box i have access to.
My box comes with Intel “Intel E5-4627” and as mentioned by Kevin this is a 2-hop numa architecture.
As a reminder this is the Numa configuration on this box:
numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 32733 MB node 0 free: 29452 MB node 1 cpus: 8 9 10 11 12 13 14 15 node 1 size: 32768 MB node 1 free: 29803 MB node 2 cpus: 16 17 18 19 20 21 22 23 node 2 size: 32768 MB node 2 free: 29796 MB node 3 cpus: 24 25 26 27 28 29 30 31 node 3 size: 32767 MB node 3 free: 29771 MB node distances: node 0 1 2 3 0: 10 21 21 21 1: 21 10 21 21 2: 21 21 10 21 3: 21 21 21 10
We have 4 nodes (0,1,2,3), each node having 8 cores (0,1,2….,31) and accessing 32gigs of memory.
The purpose of this first test is to measure memory access penalty with distant memory access on 1-hop plus an extra penalty on 2-hop (even if it’s not mentioned in numa node distances output) .
Then we will see how to find which Numa node is introducing the 2-hop penalty.
SLOB will be used in my next post in order to test the NUMA impact with an oracle workload.
For this very basic test, I have used lmbech (http://pkgs.repoforge.org/lmbench/).
lmbech is a suite of Micro benchmarking tools and one of them (called lat_mem_rd) will be used to measure memory latency.
So let’s measure latency across the NUMA nodes
But first this is a description of lat_mem_rd usage in “Measuring Cache and Memory Latency and CPU to Memory Bandwidth” Intel Document :
lat_mem_rd [depth] [stride] The [depth] specification indicates how far into memory the utility will measure. In order to ensure an accurate measurement, specify an amount that will go far enough beyond the cache so that it does not factor in latency measurements. Finally, [stride] is the skipped amount of memory before the next access. If [stride] is not large enough, modern processors have the ability to prefetch the data, thus providing artificial latencies for system memory region. If a stride it too large, the utility will not report correct latencies as it will be skipping past measured intervals. The utility will default to 128 bytes when [stride] is not specified. Refer to Figure 4 for results graph. Binding to the first core and accessing only up to 256M of RAM, the command line would looks as follows: ./lat_mem_rd –N 1 –P 1 256M 512
For memory latency measurement on local NUMA node 0, we will use a 20M array with a stride size of 256B :
First column is the memory depth accessed (in Mbytes)
Second column is the memory latency measured (in NanoSeconds)
numactl --membind=0 --cpunodebind=0 ./lat_mem_rd 20 256 "stride=256 depth(MB) Latency(ns) 0.00049 1.117 0.00098 1.117 ..... 0.02344 1.117 0.02539 1.117 0.02734 1.117 0.02930 1.117 0.03125 1.117 0.03516 3.350 0.03906 3.351 0.04297 3.351 0.04688 3.350 0.05078 2.980 0.05469 2.942 0.05859 3.114 0.06250 3.351 0.07031 3.350 ..... 0.10938 3.352 0.11719 3.234 0.12500 3.352 0.14062 3.350 0.15625 3.352 0.17188 3.352 0.18750 3.354 0.20312 5.017 0.21875 5.868 0.23438 5.704 0.25000 7.155 0.28125 7.867 0.31250 10.289 0.34375 10.267 ...... 6.50000 10.843 7.00000 10.843 7.50000 10.841 8.00000 10.841 9.00000 10.838 10.00000 10.839 11.00000 10.949 12.00000 11.139 13.00000 11.409 14.00000 12.978 15.00000 18.131 16.00000 29.493 18.00000 52.400 20.00000 53.660
One of the interesting part is that as the memory read access move from L1(32k) cache to L2(256k),L3(16MB) we could see impact on read latency and finally the main memory latency.(53 ns in this case)
Now check the latency performing memory access to the other distant node(1,2,3) till using binding to cpunode 0 and note the max latency for this 20MB array.
numactl --membind=1 --cpunodebind=0 ./lat_mem_rd 20 256 depth(MB) Latency(ns) 20.00000 211.483
numactl --membind=2 --cpunodebind=0 ./lat_mem_rd 20 256 depth(MB) Latency(ns) 20.00000 217.169
numactl --membind=3 --cpunodebind=0 ./lat_mem_rd 20 256 depth(MB) Latency(ns) 20.00000 248.341
As you can see, we can observe an extra latency for NUMA node 3 compared to other distant node (around 14%) so it must be the NUMA node causing the extra hop.
We could conclude that our NUMA node configuration could be illustrates like this (2-hop needed from node 0 to access memory on NUMA node 3):
In next post we will “SLOB them all” these NUMA nodes and check the impact of the extra hope on ORACLE.
Numa/interleave memory/oracle
All the following tests have been executed on a 11.2.0.4 version.
Following some posts about numa on Bertrand Drouvot website (cpu binding (processor_group_name) vs Instance caging comparison during LIO pressure and Measure the impact of remote versus local NUMA node access thanks to processor_group_name)
and some input of Martin Bach book and Kevin Closson posts:Oracle on Opteron with Linux-The NUMA Angle (Part VI). Introducing Cyclops. and Oracle11g Automatic Memory Management – Part III. A NUMA Issue.
Martin Bach book : http://goo.gl/lfdkli and Linux large pages and non-uniform memory distribution post
I was a little bit confused about how oracle manages numa system memory when numa is disabled at ORACLE level aka _enable_NUMA_support is set to FALSE and enable at system level.
By default on Linux system each process you execute should follow the numa policy defined at OS level which is by default allocates interleave memory to local node:
numactl --show policy: default preferred node: current physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 cpubind: 0 1 2 3 nodebind: 0 1 2 3 membind: 0 1 2 3
So i decided to test the behavior of oracle in different test case (using a SGA of 20G ).
My platform have the following numa configuration (4 numa nodes with 32Gb of memory each):
First test :
Start the instance with huge pages and ASMM :
Let’s have a look of memory allocation after startup:
numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 32733 MB node 0 free: 29452 MB node 1 cpus: 8 9 10 11 12 13 14 15 node 1 size: 32768 MB node 1 free: 29803 MB node 2 cpus: 16 17 18 19 20 21 22 23 node 2 size: 32768 MB node 2 free: 29796 MB node 3 cpus: 24 25 26 27 28 29 30 31 node 3 size: 32767 MB node 3 free: 29771 MB node distances: node 0 1 2 3 0: 10 21 21 21 1: 21 10 21 21 2: 21 21 10 21 3: 21 21 21 10
As you can see the memory is distributed across the numa node even if it is not the default policy at os level.
So oracle is forcing interleave memory spread accros numa node and it could be seen via strace when starting oracle:
The following system call is executed :
mbind(1744830464, 21340618752, MPOL_INTERLEAVE, 0xd863da0, 513, 0) = 0
The description of this system call is the following :
mbind() sets the NUMA memory policy, which consists of a policy mode and zero or more nodes, for the memory range starting with addr and continuing for len bytes. The memory policy defines from which node memory is allocated.
So it looks like that oracle set the memory policy to memory interleave for the SGA:
The MPOL_INTERLEAVE mode specifies that page allocations has to be interleaved across the set of nodes specified in nodemask. This optimizes for bandwidth instead of latency by spreading out pages and memory accesses to those pages across multiple nodes. To be effective the memory area should be fairly large, at least 1MB or bigger with a fairly uniform access pattern. Accesses to a single page of the area will still be limited to the memory bandwidth of a single node.
Second test :
Start the instance without huge pages and with ASMM :
And again let’s have a look of memory allocation after startup :
numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 32733 MB node 0 free: 26394 MB node 1 cpus: 8 9 10 11 12 13 14 15 node 1 size: 32768 MB node 1 free: 26272 MB node 2 cpus: 16 17 18 19 20 21 22 23 node 2 size: 32768 MB node 2 free: 26619 MB node 3 cpus: 24 25 26 27 28 29 30 31 node 3 size: 32767 MB node 3 free: 26347 MB node distances: node 0 1 2 3 0: 10 21 21 21 1: 21 10 21 21 2: 21 21 10 21 3: 21 21 21 10
again the memory is interleaved across node.
and again we can see same mbind system calls.
Third test :
Start the instance with AMM :
Once again !!! the memory is interleaved across the nodes.
and again!!! same mbind system calls.
So the conclusion is that in each configuration oracle allocates memory using memory interleave policy.
So is it possible to switch oracle to default OS numa policy mode? It looks like it’s possible thanks to “_enable_NUMA_interleave” hidden parameter which is set to TRUE by default (and thanks to Bertrand for pointing me to that parameter).
Let’s test this (i tested this only on the first configuration with huge pages in place):
alter system set "_enable_NUMA_interleave"=false scope=spfile;
so now restart the instance and look again at memory allocation :
numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 32733 MB node 0 free: 28842 MB node 1 cpus: 8 9 10 11 12 13 14 15 node 1 size: 32768 MB node 1 free: 29091 MB node 2 cpus: 16 17 18 19 20 21 22 23 node 2 size: 32768 MB node 2 free: 29580 MB node 3 cpus: 24 25 26 27 28 29 30 31 node 3 size: 32767 MB node 3 free: 18022 MB node distances: node 0 1 2 3 0: 10 21 21 21 1: 21 10 21 21 2: 21 21 10 21 3: 21 21 21 10
So oracle switched back from default OS numa policy (Look at node 3) and there is no more trace of mbind system calls at all.
To conclude it seems that at least in this version memory is interleaved on each node by default on every ORACLE memory management mode.
But it’s still possible and i agree not very useful to switch to non interleave mode.