Numa/interleave memory/oracle

Posted on Updated on

All the following tests have been executed on a 11.2.0.4 version.

Following some posts about numa on Bertrand Drouvot website (cpu binding (processor_group_name) vs Instance caging comparison during LIO pressure and Measure the impact of remote versus local NUMA node access thanks to processor_group_name)

and some input of Martin Bach book and Kevin Closson posts:Oracle on Opteron with Linux-The NUMA Angle (Part VI). Introducing Cyclops. and Oracle11g Automatic Memory Management – Part III. A NUMA Issue.

Martin Bach book : http://goo.gl/lfdkli and Linux large pages and non-uniform memory distribution post

I was a little bit confused about how oracle manages numa system memory when numa is disabled at ORACLE level aka _enable_NUMA_support is set to FALSE and enable at system level.

By default on Linux system each process you execute should follow the numa policy defined at OS level which is by default allocates interleave memory to local node:

numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
cpubind: 0 1 2 3
nodebind: 0 1 2 3
membind: 0 1 2 3

So i decided to test the behavior of oracle in different test case (using a SGA of 20G ).

My platform have the following numa configuration (4 numa nodes with 32Gb of memory each):

First test :

Start the instance with huge pages and ASMM :

Let’s have a look of memory allocation after startup:

numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32733 MB
node 0 free: 29452 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32768 MB
node 1 free: 29803 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32768 MB
node 2 free: 29796 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32767 MB
node 3 free: 29771 MB
node distances:
node   0   1   2   3
0:  10  21  21  21
1:  21  10  21  21
2:  21  21  10  21
3:  21  21  21  10

As you can see the memory is distributed across the numa node even if it is not the default policy at os level.

So oracle is forcing interleave memory spread accros numa node and it could be seen via strace when starting oracle:

The following system call is executed :

mbind(1744830464, 21340618752, MPOL_INTERLEAVE, 0xd863da0, 513, 0) = 0

The description of this system call is the following :

mbind() sets the NUMA memory policy, which consists of a policy mode
       and zero or more nodes, for the memory range starting with addr and
       continuing for len bytes.  The memory policy defines from which node
       memory is allocated.

So it looks like that oracle set the memory policy to memory interleave for the SGA:

The MPOL_INTERLEAVE mode specifies that page allocations has to be interleaved across the set of nodes specified in nodemask. This optimizes for bandwidth instead of latency by spreading out pages and memory accesses to those pages across multiple nodes. To be effective the memory area should be fairly large, at least 1MB or bigger with a fairly uniform access pattern. Accesses to a single page of the area will still be limited to the memory bandwidth of a single node.

Second test :

Start the instance without huge pages and with ASMM :

And again let’s have a look of memory allocation after startup :

numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32733 MB
node 0 free: 26394 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32768 MB
node 1 free: 26272 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32768 MB
node 2 free: 26619 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32767 MB
node 3 free: 26347 MB
node distances:
node   0   1   2   3
0:  10  21  21  21
1:  21  10  21  21
2:  21  21  10  21
3:  21  21  21  10

again the memory is interleaved across node.

and again we can see same mbind system calls.

Third test :

Start the instance with AMM :

Once again !!! the memory is interleaved across the nodes.

and again!!! same mbind system calls.

So the conclusion is that in each configuration oracle allocates memory using memory interleave policy.

So is it possible to switch oracle to default OS numa policy mode? It looks like it’s possible thanks to “_enable_NUMA_interleave” hidden parameter which is set to TRUE by default (and thanks to Bertrand for pointing me to that  parameter).

Let’s test this (i tested this only on the first configuration with huge pages in place):

alter system set "_enable_NUMA_interleave"=false scope=spfile;

so now restart the instance and look again at memory allocation :

numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32733 MB
node 0 free: 28842 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32768 MB
node 1 free: 29091 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32768 MB
node 2 free: 29580 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32767 MB
node 3 free: 18022 MB
node distances:
node   0   1   2   3
0:  10  21  21  21
1:  21  10  21  21
2:  21  21  10  21
3:  21  21  21  10

So oracle switched back from default OS numa policy (Look at node 3) and there is no more trace of mbind system calls at all.

To conclude it seems that at least in this version memory is interleaved on each node by default on every ORACLE memory management mode.

But it’s still possible and i agree not very useful to switch to non interleave mode.

Advertisements

8 thoughts on “Numa/interleave memory/oracle

    […] the cpus have been allocated from the same NUMA node (See the beginning of this post), while the SGA is spread across NUMA nodes in the Instance caging test. And it matters: See this post for more […]

    Like

    kevinclosson said:
    January 18, 2015 at 6:30 pm

    Have you experimented with _enable_NUMA_support=TRUE? When you do so you should find that the variable component of the SGA is interleaved and there are N shared memory segs one for each NUMA node bound to the local memory of the node.

    The variable component of the SGA will always need to be interleaved for performance. It has all the latch structures. Homing a latches on specific nodes isn’t going to pay dividends. There are execptions such as cache_buffers_lru as per my writings that date back to the first NUMA-optimized port of Oracle which was Oracle8i on Sequent DYNIX/ptx … I’ll provide a reference to some historical prior art in case you find that sort of thing interesting:

    http://kevinclosson.files.wordpress.com/2007/04/oracle8i.pdf

    Have you considered testing with cached SLOB on that 4 socket box? I think I can determine from the distances map that this is Xeon E7 (perhaps WSM-EX or IVB-EX)? You might care to explore init.ora NUMA off (the default) and create your own NUMA affinity. There might be interesting things to find in the following:

    1. numactl –cpunodebind=0,1 –interleave=0,1 /bin/sh
    1.1 Now your shell has hard affinity to nodes 0,1 for execution and memory placement (interleaved). Start the listener and the instance and all SLOB processes from this shell and you are testing 50% locality on a 4 socket box.

    From the above consider 0,1,2 to test 33% locality.

    Finally, if this is an EX box consider booting with numa=off in grub. Since EX is a single hop box you might find that offers the best performance.

    Oops, one more thing. 12c is entirely different and actually quite NUMA aware with it’s integration with CGROUPS. I’m reluctant to recommend production use of a “dot-1” Oracle release (for historical and plainly obvious reasons) but it is quite suitable for study.

    Like

      ycolin responded:
      January 18, 2015 at 7:23 pm

      Thanks for your comment Kevin, regarding the cpu this is a E5-4627 v2 so if i’m not wrong IVB-EP. I didn’t check test with _enable_NUMA_support=true but i will do this soon and it’s definitively on my plan to test that with SLOB. By the way, i will follow your guidelines thanks again!!

      Like

      Wellington Prado said:
      March 9, 2016 at 9:04 pm

      Hi Kevin,

      Do you know if in 12c (with cgroups) the variable component of the SGA will always be interleaved too for all numa nodes on the system ? I suppose not, but I haven’t tested yet.

      Tks

      Like

        ycolin responded:
        March 9, 2016 at 9:55 pm

        Hi Wellington,
        You could look at Bertrand post here to have the response
        Yves.

        Liked by 1 person

    kevinclosson said:
    January 19, 2015 at 7:31 pm

    E6-4600 is a 2-hop NUMA. You might consider setting up numactl –cpunodebind –interleave tests that stress sockets 0,1 then 0,2 then 0,3 then 0,1,2 then 0,1,3 and then 0,2,3

    If you test as I explained above you should find that cached Oracle will suffer when using the socket that introduces 2 hops.

    Liked by 1 person

      ycolin responded:
      January 19, 2015 at 8:43 pm

      Thanks! interesting test. i’ll do that soon:)

      Like

    […] On a NUMA enabled system, a 12c database Instance allocates the SGA evenly across all the NUMA nodes by default (unless you change the “_enable_NUMA_interleave” hidden parameter to FALSE). You can find more details about this behaviour in Yves’s post. […]

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s