Month: January 2015

2-hop NUMA where is our last “hop”?

Posted on Updated on

Following the Kevin Closson comments about my first blog post, i dug a little deeper in the NUMA architecture of this specific box i have access to.

My box comes with Intel “Intel E5-4627” and as mentioned by Kevin this is a 2-hop numa architecture.

As a reminder this is the Numa configuration on this box:

numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32733 MB
node 0 free: 29452 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32768 MB
node 1 free: 29803 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32768 MB
node 2 free: 29796 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32767 MB
node 3 free: 29771 MB
node distances:
node   0   1   2   3
0:  10  21  21  21
1:  21  10  21  21
2:  21  21  10  21
3:  21  21  21  10

We have 4 nodes (0,1,2,3), each node having 8 cores (0,1,2….,31) and accessing 32gigs of memory.

The purpose of this first test is to measure memory access penalty with distant memory access on 1-hop plus an extra penalty on 2-hop (even if it’s not mentioned in numa node distances output) .

Then we will see how to find which Numa node is introducing the 2-hop penalty.

SLOB will be used in my next post in order to test the NUMA impact with an oracle workload.

For this very basic test, I have used lmbech (http://pkgs.repoforge.org/lmbench/).

lmbech is a suite of Micro benchmarking tools and one of them (called lat_mem_rd) will be used to measure memory latency.

So let’s measure latency across the NUMA nodes

But first this is a description of lat_mem_rd usage in “Measuring Cache and Memory Latency and CPU to Memory Bandwidth” Intel Document :

lat_mem_rd [depth] [stride] 

The [depth] specification indicates how far into memory the utility will measure. 
In order to ensure an accurate measurement, specify an amount that will go 
far enough beyond the cache so that it does not factor in latency 
measurements. 
Finally, [stride] is the skipped amount of memory before the 
next access. If [stride] is not large enough, modern processors have the 
ability to prefetch the data, thus providing artificial latencies for system 
memory region. If a stride it too large, the utility will not report correct 
latencies as it will be skipping past measured intervals. The utility will default 
to 128 bytes when [stride] is not specified. Refer to Figure 4 for results 
graph. Binding to the first core and accessing only up to 256M of RAM, the 
command line would looks as follows: 
./lat_mem_rd –N 1 –P 1 256M 512

For memory latency measurement on local NUMA node 0, we will use a 20M array with a stride size of 256B :

First column is the memory depth accessed (in Mbytes)

Second column is the memory latency measured (in NanoSeconds)

numactl --membind=0 --cpunodebind=0 ./lat_mem_rd 20 256
"stride=256
depth(MB)  Latency(ns)
0.00049 1.117
0.00098 1.117
.....
0.02344 1.117
0.02539 1.117
0.02734 1.117
0.02930 1.117
0.03125 1.117
0.03516 3.350
0.03906 3.351
0.04297 3.351
0.04688 3.350
0.05078 2.980
0.05469 2.942
0.05859 3.114
0.06250 3.351
0.07031 3.350
.....
0.10938 3.352
0.11719 3.234
0.12500 3.352
0.14062 3.350
0.15625 3.352
0.17188 3.352
0.18750 3.354
0.20312 5.017
0.21875 5.868
0.23438 5.704
0.25000 7.155
0.28125 7.867
0.31250 10.289
0.34375 10.267
......
6.50000 10.843
7.00000 10.843
7.50000 10.841
8.00000 10.841
9.00000 10.838
10.00000 10.839
11.00000 10.949
12.00000 11.139
13.00000 11.409
14.00000 12.978
15.00000 18.131
16.00000 29.493
18.00000 52.400
20.00000 53.660

One of the interesting part is that as the memory read access move from L1(32k) cache to L2(256k),L3(16MB) we could see impact on read latency and finally the main memory latency.(53 ns in this case)

Now check the latency performing memory access to the other distant node(1,2,3) till using binding to cpunode 0 and note the max latency for this 20MB array.

numactl --membind=1 --cpunodebind=0 ./lat_mem_rd 20 256
depth(MB)  Latency(ns)
20.00000 211.483
numactl --membind=2 --cpunodebind=0 ./lat_mem_rd 20 256
depth(MB)  Latency(ns)
20.00000 217.169
numactl --membind=3 --cpunodebind=0 ./lat_mem_rd 20 256
depth(MB)  Latency(ns)
20.00000 248.341

As you can see, we can observe an extra latency for NUMA node 3 compared to other distant node (around 14%) so it must be the NUMA node causing the extra hop.

We could conclude that our NUMA node configuration could be illustrates like this (2-hop needed from node 0 to access memory on NUMA node 3):

numa

In next post we will “SLOB them all” these NUMA nodes and check the impact of the extra hope on ORACLE.

Advertisements

Numa/interleave memory/oracle

Posted on Updated on

All the following tests have been executed on a 11.2.0.4 version.

Following some posts about numa on Bertrand Drouvot website (cpu binding (processor_group_name) vs Instance caging comparison during LIO pressure and Measure the impact of remote versus local NUMA node access thanks to processor_group_name)

and some input of Martin Bach book and Kevin Closson posts:Oracle on Opteron with Linux-The NUMA Angle (Part VI). Introducing Cyclops. and Oracle11g Automatic Memory Management – Part III. A NUMA Issue.

Martin Bach book : http://goo.gl/lfdkli and Linux large pages and non-uniform memory distribution post

I was a little bit confused about how oracle manages numa system memory when numa is disabled at ORACLE level aka _enable_NUMA_support is set to FALSE and enable at system level.

By default on Linux system each process you execute should follow the numa policy defined at OS level which is by default allocates interleave memory to local node:

numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
cpubind: 0 1 2 3
nodebind: 0 1 2 3
membind: 0 1 2 3

So i decided to test the behavior of oracle in different test case (using a SGA of 20G ).

My platform have the following numa configuration (4 numa nodes with 32Gb of memory each):

First test :

Start the instance with huge pages and ASMM :

Let’s have a look of memory allocation after startup:

numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32733 MB
node 0 free: 29452 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32768 MB
node 1 free: 29803 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32768 MB
node 2 free: 29796 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32767 MB
node 3 free: 29771 MB
node distances:
node   0   1   2   3
0:  10  21  21  21
1:  21  10  21  21
2:  21  21  10  21
3:  21  21  21  10

As you can see the memory is distributed across the numa node even if it is not the default policy at os level.

So oracle is forcing interleave memory spread accros numa node and it could be seen via strace when starting oracle:

The following system call is executed :

mbind(1744830464, 21340618752, MPOL_INTERLEAVE, 0xd863da0, 513, 0) = 0

The description of this system call is the following :

mbind() sets the NUMA memory policy, which consists of a policy mode
       and zero or more nodes, for the memory range starting with addr and
       continuing for len bytes.  The memory policy defines from which node
       memory is allocated.

So it looks like that oracle set the memory policy to memory interleave for the SGA:

The MPOL_INTERLEAVE mode specifies that page allocations has to be interleaved across the set of nodes specified in nodemask. This optimizes for bandwidth instead of latency by spreading out pages and memory accesses to those pages across multiple nodes. To be effective the memory area should be fairly large, at least 1MB or bigger with a fairly uniform access pattern. Accesses to a single page of the area will still be limited to the memory bandwidth of a single node.

Second test :

Start the instance without huge pages and with ASMM :

And again let’s have a look of memory allocation after startup :

numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32733 MB
node 0 free: 26394 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32768 MB
node 1 free: 26272 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32768 MB
node 2 free: 26619 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32767 MB
node 3 free: 26347 MB
node distances:
node   0   1   2   3
0:  10  21  21  21
1:  21  10  21  21
2:  21  21  10  21
3:  21  21  21  10

again the memory is interleaved across node.

and again we can see same mbind system calls.

Third test :

Start the instance with AMM :

Once again !!! the memory is interleaved across the nodes.

and again!!! same mbind system calls.

So the conclusion is that in each configuration oracle allocates memory using memory interleave policy.

So is it possible to switch oracle to default OS numa policy mode? It looks like it’s possible thanks to “_enable_NUMA_interleave” hidden parameter which is set to TRUE by default (and thanks to Bertrand for pointing me to that  parameter).

Let’s test this (i tested this only on the first configuration with huge pages in place):

alter system set "_enable_NUMA_interleave"=false scope=spfile;

so now restart the instance and look again at memory allocation :

numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32733 MB
node 0 free: 28842 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32768 MB
node 1 free: 29091 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32768 MB
node 2 free: 29580 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32767 MB
node 3 free: 18022 MB
node distances:
node   0   1   2   3
0:  10  21  21  21
1:  21  10  21  21
2:  21  21  10  21
3:  21  21  21  10

So oracle switched back from default OS numa policy (Look at node 3) and there is no more trace of mbind system calls at all.

To conclude it seems that at least in this version memory is interleaved on each node by default on every ORACLE memory management mode.

But it’s still possible and i agree not very useful to switch to non interleave mode.