2-hop NUMA where is our last “hop”?

Posted on Updated on

Following the Kevin Closson comments about my first blog post, i dug a little deeper in the NUMA architecture of this specific box i have access to.

My box comes with Intel “Intel E5-4627” and as mentioned by Kevin this is a 2-hop numa architecture.

As a reminder this is the Numa configuration on this box:

numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32733 MB
node 0 free: 29452 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32768 MB
node 1 free: 29803 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32768 MB
node 2 free: 29796 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32767 MB
node 3 free: 29771 MB
node distances:
node   0   1   2   3
0:  10  21  21  21
1:  21  10  21  21
2:  21  21  10  21
3:  21  21  21  10

We have 4 nodes (0,1,2,3), each node having 8 cores (0,1,2….,31) and accessing 32gigs of memory.

The purpose of this first test is to measure memory access penalty with distant memory access on 1-hop plus an extra penalty on 2-hop (even if it’s not mentioned in numa node distances output) .

Then we will see how to find which Numa node is introducing the 2-hop penalty.

SLOB will be used in my next post in order to test the NUMA impact with an oracle workload.

For this very basic test, I have used lmbech (http://pkgs.repoforge.org/lmbench/).

lmbech is a suite of Micro benchmarking tools and one of them (called lat_mem_rd) will be used to measure memory latency.

So let’s measure latency across the NUMA nodes

But first this is a description of lat_mem_rd usage in “Measuring Cache and Memory Latency and CPU to Memory Bandwidth” Intel Document :

lat_mem_rd [depth] [stride] 

The [depth] specification indicates how far into memory the utility will measure. 
In order to ensure an accurate measurement, specify an amount that will go 
far enough beyond the cache so that it does not factor in latency 
measurements. 
Finally, [stride] is the skipped amount of memory before the 
next access. If [stride] is not large enough, modern processors have the 
ability to prefetch the data, thus providing artificial latencies for system 
memory region. If a stride it too large, the utility will not report correct 
latencies as it will be skipping past measured intervals. The utility will default 
to 128 bytes when [stride] is not specified. Refer to Figure 4 for results 
graph. Binding to the first core and accessing only up to 256M of RAM, the 
command line would looks as follows: 
./lat_mem_rd –N 1 –P 1 256M 512

For memory latency measurement on local NUMA node 0, we will use a 20M array with a stride size of 256B :

First column is the memory depth accessed (in Mbytes)

Second column is the memory latency measured (in NanoSeconds)

numactl --membind=0 --cpunodebind=0 ./lat_mem_rd 20 256
"stride=256
depth(MB)  Latency(ns)
0.00049 1.117
0.00098 1.117
.....
0.02344 1.117
0.02539 1.117
0.02734 1.117
0.02930 1.117
0.03125 1.117
0.03516 3.350
0.03906 3.351
0.04297 3.351
0.04688 3.350
0.05078 2.980
0.05469 2.942
0.05859 3.114
0.06250 3.351
0.07031 3.350
.....
0.10938 3.352
0.11719 3.234
0.12500 3.352
0.14062 3.350
0.15625 3.352
0.17188 3.352
0.18750 3.354
0.20312 5.017
0.21875 5.868
0.23438 5.704
0.25000 7.155
0.28125 7.867
0.31250 10.289
0.34375 10.267
......
6.50000 10.843
7.00000 10.843
7.50000 10.841
8.00000 10.841
9.00000 10.838
10.00000 10.839
11.00000 10.949
12.00000 11.139
13.00000 11.409
14.00000 12.978
15.00000 18.131
16.00000 29.493
18.00000 52.400
20.00000 53.660

One of the interesting part is that as the memory read access move from L1(32k) cache to L2(256k),L3(16MB) we could see impact on read latency and finally the main memory latency.(53 ns in this case)

Now check the latency performing memory access to the other distant node(1,2,3) till using binding to cpunode 0 and note the max latency for this 20MB array.

numactl --membind=1 --cpunodebind=0 ./lat_mem_rd 20 256
depth(MB)  Latency(ns)
20.00000 211.483
numactl --membind=2 --cpunodebind=0 ./lat_mem_rd 20 256
depth(MB)  Latency(ns)
20.00000 217.169
numactl --membind=3 --cpunodebind=0 ./lat_mem_rd 20 256
depth(MB)  Latency(ns)
20.00000 248.341

As you can see, we can observe an extra latency for NUMA node 3 compared to other distant node (around 14%) so it must be the NUMA node causing the extra hop.

We could conclude that our NUMA node configuration could be illustrates like this (2-hop needed from node 0 to access memory on NUMA node 3):

numa

In next post we will “SLOB them all” these NUMA nodes and check the impact of the extra hope on ORACLE.

Advertisements

2 thoughts on “2-hop NUMA where is our last “hop”?

    kevinclosson said:
    January 20, 2015 at 8:48 pm

    If sockets 1 and 2 where also running lmbench local (e.g., cpunodebind=1 membind=1, etc) while doing your 2 hop test I think the pain would be worse than 14% for the hop. Idle sockets pass only remote lines easily…and in these systems everything shuttles around in cacheline size (64B).

    Liked by 1 person

      John Appleby said:
      January 21, 2015 at 12:28 am

      Yes that’s the issue with these 4-socket systems with 2 NUMA lines. Once local links are saturated, you get a network storm and overall network performance deteriorates.

      In the 32S systems with 3 NUMA lines that we have been testing, this is really exacerbated if the software isn’t memory location-aware.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s