Performance tuning: Intel 10-gigabit NIC

By default, Linux networking is configured for best reliability, not performance. With a 10GbE adapter, this is especially apparent. The kernel’s send/receive buffers, TCP memory allocations, and packet backlog are much too small for optimal performance. This is where a little testing & tuning can give your NIC a big boost.

There are three performance-tuning changes you can make, as listed in the Intel ixgb driver documentation. Here they are in order of greatest impact:

  1. Enabling jumbo frames on your local host(s) and switch.
  2. Using sysctl to tune kernel settings.
  3. Using setpci to tune PCI settings for the adapter.

Keep in mind that any tuning listed here is only a suggestion. Much of performance tuning is done by changing one setting, then benchmarking and seeing if it worked for you. So your results may vary.

Before starting any benchmarks, you may also want to disable irqbalance and cpuspeed. Doing so will maximize network throughput and allow you to get the best results on your benchmarks.

service irqbalance stop
service cpuspeed stop
chkconfig irqbalance off
chkconfig cpuspeed off

Method #1: jumbo frames

In Linux, setting up jumbo frames is as simple as running a single command, or adding a single field to your interface config.

ifconfig eth2 mtu 9000 txqueuelen 1000 up

For a more permanent change, add this new MTU value to your interface config, replacing “eth2” with your interface name.

vim /etc/sysconfig/network-scripts/ifcfg-eth2
MTU="9000"

Method #2: sysctl settings

There are several important settings that impact network performance in Linux. These were taken from Mark Wagner’s excellent presentation at the Red Hat Summit in 2008.

Core memory settings:

  • net.core.rmem_max –  max size of rx socket buffer
  • net.core.wmem_max – max size of tx socket buffer
  • net.core.rmem_default – default rx size of socket buffer
  • net.core.wmem_default – default tx size of socket buffer
  • net.core.optmem_max – maximum amount of option memory
  • net.core.netdev_max_backlog – how many unprocessed rx packets before kernel starts to drop them
Here is my modified /etc/sysctl.conf. It can be appended onto the default config.
 # -- tuning -- #
# Increase system file descriptor limit
fs.file-max = 65535

# Increase system IP port range to allow for more concurrent connections
net.ipv4.ip_local_port_range = 1024 65000

# -- 10gbe tuning from Intel ixgb driver README -- #

# turn off selective ACK and timestamps
net.ipv4.tcp_sack = 0
net.ipv4.tcp_timestamps = 0

# memory allocation min/pressure/max.
# read buffer, write buffer, and buffer space
net.ipv4.tcp_rmem = 10000000 10000000 10000000
net.ipv4.tcp_wmem = 10000000 10000000 10000000
net.ipv4.tcp_mem = 10000000 10000000 10000000

net.core.rmem_max = 524287
net.core.wmem_max = 524287
net.core.rmem_default = 524287
net.core.wmem_default = 524287
net.core.optmem_max = 524287
net.core.netdev_max_backlog = 300000

Method #3: PCI bus tuning

If you want to take your tuning even further yet, here’s an option to adjust the PCI bus that the NIC is plugged into. The first thing you’ll need to do is find the PCI address, as shown by lspci:

[[email protected] ~]$ lspci
 07:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)

Here 07.00.0 is the pci bus address. Now we can grep for that in /proc/bus/pci/devices to gather even more information.

[[email protected] ~]$ grep 0700 /proc/bus/pci/devices
0700    808610fb        28              d590000c                       0                    ecc1                       0                d58f800c                       0                       0                   80000                       0                      20                       0                    4000                  0                0        ixgbe

Various information about the PCI device will display, as you can see above. But the number we’re interested in is the second field, 808610fb. This is the Vendor ID and Device ID together. Vendor: 8086 Device: 10fb. You can use these values to tune the PCI bus MMRBC, or Maximum Memory Read Byte Count.

This will increase the MMRBC to 4k reads, increasing the transmit burst lengths on the bus.

setpci -v -d 8086:10fb e6.b=2e

About this command:
The -d option gives the location of the NIC on the PCI-X bus structure;
e6.b is the address of the PCI-X Command Register,
and 2e is the value to be set.

These are the other possible values for this register (although the one listed above, 2e, is recommended by the Intel ixgbe documentation).

MM value in bytes
22 512 (default)
26 1024
2a 2048
2e 4096

And finally, testing

Testing is something that should be done in between each configuration change, but for the sake of brevity I’ll just show the before and after results. The benchmarking tools used were ‘iperf’ and ‘netperf’.

Here’s how your 10GbE NIC might perform before tuning…

 [  3]  0.0-100.0 sec   54.7 GBytes  4.70 Gbits/sec

bytes  bytes   bytes    secs.    10^6bits/sec
87380 16384 16384    60.00    5012.24

 

And after tuning…
 [  3]  0.0-100.0 sec   115 GBytes  9.90 Gbits/sec

bytes  bytes   bytes    secs.    10^6bits/sec
10000000 10000000 10000000    30.01    9908.08

Wow! What a difference a little tuning makes. I’ve seen great results from my Hadoop HDFS cluster after just spending a couple hours getting to know my server’s network hardware. Whatever your application for 10GbE might be, this is sure to be of benefit to you as well.

Comments ( 4 )

  1. / Replyi
    I use these settings, but only 4Gbps+ by iperf.
  2. / Replycb40
    Great read. I went from 4.3 to 9.3 using this article on my CentOS7 and the one from Intel for tuning windows on my Windows 10 PC. I did not do the bus tuning yet you mentioned, yet. # iperf3 -c 192.168.80.5 -p 5201 Connecting to host 192.168.80.5, port 5201 [ 4] local 192.168.80.100 port 30801 connected to 192.168.80.5 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 1.09 GBytes 9.33 Gbits/sec 0 402 KBytes [ 4] 1.00-2.00 sec 1.08 GBytes 9.28 Gbits/sec 0 402 KBytes [ 4] 2.00-3.00 sec 1.05 GBytes 9.01 Gbits/sec 0 411 KBytes [ 4] 3.00-4.00 sec 1.09 GBytes 9.33 Gbits/sec 0 411 KBytes [ 4] 4.00-5.00 sec 1.08 GBytes 9.31 Gbits/sec 0 411 KBytes [ 4] 5.00-6.00 sec 1.08 GBytes 9.30 Gbits/sec 0 411 KBytes [ 4] 6.00-7.00 sec 1.09 GBytes 9.35 Gbits/sec 0 411 KBytes [ 4] 7.00-8.00 sec 1.09 GBytes 9.34 Gbits/sec 0 411 KBytes [ 4] 8.00-9.00 sec 1.09 GBytes 9.35 Gbits/sec 0 411 KBytes [ 4] 9.00-10.00 sec 1.09 GBytes 9.35 Gbits/sec 0 411 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 10.8 GBytes 9.30 Gbits/sec 0 sender [ 4] 0.00-10.00 sec 10.8 GBytes 9.29 Gbits/sec receiver
  3. / Replyranchu
    Hi, Which tool do you recommend for testing 10gbe interface ? iperf/netperf/pktgen/ostinato ? Thanks, Ran
  4. / Replymateus
    Great, I got improve on TX very well. Is there a way to do the same with RX?

Leave a reply

Your email address will not be published.

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>