Tuning 10Gb NICs highway to hell

If you are trying to achieve a maximum performance with 10Gb or 40Gb NICs in RHEL or similar  prepare yourself to a battle.

This article is for experienced users, don’t mess up with default kernel parameters if you don’t know for what they serve, remember KISS always.

# Bandwidth with 10Gb NIC
= 10 Gbps    = 1.25 GB/s = 1.16 GiB/s
= 75 GB/min  = 69.8 GiB/min
= 4.5 TB/hr  = 4.09 TiB/hr
= 108 TB/day = 98.2 TiB/day

TCP Parameter Settings

Default TCP parameters in most Linux are to much conservative, and are tuned to handle 100Mb or 1Gb port speeds, and result in buffer sizes that are too small for 10Gb networks. Modifying these values will lead to significant performance gain in a 10Gb and 40G network.

RTT, and BDP are essential values influenced by TCP parameters. Round Trip Time (RTT) is the total amount of time that a packet takes to reach a destination and get back to the source, RTT can be measured using ping.

Bandwidth Delay Product (BDP) is the amount of data that can be in transit at any given time. It is the product of the link bandwidth and the RTT value. Assuming 100ms RTT: BDP = (.1s * (10 * 10^9 )bits ) = 134217728 Bytes. Buffer sizes should be adjusted to permit the maximum number of bytes in transit and prevent traffic throttling.

  • kernel parameters

The following kernel values can be set on Linux for 10Gb NICs.

# Maximum receive socket buffer size
net.core.rmem_max = 134217728 

# Maximum send socket buffer size
net.core.wmem_max = 134217728 

# Minimum, initial and max TCP Receive buffer size in Bytes
net.ipv4.tcp_rmem = 4096 87380 134217728 

# Minimum, initial and max buffer space allocated
net.ipv4.tcp_wmem = 4096 65536 134217728 

# Maximum number of packets queued on the input side
net.core.netdev_max_backlog = 300000 

# Auto tuning
net.ipv4.tcp_moderate_rcvbuf =1

# Don't cache ssthresh from previous connection
net.ipv4.tcp_no_metrics_save = 1

# The Hamilton TCP (HighSpeed-TCP) algorithm is a packet loss based congestion control and is more aggressive pushing up to max bandwidth (total BDP) and favors hosts with lower TTL / VARTTL.
net.ipv4.tcp_congestion_control=htcp

# If you are using jumbo frames set this to avoid MTU black holes.
net.ipv4.tcp_mtu_probing = 1

If your system is running under netfilter, make sure that your iptables rules don’t interfere with the traffic between servers:

  • netfilter and kernel parameters
# iptables -I INPUT 1 --src 10.168.22.0/24 -j ACCEPT  
# iptables -t raw -I PREROUTING 1 --src 10.168.22.0/24 -j NOTRACK  
# iptables -I INPUT 1 --src 10.168.22.0/24 -j ACCEPT  
# iptables -t raw -I PREROUTING 1 --src 10.168.222.0/24 -j NOTRACK
# Netfilter should be turned off on bridge devices
net.bridge.bridge-nf-call-iptables=0
net.bridge.bridge-nf-call-arptables=0
net.bridge.bridge-nf-call-ip6tables=0

Jumbo Frames

The default MTU value is 1500 on Linux, 10Gb and 40Gb ports support up to 64KB MTU values. An MTU value of 9000 was adequate to improve performance and make it more consistent. Remember that if you change the MTU, the changed value must set be set on all devices between the communicating hosts.

Offload Engine

If the application that is running in you server requires very low latency TCP communication like Diameter Protocol, you may want to disable TCP Offload Engine (TOE), disabling TOE will result in more CPU and PCI usage but the rate with TCP retransmission and TCP timeout sessions will greatly decreased.

If you are using bnx2x kernel module you use the following modprobe rule to disable offloading features, but this will require a reboot.

options bnx2x disable_tpa=1

Persistent rules

  • offloading, ring buffer and txqueuelen

This shell script will do the magic to adjust offloading, ring buffer and txqueuelen for all declared interfaces, fit  the script to your needs.

# cat <<EOF>> /sbin/ifup-local
#!/bin/bash
case "$1" in
eth0|eth1|eth2|eth3)
;;
*)
echo "Turning off offloading on $1"
/sbin/ethtool -K $1 tx off rx off tso off gso off lro off
echo "Adjusting buffer size on $1"
/sbin/ethtool -G $1 rx 4096 tx 4096
echo "Adjusting txqueuelen on $1"
/sbin/ip link set dev $1 txqueuelen 10000
;;
esac
exit 0
EOF
# chmod +x /sbin/ifup-local
  • MTU

 

Configure the MTU size on your NICs.

DEVICE=eth0
BOOTPROTO=none
ONBOOT=yes
NETMASK=255.255.255.0
IPADDR=10.168.22.10
USERCTL=no
MTU=9000

If you like udev rules, here are some of them.

  • udev rule for txqueuelen
# cat <<EOF>> /etc/rules.d/71-net-txqueuelen.rules
SUBSYSTEM=="net", ACTION=="add", KERNEL=="eth*", ATTR{tx_queue_len}="10000"
EOF

Get the DEVPATH of you NIC, use the following command.

# udevadm info --query=path --path=/sys/class/net/eth0

After getting the DEVPATH, create te udev rule for every NIC.

# cat <<EOF>> /etc/rules.d/71-net-offloading.rules
ACTION=="add", SUBSYSTEM=="net", DEVPATH=="/devices/pci0000:00/0000:00:03.0/0000:02:00.1/net/eth0", \ RUN+="/sbin/ethtool -K eth0 tx off rx off tso off gso off lro off"
EOF

Now do your tests and have fun, i hope see you on the next battle to rule the system.

A big hug to the jedi Ramon that guided me from the tuning path.

 

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s