123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445 |
- Scaling in the Linux Networking Stack
- Introduction
- ============
- This document describes a set of complementary techniques in the Linux
- networking stack to increase parallelism and improve performance for
- multi-processor systems.
- The following technologies are described:
- RSS: Receive Side Scaling
- RPS: Receive Packet Steering
- RFS: Receive Flow Steering
- Accelerated Receive Flow Steering
- XPS: Transmit Packet Steering
- RSS: Receive Side Scaling
- =========================
- Contemporary NICs support multiple receive and transmit descriptor queues
- (multi-queue). On reception, a NIC can send different packets to different
- queues to distribute processing among CPUs. The NIC distributes packets by
- applying a filter to each packet that assigns it to one of a small number
- of logical flows. Packets for each flow are steered to a separate receive
- queue, which in turn can be processed by separate CPUs. This mechanism is
- generally known as “Receive-side Scaling” (RSS). The goal of RSS and
- the other scaling techniques is to increase performance uniformly.
- Multi-queue distribution can also be used for traffic prioritization, but
- that is not the focus of these techniques.
- The filter used in RSS is typically a hash function over the network
- and/or transport layer headers-- for example, a 4-tuple hash over
- IP addresses and TCP ports of a packet. The most common hardware
- implementation of RSS uses a 128-entry indirection table where each entry
- stores a queue number. The receive queue for a packet is determined
- by masking out the low order seven bits of the computed hash for the
- packet (usually a Toeplitz hash), taking this number as a key into the
- indirection table and reading the corresponding value.
- Some advanced NICs allow steering packets to queues based on
- programmable filters. For example, webserver bound TCP port 80 packets
- can be directed to their own receive queue. Such “n-tuple” filters can
- be configured from ethtool (--config-ntuple).
- ==== RSS Configuration
- The driver for a multi-queue capable NIC typically provides a kernel
- module parameter for specifying the number of hardware queues to
- configure. In the bnx2x driver, for instance, this parameter is called
- num_queues. A typical RSS configuration would be to have one receive queue
- for each CPU if the device supports enough queues, or otherwise at least
- one for each memory domain, where a memory domain is a set of CPUs that
- share a particular memory level (L1, L2, NUMA node, etc.).
- The indirection table of an RSS device, which resolves a queue by masked
- hash, is usually programmed by the driver at initialization. The
- default mapping is to distribute the queues evenly in the table, but the
- indirection table can be retrieved and modified at runtime using ethtool
- commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the
- indirection table could be done to give different queues different
- relative weights.
- == RSS IRQ Configuration
- Each receive queue has a separate IRQ associated with it. The NIC triggers
- this to notify a CPU when new packets arrive on the given queue. The
- signaling path for PCIe devices uses message signaled interrupts (MSI-X),
- that can route each interrupt to a particular CPU. The active mapping
- of queues to IRQs can be determined from /proc/interrupts. By default,
- an IRQ may be handled on any CPU. Because a non-negligible part of packet
- processing takes place in receive interrupt handling, it is advantageous
- to spread receive interrupts between CPUs. To manually adjust the IRQ
- affinity of each interrupt see Documentation/IRQ-affinity.txt. Some systems
- will be running irqbalance, a daemon that dynamically optimizes IRQ
- assignments and as a result may override any manual settings.
- == Suggested Configuration
- RSS should be enabled when latency is a concern or whenever receive
- interrupt processing forms a bottleneck. Spreading load between CPUs
- decreases queue length. For low latency networking, the optimal setting
- is to allocate as many queues as there are CPUs in the system (or the
- NIC maximum, if lower). The most efficient high-rate configuration
- is likely the one with the smallest number of receive queues where no
- receive queue overflows due to a saturated CPU, because in default
- mode with interrupt coalescing enabled, the aggregate number of
- interrupts (and thus work) grows with each additional queue.
- Per-cpu load can be observed using the mpstat utility, but note that on
- processors with hyperthreading (HT), each hyperthread is represented as
- a separate CPU. For interrupt handling, HT has shown no benefit in
- initial tests, so limit the number of queues to the number of CPU cores
- in the system.
- RPS: Receive Packet Steering
- ============================
- Receive Packet Steering (RPS) is logically a software implementation of
- RSS. Being in software, it is necessarily called later in the datapath.
- Whereas RSS selects the queue and hence CPU that will run the hardware
- interrupt handler, RPS selects the CPU to perform protocol processing
- above the interrupt handler. This is accomplished by placing the packet
- on the desired CPU’s backlog queue and waking up the CPU for processing.
- RPS has some advantages over RSS: 1) it can be used with any NIC,
- 2) software filters can easily be added to hash over new protocols,
- 3) it does not increase hardware device interrupt rate (although it does
- introduce inter-processor interrupts (IPIs)).
- RPS is called during bottom half of the receive interrupt handler, when
- a driver sends a packet up the network stack with netif_rx() or
- netif_receive_skb(). These call the get_rps_cpu() function, which
- selects the queue that should process a packet.
- The first step in determining the target CPU for RPS is to calculate a
- flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash
- depending on the protocol). This serves as a consistent hash of the
- associated flow of the packet. The hash is either provided by hardware
- or will be computed in the stack. Capable hardware can pass the hash in
- the receive descriptor for the packet; this would usually be the same
- hash used for RSS (e.g. computed Toeplitz hash). The hash is saved in
- skb->rx_hash and can be used elsewhere in the stack as a hash of the
- packet’s flow.
- Each receive hardware queue has an associated list of CPUs to which
- RPS may enqueue packets for processing. For each received packet,
- an index into the list is computed from the flow hash modulo the size
- of the list. The indexed CPU is the target for processing the packet,
- and the packet is queued to the tail of that CPU’s backlog queue. At
- the end of the bottom half routine, IPIs are sent to any CPUs for which
- packets have been queued to their backlog queue. The IPI wakes backlog
- processing on the remote CPU, and any queued packets are then processed
- up the networking stack.
- ==== RPS Configuration
- RPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on
- by default for SMP). Even when compiled in, RPS remains disabled until
- explicitly configured. The list of CPUs to which RPS may forward traffic
- can be configured for each receive queue using a sysfs file entry:
- /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
- This file implements a bitmap of CPUs. RPS is disabled when it is zero
- (the default), in which case packets are processed on the interrupting
- CPU. Documentation/IRQ-affinity.txt explains how CPUs are assigned to
- the bitmap.
- == Suggested Configuration
- For a single queue device, a typical RPS configuration would be to set
- the rps_cpus to the CPUs in the same memory domain of the interrupting
- CPU. If NUMA locality is not an issue, this could also be all CPUs in
- the system. At high interrupt rate, it might be wise to exclude the
- interrupting CPU from the map since that already performs much work.
- For a multi-queue system, if RSS is configured so that a hardware
- receive queue is mapped to each CPU, then RPS is probably redundant
- and unnecessary. If there are fewer hardware queues than CPUs, then
- RPS might be beneficial if the rps_cpus for each queue are the ones that
- share the same memory domain as the interrupting CPU for that queue.
- ==== RPS Flow Limit
- RPS scales kernel receive processing across CPUs without introducing
- reordering. The trade-off to sending all packets from the same flow
- to the same CPU is CPU load imbalance if flows vary in packet rate.
- In the extreme case a single flow dominates traffic. Especially on
- common server workloads with many concurrent connections, such
- behavior indicates a problem such as a misconfiguration or spoofed
- source Denial of Service attack.
- Flow Limit is an optional RPS feature that prioritizes small flows
- during CPU contention by dropping packets from large flows slightly
- ahead of those from small flows. It is active only when an RPS or RFS
- destination CPU approaches saturation. Once a CPU's input packet
- queue exceeds half the maximum queue length (as set by sysctl
- net.core.netdev_max_backlog), the kernel starts a per-flow packet
- count over the last 256 packets. If a flow exceeds a set ratio (by
- default, half) of these packets when a new packet arrives, then the
- new packet is dropped. Packets from other flows are still only
- dropped once the input packet queue reaches netdev_max_backlog.
- No packets are dropped when the input packet queue length is below
- the threshold, so flow limit does not sever connections outright:
- even large flows maintain connectivity.
- == Interface
- Flow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not
- turned on. It is implemented for each CPU independently (to avoid lock
- and cache contention) and toggled per CPU by setting the relevant bit
- in sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU
- bitmap interface as rps_cpus (see above) when called from procfs:
- /proc/sys/net/core/flow_limit_cpu_bitmap
- Per-flow rate is calculated by hashing each packet into a hashtable
- bucket and incrementing a per-bucket counter. The hash function is
- the same that selects a CPU in RPS, but as the number of buckets can
- be much larger than the number of CPUs, flow limit has finer-grained
- identification of large flows and fewer false positives. The default
- table has 4096 buckets. This value can be modified through sysctl
- net.core.flow_limit_table_len
- The value is only consulted when a new table is allocated. Modifying
- it does not update active tables.
- == Suggested Configuration
- Flow limit is useful on systems with many concurrent connections,
- where a single connection taking up 50% of a CPU indicates a problem.
- In such environments, enable the feature on all CPUs that handle
- network rx interrupts (as set in /proc/irq/N/smp_affinity).
- The feature depends on the input packet queue length to exceed
- the flow limit threshold (50%) + the flow history length (256).
- Setting net.core.netdev_max_backlog to either 1000 or 10000
- performed well in experiments.
- RFS: Receive Flow Steering
- ==========================
- While RPS steers packets solely based on hash, and thus generally
- provides good load distribution, it does not take into account
- application locality. This is accomplished by Receive Flow Steering
- (RFS). The goal of RFS is to increase datacache hitrate by steering
- kernel processing of packets to the CPU where the application thread
- consuming the packet is running. RFS relies on the same RPS mechanisms
- to enqueue packets onto the backlog of another CPU and to wake up that
- CPU.
- In RFS, packets are not forwarded directly by the value of their hash,
- but the hash is used as index into a flow lookup table. This table maps
- flows to the CPUs where those flows are being processed. The flow hash
- (see RPS section above) is used to calculate the index into this table.
- The CPU recorded in each entry is the one which last processed the flow.
- If an entry does not hold a valid CPU, then packets mapped to that entry
- are steered using plain RPS. Multiple table entries may point to the
- same CPU. Indeed, with many flows and few CPUs, it is very likely that
- a single application thread handles flows with many different flow hashes.
- rps_sock_flow_table is a global flow table that contains the *desired* CPU
- for flows: the CPU that is currently processing the flow in userspace.
- Each table value is a CPU index that is updated during calls to recvmsg
- and sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage()
- and tcp_splice_read()).
- When the scheduler moves a thread to a new CPU while it has outstanding
- receive packets on the old CPU, packets may arrive out of order. To
- avoid this, RFS uses a second flow table to track outstanding packets
- for each flow: rps_dev_flow_table is a table specific to each hardware
- receive queue of each device. Each table value stores a CPU index and a
- counter. The CPU index represents the *current* CPU onto which packets
- for this flow are enqueued for further kernel processing. Ideally, kernel
- and userspace processing occur on the same CPU, and hence the CPU index
- in both tables is identical. This is likely false if the scheduler has
- recently migrated a userspace thread while the kernel still has packets
- enqueued for kernel processing on the old CPU.
- The counter in rps_dev_flow_table values records the length of the current
- CPU's backlog when a packet in this flow was last enqueued. Each backlog
- queue has a head counter that is incremented on dequeue. A tail counter
- is computed as head counter + queue length. In other words, the counter
- in rps_dev_flow[i] records the last element in flow i that has
- been enqueued onto the currently designated CPU for flow i (of course,
- entry i is actually selected by hash and multiple flows may hash to the
- same entry i).
- And now the trick for avoiding out of order packets: when selecting the
- CPU for packet processing (from get_rps_cpu()) the rps_sock_flow table
- and the rps_dev_flow table of the queue that the packet was received on
- are compared. If the desired CPU for the flow (found in the
- rps_sock_flow table) matches the current CPU (found in the rps_dev_flow
- table), the packet is enqueued onto that CPU’s backlog. If they differ,
- the current CPU is updated to match the desired CPU if one of the
- following is true:
- - The current CPU's queue head counter >= the recorded tail counter
- value in rps_dev_flow[i]
- - The current CPU is unset (>= nr_cpu_ids)
- - The current CPU is offline
- After this check, the packet is sent to the (possibly updated) current
- CPU. These rules aim to ensure that a flow only moves to a new CPU when
- there are no packets outstanding on the old CPU, as the outstanding
- packets could arrive later than those about to be processed on the new
- CPU.
- ==== RFS Configuration
- RFS is only available if the kconfig symbol CONFIG_RPS is enabled (on
- by default for SMP). The functionality remains disabled until explicitly
- configured. The number of entries in the global flow table is set through:
- /proc/sys/net/core/rps_sock_flow_entries
- The number of entries in the per-queue flow table are set through:
- /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
- == Suggested Configuration
- Both of these need to be set before RFS is enabled for a receive queue.
- Values for both are rounded up to the nearest power of two. The
- suggested flow count depends on the expected number of active connections
- at any given time, which may be significantly less than the number of open
- connections. We have found that a value of 32768 for rps_sock_flow_entries
- works fairly well on a moderately loaded server.
- For a single queue device, the rps_flow_cnt value for the single queue
- would normally be configured to the same value as rps_sock_flow_entries.
- For a multi-queue device, the rps_flow_cnt for each queue might be
- configured as rps_sock_flow_entries / N, where N is the number of
- queues. So for instance, if rps_sock_flow_entries is set to 32768 and there
- are 16 configured receive queues, rps_flow_cnt for each queue might be
- configured as 2048.
- Accelerated RFS
- ===============
- Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load
- balancing mechanism that uses soft state to steer flows based on where
- the application thread consuming the packets of each flow is running.
- Accelerated RFS should perform better than RFS since packets are sent
- directly to a CPU local to the thread consuming the data. The target CPU
- will either be the same CPU where the application runs, or at least a CPU
- which is local to the application thread’s CPU in the cache hierarchy.
- To enable accelerated RFS, the networking stack calls the
- ndo_rx_flow_steer driver function to communicate the desired hardware
- queue for packets matching a particular flow. The network stack
- automatically calls this function every time a flow entry in
- rps_dev_flow_table is updated. The driver in turn uses a device specific
- method to program the NIC to steer the packets.
- The hardware queue for a flow is derived from the CPU recorded in
- rps_dev_flow_table. The stack consults a CPU to hardware queue map which
- is maintained by the NIC driver. This is an auto-generated reverse map of
- the IRQ affinity table shown by /proc/interrupts. Drivers can use
- functions in the cpu_rmap (“CPU affinity reverse map”) kernel library
- to populate the map. For each CPU, the corresponding queue in the map is
- set to be one whose processing CPU is closest in cache locality.
- ==== Accelerated RFS Configuration
- Accelerated RFS is only available if the kernel is compiled with
- CONFIG_RFS_ACCEL and support is provided by the NIC device and driver.
- It also requires that ntuple filtering is enabled via ethtool. The map
- of CPU to queues is automatically deduced from the IRQ affinities
- configured for each receive queue by the driver, so no additional
- configuration should be necessary.
- == Suggested Configuration
- This technique should be enabled whenever one wants to use RFS and the
- NIC supports hardware acceleration.
- XPS: Transmit Packet Steering
- =============================
- Transmit Packet Steering is a mechanism for intelligently selecting
- which transmit queue to use when transmitting a packet on a multi-queue
- device. To accomplish this, a mapping from CPU to hardware queue(s) is
- recorded. The goal of this mapping is usually to assign queues
- exclusively to a subset of CPUs, where the transmit completions for
- these queues are processed on a CPU within this set. This choice
- provides two benefits. First, contention on the device queue lock is
- significantly reduced since fewer CPUs contend for the same queue
- (contention can be eliminated completely if each CPU has its own
- transmit queue). Secondly, cache miss rate on transmit completion is
- reduced, in particular for data cache lines that hold the sk_buff
- structures.
- XPS is configured per transmit queue by setting a bitmap of CPUs that
- may use that queue to transmit. The reverse mapping, from CPUs to
- transmit queues, is computed and maintained for each network device.
- When transmitting the first packet in a flow, the function
- get_xps_queue() is called to select a queue. This function uses the ID
- of the running CPU as a key into the CPU-to-queue lookup table. If the
- ID matches a single queue, that is used for transmission. If multiple
- queues match, one is selected by using the flow hash to compute an index
- into the set.
- The queue chosen for transmitting a particular flow is saved in the
- corresponding socket structure for the flow (e.g. a TCP connection).
- This transmit queue is used for subsequent packets sent on the flow to
- prevent out of order (ooo) packets. The choice also amortizes the cost
- of calling get_xps_queues() over all packets in the flow. To avoid
- ooo packets, the queue for a flow can subsequently only be changed if
- skb->ooo_okay is set for a packet in the flow. This flag indicates that
- there are no outstanding packets in the flow, so the transmit queue can
- change without the risk of generating out of order packets. The
- transport layer is responsible for setting ooo_okay appropriately. TCP,
- for instance, sets the flag when all data for a connection has been
- acknowledged.
- ==== XPS Configuration
- XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
- default for SMP). The functionality remains disabled until explicitly
- configured. To enable XPS, the bitmap of CPUs that may use a transmit
- queue is configured using the sysfs file entry:
- /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
- == Suggested Configuration
- For a network device with a single transmission queue, XPS configuration
- has no effect, since there is no choice in this case. In a multi-queue
- system, XPS is preferably configured so that each CPU maps onto one queue.
- If there are as many queues as there are CPUs in the system, then each
- queue can also map onto one CPU, resulting in exclusive pairings that
- experience no contention. If there are fewer queues than CPUs, then the
- best CPUs to share a given queue are probably those that share the cache
- with the CPU that processes transmit completions for that queue
- (transmit interrupts).
- Per TX Queue rate limitation:
- =============================
- These are rate-limitation mechanisms implemented by HW, where currently
- a max-rate attribute is supported, by setting a Mbps value to
- /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate
- A value of zero means disabled, and this is the default.
- Further Information
- ===================
- RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into
- 2.6.38. Original patches were submitted by Tom Herbert
- (therbert@google.com)
- Accelerated RFS was introduced in 2.6.35. Original patches were
- submitted by Ben Hutchings (bwh@kernel.org)
- Authors:
- Tom Herbert (therbert@google.com)
- Willem de Bruijn (willemb@google.com)
|