123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971972973974975976977978979980981982983984985986987988989990991992993994995996997998999100010011002100310041005100610071008100910101011101210131014101510161017101810191020102110221023102410251026102710281029103010311032103310341035103610371038103910401041104210431044104510461047104810491050105110521053105410551056105710581059106010611062106310641065106610671068 |
- --------------------------------------------------------------------------------
- + ABSTRACT
- --------------------------------------------------------------------------------
- This file documents the mmap() facility available with the PACKET
- socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for
- i) capture network traffic with utilities like tcpdump, ii) transmit network
- traffic, or any other that needs raw access to network interface.
- You can find the latest version of this document at:
- http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap
- Howto can be found at:
- http://wiki.gnu-log.net (packet_mmap)
- Please send your comments to
- Ulisses Alonso Camaró <[email protected]>
- Johann Baudy <[email protected]>
- -------------------------------------------------------------------------------
- + Why use PACKET_MMAP
- --------------------------------------------------------------------------------
- In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very
- inefficient. It uses very limited buffers and requires one system call to
- capture each packet, it requires two if you want to get packet's timestamp
- (like libpcap always does).
- In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
- configurable circular buffer mapped in user space that can be used to either
- send or receive packets. This way reading packets just needs to wait for them,
- most of the time there is no need to issue a single system call. Concerning
- transmission, multiple packets can be sent through one system call to get the
- highest bandwidth. By using a shared buffer between the kernel and the user
- also has the benefit of minimizing packet copies.
- It's fine to use PACKET_MMAP to improve the performance of the capture and
- transmission process, but it isn't everything. At least, if you are capturing
- at high speeds (this is relative to the cpu speed), you should check if the
- device driver of your network interface card supports some sort of interrupt
- load mitigation or (even better) if it supports NAPI, also make sure it is
- enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
- supported by devices of your network. CPU IRQ pinning of your network interface
- card can also be an advantage.
- --------------------------------------------------------------------------------
- + How to use mmap() to improve capture process
- --------------------------------------------------------------------------------
- From the user standpoint, you should use the higher level libpcap library, which
- is a de facto standard, portable across nearly all operating systems
- including Win32.
- Said that, at time of this writing, official libpcap 0.8.1 is out and doesn't include
- support for PACKET_MMAP, and also probably the libpcap included in your distribution.
- I'm aware of two implementations of PACKET_MMAP in libpcap:
- http://wiki.ipxwarzone.com/ (by Simon Patarin, based on libpcap 0.6.2)
- http://public.lanl.gov/cpw/ (by Phil Wood, based on lastest libpcap)
- The rest of this document is intended for people who want to understand
- the low level details or want to improve libpcap by including PACKET_MMAP
- support.
- --------------------------------------------------------------------------------
- + How to use mmap() directly to improve capture process
- --------------------------------------------------------------------------------
- From the system calls stand point, the use of PACKET_MMAP involves
- the following process:
- [setup] socket() -------> creation of the capture socket
- setsockopt() ---> allocation of the circular buffer (ring)
- option: PACKET_RX_RING
- mmap() ---------> mapping of the allocated buffer to the
- user process
- [capture] poll() ---------> to wait for incoming packets
- [shutdown] close() --------> destruction of the capture socket and
- deallocation of all associated
- resources.
- socket creation and destruction is straight forward, and is done
- the same way with or without PACKET_MMAP:
- int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL));
- where mode is SOCK_RAW for the raw interface were link level
- information can be captured or SOCK_DGRAM for the cooked
- interface where link level information capture is not
- supported and a link level pseudo-header is provided
- by the kernel.
- The destruction of the socket and all associated resources
- is done by a simple call to close(fd).
- Similarly as without PACKET_MMAP, it is possible to use one socket
- for capture and transmission. This can be done by mapping the
- allocated RX and TX buffer ring with a single mmap() call.
- See "Mapping and use of the circular buffer (ring)".
- Next I will describe PACKET_MMAP settings and its constraints,
- also the mapping of the circular buffer in the user process and
- the use of this buffer.
- --------------------------------------------------------------------------------
- + How to use mmap() directly to improve transmission process
- --------------------------------------------------------------------------------
- Transmission process is similar to capture as shown below.
- [setup] socket() -------> creation of the transmission socket
- setsockopt() ---> allocation of the circular buffer (ring)
- option: PACKET_TX_RING
- bind() ---------> bind transmission socket with a network interface
- mmap() ---------> mapping of the allocated buffer to the
- user process
- [transmission] poll() ---------> wait for free packets (optional)
- send() ---------> send all packets that are set as ready in
- the ring
- The flag MSG_DONTWAIT can be used to return
- before end of transfer.
- [shutdown] close() --------> destruction of the transmission socket and
- deallocation of all associated resources.
- Socket creation and destruction is also straight forward, and is done
- the same way as in capturing described in the previous paragraph:
- int fd = socket(PF_PACKET, mode, 0);
- The protocol can optionally be 0 in case we only want to transmit
- via this socket, which avoids an expensive call to packet_rcv().
- In this case, you also need to bind(2) the TX_RING with sll_protocol = 0
- set. Otherwise, htons(ETH_P_ALL) or any other protocol, for example.
- Binding the socket to your network interface is mandatory (with zero copy) to
- know the header size of frames used in the circular buffer.
- As capture, each frame contains two parts:
- --------------------
- | struct tpacket_hdr | Header. It contains the status of
- | | of this frame
- |--------------------|
- | data buffer |
- . . Data that will be sent over the network interface.
- . .
- --------------------
- bind() associates the socket to your network interface thanks to
- sll_ifindex parameter of struct sockaddr_ll.
- Initialization example:
- struct sockaddr_ll my_addr;
- struct ifreq s_ifr;
- ...
- strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
- /* get interface index of eth0 */
- ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
- /* fill sockaddr_ll struct to prepare binding */
- my_addr.sll_family = AF_PACKET;
- my_addr.sll_protocol = htons(ETH_P_ALL);
- my_addr.sll_ifindex = s_ifr.ifr_ifindex;
- /* bind socket to eth0 */
- bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
- A complete tutorial is available at: http://wiki.gnu-log.net/
- By default, the user should put data at :
- frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll)
- So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW),
- the beginning of the user data will be at :
- frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
- If you wish to put user data at a custom offset from the beginning of
- the frame (for payload alignment with SOCK_RAW mode for instance) you
- can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order
- to make this work it must be enabled previously with setsockopt()
- and the PACKET_TX_HAS_OFF option.
- --------------------------------------------------------------------------------
- + PACKET_MMAP settings
- --------------------------------------------------------------------------------
- To setup PACKET_MMAP from user level code is done with a call like
- - Capture process
- setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))
- - Transmission process
- setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))
- The most significant argument in the previous call is the req parameter,
- this parameter must to have the following structure:
- struct tpacket_req
- {
- unsigned int tp_block_size; /* Minimal size of contiguous block */
- unsigned int tp_block_nr; /* Number of blocks */
- unsigned int tp_frame_size; /* Size of frame */
- unsigned int tp_frame_nr; /* Total number of frames */
- };
- This structure is defined in /usr/include/linux/if_packet.h and establishes a
- circular buffer (ring) of unswappable memory.
- Being mapped in the capture process allows reading the captured frames and
- related meta-information like timestamps without requiring a system call.
- Frames are grouped in blocks. Each block is a physically contiguous
- region of memory and holds tp_block_size/tp_frame_size frames. The total number
- of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because
- frames_per_block = tp_block_size/tp_frame_size
- indeed, packet_set_ring checks that the following condition is true
- frames_per_block * tp_block_nr == tp_frame_nr
- Lets see an example, with the following values:
- tp_block_size= 4096
- tp_frame_size= 2048
- tp_block_nr = 4
- tp_frame_nr = 8
- we will get the following buffer structure:
- block #1 block #2
- +---------+---------+ +---------+---------+
- | frame 1 | frame 2 | | frame 3 | frame 4 |
- +---------+---------+ +---------+---------+
- block #3 block #4
- +---------+---------+ +---------+---------+
- | frame 5 | frame 6 | | frame 7 | frame 8 |
- +---------+---------+ +---------+---------+
- A frame can be of any size with the only condition it can fit in a block. A block
- can only hold an integer number of frames, or in other words, a frame cannot
- be spawned across two blocks, so there are some details you have to take into
- account when choosing the frame_size. See "Mapping and use of the circular
- buffer (ring)".
- --------------------------------------------------------------------------------
- + PACKET_MMAP setting constraints
- --------------------------------------------------------------------------------
- In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch),
- the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or
- 16384 in a 64 bit architecture. For information on these kernel versions
- see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt
- Block size limit
- ------------------
- As stated earlier, each block is a contiguous physical region of memory. These
- memory regions are allocated with calls to the __get_free_pages() function. As
- the name indicates, this function allocates pages of memory, and the second
- argument is "order" or a power of two number of pages, that is
- (for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes,
- order=2 ==> 16384 bytes, etc. The maximum size of a
- region allocated by __get_free_pages is determined by the MAX_ORDER macro. More
- precisely the limit can be calculated as:
- PAGE_SIZE << MAX_ORDER
- In a i386 architecture PAGE_SIZE is 4096 bytes
- In a 2.4/i386 kernel MAX_ORDER is 10
- In a 2.6/i386 kernel MAX_ORDER is 11
- So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel
- respectively, with an i386 architecture.
- User space programs can include /usr/include/sys/user.h and
- /usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations.
- The pagesize can also be determined dynamically with the getpagesize (2)
- system call.
- Block number limit
- --------------------
- To understand the constraints of PACKET_MMAP, we have to see the structure
- used to hold the pointers to each block.
- Currently, this structure is a dynamically allocated vector with kmalloc
- called pg_vec, its size limits the number of blocks that can be allocated.
- +---+---+---+---+
- | x | x | x | x |
- +---+---+---+---+
- | | | |
- | | | v
- | | v block #4
- | v block #3
- v block #2
- block #1
- kmalloc allocates any number of bytes of physically contiguous memory from
- a pool of pre-determined sizes. This pool of memory is maintained by the slab
- allocator which is at the end the responsible for doing the allocation and
- hence which imposes the maximum memory that kmalloc can allocate.
- In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The
- predetermined sizes that kmalloc uses can be checked in the "size-<bytes>"
- entries of /proc/slabinfo
- In a 32 bit architecture, pointers are 4 bytes long, so the total number of
- pointers to blocks is
- 131072/4 = 32768 blocks
- PACKET_MMAP buffer size calculator
- ------------------------------------
- Definitions:
- <size-max> : is the maximum size of allocable with kmalloc (see /proc/slabinfo)
- <pointer size>: depends on the architecture -- sizeof(void *)
- <page size> : depends on the architecture -- PAGE_SIZE or getpagesize (2)
- <max-order> : is the value defined with MAX_ORDER
- <frame size> : it's an upper bound of frame's capture size (more on this later)
- from these definitions we will derive
- <block number> = <size-max>/<pointer size>
- <block size> = <pagesize> << <max-order>
- so, the max buffer size is
- <block number> * <block size>
- and, the number of frames be
- <block number> * <block size> / <frame size>
- Suppose the following parameters, which apply for 2.6 kernel and an
- i386 architecture:
- <size-max> = 131072 bytes
- <pointer size> = 4 bytes
- <pagesize> = 4096 bytes
- <max-order> = 11
- and a value for <frame size> of 2048 bytes. These parameters will yield
- <block number> = 131072/4 = 32768 blocks
- <block size> = 4096 << 11 = 8 MiB.
- and hence the buffer will have a 262144 MiB size. So it can hold
- 262144 MiB / 2048 bytes = 134217728 frames
- Actually, this buffer size is not possible with an i386 architecture.
- Remember that the memory is allocated in kernel space, in the case of
- an i386 kernel's memory size is limited to 1GiB.
- All memory allocations are not freed until the socket is closed. The memory
- allocations are done with GFP_KERNEL priority, this basically means that
- the allocation can wait and swap other process' memory in order to allocate
- the necessary memory, so normally limits can be reached.
- Other constraints
- -------------------
- If you check the source code you will see that what I draw here as a frame
- is not only the link level frame. At the beginning of each frame there is a
- header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame
- meta information like timestamp. So what we draw here a frame it's really
- the following (from include/linux/if_packet.h):
- /*
- Frame structure:
- - Start. Frame must be aligned to TPACKET_ALIGNMENT=16
- - struct tpacket_hdr
- - pad to TPACKET_ALIGNMENT=16
- - struct sockaddr_ll
- - Gap, chosen so that packet data (Start+tp_net) aligns to
- TPACKET_ALIGNMENT=16
- - Start+tp_mac: [ Optional MAC header ]
- - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.
- - Pad to align to TPACKET_ALIGNMENT=16
- */
-
- The following are conditions that are checked in packet_set_ring
- tp_block_size must be a multiple of PAGE_SIZE (1)
- tp_frame_size must be greater than TPACKET_HDRLEN (obvious)
- tp_frame_size must be a multiple of TPACKET_ALIGNMENT
- tp_frame_nr must be exactly frames_per_block*tp_block_nr
- Note that tp_block_size should be chosen to be a power of two or there will
- be a waste of memory.
- --------------------------------------------------------------------------------
- + Mapping and use of the circular buffer (ring)
- --------------------------------------------------------------------------------
- The mapping of the buffer in the user process is done with the conventional
- mmap function. Even the circular buffer is compound of several physically
- discontiguous blocks of memory, they are contiguous to the user space, hence
- just one call to mmap is needed:
- mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
- If tp_frame_size is a divisor of tp_block_size frames will be
- contiguously spaced by tp_frame_size bytes. If not, each
- tp_block_size/tp_frame_size frames there will be a gap between
- the frames. This is because a frame cannot be spawn across two
- blocks.
- To use one socket for capture and transmission, the mapping of both the
- RX and TX buffer ring has to be done with one call to mmap:
- ...
- setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo));
- setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar));
- ...
- rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
- tx_ring = rx_ring + size;
- RX must be the first as the kernel maps the TX ring memory right
- after the RX one.
- At the beginning of each frame there is an status field (see
- struct tpacket_hdr). If this field is 0 means that the frame is ready
- to be used for the kernel, If not, there is a frame the user can read
- and the following flags apply:
- +++ Capture process:
- from include/linux/if_packet.h
- #define TP_STATUS_COPY (1 << 1)
- #define TP_STATUS_LOSING (1 << 2)
- #define TP_STATUS_CSUMNOTREADY (1 << 3)
- #define TP_STATUS_CSUM_VALID (1 << 7)
- TP_STATUS_COPY : This flag indicates that the frame (and associated
- meta information) has been truncated because it's
- larger than tp_frame_size. This packet can be
- read entirely with recvfrom().
-
- In order to make this work it must to be
- enabled previously with setsockopt() and
- the PACKET_COPY_THRESH option.
- The number of frames that can be buffered to
- be read with recvfrom is limited like a normal socket.
- See the SO_RCVBUF option in the socket (7) man page.
- TP_STATUS_LOSING : indicates there were packet drops from last time
- statistics where checked with getsockopt() and
- the PACKET_STATISTICS option.
- TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which
- its checksum will be done in hardware. So while
- reading the packet we should not try to check the
- checksum.
- TP_STATUS_CSUM_VALID : This flag indicates that at least the transport
- header checksum of the packet has been already
- validated on the kernel side. If the flag is not set
- then we are free to check the checksum by ourselves
- provided that TP_STATUS_CSUMNOTREADY is also not set.
- for convenience there are also the following defines:
- #define TP_STATUS_KERNEL 0
- #define TP_STATUS_USER 1
- The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel
- receives a packet it puts in the buffer and updates the status with
- at least the TP_STATUS_USER flag. Then the user can read the packet,
- once the packet is read the user must zero the status field, so the kernel
- can use again that frame buffer.
- The user can use poll (any other variant should apply too) to check if new
- packets are in the ring:
- struct pollfd pfd;
- pfd.fd = fd;
- pfd.revents = 0;
- pfd.events = POLLIN|POLLRDNORM|POLLERR;
- if (status == TP_STATUS_KERNEL)
- retval = poll(&pfd, 1, timeout);
- It doesn't incur in a race condition to first check the status value and
- then poll for frames.
- ++ Transmission process
- Those defines are also used for transmission:
- #define TP_STATUS_AVAILABLE 0 // Frame is available
- #define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send()
- #define TP_STATUS_SENDING 2 // Frame is currently in transmission
- #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct
- First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a
- packet, the user fills a data buffer of an available frame, sets tp_len to
- current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST.
- This can be done on multiple frames. Once the user is ready to transmit, it
- calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are
- forwarded to the network device. The kernel updates each status of sent
- frames with TP_STATUS_SENDING until the end of transfer.
- At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE.
- header->tp_len = in_i_size;
- header->tp_status = TP_STATUS_SEND_REQUEST;
- retval = send(this->socket, NULL, 0, 0);
- The user can also use poll() to check if a buffer is available:
- (status == TP_STATUS_SENDING)
- struct pollfd pfd;
- pfd.fd = fd;
- pfd.revents = 0;
- pfd.events = POLLOUT;
- retval = poll(&pfd, 1, timeout);
- -------------------------------------------------------------------------------
- + What TPACKET versions are available and when to use them?
- -------------------------------------------------------------------------------
- int val = tpacket_version;
- setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
- getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
- where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
- TPACKET_V1:
- - Default if not otherwise specified by setsockopt(2)
- - RX_RING, TX_RING available
- TPACKET_V1 --> TPACKET_V2:
- - Made 64 bit clean due to unsigned long usage in TPACKET_V1
- structures, thus this also works on 64 bit kernel with 32 bit
- userspace and the like
- - Timestamp resolution in nanoseconds instead of microseconds
- - RX_RING, TX_RING available
- - VLAN metadata information available for packets
- (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID),
- in the tpacket2_hdr structure:
- - TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates
- that the tp_vlan_tci field has valid VLAN TCI value
- - TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field
- indicates that the tp_vlan_tpid field has valid VLAN TPID value
- - How to switch to TPACKET_V2:
- 1. Replace struct tpacket_hdr by struct tpacket2_hdr
- 2. Query header len and save
- 3. Set protocol version to 2, set up ring as usual
- 4. For getting the sockaddr_ll,
- use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of
- (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
- TPACKET_V2 --> TPACKET_V3:
- - Flexible buffer implementation:
- 1. Blocks can be configured with non-static frame-size
- 2. Read/poll is at a block-level (as opposed to packet-level)
- 3. Added poll timeout to avoid indefinite user-space wait
- on idle links
- 4. Added user-configurable knobs:
- 4.1 block::timeout
- 4.2 tpkt_hdr::sk_rxhash
- - RX Hash data available in user space
- - Currently only RX_RING available
- -------------------------------------------------------------------------------
- + AF_PACKET fanout mode
- -------------------------------------------------------------------------------
- In the AF_PACKET fanout mode, packet reception can be load balanced among
- processes. This also works in combination with mmap(2) on packet sockets.
- Currently implemented fanout policies are:
- - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash
- - PACKET_FANOUT_LB: schedule to socket by round-robin
- - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on
- - PACKET_FANOUT_RND: schedule to socket by random selection
- - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another
- - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping
- Minimal example code by David S. Miller (try things like "./test eth0 hash",
- "./test eth0 lb", etc.):
- #include <stddef.h>
- #include <stdlib.h>
- #include <stdio.h>
- #include <string.h>
- #include <sys/types.h>
- #include <sys/wait.h>
- #include <sys/socket.h>
- #include <sys/ioctl.h>
- #include <unistd.h>
- #include <linux/if_ether.h>
- #include <linux/if_packet.h>
- #include <net/if.h>
- static const char *device_name;
- static int fanout_type;
- static int fanout_id;
- #ifndef PACKET_FANOUT
- # define PACKET_FANOUT 18
- # define PACKET_FANOUT_HASH 0
- # define PACKET_FANOUT_LB 1
- #endif
- static int setup_socket(void)
- {
- int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));
- struct sockaddr_ll ll;
- struct ifreq ifr;
- int fanout_arg;
- if (fd < 0) {
- perror("socket");
- return EXIT_FAILURE;
- }
- memset(&ifr, 0, sizeof(ifr));
- strcpy(ifr.ifr_name, device_name);
- err = ioctl(fd, SIOCGIFINDEX, &ifr);
- if (err < 0) {
- perror("SIOCGIFINDEX");
- return EXIT_FAILURE;
- }
- memset(&ll, 0, sizeof(ll));
- ll.sll_family = AF_PACKET;
- ll.sll_ifindex = ifr.ifr_ifindex;
- err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
- if (err < 0) {
- perror("bind");
- return EXIT_FAILURE;
- }
- fanout_arg = (fanout_id | (fanout_type << 16));
- err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT,
- &fanout_arg, sizeof(fanout_arg));
- if (err) {
- perror("setsockopt");
- return EXIT_FAILURE;
- }
- return fd;
- }
- static void fanout_thread(void)
- {
- int fd = setup_socket();
- int limit = 10000;
- if (fd < 0)
- exit(fd);
- while (limit-- > 0) {
- char buf[1600];
- int err;
- err = read(fd, buf, sizeof(buf));
- if (err < 0) {
- perror("read");
- exit(EXIT_FAILURE);
- }
- if ((limit % 10) == 0)
- fprintf(stdout, "(%d) \n", getpid());
- }
- fprintf(stdout, "%d: Received 10000 packets\n", getpid());
- close(fd);
- exit(0);
- }
- int main(int argc, char **argp)
- {
- int fd, err;
- int i;
- if (argc != 3) {
- fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]);
- return EXIT_FAILURE;
- }
- if (!strcmp(argp[2], "hash"))
- fanout_type = PACKET_FANOUT_HASH;
- else if (!strcmp(argp[2], "lb"))
- fanout_type = PACKET_FANOUT_LB;
- else {
- fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]);
- exit(EXIT_FAILURE);
- }
- device_name = argp[1];
- fanout_id = getpid() & 0xffff;
- for (i = 0; i < 4; i++) {
- pid_t pid = fork();
- switch (pid) {
- case 0:
- fanout_thread();
- case -1:
- perror("fork");
- exit(EXIT_FAILURE);
- }
- }
- for (i = 0; i < 4; i++) {
- int status;
- wait(&status);
- }
- return 0;
- }
- -------------------------------------------------------------------------------
- + AF_PACKET TPACKET_V3 example
- -------------------------------------------------------------------------------
- AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame
- sizes by doing it's own memory management. It is based on blocks where polling
- works on a per block basis instead of per ring as in TPACKET_V2 and predecessor.
- It is said that TPACKET_V3 brings the following benefits:
- *) ~15 - 20% reduction in CPU-usage
- *) ~20% increase in packet capture rate
- *) ~2x increase in packet density
- *) Port aggregation analysis
- *) Non static frame size to capture entire packet payload
- So it seems to be a good candidate to be used with packet fanout.
- Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile
- it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.):
- /* Written from scratch, but kernel-to-user space API usage
- * dissected from lolpcap:
- * Copyright 2011, Chetan Loke <[email protected]>
- * License: GPL, version 2.0
- */
- #include <stdio.h>
- #include <stdlib.h>
- #include <stdint.h>
- #include <string.h>
- #include <assert.h>
- #include <net/if.h>
- #include <arpa/inet.h>
- #include <netdb.h>
- #include <poll.h>
- #include <unistd.h>
- #include <signal.h>
- #include <inttypes.h>
- #include <sys/socket.h>
- #include <sys/mman.h>
- #include <linux/if_packet.h>
- #include <linux/if_ether.h>
- #include <linux/ip.h>
- #ifndef likely
- # define likely(x) __builtin_expect(!!(x), 1)
- #endif
- #ifndef unlikely
- # define unlikely(x) __builtin_expect(!!(x), 0)
- #endif
- struct block_desc {
- uint32_t version;
- uint32_t offset_to_priv;
- struct tpacket_hdr_v1 h1;
- };
- struct ring {
- struct iovec *rd;
- uint8_t *map;
- struct tpacket_req3 req;
- };
- static unsigned long packets_total = 0, bytes_total = 0;
- static sig_atomic_t sigint = 0;
- static void sighandler(int num)
- {
- sigint = 1;
- }
- static int setup_socket(struct ring *ring, char *netdev)
- {
- int err, i, fd, v = TPACKET_V3;
- struct sockaddr_ll ll;
- unsigned int blocksiz = 1 << 22, framesiz = 1 << 11;
- unsigned int blocknum = 64;
- fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
- if (fd < 0) {
- perror("socket");
- exit(1);
- }
- err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v));
- if (err < 0) {
- perror("setsockopt");
- exit(1);
- }
- memset(&ring->req, 0, sizeof(ring->req));
- ring->req.tp_block_size = blocksiz;
- ring->req.tp_frame_size = framesiz;
- ring->req.tp_block_nr = blocknum;
- ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz;
- ring->req.tp_retire_blk_tov = 60;
- ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH;
- err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req,
- sizeof(ring->req));
- if (err < 0) {
- perror("setsockopt");
- exit(1);
- }
- ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr,
- PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0);
- if (ring->map == MAP_FAILED) {
- perror("mmap");
- exit(1);
- }
- ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd));
- assert(ring->rd);
- for (i = 0; i < ring->req.tp_block_nr; ++i) {
- ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size);
- ring->rd[i].iov_len = ring->req.tp_block_size;
- }
- memset(&ll, 0, sizeof(ll));
- ll.sll_family = PF_PACKET;
- ll.sll_protocol = htons(ETH_P_ALL);
- ll.sll_ifindex = if_nametoindex(netdev);
- ll.sll_hatype = 0;
- ll.sll_pkttype = 0;
- ll.sll_halen = 0;
- err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
- if (err < 0) {
- perror("bind");
- exit(1);
- }
- return fd;
- }
- static void display(struct tpacket3_hdr *ppd)
- {
- struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac);
- struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN);
- if (eth->h_proto == htons(ETH_P_IP)) {
- struct sockaddr_in ss, sd;
- char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST];
- memset(&ss, 0, sizeof(ss));
- ss.sin_family = PF_INET;
- ss.sin_addr.s_addr = ip->saddr;
- getnameinfo((struct sockaddr *) &ss, sizeof(ss),
- sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST);
- memset(&sd, 0, sizeof(sd));
- sd.sin_family = PF_INET;
- sd.sin_addr.s_addr = ip->daddr;
- getnameinfo((struct sockaddr *) &sd, sizeof(sd),
- dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST);
- printf("%s -> %s, ", sbuff, dbuff);
- }
- printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash);
- }
- static void walk_block(struct block_desc *pbd, const int block_num)
- {
- int num_pkts = pbd->h1.num_pkts, i;
- unsigned long bytes = 0;
- struct tpacket3_hdr *ppd;
- ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd +
- pbd->h1.offset_to_first_pkt);
- for (i = 0; i < num_pkts; ++i) {
- bytes += ppd->tp_snaplen;
- display(ppd);
- ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd +
- ppd->tp_next_offset);
- }
- packets_total += num_pkts;
- bytes_total += bytes;
- }
- static void flush_block(struct block_desc *pbd)
- {
- pbd->h1.block_status = TP_STATUS_KERNEL;
- }
- static void teardown_socket(struct ring *ring, int fd)
- {
- munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr);
- free(ring->rd);
- close(fd);
- }
- int main(int argc, char **argp)
- {
- int fd, err;
- socklen_t len;
- struct ring ring;
- struct pollfd pfd;
- unsigned int block_num = 0, blocks = 64;
- struct block_desc *pbd;
- struct tpacket_stats_v3 stats;
- if (argc != 2) {
- fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]);
- return EXIT_FAILURE;
- }
- signal(SIGINT, sighandler);
- memset(&ring, 0, sizeof(ring));
- fd = setup_socket(&ring, argp[argc - 1]);
- assert(fd > 0);
- memset(&pfd, 0, sizeof(pfd));
- pfd.fd = fd;
- pfd.events = POLLIN | POLLERR;
- pfd.revents = 0;
- while (likely(!sigint)) {
- pbd = (struct block_desc *) ring.rd[block_num].iov_base;
- if ((pbd->h1.block_status & TP_STATUS_USER) == 0) {
- poll(&pfd, 1, -1);
- continue;
- }
- walk_block(pbd, block_num);
- flush_block(pbd);
- block_num = (block_num + 1) % blocks;
- }
- len = sizeof(stats);
- err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len);
- if (err < 0) {
- perror("getsockopt");
- exit(1);
- }
- fflush(stdout);
- printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n",
- stats.tp_packets, bytes_total, stats.tp_drops,
- stats.tp_freeze_q_cnt);
- teardown_socket(&ring, fd);
- return 0;
- }
- -------------------------------------------------------------------------------
- + PACKET_QDISC_BYPASS
- -------------------------------------------------------------------------------
- If there is a requirement to load the network with many packets in a similar
- fashion as pktgen does, you might set the following option after socket
- creation:
- int one = 1;
- setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one));
- This has the side-effect, that packets sent through PF_PACKET will bypass the
- kernel's qdisc layer and are forcedly pushed to the driver directly. Meaning,
- packet are not buffered, tc disciplines are ignored, increased loss can occur
- and such packets are also not visible to other PF_PACKET sockets anymore. So,
- you have been warned; generally, this can be useful for stress testing various
- components of a system.
- On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled
- on PF_PACKET sockets.
- -------------------------------------------------------------------------------
- + PACKET_TIMESTAMP
- -------------------------------------------------------------------------------
- The PACKET_TIMESTAMP setting determines the source of the timestamp in
- the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your
- NIC is capable of timestamping packets in hardware, you can request those
- hardware timestamps to be used. Note: you may need to enable the generation
- of hardware timestamps with SIOCSHWTSTAMP (see related information from
- Documentation/networking/timestamping.txt).
- PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING:
- int req = SOF_TIMESTAMPING_RAW_HARDWARE;
- setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req))
- For the mmap(2)ed ring buffers, such timestamps are stored in the
- tpacket{,2,3}_hdr structure's tp_sec and tp_{n,u}sec members. To determine
- what kind of timestamp has been reported, the tp_status field is binary |'ed
- with the following possible bits ...
- TP_STATUS_TS_RAW_HARDWARE
- TP_STATUS_TS_SOFTWARE
- ... that are equivalent to its SOF_TIMESTAMPING_* counterparts. For the
- RX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a
- software fallback was invoked *within* PF_PACKET's processing code (less
- precise).
- Getting timestamps for the TX_RING works as follows: i) fill the ring frames,
- ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant
- frames to be updated resp. the frame handed over to the application, iv) walk
- through the frames to pick up the individual hw/sw timestamps.
- Only (!) if transmit timestamping is enabled, then these bits are combined
- with binary | with TP_STATUS_AVAILABLE, so you must check for that in your
- application (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING))
- in a first step to see if the frame belongs to the application, and then
- one can extract the type of timestamp in a second step from tp_status)!
- If you don't care about them, thus having it disabled, checking for
- TP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the
- TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec
- members do not contain a valid value. For TX_RINGs, by default no timestamp
- is generated!
- See include/linux/net_tstamp.h and Documentation/networking/timestamping
- for more information on hardware timestamps.
- -------------------------------------------------------------------------------
- + Miscellaneous bits
- -------------------------------------------------------------------------------
- - Packet sockets work well together with Linux socket filters, thus you also
- might want to have a look at Documentation/networking/filter.txt
- --------------------------------------------------------------------------------
- + THANKS
- --------------------------------------------------------------------------------
-
- Jesse Brandeburg, for fixing my grammathical/spelling errors
|