From nobody Sat Apr 3 19:24:12 2004 Path: main.gmane.org!not-for-mail From: uaca@alumni.uv.XXX Newsgroups: gmane.linux.network.general Subject: PACKET_MMAP's documentation Date: Sun, 28 Mar 2004 16:40:53 +0200 Lines: 506 Sender: linux-net-owner@vger.kernel.XXX X-Mailing-List: linux-net@vger.kernel.org Xref: main.gmane.org gmane.linux.network.general:5647 DaveM: If you agree with it I will send two small patches to modify kernel's configure help.=20 Ulisses -----------------------------------------------------------------------= --------- + ABSTRACT -----------------------------------------------------------------------= --------- This file documents the CONFIG_PACKET_MMAP option available with the PA= CKET socket interface on 2.4 and 2.6 kernels. This type of sockets is used f= or=20 capture network traffic with utilities like tcpdump or any other that u= ses=20 the libpcap library.=20 You can find the latest version of this document at http://pusa.uv.es/~ulisses/packet_mmap/ Please send me your comments to Ulisses Alonso Camar=F3 -----------------------------------------------------------------------= -------- + Why use PACKET_MMAP -----------------------------------------------------------------------= --------- In Linux 2.4/2.6 if PACKET_MMAP is not enabled, the capture process is = very inefficient. It uses very limited buffers and requires one system call to capture each packet, it requires two if you want to get packet's=20 timestamp (like libpcap always does). In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a= size=20 configurable circular buffer mapped in user space. This way reading pac= kets just=20 needs to wait for them, most of the time there is no need to issue a si= ngle=20 system call. By using a shared buffer between the kernel and the user=20 also has the benefit of minimizing packet copies. It's fine to use PACKET_MMAP to improve the performance of the capture = process,=20 but it isn't everything. At least, if you are capturing at high speeds = (this=20 is relative to the cpu speed), you should check if the device driver of= your=20 network interface card supports some sort of interrupt load mitigation = or=20 (even better) if it supports NAPI, also make sure it is enabled. -----------------------------------------------------------------------= --------- + How to use CONFIG_PACKET_MMAP -----------------------------------------------------------------------= --------- =46rom the user standpoint, you should use the higher level libpcap lib= rary, wich is a de facto standard, portable across nearly all operating systems including Win32.=20 Said that, at time of this writing, official libpcap 0.8.1 is out and d= oesn't include support for PACKET_MMAP, and also probably the libpcap included in your= distribution.=20 I'm aware of two implementations of PACKET_MMAP in libpcap: http://pusa.uv.es/~ulisses/packet_mmap/ (by Simon Patarin, based o= n libpcap 0.6.2) http://public.lanl.gov/cpw/ (by Phil Wood, based on la= stest libpcap) The rest of this document is intended for people who want to understand the low level details or want to improve libpcap by including PACKET_MM= AP support. -----------------------------------------------------------------------= --------- + How to use CONFIG_PACKET_MMAP directly -----------------------------------------------------------------------= --------- =46rom the system calls stand point, the use of PACKET_MMAP involves the following process: [setup] socket() -------> creation of the capture socket setsockopt() ---> allocation of the circular buffer (ring) mmap() ---------> maping of the allocated buffer to the user process [capture] poll() ---------> to wait for incoming packets [shutdown] close() --------> destruction of the capture socket and deallocation of all associated=20 resources. socket creation and destruction is straight forward, and is done=20 the same way with or without PACKET_MMAP: int fd; fd=3D socket(PF_PACKET, mode, htons(ETH_P_ALL)) where mode is SOCK_RAW for the raw interface were link level information can be captured or SOCK_DGRAM for the cooked interface where link level information capture is not=20 supported and a link level pseudo-header is provided=20 by the kernel. The destruction of the socket and all associated resources is done by a simple call to close(fd). Next I will describe PACKET_MMAP settings and it's constraints, also the maping of the circular buffer in the user process and=20 the use of this buffer. -----------------------------------------------------------------------= --------- + PACKET_MMAP settings -----------------------------------------------------------------------= --------- To setup PACKET_MMAP from user level code is done with a call like setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(r= eq)) The most significant argument in the previous call is the req parameter= ,=20 this parameter must to have the following structure: struct tpacket_req { unsigned int tp_block_size; /* Minimal size of contiguous b= lock */ unsigned int tp_block_nr; /* Number of blocks */ unsigned int tp_frame_size; /* Size of frame */ unsigned int tp_frame_nr; /* Total number of frames */ }; This structure is defined in /usr/include/linux/if_packet.h and establi= shes a=20 circular buffer (ring) of unswappable memory mapped in the capture proc= ess.=20 Being mapped in the capture process allows reading the captured frames = and=20 related meta-information like timestamps without requiring a system cal= l. Captured frames are grouped in blocks. Each block is a physically conti= guous=20 region of memory and holds tp_block_size/tp_frame_size frames. The tota= l number=20 of blocks is tp_block_nr. Note that tp_frame_nr is a redundant paramete= r because frames_per_block =3D tp_block_size/tp_frame_size indeed, packet_set_ring checks that the following condition is true frames_per_block * tp_block_nr =3D=3D tp_frame_nr Lets see an example, with the following values: tp_block_size=3D 4096 tp_frame_size=3D 2048 tp_block_nr =3D 4 tp_frame_nr =3D 8 we will get the following buffer structure: block #1 block #2 =20 +---------+---------+ +---------+---------+ =20 | frame 1 | frame 2 | | frame 3 | frame 4 | =20 +---------+---------+ +---------+---------+ =20 block #3 block #4 +---------+---------+ +---------+---------+ | frame 5 | frame 6 | | frame 7 | frame 8 | +---------+---------+ +---------+---------+ A frame can be of any size with the only condition it can fit in a bloc= k. A block can only hold an integer number of frames, or in other words, a frame c= annot=20 be spawn accross two blocks so there are some datails you have to take = into=20 account when choosing the frame_size. See "Maping and use of the circul= ar=20 buffer (ring)". -----------------------------------------------------------------------= --------- + PACKET_MMAP setting constraints -----------------------------------------------------------------------= --------- In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 = branch), the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit archite= cture or 16384 in a 64 bit architecture. For information on these kernel version= s see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5= =2Etxt Block size limit ------------------ As stated earlier, each block is a contiguous physical region of memory= =2E These=20 memory regions are allocated with calls to the __get_free_pages() funct= ion. As=20 the name indicates, this function allocates pages of memory, and the se= cond argument is "order" or a power of two number of pages, that is=20 (for PAGE_SIZE =3D=3D 4096) order=3D0 =3D=3D> 4096 bytes, order=3D1 =3D= =3D> 8192 bytes,=20 order=3D2 =3D=3D> 16384 bytes, etc. The maximum size of a=20 region allocated by __get_free_pages is determined by the MAX_ORDER mac= ro. More=20 precisely the limit can be calculated as: PAGE_SIZE << MAX_ORDER In a i386 architecture PAGE_SIZE is 4096 bytes=20 In a 2.4/i386 kernel MAX_ORDER is 10 In a 2.6/i386 kernel MAX_ORDER is 11 So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kerne= l=20 respectively, with an i386 architecture. User space programs can include /usr/include/sys/user.h and=20 /usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations. The pagesize can also be determined dynamically with the getpagesize (2= )=20 system call.=20 Block number limit -------------------- To understand the constraints of PACKET_MMAP, we have to see the struct= ure=20 used to hold the pointers to each block. Currently, this structure is a dynamically allocated vector with kmallo= c=20 called pg_vec, its size limits the number of blocks that can be allocat= ed. +---+---+---+---+ | x | x | x | x | +---+---+---+---+ | | | | | | | v | | v block #4 | v block #3 v block #2 block #1 kmalloc allocates any number of bytes of phisically contiguous memory f= rom=20 a pool of pre-determined sizes. This pool of memory is mantained by the= slab=20 allocator wich is at the end the responsible for doing the allocation a= nd=20 hence wich imposes the maximum memory that kmalloc can allocate.=20 In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 byte= s. The=20 predetermined sizes that kmalloc uses can be checked in the "size-"=20 entries of /proc/slabinfo In a 32 bit architecture, pointers are 4 bytes long, so the total numbe= r of=20 pointers to blocks is 131072/4 =3D 32768 blocks PACKET_MMAP buffer size calculator ------------------------------------ Definitions: : is the maximum size of allocable with kmalloc (see /pro= c/slabinfo) : depends on the architecture -- sizeof(void *) : depends on the architecture -- PAGE_SIZE or getpagesize= (2) : is the value defined with MAX_ORDER : it's an upper bound of frame's capture size (more on th= is later) from these definitions we will derive=20 =3D / =3D << so, the max buffer size is * and, the number of frames be * / Suposse the following parameters, wich apply for 2.6 kernel and an i386 architecture: =3D 131072 bytes =3D 4 bytes =3D 4096 bytes =3D 11 and a value for of 2048 byteas. These parameters will yiel= d =3D 131072/4 =3D 32768 blocks =3D 4096 << 11 =3D 8 MiB. and hence the buffer will have a 262144 MiB size. So it can hold=20 262144 MiB / 2048 bytes =3D 134217728 frames Actually, this buffer size is not possible with an i386 architecture.=20 Remember that the memory is allocated in kernel space, in the case of=20 an i386 kernel's memory size is limited to 1GiB. All memory allocations are not freed until the socket is closed. The me= mory=20 allocations are done with GFP_KERNEL priority, this basically means tha= t=20 the allocation can wait and swap other process' memory in order to allo= cate=20 the nececessary memory, so normally limits can be reached. Other constraints ------------------- If you check the source code you will see that what I draw here as a fr= ame is not only the link level frame. At the begining of each frame there i= s a=20 header called struct tpacket_hdr used in PACKET_MMAP to hold link level= 's frame meta information like timestamp. So what we draw here a frame it's real= ly=20 the following (from include/linux/if_packet.h): /* Frame structure: - Start. Frame must be aligned to TPACKET_ALIGNMENT=3D16 - struct tpacket_hdr - pad to TPACKET_ALIGNMENT=3D16 - struct sockaddr_ll - Gap, chosen so that packet data (Start+tp_net) alignes to=20 TPACKET_ALIGNMENT=3D16 - Start+tp_mac: [ Optional MAC header ] - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=3D16. - Pad to align to TPACKET_ALIGNMENT=3D16 */ =20 =20 The following are conditions that are checked in packet_set_ring tp_block_size must be a multiple of PAGE_SIZE (1) tp_frame_size must be greater than TPACKET_HDRLEN (obvious) tp_frame_size must be a multiple of TPACKET_ALIGNMENT tp_frame_nr must be exactly frames_per_block*tp_block_nr Note that tp_block_size should be choosed to be a power of two or there= will be a waste of memory. -----------------------------------------------------------------------= --------- + Maping and use of the circular buffer (ring) -----------------------------------------------------------------------= --------- The maping of the buffer in the user process is done with the conventio= nal=20 mmap function. Even the circular buffer is compound of several physical= ly discontiguous blocks of memory, they are contiguous to the user space, = hence just one call to mmap is needed: mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); If tp_frame_size is a divisor of tp_block_size frames will be=20 contiguosly spaced by tp_frame_size bytes. If not, each=20 tp_block_size/tp_frame_size frames there will be a gap between=20 the frames. This is because a frame cannot be spawn across two blocks.=20 At the beginning of each frame there is an status field (see=20 struct tpacket_hdr). If this field is 0 means that the frame is ready to be used for the kernel, If not, there is a frame the user can read=20 and the following flags apply: from include/linux/if_packet.h #define TP_STATUS_COPY 2=20 #define TP_STATUS_LOSING 4=20 #define TP_STATUS_CSUMNOTREADY 8=20 TP_STATUS_COPY : This flag indicates that the frame (and associa= ted meta information) has been truncated because it= 's=20 larger than tp_frame_size. This packet can be=20 read entirely with recvfrom(). =20 In order to make this work it must to be enabled previously with setsockopt() and=20 the PACKET_COPY_THRESH option.=20 The number of frames than can be buffered to=20 be read with recvfrom is limited like a normal = socket. See the SO_RCVBUF option in the socket (7) man = page. TP_STATUS_LOSING : indicates there were packet drops from last tim= e=20 statistics where checked with getsockopt() and the PACKET_STATISTICS option. TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets wic= h=20 it's checksum will be done in hardware. So whil= e=20 reading the packet we should not try to check t= he=20 checksum.=20 for convenience there are also the following defines: #define TP_STATUS_KERNEL 0 #define TP_STATUS_USER 1 The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel receives a packet it puts in the buffer and updates the status with at least the TP_STATUS_USER flag. Then the user can read the packet, once the packet is read the user must zero the status field, so the ker= nel=20 can use again that frame buffer. The user can use poll (any other variant should apply too) to check if = new packets are in the ring: struct pollfd pfd; pfd.fd =3D fd; pfd.revents =3D 0; pfd.events =3D POLLIN|POLLRDNORM|POLLERR; if (status =3D=3D TP_STATUS_KERNEL) retval =3D poll(&pfd, 1, timeout); It doesn't incur in a race condition to first check the status value an= d=20 then poll for frames. -----------------------------------------------------------------------= --------- + THANKS -----------------------------------------------------------------------= --------- =20 Jesse Brandeburg, for fixing my grammathical/spelling errors >>> EOF - To unsubscribe from this list: send the line "unsubscribe linux-net" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html