From Earlham Cluster Department
Linux 2.6 TCP/IP Receive Stack
Todo for this doc:
- finish socket functions
- not all of our current timer points
Network device interface driver's ISR (interrupt service routine) calls netif_rx which passes the incoming packet to the input packet queing layer. Packets are stored in Softnet_data data structures. Queuing layer manages the softIRQs that process input packets taken directly from the device driver. Note that the term softIRQs refer to the two queue processing threads. The receive thread is named NET_RX_SOFTIRQ. This thread's action function, net_rx_action, removes the packet from the queue and passes it to the packet handlers (which provides flow control for incoming packets). The packet handler is the backlog device (blog_dev). The function that does the work in the backlog device is process_backlog.
[See Herbert pp. 184-192, and diagram on pg. 192]
This function loops through all the packets on the backlog device's input packet queue. This function is prevented from spinning on the processor by being time budgeted. It also disables hardware interrupts to protect the input packet queue because the ISR places the packet directly from the card to the queue.
[Herbert pp. 194-196]
The protocol field of the skb (taken from the link layer header) is compared to the registered protocol handler values. If there is a match, the packet is passed to the matching protocol handler. If there is no match, the packet is dropped. The TPR_NET_START is located before the comparision and TPR_NET_END is located after.
[Herbert pp. 197-199]
As the main input function for the IP protocol, ip_rcv takes an incoming packet (in the form of an sk_buff) from netif_receive_skb. This function checks the header checksum to ensure that the packet is IPV4 and that the checksum is valid. If either fails, the packet is discarded. The timer point TPR_IP_START is located at the beginning of this function.
[Herbert pp. 447-450, diagram pg. 448]
Internal and external routing is performed in the ip_rcv_finish function. After the destination of the packet is determined, the packet is passed to ip_local_deliver.
[Herbert pp. 450-453]
This function reassembles the packet (if it was fragmented) and sends the packet to ip_local_deliver_finish. The TPR_IP_END timer point is hit after the packet is reformed.
[Herbert pg. 453]
The ip_local_deliver_finish function determines which higher level protocol will received the packets, how many protocols will receive the packet (in special cases), which protocols will received a clone of the packet, and it sends the packet to any open raw sockets. The packet is sent to a higher level protocol by passing the packet's data to the registered packet handling function (each protocol must register a handling function when it is being registered).
Instead of the TPR_IP_END timer point being in the ip_local_deliver function, it might be more appropriately placed in the ip_local_deliver_finish function in front of the call to the protocol packet handler.
[Herbert pp. 454-456]
TCP has three queues: the receive queue, backlog queue, and prequeue. Normally (defined as the receive queue not being full and the user receive socket is not in use) the receive queue takes the packet and passes it to the socket. If the receive queue is full or the user task has the socket locked, the packets are placed in the backlog queue. The packet is sent to the prequeue via the tcp_prequeue function when both packet header prediction determines the packet packet is an in-order segment containing data and the socket is in the established state. The receive/backlog queue route is called the "slow" path and the prequeue route is called the "fast" path.
[Herbert pp. 478-483,diagram on pg. 473]
Among many other things, the tcp_v4_rcv function determines if the packet should take the fast or slow route through TCP. Other actions taken by this function are error checking, determining if the socket is able to accept the packet, and calling functions to handle TCP state. The TPR_IP_TCP timer point is located at the beginning of this function.
[Herbert pp. 472-478]
The tcp_prequeue function gives the packet to the socket and sets the ACK to be sent back (it is piggybacked on the next data segment). If the prequeue is full, the packet is sent to the TCP backlog device. If the prequeue is not full the packet is placed on in the prequeue.
[Herbert pp. 478-482]
As the backlog receive function for TCP, tcp_v4_do_rcv is called when the socket is unable to receive incoming packets. If the TCP state is ESTABLISHED at this point (as the header prediction suggested), the packet is sent to be fully processed by TCP via tcp_rcv_established. Other TCP state packets (ACK, SYN, etc) are determined and passed off to state handling functions.
[Herbert pp. 482-483]
The tcp_rcv_established function does the bulk of the work with packets that contain data (aka more than TCP state packets). The actual header prediction happens in this function. The final check for taking the fast path is also administered to the packet in this function. If the fast path is taken, the packet data is copied to user space. If the slow path is taken, the packet data is copied to the socket queue (which will then have to be copied again to user space - this is my guess as what they mean exactly by fast path). Other functions are called to send out ACKs if necessary. Timer point TPR_TCP_SOCK1 is located in the fast path after the copy to user memory but before the potential sendig of ACKs. The TPR_TCP_SOCK2 is similarly located in the slow path.
On line 4325, there is a tcp_rcv_rtt_measure_ts function. This might be interesting to poke at as it might have an alternate take on our timing information.
[Herbert pp. 493-501 , diagram pg. 494]
"queues up data in the socket's normal receive queue." "puts segments that are out of order on the
[Herbert pp. 501]
When data is put on the socket queue, the user task receives a signal meaning that there is data to read. The user task uses system receive or read calls to open the socket. These system calls are translated to tcp_rcvmsg which copies the information stored in the socket to a user buffer. This function also takes care of the user closing the socket and sending the proper state packets.
[Herbert pp. 508-516]
* This routine provides an alternative to tcp_recvmsg() for routines * that would like to handle copying from skbuffs directly in 'sendfile' * fashion. */
This is the registered function that receives the packet from IP. The packet header, checksum, and length are checked for errors. If no errors are detected, the packet is placed on the UDP recevie queue by a call to udp_queue_rcv_skb. The TPR_UDP_START timer point is placed before the error checking. The TPR_UDP_END timer point is located after the call to udp_queue_rcv_skb.
[Herbert pp. 459-462]
This function deals with encapsulated packets, completes the packet checksum process, and determines if there is enough space in the socket for the packet. If there is not enough space, the packet is dropped. The sock_queue_rcv_skb function is called to place the packet information in the socket.
[Herbert pp. 466-467]
When receiving UDP data, the socket calls the UDP socket receive function, udp_recvmsg. This function checks for socket errors on the error message queue, dequeues packets from the socket's receive queue, and calls skb_copy_datagram_iovec to copy the datagram into the user's buffer. This may be the best place for TPR_UDP_END and TPR_SOCKET_START.
[Herbert pp. 467-470]
Linux 2.6 TCP/IP Send Stack
some routing is done here udp.c
not sure how this path completes, but eventually we end up at ip_output (as with TCP packets)
calls dst->output or hh->hh_output, which are both function pointers to dev_queue_xmit
cf. Herbert pp. 107-110
member of net_device struct (implemented by each network driver); see how we hook into it in ip_output.c
cf. Herbert p. 100