TCP/IP协议栈在Linux内核中的运行时序分析
一、Linux内核运行实体
1.内容
运行实体包括内核线程、系统调用过程、中断处理过程、软中断、tasklet、work queue等
内核线程跟其他线程一样,参与线程调度。在系统进行线程切换的时候,会保存相应的上下文,然后根据线程调用算法,选择要调度的线程进行CPU的分配和执行。其中软中断使用的线程名为ksoftirqd
系统调用是内核向用户进程提供服务的唯一方法,应用程序调用操作系统提供的功能模块。用户通过系统调用从用户态切换到核心态,执行对应的内核程序,然后将结果返回给用户程序并切换回用户态,这个过程称为系统调用过程。系统调用可以通过int$80和syscall指令实现。
中断处理过程是指系统根据中断号,找到中断号对应的irq_desc结构(irq_desc结构为内核中中断的描述结构,内核中有一个irq_desc结构的数组irq_desc_ptrs[NR_IRQS]),然后调用irq_desc中的handle_irq函数,即中断入口函数。执行相应的中断入口函数并返回中断的过程。属于中断处理中的Top-Half部分。
中断上下文指的是中断发生以后,CPU跳到内核设置好的中断处理代码中去,由这部分内核代码来处理中断。这个处理过程中的上下文就是中断上下文。中断上下文是不参与调度的,这样中断上下文不能出现可能休眠的操作,因为中断上下文一旦休眠就会交出cpu,而中断上下文不参与调度,也就是交出cpu时不会保存上下文,这样一旦交出cpu就永远不会回来了。这也就决定了在中断中不能和用户空间进行数据交互,因为这种交互可能会引起休眠。因为这种特性,中断处理分成了Top-Half和Bottom-Half两部分,中断处理程序则属于Top-Half部分。
2.中断处理流程分类
Top-Half:Top-Half就是之前所说的关中断的中断处理过程,它能最快的响应中断,并且做一些必须在中断响应之后马上要做的事情。而一些需要在中断处理函数后继续执行的操作,内核建议把它放在Bottom-Half执行。Bottom-Half的是一些虽然与中断有相关性但是可以延后执行的开中断的任务,可以使用Bottom-Half处理非紧急事务。由于Bottom-Half可以被中断打断,所以执行Bottom-Half时仍然能够即使处理紧要的中断。
Bottom-Half:如果用一个词来形容底Bottom-Half的功能,就是延迟执行。在中断的Top-Half,即中断处理程序结束前,当前的中断线在所有的处理器上都会被屏蔽,在中断处理程序执行时会禁止所有的本地中断。因此尽可能地缩短中断被屏蔽的时间对系统的响应能力和性能都至关重要。因此,要将耗时较长的任务放到Bottom-Half延迟执行。
3.底半部的实现
1)软中断
软中断使内核可以执行延迟任务,由于其运作方式与之前的中断类似,并且完全由软件实现,因此称之为软中断。
1.1 软中断之前,首先对软中断函数进行注册,调用open_softirq函数,参数为中断序号。实际所做就是赋值软中断向量表(softirq_vect[中断序号] = 中断函数描述符(封装了中断函数的结构体))
1.2 中断先根据中断号进行中断的响应,也就是do_IRQ函数。
1.3 do_IRQ函数会依次执行该中断线上的所以中断处理程序,该实现方式主要是在do_IRQ函数中调用handle_IRQ_event函数实现。handle_IRQ_event会执行irq所在的中断线上的所有的中断处理函数。
1.4 调用irq_exit函数,该函数是负责软中断的调用和处理。开始了中断的下部
void irq_exit(void) { ... if (!in_interrupt() && local_softirq_pending()) //判断是否有软中断被请求,主要是看是否有执行raise_softirq函数, invoke_softirq();//用于唤醒软中断,即会激活do_softirq函数 ... }
invoke_softirq()
static inline void invoke_softirq(void) { if (ksoftirqd_running(local_softirq_pending())) return; if (!force_irqthreads) { #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK /* * We can safely execute softirq on the current stack if * it is the irq stack, because it should be near empty * at this stage. */ __do_softirq(); //如果不强制要求在软中断线程中执行,直接调用__do_softirq() #else /* * Otherwise, irq_exit() is called on the task stack that can * be potentially deep already. So call softirq in its own stack * to prevent from any overrun. */ do_softirq_own_stack(); //内部依然是调用__do_softirq() #endif } else { wakeup_softirqd(); //否则唤醒软中断线程,交给它来做 } }
do_softirq_own_stack()
#ifdef __ARCH_HAS_DO_SOFTIRQ void do_softirq_own_stack(void); #else static inline void do_softirq_own_stack(void) { __do_softirq(); } #endif
__do_softirq()
asmlinkage __visible void __softirq_entry __do_softirq(void) { unsigned long end = jiffies + MAX_SOFTIRQ_TIME; unsigned long old_flags = current->flags; int max_restart = MAX_SOFTIRQ_RESTART; struct softirq_action *h; bool in_hardirq; __u32 pending; int softirq_bit; /* * Mask out PF_MEMALLOC as the current task context is borrowed for the * softirq. A softirq handled, such as network RX, might set PF_MEMALLOC * again if the socket is related to swapping. */ current->flags &= ~PF_MEMALLOC; pending = local_softirq_pending(); account_irq_enter_time(current); __local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET); in_hardirq = lockdep_softirq_start(); restart: /* Reset the pending bitmask before enabling irqs */ set_softirq_pending(0); local_irq_enable(); //打开本地中断 h = softirq_vec; while ((softirq_bit = ffs(pending))) { unsigned int vec_nr; int prev_count; h += softirq_bit - 1; //根据softirq_bit找到相应要调用的函数地址 vec_nr = h - softirq_vec; //软中断序号 prev_count = preempt_count(); kstat_incr_softirqs_this_cpu(vec_nr); trace_softirq_entry(vec_nr); h->action(h);//调用中断函数 trace_softirq_exit(vec_nr); if (unlikely(prev_count != preempt_count())) { pr_err("huh, entered softirq %u %s %p with preempt_count %08x, exited with %08x?\n", vec_nr, softirq_to_name[vec_nr], h->action, prev_count, preempt_count()); preempt_count_set(prev_count); } h++; pending >>= softirq_bit; } if (__this_cpu_read(ksoftirqd) == current) rcu_softirq_qs(); local_irq_disable(); //关闭本地中断 pending = local_softirq_pending(); if (pending) { if (time_before(jiffies, end) && !need_resched() && --max_restart) goto restart; wakeup_softirqd(); //如果必要,唤醒软中断线程来处理新到来的软中断 } lockdep_softirq_end(in_hardirq); account_irq_exit_time(current); __local_bh_enable(SOFTIRQ_OFFSET); WARN_ON_ONCE(in_interrupt()); current_restore_flags(old_flags, PF_MEMALLOC); }
__do_softirq()实际做了以下几步:
1. 首先调用local_softirq_pending函数取得目前有哪些位存在软件中断
2.. 调用__local_bh_disable关闭软中断,其实就是设置正在处理软件中断标记,在同一个CPU上使得不能重入__do_softirq函数
3. 重新设置软中断标记为0,set_softirq_pending重新设置软中断标记为0,这样在之后重新开启中断之后硬件中断中又可以设置软件中断位。
4. 开启硬件中断,使得软中断可以被硬中断中断
5. 之后在一个循环中,遍历pending标志的每一位,如果这一位设置就会调用软件中断的处理函数。在这个过程中硬件中断是开启的,随时可以打断软件中断。这样保证硬件中断不会丢失。
6. 之后关闭硬件中断,查看是否又有软件中断处于pending状态。如果有,并且满足一定条件的话,唤醒软中断线程进行处理
2)tasklet
软中断是将操作推迟到将来某一个时刻执行的最有效的方法。由于该延迟机制处理复杂,多个处理器可以同时并且独立得处理(即__do_softirq函数可以被多个CPU同时执行),并且一个软中断的处理程序可以在多个CPU上同时执行,因此处理程序必须要被设计为完全可重入和线程安全的。此外临界区必须用自旋锁保护。由于软中断因为这些原因就显得太过于麻烦,因此引入tasklet机制,就变得很有必要了。
tasklet有以下特点:
1.一种tasklet对应唯一的一种action
2.一种tasklet只能运行在一个CPU上,不能并行和阻塞
3.tasklet的数量可以动态的改变,不像软中断只静态提供最多32项

tasklet以链表形式存在,每个cpu有两个软中断号用于tasklet处理,每个中断号事实上各维护了一个tasklet链表,两个链表分别为tasklet_vec和tasklet_hi_vec
struct tasklet_head { struct tasklet_struct *head; //用于指向tasklet队列的首个元素 struct tasklet_struct **tail; //用于指向tasklet队列最后一个元素的next,如果队列为空,指向head }; static DEFINE_PER_CPU(struct tasklet_head, tasklet_vec); static DEFINE_PER_CPU(struct tasklet_head, tasklet_hi_vec);
以tasklet_vec为例


tasklet结构体
struct tasklet_struct { struct tasklet_struct *next; //链接下一个tasklet_struct对象,以构成一个tasklet队列 unsigned long state; //该tasklet的运行状态标志位 atomic_t count; //该tasklet被引用的次数标志位,当count为0时,表示已激活可用 void (*func)(unsigned long); //该tasklet的处理函数指针,也是tasklet的核心所在 unsigned long data; //给上面的处理函数传的参数。 };
3)work queue
从上面的介绍看以看出,软中断运行在中断上下文中,因此不能阻塞和睡眠,而tasklet使用软中断实现,当然也不能阻塞和睡眠。但如果某延迟处理函数需要睡眠或者阻塞呢?没关系工作队列就可以如您所愿了。
把推后执行的任务叫做工作(work),描述它的数据结构为work_struct ,这些工作以队列结构组织成工作队列(workqueue),其数据结构为workqueue_struct ,而工作线程就是负责执行工作队列中的工作。系统默认的工作者线程为events。
工作队列(work queue)是另外一种将工作推后执行的形式。工作队列可以把工作推后,交由一个内核线程去执行—这个下半部分总是会在进程上下文执行,但由于是内核线程,其不能访问用户空间。最重要特点的就是工作队列允许重新调度甚至是睡眠。
通常,在工作队列和软中断/tasklet中作出选择非常容易。可使用以下规则:
- 如果推后执行的任务需要睡眠,那么只能选择工作队列。
- 如果推后执行的任务需要延时指定的时间再触发,那么使用工作队列,因为其可以利用timer延时(内核定时器实现)。
- 如果推后执行的任务需要在一个tick之内处理,则使用软中断或tasklet,因为其可以抢占普通进程和内核线程,同时不可睡眠。
- 如果推后执行的任务对延迟的时间没有任何要求,则使用工作队列,此时通常为无关紧要的任务。
实际上,工作队列的本质就是将工作交给内核线程处理,因此其可以用内核线程替换。但是内核线程的创建和销毁对编程者的要求较高,而工作队列实现了内核线程的封装,不易出错,所以我们也推荐使用工作队列。

二、TCP/IP协议栈
2.1 TCP/IP协议栈简介
1 TCP/IP协议栈定义。TCP/IP 协议栈是一系列网络协议的总和,是构成网络通信的核心骨架,它定义了电子设备如何连入因特网,以及数据如何在它们之间进行传输。TCP/IP 协议采用4层结构,分别是应用层、传输层、网络层和链路层,每一层都呼叫它的下一层所提供的协议来完成自己的需求。
2 各层职责
- 链路层:对0和1进行分组,定义数据帧,确认主机的物理地址,传输数据;
- 网络层:定义IP地址,确认主机所在的网络位置,并通过IP进行MAC寻址,对外网数据包进行路由转发;
- 传输层:定义端口,确认主机上应用程序的身份,并将数据包交给对应的应用程序;
- 应用层:定义数据格式,并按照对应的格式解读数据。

2.2. Linux网络协议栈
Linux的整个网络协议栈都构建与Linux Kernel中,整个栈也是严格按照分层的思想来设计的,整个栈共分为五层,分别是 :
1、系统调用接口层,实质是一个面向用户空间应用程序的接口调用库,向用户空间应用程序提供使用网络服务的接口。
2、协议无关的接口层,就是SOCKET层,这一层的目的是屏蔽底层的不同协议(更准确的来说主要是TCP与UDP,当然还包括RAW IP, SCTP等),以便与系统调用层之间的接口可以简单,统一。简单的说,不管我们应用层使用什么协议,都要通过系统调用接口来建立一个SOCKET,这个SOCKET其实是一个巨大的sock结构,它和下面一层的网络协议层联系起来,屏蔽了不同的网络协议的不同,只吧数据部分呈献给应用层(通过系统调用接口来呈献)。
3、网络协议实现层,毫无疑问,这是整个协议栈的核心。这一层主要实现各种网络协议,最主要的当然是IP,ICMP,ARP,RARP,TCP,UDP等。
4、与具体设备无关的驱动接口层,这一层的目的主要是为了统一不同的接口卡的驱动程序与网络协议层的接口,它将各种不同的驱动程序的功能统一抽象为几个特殊的动作,如open,close,init等,这一层可以屏蔽底层不同的驱动程序。
5、驱动程序层,建立与硬件的接口层。

2.3 socket
各种网络应用程序基本上都是通过 Linux Socket 编程接口来和内核空间的网络协议栈通信的。Linux Socket 是从 BSD Socket 发展而来的,它是 Linux 操作系统的重要组成部分之一,它是网络应用程序的基础。从层次上来说,它位于应用层,是操作系统为应用程序员提供的 API,通过它,应用程序可以访问传输层协议。
- socket 位于传输层协议之上,屏蔽了不同网络协议之间的差异
- socket 是网络编程的入口,它提供了大量的系统调用,构成了网络程序的主体
- 在Linux系统中,socket 属于文件系统的一部分,网络通信可以被看作是对文件的读取,使得我们对网络的控制和对文件的控制一样方便。

三、调研代码简介
server.h:
#ifndef __SERVER_H #define __SERVER_H #ifdef WIN32 #include <winsock.h> #elif linux #include <sys/types.h> #include <sys/wait.h> #include <sys/time.h> #include <sys/socket.h> #include <netdb.h> #include <arpa/inet.h> #include <errno.h> #include <unistd.h> #endif using namespace std; class Server { private: #ifdef WIN32 #define SOCKET SOCKET #define SIN_ADDRESS sin_addr.S_un #define SADDR S_addr #elif linux #define SOCKET int #define SIN_ADDRESS sin_addr #define SADDR s_addr #define MAXBYTE (0xff) #define SOCKET_ERROR (-1) #endif public: #ifdef WIN32 #define init() win32_init() #define destory() win32_destory() #define Accept(x,y,z) accept(x,y,z) #define Sleep(x) Sleep(x) #elif linux #define init() linux_init() #define destory() linux_destory() #define Accept(x,y,z) accept(x,y,reinterpret_cast<socklen_t*>(z)) #define Sleep(x) sleep(x) #endif private: SOCKET server_socket, client_socket; sockaddr_in server_information, client_information; const char* server_address; unsigned int server_port; public: static bool win32_init(); static bool linux_init(); void linux_destory(); void win32_destory(); public: Server(unsigned int server_pt); ~Server() { destory(); } bool get_host_ip(); void draw_window(); bool bind_socket(); bool listen_server(int n); void accept_request(); }; #endif
server.cpp:
#include <iostream> #include <cstring> #include "server.h" using namespace std; Server::Server(unsigned int server_pt) { server_socket = socket(AF_INET, SOCK_STREAM, 0); server_address = nullptr; server_port = server_pt; server_information.sin_family = AF_INET; server_information.sin_port = htons(server_port); server_information.SIN_ADDRESS.SADDR = 0; } #ifdef WIN32 bool Server::win32_init() { WORD mversion_request; WSADATA wsadata; mversion_request = MAKEWORD(2, 2); if (WSAStartup(mversion_request, &wsadata) != 0) { cout << "WSAStartup() failed! " << endl; return false; } if (wsadata.wVersion != mversion_request) { cout << "The version of Winsocket is not suited! " << endl; WSACleanup(); return false; } return true; } void Server::win32_destory() { shutdown(server_socket, 2); closesocket(server_socket); WSACleanup(); } #endif #ifdef linux bool Server::linux_init() { return true; } void Server::linux_destory() { close(server_socket); } #endif bool Server::get_host_ip() { char host_name[MAXBYTE]; gethostname(host_name, MAXBYTE); hostent* p_hostent = gethostbyname(host_name); if (p_hostent == nullptr) { cout << "Get Self Ip Error" << endl; return false; } in_addr in_add; memcpy(&in_add, p_hostent->h_addr_list[0], 4); server_address = inet_ntoa(in_add); server_information.sin_addr = in_add; cout << "服务器的主机名为 : " << host_name << " , 服务器的IP地址为 : " << server_address << endl; return true; } void Server::draw_window() { cout << "服务端启动成功" << endl; cout << "------------------" << endl; cout << "----- 服务端 -----" << endl; cout << "------------------" << endl; } bool Server::bind_socket() { cout << "Begin Bind Socket " << endl; if (bind(server_socket, (sockaddr*)&server_information, sizeof(sockaddr)) == SOCKET_ERROR) { cout << "Bind Socket Error " << endl; return false; } cout << "Bind Socket Successfully !!" << endl; return true; } bool Server::listen_server(int n) { cout << "Begin Listen " << endl; if (listen(server_socket, n)) { cout << "Listen Error " << endl; return false; } cout << "Listen Successfully !!" << endl; return true; } void Server::accept_request() { int len = sizeof(sockaddr); string message = "Welcome to my server !!, you can send anything !!!"; client_socket = Accept(server_socket, (sockaddr*)&client_information, &len); cout << "Accept Client, Client IP : " << inet_ntoa(client_information.sin_addr) << " , Client Port : " << ntohs(client_information.sin_port) << endl; cout << "Server Send A Message To Client : " << message << endl; send(client_socket, message.c_str(), message.size(), 0); cout << "------------- Waiting For Clients -------------" << endl; char buffer[1024] = "\0"; while (true) { if ((len = recv(client_socket, buffer, sizeof(buffer), 0)) > 0) { if (strncmp(buffer, "quit", len) == 0 || strncmp(buffer, "\\q", len) == 0) { cout << "客户端已终止连接" << endl; break; } cout << "Recevied Datagrame From Tcp Client, IP : " << inet_ntoa(client_information.sin_addr) << " , Port : " << ntohs(client_information.sin_port) << endl; /*for (int a = 0; a < len; ++a) { if (buffer[a] >= 'a' && buffer[a] <= 'z') buffer[a] &= 0xDF; } buffer[len] = '\0';*/ cout << "Transform Successfully : " << buffer << endl; send(client_socket, "hi", 2, 0); } //Sleep(500); } } int main() { unsigned int server_port = 10086; const int client_size = 5; if (Server::init()) { Server server(server_port); if (server.get_host_ip() && server.bind_socket() && server.listen_server(client_size)) { server.draw_window(); server.accept_request(); } } return 0; }
client.h:
#ifndef __CLIENT_H #define __CLIENT_H #ifdef WIN32 #include <winsock.h> #elif linux #include <sys/types.h> #include <sys/wait.h> #include <sys/time.h> #include <sys/socket.h> #include <netdb.h> #include <arpa/inet.h> #include <errno.h> #include <unistd.h> #endif class Client { private: #ifdef WIN32 #define SOCKET SOCKET #define SIN_ADDRESS sin_addr.S_un #define SADDR S_addr #elif linux #define SOCKET int #define SIN_ADDRESS sin_addr #define SADDR s_addr #define MAXBYTE (0xff) #define SOCKET_ERROR (-1) #endif public: #ifdef WIN32 #define init() win32_init() #define destory() win32_destory() #define Accept(x,y,z) accept(x,y,z) #define Sleep(x) Sleep(x) #elif linux #define init() linux_init() #define destory() linux_destory() #define Accept(x,y,z) accept(x,y,reinterpret_cast<socklen_t*>(z)) #define Sleep(x) sleep(x) #endif private: SOCKET server_socket; sockaddr_in server_information; const char* server_address; unsigned int server_port; public: static bool win32_init(); static bool linux_init(); void win32_destory(); void linux_destory(); public: Client(const char* server_add, unsigned int server_pt); Client(in_addr& add, unsigned int server_pt); ~Client() { destory(); } void draw_window(); bool connet_server(); void send_something(); }; #endif
client.cpp:
// Client.cpp : 此文件包含 "main" 函数。程序执行将在此处开始并结束。 // #include <iostream> #include <sstream> #include "client.h" using namespace std; Client::Client(const char* server_add, unsigned int server_pt) { server_socket = socket(AF_INET, SOCK_STREAM, 0); server_address = server_add; server_port = server_pt; server_information.sin_family = AF_INET; server_information.sin_port = htons(server_port); server_information.SIN_ADDRESS.SADDR = inet_addr(server_address); } Client::Client(in_addr& server_add, unsigned int server_pt) { server_socket = socket(AF_INET, SOCK_STREAM, 0); server_address = inet_ntoa(server_add); server_port = server_pt; server_information.sin_family = AF_INET; server_information.sin_port = htons(server_port); server_information.sin_addr = server_add; } #ifdef WIN32 bool Client::win32_init() { WORD mversion_request; WSADATA wsadata; mversion_request = MAKEWORD(2, 2); if (WSAStartup(mversion_request, &wsadata) != 0) { cout << "WSAStartup() failed! " << endl; return false; } if (wsadata.wVersion != mversion_request) { cout << "The version of Winsocket is not suited! " << endl; WSACleanup(); return false; } return true; } void Client::win32_destory() { shutdown(server_socket, 2); closesocket(server_socket); WSACleanup(); } #endif #ifdef linux bool Client::linux_init() { return true; } void Client::linux_destory() { close(server_socket); } #endif void Client::draw_window() { cout << "客户端启动成功" << endl; cout << "------------------" << endl; cout << "----- 客户端 -----" << endl; cout << "------------------" << endl; } bool Client::connet_server() { cout << "Begin Connet Server, IP : " << server_address << ", Port : " << server_port << endl; if (connect(server_socket, (sockaddr*)&server_information, sizeof(sockaddr)) == SOCKET_ERROR) { cout << "Connect Failed !! " << SOCKET_ERROR << endl; return false; } cout << "Connect Successfully " << endl; return true; } void Client::send_something() { int len = -1; char buffer[1024] = "\0"; string middle = ""; while (middle != "quit" && middle != "\\q") { if ((len = recv(server_socket, buffer, sizeof(buffer), 0)) == -1) { cout << "Recv Error " << endl; exit(-1); } buffer[len] = '\0'; cout << "Receve For Server : " << buffer << endl; cout << "Send Something To Server : (quit or \\q to quit) " << endl; getline(cin, middle); cout << middle << " , size : " << middle.size() << endl; send(server_socket, middle.c_str(), middle.size(), 0); cout << "Send Successfully : " << middle << endl; } } int main(int argc, char *argv[]) { hostent* server_ip; unsigned int server_port = 10086; char host_name[MAXBYTE] = { 0 }; if (Client::init()) { if (argc != 2) { cerr << "usage : client hostname" << endl; exit(-1); } if((server_ip = gethostbyname(argv[1])) == nullptr) { cerr << "server host name is error " << endl; exit(-1); } Client client(*((in_addr*)server_ip->h_addr_list[0]), server_port); if (client.connet_server()) { client.draw_window(); client.send_something(); } } return 0; }
四、send分析
1.应用层
client端首先调用send函数,找到相应的头文件声明sys/socket.h,根据该函数声明找到该函数的定义,在send.S文件中
#define socket send #define __socket __libc_send #define NARGS 4 #define NEED_CANCELLATION #include <socket.S>
可以看到,包含了send.S以socket.S为模板生成对应的汇编代码,socket.S代码如下
#define P(a, b) P2(a, b) #define P2(a, b) a##b #ifndef __socket # ifndef NO_WEAK_ALIAS # define __socket P(__,socket) # else # define __socket socket # endif #endif .globl __socket cfi_startproc ENTRY (__socket) /* Save registers. */ movl %ebx, %edx cfi_register (3, 2) movl $SYS_ify(socketcall), %eax /* System call number in %eax. */ /* Use ## so `socket' is a separate token that might be #define'd. */ movl $P(SOCKOP_,socket), %ebx /* Subcode is first arg to syscall. */ lea 4(%esp), %ecx /* Address of args is 2nd arg. */ /* Do the system call trap. */ ENTER_KERNEL /* Restore registers. */ movl %edx, %ebx cfi_restore (3) ret
对相应的宏进行替换,可以得到如下代码
ENTRY (__libc_send) movl %ebx, %edx cfi_register (3, 2) movl $__NR_socketcall, %eax movl $SOCKOP_send, %ebx lea 4(%esp), %ecx int $0x80 movl %edx, %ebx cfi_restore (3) ret
其中__NR_socketcall在unistd_32.h文件中是102,SOCKOP_send在socketcall.h文件中是9。而int$0x80定义在entry_32.S文件中
ENTRY(entry_INT80_32) ASM_CLAC pushl %eax /* pt_regs->orig_ax */ SAVE_ALL pt_regs_ax=$-ENOSYS switch_stacks=1 /* save rest */ /* * User mode is traced as though IRQs are on, and the interrupt gate * turned them off. */ TRACE_IRQS_OFF movl %esp, %eax call do_int80_syscall_32
...
可以看到,首先将eax寄存器压入栈中,也就是上文提到的系统调用号102。之后通过SAVE_ALL 将其他寄存器的值压入栈中进行保存
.macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0 skip_gs=0 unwind_espfix=0 cld .if \skip_gs == 0 PUSH_GS .endif pushl %fs pushl %eax movl $(__KERNEL_PERCPU), %eax movl %eax, %fs .if \unwind_espfix > 0 UNWIND_ESPFIX_STACK .endif popl %eax FIXUP_FRAME pushl %es pushl %ds pushl \pt_regs_ax pushl %ebp pushl %edi pushl %esi pushl %edx pushl %ecx pushl %ebx movl $(__USER_DS), %edx movl %edx, %ds movl %edx, %es .if \skip_gs == 0 SET_KERNEL_GS %edx .endif /* Switch to kernel stack if necessary */ .if \switch_stacks > 0 SWITCH_TO_KERNEL_STACK .endif .endm
将当前的栈指针保存到eax寄存器中,并调用do_int80_syscall_32。该函数定义在common.c文件中
__visible void do_int80_syscall_32(struct pt_regs *regs) { enter_from_user_mode(); local_irq_enable(); do_syscall_32_irqs_on(regs); }
内部调用了do_syscall_32_irqs_on()
static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs) { struct thread_info *ti = current_thread_info(); unsigned int nr = (unsigned int)regs->orig_ax; #ifdef CONFIG_IA32_EMULATION ti->status |= TS_COMPAT; #endif if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY) { /* * Subtlety here: if ptrace pokes something larger than * 2^32-1 into orig_ax, this truncates it. This may or * may not be necessary, but it matches the old asm * behavior. */ nr = syscall_trace_enter(regs); } if (likely(nr < IA32_NR_syscalls)) { nr = array_index_nospec(nr, IA32_NR_syscalls); #ifdef CONFIG_IA32_EMULATION regs->ax = ia32_sys_call_table[nr](regs); #else /* * It's possible that a 32-bit syscall implementation * takes a 64-bit parameter but nonetheless assumes that * the high bits are zero. Make sure we zero-extend all * of the args. */ regs->ax = ia32_sys_call_table[nr]( (unsigned int)regs->bx, (unsigned int)regs->cx, (unsigned int)regs->dx, (unsigned int)regs->si, (unsigned int)regs->di, (unsigned int)regs->bp); #endif /* CONFIG_IA32_EMULATION */ } syscall_return_slowpath(regs); }
可以看出,首先nr取出了之前压入栈中的系统调用号102。并根据相应的系统调用表,来调用相应的系统调用,并传入压栈的参数。
系统调用表ia32_sys_call_table定义在syscall_32.c文件中,该文件包括syscall_32.h头文件,而syscall_32.h头文件是根据syscall_32.tbl动态生成的。因此我们直接看syscall_32.tbl即可
102 i386 socketcall sys_socketcall __ia32_compat_sys_socketcall
可以看到102系统调用号最终调用的系统调用为sys_socketcall
至此,send结束了glibc状态,系统也就从用户态进入到了内核态,并且找到了相应的系统调用。可以进行内核调试了。
通过gdb调试,首先找到sys_socketcall,在socket.c文件中,并且要先经过宏展开
SYSCALL_DEFINE2(socketcall, int, call, unsigned long __user *, args) { ... switch (call) { case SYS_SOCKET: err = __sys_socket(a0, a1, a[2]); break; case SYS_BIND: err = __sys_bind(a0, (struct sockaddr __user *)a1, a[2]); break; case SYS_CONNECT: err = __sys_connect(a0, (struct sockaddr __user *)a1, a[2]); break; case SYS_LISTEN: err = __sys_listen(a0, a1); break; case SYS_ACCEPT: err = __sys_accept4(a0, (struct sockaddr __user *)a1, (int __user *)a[2], 0); break; case SYS_GETSOCKNAME: err = __sys_getsockname(a0, (struct sockaddr __user *)a1, (int __user *)a[2]); break; case SYS_GETPEERNAME: err = __sys_getpeername(a0, (struct sockaddr __user *)a1, (int __user *)a[2]); break; case SYS_SOCKETPAIR: err = __sys_socketpair(a0, a1, a[2], (int __user *)a[3]); break; case SYS_SEND: err = __sys_sendto(a0, (void __user *)a1, a[2], a[3], NULL, 0); break; case SYS_SENDTO: err = __sys_sendto(a0, (void __user *)a1, a[2], a[3], (struct sockaddr __user *)a[4], a[5]); break; case SYS_RECV: err = __sys_recvfrom(a0, (void __user *)a1, a[2], a[3], NULL, NULL); break; case SYS_RECVFROM: err = __sys_recvfrom(a0, (void __user *)a1, a[2], a[3], (struct sockaddr __user *)a[4], (int __user *)a[5]); break; case SYS_SHUTDOWN: err = __sys_shutdown(a0, a1); break; case SYS_SETSOCKOPT: err = __sys_setsockopt(a0, a1, a[2], (char __user *)a[3], a[4]); break; case SYS_GETSOCKOPT: err = __sys_getsockopt(a0, a1, a[2], (char __user *)a[3], (int __user *)a[4]); break; case SYS_SENDMSG: err = __sys_sendmsg(a0, (struct user_msghdr __user *)a1, a[2], true); break; case SYS_SENDMMSG: err = __sys_sendmmsg(a0, (struct mmsghdr __user *)a1, a[2], a[3], true); break; case SYS_RECVMSG: err = __sys_recvmsg(a0, (struct user_msghdr __user *)a1, a[2], true); break; case SYS_RECVMMSG: if (IS_ENABLED(CONFIG_64BIT) || !IS_ENABLED(CONFIG_64BIT_TIME)) err = __sys_recvmmsg(a0, (struct mmsghdr __user *)a1, a[2], a[3], (struct __kernel_timespec __user *)a[4], NULL); else err = __sys_recvmmsg(a0, (struct mmsghdr __user *)a1, a[2], a[3], NULL, (struct old_timespec32 __user *)a[4]); break; case SYS_ACCEPT4: err = __sys_accept4(a0, (struct sockaddr __user *)a1, (int __user *)a[2], a[3]); break; default: err = -EINVAL; break; } return err; }
根据上文,call的值是9。走的是case SYS_SEND,调用的是__sys_sendto函数
int __sys_sendto(int fd, void __user *buff, size_t len, unsigned int flags, struct sockaddr __user *addr, int addr_len) { struct socket *sock; struct sockaddr_storage address; int err; struct msghdr msg; ... msg.msg_flags = flags; err = sock_sendmsg(sock, &msg); out_put: fput_light(sock->file, fput_needed); out: return err; }
进入函数,发现调用了sock_sendmsg。继续跟入
int sock_sendmsg(struct socket *sock, struct msghdr *msg) { int err = security_socket_sendmsg(sock, msg, msg_data_left(msg)); return err ?: sock_sendmsg_nosec(sock, msg); }
调用了sock_sendmsg_nosec,继续跟入
static inline int sock_sendmsg_nosec(struct socket *sock, struct msghdr *msg) { int ret = INDIRECT_CALL_INET(sock->ops->sendmsg, inet6_sendmsg, inet_sendmsg, sock, msg, msg_data_left(msg)); BUG_ON(ret == -EIOCBQUEUED); return ret; }
将INDIRECT_CALL_INET进行宏展开 int rect = sock->ops->sendmsg(sock, msg, msg_data_left(msg))
发现实际调用的是sock里ops指针下的sendmsg函数
struct socket { socket_state state; short type; unsigned long flags; struct file *file; struct sock *sk; const struct proto_ops *ops; struct socket_wq wq; };
struct proto_ops { int family; struct module *owner; int (*release) (struct socket *sock); int (*bind) (struct socket *sock, struct sockaddr *myaddr, int sockaddr_len); int (*connect) (struct socket *sock, struct sockaddr *vaddr, int sockaddr_len, int flags); ... int (*sendmsg) (struct socket *sock, struct msghdr *m, size_t total_len); /* Notes for implementing recvmsg: * =============================== * msg->msg_namelen should get updated by the recvmsg handlers * iff msg_name != NULL. It is by default 0 to prevent * returning uninitialized memory to user space. The recvfrom * handlers can assume that msg.msg_name is either NULL or has * a minimum size of sizeof(struct sockaddr_storage). */ int (*recvmsg) (struct socket *sock, struct msghdr *m, size_t total_len, int flags); ... };
通过查阅资料,发现tcp的proto_ops的初始化时在af_inet.c文件中
const struct proto_ops inet_stream_ops = { .family = PF_INET, .owner = THIS_MODULE, .release = inet_release, .bind = inet_bind, .connect = inet_stream_connect, .socketpair = sock_no_socketpair, .accept = inet_accept, .getname = inet_getname, .poll = tcp_poll, .ioctl = inet_ioctl, .gettstamp = sock_gettstamp, .listen = inet_listen, .shutdown = inet_shutdown, .setsockopt = sock_common_setsockopt, .getsockopt = sock_common_getsockopt, .sendmsg = inet_sendmsg, .recvmsg = inet_recvmsg, #ifdef CONFIG_MMU .mmap = tcp_mmap, #endif .sendpage = inet_sendpage, .splice_read = tcp_splice_read, .read_sock = tcp_read_sock, .sendmsg_locked = tcp_sendmsg_locked, .sendpage_locked = tcp_sendpage_locked, .peek_len = tcp_peek_len, #ifdef CONFIG_COMPAT .compat_setsockopt = compat_sock_common_setsockopt, .compat_getsockopt = compat_sock_common_getsockopt, .compat_ioctl = inet_compat_ioctl, #endif .set_rcvlowat = tcp_set_rcvlowat, };
即sock->ops->sendmsg调用是inet_sendmsg
int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size) { struct sock *sk = sock->sk; if (unlikely(inet_send_prepare(sk))) return -EAGAIN; return INDIRECT_CALL_2(sk->sk_prot->sendmsg, tcp_sendmsg, udp_sendmsg, sk, msg, size); }
经过宏展开之后,sk->__sk_common.skc_prot->sendmsg(sk, msg, size),最终调用的是tcp_sendmsg
2.TCP层
查看tcp_sendmsg代码,发现调用了tcp_sendmsg_locked
int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) { int ret; lock_sock(sk); ret = tcp_sendmsg_locked(sk, msg, size); release_sock(sk); return ret; }
继续跟踪
int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) { struct tcp_sock *tp = tcp_sk(sk); struct ubuf_info *uarg = NULL; struct sk_buff *skb; struct sockcm_cookie sockc; int flags, err, copied = 0; int mss_now = 0, size_goal, copied_syn = 0; int process_backlog = 0; bool zc = false; long timeo; flags = msg->msg_flags; if (flags & MSG_ZEROCOPY && size && sock_flag(sk, SOCK_ZEROCOPY)) { skb = tcp_write_queue_tail(sk); uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb)); if (!uarg) { err = -ENOBUFS; goto out_err; } zc = sk->sk_route_caps & NETIF_F_SG; if (!zc) uarg->zerocopy = 0; } if (unlikely(flags & MSG_FASTOPEN || inet_sk(sk)->defer_connect) && !tp->repair) { err = tcp_sendmsg_fastopen(sk, msg, &copied_syn, size, uarg); if (err == -EINPROGRESS && copied_syn > 0) goto out; else if (err) goto out_err; } timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT); tcp_rate_check_app_limited(sk); /* is sending application-limited? */ /* Wait for a connection to finish. One exception is TCP Fast Open * (passive side) where data is allowed to be sent before a connection * is fully established. */ if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) && !tcp_passive_fastopen(sk)) { err = sk_stream_wait_connect(sk, &timeo); if (err != 0) goto do_error; } if (unlikely(tp->repair)) { if (tp->repair_queue == TCP_RECV_QUEUE) { copied = tcp_send_rcvq(sk, msg, size); goto out_nopush; } err = -EINVAL; if (tp->repair_queue == TCP_NO_QUEUE) goto out_err; /* 'common' sending to sendq */ } sockcm_init(&sockc, sk); if (msg->msg_controllen) { err = sock_cmsg_send(sk, msg, &sockc); if (unlikely(err)) { err = -EINVAL; goto out_err; } } /* This should be in poll */ sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk); /* Ok commence sending. */ copied = 0; restart: mss_now = tcp_send_mss(sk, &size_goal, flags); err = -EPIPE; if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN)) goto do_error; while (msg_data_left(msg)) { int copy = 0; skb = tcp_write_queue_tail(sk); if (skb) copy = size_goal - skb->len; if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) { bool first_skb; new_segment: if (!sk_stream_memory_free(sk)) goto wait_for_sndbuf; if (unlikely(process_backlog >= 16)) { process_backlog = 0; if (sk_flush_backlog(sk)) goto restart; } first_skb = tcp_rtx_and_write_queues_empty(sk); skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation, first_skb); if (!skb) goto wait_for_memory; process_backlog++; skb->ip_summed = CHECKSUM_PARTIAL; skb_entail(sk, skb); copy = size_goal; /* All packets are restored as if they have * already been sent. skb_mstamp_ns isn't set to * avoid wrong rtt estimation. */ if (tp->repair) TCP_SKB_CB(skb)->sacked |= TCPCB_REPAIRED; } /* Try to append data to the end of skb. */ if (copy > msg_data_left(msg)) copy = msg_data_left(msg); /* Where to copy to? */ if (skb_availroom(skb) > 0 && !zc) { /* We have some space in skb head. Superb! */ copy = min_t(int, copy, skb_availroom(skb)); err = skb_add_data_nocache(sk, skb, &msg->msg_iter, copy); if (err) goto do_fault; } else if (!zc) { bool merge = true; int i = skb_shinfo(skb)->nr_frags; struct page_frag *pfrag = sk_page_frag(sk); if (!sk_page_frag_refill(sk, pfrag)) goto wait_for_memory; if (!skb_can_coalesce(skb, i, pfrag->page, pfrag->offset)) { if (i >= sysctl_max_skb_frags) { tcp_mark_push(tp, skb); goto new_segment; } merge = false; } copy = min_t(int, copy, pfrag->size - pfrag->offset); if (!sk_wmem_schedule(sk, copy)) goto wait_for_memory; err = skb_copy_to_page_nocache(sk, &msg->msg_iter, skb, pfrag->page, pfrag->offset, copy); if (err) goto do_error; /* Update the skb. */ if (merge) { skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy); } else { skb_fill_page_desc(skb, i, pfrag->page, pfrag->offset, copy); page_ref_inc(pfrag->page); } pfrag->offset += copy; } else { err = skb_zerocopy_iter_stream(sk, skb, msg, copy, uarg); if (err == -EMSGSIZE || err == -EEXIST) { tcp_mark_push(tp, skb); goto new_segment; } if (err < 0) goto do_error; copy = err; } if (!copied) TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH; WRITE_ONCE(tp->write_seq, tp->write_seq + copy); TCP_SKB_CB(skb)->end_seq += copy; tcp_skb_pcount_set(skb, 0); copied += copy; if (!msg_data_left(msg)) { if (unlikely(flags & MSG_EOR)) TCP_SKB_CB(skb)->eor = 1; goto out; } if (skb->len < size_goal || (flags & MSG_OOB) || unlikely(tp->repair)) continue; if (forced_push(tp)) { tcp_mark_push(tp, skb); __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH); } else if (skb == tcp_send_head(sk)) tcp_push_one(sk, mss_now); continue; wait_for_sndbuf: set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); wait_for_memory: if (copied) tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH, size_goal); err = sk_stream_wait_memory(sk, &timeo); if (err != 0) goto do_error; mss_now = tcp_send_mss(sk, &size_goal, flags); } out: if (copied) { tcp_tx_timestamp(sk, sockc.tsflags); tcp_push(sk, flags, mss_now, tp->nonagle, size_goal); } out_nopush: sock_zerocopy_put(uarg); return copied + copied_syn; do_error: skb = tcp_write_queue_tail(sk); do_fault: tcp_remove_empty_skb(sk, skb); if (copied + copied_syn) goto out; out_err: sock_zerocopy_put_abort(uarg, true); err = sk_stream_error(sk, flags, err); /* make sure we wake any epoll edge trigger waiter */ if (unlikely(tcp_rtx_and_write_queues_empty(sk) && err == -EAGAIN)) { sk->sk_write_space(sk); tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED); } return err; }
这段代码较长,主要做几件事
1.首先检查TCP的状态,只有TCP处于ESTABLISHED和CLOSE_WAIT状态,才能发送数据
2.检查分段,获取最大长度MSS。做一些其他的检查
3.拷贝要发送的数据到skb缓冲区,并尝试调用tcp_push函数
static void tcp_push(struct sock *sk, int flags, int mss_now, int nonagle, int size_goal) { struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; //获取发送队列中的最后一个skb skb = tcp_write_queue_tail(sk); if (!skb) return; //判断是否需要设置PSH标记 if (!(flags & MSG_MORE) || forced_push(tp)) tcp_mark_push(tp, skb); //如果设置了MSG_OOB选项,就记录紧急指针 tcp_mark_urg(tp, flags); //判断是否需要阻塞小包,将多个小数据段合并到一个skb中一起发送 if (tcp_should_autocork(sk, skb, size_goal)) { /* avoid atomic op if TSQ_THROTTLED bit is already set */ //设置TSQ_THROTTLED标志 if (!test_bit(TSQ_THROTTLED, &sk->sk_tsq_flags)) { NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPAUTOCORKING); set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags); } /* It is possible TX completion already happened * before we set TSQ_THROTTLED. */ //有可能在设置TSQ_THROTTLED前,网卡txring已经完成发送 //因此再次检查条件,避免错误阻塞数据报文 if (refcount_read(&sk->sk_wmem_alloc) > skb->truesize) return; } //应用程序用MSG_MORE标识告诉4层将会有更多的小数据包的传输 //然后将这个标记再传递给3层,3层就会提前划分一个mtu大小的数据包,来组合这些数据帧 if (flags & MSG_MORE) nonagle = TCP_NAGLE_CORK; //TCP层还没处理完,接着往下走 __tcp_push_pending_frames(sk, mss_now, nonagle); }
该函数最后调用了__tcp_push_pending_frames,继续跟踪
void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss, int nonagle) { /* If we are closed, the bytes will have to remain here. * In time closedown will finish, we empty the write queue and * all will be happy. */ if (unlikely(sk->sk_state == TCP_CLOSE)) return; if (tcp_write_xmit(sk, cur_mss, nonagle, 0, sk_gfp_mask(sk, GFP_ATOMIC))) //发送失败,检测是否需要开启坚持定时器 tcp_check_probe_timer(sk); }
继续进入tcp_write_xmit函数
static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, int push_one, gfp_t gfp) { struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; unsigned int tso_segs, sent_pkts; int cwnd_quota; int result; bool is_cwnd_limited = false, is_rwnd_limited = false; u32 max_segs; sent_pkts = 0; tcp_mstamp_refresh(tp); if (!push_one) { /* Do MTU probing. */ result = tcp_mtu_probe(sk); if (!result) { return false; } else if (result > 0) { sent_pkts = 1; } } max_segs = tcp_tso_segs(sk, mss_now); while ((skb = tcp_send_head(sk))) { unsigned int limit; ... /* Argh, we hit an empty skb(), presumably a thread * is sleeping in sendmsg()/sk_stream_wait_memory(). * We do not want to send a pure-ack packet and have * a strange looking rtx queue with empty packet(s). */ if (TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq) break; if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) break; ...
}
该函数检查了一下连接状态和拥塞窗口的大小,紧接着将skb缓冲区的数据发送出去,调用tcp_transmit_skb,发现tcp_transmit_skb内部调用了__tcp_transmit_skb
static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, gfp_t gfp_mask, u32 rcv_nxt) { const struct inet_connection_sock *icsk = inet_csk(sk); struct inet_sock *inet; struct tcp_sock *tp; struct tcp_skb_cb *tcb; struct tcp_out_options opts; unsigned int tcp_options_size, tcp_header_size; struct sk_buff *oskb = NULL; struct tcp_md5sig_key *md5; struct tcphdr *th; u64 prior_wstamp; int err; BUG_ON(!skb || !tcp_skb_pcount(skb)); tp = tcp_sk(sk); prior_wstamp = tp->tcp_wstamp_ns; tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache); skb->skb_mstamp_ns = tp->tcp_wstamp_ns; if (clone_it) { TCP_SKB_CB(skb)->tx.in_flight = TCP_SKB_CB(skb)->end_seq - tp->snd_una; oskb = skb; tcp_skb_tsorted_save(oskb) { if (unlikely(skb_cloned(oskb))) skb = pskb_copy(oskb, gfp_mask); else skb = skb_clone(oskb, gfp_mask); } tcp_skb_tsorted_restore(oskb); if (unlikely(!skb)) return -ENOBUFS; /* retransmit skbs might have a non zero value in skb->dev * because skb->dev is aliased with skb->rbnode.rb_left */ skb->dev = NULL; } inet = inet_sk(sk); tcb = TCP_SKB_CB(skb); memset(&opts, 0, sizeof(opts)); if (unlikely(tcb->tcp_flags & TCPHDR_SYN)) { tcp_options_size = tcp_syn_options(sk, skb, &opts, &md5); } else { tcp_options_size = tcp_established_options(sk, skb, &opts, &md5); /* Force a PSH flag on all (GSO) packets to expedite GRO flush * at receiver : This slightly improve GRO performance. * Note that we do not force the PSH flag for non GSO packets, * because they might be sent under high congestion events, * and in this case it is better to delay the delivery of 1-MSS * packets and thus the corresponding ACK packet that would * release the following packet. */ if (tcp_skb_pcount(skb) > 1) tcb->tcp_flags |= TCPHDR_PSH; } tcp_header_size = tcp_options_size + sizeof(struct tcphdr); /* if no packet is in qdisc/device queue, then allow XPS to select * another queue. We can be called from tcp_tsq_handler() * which holds one reference to sk. * * TODO: Ideally, in-flight pure ACK packets should not matter here. * One way to get this would be to set skb->truesize = 2 on them. */ skb->ooo_okay = sk_wmem_alloc_get(sk) < SKB_TRUESIZE(1); /* If we had to use memory reserve to allocate this skb, * this might cause drops if packet is looped back : * Other socket might not have SOCK_MEMALLOC. * Packets not looped back do not care about pfmemalloc. */ skb->pfmemalloc = 0; skb_push(skb, tcp_header_size); skb_reset_transport_header(skb); skb_orphan(skb); skb->sk = sk; skb->destructor = skb_is_tcp_pure_ack(skb) ? __sock_wfree : tcp_wfree; skb_set_hash_from_sk(skb, sk); refcount_add(skb->truesize, &sk->sk_wmem_alloc); skb_set_dst_pending_confirm(skb, sk->sk_dst_pending_confirm); /* Build TCP header and checksum it. */ th = (struct tcphdr *)skb->data; th->source = inet->inet_sport; th->dest = inet->inet_dport; th->seq = htonl(tcb->seq); th->ack_seq = htonl(rcv_nxt); *(((__be16 *)th) + 6) = htons(((tcp_header_size >> 2) << 12) | tcb->tcp_flags); th->check = 0; th->urg_ptr = 0; /* The urg_mode check is necessary during a below snd_una win probe */ if (unlikely(tcp_urg_mode(tp) && before(tcb->seq, tp->snd_up))) { if (before(tp->snd_up, tcb->seq + 0x10000)) { th->urg_ptr = htons(tp->snd_up - tcb->seq); th->urg = 1; } else if (after(tcb->seq + 0xFFFF, tp->snd_nxt)) { th->urg_ptr = htons(0xFFFF); th->urg = 1; } } tcp_options_write((__be32 *)(th + 1), tp, &opts); skb_shinfo(skb)->gso_type = sk->sk_gso_type; if (likely(!(tcb->tcp_flags & TCPHDR_SYN))) { th->window = htons(tcp_select_window(sk)); tcp_ecn_send(sk, skb, th, tcp_header_size); } else { /* RFC1323: The window in SYN & SYN/ACK segments * is never scaled. */ th->window = htons(min(tp->rcv_wnd, 65535U)); } #ifdef CONFIG_TCP_MD5SIG /* Calculate the MD5 hash, as we have all we need now */ if (md5) { sk_nocaps_add(sk, NETIF_F_GSO_MASK); tp->af_specific->calc_md5_hash(opts.hash_location, md5, sk, skb); } #endif icsk->icsk_af_ops->send_check(sk, skb); if (likely(tcb->tcp_flags & TCPHDR_ACK)) tcp_event_ack_sent(sk, tcp_skb_pcount(skb), rcv_nxt); if (skb->len != tcp_header_size) { tcp_event_data_sent(tp, sk); tp->data_segs_out += tcp_skb_pcount(skb); tp->bytes_sent += skb->len - tcp_header_size; } if (after(tcb->end_seq, tp->snd_nxt) || tcb->seq == tcb->end_seq) TCP_ADD_STATS(sock_net(sk), TCP_MIB_OUTSEGS, tcp_skb_pcount(skb)); tp->segs_out += tcp_skb_pcount(skb); /* OK, its time to fill skb_shinfo(skb)->gso_{segs|size} */ skb_shinfo(skb)->gso_segs = tcp_skb_pcount(skb); skb_shinfo(skb)->gso_size = tcp_skb_mss(skb); /* Leave earliest departure time in skb->tstamp (skb->skb_mstamp_ns) */ /* Cleanup our debris for IP stacks */ memset(skb->cb, 0, max(sizeof(struct inet_skb_parm), sizeof(struct inet6_skb_parm))); tcp_add_tx_delay(skb, tp); err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl); if (unlikely(err > 0)) { tcp_enter_cwr(sk); err = net_xmit_eval(err); } if (!err && oskb) { tcp_update_skb_after_send(sk, oskb, prior_wstamp); tcp_rate_skb_sent(sk, oskb); } return err; }
根据注释可知,该函数首先将要发送的skb缓冲区先复制保留一份,然后添加上TCP的头部,并执行err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl)。查阅资料可知,在TCP/IP协议栈当中,queue_xmit会被赋值为ip_queue_xmit。也就是说,TCP将要发送的数据经过运输层的封装和检查等一系列操作之后,调用了IP层的服务。将添加了头部的数据交给了IP层。至此传输层的追踪完毕。
3.网络层
ip_queue_xmit内部其实就是调用了__ipqueue_xmit
int __ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl, __u8 tos) { struct inet_sock *inet = inet_sk(sk); struct net *net = sock_net(sk); struct ip_options_rcu *inet_opt; struct flowi4 *fl4; struct rtable *rt; struct iphdr *iph; int res; /* Skip all of this if the packet is already routed, * f.e. by something like SCTP. */ rcu_read_lock(); inet_opt = rcu_dereference(inet->inet_opt); fl4 = &fl->u.ip4; //获得skb中的路由缓存 rt = skb_rtable(skb); //如果存在缓存,直接跳到packet_routed位置执行否则通过ip_route_output_ports查找路由缓存, if (rt) goto packet_routed; /* Make sure we can route this packet. */ //检查控制块中的路由缓存 rt = (struct rtable *)__sk_dst_check(sk, 0); //缓存过期了 if (!rt) { __be32 daddr; /* Use correct destination address if we have options. */ //使用正确的地址 daddr = inet->inet_daddr; if (inet_opt && inet_opt->opt.srr) daddr = inet_opt->opt.faddr; /* If this fails, retransmit mechanism of transport layer will * keep trying until route appears or the connection times * itself out. */ //查找路由缓存 rt = ip_route_output_ports(net, fl4, sk, daddr, inet->inet_saddr, inet->inet_dport, inet->inet_sport, sk->sk_protocol, RT_CONN_FLAGS_TOS(sk, tos), sk->sk_bound_dev_if); //失败 if (IS_ERR(rt)) goto no_route; //设置控制块的路由缓存 sk_setup_caps(sk, &rt->dst); } //将路由设置到SVB中 skb_dst_set_noref(skb, &rt->dst); packet_routed: if (inet_opt && inet_opt->opt.is_strictroute && rt->rt_uses_gateway) goto no_route; /* OK, we know where to send it, allocate and build IP header. */ //接下来我们知道了目的地址,我们就可以构造并且加入ip头部内容 skb_push(skb, sizeof(struct iphdr) + (inet_opt ? inet_opt->opt.optlen : 0)); skb_reset_network_header(skb); iph = ip_hdr(skb); *((__be16 *)iph) = htons((4 << 12) | (5 << 8) | (tos & 0xff)); if (ip_dont_fragment(sk, &rt->dst) && !skb->ignore_df) iph->frag_off = htons(IP_DF); else iph->frag_off = 0; iph->ttl = ip_select_ttl(inet, &rt->dst); iph->protocol = sk->sk_protocol; ip_copy_addrs(iph, fl4); /* Transport layer set skb->h.foo itself. */ if (inet_opt && inet_opt->opt.optlen) { iph->ihl += inet_opt->opt.optlen >> 2; ip_options_build(skb, &inet_opt->opt, inet->inet_daddr, rt, 0); } ip_select_ident_segs(net, skb, sk, skb_shinfo(skb)->gso_segs ?: 1); /* TODO : should we use skb->sk here instead of sk ? */ skb->priority = sk->sk_priority; skb->mark = sk->sk_mark; res = ip_local_out(net, sk, skb); rcu_read_unlock(); return res; no_route: //无路由情况下的处理 rcu_read_unlock(); IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES); kfree_skb(skb); return -EHOSTUNREACH; }
其中Skb_rtable(skb)获取 skb 中的路由缓存,然后判断是否有缓存,如果有缓存就直接进行packet_routed函数,否则就 执行ip_route_output_ports查找路由缓存。最后调用ip_local_out发送数据包。同函数ip_queue_xmit一样,ip_local_out函数内部调用__ip_local_out。
int __ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb) { struct iphdr *iph = ip_hdr(skb); iph->tot_len = htons(skb->len); ip_send_check(iph); /* if egress device is enslaved to an L3 master device pass the * skb to its handler for processing */ skb = l3mdev_ip_out(sk, skb); if (unlikely(!skb)) return 0; skb->protocol = htons(ETH_P_IP); return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT, net, sk, skb, NULL, skb_dst(skb)->dev, dst_output); }
__ip_local_out设置了数据包的总长度和检验和,并调用dst_output发送IP数据包
/* Output packet to network from transport. */ static inline int dst_output(struct net *net, struct sock *sk, struct sk_buff *skb) { return skb_dst(skb)->output(net, sk, skb); }
首先使用skb_dst函数对skb进行了转换,然后调用了output函数
int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb) { struct net_device *dev = skb_dst(skb)->dev; IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len); skb->dev = dev; skb->protocol = htons(ETH_P_IP); return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING, net, sk, skb, NULL, dev, ip_finish_output, !(IPCB(skb)->flags & IPSKB_REROUTED)); }
调用了ip_finish_output函数,而ip_finish_output调用了__ip_finish_output。故直接看__ip_finish_output
static int __ip_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb) { unsigned int mtu; #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM) /* Policy lookup after SNAT yielded a new policy */ if (skb_dst(skb)->xfrm) { IPCB(skb)->flags |= IPSKB_REROUTED; return dst_output(net, sk, skb); } #endif mtu = ip_skb_dst_mtu(sk, skb); if (skb_is_gso(skb)) return ip_finish_output_gso(net, sk, skb, mtu); if (skb->len > mtu || (IPCB(skb)->flags & IPSKB_FRAG_PMTU)) return ip_fragment(net, sk, skb, mtu, ip_finish_output2); return ip_finish_output2(net, sk, skb); }
可以看到,该函数先获取MTU最大报文长度,如果需要分片的话,先进行分片,在调用ip_finish_output2。否则直接调用ip_finish_output2
因此,查看ip_finish_output2源码
static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *skb) { struct dst_entry *dst = skb_dst(skb); struct rtable *rt = (struct rtable *)dst; struct net_device *dev = dst->dev; unsigned int hh_len = LL_RESERVED_SPACE(dev); struct neighbour *neigh; bool is_v6gw = false; if (rt->rt_type == RTN_MULTICAST) { IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTMCAST, skb->len); } else if (rt->rt_type == RTN_BROADCAST) IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTBCAST, skb->len); /* Be paranoid, rather than too clever. */ if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) { struct sk_buff *skb2; skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev)); if (!skb2) { kfree_skb(skb); return -ENOMEM; } if (skb->sk) skb_set_owner_w(skb2, skb->sk); consume_skb(skb); skb = skb2; } if (lwtunnel_xmit_redirect(dst->lwtstate)) { int res = lwtunnel_xmit(skb); if (res < 0 || res == LWTUNNEL_XMIT_DONE) return res; } rcu_read_lock_bh(); neigh = ip_neigh_for_gw(rt, skb, &is_v6gw); if (!IS_ERR(neigh)) { int res; sock_confirm_neigh(skb, neigh); /* if crossing protocols, can not use the cached header */ res = neigh_output(neigh, skb, is_v6gw); rcu_read_unlock_bh(); return res; } rcu_read_unlock_bh(); net_dbg_ratelimited("%s: No header cache and no neighbour!\n", __func__); kfree_skb(skb); return -EINVAL; }
该函数首先检查IP数据报的头部大小,以确保链路层拥有足够的空间。然后开始arp协议,并调用neigh_output
static inline int neigh_output(struct neighbour *n, struct sk_buff *skb, bool skip_cache) { const struct hh_cache *hh = &n->hh; if ((n->nud_state & NUD_CONNECTED) && hh->hh_len && !skip_cache) return neigh_hh_output(hh, skb); else return n->output(n, skb); }
输出分为有二层头有缓存和没有两种情况,有缓存时调用neigh_hh_output()进行快速输出,没有缓存时,则调用邻居子系统的输出回调函数进行慢速输出,neigh_hh_output()源码如下:
static inline int neigh_hh_output(const struct hh_cache *hh, struct sk_buff *skb) { unsigned int hh_alen = 0; unsigned int seq; unsigned int hh_len; do { seq = read_seqbegin(&hh->hh_lock); hh_len = READ_ONCE(hh->hh_len); if (likely(hh_len <= HH_DATA_MOD)) { hh_alen = HH_DATA_MOD; /* skb_push() would proceed silently if we have room for * the unaligned size but not for the aligned size: * check headroom explicitly. */ if (likely(skb_headroom(skb) >= HH_DATA_MOD)) { /* this is inlined by gcc */ memcpy(skb->data - HH_DATA_MOD, hh->hh_data, HH_DATA_MOD); } } else { hh_alen = HH_DATA_ALIGN(hh_len); if (likely(skb_headroom(skb) >= hh_alen)) { memcpy(skb->data - hh_alen, hh->hh_data, hh_alen); } } } while (read_seqretry(&hh->hh_lock, seq)); if (WARN_ON_ONCE(skb_headroom(skb) < hh_alen)) { kfree_skb(skb); return NET_XMIT_DROP; } __skb_push(skb, hh_len); return dev_queue_xmit(skb); }
该函数将对应的缓存,拷贝到IP数据包的头部,并将该头部压入缓存当中。最终调用链路层的服务dev_queue_xmit。至此,网络层的追踪结束了。
4.链路层
首先追踪dev_queue_xmit,跟别的入口函数一样,内部调用了__dev_queue_xmit
static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) { struct net_device *dev = skb->dev; struct netdev_queue *txq; struct Qdisc *q; int rc = -ENOMEM; bool again = false; skb_reset_mac_header(skb); ...
b_dst_force(skb); txq = netdev_core_pick_tx(dev, skb, sb_dev); q = rcu_dereference_bh(txq->qdisc); trace_net_dev_queue(skb); if (q->enqueue) { rc = __dev_xmit_skb(skb, q, dev, txq); goto out; } /* The device has no queue. Common case for software devices: * loopback, all the sorts of tunnels... * Really, it is unlikely that netif_tx_lock protection is necessary * here. (f.e. loopback and IP tunnels are clean ignoring statistics * counters.) * However, it is possible, that they rely on protection * made by us here. * Check this and shot the lock. It is not prone from deadlocks. *Either shot noqueue qdisc, it is even simpler 8) */ if (dev->flags & IFF_UP) { int cpu = smp_processor_id(); /* ok because BHs are off */ if (txq->xmit_lock_owner != cpu) { if (dev_xmit_recursion()) goto recursion_alert; skb = validate_xmit_skb(skb, dev, &again); if (!skb) goto out; HARD_TX_LOCK(dev, txq, cpu); if (!netif_xmit_stopped(txq)) { dev_xmit_recursion_inc(); skb = dev_hard_start_xmit(skb, dev, txq, &rc); dev_xmit_recursion_dec(); if (dev_xmit_complete(rc)) { HARD_TX_UNLOCK(dev, txq); goto out; } } HARD_TX_UNLOCK(dev, txq); net_crit_ratelimited("Virtual device %s asks to queue packet!\n", dev->name); } else { /* Recursion is detected! It is possible, * unfortunately */ recursion_alert: net_crit_ratelimited("Dead loop on virtual device %s, fix it urgently!\n", dev->name); } } ... }
其他的不过多关注,该函数调用dev_hard_start_xmit发送数据
struct sk_buff *dev_hard_start_xmit(struct sk_buff *first, struct net_device *dev, struct netdev_queue *txq, int *ret) { struct sk_buff *skb = first; int rc = NETDEV_TX_OK; while (skb) { struct sk_buff *next = skb->next; //取出skb的下一个数据单元 skb_mark_not_on_list(skb); //将此数据包送到driver Tx函数,因为dequeue的数据也会从这里发送,所以会有next rc = xmit_one(skb, dev, txq, next != NULL); //如果发送不成功,next还原到skb->next 退出 if (unlikely(!dev_xmit_complete(rc))) { skb->next = next; goto out; } //如果发送成功,把next置给skb,一般的next为空 这样就返回,如果不为空就继续发! skb = next; //如果txq被stop,并且skb需要发送,就产生TX Busy的问题! if (netif_tx_queue_stopped(txq) && skb) { rc = NETDEV_TX_BUSY; break; } } out: *ret = rc; return skb; }
最终的数据通过xmit_one这个函数传递给物理层的设备,到这里虚拟的传递的驱动就要结束了,将和实际的设备驱动连接起来
static int xmit_one(struct sk_buff *skb, struct net_device *dev, struct netdev_queue *txq, bool more) { unsigned int len; int rc; if (dev_nit_active(dev)) dev_queue_xmit_nit(skb, dev); len = skb->len; trace_net_dev_start_xmit(skb, dev); rc = netdev_start_xmit(skb, dev, txq, more); trace_net_dev_xmit(skb, rc, dev, len); return rc; }
xmit_one函数在使用的过程中,利用netdev_start_xmit来启动物理层的接口,进而调用__netdev_start_xmit,物理层在收到发送请求之后,通过 DMA 将该主存中的数据拷贝至内部RAM(buffer)之中,同时在数据的拷贝中,还会加入相关协议等。对于以太网网络,物理层发送采用CSMA/CD协议,即在发送过程中侦听链路冲突。一旦网卡完成报文发送,将产生中断通知CPU,然后驱动层中的中断处理程序就可以删除保存的 skb 了。到这一步,这个数据就可以完整的输出到物理层设备上了,转化为比特流的形式。
五、GDB调试



可以看到,除了中断是使用的entry_64以外,其他的跟踪与上文的分析无异。
六、时序图


浙公网安备 33010602011771号