bcc Python开发者教程(译:bcc Python Developer Tutorial)

翻译自：https://github.com/iovisor/bcc/blob/master/docs/tutorial_bcc_python_developer.md

bcc Python Developer Tutorial

这个教程主要目的是展示如何使用python来进行bcc工具开发和编程。教程主要分为两个部分：可观察性和网络。

文中的代码片段均都来自于bcc：代码片段的licenses见bcc中具体文件。

也可参考bcc开发者手册reference_guide.md以及end-users工具教程: tutorial.md。此外bcc还开放有lua接口。

Observability

"可观察性"教程包含17个课程和46个要学习列举的事项。

Lesson 1. Hello World

我们通过运行examples/hello_world.py这个例子来开启我们的学习之旅。在一个终端运行这个脚本，同时在另外一个终端运行一些命令(例如"ls")。正常的预期是在新任务运行时打印"Hello, World!"，如果结果不符合预期，说明还有一些东西没有准备好：参考 INSTALL.md。

# ./examples/hello_world.py
            bash-13364 [002] d... 24573433.052937: : Hello, World!
            bash-13364 [003] d... 24573436.642808: : Hello, World!
[...]

下面是hello_world.py：

from bcc import BPF
BPF(text='int kprobe__sys_clone(void *ctx) { bpf_trace_printk("Hello, World!\\n"); return 0; }').trace_print()

在这里我们有6样需要学习的东西：

text='...'：定义一个inline BPF 程序. 这个程序使用C语言编码风格。
kprobe__sys_clone()：这是通过kprobe进行内核函数动态跟踪的快捷方法。如果C语言函数名称以"kprobe__"作为前缀，则函数名其余部分则表示将要被跟踪的内核函数接口(名)，在我们这里的场景中就是跟踪内核函数sys_clone().
void *ctx： ctx本来是具体类型的参数，但是由于我们这里没有使用这个参数，因此就将其写成void *类型。
bpf_trace_printk()：一种将信息输出到trace_pipe(/sys/kernel/debug/tracing/trace_pipe)简单机制。在一些简单用例中这样使用没有问题， but它也有一些限制：最多3 参数；第一个参数必须是%s(即字符串)；同时trace_pipe在内核中全局共享，so 其他并行使用trace_pipe的程序有可能会将trace_pipe的输出扰乱。一个更好的方式是通过BPF_PERF_OUTPUT(), 稍后将会讲到。
return 0;：必须这样，返回0 (如果要知道why, 参考 #139 https://github.com/iovisor/bcc/issues/139)。
.trace_print(): bcc提供的一个功能用以读取trace_pipe的内容输出到终端。

Lesson 2. sys_sync()

这一课我们要写一个跟踪sys_sync()内核函数的程序。这个程序会在sys_sync()函数被调用时在终端打印"sys_sync() called" 。程序写好运行起来，并在另外一个终端运行sync命令来进行测试。Lesson 1中的hello_world.py 程序基本上不用怎么修改就够用。

不过，在Lesson 1的基础上再增加一条：在跟踪程序运行时会在第一条打印输出"Tracing sys_sync()... Ctrl-C to end." 。提示：it's just Python。

Lesson 3. hello_fields.py

这个程序在examples/tracing/hello_fields.py. 示例输出如下 (命令运行在另外一个终端)：

# ./examples/tracing/hello_fields.py
TIME(s)            COMM             PID    MESSAGE
24585001.174885999 sshd             1432   Hello, World!
24585001.195710000 sshd             15780  Hello, World!
24585001.991976000 systemd-udevd    484    Hello, World!
24585002.276147000 bash             15787  Hello, World!

代码如下：

from bcc import BPF

# define BPF program
prog = """
int hello(void *ctx) {
    bpf_trace_printk("Hello, World!\\n");
    return 0;
}
"""

# load BPF program
b = BPF(text=prog)
b.attach_kprobe(event=b.get_syscall_fnname("clone"), fn_name="hello")

# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "MESSAGE"))

# format output
while 1:
    try:
        (task, pid, cpu, flags, ts, msg) = b.trace_fields()
    except ValueError:
        continue
    print("%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))

这个程序与hello_world.py相似也是通过sys_clone()来跟踪新任务，但是添加了一些新的学习事项：

prog =: 这一次我们将C程序定义为了变量，后续通过引用这个变量的方式来使用。如果你想根据命令行参数来进行一些字符串替换，这种方式就很有用。
hello(): 我们定义了一个C语言函数而非kprobe__ 快捷方式，稍后我们将会引用这个函数。所有声明在BPF程序中的C函数在跟踪函数的kprobe会被执行，因而这里的C函数需要一个pt_reg* ctx类型的首参。如果你想定义一些helper函数，但是又不希望这些函数在probe时就执行，那么需要将这些helper函数定义为static inline 这样编译器可以将其编译为inlined属性；有时候也许你需要使用_always_inline 函数属性来实现这一效果。
b.attach_kprobe(event=b.get_syscall_fnname("clone"), fn_name="hello")：为内核的clone系统调用函数添加一个kprobe点，这样实施后在clone()函数的kprobe会执行我们定义的hello() 函数。也可以多次调用attach_kprobe() 函数在需要跟踪的内核函数的kprobe点插入你定义的kprobe跟踪函数。
b.trace_fields()：从trace_pipe返回一组固定字段。类似于trace_print()这个函数一般只在调试时使用，如果在正式发布的工具中应该使用BPF_PERF_OUTPUT()来代替。

Lesson 4. sync_timing.py

还记得吗，系统管理员层在reboot机器前在终端上连敲了三次sync命令来让第一次sync同步执行完成? 后来有人觉得sync;sync;sync这种把它们放在一行运行的操作简直是666，甚至最终都成为了行业惯例，尽管违背了初衷！

接下来的这个列子用以记录do_sync被频繁调用的有都快，如果调用间隔小于一秒，则将两次被调用的时间间隔打印出来。这样sync;sync;sync一串命令将会输出第2次和第3次的调用间隔。

# ./examples/tracing/sync_timing.py
Tracing for quick sync's... Ctrl-C to end
At time 0.00 s: multiple syncs detected, last 95 ms ago
At time 0.10 s: multiple syncs detected, last 96 ms ago

这个程序在 examples/tracing/sync_timing.py ：

from __future__ import print_function
from bcc import BPF

# load BPF program
b = BPF(text="""
#include <uapi/linux/ptrace.h>

BPF_HASH(last);

int do_trace(struct pt_regs *ctx) {
    u64 ts, *tsp, delta, key = 0;

    // attempt to read stored timestamp
    tsp = last.lookup(&key);
    if (tsp != NULL) {
        delta = bpf_ktime_get_ns() - *tsp;
        if (delta < 1000000000) {
            // output if time is less than 1 second
            bpf_trace_printk("%d\\n", delta / 1000000);
        }
        last.delete(&key);
    }

    // update stored timestamp
    ts = bpf_ktime_get_ns();
    last.update(&key, &ts);
    return 0;
}
""")

b.attach_kprobe(event=b.get_syscall_fnname("sync"), fn_name="do_trace")
print("Tracing for quick sync's... Ctrl-C to end")

# format output
start = 0
while 1:
    (task, pid, cpu, flags, ts, ms) = b.trace_fields()
    if start == 0:
        start = ts
    ts = ts - start
    print("At time %.2f s: multiple syncs detected, last %s ms ago" % (ts, ms))

这一课我们要学习如下知识:

bpf_ktime_get_ns(): 返回当前时间戳，以纳秒为单位。
BPF_HASH(last): 创建一个名字为"last"的BPF map对象，其本质上是一个hash表。我们没有指定任何参数，因而这里对map中的key和value都默认为u64类型。
key = 0: 在这个hash map中我们仅存放一对key/value，且key硬编码为0。
last.lookup(&key): 在hash中通过key查找元素，如果查找到则返回key对应的value指针，否则返回NULL。这里入参传递的是key地址。
if (tsp != NULL) {: 内核中的verifier 要求在引用一个返回自map lookup的value指针前必须进行NULL指针检查。
last.delete(&key): 从hash中删除key。由于老版本kenrel存在bug因而要求在.update()后需要这样做，不过这个bug已经在4.8.10后已经fixed。
last.update(&key, &ts): 在hash map中将ts与key进行关联，这会覆盖之前的键值对中key对应的value，这里是记录时间戳。

Lesson 5. sync_count.py

对上一节的sync_timing.py 程序进行修改，把内核中sync系统调用的次数(both fast and slow)记录下来，并在打印中输出。这个调用次数count可在已有的hash map中新增一个key来记录。

Lesson 6. disksnoop.py

看看 examples/tracing/disksnoop.py程序找一些新鲜的玩法吧，下面是这个程序的输出：

# ./disksnoop.py
TIME(s)            T  BYTES    LAT(ms)
16458043.436012    W  4096        3.13
16458043.437326    W  4096        4.44
16458044.126545    R  4096       42.82
16458044.129872    R  4096        3.24
[...]

部分代码片段如下：

[...]
REQ_WRITE = 1        # from include/linux/blk_types.h

# load BPF program
b = BPF(text="""
#include <uapi/linux/ptrace.h>
#include <linux/blkdev.h>

BPF_HASH(start, struct request *);

void trace_start(struct pt_regs *ctx, struct request *req) {
    // stash start timestamp by request ptr
    u64 ts = bpf_ktime_get_ns();

    start.update(&req, &ts);
}

void trace_completion(struct pt_regs *ctx, struct request *req) {
    u64 *tsp, delta;

    tsp = start.lookup(&req);
    if (tsp != 0) {
        delta = bpf_ktime_get_ns() - *tsp;
        bpf_trace_printk("%d %x %d\\n", req->__data_len,
            req->cmd_flags, delta / 1000);
        start.delete(&req);
    }
}
""")

b.attach_kprobe(event="blk_start_request", fn_name="trace_start")
b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_start")
b.attach_kprobe(event="blk_account_io_done", fn_name="trace_completion")
[...]

这节我们要学习如下知识：

REQ_WRITE: 我们在这个python程序中定义了一个内核中已有的常量，因为稍后我们将会使用到它。如果我们在BPF中使用直接REQ_WRITE即使不定义也应该不会有问题，但前提是要使用正确的头文件#includes。
trace_start(struct pt_regs *ctx, struct request *req): 这个函数稍后会被attached到kprobes中。Kprobe函数中的首参是struct pt_regs *ctx，这个参数用以提供BPF现场和上下文寄存器；第二个是被kprobe跟踪的内核函数的实际参数。我们将trace_start()函数attach到blk_start_request()内核函数，而这个内核函数的第一个参数就是struct request *类型。
start.update(&req, &ts): 这里我们使用request 结构指针作为我们hash map的键值key。 What? 用指针做key？哈哈，这个在traceing不足为奇。结构体指针在hash map中被证明是很理想的键值，因为他们是独一无二的：两个结构体(对象)不可能有相同的指针(地址)。(但是要注意内存被释放后指针被重复使用的情况)。因此这里我们将时间戳timestamp与描述磁盘IO的结构体request struct(指针)进行key/value配对使用，这样我们就可以对其进行计时。通常有两种键值可用来与时间戳配对存放：结构体指针和线程IDs (for timing function entry to return).
req->__data_len：这里引用struct request的成员。详情请翻阅内核源码中这个结构的定义以及它有哪些成员。bcc工具实际上将这些表达式重写为了一系列bpf_probe_read_kernel() 调用。有时候bcc无法处理一些复杂的引用，此时需要直接调用bpf_probe_read_kernel()。

这个程序非常有意思，如果你能够理解这里所有的代码，你就会解许多重要的基础知识。目前我们仍然使用的是bpf_trace_printk()函数，让我们接下来继续改进它吧 !

Lesson 7. hello_perf_output.py

好了，接下来我们不再用前面的bpf_trace_printk()，而是使用BPF_PERF_OUTPUT() 接口这才是正确的打开方式。这也意味着我们无法再欢快而自由的通过 trace_field()获取到PID和timestamp这些成员字段了，我们不得不自食其力直接取到他们。用例的输出如下：

# ./hello_perf_output.py
TIME(s)            COMM             PID    MESSAGE
0.000000000        bash             22986  Hello, perf_output!
0.021080275        systemd-udevd    484    Hello, perf_output!
0.021359520        systemd-udevd    484    Hello, perf_output!
0.021590610        systemd-udevd    484    Hello, perf_output!
[...]

代码取自 examples/tracing/hello_perf_output.py ：

from bcc import BPF

# define BPF program
prog = """
#include <linux/sched.h>

// define output data structure in C
struct data_t {
    u32 pid;
    u64 ts;
    char comm[TASK_COMM_LEN];
};
BPF_PERF_OUTPUT(events);

int hello(struct pt_regs *ctx) {
    struct data_t data = {};

    data.pid = bpf_get_current_pid_tgid();
    data.ts = bpf_ktime_get_ns();
    bpf_get_current_comm(&data.comm, sizeof(data.comm));

    events.perf_submit(ctx, &data, sizeof(data));

    return 0;
}
"""

# load BPF program
b = BPF(text=prog)
b.attach_kprobe(event=b.get_syscall_fnname("clone"), fn_name="hello")

# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "MESSAGE"))

# process event
start = 0
def print_event(cpu, data, size):
    global start
    event = b["events"].event(data)
    if start == 0:
            start = event.ts
    time_s = (float(event.ts - start)) / 1000000000
    print("%-18.9f %-16s %-6d %s" % (time_s, event.comm, event.pid,
        "Hello, perf_output!"))

# loop with callback to print_event
b["events"].open_perf_buffer(print_event)
while 1:
    b.perf_buffer_poll()

本节需要学习：

struct data_t：这是一个我们自己定义的C语言结构体，用于内核向用户态传递数据。
BPF_PERF_OUTPUT(events)：将我们的输出通道命名为"events"。
struct data_t data = {};：创建一个空的data_t struct结构体对象，其成员在后续填充。
bpf_get_current_pid_tgid()：返回值的低32位保存当前任务的线程(在Linux中内核视角的PID实际上是用户态的线程ID)，高32位保存线程组ID(也就是用户态视角的进程PID)。如果用u32类型对这个值进行强转，高32位将会被截断discard。我们应该用 PID 还是 TGID呢? 对于多线程应用来说线程组中的TGID都是相同的，因此如果你想要区分的是不同的线程，那么就使用PID。究竟使用PID还是TGID实际上，这是一个与用户期望有关的问题。
bpf_get_current_comm()：将当前任务的名字(字符串)放到第一个入参中指针所指向的内存中。
events.perf_submit()：提交event以便用户态通过perf ring buffer读取perf数据。
def print_event()：自定义的Python函数用以处理event stream读取的events信息。
b["events"].event(data)：以一个python对象的方式返回events信息，这个python对象是从前面C语言声明中自动生成的。
b["events"].open_perf_buffer(print_event)：将Python函数print_event 与events stream关联起来。
while 1: b.perf_buffer_poll()：polling等待perf 事件。

Lesson 8. sync_perf_output.py

使用BPF_PERF_OUTPUT对上一节的sync_timing.py进行重构。

Lesson 9. bitehist.py

下面这个工具以直方图方式记录disk I/O大小，示例输出如下：

# ./bitehist.py
Tracing... Hit Ctrl-C to end.
^C
     kbytes          : count     distribution
       0 -> 1        : 3        |                                      |
       2 -> 3        : 0        |                                      |
       4 -> 7        : 211      |**********                            |
       8 -> 15       : 0        |                                      |
      16 -> 31       : 0        |                                      |
      32 -> 63       : 0        |                                      |
      64 -> 127      : 1        |                                      |
     128 -> 255      : 800      |**************************************|

下面在来自 examples/tracing/bitehist.py ：

from __future__ import print_function
from bcc import BPF
from time import sleep

# load BPF program
b = BPF(text="""
#include <uapi/linux/ptrace.h>
#include <linux/blkdev.h>

BPF_HISTOGRAM(dist);

int kprobe__blk_account_io_done(struct pt_regs *ctx, struct request *req)
{
    dist.increment(bpf_log2l(req->__data_len / 1024));
    return 0;
}
""")

# header
print("Tracing... Hit Ctrl-C to end.")

# trace until Ctrl-C
try:
    sleep(99999999)
except KeyboardInterrupt:
    print()

# output
b["dist"].print_log2_hist("kbytes")

让我们回顾一下前面的课程：

kprobe__：以这个为前缀开表达式中后面的字符串表示要安装kprobe钩子的内核函数。
struct pt_regs *ctx, struct request *req：kprobe钩子函数的参数。参数ctx存着寄存器和BPF的上下文；参数req是被跟踪内核函数(这里是blk_account_io_done())第一个参数。
req->__data_len：对参数成员进行引用。

需要学习的新东西：

BPF_HISTOGRAM(dist)：定义一个histogram类型BPF map 对象，其名字为"dist"。
dist.increment()：增加dist标记索引，第一个参数表示增加的步长，如果没有指定参数，默认步长为1。或者也可以自己将第二个参数作为步长。
bpf_log2l()：对参数进行log-2计算，其结果作为直方图索引，这样我们可以构建一个2阶直方图。
b["dist"].print_log2_hist("kbytes")：打印"dist"的2阶直方图，以kbytes作为输出列的头信息。由于 bucket计数是内核到用户态传输的唯一数据，因此这种方式效率很高。

Lesson 10. disklatency.py

编写一个程序记录disk I/O的时间，并打印出这些延迟时间的直方图。Disk I/O 的监测和计时可参考上一节的disksnoop.py程序，直方图代码可以参考上一节的bitehist.py程序。

Lesson 11. vfsreadlat.py

这个例子程序拆分为单独的Python 和 C 文件。输出如下：

# ./vfsreadlat.py 1
Tracing... Hit Ctrl-C to end.
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 2        |***********                             |
         4 -> 7          : 7        |****************************************|
         8 -> 15         : 4        |**********************                  |

     usecs               : count     distribution
         0 -> 1          : 29       |****************************************|
         2 -> 3          : 28       |**************************************  |
         4 -> 7          : 4        |*****                                   |
         8 -> 15         : 8        |***********                             |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 2        |**                                      |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 4        |*****                                   |
      8192 -> 16383      : 6        |********                                |
     16384 -> 32767      : 9        |************                            |
     32768 -> 65535      : 6        |********                                |
     65536 -> 131071     : 2        |**                                      |

     usecs               : count     distribution
         0 -> 1          : 11       |****************************************|
         2 -> 3          : 2        |*******                                 |
         4 -> 7          : 10       |************************************    |
         8 -> 15         : 8        |*****************************           |
        16 -> 31         : 1        |***                                     |
        32 -> 63         : 2        |*******                                 |
[...]

代码可参考examples/tracing/vfsreadlat.py 和 examples/tracing/vfsreadlat.c。

需要学习的东西：

b = BPF(src_file = "vfsreadlat.c")：从一个单独的C语言文件读取程序。
b.attach_kretprobe(event="vfs_read", fn_name="do_return")：将BPF C函数do_return() 添加到内核函数 vfs_read()的返回点kprobe钩子中，也就是kretprobe：跟踪的是一个内核函数的返回点，而非进入点。
b["dist"].clear()：清除histogram.

Lesson 12. urandomread.py

跟踪dd if=/dev/urandom of=/dev/null bs=8k count=5命令：

# ./urandomread.py
TIME(s)            COMM             PID    GOTBITS
24652832.956994001 smtp             24690  384
24652837.726500999 dd               24692  65536
24652837.727111001 dd               24692  65536
24652837.727703001 dd               24692  65536
24652837.728294998 dd               24692  65536
24652837.728888001 dd               24692  65536

哈！我偶然抓到了smtp。这个例子的代码在 examples/tracing/urandomread.py：

from __future__ import print_function
from bcc import BPF

# load BPF program
b = BPF(text="""
TRACEPOINT_PROBE(random, urandom_read) {
    // args is from /sys/kernel/debug/tracing/events/random/urandom_read/format
    bpf_trace_printk("%d\\n", args->got_bits);
    return 0;
}
""")

# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "GOTBITS"))

# format output
while 1:
    try:
        (task, pid, cpu, flags, ts, msg) = b.trace_fields()
    except ValueError:
        continue
    print("%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))

本节需要学习：

TRACEPOINT_PROBE(random, urandom_read)：跟踪内核tracepoint函数random:urandom_read。由于tracepoint的API比较稳定，一般不会轻易修改；因而相较于kprobe，只要能够使用tracepoint就尽量选择tracepoint跟踪。可以通过perf list命令来列出有哪些可用的tracepoints。在Linux >= 4.7 的版本中要求将BPF 程序 attach 到tracepoints。
args->got_bits： args 是自动生成的，其类型为tracepoint参数类型数据结构。上面代码中的注释提示可以在哪里查找到这个结构体，Eg:

# cat /sys/kernel/debug/tracing/events/random/urandom_read/format
name: urandom_read
ID: 972
format:
	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
	field:int common_pid;	offset:4;	size:4;	signed:1;

	field:int got_bits;	offset:8;	size:4;	signed:1;
	field:int pool_left;	offset:12;	size:4;	signed:1;
	field:int input_left;	offset:16;	size:4;	signed:1;

print fmt: "got_bits %d nonblocking_pool_entropy_left %d input_entropy_left %d", REC->got_bits, REC->pool_left, REC->input_left

在这个案例中我们打印的是got_bits成员。

Lesson 13. disksnoop.py fixed版

对前一节的disksnoop.py程序进行修改，这次跟踪block:block_rq_issue 和 block:block_rq_complete 两个tracepoints点。

Lesson 14. strlen_count.py

这一课中我们的程序要跟踪的是strlen()这个用户态库函数，并统计这个函数中不同参数出现的频次。用例输出如下：

# ./strlen_count.py
Tracing strlen()... Hit Ctrl-C to end.
^C     COUNT STRING
         1 " "
         1 "/bin/ls"
         1 "."
         1 "cpudist.py.1"
         1 ".bashrc"
         1 "ls --color=auto"
         1 "key_t"
[...]
        10 "a7:~# "
        10 "/root"
        12 "LC_ALL"
        12 "en_US.UTF-8"
        13 "en_US.UTF-8"
        20 "~"
        70 "#%^,~:-=?+/}"
       340 "\x01\x1b]0;root@bgregg-test: ~\x07\x02root@bgregg-test:~# "

这些是这次跟踪到的该库函数处理的各种字符串参数，同时还打印了他们出现频次情况。例如strlen()使用"LC_ALL" 字符串参数调用了12次。

代码在 examples/tracing/strlen_count.py：

from __future__ import print_function
from bcc import BPF
from time import sleep

# load BPF program
b = BPF(text="""
#include <uapi/linux/ptrace.h>

struct key_t {
    char c[80];
};
BPF_HASH(counts, struct key_t);

int count(struct pt_regs *ctx) {
    if (!PT_REGS_PARM1(ctx))
        return 0;

    struct key_t key = {};
    u64 zero = 0, *val;

    bpf_probe_read_user(&key.c, sizeof(key.c), (void *)PT_REGS_PARM1(ctx));
    // could also use `counts.increment(key)`
    val = counts.lookup_or_try_init(&key, &zero);
    if (val) {
      (*val)++;
    }
    return 0;
};
""")
b.attach_uprobe(name="c", sym="strlen", fn_name="count")

# header
print("Tracing strlen()... Hit Ctrl-C to end.")

# sleep until Ctrl-C
try:
    sleep(99999999)
except KeyboardInterrupt:
    pass

# print output
print("%10s %s" % ("COUNT", "STRING"))
counts = b.get_table("counts")
for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
    print("%10d \"%s\"" % (v.value, k.c.encode('string-escape')))

本节需要学习：

PT_REGS_PARM1(ctx)：这个宏用于获取被跟踪函数strlen()的第一个参数，也就是要处理的字符串。
b.attach_uprobe(name="c", sym="strlen", fn_name="count")：Attach "c"库(if this is the main program, use its pathname)，跟踪其用户态函数strlen()，并在strlen()函数执行时调用我们的挂接的uprobe函数count()。

Lesson 15. nodejs_http_server.py

这一次我们将跟踪用户态静态定义tracing (USDT) 探针，这就是用户态版本的tracepoint。示例输出如下：

# ./nodejs_http_server.py 24728
TIME(s)            COMM             PID    ARGS
24653324.561322998 node             24728  path:/index.html
24653335.343401998 node             24728  path:/images/welcome.png
24653340.510164998 node             24728  path:/images/favicon.png

相关代码来自于 examples/tracing/nodejs_http_server.py：

from __future__ import print_function
from bcc import BPF, USDT
import sys

if len(sys.argv) < 2:
    print("USAGE: nodejs_http_server PID")
    exit()
pid = sys.argv[1]
debug = 0

# load BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
int do_trace(struct pt_regs *ctx) {
    uint64_t addr;
    char path[128]={0};
    bpf_usdt_readarg(6, ctx, &addr);
    bpf_probe_read_user(&path, sizeof(path), (void *)addr);
    bpf_trace_printk("path:%s\\n", path);
    return 0;
};
"""

# enable USDT probe from given PID
u = USDT(pid=int(pid))
u.enable_probe(probe="http__server__request", fn_name="do_trace")
if debug:
    print(u.get_text())
    print(bpf_text)

# initialize BPF
b = BPF(text=bpf_text, usdt_contexts=[u])

本节需要学习：

bpf_usdt_readarg(6, ctx, &addr)：从USDT probe中读取第6个参数地址到addr中。
bpf_probe_read_user(&path, sizeof(path), (void *)addr)：现在将addr 指向path。
u = USDT(pid=int(pid))：使用指定的PID初始化USDT tracing。
u.enable_probe(probe="http__server__request", fn_name="do_trace")：将我们的BPF C函数do_trace() 添加到USDT探测点Node.js的http__server__request 处。
b = BPF(text=bpf_text, usdt_contexts=[u])：需要将我们的USDT对象"u"传入到BPF对象的创建函数中。

Lesson 16. task_switch.c

这是前期课程中已经包含的一部分内容，这里主要是回顾和巩固一下我们前面已经学过的内容。

这是一个比Hello World稍复杂的跟踪实例。这个程序会在每次任务切换时被调用，它会把新/老任务的pids记录到BPF map中。

下面这段C程序引入了一个新概念：参数 prev。这个参数由BCC前端进行特殊处理，因此对该变量的访问是从kprobe基础结构传递的已保存上下文中获取。从位置1开始的参数的原型应该与正在被探测的内核函数的原型匹配。这样一来程序就对函数参数的访问权无缝对接。

#include <uapi/linux/ptrace.h>
#include <linux/sched.h>

struct key_t {
    u32 prev_pid;
    u32 curr_pid;
};

BPF_HASH(stats, struct key_t, u64, 1024);
int count_sched(struct pt_regs *ctx, struct task_struct *prev) {
    struct key_t key = {};
    u64 zero = 0, *val;

    key.curr_pid = bpf_get_current_pid_tgid();
    key.prev_pid = prev->pid;

    // could also use `stats.increment(key);`
    val = stats.lookup_or_try_init(&key, &zero);
    if (val) {
      (*val)++;
    }
    return 0;
}

用户态组件加载上面文件的内容并attache到finish_task_switch()内核函数的kprobe探测钩子中。通过[]这个操作将可以访问到程序中BPF对象中BPF_HASH元素，这样就可以直接访问到内核中的变量。使用这个对象就像python中的其他对象一样：read, update, and deletes 等等内置的函数都是标准配置。

from bcc import BPF
from time import sleep

b = BPF(src_file="task_switch.c")
b.attach_kprobe(event="finish_task_switch", fn_name="count_sched")

# generate many schedule events
for i in range(0, 100): sleep(0.01)

for k, v in b["stats"].items():
    print("task_switch[%5d->%5d]=%u" % (k.prev_pid, k.curr_pid, v.value))

这些程序可以在这些文件中找到 examples/tracing/task_switch.c 以及 examples/tracing/task_switch.py 。

Lesson 17. Further Study

要想进一步学习可参考Sasha Goldshtein的linux-tracing-workshop , 这里还提供了额外的实验；此外bcc项目的 /tools牧中也有许多工具值得学习。

如果您想为bcc贡献工具，请阅读CONTRIBUTING-SCRIPTS.md 文档。在README.md 文档的尾部你还可以找到我们的联系方式。Good luck, and happy tracing!

Networking

To do.

posted on 2021-10-14 20:56 温暖的电波阅读(1785) 评论(0) 收藏举报

刷新页面返回顶部


博客园 © 2004-2025 浙公网安备 33010602011771号浙ICP备2021040463号-3

导航

bcc Python Developer Tutorial

Observability

Lesson 1. Hello World

Lesson 2. sys_sync()

Lesson 3. hello_fields.py

Lesson 4. sync_timing.py

Lesson 5. sync_count.py

Lesson 6. disksnoop.py

Lesson 7. hello_perf_output.py

Lesson 8. sync_perf_output.py

Lesson 9. bitehist.py

Lesson 10. disklatency.py

Lesson 11. vfsreadlat.py

Lesson 12. urandomread.py

Lesson 13. disksnoop.py fixed版

Lesson 14. strlen_count.py

Lesson 15. nodejs_http_server.py

Lesson 16. task_switch.c

Lesson 17. Further Study

Networking