Skip to content
Snippets Groups Projects
  • Alban Crequy's avatar
    tools: add filtering by mount namespace · 32ab8583
    Alban Crequy authored
    
    In previous patches, I added the option --cgroupmap to filter events
    belonging to a set of cgroup-v2. Although this approach works fine with
    systemd services and containers when cgroup-v2 is enabled, it does not
    work with containers when only cgroup-v1 is enabled because
    bpf_get_current_cgroup_id() only works with cgroup-v2. It also requires
    Linux 4.18 to get this bpf helper function.
    
    This patch adds an additional way to filter by containers, using mount
    namespaces.
    
    Note that this does not help with systemd services since they normally
    don't create a new mount namespace (unless you set some options like
    'ReadOnlyPaths=', see "man 5 systemd.exec").
    
    My goal with this patch is to filter Kubernetes pods, even on
    distributions with an older kernel (<4.18) or without cgroup-v2 enabled.
    
    - This is only implemented for tools that already support filtering by
      cgroup id (bindsnoop, capable, execsnoop, profile, tcpaccept, tcpconnect,
      tcptop and tcptracer).
    
    - I picked the mount namespace because the other namespaces could be
      disabled in Kubernetes (e.g. HostNetwork, HostPID, HostIPC).
    
    It can be tested by following the example in docs/special_filtering added
    in this commit, to avoid compiling locally the following command can be used
    
    ```
    sudo bpftool map create /sys/fs/bpf/mnt_ns_set type hash key 8 value 4 \
      entries 128 name mnt_ns_set flags 0
    docker run -ti --rm --privileged \
      -v /usr/src:/usr/src -v /lib/modules:/lib/modules \
      -v /sys/fs/bpf:/sys/fs/bpf --pid=host kinvolk/bcc:alban-containers-filters \
      /usr/share/bcc/tools/execsnoop --mntnsmap /sys/fs/bpf/mnt_ns_set
    
    ```
    
    Co-authored-by: default avatarAlban Crequy <alban@kinvolk.io>
    Co-authored-by: default avatarMauricio Vásquez <mauricio@kinvolk.io>
    32ab8583
tcptop_example.txt 5.56 KiB
Demonstrations of tcptop, the Linux eBPF/bcc version.


tcptop summarizes throughput by host and port. Eg:

# tcptop
Tracing... Output every 1 secs. Hit Ctrl-C to end
<screen clears>
19:46:24 loadavg: 1.86 2.67 2.91 3/362 16681

PID    COMM         LADDR                 RADDR                  RX_KB  TX_KB
16648  16648        100.66.3.172:22       100.127.69.165:6684        1      0
16647  sshd         100.66.3.172:22       100.127.69.165:6684        0   2149
14374  sshd         100.66.3.172:22       100.127.69.165:25219       0      0
14458  sshd         100.66.3.172:22       100.127.69.165:7165        0      0

PID    COMM         LADDR6                           RADDR6                            RX_KB  TX_KB
16681  sshd         fe80::8a3:9dff:fed5:6b19:22      fe80::8a3:9dff:fed5:6b19:16606        1      1
16679  ssh          fe80::8a3:9dff:fed5:6b19:16606   fe80::8a3:9dff:fed5:6b19:22           1      1
16680  sshd         fe80::8a3:9dff:fed5:6b19:22      fe80::8a3:9dff:fed5:6b19:16606        0      0

This example output shows two listings of TCP connections, for IPv4 and IPv6.
If there is only traffic for one of these, then only one group is shown.

The output in each listing is sorted by total throughput (send then receive),
and when printed it is rounded (floor) to the nearest Kbyte. The example output
shows PID 16647, sshd, transmitted 2149 Kbytes during the tracing interval.
The other IPv4 sessions had such low throughput they rounded to zero.

All TCP sessions, including over loopback, are included.

The session with the process name (COMM) of 16648 is really a short-lived
process with PID 16648 where we didn't catch the process name when printing
the output. If this behavior is a serious issue for you, you can modify the
tool's code to include bpf_get_current_comm() in the key structs, so that it's
fetched during the event and will always be seen. I did it this way to start
with, but it was measurably increasing the overhead of this tool, so I switched
to the asynchronous model.

The overhead is relative to TCP event rate (the rate of tcp_sendmsg() and
tcp_recvmsg() or tcp_cleanup_rbuf()). Due to buffering, this should be lower
than the packet rate. You can measure the rate of these using funccount.
Some sample production servers tested found total rates of 4k to 15k per
second. The CPU overhead at these rates ranged from 0.5% to 2.0% of one CPU.
Maybe your workloads have higher rates and therefore higher overhead, or,
lower rates.


I much prefer not clearing the screen, so that historic output is in the
scroll-back buffer, and patterns or intermittent issues can be better seen.
You can do this with -C:

# tcptop -C
Tracing... Output every 1 secs. Hit Ctrl-C to end

20:27:12 loadavg: 0.08 0.02 0.17 2/367 17342

PID    COMM         LADDR                 RADDR                  RX_KB  TX_KB
17287  17287        100.66.3.172:22       100.127.69.165:57585       3      1
17286  sshd         100.66.3.172:22       100.127.69.165:57585       0      1
14374  sshd         100.66.3.172:22       100.127.69.165:25219       0      0

20:27:13 loadavg: 0.08 0.02 0.17 1/367 17342

PID    COMM         LADDR                 RADDR                  RX_KB  TX_KB
17286  sshd         100.66.3.172:22       100.127.69.165:57585       1   7761
14374  sshd         100.66.3.172:22       100.127.69.165:25219       0      0

20:27:14 loadavg: 0.08 0.02 0.17 2/365 17347
PID    COMM         LADDR                 RADDR                  RX_KB  TX_KB
17286  17286        100.66.3.172:22       100.127.69.165:57585       1   2501
14374  sshd         100.66.3.172:22       100.127.69.165:25219       0      0

20:27:15 loadavg: 0.07 0.02 0.17 2/367 17403

PID    COMM         LADDR                 RADDR                  RX_KB  TX_KB
17349  17349        100.66.3.172:22       100.127.69.165:10161       3      1
17348  sshd         100.66.3.172:22       100.127.69.165:10161       0      1
14374  sshd         100.66.3.172:22       100.127.69.165:25219       0      0

20:27:16 loadavg: 0.07 0.02 0.17 1/367 17403

PID    COMM         LADDR                 RADDR                  RX_KB  TX_KB
17348  sshd         100.66.3.172:22       100.127.69.165:10161    3333      0
14374  sshd         100.66.3.172:22       100.127.69.165:25219       0      0

20:27:17 loadavg: 0.07 0.02 0.17 2/366 17409

PID    COMM         LADDR                 RADDR                  RX_KB  TX_KB
17348  17348        100.66.3.172:22       100.127.69.165:10161    6909      2

You can disable the loadavg summary line with -S if needed.

The --cgroupmap option filters based on a cgroup set. It is meant to be used
with an externally created map.

# tcptop --cgroupmap /sys/fs/bpf/test01

For more details, see docs/special_filtering.md


USAGE:

# tcptop -h
usage: tcptop.py [-h] [-C] [-S] [-p PID] [--cgroupmap CGROUPMAP]
                 [--mntnsmap MNTNSMAP]
                 [interval] [count]

Summarize TCP send/recv throughput by host

positional arguments:
  interval              output interval, in seconds (default 1)
  count                 number of outputs

optional arguments:
  -h, --help            show this help message and exit
  -C, --noclear         don't clear the screen
  -S, --nosummary       skip system summary line
  -p PID, --pid PID     trace this PID only
  --cgroupmap CGROUPMAP
                        trace cgroups in this BPF map only

examples:
    ./tcptop           # trace TCP send/recv by host
    ./tcptop -C        # don't clear the screen
    ./tcptop -p 181    # only trace PID 181
    ./tcptop --cgroupmap ./mappath  # only trace cgroups in this BPF map
    ./tcptop --mntnsmap mappath   # only trace mount namespaces in the map