发布于 2015-09-10 16:22:13 | 216 次阅读 | 评论: 0 | 来源: 网络整理
Linux Containers rely on control groups which not only track groups of processes, but also expose metrics about CPU, memory, and block I/O usage. You can access those metrics and obtain network usage metrics as well. This is relevant for “pure” LXC containers, as well as for Docker containers.
Control groups are exposed through a pseudo-filesystem. In recent distros, you should find this filesystem under /sys/fs/cgroup. Under that directory, you will see multiple sub-directories, called devices, freezer, blkio, etc.; each sub-directory actually corresponds to a different cgroup hierarchy.
On older systems, the control groups might be mounted on /cgroup, without distinct hierarchies. In that case, instead of seeing the sub-directories, you will see a bunch of files in that directory, and possibly some directories corresponding to existing containers.
To figure out where your control groups are mounted, you can run:
grep cgroup /proc/mounts
You can look into /proc/cgroups to see the different control group subsystems known to the system, the hierarchy they belong to, and how many groups they contain.
You can also look at /proc/<pid>/cgroup to see which control groups a process belongs to. The control group will be shown as a path relative to the root of the hierarchy mountpoint; e.g. / means “this process has not been assigned into a particular group”, while /lxc/pumpkin means that the process is likely to be a member of a container named pumpkin.
For each container, one cgroup will be created in each hierarchy. On older systems with older versions of the LXC userland tools, the name of the cgroup will be the name of the container. With more recent versions of the LXC tools, the cgroup will be lxc/<container_name>.
For Docker containers using cgroups, the container name will be the full ID or long ID of the container. If a container shows up as ae836c95b4c3 in docker ps, its long ID might be something like ae836c95b4c3c9e9179e0e91015512da89fdec91612f63cebae57df9a5444c79. You can look it up with docker inspect or docker ps -notrunc.
Putting everything together to look at the memory metrics for a Docker container, take a look at /sys/fs/cgroup/memory/lxc/<longid>/.
For each subsystem (memory, CPU, and block I/O), you will find one or more pseudo-files containing statistics.
Memory metrics are found in the “memory” cgroup. Note that the memory control group adds a little overhead, because it does very fine-grained accounting of the memory usage on your host. Therefore, many distros chose to not enable it by default. Generally, to enable it, all you have to do is to add some kernel command-line parameters: cgroup_enable=memory swapaccount=1.
The metrics are in the pseudo-file memory.stat. Here is what it will look like:
cache 11492564992 rss 1930993664 mapped_file 306728960 pgpgin 406632648 pgpgout 403355412 swap 0 pgfault 728281223 pgmajfault 1724 inactive_anon 46608384 active_anon 1884520448 inactive_file 7003344896 active_file 4489052160 unevictable 32768 hierarchical_memory_limit 9223372036854775807 hierarchical_memsw_limit 9223372036854775807 total_cache 11492564992 total_rss 1930993664 total_mapped_file 306728960 total_pgpgin 406632648 total_pgpgout 403355412 total_swap 0 total_pgfault 728281223 total_pgmajfault 1724 total_inactive_anon 46608384 total_active_anon 1884520448 total_inactive_file 7003344896 total_active_file 4489052160 total_unevictable 32768
The first half (without the total_ prefix) contains statistics relevant to the processes within the cgroup, excluding sub-cgroups. The second half (with the total_ prefix) includes sub-cgroups as well.
Some metrics are “gauges”, i.e. values that can increase or decrease (e.g. swap, the amount of swap space used by the members of the cgroup). Some others are “counters”, i.e. values that can only go up, because they represent occurrences of a specific event (e.g. pgfault, which indicates the number of page faults which happened since the creation of the cgroup; this number can never decrease).
Accounting for memory in the page cache is very complex. If two processes in different control groups both read the same file (ultimately relying on the same blocks on disk), the corresponding memory charge will be split between the control groups. It’s nice, but it also means that when a cgroup is terminated, it could increase the memory usage of another cgroup, because they are not splitting the cost anymore for those memory pages.
Now that we’ve covered memory metrics, everything else will look very simple in comparison. CPU metrics will be found in the cpuacct controller.
For each container, you will find a pseudo-file cpuacct.stat, containing the CPU usage accumulated by the processes of the container, broken down between user and system time. If you’re not familiar with the distinction, user is the time during which the processes were in direct control of the CPU (i.e. executing process code), and system is the time during which the CPU was executing system calls on behalf of those processes.
Those times are expressed in ticks of 1/100th of a second. Actually, they are expressed in “user jiffies”. There are USER_HZ “jiffies” per second, and on x86 systems, USER_HZ is 100. This used to map exactly to the number of scheduler “ticks” per second; but with the advent of higher frequency scheduling, as well as tickless kernels, the number of kernel ticks wasn’t relevant anymore. It stuck around anyway, mainly for legacy and compatibility reasons.
Block I/O is accounted in the blkio controller. Different metrics are scattered across different files. While you can find in-depth details in the blkio-controller file in the kernel documentation, here is a short list of the most relevant ones:
Network metrics are not exposed directly by control groups. There is a good explanation for that: network interfaces exist within the context of network namespaces. The kernel could probably accumulate metrics about packets and bytes sent and received by a group of processes, but those metrics wouldn’t be very useful. You want per-interface metrics (because traffic happening on the local lo interface doesn’t really count). But since processes in a single cgroup can belong to multiple network namespaces, those metrics would be harder to interpret: multiple network namespaces means multiple lo interfaces, potentially multiple eth0 interfaces, etc.; so this is why there is no easy way to gather network metrics with control groups.
Instead we can gather network metrics from other sources:
IPtables (or rather, the netfilter framework for which iptables is just an interface) can do some serious accounting.
For instance, you can setup a rule to account for the outbound HTTP traffic on a web server:
iptables -I OUTPUT -p tcp --sport 80
There is no -j or -g flag, so the rule will just count matched packets and go to the following rule.
Later, you can check the values of the counters, with:
iptables -nxvL OUTPUT
Technically, -n is not required, but it will prevent iptables from doing DNS reverse lookups, which are probably useless in this scenario.
Counters include packets and bytes. If you want to setup metrics for container traffic like this, you could execute a for loop to add two iptables rules per container IP address (one in each direction), in the FORWARD chain. This will only meter traffic going through the NAT layer; you will also have to add traffic going through the userland proxy.
Then, you will need to check those counters on a regular basis. If you happen to use collectd, there is a nice plugin to automate iptables counters collection.
Since each container has a virtual Ethernet interface, you might want to check directly the TX and RX counters of this interface. You will notice that each container is associated to a virtual Ethernet interface in your host, with a name like vethKk8Zqi. Figuring out which interface corresponds to which container is, unfortunately, difficult.
But for now, the best way is to check the metrics from within the containers. To accomplish this, you can run an executable from the host environment within the network namespace of a container using ip-netns magic.
The ip-netns exec command will let you execute any program (present in the host system) within any network namespace visible to the current process. This means that your host will be able to enter the network namespace of your containers, but your containers won’t be able to access the host, nor their sibling containers. Containers will be able to “see” and affect their sub-containers, though.
The exact format of the command is:
ip netns exec <nsname> <command...>
For example:
ip netns exec mycontainer netstat -i
ip netns finds the “mycontainer” container by using namespaces pseudo-files. Each process belongs to one network namespace, one PID namespace, one mnt namespace, etc., and those namespaces are materialized under /proc/<pid>/ns/. For example, the network namespace of PID 42 is materialized by the pseudo-file /proc/42/ns/net.
When you run ip netns exec mycontainer ..., it expects /var/run/netns/mycontainer to be one of those pseudo-files. (Symlinks are accepted.)
In other words, to execute a command within the network namespace of a container, we need to:
Please review Enumerating Cgroups to learn how to find the cgroup of a pprocess running in the container of which you want to measure network usage. From there, you can examine the pseudo-file named tasks, which containes the PIDs that are in the control group (i.e. in the container). Pick any one of them.
Putting everything together, if the “short ID” of a container is held in the environment variable $CID, then you can do this:
TASKS=/sys/fs/cgroup/devices/$CID*/tasks PID=$(head -n 1 $TASKS) mkdir -p /var/run/netns ln -sf /proc/$PID/ns/net /var/run/netns/$CID ip netns exec $CID netstat -i
Note that running a new process each time you want to update metrics is (relatively) expensive. If you want to collect metrics at high resolutions, and/or over a large number of containers (think 1000 containers on a single host), you do not want to fork a new process each time.
Here is how to collect metrics from a single process. You will have to write your metric collector in C (or any language that lets you do low-level system calls). You need to use a special system call, setns(), which lets the current process enter any arbitrary namespace. It requires, however, an open file descriptor to the namespace pseudo-file (remember: that’s the pseudo-file in /proc/<pid>/ns/net).
However, there is a catch: you must not keep this file descriptor open. If you do, when the last process of the control group exits, the namespace will not be destroyed, and its network resources (like the virtual interface of the container) will stay around for ever (or until you close that file descriptor).
The right approach would be to keep track of the first PID of each container, and re-open the namespace pseudo-file each time.
Sometimes, you do not care about real time metric collection, but when a container exits, you want to know how much CPU, memory, etc. it has used.
Docker makes this difficult because it relies on lxc-start, which carefully cleans up after itself, but it is still possible. It is usually easier to collect metrics at regular intervals (e.g. every minute, with the collectd LXC plugin) and rely on that instead.
But, if you’d still like to gather the stats when a container stops, here is how:
For each container, start a collection process, and move it to the control groups that you want to monitor by writing its PID to the tasks file of the cgroup. The collection process should periodically re-read the tasks file to check if it’s the last process of the control group. (If you also want to collect network statistics as explained in the previous section, you should also move the process to the appropriate network namespace.)
When the container exits, lxc-start will try to delete the control groups. It will fail, since the control group is still in use; but that’s fine. You process should now detect that it is the only one remaining in the group. Now is the right time to collect all the metrics you need!
Finally, your process should move itself back to the root control group, and remove the container control group. To remove a control group, just rmdir its directory. It’s counter-intuitive to rmdir a directory as it still contains files; but remember that this is a pseudo-filesystem, so usual rules don’t apply. After the cleanup is done, the collection process can exit safely.