Docker,  Linux

一则进程写日志导致磁盘IO高的案例分析

一 背景介绍

源于Linux性能优化的 [26 | 案例篇:如何找出狂打日志的“内鬼”?],一个跑在docker容器中的Python应用,持续不断的写日志,导致磁盘IO一直很高,进而影响系统性能。我这里根据自己的思路重新复盘模拟一次该案例的场景。在一台24C 64GB内存,配备SSD盘的物理机,CentOS 7.4的操作系统上操作。

二 分析步骤和思路

1 启动container

[root@centos-master ~]# docker run -itd --name=app  -v /tmp/:/tmp feisky/logapp
Unable to find image 'feisky/logapp:latest' locally
Trying to pull repository docker.io/feisky/logapp ... 
latest: Pulling from docker.io/feisky/logapp
4fe2ade4980c: Pull complete 
7cf6a1d62200: Pull complete 
d16bb326b4ba: Downloading [=================>                                 ] 8.379 MB/24.55 MB
ed7df901e9a8: Download complete 
d7e55b1728aa: Download complete 
e355358dada7: Pull complete 
Digest: sha256:8dfbc3116680c581d20f58d63a2353b9a7e67a0442dd43803d7dad43c6d4ac0a
Status: Downloaded newer image for docker.io/feisky/logapp:latest
7f058667d924e337e8b7db93ecf28048b8226c76050d4c6f936fab77889ddb2f
[root@centos-master ~]# docker ps
CONTAINER ID        IMAGE                   COMMAND                  CREATED             STATUS              PORTS                    NAMES
7f058667d924        feisky/logapp           "python /app.py"         24 seconds ago      Up 23 seconds                                app
15137d822209        luksa/kubia-unhealthy   "node app.js"            12 days ago         Up 12 days                                   kubia-unhealthy
f1df8f14c584        registry                "/entrypoint.sh /e..."   2 weeks ago         Up 2 weeks          0.0.0.0:5000->5000/tcp   naughty_newton
[root@centos-master ~]# 

2 top查看系统性能指标

[root@centos-master ~]# top
top - 10:36:15 up 1139 days, 21:07,  2 users,  load average: 1.58, 0.60, 0.79
Tasks: 320 total,   2 running, 318 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.1 us,  3.0 sy,  0.0 ni, 90.8 id,  5.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 65928364 total,  7923768 free, 40700180 used, 17304416 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 12227536 avail Mem 
​
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+  PPID   UID COMMAND                                                                                                                                      
14327 root      20   0  963372 932328   3304 R  89.8  1.4   1:10.86 14308     0 python /app.py                                                                                                                               
11426 root      20   0       0      0      0 S   2.0  0.0   1:45.41     2     0 [kworker/u66:2]                                                                                                                              
23586 root      20   0   25.8g  25.3g   6832 S   1.0 40.3   9866:02     1     0 mongod -f /etc/mongodb.conf                                                                                                                  
14418 root      20   0  157988   2488   1568 R   0.7  0.0   0:00.18 14392     0 top                         

24CPU的机器,负载并不高。内存占的比较大。进程号为14327的进程,占了比较高的CPU资源。top中通过数字键1调出CPU的资源使用情况:

[root@centos-master ~]# top
top - 10:50:25 up 1139 days, 21:21,  2 users,  load average: 2.67, 2.39, 1.80
Tasks: 317 total,   2 running, 315 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  0.3 sy,  0.0 ni, 98.7 id,  0.7 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.3 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.3 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  0.3 us,  0.3 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  : 20.6 us, 54.2 sy,  0.0 ni, 25.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 :  0.7 us,  0.0 sy,  0.0 ni, 25.2 id, 74.2 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 :  0.0 us,  1.7 sy,  0.0 ni, 28.4 id, 69.6 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu12 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu13 :  0.3 us,  0.3 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu14 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu15 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu16 :  0.3 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu17 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu18 :  0.3 us,  0.3 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu19 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu20 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu21 :  0.3 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu22 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu23 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 65928364 total,  8146796 free, 40467680 used, 17313888 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 12452092 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+  PPID   UID COMMAND                                                                                                                                      
14327 root      20   0  963372 735988   3304 R  74.8  1.1  12:27.35 14308     0 python                                                                                                                                       
11426 root      20   0       0      0      0 S   1.7  0.0   2:03.16     2     0 kworker/u66:2                                                                                                                                
26598 etcd      20   0   12.4g  60620  11428 S   1.3  0.1   2295:18     1   492 etcd                                                                                                                                         
26773 kube      20   0 2494188  62844  21092 S   1.3  0.1   1864:10     1   493 kube-scheduler                                                                                                                               
14683 root      20   0  157956   2488   1568 R   0.7  0.0   0:00.11 14392     0 top  

有部分CPU的IOwait高一点儿,其它看起来都正常。

其实,到这里我们就基本上知道有异常进程了,14327。可以去看该进程具体在干嘛了???直接到第5步骤了,但是,我们可以多角度完整的跟踪分析。

3 pidstat查看进程的资源消耗

看到占用CPU资源高的进程是14327,我们来观察它的资源消耗:pidstat -p 14327

[root@centos-master ~]# pidstat -p 14327 1
Linux 3.10.0-514.el7.x86_64 (centos-master)     2022年03月18日  _x86_64_        (24 CPU)
​
10时54分06秒   UID       PID    %usr %system  %guest    %CPU   CPU  Command
10时54分07秒     0     14327   23.00   67.00    0.00   90.00     7  python
10时54分08秒     0     14327   19.00   71.00    0.00   90.00     7  python
10时54分09秒     0     14327   25.00   64.00    0.00   89.00     7  python
10时54分10秒     0     14327   19.00   53.00    0.00   72.00     7  python
10时54分11秒     0     14327   10.00   65.00    0.00   75.00     7  python
10时54分12秒     0     14327   29.00   61.00    0.00   90.00     7  python
10时54分13秒     0     14327   27.72   62.38    0.00   90.10     7  python
10时54分14秒     0     14327   16.00   73.00    0.00   89.00     7  python
10时54分15秒     0     14327   25.00   65.00    0.00   90.00     7  python
10时54分16秒     0     14327   21.00   59.00    0.00   80.00     7  python
10时54分17秒     0     14327   10.00   38.00    0.00   48.00     3  python
^C
平均时间:     0     14327   20.44   61.67    0.00   82.11     -  python
[root@centos-master ~]# 

的确,看到该进程占了较高的CPU资源。我们通过pidstat -d -l -p 14327 1来查看它的磁盘IO信息,同时-l表示打印它执行的命令和参数:

[root@centos-master ~]# pidstat -d  -l -p 14327 1
Linux 3.10.0-514.el7.x86_64 (centos-master)     2022年03月18日  _x86_64_        (24 CPU)
​
11时00分59秒   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
11时01分00秒     0     14327      0.00 318540.00      0.00  python /app.py 
11时01分01秒     0     14327      0.00      0.00 479236.00  python /app.py 
11时01分02秒     0     14327      0.00 344500.00      0.00  python /app.py 
11时01分03秒     0     14327      0.00 307472.00      0.00  python /app.py 
11时01分04秒     0     14327      0.00 269632.00 700420.00  python /app.py 
11时01分05秒     0     14327      0.00 384912.00      0.00  python /app.py 
11时01分06秒     0     14327      0.00 445024.00      0.00  python /app.py 
11时01分07秒     0     14327      0.00  91668.00 312872.00  python /app.py 
^C
平均时间:     0     14327      0.00 270218.50 186566.00  python /app.py 
[root@centos-master ~]# 

看到该进程正在执行大量的写操作,平均每秒写差不多270MB。

这里,我们再进一步分析一下系统的整体IO指标。

4 iostat查看系统IO指标

[root@centos-master ~]# iostat -dx 1
Linux 3.10.0-514.el7.x86_64 (centos-master)     2022年03月18日  _x86_64_        (24 CPU)
​
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     4.20    0.06    4.19     4.78    57.96    29.56     0.04    9.48    3.74    9.56   3.80   1.61
sdc               0.00     0.48    0.09    2.32     4.60    47.78    43.33     0.01    5.24   54.24    3.24   1.25   0.30
sdd               0.00     0.01    0.00    0.00     0.00     0.02   249.21     0.00   60.17    0.52   61.92   0.45   0.00
sda               0.00     0.00    0.13    0.06    15.98     7.16   243.97     0.00   12.52    0.97   36.74   0.32   0.01
dm-0              0.00     0.00    0.06    8.39     4.78    57.96    14.86     0.04    4.36    3.95    4.36   1.91   1.61
dm-1              0.00     0.00    0.00    0.00     0.00     0.00    26.32     0.00    6.82    6.82    0.00   6.07   0.00
​
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     2.00    0.00  441.00     0.00 55724.00   252.72    54.40  151.57    0.00  151.57   1.05  46.20
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00  295.00     0.00 36904.00   250.20    54.80  228.38    0.00  228.38   1.57  46.20
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
​
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  567.00     0.00 71212.00   251.19    89.14  139.16    0.00  139.16   1.10  62.30
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00  706.00     0.00 89004.00   252.14    89.88  112.40    0.00  112.40   0.88  62.40
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
​
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     1.00    0.00 1006.00     0.00 128768.00   256.00   145.22  142.77    0.00  142.77   0.99 100.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00 1028.00     0.00 131088.00   255.04   146.82  140.68    0.00  140.68   0.97 100.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
​
^C
[root@centos-master ~]# 

系统有大量的IO操作集中在sdb和dm-0设备上,它的%util持续偏高。印证了前面的进程在进行大量IO操作。

5 跟踪进程调用链

当我们知道14327的进程异常时,我们可以通过strace工具来跟踪它的系统调用链。

[root@centos-master ~]# strace -p 14327
strace: Process 14327 attached
mmap(NULL, 314576896, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4b7e9c3000
write(4, "2022-03-18 03:08:40,212 - __main"..., 314572844) = 314572844
munmap(0x7f4b7e9c3000, 314576896)       = 0
write(4, "\n", 1)                       = 1
munmap(0x7f4b915c4000, 314576896)       = 0
select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)
getpid()                                = 1
mmap(NULL, 314576896, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4b915c4000
mmap(NULL, 393220096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4b79ec3000
mremap(0x7f4b79ec3000, 393220096, 314576896, MREMAP_MAYMOVE) = 0x7f4b79ec3000
munmap(0x7f4b915c4000, 314576896)       = 0
lseek(4, 0, SEEK_END)                   = 943718535
lseek(4, 0, SEEK_CUR)                   = 943718535
munmap(0x7f4b79ec3000, 314576896)       = 0
close(4)                                = 0
stat("/tmp/logtest.txt.1", {st_mode=S_IFREG|0644, st_size=943718535, ...}) = 0
unlink("/tmp/logtest.txt.1")            = 0
stat("/tmp/logtest.txt", {st_mode=S_IFREG|0644, st_size=943718535, ...}) = 0
rename("/tmp/logtest.txt", "/tmp/logtest.txt.1") = 0
open("/tmp/logtest.txt", O_WRONLY|O_CREAT|O_APPEND|O_CLOEXEC, 0666) = 4
fcntl(4, F_SETFD, FD_CLOEXEC)           = 0
fstat(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
lseek(4, 0, SEEK_END)                   = 0
ioctl(4, TIOCGWINSZ, 0x7ffecd67a520)    = -1 ENOTTY (Inappropriate ioctl for device)
lseek(4, 0, SEEK_CUR)                   = 0
ioctl(4, TIOCGWINSZ, 0x7ffecd67a440)    = -1 ENOTTY (Inappropriate ioctl for device)
lseek(4, 0, SEEK_CUR)                   = 0
mmap(NULL, 314576896, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4b915c4000
mmap(NULL, 314576896, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4b7e9c3000
write(4, "2022-03-18 03:08:41,104 - __main"..., 314572844) = 314572844
munmap(0x7f4b7e9c3000, 314576896)       = 0
write(4, "\n", 1)                       = 1
munmap(0x7f4b915c4000, 314576896)       = 0
select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)
getpid()                                = 1
mmap(NULL, 314576896, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4b915c4000
mmap(NULL, 393220096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4b79ec3000

看到,它调用系统函数write向文件描述符为4的文件写数据,大小为314572844字节。然后,调用了系统函数stat(),unlink(),rename(),open(),参数指向了/tmp/logtest.txt文件。

[root@centos-master ~]# ll /tmp/logtest.txt*
-rw-r--r-- 1 root root 629145690 3月  18 11:11 /tmp/logtest.txt
-rw-r--r-- 1 root root 943718535 3月  18 11:11 /tmp/logtest.txt.1
[root@centos-master ~]# 

观察一下,该文件不断的在变化,这里其实是在轮询写日志。

6 跟踪进程打开的文件信息

我们通过lsof跟踪进程打开的文件信息:

[root@centos-master ~]# lsof -p 14327
COMMAND   PID USER   FD   TYPE DEVICE  SIZE/OFF       NODE NAME
python  14327 root  cwd    DIR   0,60      4096 2301835634 /
python  14327 root  rtd    DIR   0,60      4096 2301835634 /
python  14327 root  txt    REG  253,0     28016    5506213 /usr/local/bin/python3.7
python  14327 root  mem    REG  253,0              5506942 /usr/local/lib/python3.7/lib-dynload/_queue.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory)
python  14327 root  mem    REG  253,0              5506940 /usr/local/lib/python3.7/lib-dynload/_pickle.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory)
python  14327 root  mem    REG  253,0              5506951 /usr/local/lib/python3.7/lib-dynload/_struct.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory)
python  14327 root  mem    REG  253,0              5506973 /usr/local/lib/python3.7/lib-dynload/select.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory)
python  14327 root  mem    REG  253,0              5506948 /usr/local/lib/python3.7/lib-dynload/_socket.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory)
python  14327 root  mem    REG  253,0              5506943 /usr/local/lib/python3.7/lib-dynload/_random.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory)
python  14327 root  mem    REG  253,0              5506910 /usr/local/lib/python3.7/lib-dynload/_bisect.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory)
python  14327 root  mem    REG  253,0              5506946 /usr/local/lib/python3.7/lib-dynload/_sha3.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory)
python  14327 root  mem    REG  253,0              5506911 /usr/local/lib/python3.7/lib-dynload/_blake2.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory)
python  14327 root  mem    REG  253,0              5505192 /lib/libcrypto.so.43.0.1 (stat: No such file or directory)
python  14327 root  mem    REG  253,0              5506931 /usr/local/lib/python3.7/lib-dynload/_hashlib.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory)
python  14327 root  mem    REG  253,0              5506965 /usr/local/lib/python3.7/lib-dynload/math.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory)
python  14327 root  mem    REG  253,0              5506932 /usr/local/lib/python3.7/lib-dynload/_heapq.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory)
python  14327 root  mem    REG  253,0              5505134 /etc/localtime (path inode=14157303)
python  14327 root  mem    REG  253,0              5506322 /usr/local/lib/libpython3.7m.so.1.0 (stat: No such file or directory)
python  14327 root  mem    REG  253,0              5505189 /lib/ld-musl-x86_64.so.1 (stat: No such file or directory)
python  14327 root    0u   CHR  136,2       0t0          5 /2
python  14327 root    1u   CHR  136,2       0t0          5 /2
python  14327 root    2u   CHR  136,2       0t0          5 /2
python  14327 root    3r   CHR    1,9       0t0 2301848644 /dev/urandom
python  14327 root    4w   REG  253,0 314572845    5123176 /tmp/logtest.txt
[root@centos-master ~]# 

果然,看到它的FD列,即文件描述符为4w,表示是文件描述符为4的文件,w表示以写的方式操作文件。大小为314572845。

到这里,基本确定该进程在干嘛了。

7 查看docker容器应用源码

查看运行的docker container:

[root@centos-master ~]# docker ps
CONTAINER ID        IMAGE                   COMMAND                  CREATED             STATUS              PORTS                    NAMES
7f058667d924        feisky/logapp           "python /app.py"         41 minutes ago      Up 41 minutes                                app
15137d822209        luksa/kubia-unhealthy   "node app.js"            12 days ago         Up 12 days                                   kubia-unhealthy
f1df8f14c584        registry                "/entrypoint.sh /e..."   2 weeks ago         Up 2 weeks          0.0.0.0:5000->5000/tcp   naughty_newton
[root@centos-master ~]#

进入container,查看应用代码:

[root@centos-master ~]# docker exec -it app /bin/sh
/ # pwd
/ # ls
app.py  bin     dev     etc     home    lib     media   mnt     proc    root    run     sbin    srv     sys     tmp     usr     var
/ # cat app.py 
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
​
import logging
import random
import string
import signal
import time
​
from logging.handlers import RotatingFileHandler
​
logger = logging.getLogger(__name__)
logger.setLevel(level=logging.INFO)
rHandler = RotatingFileHandler(
    "/tmp/logtest.txt", maxBytes=1024 * 1024 * 1024, backupCount=1)
rHandler.setLevel(logging.INFO)
formatter = logging.Formatter(
    '%(asctime)s - %(name)s - %(levelname)s - %(message)s')
rHandler.setFormatter(formatter)
logger.addHandler(rHandler)
​
​
def set_logging_info(signal_num, frame):
    '''Set loging level to INFO when receives SIGUSR1'''
    logger.setLevel(logging.INFO)
​
​
def set_logging_warning(signal_num, frame):
    '''Set loging level to WARNING when receives SIGUSR2'''
    logger.setLevel(logging.WARNING)
​
​
def get_message(N):
    '''Get message for logging'''
    return N * ''.join(
        random.choices(string.ascii_uppercase + string.digits, k=1))
​
​
def write_log(size):
    '''Write logs to file'''
    message = get_message(size)
    while True:
        logger.info(message)
        time.sleep(0.1)
​
​
signal.signal(signal.SIGUSR1, set_logging_info)
signal.signal(signal.SIGUSR2, set_logging_warning)
​
if __name__ == '__main__':
    msg_size = 300 * 1024 * 1024
    write_log(msg_size)
/ # 

这里,看了app.py代码,在写日志,日志级别有不同的调整方式。

def set_logging_info(signal_num, frame):
    '''Set loging level to INFO when receives SIGUSR1'''
    logger.setLevel(logging.INFO)
​
​
def set_logging_warning(signal_num, frame):
    '''Set loging level to WARNING when receives SIGUSR2'''
    logger.setLevel(logging.WARNING)

8 修改应用的日志级别

[root@centos-master ~]# kill -l
 1) SIGHUP       2) SIGINT       3) SIGQUIT      4) SIGILL       5) SIGTRAP
 6) SIGABRT      7) SIGBUS       8) SIGFPE       9) SIGKILL     10) SIGUSR1
11) SIGSEGV     12) SIGUSR2     13) SIGPIPE     14) SIGALRM     15) SIGTERM
16) SIGSTKFLT   17) SIGCHLD     18) SIGCONT     19) SIGSTOP     20) SIGTSTP
21) SIGTTIN     22) SIGTTOU     23) SIGURG      24) SIGXCPU     25) SIGXFSZ
26) SIGVTALRM   27) SIGPROF     28) SIGWINCH    29) SIGIO       30) SIGPWR
31) SIGSYS      34) SIGRTMIN    35) SIGRTMIN+1  36) SIGRTMIN+2  37) SIGRTMIN+3
38) SIGRTMIN+4  39) SIGRTMIN+5  40) SIGRTMIN+6  41) SIGRTMIN+7  42) SIGRTMIN+8
43) SIGRTMIN+9  44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12
53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9  56) SIGRTMAX-8  57) SIGRTMAX-7
58) SIGRTMAX-6  59) SIGRTMAX-5  60) SIGRTMAX-4  61) SIGRTMAX-3  62) SIGRTMAX-2
63) SIGRTMAX-1  64) SIGRTMAX
[root@centos-master ~]# kill -SIGUSR2 14327
[root@centos-master ~]# 

然后,top和iostat监控系统的性能指标,都正常了。

三小结

这个案例较为简单,从top上看到异常进程,我们可以立即跟踪该进程在执行什么,进而找到线索和思路。当然,这里,通过iostat -dx 1和pidstat -d -l -p the_pid 1;来看了一下系统的IO指标,发现有异常的IO请求操作。最后,通过strace -p the_pid;和lsof -p the_pid找到进程的系统调用栈和打开文件句柄。最终,确定进程是在持续记日志,并从app.py源码中,印证定位和猜测。

个人收获,strace -p the_pid;和lsof -p the_pid;以前没有深入的使用,这里值得再研究。

留言