I've said many times that dtrace is not just a wonderful tool for
developers and performance gurus. The Kings of Computing, which are
of course System Admins, also find it really useful.
There is an ancient version of make called Parallel make that
occasionally suffers from a bug (1223984) where it gets into a loop
like this:
waitid(P_ALL, 0, 0x08047270, WEXITED|WTRAPPED) Err#10 ECHILD
alarm(0) = 30
alarm(30) = 0
waitid(P_ALL, 0, 0x08047270, WEXITED|WTRAPPED) Err#10 ECHILD
alarm(0) = 30
alarm(30) = 0
waitid(P_ALL, 0, 0x08047270, WEXITED|WTRAPPED) Err#10 ECHILD
This will then consume a CPU and the users CPU shares. The
application is never going to be fixed so the normal advice is not to
use it. However since it can be NFS mounted from anywhere I can't
reliably delete all copies of it so occasionally we will see run away
processes on our build server.
It turns out this is a snip to fix with
dtrace. Simply look for cases where the wait system call returns an
error and errno is set to ECHILD (10) and if that happens 10 times in
a row for the same process and that process does not call fork then
stop the process.
The script is simple enough for me to
just do it on the command line:
# dtrace -wn 'syscall::waitsys:return / arg1 <= 0 &&
execname == "make.bin" && errno == 10 && waitcount[pid]++ > 20 / {
stop();
printf("uid %d pid %d", uid, pid) }
syscall::forksys:return / arg1 > 0 / { waitcount[pid] = 0 }'
dtrace: description 'syscall::waitsys:return ' matched 2 probes
dtrace: allowing destructive actions
CPU ID FUNCTION:NAME
2 20588 waitsys:return uid 36580 pid 29252
3 20588 waitsys:return uid 36580 pid 2522
5 20588 waitsys:return uid 36580 pid 28663
7 20588 waitsys:return uid 36580 pid 29884
10 20588 waitsys:return uid 36580 pid 941
15 20588 waitsys:return uid 36580 pid 1098
This was way easier then messing around with prstat, truss and
pstop!