Cycling Blogs and Bloggers

One of my (many) pet peeves are shell scripts that fail to delete any temporary files they use. Included in this pet peeve are shell scripts that create more temporary files than they absolutely need, in most cases the number is 0 but there are a few cases where you really do need a temporary file but if it is temproary make sure you always delete the file.

The trick here is to use the EXIT trap handler to delete the file. That way if your script is killed (unless it is kill with SIGKILL) it will still clean up. Since you will be using mktemp(1) to create your temporary file and you want to minimize any race condition where the file could be left around you need to do (korn shell):

trap '${TMPFILE:+rm ${TMPFILE}}' EXIT

TMPFILE=$(mktemp /tmp/$0.temp.XXXXXX)

if further down the script you delete or rename the file all you have to do is unset TMPFILE eg:

mv $TMPFILE /etc/foo && unset TMPFILE

If you share a file system using the CIFS server (not SAMBA) and create a file in that file system using Windows XP the file ends up with these strange permissions and an ACL like this:

: pearson FSS 12 $; ls -vd Bad
d---------+  2 cjg      staff          2 Nov 13 17:11 Bad
     0:user:cjg:list_directory/read_data/add_file/write_data/add_subdirectory
         /append_data/read_xattr/write_xattr/execute/delete_child
         /read_attributes/write_attributes/delete/read_acl/write_acl
         /write_owner/synchronize:allow
     1:group:2147483648:list_directory/read_data/add_file/write_data

         /add_subdirectory/append_data/read_xattr/write_xattr/execute

         /delete_child/read_attributes/write_attributes/delete/read_acl

         /write_acl/write_owner/synchronize:allow

: pearson FSS 13 $; 


The first thing that riles UNIX some users is the lack of any file permissions, although things seem to work fine. The strange group ACL is for the local WINDOWS SYSTEM group. However the odd thing is for me it renders iTunes on the Windows system unable to see the files that it has created.

The solution is to add a default ACL to the root of the file system (well to every object in the file system if the file system is not new) that looks like this:

A+owner@:full_set:fd:allow,everyone@:read_set/execute:fd:allow

So this has the rather pleasant side effect of setting the UNIX permissions to something more recognisable:

: pearson FSS 20 $; ls -vd Good
drwxr-xr-x+  2 cjg      staff          2 Nov 13 18:16 Good
     0:owner@:list_directory/read_data/add_file/write_data/add_subdirectory
         /append_data/read_xattr/write_xattr/execute/delete_child
         /read_attributes/write_attributes/delete/read_acl/write_acl
         /write_owner/synchronize:file_inherit/dir_inherit/inherited:allow
     1:everyone@:list_directory/read_data/read_xattr/execute/read_attributes
         /read_acl:file_inherit/dir_inherit/inherited:allow
: pearson FSS 21 $; 

and the even more pleasant side effect of making iTunes works again!

I've said many times that dtrace is not just a wonderful tool for developers and performance gurus. The Kings of Computing, which are of course System Admins, also find it really useful.

There is an ancient version of make called Parallel make that occasionally suffers from a bug (1223984) where it gets into a loop like this:

waitid(P_ALL, 0, 0x08047270, WEXITED|WTRAPPED)	Err#10 ECHILD
alarm(0)					= 30
alarm(30)					= 0
waitid(P_ALL, 0, 0x08047270, WEXITED|WTRAPPED)	Err#10 ECHILD
alarm(0)					= 30
alarm(30)					= 0
waitid(P_ALL, 0, 0x08047270, WEXITED|WTRAPPED)	Err#10 ECHILD

This will then consume a CPU and the users CPU shares. The application is never going to be fixed so the normal advice is not to use it. However since it can be NFS mounted from anywhere I can't reliably delete all copies of it so occasionally we will see run away processes on our build server.

It turns out this is a snip to fix with dtrace. Simply look for cases where the wait system call returns an error and errno is set to ECHILD (10) and if that happens 10 times in a row for the same process and that process does not call fork then stop the process.

The script is simple enough for me to just do it on the command line:


# dtrace -wn 'syscall::waitsys:return / arg1 <= 0 && 
execname == "make.bin" && errno == 10  && waitcount[pid]++ > 20 / {

	stop();

	printf("uid %d pid %d", uid, pid) }

syscall::forksys:return / arg1 > 0 / { waitcount[pid] = 0 }'
dtrace: description 'syscall::waitsys:return ' matched 2 probes
dtrace: allowing destructive actions
CPU     ID                    FUNCTION:NAME
  2  20588                   waitsys:return uid 36580 pid 29252
  3  20588                   waitsys:return uid 36580 pid 2522
  5  20588                   waitsys:return uid 36580 pid 28663
  7  20588                   waitsys:return uid 36580 pid 29884
 10  20588                   waitsys:return uid 36580 pid 941
 15  20588                   waitsys:return uid 36580 pid 1098

This was way easier then messing around with prstat, truss and pstop!

At the request of the users the access hours for Sun Ray users in the house have been relaxed so that on Friday and Saturday nights the Sun Ray's in bedrooms can be used later.

This required that the access hour script be updated to understand the day of the week and hence the access_hour file also is updated in an incompatible way. There is now an extra column representing the days of the week when the rule is applied as the first column after the name of the user. The day of the week field will take a wild card '*' or ranges (1-5) for Monday to Friday, or lists (1,3,5). Sunday is day 0 as any self respecting geek would have it.

The new access_file I have looks something like this:

    user0:0-4:0001:2300:P8.00144f7dc383
    
    user2:0-4:0630:2300
    
    user3:0-4:0630:2230
    
    user4:0-4:0630:2100
    
    user4:5-6:0630:2200
    

The script is still here: http://blogs.sun.com/chrisg/resource/check_access_hours

ZFSでのdeduplication実装その2

直前に書いたzfs dedup ですが、 早速社内の人が試した結果がきました。 とっても単純な試験ですが、とてもわかりやすい検証結果となっています。 ビルド#128はまだありませんので、開発途中のビルドからbfuをしたそうです。 最初dfとかzfs listみてもdedupの効果がわからずとまどったようですが、zpool listでちゃんとわかるということです。

  • lacieusb という pool を作成し、dedup を on にした初期状態(ファイル等なし)
    $ zfs list NAME USED AVAIL REFER MOUNTPOINT lacieusb 166K 293G 21K /lacieusb ... $ zpool list NAME SIZE USED AVAIL CAP DEDUP HEALTH ALTROOT lacieusb 298G 310K 298G 0% 1.00x ONLINE - ... $ zdb -S lacieusb Simulated DDT histogram: bucket allocated referenced ______ ______________________________ ______________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- ...
  • dataset の作成
    $ zfs create lacieusb/test1 $ zfs create lacieusb/test2 $ zfs create lacieusb/test3
  • dataset lacieusb/test1 に 1GB のファイル testfile を作成。 ファイル 1 つの初期状態。
    $ cd /lacieusb/test1 $ dd if=/dev/urandom of=testfile bs=1024 count=1000000 $ sum testfile 36236 2000000 testfile $ zpool list NAME SIZE USED AVAIL CAP DEDUP HEALTH ALTROOT lacieusb 298G 981M 297G 0% 1.00x ONLINE - ... $ zfs list NAME USED AVAIL REFER MOUNTPOINT lacieusb 980M 292G 25K /lacieusb lacieusb/test1 977M 292G 977M /lacieusb/test1 lacieusb/test2 21K 292G 21K /lacieusb/test2 lacieusb/test3 21K 292G 21K /lacieusb/test3 ... $ zdb -S lacieusb Simulated DDT histogram: bucket allocated referenced ______ ______________________________ ______________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 1 7.63K 977M 977M 977M 7.63K 977M 977M 977M Total 7.63K 977M 977M 977M 7.63K 977M 977M 977M dedup = 1.00, compress = 1.00, copies = 1.00, dedup * compress / copies = 1.00 ...
  • dedup の確認開始。 念のため、testfile を lacieusb ではない外の pool にコピーする。 コピーしたファイルを /lacieusb/test2 と/lacieusb/test3 にコピーする。
    $ sum /usr/tmp/testfile 36236 2000000 /usr/tmp/testfile $ cp -fpr testfile /lacieusb/test2/ $ sum /lacieusb/test2/testfile 36236 2000000 /lacieusb/test2/testfile $ zpool list NAME SIZE USED AVAIL CAP DEDUP HEALTH ALTROOT lacieusb 298G 983M 297G 0% 2.00x ONLINE - rpool 228G 168G 59.9G 73% 1.00x ONLINE - ... $ zfs list NAME USED AVAIL REFER MOUNTPOINT lacieusb 1.91G 291G 25K /lacieusb lacieusb/test1 977M 291G 977M /lacieusb/test1 lacieusb/test2 977M 291G 977M /lacieusb/test2 lacieusb/test3 21K 291G 21K /lacieusb/test3 ... $ zdb -S lacieusb Simulated DDT histogram: bucket allocated referenced ______ ______________________________ ______________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 2 7.63K 977M 977M 977M 15.3K 1.91G 1.91G 1.91G Total 7.63K 977M 977M 977M 15.3K 1.91G 1.91G 1.91G dedup = 2.00, compress = 1.00, copies = 1.00, dedup * compress / copies = 2.00 ...
    zfs listでは普通に領域を消費しているように見えますが、 zpool listではほぼ1GBの消費にとどまっているのが確認できます。
  • さらに /lacieusb/test3 へ cp する。
    $ cp /lacieusb/test2/testfile /lacieusb/test3 $ ls -la 合計 2001617 drwxr-xr-x 2 root root 3 11月 4日 16:50 ./ drwxr-xr-x 5 root root 5 11月 4日 16:21 ../ -rw-r--r-- 1 root root 1024000000 11月 4日 16:50 testfile $ zfs list NAME USED AVAIL REFER MOUNTPOINT lacieusb 2.87G 290G 25K /lacieusb lacieusb/test1 977M 290G 977M /lacieusb/test1 lacieusb/test2 977M 290G 977M /lacieusb/test2 lacieusb/test3 977M 290G 977M /lacieusb/test3 ... $ zpool list NAME SIZE USED AVAIL CAP DEDUP HEALTH ALTROOT lacieusb 298G 983M 297G 0% 3.00x ONLINE - rpool 228G 168G 59.9G 73% 1.00x ONLINE - ... $ zdb -S lacieusb Simulated DDT histogram: bucket allocated referenced ______ ______________________________ ______________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 2 7.63K 977M 977M 977M 22.9K 2.86G 2.86G 2.86G Total 7.63K 977M 977M 977M 22.9K 2.86G 2.86G 2.86G dedup = 3.00, compress = 1.00, copies = 1.00, dedup * compress / copies = 3.00
  • ファイルの変更。 block単位の動きを確認するために、ghex2でlacieusb4/testfile を開き適当に値を書き換えた。
    $ ls -al -rw-r--r-- 1 root root 1024000000 11月 4日 17:02 testfile $ sum testfile 36791 2000000 testfile $ zpool list NAME SIZE USED AVAIL CAP DEDUP HEALTH ALTROOT lacieusb 298G 984M 297G 0% 3.99x ONLINE - rpool 228G 168G 59.9G 73% 1.00x ONLINE - ... $ zdb -S lacieusb Simulated DDT histogram: bucket allocated referenced ______ ______________________________ ______________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 1 1 128K 128K 128K 1 128K 128K 128K 2 1 128K 128K 128K 3 384K 384K 384K 4 7.63K 977M 977M 977M 30.5K 3.81G 3.81G 3.81G Total 7.63K 977M 977M 977M 30.5K 3.81G 3.81G 3.81G ... dedup = 4.00, compress = 1.00, copies = 1.00, dedup * compress / copies = 4.00
    1MBしか増えていませんので、ファイル単位ではなくblock単位であることが確認できます。
時間があればもう少し実用的で、どの程度CPUやメモリに対する負荷がかかるのかみてみたいところです。

ZFSでのdeduplication実装

Jeff Bonwick氏 新しい書き込み でZFSのdedupが話題なっていますね。 deduplication、重複をなくすということですが、これ自身が最近結構話題になっていると思います。 また、これほど早くdedupがzfsで実装されるとは思いませんでした。 Jeffのblogには書いていませんが、OpenSolarisのビルド#128に入ってくるそうです。 ZFSはもとともsnapshot/cloneを駆使することで、 使い方によってはかなり重複がさけられる運用ができますが、 これでますます無駄のないストレージの運用プラス速度の向上が見込めますね。 もちろん様々なトレードオフがあるので、 実際の運用にどの方法が適しているのかみてみる必要もあると思いますが。

以下、Jeffのブログの要約です。 英語の参照文も入れおきますので、興味あれば原文を読む時の参考にしてください。 だれか全部訳してくれるまでということで。

  • block 単位。
    ZFS provides block-level deduplication because this is the finest granularity that makes sense for a general-purpose storage system.
  • dedupはバッチで後から処理する実装と、その場その場で行うのがありますが、 zfsの実装はその場(real time?)です。
    ZFS deduplication is synchronous.
  • zpoolではなくzfsに対するプロパティとなり、fs 単位で設定になります。 他のプロパティ同様継承もされます。
    zfs set dedup=on tank
  • dedupの検出に省略時はSHA256のハッシュを使いますが、 もしこれでは信用できないと思ったら、実際の中身を比べる方法もあります。
    zfs set dedup=verify tank
  • さらに組み合わせも(計算の速いハッシュだけど、確認もする)。
    zfs set dedup=fletcher4,verify tank
  • 実装のヒント?
    to use the 256-bit block checksums in ZFS as hash signatures for dedup.

自転車29 - 富士チャレンジ200

とっても久しぶりの更新ですが、自転車ねたです。 ブログの更新は滞っていましたが、相変わらずほぼ毎週どこかに出かけていました。 よく出没するのは南多摩尾根幹線道路とか、甲州街道の大垂水峠と近場が多いですが、 今年は初めて箱根を1号線で超えて静岡の興津までいったり (行った先でOpenSolarisの勉強会に参加 - 2泊)、 中山道で軽井沢(1泊)にいったりしています。 また、最近はサンでほかに自転車に乗っている人たちや、 その友達の方たちに誘われて行く機会が何度かありました。 誘われた最初は正直「ゲーッ」て思うような話ばかりだったのですが(私にはまだ無理かなー?)、 誘ってもらったおかげで、富士のスバルラインと大弛峠にそれぞれ登ることができました (自分としては快挙!、ひとりではまずいかなかった)。 その中でもっとも「エーッ!!」と思ったのが今回の 富士チャレンジ200です。 なんとなく勢いでチームエンデューロ7時間に登録してしまったのですが、 考えれば考えるほど大それたことをしてしまったものだと、及び腰になっていました。 しかも、サンの社員ででる3人はすべてレース未経験、集団で走ることになれていません。 他の2人は早いからまだいいですけど....

でも、「サン」の名前で登録してしまったし、いろいろプランも立てたしということで、 元気に事故なしで完走することを目指し3人とも サンのジャージ を着てがんばることに。

富士チャレンジ200
  • 前日の金曜日から、一緒に行くメンバの知り合いの山中湖の別荘に前泊。 近くのレストランで夕食をとりますが、みんなまじめなため、ジョッキでビールを飲むのはわたしだけ。 チームの他のメンバはコップビール。 ソロ200kmを走る2人はアルコール完全抜き。 いきなり気合いが違う。
  • 当日は4:30位に起きて、T氏が作ってくれたおいしいパスタを頂いて、5:30には出発。 天気は予報で晴れのはずが、雲が厚くちょっと不安。 三国峠を越えて富士スピードウエイにいってみると案の定すでにいっぱい車がならんでいます。 ピットから少し遠いところの駐車場になってしまいましたが、ピットはまだ混んでおらず、場所取りはまあまあ。
    送信者 富士チャレンジ200

  • 7:00から試走の開始で、あたふたとしながら、試走に行きます。 初めてのF1コースでの自転車、「うーん」めっちゃ気持ちよい!!! 試走ですから軽く流すだけなのですし、上り坂も結構あるのですが、 とにかく気持ちよい! 路面がスムース、広い道路を自転車だけで占有!。 曇りだけど富士山がよく見える(って見とれていると事故りますな)。 とりあえず2周で終わり、軽く汗をかく程度でした。
  • 8:00からは子供の部の開始です。 かわいいですね。 でも中にはかなり本格的な自転車に乗っている子もいたりします。 いくらするだあれーとかいろいろ勝手なことを思っていると、ソロにでる2人がそろそろ準備にいくとのこと。 先日のじゃんけんで負けてトップバッターになってしまった私(T T)もそろそろ並ぼうかなという時間に。
  • 9:00スタートですが、約2300人が同時にスタートする雰囲気は異常です。 後からやってきて無理矢理前に割り込んでくる人がいたりもする中、開始前の挨拶やらなんやらやっています。 そんな中、初めてで、トップバッターで、かなり前の方にならんで(鶴見伸吾が手を振っているのはわかる)、 何となく周りの自転車とはクラスがかなり違いそうで(みんなDura-AceとかUltegraじゃん)場違いな雰囲気で、 だんだん緊張していきます。
  • いよいよスタートですが、一週目がローリングスタートなのが救いです。前の人を抜いてはいけないので、 ついて行くだけです。 あれ、でも、半分くらい回ったあたりから、抜いていく人が、いやでも、前の人を抜かないように.... うーん...
  • 2周目からはふつうに走ってよいのですが、私の最初の担当は8周です。 どのくらいの力で走ればよいのか?考えつつも、ほとんど力一杯走ります。 ラップは7分20秒くらい、私としてはまあまあでしょう。 回を重ねるごとに遅くなっていきます。 どうも、他の人とは力を入れるポイントが違うことがわかってきます。 私はビビリなので下りながらつっこむ第1コーナーはかなりスピードを落としてしまいますが、 コーナの立ち上がりからしっかりこいで、60km/h MAX位でコカコーラコーナあたりまでがんばります (ふつうはこの気持ちのよい下りで力をためて回復するようです)。 その後も結構速度が出しやすいです。 ダンロップコーナから本格的な登りになりますが、 いわゆるトップ集団とか目立って早い人たちをのぞけば、なぜか私は早いほうになります 52tのアウターでなんとか最後まで登れますが、 本当にそれがよい判断かはよくわかりません。 基本左側通行で、抜くときは右から抜くのですが、怖いのはこういう登りにいる集団が自分より遅い人が 多いと思って右に行くと、とんでもなく早い人たちが、さらに右から、時には左から抜きにくるのでビビリます。 そして最後のパナソニックコーナをダンシングで抜けると直線なのですが、私は上り坂で力を使いすぎるようで、 登りで抜いた人も含めて直線の最初の頃でかなり抜かれてしまいました。 30km/h位まで速度を上げるのが遅い のですね (T T)。 2周目だったかな?、ソロにでているY氏が見えました、 あれー、彼はソロなのにやっぱり私より早いよー...じわじわと見えなくなってしまいました。
  • 直線ではよくトレインに乗れというが常識のようなのですが、 私はなかなか乗れませんでした。 だいたいまともなトレインは速すぎでついて行けませんでしたし、 これだと思って、後ろに乗ったつもりが、途中から乗ってしまったのか、 すごい形相で後ろからきた人に場所をとられてしまったり、 早いと思って乗ったのに、直線の後半で速度が上がらず、降りてしまったり。 どうも、私のペース配分は一般的ではないようです。
  • そうこうしているうちに私の最初の任務完了でピットに入ります。 ここで、2時間くらいまっていることになるのですが、早速よい知らせが。 携帯で検索すると自分のチーム成績がリアルタイムでわかるのですが、 なんと50位!。
    送信者 富士チャレンジ200

    220チームくらい出走しているはずなので、私としては快挙。 と、喜んでいると、なんと、どんどん順位が上がっていきます。 すげー、早いと思ったけどそこまでとは.... 3人目が出走するころにはなんと30位。 さて、休憩時間の楽しみにはソロと違って時間がとれるので...なんと無料のおしるこが餅入りで食べられます。 チームの応援をしたりしていると、2時間は意外と早く過ぎていきます。 そして、私に再び番が回ってくることには25位。 すごい!、問題は私がどこまで順位下げてしまうのだろうか....
  • 2度目は7周です。 2時間休んだから、元のペースでがんばろう!..あれ?ペースが、ラップではかると前回の最後の時間とかわらず、 がっかり。 少しでもペースを落とさないようにとがんばりますが、やっぱり落ちる。順位が気になりだしたところで、交代。 チームメンバによると順位は変わっていないとのこと、ほんとう?! チームメンバが優しくて、落ちた瞬間は見てないことにしてくれたのかもしれません。 2回目の3人目が走る頃にはなんと雨が降り出したではないですか。 事故もコース上では起こっているようで、規制されています。 うーん、なんか不安、と思っているうちに再び私の番。 この時点で20位くらいだったと思います。
  • 残り時間約1時間で3人が走れるように今度は3周です。 疲れもそこそこたまっていますが、雨が怖いです。 コーナはさっきよりもゆっくり回る代わりに、上り坂をよりがんばることにしました。 お、T氏がいたー、「がんばれー」と声をかけるのが私には精一杯。 しかし、さすがに3周はあっという間に終わり、私の出番は終了。 2番走者が4周走ることにして、最終走者に最後を託します (時間的には4周はむりかなー)。 さすがに4周は無理でしたが、なんと結果は18位!! いや、私以外の2人は本当に早いです。
    送信者 富士チャレンジ200

  • ソロにでていた2人も無事に完走。 とくにそのうちの1人T氏の方は半年前に本格的に自転車を始められたばかりなのにすごいです。 もう一人のソロのY氏のほうもすごいのは自分が終わった後、T氏のアシストに2周も余分に回ったそうです。 あー、チームで走った我々は「がんばれー」の声くらいかかけなかったな...ごめんなさい _o_。
  • レースが終わったら、雨の中撤収です。 いったん別荘に戻って、温泉にでかけ、近くのレストランからのケータリングと、 自分たちで持ち込んだ飲み物で宴会です! なんとT氏はここでもパスタを作ってくれました、感謝(実は次の日の朝も!)
初めての本格的なレース、めちゃめちゃビビリましたが、とっても楽しかったです。 また、この機会を与えてくれて、私にもちゃんと走らせてくれた仲間たちに感謝感激です、 改めて「ありがとうございました」。

Since the "nevada" builds of Solaris next are due to end soon and for some time the upgrade of my home server has involved more than a little bit of TLC to get it to work I will be moving to an OpenSolaris build just as soon as I can.

However before I can do this I need to make sure I have all thesoftware to provide home service. This is really a note to myself to I don't forget anything.

  • Exim Mail Transfer Agent (MTA). Since I am using certain encryption routines, virus detection and spamassassin I was unable to use the standard MTA, sendmail, when the system was originaly built and have been using exim, from blastwave. I hope to build and use exim without getting all the cruft that comes from the Blastwave packaged. So far this looks like it will be simple as OpenSolaris now has OpenSSL.

  • An imapd. Currently I have a blastwave version but again I intend to build this from scratch again the addition of OpenSSL and libcrypto should make this easy.

  • Clamav. To protect any Windows systems and to generally not pass on viri to others clamav has been scanning all incoming email. Again I will build this from scratch as I already do.

  • Spamassassin. Again I already build this for nevada so building it for OpenSolaris will be easy.

  • Ddclient. Having dynamic DNS allows me to login remotely and read email.

  • Squeezecenter. This is a big issue and in the past has proved hard to get built thanks to all the perl dependacies. It is for that reason I will continue to run it in a zone so that I don't have to trash the main system. Clearly with all my digital music loaded into the squeezecentre software this has to work.

I'm going to see if I can jump through the legal hoops that will allow me to contribute the builds to the contrib repository via Source Juicer. However as this is my spare time I don't know whether the legal reviews will be funded.

Due to the way OpenSolaris is delivered I also need to be more careful about what I install. rather than being able to choose everything. First I need my list from my laptop. Then in addtion to that I'll need

  • Samba - pkg:/SUNWsmba

  • cups - pkg:/SUNWcups

  • OpenSSL - pkg:/SUNWopenssl

Oh and I'll need the Sun Ray server software.

I've finally worked out how to drive purple-url-handler. Strictly John worked it out, so I will stand on his shoulders, but for some reason it would not work for me and I now know why and have a workaround.

First you need an XMPP URI on a web page. Some thing like:

xmpp:chrisg_fans@muc.im.sun.com?join

will when clicked in a browser that has the right helper, something OpenSolaris has had for some time, will take your IM client to that room. However with pidgin that is only the case if that room is available in the first XMPP server listed in your list of accounts. So given that this room is on Sun's IM server with the list of accounts looking like this:


It will try and connect to the first XMPP server listed, which is google and hence fail. Changing the order to be:


and then logging in and out and now the link will work. You can drag and drip the entries in pidgin.

Today I took the plunge and moved from working on our Nevada based Sun Ray Servers to one running OpenSolaris. So that I could get the full OpenSolaris look and feel I first purged my home directory of a number of configuration files and directories using a script like1 this:

#!/bin/ksh -p
TARGET=b4OpenSolaris
test -d $HOME/$TARGET || mkdir $HOME/$TARGET
mv $HOME/.ICEauthority $HOME/$TARGET
mv $HOME/.cache $HOME/$TARGET
mv $HOME/.chewing $HOME/$TARGET
mv $HOME/.config $HOME/$TARGET
mv $HOME/.dbus $HOME/$TARGET
mv $HOME/.dmrc $HOME/$TARGET
mv $HOME/.gconf $HOME/$TARGET
mv $HOME/.gconfd $HOME/$TARGET
mv $HOME/.gksu.lock $HOME/$TARGET
mv $HOME/.gnome2 $HOME/$TARGET
mv $HOME/.gnome2_private $HOME/$TARGET
mv $HOME/.gstreamer-0.10 $HOME/$TARGET
mv $HOME/.gtk-bookmarks $HOME/$TARGET
mv $HOME/.iiim $HOME/$TARGET
mv $HOME/.local $HOME/$TARGET
mv $HOME/.nautilus $HOME/$TARGET
mv $HOME/.printer-groups.xml $HOME/$TARGET
mv $HOME/.rnd $HOME/$TARGET
mv $HOME/.sunstudio $HOME/$TARGET
mv $HOME/.sunw $HOME/$TARGET
mv $HOME/.updatemanager $HOME/$TARGET
mv $HOME/.xesam $HOME/$TARGET
mv $HOME/.xsession-errors $HOME/$TARGET

I generated the list by installing OpenSolaris in a VirtualBox and then logging in and doing a bit of browsing and general usage and then seeing was was created. Additionally “.mozilla” was created but I chose to retain that so that I can keep all the history that is in my browser.

Once logged in I have removed the update-manager icon as I am not the administrator. I have also removed the power notification and network monitor as they provide no useful data on a Sun Ray server.

Using “System->Preferences->Startup Applications” I unchecked the codeina update notifier and added my script for updating my IM status.

So far so good but it is taking a while to get used to the menu being a the top and the window list at the bottom of the screen.

1Like as in similar to and not this exact script as mine had my home directory hard coded into it.

Some of the most common failures that result in customer calls are misuses of the memory allocation routines, malloc, calloc, realloc, valloc, memalign and free. There are many ways in which you can misuse these routines and the data that they return and the resulting failures often occur within the routines even though the problem is with the calling program.

I'm not going to discuss here all the ways you can abuse these routines but look at a particular type abuse. The double free. When you allocate memory using these routines it is your responsibility to free it again so that the memory does not “leak”. However you must only free the memory once. Freeing it more than once is a bug and the results of that are undefined.

This very simple code has a double free:

#include <stdlib.h>

void
doit(int n, char *x)
{
        if (n-- == 0)
                free(x);
        else
                doit(n,x);
}
int
main(int argc, char **argv)
{
        char *x;
        char *y;

        x = malloc(100000);
        
        doit(3, x);
        doit(10, x);
}

and if you compile and run that program all appears well;


However a more realistic program could go on to fail in interesting ways leaving you with the difficult task of finding the culprit. It is for that reason the libumem has good checking for double frees:


: exdev.eu FSS 26 $;  LD_PRELOAD=libumem.so.1 /home/cg13442/lang/c/double_free
Abort(coredump)
: exdev.eu FSS 27 $; mdb core
Loading modules: [ libumem.so.1 libc.so.1 ld.so.1 ]
> ::status
debugging core file of double_free (64-bit) from exdev
file: /home/cg13442/lang/c/double_free
initial argv: /home/cg13442/lang/c/double_free
threading model: native threads
status: process terminated by SIGABRT (Abort), pid=18108 uid=14442 code=-1
> ::umem_status
Status:         ready and active
Concurrency:    16
Logs:           (inactive)
Message buffer:
free(e53650): double-free or invalid buffer
stack trace:
libumem.so.1'umem_err_recoverable+0xa6
libumem.so.1'process_free+0x17e
libumem.so.1'free+0x16
double_free'doit+0x3a
double_free'doit+0x4d
double_free'doit+0x4d
double_free'doit+0x4d
double_free'doit+0x4d
double_free'doit+0x4d
double_free'doit+0x4d
double_free'doit+0x4d
double_free'doit+0x4d
double_free'doit+0x4d
double_free'doit+0x4d
double_free'main+0x100
double_free'_start+0x6c

> 

Good though this is there are situations when libumem is not used and others where it can't be used1. In those cases it is useful to be able to use dtrace to do this and any way it is always nice to have more than one arrow in your quiver:


: exdev.eu FSS 54 $; me/cg13442/lang/c/double_free 2> /dev/null              <
/usr/sbin/dtrace -qs doublefree.d -c /home/cg13442/lang/c/double_free 2> /dev/null
Hit Control-C to stop tracing
double free?
	Address: 0xe53650
	Previous free at: 2009 Jun 23 12:23:22, LWP -1
	This     free at: 2009 Jun 23 12:23:22, LWP -1
	Frees 42663 nsec apart
	Allocated 64474 nsec ago by LWP -1

              libumem.so.1`free
              double_free`doit+0x3a
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d

: exdev.eu FSS 56 $; 

If run as root you can get the the real LWP values that did the allocation and the frees:

: exdev.eu FSS 63 $; pfexec /usr/sbin/dtrace -qs doublefree.d -c /home/cg1344>
Hit Control-C to stop tracing
double free?
	Address: 0xe53650
	Previous free at: 2009 Jun 23 14:21:29, LWP 1
	This     free at: 2009 Jun 23 14:21:29, LWP 1
	Frees 27543 nsec apart
	Allocated 39366 nsec ago by LWP 1

              libumem.so.1`free
              double_free`doit+0x3a
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d

: exdev.eu FSS 64 $;

Here is the script in all it's glory.

#!/usr/sbin/dtrace -qs

BEGIN
{
	printf("Hit Control-C to stop tracing\n");
}
ERROR 
/ arg4 == DTRACEFLT_KPRIV || arg4 == DTRACEFLT_UPRIV /
{
	lwp = -1;
}

pid$target::realloc:entry,
pid$target::free:entry
{
	self->addr = arg0;
	self->recurse++;
}

pid$target::realloc:return,
pid$target::free:return
/ self->recurse /
{
	self->recurse--;
	self->addr = 0;
}

pid$target::malloc:entry,
pid$target::memalign:entry,
pid$target::valloc:entry,
pid$target::calloc:entry,
pid$target::realloc:entry,
pid$target::realloc:entry,
pid$target::free:entry
/ lwp != -1 && self->lwp == 0 /
{
	self->lwp = curlwpsinfo->pr_lwpid;
}

pid$target::malloc:entry,
pid$target::calloc:entry,
pid$target::realloc:entry,
pid$target::memalign:entry,
pid$target::valloc:entry,
pid$target::free:entry
/ self->lwp == 0 /
{
	self->lwp = lwp;
}

pid$target::malloc:return,
pid$target::calloc:return,
pid$target::realloc:return,
pid$target::memalign:return,
pid$target::valloc:return
{
	alloc_time[arg1] = timestamp;
	allocated[arg1] = 1;
	free_walltime[arg1] = 0LL;
	free_time[arg1] = 0LL;
	free_lwpid[arg1] = 0;
	alloc_lwpid[arg1] = self->lwp;
	self->lwp = 0;
}

pid$target::realloc:entry,
pid$target::free:entry
/ self->recurse == 1 && alloc_time[arg0] && allocated[arg0] == 0 /
{
	printf("double free?\n");
	printf("\tAddress: 0x%p\n", arg0);
	printf("\tPrevious free at: %Y, LWP %d\n", free_walltime[arg0],
		free_lwpid[arg0]);
	printf("\tThis     free at: %Y, LWP %d\n", walltimestamp,
		self->lwp);
	printf("\tFrees %d nsec apart\n", timestamp - free_time[arg0]);
	printf("\tAllocated %d nsec ago by LWP %d\n",
		timestamp - alloc_time[arg0], alloc_lwpid[arg0]);

	ustack(10);
}

pid$target::realloc:entry,
pid$target::free:entry
/ self->recurse == 1 && alloc_time[arg0] && allocated[arg0] == 1 /
{
	free_walltime[arg0] = walltimestamp;
	free_time[arg0] = timestamp;
	free_lwpid[arg0] = self->lwp;

	allocated[arg0] = 0;
}

pid$target::free:entry
/self->lwp && self->recurse == 0/
{
	self->lwp = 0;
}

1Most of the cases it “can't” be used is because it finds fatal problems early on in the start up of applications. Then the application writers make bizarre claims that this is a problem with libumem and will tell you it is not supported with their app. In fact the problem is with the application.

1Iostat has been around for years and until Dtrace came along and allowed us to look more deeply into the kernel was the tool for analysing how the io subsystem was working in Solaris. However interpreting the output has proved in the past to cause problems.

First if you are looking at latency issues it is vital that you use the smallest time quantum to iostat you can, which as of Solaris 10 is 1 second. Here is a sample of some output produced from “iostat -x 1”:

                  extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
                 extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       5.0 1026.5    1.6 1024.5  0.0 25.6   24.8   0  23 
                 extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 


The first thing to draw your attention to is the Column “%b” which the manual tells you is:

%b percent of time the disk is busy (transactions in progress)

So in this example the disk was “busy”, ie had at least one transaction (command) in progress for 23% of the time period. Ie 0.23 seconds as the time period was 1 second.

Now look at the “actv” column. Again the manual says:

actv average number of transactions actively being serviced (removed from the queue but not yet completed)
This is the number of I/O operations accepted, but not yet serviced, by the device.
In this example the average number of transactions outstanding for this time quantum was 25.6. Now here is the bit that is so often missed. Since we know that all the transactions actually took place within 0.23 seconds and were not evenly spread across the full second the average queue depth when busy was 100/23 * 25.6 or 111.3. Thanks to dtrace and this D script you can see the actual IO pattern2:

Even having done the maths iostat smooths out peaks in the IO pattern and thus under reports the peak number of transactions as 103.0 when the true value is 200.
The same is true for the bandwidth. The iostat above comes reports 1031.5 transactions a second (r/s + w/s) again though this does not take into account that all those IO requests happened in 0.23 seconds. So the true figure for the device would be 1031.5 * 100/23 which is 4485 transations/sec.
If we up the load on the disk a bit then you can conclude more from the iostat:
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
                 extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       5.0 2155.7    1.6 2153.7 30.2 93.3   57.1  29  45 
                 extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       0.0 3989.1    0.0 3989.1 44.6 157.2   50.6  41  83 
                 extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 

Since the %w column is non zero, and from the manual %w is:

%w percent of time there are transactions waiting for service (queue non-empty)

This is telling us that the device's active queue was full. So on the third line of the above output the devices queue was full for 0.41 seconds. Since the queue depth is quite easy to find out3 and in this case was 256, you can deduce that the queue depth for that 0.41 seconds was 256. Thus the average for the 0.59 seconds left was (157.2-(0.41*256))/0.59 which is 88.5. The graph of the dtrace results tells a different story:


These examples demonstrate what can happen if your application dumps a large number of transactions onto a storage device while the through put will be fine and if you look at iostat data things can appear ok if the granularity of the samples is not close to your requirement for latency any problem can be hidden by the statistical nature of iostat.

1Apologies to those who saw a draft of this on my blog briefly.

2The application creating the IO attempts to keep 200 transations in the disk at all the time. It is interesting to see that it fails as it does not get notification of the completion of the IO until all or nearly all the outstanding transactions have completed.

3This command will do it for all the devices on your system:

echo '*sd_state::walk softstate | ::print -d -at "struct sd_lun" un_throttle' | pfexec mdb -k

however be warned the throttle is dynamic so dtrace gives the real answer.

What do you do if you manage to delete or corrupt /etc/name_to_major? Assuming you don't have a backup a ZFS snapshot or an alternative boot environment, in which case you probably are in the wrong job, you would appear to be in trouble.

First thing is not to panic. Do not reboot the system. If you do that it won't boot and your day has just got a whole lot worse. The data needed to rebuild /etc/name_to_major is in the running kernel so it can be rebuilt from that. If your system an x86 system it is also in the boot archive.

However if you have no boot archive or have over written it with the bad name_to_system this script will extract it from the kernel, all be it slowly:

#!/bin/ksh
i=0
while ((i < 1000 ))
do
print "0t$i::major2name" | mdb -k | read x && echo $x $i
let i=i+1 
done

1Redirect that into a file then move the remains of your /etc/name_to_major out of the way and copy the file in place.

Next time make sure you have a back up or snapshot or alternative boot environment!

1You will see lots of errors of the form “mdb: failed to convert major number to name” these are to be expected. They can be limited to just one by adding “|| break” to the mdb line but that assumes that you have no holes in the major number listings which you may have if you have removed a device, so best to not risk that.

My winter bike is ready for another dose of rain and darkness. This year I have a new headlight the 2.4W Schmidt Edelux. A single LED that throws out more light that my old 12V set up. The old Lumitec Oval Plus sensor failed at the end of last winter such that the only part that still worked was the lamp. Neither the sensor or the swith would turn it off and the standlight also failed. While I don't hold much store in a forward standlights having so many it was only time before the light really failed. Something I can't really risk.

So I have joined the 21st century and have an LED. I've only test ridden it up and down the road, which has street lights, so does not really do it justice but it was very impressive. Like the Oval Plus sensor it comes on automatically and has a manual override.

Unlike the Oval Plus the switch is well protected being a reed switch operated by a magnet so there is no way for water to get inside and like all Schmidt lights it has a 5 year warranty and looks fantasitc.

It is powered by the SON hub generator and also swithes the rear light.

Photo by Robyn Gerhard

Editing sd.conf has always been somewhat difficult thanks to it not being a documented interface and that the interface was never inteded to be exposed and it was even architecture specific. Fortunately Micheal documented it, which meant that it was known even if syntax remained obscure.

However after ARC case 2008/465 was approved and the changes pushed as part of bug 6518995 you can now use more a human readable syntax1:

sd-config-list=
        "ATA     VBOX HARDDISK", "disksort:false";

As it turns out the “disksort”2 option along with the thottle-max and throttle-min are the ones I most often want to tune.

Here is the current list of tunables lifted straight from the ARC case.


Tunable_Name

Commitment

Data_Type

cache-nonvolatile

Private

BOOLEAN

controller-type

Private

UINT32

delay-busy

Committed

UINT32

disksort

Private

BOOLEAN

timeout-releasereservation

Private

UINT32

reset-lun

Private

BOOLEAN

retries-busy

Private

UINT32

retries-timeout

Committed

UINT32

retries-notready

Private

UINT32

retries-reset

Private

UINT32

throttle-max

Private

UINT32

throttle-min

Private

UINT32


1This reminds me of the change to /etc/printcap that allowed you to specify the terminal flags as strings rather than as a bitmap. All the mystery seemed to be removed!

2While I used disksort as an example for this case I can't think of any reason why you would have it enabled for a virtual disk in VirtualBox.

I've only ever made 6 postings to this blog, largely because I was always concerned about the risk of losing access to it... and now so I shall.


 So, as a reminder, you are invited to read my more regular stream of thinkage at http://www.crypticide.com/dropsafe/.


 Best wishes all.


- alec


Someone has posted a script to start a remote xterm on BigAdmin which exposes a number of issues I thought it would be better if google stood some chance of finding a better answer or at least an answer that does not rely on inherently insecure settings.

Remote X applications should be started using ssh -X so that the X traffic is encrypted and if you add -C compressed which can be a significant performance boost. So a script to do this could be handy although to be honest knowing the ssh options or having them set as the default in your .ssh/config is just as easy:

: exdev.eu FSS 31 $; egrep '^(Compress|ForwardX)' ~/.ssh/config
ForwardX11 yes
Compression yes
: exdev.eu FSS 32 $; ssh -f pearson /usr/X11/bin/xterm         
: exdev.eu FSS 33 $; 

or more usefully to start graphical tools:

: exdev.eu FSS 33 $; ssh -f pearson pfexec /usr/sadm/admin/bin/dhcpmgr
: exdev.eu FSS 34 $; 

However if you really want a script to do it here is one that will and no need to mess with your .ssh/config

#!/bin/ksh
REMOTE_PATH=${REMOTE_PATH:-${PATH}}
APP=${0##*/}
if (( $# < 1 )) 
then
        print "USAGE: ${APP} host [args]" >&2
        exit 1
fi
host=$1
shift
exec /usr/bin/ssh -o ClearAllForwardings=yes -C -Xfn $host \
        PATH=${REMOTE_PATH} pfexec ${APP#r} $@

If you save this into a file called “rxterm” then running “rxterm remotehost” will start an xterm on the system remotehost assuming you can ssh to that system.

More entertainingly you can save it as “rdhcpmgr” and it will start the dhcpmgr program on a remote system and securely display it on your current display (assuming your PATH includes /usr/sadm/admin/bin and your profile allows you access to that application). You can use it to start any application by simple naming it after the application in question with a preceding “r”.

As I have lived with OpenSolaris I've got used to the updates occuring automatically as you would with most modern Operating Systems. What is a real joy is that it creates a new boot environment for the updates so in the event that one was toxic you can always just roll back. It also gives you a handy reference as to how many updates you have done. Number 13 has just happened for me:

cjg@brompton:~$ beadm list
BE               Active Mountpoint Space  Policy Created          
--               ------ ---------- -----  ------ -------          
b4nvidia-bin-fix -      -          86.0K  static 2009-06-06 17:27 
opensolaris-10   -      -          15.68M static 2009-07-18 08:04 
opensolaris-11   -      -          33.73M static 2009-07-26 09:42 
opensolaris-12   N      /          266.5K static 2009-08-24 13:26 
opensolaris-13   R      -          16.42G static 2009-08-26 18:58 
opensolaris-4    -      -          22.19M static 2009-01-30 21:42 
opensolaris-5    -      -          21.30M static 2009-02-25 20:14 
opensolaris-6    -      -          45.87M static 2009-04-10 18:17 
opensolaris-7    -      -          37.83M static 2009-06-01 20:51 
opensolaris-8    -      -          19.03M static 2009-07-02 18:55 
opensolaris-9    -      -          11.56M static 2009-07-10 07:30 
cjg@brompton:~$ 

I'm going to have to bite the bullet on my home server and swith to OpenSolaris soon as the nevada builds stop. Alas with term time approaching I don't think I will be allowed significant down time for a while.

Sometimes in the course of being a system administrator it is useful to know what file systems are being mounted and when and what mounts fail and why. While you can turn on automounter verbose mode that only answers the question for the automounter.

Dtrace makes answering the general question a snip:

: exdev.eu FSS 24 $; cat mount_monitor.d                         
#!/usr/sbin/dtrace -qs

fbt::domount:entry
/ args[1]->dir /
{
        self->dir = args[1]->flags & 0x8 ? args[1]->dir : 
              copyinstr((intptr_t)args[1]->dir);
}
fbt::domount:return
/ self->dir != 0 /
{
        
        printf("%Y domount ppid %d, %s %s pid %d -> %s", walltimestamp, 
              ppid, execname, self->dir, pid, arg1 == 0 ? "OK" : "failed");
}
fbt::domount:return
/ self->dir != 0 && arg1 == 0/
{
        printf("\n");
        self->dir = 0;
}
fbt::domount:return
/ self->dir != 0 && arg1 != 0/
{
        printf("errno %d\n", arg1);
        self->dir = 0;
}
: exdev.eu FSS 25 $; pfexec /usr/sbin/dtrace -qs  mount_monitor.d
2009 Aug  6 12:57:57 domount ppid 0, sched /share/consoles pid 0 -> OK
2009 Aug  6 12:57:59 domount ppid 0, sched /share/chroot pid 0 -> OK
2009 Aug  6 12:58:00 domount ppid 0, sched /share/newsrc pid 0 -> OK
2009 Aug  6 12:58:00 domount ppid 0, sched /share/build2 pid 0 -> OK
2009 Aug  6 12:58:00 domount ppid 0, sched /share/chris_at_play pid 0 -> OK
2009 Aug  6 12:58:00 domount ppid 0, sched /share/ws_eng pid 0 -> OK
2009 Aug  6 12:58:00 domount ppid 0, sched /share/ws pid 0 -> OK
2009 Aug  6 12:58:03 domount ppid 0, sched /home/tx pid 0 -> OK
2009 Aug  6 12:58:04 domount ppid 0, sched /home/fl pid 0 -> OK
2009 Aug  6 12:58:05 domount ppid 0, sched /home/socal pid 0 -> OK
2009 Aug  6 12:58:07 domount ppid 0, sched /home/bur pid 0 -> OK
2009 Aug  6 12:58:23 domount ppid 0, sched /net/e2big.uk/export/install/docs pid 0 -> OK
2009 Aug  6 12:58:23 domount ppid 0, sched /net/e2big.uk/export/install/browser pid 0 -> OK
2009 Aug  6 12:58:23 domount ppid 0, sched /net/e2big.uk/export/install/cdroms pid 0 -> OK
2009 Aug  6 12:59:45 domount ppid 8929, Xnewt /tmp/.X11-pipe/X6 pid 8935 -> OK

In particular that last line if repeated often can give you a clue to things not being right.

Many databases get backed up by simply stopping the database copying all the data files and then restarting the database. This is fine for things that don't require 24 hour access. However if you are concerned about the time it takes to take the back up then don't do this:

stop_database
cp /data/file1.db .
gzip file1.db
cp /data/file2.db .
gzip file2.db
start_database

Now there are many ways to improve this using ZFS and snapshots being one of the best but if you don't want to go there then at the very least stop doing the “cp”. It is completely pointless. The above should just be:

stop_database
gzip < /data/file1.db > file1.db
gzip < /data/file2.db > file2.db
start_database

You can continue to make it faster by backgrounding those gzips if the system has spare capacity while the back up is running but that is another point. Just stopping those extra copies will make life faster as they are completely unnecessary.

Since writing scsi.d I have been pondering if there should really be a scsi dtrace provider that allows you to do all that scsi.d does and more. Since the push of 6797025 that both removed the main reason for not doing this and also gave impetus to do it as scsi.d needed incompatible changes to use the new return function as the return “probe”.

This work is very much work in progress and may or may not see the light of day due to some other issues around scsi addressing, however I thought I would document how I added a kernel dtrace provider so if you want to you don't have to do so much searching1.

Adding the probes themselves is simplicity itself using the DTRACE_PROBEN() macros. Following the convention I added this macro:

#define	DTRACE_SCSI_2(name, type1, arg1, type2, arg2)			\
	DTRACE_PROBE2(__scsi_##name, type1, arg1, type2, arg2);

to usr/src/uts/common/sys/sdt.h. Then after including <sys/sdt.h> in each file I put this macro in each of the places I wanted my probes:

 	DTRACE_SCSI_2(transport, struct scsi_pkt *, pkt,
 	    struct scsi_address *, P_TO_ADDR(pkt))

The bit that took a while to find was how to turn these into a provider. To do that edit the file “usr/src/uts/common/dtrace/sdt_subr.c” and create the attribute structure2:

 static dtrace_pattr_t scsi_attr = {
 { DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
 { DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
 { DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
 { DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_ISA },
 { DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
 };

and add it to the sdt_providers array:


	{ "scsi", "__scsi_", &scsi_attr, 0 },

than add the probes to the sdt_args array:

	{ "scsi", "transport", 0, 0, "struct scsi_pkt *", "scsi_pktinfo_t *"},
	{ "scsi", "transport", 1, 1, "struct scsi_address *", "scsi_addrinfo_t *"},
	{ "scsi", "complete", 0, 0, "struct scsi_pkt *", "scsi_pktinfo_t *"},
	{ "scsi", "complete", 1, 1, "struct scsi_address *", "scsi_addrinfo_t *"},

Finally you need to create a file containing the definitions of the output structures, scsi_pktinfo_t and scsi_addrinfo_t and define translators for them. That goes into /usr/lib/dtrace and I called mine scsa.d (there is already one called scsi.d).

/*
 * CDDL HEADER START
 *
 * The contents of this file are subject to the terms of the
 * Common Development and Distribution License (the "License").
 * You may not use this file except in compliance with the License.
 *
 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
 * or http://www.opensolaris.org/os/licensing.
 * See the License for the specific language governing permissions
 * and limitations under the License.
 *
 * When distributing Covered Code, include this CDDL HEADER in each
 * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
 * If applicable, add the following below this CDDL HEADER, with the
 * fields enclosed by brackets "[]" replaced with your own identifying
 * information: Portions Copyright [yyyy] [name of copyright owner]
 *
 * CDDL HEADER END
 */
/*
 * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
 * Use is subject to license terms.
 */

#pragma D depends_on module scsi
#pragma D depends_on provider scsi

inline char TEST_UNIT_READY = 0x0;
#pragma D binding "1.0" TEST_UNIT_READY
inline char REZERO_UNIT_or_REWIND = 0x0001;
#pragma D binding "1.0" REZERO_UNIT_or_REWIND

inline char SCSI_HBA_ADDR_COMPLEX = 0x0040;
#pragma D binding "1.0" SCSI_HBA_ADDR_COMPLEX

typedef struct scsi_pktinfo {
	caddr_t pkt_ha_private;
	uint_t	pkt_flags;
	int	pkt_time;
	uchar_t *pkt_scbp;
	uchar_t *pkt_cdbp;
	ssize_t pkt_resid;
	uint_t	pkt_state;
	uint_t 	pkt_statistics;
	uchar_t pkt_reason;
	uint_t	pkt_cdblen;
	uint_t	pkt_tgtlen;
	uint_t	pkt_scblen;
} scsi_pktinfo_t;

#pragma D binding "1.0" translator
translator scsi_pktinfo_t  < struct scsi_pkt *P > {
	pkt_ha_private = P->pkt_ha_private;
	pkt_flags = P->pkt_flags;
	pkt_time = P->pkt_time;
	pkt_scbp = P->pkt_scbp;
	pkt_cdbp = P->pkt_cdbp;
	pkt_resid = P->pkt_resid;
	pkt_state = P->pkt_state;
	pkt_statistics = P->pkt_statistics;
	pkt_reason = P->pkt_reason;
	pkt_cdblen = P->pkt_cdblen;
	pkt_tgtlen = P->pkt_tgtlen;
	pkt_scblen = P->pkt_scblen;
};

typedef struct scsi_addrinfo {
	struct scsi_hba_tran	*a_hba_tran;
	ushort_t a_target;	/* ua target */
	uchar_t	 a_lun;		/* ua lun on target */
	struct scsi_device *a_sd;
} scsi_addrinfo_t;

#pragma D binding "1.0" translator
translator scsi_addrinfo_t  < struct scsi_address *A > {
	a_hba_tran = A->a_hba_tran;
	a_target = !(A->a_hba_tran->tran_hba_flags & SCSI_HBA_ADDR_COMPLEX) ?
		0 : A->a.spi.a_target;
	a_lun = !(A->a_hba_tran->tran_hba_flags & SCSI_HBA_ADDR_COMPLEX) ?
		0 : A->a.spi.a_lun;
	a_sd = (A->a_hba_tran->tran_hba_flags & SCSI_HBA_ADDR_COMPLEX) ?
		A->a.a_sd : 0;
};

again this is just enough to get going so I can see and use the probes:

jack@v4u-2500b-gmp03:~$ pfexec dtrace -P scsi -l
   ID   PROVIDER            MODULE                          FUNCTION NAME
 1303       scsi              scsi                    scsi_transport transport
 1313       scsi              scsi                 scsi_hba_pkt_comp complete
jack@v4u-2500b-gmp03:~$ 

While this all works well for parallel scsi getting the address of devices on fibre is not clear to me. If you have any suggestions I'm all ears.

1If there is such a document already in existence then please add a comment. I will just wish I could have found it.

2These may not be the right attributes but gets me to the point it compiles and can be used in a PoC.

Seeing Katsumi Inoue blogging about Oracle 10g reporting timestamps using the output from gethrtime() reminded me that I have had on occasion wished I had a log to map hrtime to the current time. As Katsumi points out the output of gethrtime() is not absolutely tied to the current time. So there is no way to take the output from it and tell when in real time the output was generated unless you have some reference point. To make things more complex the output is reset each time the system reboots.

For this reason it is useful to keep a file that contains a history of the hrtime and the real time so that any logs can be retrospectively coerced back into a readable format.

There are lots of ways to do this but since on this blog we seem to be in Dtrace mode here is how using dtrace

pfexec /usr/sbin/dtrace -o /var/log/hrtime.log -qn 'BEGIN,tick-1hour,END {
printf("%d:%d.%9.9d:%Y\n",
        timestamp, walltimestamp/1000000000,
        walltimestamp%1000000000, walltimestamp);
}'


Then you get a nice file that contains three columns. The hrtime, the time in seconds since January 1st 1970 and a human readable representation of the time in the current timezone:

: s4u-10-gmp03.eu TS 39 $; cat /var/log/hrtime.log    
5638545510919736:1248443226.350000625:2009 Jul 24 14:47:06
5642145449325180:1248446826.279995332:2009 Jul 24 15:47:06

I have to confess however that using Dtrace for this does not feel right, not least as you need to be root for this to be reliable and also the C code is trivial to write, compile and run from cron and send the output to syslog:

: exdev.eu FSS 39 $; cat  ./gethrtime_base.c
#include <sys/time.h>
#include <stdio.h>

int
main(int argc, char **argv)
{
	hrtime_t hrt = gethrtime();
	struct timeval tv;
	gettimeofday(&tv, NULL);

	printf("%lld:%d.%6.6d:%s", hrt, tv.tv_sec, tv.tv_usec,
			ctime(&tv.tv_sec));
}
: exdev.eu FSS 40 $; make ./gethrtime_base
cc    -o gethrtime_base gethrtime_base.c 
: exdev.eu FSS 41 $;  ./gethrtime_base
11013365852133078:1248444379.163215:Fri Jul 24 15:06:19 2009
: exdev.eu FSS 42 $; 
./gethrtime_base | logger -p daemon.notice -t hrtime
: exdev.eu FSS 43 $;  tail -10 /var/adm/messages | grep hrtime
Jul 24 15:32:33 exdev hrtime: [ID 702911 daemon.notice] 11014939896174861:1248445953.109855:Fri Jul 24 15:32:33 2009
Jul 24 16:09:21 exdev hrtime: [ID 702911 daemon.notice] 11017148054584749:1248448161.131675:Fri Jul 24 16:09:21 2009
: exdev.eu FSS 50 $; 

As I cycled home I realised there was one more thing I could do on the exploring the limits of threads and processes on Solaris. That would be the highest load average ever. Modifying the thread creator program to not have each thread sleep once started but instead wait until all the threads were set up and then go into an infinite compute loop that should get me the highest load average possible on a system or so you would think.

With 784001 threads the load stabilised at:

10:16am  up 18:07,  2 users,  load average: 22114.50, 22022.68, 21245.781

Which was somewhat disappointing. However an earlier run with just 780,000 threads managed to peak the load at 1,784,593 while it was exiting:

 7:44am  up 15:35,  2 users,  load average: 1724593.79, 477392.80, 188985.10

I' still pondering how 780000 thread can result in a load average of more than 1 million.

After the surprise interest in the maximum number of processes on a system it seems rude not to try and see how many threads I can squeeze into a single process while I have access to a system where physical memory will not be the limiting factor. The expectation is that this will closely match the number of processes as each thread will have an LWP in the kernel which will in turn consume the segkp.

A slight modification to the forker program:

: exdev.eu FSS 62 $; cat thr_creater.c
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include <thread.h>

int
main(int argc, char **argv)
{
        pid_t pid;
        int count=0;
        while(count < (argc != 2 ? 100 : atoi(argv[1])) &&
            (pid = thr_create(NULL, 0, (void * (*)(void *))pause,
            NULL, THR_DETACHED, NULL)) != -1) {
                if (pid == 0 ) {
                        /* Success, ) */
                        if (count % 1000 == 0)
                                printf("%d\n", count);
                        count++;
                }
        }
        if (pid < 0)
                perror("fork");
        printf("%d\n", count);
        pause();
}

and this time it has to be built as a 64 bit program:

# make "CFLAGS=-m64 -mt" thr_creater
#

Here is how it went:

$; ./thr_creater 1000000        
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
.....
782000
783000
784000

Here things have stopped and for some bizarre reason attaching a debugger to see what is going on does not seem to be a good idea. I had prstat running in another window and it reported:


  2336 cg13442  7158M 7157M cpu73    0    0   1:42:59 1.6% thr_creater/784970

Which is just a few more threads than I got processes (784956) when running in multi user. However at this point the system is pretty much a warm brick as if I exit any process thr_creater hoovers up the process so I can create no more. Fortunately I had realized this would happen and had some sleep(1) processes running so I could pause the thr_creater and then kill one of the sleeps to allow me to run a command:

$; ps -o pid,vsz,rss,nlwp,comm -p 2336
   PID  VSZ  RSS NLWP COMMAND
  2336 7329704 7329248 784972 ./thr_creater

as you can see it managed to get another two threads created since the prstat exited.

I've just straightened the rear wheel on the bike of a friend of my daughter and while I was at it checked over the rest of the bike. This leads to my top tip for parents of children who cycle (which IMO should be every parent of an able bodied child).

Before you start worrying about whether your child is or is not wearing a helmet1 make sure that the brakes on their bike work. Not just when the bike is new but regularly.

1See http://cyclehelmets.org for whether you should worry about that at all.

If you have recently come into possession of a Laptop onto which to load Solaris then here are my top tips:

  1. Install OpenSolaris. At the time of writing the release is 2009.06, install that, parts of this advice may become obsolete with later releases. Do not install Solaris 10 or even worse Nevada. You should download the live CD and burn it onto a disk boot that and let it install but before you start the install read the next tip.

  2. Before you start the install open a terminal so that you can turn on compression on the root pool once it it created. You have to keep running “zpool list” until you see the pool is created and then run (pfexec zfs set compression=on rpool). You may think that disk is big but after a few months you will be needing every block you can get. Also laptop drives are so slow that compression will probably make things faster.

  3. Before you do anything after installation take a snapshot of the system so you can always go back (pfexec beadm create opensolaris@initialinstall). I really mean this.

  4. Add the extras repository. It contains virtualbox, the flash plugin for firefox, true type fonts and more. All you need is a sun online account. See https://pkg.sun.com/register/ and http://blogs.sun.com/chrisg/entry/installing_support_certificates_in_opensolaris

  5. Decide whether you want to use the development or support repository. If in doubt choose the supported one. Sun employees get access to the support repository. Customers need to get a support contract. (http://www.opensolaris.com/learn/subscriptions/). Then update to the latest bigs (pfexec pkg image-update).

  6. Add any extra packages you need. Since I am now writing this retrospectively there may be things missing. My starting list is:

    • OpenOffice (pfexec pkg install openoffice)

    • SunStudio (pfexec pkg install sunstudioexpress)

    • Netbeans (pfexec pkg install netbeans)

    • Flash (pkfexec pkg install flash)

    • Virtualbox (pfexec pkg install virtualbox)

    • TrueType fonts (pfxec pkg install ttf-fonts-core)

  7. If you are a Sun Employee install the punchin packages so you can access SWAN. I actually rarely use this as I have a Solaris 10 virtualbox image that I use for punchin so I can be both on and off SWAN at the same time but it is good to have the option.

  8. Add you keys to firefox so that you can browse the extras and support repositories from firefox. See http://wikis.sun.com/display/OpenSolarisInfo200906/How+to+Browse+the+Support+and+Extra+Repositories.

  9. Go to Fluendo and get and install the free mp3 decoder. They also sell a complete and legal set of decoders for the major video formats, I have them and have been very happy with them. They allow me to view the videos I have cycling events.

  10. Go to Adobe and get acroread. I live in hope that at some point this will be in a repository either at Sun or one Adobe runs so that it can be installed using the standard pkg commands but until then do it by hand.

Enjoy.

This week we had a customer claiming that they were unable to create more then 60,000 processes. This turned out to be due to them tuning max_nproc, maxuprc and maxpid but not setting segkpsize so the system would run out of “memory” before it ran into the resource limits for process.

Tuning segkpsize to 8G resolved it but I just had to see how many processes I could get running on an M8000.

Using these settings in /etc/system:

set segkpsize=0x300000
set pidmax=999999
set maxuprc=999990
set max_nprocs=999999

and a simple forker program:

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>

int
main(int argc, char **argv)
{
        pid_t pid;
        int count=0;
        while(count < argc == 2 ? 100 : atoi(argv[1]) &&
            (pid = fork()) != -1) {
                if (pid != 0 ) {
                        /* Parent */
                        if (count % 1000 == 0)
                                printf("%d\n", count);
                        count++;
                } else {
                        pause();
                        exit(0);
                }
        }
        if (pid < 0)
                perror("fork");
        printf("%d\n", count);
}

I was slightly disappointed at the result:


$ ./forker 100000
1000
2000
3000
.....
782000
783000
784000
fork: Resource temporarily unavailable
784956
$

Only 784956 processes, plus the ones already running when the system booted. Trying to count them with ps obviously fails but mdb gives the real count.

# ps -e| wc
ksh: cannot fork: too many processes
# 
# echo nproc::print -d | mdb -k  
0t785025
# 

Someone must have managed to get more.

In case anyone wanted to cycle from Farnborough Main railway station to the Sun Microsystems offices at Guillemont Park here is the route I take.


View Larger Map

I've been working with a customer to try and find a memory “leak” in their application. Many things have been tried, libumem, and the mdb ::findleaks command all with no success.

So I was, as I am sure others before me have, pondering if you could use dtrace to do this. Well I think you can. I have a script that puts probes into malloc et al and counts how often they are called by this thread and when they are freed often free is called.

Then in the entry probe of the target application note away how many calls there have been to the allocators and how many to free and with a bit of care realloc. Then in the return probe compare the number of calls to allocate and free with the saved values and aggregate the results. The principle is that you find the routines that are resulting in allocations that they don't clear up. This should give you a list of functions that are possible leakers which you can then investigate1.

Using the same technique I for getting dtrace to “follow fork” that I described here I ran this up on diskomizer, a program that I understand well and I'm reasonably sure does not have systemic memory leaks. The dtrace script reports three sets of results.

  1. A count of how many times each routine and it's descendents have called a memory allocator.

  2. A count of how many times each routine and it's descendents have called free or realloc with a non NULL pointer as the first argument.

  3. The difference between the two numbers above.

Then with a little bit of nawk to remove all the functions for which the counts are zero gives:

# /usr/sbin/dtrace -Z -wD TARGET_OBJ=diskomizer2 -o /tmp/out-us \
	-s /tmp/followfork.d \
	-Cs /tmp/allocated.d -c \
         "/opt/SUNWstc-diskomizer/bin/sparcv9/diskomizer -f /devs -f background \
          -o background=0 -o SECONDS_TO_RUN=1800"
dtrace: failed to compile script /tmp/allocated.d: line 20: failed to create entry probe for 'realloc': No such process
dtrace: buffer size lowered to 25m
dtrace: buffer size lowered to 25m
dtrace: buffer size lowered to 25m
dtrace: buffer size lowered to 25m
 
# nawk '$1 != 0 { print  $0 }' < /tmp/out.3081
allocations
           1 diskomizer`do_dev_control
           1 diskomizer`set_dev_state
           1 diskomizer`set_state
           3 diskomizer`report_exit_reason
           6 diskomizer`alloc_time_str
           6 diskomizer`alloc_time_str_fmt
           6 diskomizer`update_aio_read_stats
           7 diskomizer`cancel_all_io
           9 diskomizer`update_aio_write_stats
          13 diskomizer`cleanup
          15 diskomizer`update_aio_time_stats
          15 diskomizer`update_time_stats
          80 diskomizer`my_calloc
         240 diskomizer`init_read
         318 diskomizer`do_restart_stopped_devices
         318 diskomizer`start_io
         449 diskomizer`handle_write
         606 diskomizer`do_new_write
        2125 diskomizer`handle_read_then_write
        2561 diskomizer`init_buf
        2561 diskomizer`set_io_len
       58491 diskomizer`handle_read
       66255 diskomizer`handle_write_then_read
      124888 diskomizer`init_read_buf
      124897 diskomizer`do_new_read
      127460 diskomizer`expect_signal
freecount
           1 diskomizer`expect_signal
           3 diskomizer`report_exit_reason
           4 diskomizer`close_and_free_paths
           6 diskomizer`update_aio_read_stats
           9 diskomizer`update_aio_write_stats
          11 diskomizer`cancel_all_io
          15 diskomizer`update_aio_time_stats
          15 diskomizer`update_time_stats
          17 diskomizer`cleanup
         160 diskomizer`init_read
         318 diskomizer`do_restart_stopped_devices
         318 diskomizer`start_io
         442 diskomizer`handle_write
         599 diskomizer`do_new_write
        2125 diskomizer`handle_read_then_write
        2560 diskomizer`init_buf
        2560 diskomizer`set_io_len
       58491 diskomizer`handle_read
       66246 diskomizer`handle_write_then_read
      124888 diskomizer`do_new_read
      124888 diskomizer`init_read_buf
      127448 diskomizer`cancel_expected_signal
mismatch_count
     -127448 diskomizer`cancel_expected_signal
          -4 diskomizer`cancel_all_io
          -4 diskomizer`cleanup
          -4 diskomizer`close_and_free_paths
           1 diskomizer`do_dev_control
           1 diskomizer`init_buf
           1 diskomizer`set_dev_state
           1 diskomizer`set_io_len
           1 diskomizer`set_state
           6 diskomizer`alloc_time_str
           6 diskomizer`alloc_time_str_fmt
           7 diskomizer`do_new_write
           7 diskomizer`handle_write
           9 diskomizer`do_new_read
           9 diskomizer`handle_write_then_read
          80 diskomizer`init_read
          80 diskomizer`my_calloc
      127459 diskomizer`expect_signal

#

From the above you can see that there are two functions that create and free the majority of the allocations and the allocations almost match each other, which is expected as they are effectively constructor and destructor for each other. The small mismatch is not unexpected in this context.

However it is the vast number of functions that are not listed at all as they and their children make no calls to the memory allocator or have exactly matching allocation and free that are important here. Those are the functions that we have just ruled out.

From here it is easy now to drill down on the functions that are interesting you, ie the ones where there are unbalanced allocations.


I've uploaded the files allocated.d and followfork.d so you can see the details. If you find it useful then let me know.

1Unfortunately the list is longer than you want as on SPARC it includes any functions that don't have their own stack frame due to the way dtrace calculates ustackdepth, which the script makes use of.

2The script only probes particular objects, in this case the main diskomizer binary, but you can limit it to a particular library or even a particular set of entry points based on name if you edit the script.

There is a ongoing request to have follow fork functionality for the dtrace pid provider but so far no one has stood upto the plate for that RFE. In the mean time my best workaround for this is this:

cjg@brompton:~/lang/d$ cat followfork.d
proc:::start
/ppid == $target/
{
	stop();
	printf("fork %d\n", pid);
	system("dtrace -qs child.d -p %d", pid);
}
cjg@brompton:~/lang/d$ cat child.d
pid$target::malloc:entry
{
	printf("%d %s:%s %d\n", pid, probefunc, probename, ustackdepth)
}
cjg@brompton:~/lang/d$ pfexec /usr/sbin/dtrace -qws followfork.d -s child.d -p 26758
26758 malloc:entry 22
26758 malloc:entry 15
26758 malloc:entry 18
26758 malloc:entry 18
26758 malloc:entry 18
fork 27548
27548 malloc:entry 7
27548 malloc:entry 7
27548 malloc:entry 18
27548 malloc:entry 16
27548 malloc:entry 18

Clearly you can have the child script do what ever you wish.

Better solutions are welcome!