All ISV Engineers' Blogs
I'm a sucker for a good multimedia package.  I love playing with software that does movie playback on encoding, audio playback, transcoding...you name it, I love playing with it.  So naturally I was pretty psyched to see that somebody has contributed the Fluidsynth software synthesizer package to OpenSolaris.  Here is what I did to install it and run it in order to prove that the package actually works.  (here's a spoiler for you: it does work.)

Fluidsynth was put into the OpenSolaris "pending" repository, the idea being that people will check it out and if it's deemed to be of reasonable quality, it'll get voted into the contributed software repository.  The "contrib" repo is where good packages go after they've been testing in the "pending" repo staging area.  We haven't voted on Fluidsynth yet as of the time I write this; I'm hoping that will change after people read what I've done here.

(note: for my testing, I am running OpenSolaris build snv_111b)

First things first: set up OpenSolaris to find packages from the Source Juicer "pending" repository with these two steps:
  1. type "pfexec pkg set-publisher -O http://jucr.opensolaris.org/pending jucr-pending"
  2. type "pfexec pkg refresh"
Next, I launched the Package Manager application and chose the "jucr-pending" repository from the pop-up menu on the right side of that application's user interface.  After Package Manager thought about its new catalog of apps for a moment, I saw a list of hundreds of packages available to me.  I used the search field to type "fluid", and found the "fluidsynth" package.  Selected that, saw that it lists several other packages as dependencies (i.e., those other packages had better be installed for Fluidsynth to work correctly), then I installed it.  Nicely, the Package Manager installed Fluidsynth and its dependent packages for me.

At this point I ran into a little snag, which is not the fault of OpenSolaris: I'm testing this within OpenSolaris, but OpenSolaris is running as a VirtualBox guest on my Mac Book Pro.  Turns out that audio support in OpenSolaris under VirtualBox needs a little bit of work to get going.  It's easy enough, though, and took less than five minutes to get it working.  The instructions on how to make it work are in this blog post, which is clearly written.  And I had fewer problems than were mentioned; I didn't have to reboot or uninstall the "SUNWaudiohd" package.  Good times!

At this point, my big challenge was where to find two files to test Fluidsynth: a sound font (basically, a description of instruments that Fluidsynth uses to play music), and some music (a MIDI file).  I did a Google search to find a nice Yamaha DX-7 electric piano sound font (I happened to find it here), and it was easy to find any number of .mid files to play.

To test, I typed "fluidsynth <name-of-sound-font.SF> <name-of-MIDI-file.mid>".  That worked just fine: I heard the music loud and clear, although Fluidsynth complained that there is no /dev/midi.  I believe it is expecting me to connect a MIDI keyboard to the computer and start playing, which is not necessary for this test.  Also, Fluidsynth had to re-map some of the MIDI file's preferred instruments to what was available in the sound font's instrument library.  Not a problem, though.

Just for fun, I tried turning off the built-in chorus and reverb effects, and I boosted the amplitude to see if these features worked:

"fluidsynth --chorus no --reverb no --gain 0.8 <name-of-sound-font.SF> <name-of-MIDI-file.mid>".  I also tried changing these parameters individually to isolate the effects.  Again, this worked fine.

As far as I can tell, Fluidsynth works perfectly well on OpenSolaris.  It should make a fine addition to the contrib repo.


Powered by ScribeFire.

I needed to find a way to get the physical (MAC) address using C. From what I could gather from searching opensolaris.org, there are two methods for retrieving it: libdlpi and arp. libdlpi is the more elegant solution as it requires a simple call to dlpi_get_physaddr(). This is how ifconfig prints your network interface's MAC address. Unfortunately, libdlpi calls are only permitted as root.

As explained by James Carlson:

The reason it was like this was historical: getting the MAC address in
ifconfig meant opening up the DLPI node and talking to the driver. As
the drivers didn't have discrete privileges for each operation, and
you had to be almighty root to touch them, 'ifconfig' didn't show the
MAC address when not privileged.
*whatever*

The second solution is to use arp. In Solaris you can determine the physical address by looking at the arp tables directly (`arp -a | grep <INTERFACE>` or `netstat -p | grep <INTERFACE>`). With C, this can be done by using the if sockets and arp libraries.

I wrote up a solution called "getmac" using both methods. You can gather it here.

  • Directions
    $ wget http://www.pauliesworld.org/project/getmac.c
    $ gcc getmac.c -o getmac -lsocket -ldlpi
    $ ./getmac <interface_name>
    arp:	ffffffffffff
    dlpi:	dlpi failure, are you root?
    $ pfexec ./getmac <interface_name>
    arp:	ffffffffffff
    dlpi:	ffffffffffff
    
Remember to use pfexec for the libdlpi method.
Thanks to those that attended my talk on Faban at FOSDEM on Saturday. If you want to know more then post a comment here and we can work out how to best to talk. In the meantime here are my slides from the talk.
As Oracle Technology Network (OTN) welcomes Sun's Partners, developers can find answers to some of their questions on this Developer Community FAQ page.

When I build some source code and that I want to achieve the best performance I use the   Sun Studio compilers, especially on Solaris and SPARC.

Sun Studio offers a unique set of optimization features dedicated to processor instruction set that help me squeeze out the best perf out of C, C++ or Fortran code. Yet these options are so numerous that it can be a bit daunting to look into them.

If you are in a rush, you can use the  -fast option. What it really does is triggering a set of other options for maximum runtime performance. These options can be listed with:

$ CC -fast -dryrun ### command line files and options (expanded): ### -xO5 -xarch=sparc -xcache=8/16/4:3072/64/12 -xchip=ultraT1 ... -dryrun

Yet -fast has its own drawbacks. First, the options triggered might change from one compiler release to another. Also, the values for -xarch -xcache -xchip specify the processor for which to optimize, and -fast decides of these values based on the processor on which the compiler runs, which can deffer from the processor on which the code will eventually be executed. This is why I usually stay away from -fast.

Instead, here is a basic set of rules to easily decide on optimization options.

First, if some binary code already exists, I run a quick sanity check to see which options where used for this binary. On a non-strip executable, library, or object file, I run the following commands:

$ dump -C Bar.o // for C++ code ... <122> .../tmanfe/; /opt/SUNWspro/bin/CC -G -xtarget=native -compat=4 -xO4 Bar.cpp

$ dwarfdump getpagesize //for C code ... DW_AT_SUN_command_line /opt/SUNWspro/bin/cc -c -xarch=sse2 -m32 -xO3 +w getpagesize.c


Make sure the -g option is not present. This option tells the compiler to compile for debug, which in turns disables some optimization. Also verify that  -xOn (with n=[1|2|3|4|5]) is present: this turns on generic optimization. --xO1 and -xO2 are conservative. -xO5 is aggressive and may yield to perf degradation so I don't recommend to use it for a complete application. Limit its usage to some specific portions of code that are known to be heavily used and to benefit from optimization. I usual pick -xO3 as a basic level of optimization.

Use the -xarch and -xchip options specific to the targeted runtime processor. -xarch specifies the instruction set to be used while -xchip specifies the scheduling - or the ordering of the instructions.

The best value for -xarch can be found by running CC -xtarget=native -dryrun on the runtime platform. Here is a code snippet that does the work for you:

#!/bin/bash for flag in `CC -xtarget=native -dryrun 2>&1 | grep xchip` do if echo $flag | grep xchip >/dev/null ; then target=`echo $flag | grep xchip` fi done lenght=${#target} echo ${target:7:$lenght}

The right value for -xchip is found the same way:

#!/bin/bash for flag in `CC -xtarget=native -dryrun 2>&1 | grep xchip` do if echo $flag | grep xchip ; then   target=`echo $flag | grep xchip` fi done lenght=${#target} echo ${target:7:$lenght}

The two code snippets above can be use to dynamically set up your compiler flags when generating Makefiles but again, make sure to run them on the processor targeted for runtime.

This type of generic optimization usually brings between 10 to 20% in terms of performance gain and it also sets the base-line for most sophisticated optimizations that will focus on the portions of code that are the more used during execution. These portions of code can be identified by running the Sun Studio Collector and Performance Analyzer on your code: no need to instrument your binary, no need to recompile for profiling. Just run the collect utility on the optimized binary you generated. Simple and easy!

libupnp 1.6.6 is a little tricky to compile on Solaris. After downloading the source from Source Forge, you will want to extract the bzip2 and cd to the libupnp-1.6.6 directory, then do the following.
vi upnp/src/api/upnpapi.c
On line 59, there is a bug. Change
#if defined(_sun)
to
#if defined(__sun)
The change is adding an extra underscore. Otherwise sockio.h will not be recognized properly and you will get some missing networking variables when you try to build. After that is taken care of...
$ ./configure CFLAGS="-DSPARC_SOLARIS" --disable-samples
$ gmake
# gmake install

Case in hand: Given a PeopleSoft Data Mover exported data file (db or dat file), how to extract the DDL statements [from that data file] which gets executed as part of the Data Mover's data import process?

Here is a quick way to do it:

  1. Insert the SET EXTRACT statements in the Data Mover script (DMS) before the IMPORT .. statement.

    eg.,
    
    % cat /tmp/retrieveddl.dms
    
    ..
    SET EXTRACT OUTPUT /tmp/ddl_stmts.log;
    SET EXTRACT DDL;
    ..
    
    IMPORT *;
    
    

    It is mandatory that the SET EXTRACT OUPUT statement must appear before any SET EXTRACT statements.

  2. Run the Data Mover utility with the modified DMS script as an argument.

    eg., OS: Solaris

    
    % psdmtx -CT ORACLE -CD NAP11 -CO NAP11 -CP NAP11 -CI people -CW peop1e -FP /tmp/retrieveddl.dms
    
    

    On successful completion, you will find the DDL statements in /tmp/retrieveddl.dms file.

Check chapter #2 "Using PeopleSoft Data Mover" in Enterprise PeopleTools x.xx PeopleBook: Data Management document for more ideas.

Novelties of Java EE 6 is in the improved ease of development, remodeled web services, and platform simplicity with introduction of profiles. GlassFish V3 is the first fully compliant Java EE 6 Enterprise Server.

Evolution of Java Platform Enterprise Edition since the year 1998 resulted in a more robust, scalable, highly performing, and extensible platform. GlassFish V3 Enterprise Server is built on modular OSGi Standard, with its MicroKernel HK2 reducing the startup time while providing the flexibility to extend the services at Runtime. GlassFish is now also embeddable.

 

Sun GlassFish V3 Enterprise Server,

My Presentation : A Technical Overview.


NFS Tuning for HPC Streaming Applications


Overview:

I was recently working in a lab environment with the goal of setting up a Solaris 10 Update 8 (s10u8) NFS server application that would be able to stream data to a small number of s10u8 NFS clients with the highest possible throughput for a High Performance Computing (HPC) application.  The workload varied over time: at some points the workload was read-intensive while at over times the workload was write intensive.  Regardless of read or write, the application's I/O pattern was always "large block sequential I/O" which was easily modeled with a "dd" stream from one or several clients.

Due to business considerations, 10 gigabit ethernet (10GbE) was chosen for the network infrastructure.  It was necessary to not only to install appropriate server, network and I/O hardware, but also to tune each subsystem.  I wish it was more obvious if a gigabit implied 1024^3 or 1000^3.  In ether case, one might naively assume that the connection should be able to reach NFS speeds of 1.25 gigabytes per second, however, my goal was to be able to achieve NFS end-to-end throughput close to 1.0 gigabytes per second.

Hardware:

A network with the following components worked well:

  • Sun Fire X4270 servers
    • Intel Xeon "Nehalem" X5570 CPU's @ 2.93 GHz
    • Solaris 10 Update 8
  • a network, consisting of:
    • Force10 S2410 Data Center 10 GbE Switch
    • 10-Gigabit Ethernet PCI Express Ethernet Controller, either:
      • 375-3586 (aka Option X1106A-Z) with the Intel 82598EB chip (ixgbe driver), or
      • 501-7283 (aka Option X1027A-Z) with the "Neptune" chip (nxge driver)
  • an I/O system on the NFS server:
    • 375-3487 Option SG-XPCIE8SAS-E-Z SAS/SATA HBA's
    • Storage, either
      • Sun Storage J4400 Arrays, or
      • Sun Storage F5100 Flash Array

This configuration based on recent hardware was able to reach close to full line speed performance.  In contrast, a slightly older server with Intel Xeon "Harpertown"  E5440 CPU's @ 2.83 Ghz was not able to reach full line speed.

The application's I/O pattern is large block sequential and known to be 4K aligned, so the Sun Storage F5100 Flash Array is a good fit.  I would not recommend this device for general purpose NFS storage.

Network

When the hardware was initially installed, rather than immediately measuring NFS performance, the individual network and IO subsystems were tested.  To measure the network performance, I used netperf. I found that the "out of the box" s10u8 performance was not acceptable; it seems that the Solaris "out of the box" settings are better fitted to a web server with a large number of potentially slow (WAN) connections.  To get the network humming for large block LAN workload I made several changes:

a) The TCP Sliding Window settings in /etc/default/inetinit

ndd -set /dev/tcp tcp_xmit_hiwat  1048576
ndd -set /dev/tcp tcp_recv_hiwat  1048576
ndd -set /dev/tcp tcp_max_buf    16777216
ndd -set /dev/tcp tcp_cwnd_max    1048576


b) The network interface card "NIC" settings, depending on the card:
/kernel/drv/ixgbe.conf
default_mtu=8150;
tx_copy_threshold=1024;


/platform/i86pc/kernel/drv/nxge.conf
accept_jumbo = 1;
soft-lso-enable = 1;
rxdma-intr-time=1;
rxdma-intr-pkts=8;

/etc/system
* From http://www.solarisinternals.com/wiki/index.php/Networks
* For ixgbe or nxge
set ddi_msix_alloc_limit=8
* For nxge
set nxge:nxge_bcopy_thresh=1024
set pcplusmp:apic_multi_msi_max=8
set pcplusmp:apic_msix_max=8
set pcplusmp:apic_intr_policy=1
set nxge:nxge_msi_enable=2

c) Some seasoning :-)

* Added to /etc/system on S10U8 x64 systems based on
*   http://www.solarisinternals.com/wiki/index.php/Networks (Nov 18, 2009)
* For few TCP connections
set ip:tcp_squeue_wput=1
* Bursty
set hires_tick=1

d) Make sure that you are using jumbo frames. I used mtu 8150, which I know made both the NICs and the switch happy.  Maybe I should have tried a slightly more aggressive setting of 9000.  

/etc/hostname.nxge0
192.168.2.42 mtu 8150

/etc/hostname.ixgbe0
192.168.1.44 mtu 8150
e) Verifying the MTU with ping and snoop.  Some ping implementations include a flag to allow the user to set the "do not fragment" (DNF) flag, which is very useful for verifying that the MTU is properly set.  With the ping implementation that ships with s10u8, you can't set the DNF flag.  To verify the MTU, use snoop to see if large pings are fragmented:

server# snoop -r -d nxge0 192.168.1.43
Using device nxge0 (promiscuous mode)

// Example 1: A 8000 byte packet is not fragmented
client% ping -s 192.168.1.43 8000 1
PING 192.168.1.43: 8000 data bytes
8008 bytes from server10G-43 (192.168.1.43): icmp_seq=0. time=0.370 ms


192.168.1.42 -> 192.168.1.43 ICMP Echo request (ID: 14797 Sequence number: 0)
192.168.1.43 -> 192.168.1.42 ICMP Echo reply (ID: 14797 Sequence number: 0)

//
Example 2: A 9000 byte ping is broken into 2 packets in both directions
client%
ping -s 192.168.1.43 9000 1
PING 192.168.1.43: 9000 data bytes
9008 bytes from server10G-43 (192.168.1.43): icmp_seq=0. time=0.383 ms


192.168.1.42 -> 192.168.1.43 ICMP IP fragment ID=32355 Offset=0    MF=1 TOS=0x0 TTL=255
192.168.1.42 -> 192.168.1.43 ICMP IP fragment ID=32355 Offset=8128 MF=0 TOS=0x0 TTL=255
192.168.1.43 -> 192.168.1.42 ICMP IP fragment ID=49788 Offset=0    MF=1 TOS=0x0 TTL=255
192.168.1.43 -> 192.168.1.42 ICMP IP fragment ID=49788 Offset=8128 MF=0 TOS=0x0 TTL=255

//
Example 3: A 32000 byte ping is broken into 4 packets in both directions
client%
ping -s 192.168.1.43 32000 1
PING 192.168.1.43: 32000 data bytes
32008 bytes from server10G-43 (192.168.1.43): icmp_seq=0. time=0.556 ms


192.168.1.42 -> 192.168.1.43 ICMP IP fragment ID=32356 Offset=0    MF=1 TOS=0x0 TTL=255
192.168.1.42 -> 192.168.1.43 ICMP IP fragment ID=32356 Offset=8128 MF=1 TOS=0x0 TTL=255
192.168.1.42 -> 192.168.1.43 ICMP IP fragment ID=32356 Offset=16256 MF=1 TOS=0x0 TTL=255
192.168.1.42 -> 192.168.1.43 ICMP IP fragment ID=32356 Offset=24384 MF=0 TOS=0x0 TTL=255
192.168.1.43 -> 192.168.1.42 ICMP IP fragment ID=49789 Offset=0    MF=1 TOS=0x0 TTL=255
192.168.1.43 -> 192.168.1.42 ICMP IP fragment ID=49789 Offset=8128 MF=1 TOS=0x0 TTL=255
192.168.1.43 -> 192.168.1.42 ICMP IP fragment ID=49789 Offset=16256 MF=1 TOS=0x0 TTL=255
192.168.1.43 -> 192.168.1.42 ICMP IP fragment ID=49789 Offset=24384 MF=0 TOS=0x0 TTL=255



f) Verification: after network tuning was complete, it had very impressive performance, with either the nxgbe or the ixgbe driver.  The end-to-end measurement reported by netperf of 9.78 GbE is very close to full line speed and indicates that the switch, network interface cards, drivers and Solaris system call overhead are minimally intrusive.

$ /usr/local/bin/netperf -fg -H 192.168.1.43 -tTCP_STREAM -l60 TCP STREAM TEST from ::ffff:0.0.0.0 (0.0.0.0) port 0 AF_INET to ::ffff:192.168.1.43 (192.168.1.43) port 0 AF_INET

Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^9bits/sec

1048576 1048576 1048576    60.56       9.78

$ /usr/local/bin/netperf -fG -H 192.168.1.43 -tTCP_STREAM -l60
TCP STREAM TEST from ::ffff:0.0.0.0 (0.0.0.0) port 0 AF_INET to ::ffff:192.168.1.43 (192.168.1.43) port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    GBytes/sec

1048576 1048576 1048576    60.00       1.15


g) Observability: I found "nicstat" (download link at http://www.solarisinternals.com/wiki/index.php/Nicstat) to be a very valuable tool for observing network performance.  To compare the network performance with the running application against synthetic tests, I found that it was useful to graph the "nicstat" output. (see http://blogs.sun.com/taylor22/entry/graphing_solaris_performance_stats)  You can verify that the Jumbo Frames MTU is working as expected by checking that the average packet payload is big by dividing Bytes-per-sec / Packets-per-sec to get average bytes-per-packet.


Network Link Aggregation for load spreading


I initially (naively) hoped to use link aggregation to get a 20GbE NFS stream using 2 X 10GbE ports on the NFS server and 2 X 10GbE ports on the NFS client.  I hoped that packets would be distributed to the link aggregation group in a "per packet round robin" fashion.  What I found is that, regardless of whether the negotiation is based on L2, L3 or L4, LACP will negotiate port pairs based on a source/destination mapping, so that each stream of packets will only use one specific port from a link aggregation group. 

Link aggregation can be useful in spreading multiple streams over ports, but the streams will not necessarily be evenly divided across ports.  The distribution of data over ports in a link aggregation group can be viewed with "nicstat" 

After reviewing literature, I concluded that It is best to use IPMP for failover, but link aggregation for load spreading.  "Link aggregation" has finer control for load spreading than IPMP:  Comparing IPMP and Link Aggregation
  • IPMP can be used to protect against switch/router failure because each NIC can be connected to a different switch and therefore can protect against either NIC or switch failure.
  • With Link Aggregation, all of the ports in the group must be connected to a single switch/router, and that switch/router must support Link Aggregation Control Protocol (LACP), so there is no protection against switch failure.
With the Force10 switch that I tested, I was disappointed that the LACP algorithm was not doing a good job of spreading inbound packets to my hot server.  Again, once the switch mapped a client to one of the ports in the "hot server's link group", it stuck so it was not unusual for several clients to be banging hard on one port while another port was idle.

Multiple subnets for network load spreading

After trying several approaches for load spreading with S10u8, I chose to use multiple subnets.  (If I had been using Nevada 107 or newer which includes Project Clearview, I might have come to a different conclusion.)  In the end, I decided that the best solution was an old fashion approach using a single common management subnet combined with multiple data subnets:

  • All of the machines were able to to communicate with each other on a slower "management network", specifically Sun Grid Engine jobs were managed on the slower network.
  • The clients were partitioned into a small number of "data subnets".
  • The "hot" NFS data server had multiple NIC's, with each NIC on a separate "data subnet".
  • A limitation of this approach that is that clients in one subnet partition only have a low bandwidth connection to clients in different subnet partitions. This was OK for my project.
  • The advantage of manually preallocating the port distribution was that my benchmark was more deterministic.  I did not get overloaded ports in a seemingly random pattern.


Disk IO: Storage tested

Configurations that had sufficient bandwidth for this environment included:

  • A Sun Storage F5100 Flash Array using 80 FMod SSD devices in a ZFS RAID 0 stripe to create a 2TB volume. 
  • A "2 X Sun Storage J4400 Arrays" JBOD configuration with a ZFS RAID 0 stripe
  • A "4 X Sun Storage J4400 Arrays" configuration with a ZFS RAID 1+0 mirrored and striped


Disk I/O: SAS HBA's

The Sun Storage F5100 Flash Array was connected the the Sun Fire X4270 server using 4 PCIe 375-3487 SAS HBAs so that each F5100 domain with 20 FMod SSD used an independent HBA.  Using 4 SAS HBAs has a 20% better theoretical throughput than using 2 SAS HBA's:

  • Each F5100 with 80 FMod SSD devices has 4 domains with up to 20 FMods per domain
  • Each  375-3487 SAS HBA has
    • - two 4x wide SAS ports
    • - 8x PCIe
  • Each F5100 domain to SAS HBA, connected with a single 4x wide SAS port, will have a maximum half duplex speed of (3Gb/sec * 4) = 12Gb/Sec =~ 1.2 GB/Sec per F5100 domain
  • PCI Express x8 (half duplex) = 2 GB/sec
  • A full F5100 (4 domains) connected using 2 SAS HBA's would be limited by PCIe to 4.0 GB/Sec
  • A full F5100 (4 domains) connected using 4 SAS HBA's would be limited by SAS to 4.8 GB/Sec.
  • Therefore a full Sun Storage F5100 Flash Array has 20% theoretically better throughput when connected using 4 SAS HBA's rather than 2 SAS HBA's.

The "mirrored and striped" configuration using 4 X Sun Storage J4400 Arrays was connected using 3 PCIe 375-3487 SAS HBAs:

SAS Wiring for 4 JBOD trays

Multipathing (MPXIO)

MPXIO was used in the "mirrored and striped" 4 X Sun Storage J4400 Array so that for every disk in the JBOD configuration, I/O could be requested by either of the 2 SAS cards connected to the array.  To eliminate any "single point of failure" I chose to mirror all of the drives in one tray with drives in another tray, so that any tray could be removed without losing data.

The command for creating a ZFS RAID 1+0 "mirrored and striped" volume out of MPXIO devices looks like this:

zpool create -f jbod \
mirror c3t5000C5001586BE93d0 c3t5000C50015876A6Ed0 \
mirror c3t5000C5001586E279d0 c3t5000C5001586E35Ed0 \
mirror c3t5000C5001586DCF2d0 c3t5000C50015825863d0 \
mirror c3t5000C5001584C8F1d0 c3t5000C5001589CEB8d0 \
...
It was a bit tricky to figure out which disks (i.e. "c3t5000C5001586BE93d0") were in which trays.  I ended up writing a surprisingly complicated Ruby script to choose devices to mirror.  This script worked for me.  Your mileage may vary.  Use at your own risk.

#!/usr/bin/env ruby -W0

all=`stmsboot -L`
Device = Struct.new( :path_count, :non_stms_path_array, :target_ports, :expander, :location)
Location = Struct.new( :device_count, :devices )

my_map=Hash.new
all.each{
  |s|
  if s =~ /\/dev/ then
    s2=s.split
    if !my_map.has_key?(s2[1]) then
      ## puts "creating device for #{s2[1]}"
      my_map[s2[1]] = Device.new(0,[],[])
    end
    ## puts my_map[s2[1]]
    my_map[s2[1]].path_count += 1
    my_map[s2[1]].non_stms_path_array.push(s2[0])
  else
    puts "no match on #{s}"
  end
}

my_map.each {
  |k,v|
  ##puts "key is #{k}"
  mpath_data=`mpathadm show lu #{k}`
  in_target_section=false
  mpath_data.each {
    |line|
    if !in_target_section then
      if line =~ /Target Ports:/ then
        in_target_section=true
      end
      next
    end
    if line =~ /Name:/ then
      my_map[k].target_ports.push(line.split[1])
      ##break
    end
  }
  ##puts "key is #{k} value is #{v}"
  ##puts k v.non_stms_path_array[0],  v.non_stms_path_array[1]
}

location_array=[]
location_map=Hash.new
my_map.each {
  |k,v|
  my_map[k].expander = my_map[k].target_ports[0][0,14]
  my_map[k].location = my_map[k].target_ports[0][14,2].hex % 64
  if !location_map.has_key?(my_map[k].location) then
    puts "creating entry for #{my_map[k].location}"
    location_map[my_map[k].location] = Location.new(0,[])
    location_array.push(my_map[k].location)
  end
  location_map[my_map[k].location].device_count += 1
  location_map[my_map[k].location].devices.push(k)

}

location_array.sort.each {
  |location|
  puts "mirror #{location_map[location].devices[0].gsub('/dev/rdsk/','')} #{location_map[location].devices[1].gsub('/dev/rdsk/','')} \\"
}


Separate ZFS Intent Logs?

Based on http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on , I ran some tests comparing separate intent logs on either disk or SSD (slogs) vs the default chained logs (clogs).  For the large block sequential workload, "tuning" the configuration by adding separate ZFS Intent Logs actually slowed the system down slightly. 

ZFS Tuning


/etc/system parameters for ZFS
* For ZFS
set zfs:zfetch_max_streams=64
set zfs:zfetch_block_cap=2048
set zfs:zfs_txg_synctime=1
set zfs:zfs_vdev_max_pending = 8


NFS Tuning

a) Kernel settings

Solaris /etc/system

* For NFS
set nfs:nfs3_nra=16
set nfs:nfs3_bsize=1048576
set nfs:nfs3_max_transfer_size=1048576

* Added to /etc/system on S10U8 x64 systems based on
http://www.solarisinternals.com/wiki/index.php/Networks (Nov 18, 2009)
* For NFS throughput
set rpcmod:clnt_max_conns = 8


b) Mounting the NFS filesystem

/etc/vfstab

192.168.1.5:/nfs - /mnt/nfs nfs - no vers=3,rsize=1048576,wsize=1048576

c) Verifing the NFS mount parameters

# nfsstat -m
/mnt/ar from 192.168.1.7:/export/ar
 Flags:         vers=3,proto=tcp,sec=sys,hard,intr,link,symlink,acl,rsize=1048576,wsize=1048576,retrans=5,timeo=600
 Attr cache:    acregmin=3,acregmax=60,acdirmin=30,acdirmax=60


Results:

With tuning, the Sun Storage J4400 Arrays via NFS achieved write throughput of 532 MB/Sec and read throughput of 780 MB/sec for a single 'dd' stream.

$ /bin/time dd if=/dev/zero of=/mnt/jbod/test-80g bs=2048k count=40960; umount /mnt/jbod; mount /mnt/jbod; /bin/time dd if=/mnt/jbod/test-80g of=/dev/null bs=2048k
40960+0 records in
40960+0 records out

real     2:33.7
user        0.1
sys      1:30.9
40960+0 records in
40960+0 records out

real     1:44.9
user        0.1
sys      1:04.0


With tuning, the Sun Storage F5100 Flash Array via NFS achieved write throughput of 496 MB/Sec and read throughput of 832 MB/sec for a single 'dd' stream.

$ /bin/time dd if=/dev/zero of=/mnt/lf/test-80g bs=2048k count=40960; umount /mnt/lf; mount /mnt/lf; /bin/time dd if=/mnt/lf/test-80g of=/dev/null bs=2048k
40960+0 records in
40960+0 records out

real     2:45.0
user        0.2
sys      2:17.3
40960+0 records in
40960+0 records out

real     1:38.4
user        0.1
sys      1:19.6
To reiterate, this testing was done in preparation for work with an HPC application that is known to have large block sequential I/O that is aligned on 4K boundaries.  The Sun Storage F5100 Flash Array would not be recommended for general purpose NFS storage that is not known to be 4K aligned.

References:


  - http://blogs.sun.com/brendan/entry/1_gbyte_sec_nfs_streaming
  - http://blogs.sun.com/dlutz/entry/maximizing_nfs_client_performance_on

本博客关闭。

如果你还喜欢SUN的话,请抽空访问:http://thenetworkisthecomputer.com

Goodbye and have a nice day. 

Thank you for attending the Sun Startup Essentials webinar on Security for Web Applications, here is a list of useful links to learn more about and start implementing the different technologies that were covered during the presentation:

By now you certainly heard about the acquisition and about the 5-hour webcast given on the 27th about the Sun-Oracle joint strategy. So I am not going to discuss that. In fact, as an exception to this blog, I am not going to write about IT technologies. As something really unique, I am going to discuss mountain-biking.

This bike is certainly the lightest mountain-bike ever shipped fully equiped, ready to use. Weight being one of the top 3 - if not top 2 - key performance indicators in mtb-racing (with a 3% slope 60% of the biker energy is spent on fighting gravity), this makes this bike extremely competitive.

But what is really interesting is to understand how the engineers who designed the product could get such a result, and the answer is: integration. Nowadays the majority of bike manufacturers design their frame, get them built by a third party, and assemble them with of-the-shelf components. As a result, the way the frame and the components are put together is pretty much standard accross the industry.

Cannondale is one of the few player in the game that has its own set of components, such as the front suspension (yes, again, one of the lightest in the industry), the stem, or the bottom bracket, and they design them to get a straightforward integration that reduces weight. By the way it is still possible to set up their very special front suspension on a non-Cannondale frame.

So, it's because they design all the components that the complete system performs so well. Also, since Cannondale manufactures the frame themselves, they can offer a life-time guaranty on it  - wathever the weight of the pilot. Yes, controlling the technologies your product is made of comes with its own set of advantages.

That said, next time, I'll get back to IT technologies ;)



A nature of open source software (OSS) is that, anyone can have a copy of the source code as long as he or she agrees to the license of the OSS. For countries who want to expedite the development of their own information technologies, OSS provides a precious learning opportunity, and is a wonderful start point. Governments of these countries also tend to believe that, comparing to commercial software, OSS is less risky in terms of being controlled by vendors. In other words, the usage of OSS is inspected from a strategic point of view by some governments. It is linked to the security of the national information system infrastructure. Therefore, in some countries, governments encourage the application of OSS. For example, Chinese government has been showing its intention publicly for years. Preference on OSS is commonly witnessed during government procurement.


The good news is, cloud service providers, who are applying open source software
extensively, look like a beneficiary of government's preferential policy to me yet.


OSS are already been used pervasively in cloud computing world. Vendors build their cloud computing data center on top of mainstream OSS, like Linux, Xen, Hadoop, MySQL and so on. Apart from government support, OSS is likely to hold the economic advantage over commercial software. Typically, license models of commercial software are charged by user number or processor / core number. However, cloud computing systems are designed to serve high volume users. Charging by user number is not a good deal in this case. Cloud systems also run software on virtual machines. One physical machine usually runs multiple virtual machines, which means all the virtual processors of each virtual machine may be counted in the commercial license models. Such license models are financially unfavorable in cloud computing realm. In contrast, most open source licenses are cloud computing friendly, and have much less limitations on cloud-based deployment.



Thank you for attending the Sun Startup Essentials webinar on Security for Web Applications, here is a list of useful links to learn more about and start implementing the different technologies that were covered during the presentation:

La semaine passée, j'ai eu le plaisir d'échanger avec deux startups du Web membres du programme Sun Startup Essentials.

Ce programme leur permet entre autres de bénéficier de conseils techniques gratuit au sujet de leur infrastructure, et nous avons donc réfléchi ensemble à des problématiques de montée en charge et de disponibilité.

Dans ce type de discussion, je commence toujours par demander si il existe un schéma de l'infrastructure informatique. Idéalement ce document couvre les aspects suivants (liste non exhaustive) :

  • une liste des différents composants utilisés par le site Web (MySQL, Apache, etc.). Cette liste recense les dépendances externes de l'application, celles sur lesquelles la startup n'a pas la main - par opposition au code de l'application développée en interne. Afin d'améliorer l'indépendance de l'application vis-à-vis de la plate-forme il faut autant que faire ce peut réduire cette liste. Une dépendance exotique - rarement utilisée dans l'industrie - doit être justifiée par une valeur ajoutée spécifique. La liste des dépendances externes est normalement maintenue par les équipes de développement en association à un schéma fonctionnel de l'application,
  • une vue d'ensemble de l'infrastructure. On y retrouve les différents composants logiciel (le serveur Web ou le serveur d'application, la base de donnée, l'application elle-même), mais aussi les composants matériel (serveurs, baie de stockage, réseau). Cette vue d'ensemble met en évidence les éléments redondés (comme les serveurs web frontaux ou une réplication de la base de donnée) et elle intègre des éléments comme le répartiteur de charge ou "load balancer". Cette vue d'ensemble doit traduire fidèlement la configuration d'exécution - encore appelée "runtime view" - de l'application. Par opposition, le schéma fonctionnel reflète l'architecture du code source, sans forcément prendre en compte l'environnement dans lequel l'application est déployée et exécutée. C'est en général à partir de cette configuration d'exécution qu'on va pouvoir réfléchir sur la montée en charge du site. Où faut-il rajouter de la capacité de calcul, où faut-il rajouter du stockage, et surtout, comment procéder à ces ajouts sans avoir à modifier de fond en comble l'application,
  • un diagramme des flux peut être le bien venu, qui clarifie quels sont les composants mis en oeuvre pour traiter une requête client, et dans quel ordre. C'est utile quand il s'agit de prévoir ce qui va se passer quand le nombre de clients va augmenter. Ce diagramme peut préciser les différents protocoles utilisés par l'application à moins que ceux-ci n'apparaissent déjà dans la vue d'ensemble,
  • un  schéma de sécurisation du site qui précise quelles sont les données critiques pour le business, les services exposés sur Internet, et comment celles-là et ceux-ci sont protégés contre les pannes matérielles, les attaques malveillantes, ou des des risques d'une autre nature. En complément d'une analyse des risques, on trouve aussi dans ce document les différents éléments de sécurité mis en place (cryptage, identification, virtualisation, sous-réseaux, réseaux virtuels et NAT, etc.) Encore une fois, ce schéma peut ne faire qu'un avec la vue d'ensemble,
  • pour finir, le traditionnel schéma de base de données qui décrit comment ces dernières sont organisées et reliées entre elles et aide à comprendre si la base de données peut monter en charge ou non (limitez les jointures!)

Pourquoi se livrer à un tel travail ?

  • parce que pour la grande majorité des société du Web, la qualité de service du site (temps de réponse, disponibilité) fait partie de la proposition de valeur du produit, et que cette qualité de service dépend de l'infrastructure du site : le schéma d'infrastructure permet d'identifier les éléments clés de l'infra qui la conditionnent,
  • parce que documenter son infrastructure permet de mieux la maîtriser et que cette maîtrise dans le long terme est un facteur clé de succès, particulièrement si vous - l'entrepreneur - n'avez pas un profil technique,
  • enfin, ces documents démontrent aux investisseurs votre maturité et votre professionnalisme technologique. A défaut de pouvoir rentrer dans les détails, il sauront apprécier une réflexion menée en amont sur la pérennité du site et donc de leurs investissements.


Indeed, for those companies who have strong IT expertise and sufficient resources to build their own data center, security may be a disadvantage of public clouds of cloud computing. However, vast majority of organizations are not capable to setup a sophisticated IT infrastructure on their own, because of lacking of either necessary conditions mentioned above. In most cases, organizations focus on functional requirements and are not able to pay adequate attention to security issues when building IT infrastructure. The consequence is many IT systems running without necessary security control procedures, and thus be in a dangerous environment. This is especially true for small and medium size enterprises (SME). For such organizations, cloud computing, even public clouds, actually becomes a more secure option. Cloud computing providers pervasively build network security and system security into the cloud infrastructure. They have well-equipped and professionalized staffs to protect the cloud system from network threats and virus. Security of cloud systems is one of the basic offerings of any mainstream cloud services and normally does not charge extra service fee. Thus, when facing government's requirement on system security, enterprises can effectively increase the security of their information system by leveraging cloud computing if they do not want to spend resources on this task. Here is an example of the security requirement from governments:



November 24, 2009, the state council of China required companies in the network media industry to take the responsibility of maintaining network security of their own information systems. One background of this requirement is, presently, most Chinese media companies are not specialized in network security. Most information systems have potential security problems, and the systems are vulnerable to network attacks. It is obligated for media companies to take action in response to this requirement from the government. In the traditional on-premise computing, the most common reaction is to allocate dedicated resources to take charge of network security. This manner normally implies more investment on computer hardware, software and human resources. Since this is not a one-off investment, so it has to be integrated into the cost structure of the company as a constant operating cost. In addition, network security does not belong to the core competence of a media organization. Hence, such investment may do harm to the profitability of organizations.

If you have a iPhone/iPod Touch, here is a very useful application that provides indoor maps on the iPhone—it has been dubbed as "Google Maps Inside a Building".


The application, "Micello Indoor Maps", is available for free download from the U.S. iPhone AppStore, and has received some good reviews.


Micello's backend is completely a Sun stack—running on Sun Fire X2270 servers, MySQL, and GlassFish.

Congratulations Micello team for this big milestone!

 

 

Mercredi 10 Février, 11h00 (Paris) : Sécurité pour les applications Web. Pour les startups du Web, la protection et la sécurisation de leurs applications, de leurs données, et de celles de leurs clients est un véritable facteur clé de succès.  Ce Webinar couvre les différents challenges liés à la sécurité ainsi que les solutions associées telles que l'encryption, l'authentification, les certificats, la sécurisation du stockage et le stockage à tolérance de panne, les environnements étanches. Les architectes de Sun Startup Essentials présenteront des implémentations économiques basées sur des composents standards et ouverts tel qu'Apache, MySQL et ZFS. Ce webinar fait partie de l'accompagnement du programme Sun Startup Essentials et est réservé à ses membres.

Votre société a moins de 6 ans et moins de  150 employés : Rejoignez Sun Startup Essentials >>


Some HPC ISV's may be interested in Graphing Solaris Performance Stats with gnuplot

Graphing Solaris performance Stats with gnuplot

Mercredi 10 Février : Securité pour les applications Web. Pour les startups du Web, la protection et la sécurisation de leurs applications, de leurs données, et de celles de leurs clients est un véritable facteur clé de succès.  Ce Webinar couvre les différents challenges liés à la sécurité ainsi que les solutions associées telles que l'encryption, l'authentification, les certificats, la sécurisation du stockage et le stockage à tolérance de panne, les environnements étanches. Les architectes de Sun Startup Essentials présenteront des implémentations économiques basées sur des composents standards et ouverts tel qu'Apache, MySQL et ZFS. Ce webinar fait partie de l'accompagnement du programme Sun Startup Essentials et est réservé à ses membres.

Votre société a moins de 6 ans et moins de  150 employés : Rejoignez Sun Startup Essentials >>

Looking for a project management tool?

Check out MyDashboardPro.com, a simple, yet powerful and secure project management tool that allows you to collaborate with as many people as you want, inside OR outside of your company intranet!

Best part: the tool is free to use and comes with 100MB accounts - try it out at: www.MyDashboardPro.com.

A little background: 25,000+ Sun employees use a similar tool worldwide, so this is industrial-strength! The basic code of this tool was open-sourced in 2007 and became part of OpenEco.org where it underwent severe security audits. Enjoy!

Recently I had the good fortune to do some testing on an Amber Road. More officially known as the Sun Storage 7410 Unified Storage System.

The machine had 64 GB of internal memory and two quad core CPUs. A single JBOD holding 24 disks, of which one was a SSD optimized for write access (a writezilla.) The other 23 disks were "normal" 7200 RPM disks of one TB each. For connectivity some of the four gigabit NICs were used.
Besides the above mentioned SSD, this system also had two other SSDs optimized for reading (a readzilla.)
This appliance used ZFS as a file system. The writezilla was used to hold the ZFS intent log. The readzilla's were used as a level 2 ZFS (L2ARC) file system cache.

For more technical information about this product please check the website at:

http://www.sun.com/storage/disk_systems/unified_storage/7410


Introduction

The original trigger to execute these tests was a remark from an ISV that they had problems with applications using their Amber Road while an rsync was creating a backup of their email archive(s) on a Linux box to a volume on their appliance.

And indeed the throughput numbers we received were indeed not what we expected.

In order to get a better understanding a 7410 was setup together with some load generating equipment in one of our labs. The ISV was able to send us a copy of a part of their email archive we could use for testing. The Amber Road was configured in a comparable way to the one at the ISV location. The initial tests were indeed below expectations. The quest to find the bottleneck began.


Approach

There were quite a few elements that were "unknown" to me one of which were the internals of the Amber Road, including the hardware and the software (operating system, ZFS pool and file system, network interfaces , etc.)
The NICs, were they somehow dependent on each other? How much traffic can a single NIC handle? In throughput, but also in the number of packets?
The volume(s) in the Amber Road were used with NFS over gigabit ethernet. The Linux box was used to simulate the mail machine. Could this one handle the load required? The Linux volume manager was used. Using ext3 as a file system. Could this one deliver the load required? The tool used to make the backup, rsync, was this one able to drive all the resources?

I started with the front end: the Linux box. Especially the volume manager. Using

find <mail archive on lvm volume> -print | cpio -oc >/dev/null

as a way to simulate the reading part. I had a second machine on which Solaris 10 was installed. Exactly the same hardware as the Linux box. And of course couldn't resist to do the same reading tests on this one. I was not impressed with the load that could be generated from a couple of internal disks. Fortunately I had a couple of disk arrays "lying around".
I used a Sun StorageTek 6140 array holding 12 spindles spinning at 15K rpm. The array has a gigabyte of cache and two controllers.

The result of the above command:

1: 6140 Linux lvm (2 6140 volumes): 33 min
2: 6140 Solaris UFS on 2 6140 Disk Suite volume: 16 min
3: 6140 Linux nolvm (1 6140 volume): 34 min
4: 6140 Linux nolvm (4 6140 vols in parallel): 24 min

Tests 1 and 2 are performed with the same 6140 configuration. Same hardware. Different OS-es and volume managers. To check the Linux result I used half the disk capacity in the array (test 3) without lvm and 4 volumes without lvm (test 4). This last one had one volume mounted under /data1, the second volume on /data2, etc. To parallellize multiple find /dataX -print | cpio -o >/dev/null sessions were done on each file system in parallel.

After this result we decided to look into the other elements of the black box using Solaris only. Just to ensure we could generate load once we started testing the Amber Road.


Network

The next part to check was the Amber Road network. Or better, how much load can the Amber Road NICs sustain. A single volume on the Amber Road was NFS mounted with the following options:

rw,vers=3,rsize=8192,wsize=8192,soft,intr,proto=tcp,timeo=600,retrans=2,sec=sys

Since the rsync backup includes many file system operations (read, write, stat, etc) and is optimized to not write a file to backup if the file in question happens to reside on the backup volume already I changed the rsync use to a sequence of 'find ... cpio' ran in parallel. Each find command (see above) used its own private mail archive. After some analysis it was clear NFS was not a bottleneck. Using the analytics from the Amber Road the following screen dump shows the three NICs used were all running close to the theoretical speed of gigabit ethernet. In total "pumping" 300MB/sec look perfect to me, when using three interfaces. Each interface handling close to 15,000 NFSOPS/sec.


AR.png

With these results I concluded that both client and network are running fine. I needed to focus on the 7410 now. I had 24 disks in this appliance. The GUI was not my favorite tool in order to look at the disk utilization and other statistics. Now I must admit, that I am in this business for quite some years. Still prefer the command line. Although it might be good to understand that the punch cards period was even before my time.....


Method

The ISV remark included a statement about other applications suffering when a rsync was running. The rsync themselves also took a remarkable long time. This would translate to an average of 50MB/sec transfer rate.
I changed the approach a little bit. I first checked what could be seen as reasonable for the Amber Road. Fortunately there is a load generating tool called vdbench available. This open source tool can be downloaded from http://sourceforge.net/projects/vdbench. It has many possibilities. One of these is a simulation of file system tests. For this it creates a directory structure with as many files as specified. This is all done with a parameter file. Here is one example, for a pure read only test:

fsd=fsd1,anchor=/data_remote/test,depth=2,width=8,files=10000,sizes=(8k,60,32k,25,512k,10,4m,5)
fwd=fwd1,fsd=fsd1,operation=read,skew=50,xfersize=8k,fileio=sequential,fileselect=random,threads=48
fwd=fwd2,fsd=fsd1,operation=read,skew=50,xfersize=8k,fileio=random,fileselect=random,threads=48
rd=rd1,fwd=(fwd1,fwd2),fwdrate=10000,format=no,elapsed=72000,interval=1

The generator runs on a client machine. The Amber Road volume is NFS mounted under /data_remote/test. There is a directory structure of 2 levels deep. Each directory at the end of this tree holds 10,000 files. Files of different sizes: 60% of them are 8KB, 25% is 32KB, 10% are 51KB and 5% are 4MB. A total a 96 threads are being used to generate the load. The test runs for 72,000 seconds.

The example above tries to create a continuous load of 10,000 IOPS (fwdrate).

Tests had to be done over a long time. The file system cache (ZFS ARC) was approximately 60 GB. It was already shown that the behavior of the Amber Road was perfect as long as capacity was available in ARC. Once the ARC was fully utilized the behavior started to show. For my tests this steady state behavior was needed in order to be comparable to the ISV situation.

The load test was done with a RAID-Z2 default setup, a RAID-Z2 narrow and a mirrored setup. Here's a picture with the output. Time versus number of IOPS. After the volume was created the directory structure and the files were created. After this a load run was started with 50% read.


tst.png

Besides some spikes there is not much difference between these three. They all can cope with a 10,000 IOPS load very well. However this was during the first 10 minutes after all the files were created. Next graph shows the behavior over a ten times longer interval.


tst2.png

This data are from the same three tests. However here we see that at 700 seconds the default setup drops down to approximately 6,000 IOPS. Looking into the data presented by the vdbench log output (not shown here) also shows that there is no 50% read on average any more.
The RAID-Z2 narrow cab handle the 10,000 IOPS load longer. At 2700 seconds the behavior starts to change. It averages around 10,000 IOPS but the behavior is far less constant.
The mirrored setup performs the 10,000 IOPS nicely over the whole test period.

The tests described were executed to create a background load for running rsync. The bursts in the blue graph are points where the rsync is started. These were scheduled at 1800, 3600, and 5400 seconds. No ill effects during the test on a mirrored setup.

Here is a close up around the 900 second period.
tst3.png

One of the things that vdbench tries to do is to make up for missing IOs. The first big dip after T=700 is followed by a much bigger load request. The final result is that the system seems to throttle down to a 6000 IOPS level. During the first 600 seconds the average response time was around the 1 milli second. 200 seconds later this became 20 milli seconds. During the first 600 seconds the system was able to push 80 MB/sec. After the 800 second mark there was only 40-50 MB/sec left. Finally the amount of read ops settled around 85% of the total.

Finally to make this comparison complete, a close up of the graph around T=2900


tst4.png


Results

The above mentioned test was redone for the three setups but now with a read write ratio of 75:25 and again with a 100% read test.

In order to see the effect on the run time of an rsync command that would create a backup of a 2 GB "mail-archive" this command was executed at 1800, 3600 and 5400 (rsync --delete -a <mail archive> <destination directory>.) The completion times are in the following tables. The times for the rsync during the default setup test were so long, that this test needed to be redone for a much longer time (14400 seconds.)


Table 1: 50% read 50% read load
Type\Time 1800 3600 5400 7200 14400
RAID-Z2 default 5161 - - 4542 8987
RAID-Z2 narrow 124 1074 1059 - -
RAID mirror 96 174 141 - -



Table 2: 75% read 25% write load"
Type\Time 1800 3600 5400 7200 14400
RAID-Z2 default - 1842 - - 11548
RAID-Z2 narrow 91 91 210 134 -
RAID mirror 99 100 101 116 -



Table 3: 100% read load"
Type\Time 1800 3600 5400 7200 14400
RAID-Z2 default 114 111 111 111 -
RAID-Z2 narrow 76 76 76 76 -
RAID mirror 88 84 85 88 -



Conclusion

The first thing to mention is that although the default setup seems to perform unexpected, you should revisit what is happening. The appliance is being stressed by a 10,000 IOPS load. Internally there are two times eleven disks handling this load. These standard SATA disks are able to handle a load of say 70 IOPS each (google returns many articles, where this number ranges from 50 to 100). This translates to approximately 1540 IOPS for the available disks.
In a mirrored setup this appliance was able to handle 10,000 IOPS: over 6 times the capacity of those disks alone.... If there are barely any writes this is even possible with the default setup, and the narrow raid configuration too. After realizing this, I must admit, I was impressed!

Graphing Solaris Performance Stats with gnuplot

It is not unusual to see an engineer import text from "vmstat" or "iostat" to a spreadsheet application such as Microsoft Office Excel or OpenOffice Calc to visualize the data.  This is a fine approach when used periodically but impractical when used frequently.  The process of transferring the data to a laptop, manually massaging the data, launching the office application, importing the data and selecting the columns to chart is too cumbersome when used as a daily process or if there are a large number of machines that are being monitored.  It my case, I needed to visualize the performance from a few servers that were under test, and needed a few graphs from the servers, a few times a day.  I used some traditional Unix scripts and gnuplot (http://www.gnuplot.info) from the Companion CD (http://www.sun.com/software/solaris/freeware) to quickly graph the data.

The right tool for graphing Solaris data depends on your use case scenario:

  • One or two graphs, now and then: Import the data into your favorite spreadsheet application.
  • Historic data, more graphs, more frequently: use gnuplot
  • Many graphs, real-time or historic data, for more machines, such as a grid of servers being managed by Sun Grid Engine:  a formal tool such a Ganglia (http://ganglia.info, http://www.sunfreeware.com/) is recommended. An advantage of Ganglia is that performance data is exposed via a web interface to a potentially large number of viewers in real time.

That being said, here are some scripts that I used to view Solaris Performance data with gnuplot.

1. Gathering data.  For each benchmark run, a script was used to start gathering performance data:


#!/usr/bin/ksh

dir=$1
mkdir $dir
vmstat 1        > $dir/vmstat.out        2>&1 &
zpool iostat 1  > $dir/zpool_iostat.out  2>&1 &
nicstat 1       > $dir/nicstat.out       2>&1 &
iostat -nmzxc 1 > $dir/iostat.out        2>&1 &
/opt/DTraceToolkit-0.99/Bin/iopattern 1 > $dir/iopattern.out   2>&1 &

The statistics gathering processes were all killed at the end of the benchmark run. Hence, each test had a directory with a comprehensive set of statistics files.

Next it was necessary to write a set of scripts to operate on the directories.

2. Graphing CPU utilization from "vmstat".

This script was fairly short and straightforward.  The "User CPU Utilization" and "System CPU Utilization" are in the 20th and 21st columns.  I added an optional argument to truncate the graph after a specific amount of time to account for the cases where the vmstat process was not killed immediately after the benchmark.  A bash "here document" is used to enter gnuplot commands.

#!/usr/bin/bash

dir=$1
file=$1/vmstat.out

if [ $# == 2 ] ; then
  minutes=$2
  (( seconds = minutes * 60 ))
  cat $file | head -$seconds > /tmp/data
  file=/tmp/data
fi

gnuplot -persist <<EOF
set title "$dir"
plot "$file" using 20 title "%user" with lines, \
     "$file" using 21 title "%sys" with lines

EOF

Graph of CPU utilization based on vmstat output

3. Graphing IO throughput from "iostat -nmzxc 1" data

This script was a little bit more complicated for three reasons:

  • The data file contains statistics for several filesystems that are not interesting and will be filtered out.  The script needs to be launched with an argument that will be used to select one device.
  • I used the 'z' option to iostat which does not print traces when the device is idle (Zero I/O).  The 'z' option makes a smaller file that is more human readable, but it it not good for graphing.  Thus I needed synthesize the zero traces before passing the data to gnuplot.
  • I wanted to include a smooth line for the iostat "%w" and "%b" columns with a scale of 0 to 100.
#!/usr/bin/bash

# This script is used to parse "iostat -nmzxc" data which is formatted like this:
#
#                     extended device statistics
#     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
#     0.0    0.9    0.8    3.8  0.0  0.0    0.0    0.5   0   0 c0t1d0
#     0.0    0.0    0.0    0.0  0.0  0.0    0.0    2.4   0   0 sge_master:/opt/sge6-2/default/common
#     0.0    0.8    1.9  184.5  0.0  0.0    4.1   31.1   0   1 192.168.2.9:/jbod



if [ $# -lt 2 -o $# -gt 3 ] ; then
  echo "Usage: $0 pattern dir [minutes]"
  exit 1
fi

pattern=$1
dir=$2
(( minutes = 24 * 60 )) #default: graph 1 day

if [ $# == 3 ] ; then
  minutes=$3
fi

(( seconds = minutes * 60 ))
all_data=$dir/iostat.out
plot_data=/tmp/plot_data

if [ ! -r $all_data ] ; then
  echo "can not read $all_data"
  exit 1
fi

# For each time interval, either:
#   print the trace for the device that matches the pattern, or
#   print a "zero" trace if there is not one in the data file 
# You can tell that there was no trace for the device during an
# interval if you reach the "extended device statistics" line 
# without finding a trace
gawk -v pattern=$pattern '
$0 ~ pattern {
  printf("%s\n",$0);
  found = 1 ;
}

/extended/ {
  if (found == 0)
    printf("    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 192.168.2.9:/jbod \n")
  found = 0;
} ' $all_data | head -$seconds > $plot_data

gnuplot -persist <<EOF
set title "$pattern - $dir"
set ytics nomirror
set y2range [0:100]
set y2tics 0, 20
plot "$plot_data" using  3 title "read (kb/sec)" axis x1y1 with lines, \
     "$plot_data" using  4 title "write (kb/sec)" axis x1y1 with lines, \
     "$plot_data" using  9 title "%w" axis x1y2 smooth bezier with lines, \
     "$plot_data" using 10 title "%b" axis x1y2 smooth bezier with lines

EOF

I created the following graph with the command "graph_iostat.bash jbod NFS_client_10GbE 5" to select data only from the "jbod" NFS mount, where the data is stored in the directory named "NFS_client_10GbE" and only graph the first 5 minutes worth of data.


iostat_NFS_client_10GbE.png

The iostat data was collected on an NFS client connected with a 10 gigabit network.  There is some write activity (green) at the start of the 5 minute sample period, followed by several minutes of intense reading (red) where the client hits speeds of 600-700MB/sec. The purple "%b" line, with values on the right x1y2 axis, indicates that during the intense read phase, the mount point is busy about 90% of the time.  

4. Graphing I/O Service time from "iostat -nmzxc" data.

I also find that columns 6 and 7 from iostat are very interesting and can be graphed using a simplification of the previous script.

  • actv: average number of transactions actively being serviced
  • svc_t: average response time  of  transactions,  in  milliseconds


#!/usr/bin/bash

#                     extended device statistics
#     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
#     0.0    0.9    0.8    3.8  0.0  0.0    0.0    0.5   0   0 c0t1d0
#     0.0    0.0    0.0    0.0  0.0  0.0    0.0    2.4   0   0 sge_master:/opt/sge6-2/default/common
#     0.0    0.8    1.9  184.5  0.0  0.0    4.1   31.1   0   1 192.168.2.9:/jbod


if [ $# -lt 2 -o $# -gt 3 ] ; then
  echo "Usage: $0 pattern dir [minutes]"
  exit 1
fi

pattern=$1
dir=$2
(( minutes = 24 * 60 )) #default: graph 1 day

if [ $# == 3 ] ; then
  minutes=$3
fi

(( seconds = minutes * 60 ))
all_data=$dir/iostat.out
plot_data=/tmp/plot_data

# For each time interval, either:
#   print the trace for the device that matches the pattern, or
#   print a "zero" trace if there is not one in the data file 
# You can tell that there was no trace for the device during an
# interval if you reach the "extended device statistics" line 
# without finding a trace
gawk -v pattern=$pattern '
$0 ~ pattern {
  printf("%s\n",$0);
  found = 1 ;
}

/extended/ {
  if (found == 0)
    printf("    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 192.168.2.9:/jbod \n")
  found = 0;
} ' $all_data | head -$seconds > $plot_data

gnuplot -persist <<EOF
set title "$pattern - $dir"
set log y
plot "$plot_data" using  6 title "wsvc_t" with lines, \
     "$plot_data" using  7 title "asvc_t" with lines \

EOF

Here is the graph produced by the command "graph_iostat_svc_t.bash jbod NFS_client_10GbE 5"

iostat_NFS_client_svc_t_10GbE

5. Graphing network throughput data from "nicstat"

Another very valuable Solaris performance statistics tool is "nicstat".  For the download link, see http://blogs.sun.com/timc/entry/nicstat_the_solaris_and_linux .  A script to graph the data from nicstat follows the same pattern.

#!/usr/bin/bash

if [ $# -lt 2 -o $# -gt 3 ] ; then
  echo "Usage: $0 interface dir [minutes]"
  exit 1
fi

interface=$1
dir=$2
(( minutes = 24 * 60 )) #default: graph 1 day

if [ $# == 3 ] ; then
  minutes=$3
fi

(( seconds = $minutes * 60 ))
all_data=$dir/nicstat.out
plot_data=/tmp/plot_data

if [ ! -r $all_data ] ; then
  echo "can not read $all_data"
  exit 1
fi

grep $interface $all_data | head -$seconds > $plot_data

gnuplot -persist <<EOF
set title "$interface - $dir"
plot "$plot_data" using 3 title "read" with lines, \
     "$plot_data" using 4 title "write" with lines
EOF

 "graph_nicstat.bash ixgbe2 NFS_server_10GbE 5"

nicstat_NFS_server_10GbE

6. Graphing IO throughput from "zpool iostat" data

The challenge for plotting "zpool iostat" data is that the traces are not in constant units and therefore it is necessary to re-compute the data in constant units, in this example, MB/sec. 

#!/usr/bin/bash

if [ $# -lt 2 -o $# -gt 3 ] ; then
  echo "Usage: $0 pattern dir [minutes]"
  exit 1
fi

pool=$1
dir=$2
(( minutes = 24 * 60 )) #default: graph 1 day

if [ $# == 3 ] ; then
  minutes=$3
fi

(( seconds = minutes * 60 ))
all_data=$dir/zpool_iostat.out
plot_data1=/tmp/plot_data1
plot_data2=/tmp/plot_data2

if [ ! -r $all_data ] ; then
  echo "can not read $all_data"
  exit 1
fi

grep $pool $all_data | awk '{printf("%s/1048576\n",$6)}' | sed -e 's/K/*1024/g' -e 's/M/*1048576/g' -e 's/G/*1073741824/g' | bc | head -$seconds > $plot_data1
grep $pool $all_data | awk '{printf("%s/1048576\n",$7)}' | sed -e 's/K/*1024/g' -e 's/M/*1048576/g' -e 's/G/*1073741824/g' | bc | head -$seconds > $plot_data2

gnuplot -persist <<EOF
set title "$pool - $dir"
set log y
plot "$plot_data1" using 1 title "read (MB/sec)" with lines, \
     "$plot_data2" using 1 title "write (MB/sec)" with lines

EOF

Graphing the IO throughput of the zpool named "jbod" using the command  "graph_iostat_svc_t.bash jbod NFS_client_10GbE 5" shows that zpool can deliver data at speeds of close to one gigabyte per second.

zpool_iostat_NFS_client_10GbE.png

It is easy to modify the scripts above to graph the output of many tools that output a table of data in text format.

After the failure of the SATA and USB ports on my Intel D945GCL Atom board, I decided to build out a new file server. Sticking to the Atom theme, I decided to go small and get the CompuLab FIT-PC2. This little toy uses the Z530 1.6Ghz CPU that apparently uses only 6 watts of power. I'm assuming that means *without* a hard drive installed.



Measuring in at around 115 x 101 x 27mm (~ 4.5"x4.0"x1.0"), it is only big enough to hold one laptop sized 2.5" SATA drive.



The drive I installed only has 80GB of space. That would run out real quick with my needs, so I decided to get a MediaSonic USB disk enclosure to link up with my server. It can hold up to 4 SATA drives.



The PC sits on top of the enclosure on my bookshelf taking up 8.5" x 5.0" x 6.5" amount of space. This is not only power efficient, but space efficient since I am using 4 x 1TB drives. 4TB total (theoretical), ~2.6TB in a ZFS raidz. If I were to have purchased the 2TB drives, it would be even better.

Doug's blog on the FIT-PC2 gives a good overview on the features of the device and what works. There is no wifi driver and Xorg doesn't work, so you may want to install OpenSolaris on another machine before installing the internal HDD. My server is headless and uses the built-in gigabit ethernet, so I don't care about those issues.


Links and prices Total = $748
Here is a link to a press release from the European Commission, announcing that they have cleared Oracle's proposal to acquire Sun Microsystems.  I find it interesting reading because it describes the fundamental issues that the Commission dealt with before making their decision (MySQL, Java, integration of a complete technology stack)

From an open source point of view, I found it satisfying that they mentioned PostgreSQL as part of their deliberation.  They said that they found the PostgreSQL open source database to be a credible alternative to MySQL; essentially, they're saying that the open source DB landscape is not monocultural, and that there are other viable alternatives.  I think they're right, and I'm glad to see that it's not all about MySQL.

My opinion: they took a long time to make a decision, but they considered the issues you'd think they should consider.

The next few days should be really interesting for us at Sun.  Whether employed or not, at least the long purgatory period is finally coming to an end.  I can't tell you how tiresome it has become telling my friends month after month "No real news yet."  It's nice that friends care, but it sucks that there's been no progress to report.  Finally, that's done with.  (whew!)


Powered by ScribeFire.

Sun Startup Essentials est à nouveau partenaire de la Startup Academy pour l'édition 2010. Ce programme d'accompagnement de startups en est à sa troisième session. Près de 250 startups ont présenté leur candidature depuis la première édition en novembre 2008 pour pouvoir bénéficier d'un accompagnement complet et personnalisé par un panel d'experts. Désormais, le programme est permanent. Les entrepreneurs peuvent poser leur candidature à tout moment pour participer à la compétition. Celle-ci aura lieu du 5 mars (date limite de dépôt des candidatures) au 13 avril (date d'annonce des résultats). Chaque candidature est publiée sur le blog de la Startup Academy, ce qui offre aux entreprises qui concourent une visibilité prolongée et une opportunité supplémentaire de mieux se faire connaître du public, des investisseurs et des partenaires potentiels.

"Canary In A Coalmine"
The Police
from the album "Zenyatta Mondatta", 1980


Poor Steve A.[1] ... This entry is not about Steve A. though. It is about the new PeopleSoft NA Payroll benchmark result that Sun published today.

First things first. Here is the direct URL to our latest benchmark results:

        PeopleSoft Enterprise Payroll 9.0 using Oracle for Solaris on a Sun SPARC Enterprise M4000 (16 job streams[2] -- simply referred as 'stream' hereonwards)

The summary of the benchmark test results is shown below only for the 16 stream benchmarks. These numbers were extracted from the very first page of the benchmark results white papers where Oracle|PeopleSoft highlights the significance of the results and the actual numbers that are of interest to the customers. The results in the following table are sorted by the hourly throughput (payments/hour) in the descending order. The goal is to achieve as much hourly throughput as possible. Click on the link that is underneath the hourly throughput values to open corresponding benchmark result.

Oracle PeopleSoft North American Payroll 9.0 - Number of employees: 240,000 & Number of payments: 360,000
Vendor OS Hardware Config #Job Streams Elapsed Time (min) Hourly Throughput
Payments per Hour
Sun Solaris 10 5/09 1x Sun SPARC Enterprise M4000 with 4 x 2.53 GHz SPARC64-VII Quad-Core processors and 32 GB memory
1 x Sun Storage F5100 Flash Array with 40 Flash Modules for data, indexes
1 x Sun Storage J4200 Array for redo logs
16 43.78 493,376
HP HP-UX 1 x HP Integrity rx6600 with 4 x 1.6 GHz Intel Itanium2 9000 Dual-Core processors and 32 GB memory
1 x HP StorageWorks EVA 8100
16 68.07 317,320

This is all public information. Feel free to compare the hardware configurations and the data presented in both of the rows and draw your own conclusions. Since both Sun and HP used the same benchmark toolkit, workload and ran the benchmark with the same number of job streams, comparison should be pretty straight forward.

If you want to compare the 8 stream results, check the other blog entry: PeopleSoft North American Payroll on Sun Solaris with F5100 Flash Array : A blog Reprise. Sun used the same hardware to run both benchmark tests with 8 and 16 streams respectively. We could have gotten away with 20+ Flash Modules (FMODs), but we want to keep the benchmark environment consistent with our prior benchmark effort around the same benchmark workload with 8 job streams. Due to the same hardware setup, now we can easily demonstrate the advantage of parallelism (simply by comparing the test results from 8 and 16 stream benchmarks) and how resilient and scalable the F5100 Flash array is.

Our benchmarks showed an improvement of ~55% in overall throughput when the number of job streams were increased from 8 to 16. Also our 16 stream results showed ~55% improvement in overall throughput over HP's published results with the same number of streams at a maximum average CPU utilization of 45% compared to HP's maximum average CPU utilization of 89%. The half populated Sun Storage F5100 Flash Array played the key role in both of those benchmark efforts by demonstrating superior I/O performance over the traditional disk based arrays.

Before concluding, I would like to highlight a few known facts (just for the benefit of those people who may fall for the PR trickery):

  1. 8 job streams != 16 job streams. In other words, the results from an 8 stream effort is not comparable to that of a 16 stream result.
  2. The throughput should go up with increased number of job streams [ only up to some extent -- do not forget that there will be a saturation point for everything ]. For example, the throughput with 16 streams might be higher compared to the 8 stream throughput.
  3. The Law of Diminishing Returns applies to the software world too, not just for the economics. So, there is no guarantee that the throughput will be much better with 24 or 32 job streams.

Other blog posts and documents of interest:

  1. Best Practices for Oracle PeopleSoft Enterprise Payroll for North America using the Sun Storage F5100 Flash Array or Sun Flash Accelerator F20 PCIe Card
  2. PeopleSoft Enterprise Payroll 9.0 using Oracle for Solaris on a Sun SPARC Enterprise M4000 (8 streams benchmark white paper)
  3. PeopleSoft North American Payroll on Sun Solaris with F5100 Flash Array : A blog Reprise
  4. App benchmarks, incorrect conclusions and the Sun Storage F5100
  5. Oracle PeopleSoft Payroll (NA) Sun SPARC Enterprise M4000 and Sun Storage F5100 World Record Performance
































Notes:

[1] Steve A. tried so hard and his best to make everyone else believe that HP's 16 job stream NA Payroll 240K EE benchmark results are on par with Sun's 8 stream benchmark results. Apparently Steve A. failed and gave up after we showed the world a few screenshots from a published and eventually withdrawn benchmark [ by HP ]. You can read all his arguments, comparisons etc., in the comments section of my other blog entry PeopleSoft North American Payroll on Sun Solaris with F5100 Flash Array : A blog Reprise as well as in Joerg Moellenkamp's blog entries around the same topic.

[2] In PeopleSoft terminology, a job stream is something that is equivalent to a thread.

The LiveScribe Developer Challenge has been extended to run through March 13th.


Livescribe is Hosting a Developer Challenge

Submit your original smartpen app to the Livescribe Developer Challenge for a chance to win over $10,000 in prizes and special promotion within the Application Store and through media outlets. The contest runs worldwide December 7th 2009 – Mar 13th 2010. For official rules and to enter, please visit the Livescribe Developer Challenge website

Livescribe Developer Challenge

"Yet another way the Sun Startup Essentials program can help our startups (and the economy) is to provide current opportunities to our large readership & member-base. We are not trying to be recruitment agents, nor will we get involved in any part of that process, but we have a great platform to assist startups. "

Following on from the last three startup job postings here are some more hot off the press!

Subscribe to the startup program and follow me @scoobeesnac or subscribe to our blog post summaries using the textfield to the right-side of the blog, to hear more.