cmt tags

The free lunch is over.

Multicore processors are not just coming—they’re here.

Leveraging multiple cores requires writing scalable parallel programs, which is incredibly hard.

Tools such as fork/join frameworks based on work-stealing algorithms make the task easier, but it still takes a fair bit of expertise and tuning.

Bulk-data APIs such as parallel arrays allow computations to be expressed in terms of higher-level, SQL-like operations (e.g., filter, map, and reduce) which can be mapped automatically onto the fork-join paradigm.

Working with parallel arrays in Java, unfortunately, requires lots of boilerplate code to solve even simple problems.

Closures can eliminate that boilerplate.

It’s time to add them to Java.

The closures design space In the last few years three serious proposals for adding closures to Java have been put forward: BGGA, CICE, and FCM. These proposals cover a wide range of complexity and expressive power. My view, having studied them all, is that each contains good ideas yet none is entirely appropriate for a “working programmer’s language.”

To support the principal use case of parallel programming we really only need two key features:

  • A literal syntax, for writing closures, and

  • Function types, so that closures are first-class citizens in the type system.

To integrate closures with the rest of the language and the platform we need two additional features:

  • Closure conversion, so that a closure of appropriate type can be used where an object of a single-method interface or abstract class is required, and

  • Extension methods, so that closure-oriented bulk-data methods can be retrofitted onto existing libraries, and in particular the Collections Framework, without breaking compatibility.

A couple of other integration features worth considering are first-class method references and the ability of function types to include exception parameters.

Some of the other features found in the existing proposals carry considerable additional complexity:

  • Capture of non-final variables,

  • Non-local transfer of control, and

  • Library-defined control structures (i.e., control abstraction).

At present I see no need to add any of these to Java.

Let’s be about it It’s time to learn from the past debate and move forward. Sun will initiate the design and implementation of a simple closures feature, as outlined above, and add it to JDK 7 so as to enable broad experimentation. If all goes well then we’ll later submit a language-change JSR which would, in turn, be proposed as a component of the eventual Java SE 7 JSR.

Revising a programming language that’s in active use by millions of developers is no small task. Sun neither can nor should do it alone, so I hereby invite everyone who participated in the earlier closures conversations—as well as anyone else with an informed opinion—to join us.

Up next: The straw man

When a thread hits an error in a multithreaded application, that error will take out the entire app. Here's some example code:

#include <pthread.h>
#include <stdio.h>

void *work(void * param)
{
  int*a;
  a=(int*)(1024*1024);
  (*a)++;
  printf("Child thread exit\n");
}

void main()
{
  pthread_t thread;
  pthread_create(&thread,0,work,0);
  pthread_join(thread,0);
  printf("Main thread exit\n");
}

Compiling and running this produces:

% cc -O -mt pthread_error.c
% ./a.out
Segmentation Fault (core dumped)

Not entirely unexpected, that. The app died without the main thread having the chance to clear up resources etc. This is probably not ideal. However, it is possible to write a signal handler to capture the segmentation fault, and terminate the child thread without causing the main thread to terminate. It's important to realise that there's probably little chance of actually recovering from the unspecified error, but this at least might give the app the chance to report the symptoms of its demise.

#include <pthread.h>
#include <stdio.h>
#include <signal.h>

void *work(void * param)
{
  int*a;
  a=(int*)(1024*1024);
  (*a)++;
  printf("Child thread exit\n");
}

void hsignal(int i)
{
  printf("Signal %i\n",i);
  pthread_exit(0);
}

void main()
{
  pthread_t thread;
  sigset(SIGSEGV,hsignal);
  pthread_create(&thread,0,work,0);
  pthread_join(thread,0);
  printf("Main thread exit\n");
}

Which produces the output:

% cc -O -mt pthread_error.c
% ./a.out
Signal 11
Main thread exit

The Sun SPARC Enterprise T5240 server running the Sun Java Messaging server 7.2 achieved a World Record SPECmail2009 result using Sun Storage 7310 Unified Storage System and ZFS file system.  Sun's OpenStorage platforms enable another world record.

  • World record SPECmail2009 benchmark using the Sun SPARC Enterprise T5240 server (two 1.6GHz UltraSPARC T2 Plus), Sun Communications Suite 7, Solaris 10, and the Sun Storage 7310 Unified Storage System achieved 14,500 SPECmail_Ent2009 users at 69,857 Sessions/Hour.

  • This SPECmail2009 benchmark result clearly demonstrates that the Sun Messaging Server 7.2, Solaris 10 and ZFS solution can support a large, enterprise level IMAP mail server environment as a low cost 'Sun on Sun' solution, delivering the best performance and maximizing data integrity and availability of Sun Open Storage and ZFS.

  • The Sun SPARC Enterprise T5240 server supported 2.4 times more users with 2.4 times better sessions/hour rate than AppleXserv3 solution on the SPECmail2009 benchmark.

  • There are no IBM Power6 results on this benchmark.

  • The configuration using Sun OpenStorage outperformed all previous results with traditional direct attached storage and significantly higher number of disk devices.

SPECmail2009 Performance Landscape (ordered by performance)

System Performance Disks OS Messaging
Server
Users Sessions/
hour
Sun SPARC Enterprise T5240
2 x 1.6GHz UltraSPARC T2 Plus
14,500 69,857 58
NAS
Solaris 10 CommSuite 7.2
Sun JMS 7.2
Sun SPARC Enterprise T5240
2 x 1.6GHz UltraSPARC T2 Plus
12,000 57,758 80
DAS
Solaris 10 CommSuite 5
Sun JMS 6.3
Sun Fire X4275
2 x 2.93GHz Xeon X5570
8,000 38,348 44
NAS
Solaris 10 Sun JMS 6.2
Apple Xserv3,1
2 x 2.93GHz Xeon X5570
6,000 28,887 82
DAS
MacOS 10.6 Dovecot 1.1.14
apple 0.5
Sun SPARC Enterprise T5220
1 x 1.4GHz UltraSPARC T2
3,600 17,316 52
DAS
Solaris 10 Sun JMS 6.2

Complete benchmark results may be found at the SPEC benchmark website http://www.spec.org

Users - SPECmail_Ent2009 Users
Sessions/hour - SPECmail2009 Sessions/hour
NAS - Network Attached Storage
DAS - Direct Attached Storage

Results and Configuration Summary

Hardware Configuration:

    Sun SPARC Enterprise T5240
      2 x 1.6 GHz UltraSPARC T2 Plus processors
      128 GB memory
      2 x 146GB, 10K RPM SAS disks, 4 x 32GB SSDs

External Storage:

    2 x Sun Storage 7310 Unified Storage System, each with
      32 GB of memory
      24 x 1 TB 7200 RPM SATA Drives

Software Configuration:

    Solaris 10
    ZFS
    Sun Java Communications Suite 7 Update 2
      Sun Java System Messaging Server 7.2
      Directory Server 6.3

Benchmark Description

The SPECmail2009 benchmark measures the ability of corporate e-mail systems to meet today's demanding e-mail users over fast corporate local area networks (LAN). The SPECmail2009 benchmark simulates corporate mail server workloads that range from 250 to 10,000 or more users, using industry standard SMTP and IMAP4 protocols. This e-mail server benchmark creates client workloads based on a 40,000 user corporation, and uses folder and message MIME structures that include both traditional office documents and a variety of rich media content. The benchmark also adds support for encrypted network connections using industry standard SSL v3.0 and TLS 1.0 technology. SPECmail2009 replaces all versions of SPECmail2008, first released in August 2008. The results from the two benchmarks are not comparable.

Software on one or more client machines generates a benchmark load for a System Under Test (SUT) and measures the SUT response times. A SUT can be a mail server running on a single system or a cluster of systems.

A SPECmail2009 'run' simulates a 100% load level associated with the specific number of users, as defined in the configuration file. The mail server must maintain a specific Quality of Service (QoS) at the 100% load level to produce a valid benchmark result. If the mail server does maintain the specified QoS at the 100% load level, the performance of the mail server is reported as SPECmail_Ent2009 SMTP and IMAP Users at SPECmail2009 Sessions per hour. The SPECmail_Ent2009 users at SPECmail2009 Sessions per Hour metric reflects the unique workload combination for a SPEC IMAP4 user.

Key Points and Best Practices

  • Each Sun Storage 7310 Unified Storage System was configured with one J4400 JBOD array with 22x1TB SATA drives to a mirrored device and 4 shared volumes are built under the mirrored device. Total 8 mirrored volumes from 2 x Sun Storage 7310 are mounted on the system under test (SUT) messaging mail indexes and mail messages file system using NFSV4 protocol. Four SSDs were used as the SUT internal disks. Each SSD is configured as a ZFS file system. Four such ZFS directories are used for the messaging server queue, store metadata, LDAP and queue. SSDs substantially reduced the store metadata and queue latencies.

  • Each Sun Storage 7310 Unified Storage System was connected to the SUT via a dual 10-Gigabit Ethernet Fiber XFP card.

  • The Sun Storage 7310 Unified Storage System software version is 2009.08.11,1-0.

  • The clients used these Java options: java -d64 -Xms4096m -Xmx4096m -XX:+AggressiveHeap

  • Substantial performance improvement and scalability was observed with Sun Communications Suite7 update2, Java Messaging Server 7.2 and Directory Server 6.2

  • See the SPEC Report for all OS, network and messaging server tunings.

See Also

Disclosure Statement

SPEC, SPECmail reg tm of Standard Performance Evaluation Corporation. Results as of 10/22/09 on www.spec.org. SPECmail2009: Sun SPARC Enterprise T5240, SPECmail_Ent2009 14,500 users at 69,857 SPECmail2009 Sessions/hour. Apple Xserv3,1, SPECmail_Ent2009 6,000 users at 28,887 SPECmail2009 Sessions/hour.

Hi There!

I've been a little quiet for the last 2 weeks, working on a decent size Siebel deployment plan. (including Oracle DB, Siebel, Fusion middleware components, as well as integrated 3rd party applications)


The partner came to Sun requesting sizing for the hardware infrastructure (putting Oracle on Sun makes sense for many, many reasons!).
The request was for x86 servers, running RedHat and 'we'll virtualise everything'. I guess that was fine, except for one little detail… Oracle doesn't provide explicit support for most of its applications when running in virtualised environments, unless its Oracle VM.
Now this partner didn't have skills in Oracle VM, so what alternative was available?


Answer: LDOMS & Containers.


Thats right, Sun's hardware partitioning & OS virtualisation technologies are supported for the deployment of most Oracle applications (support for LDOMs, Containers, or both, depending on the particular application itself).


So how does this affect the infrastructure solution?


Well, the x86 solution could not take advantage of virtualisation and thus required almost 100 physical servers (1RU or blades).
The SPARC CMT solution required less than 40 blades, provided a massive reduction in the number of physical systems that needed to be managed, took up much less floorspace in the datacenter (5 B6000 chassis vs 10), greatly reduced the number of Ethernet & FC SAN switch ports & cabling required (saved approx. 200k on FC switches alone), etc.


And it turned out cheaper too!


Lesson learned?


Not all enterprise applications & ISV's support running their products within virtualised environments (or will require you to replicate the issue on physical hardware first). LDOMS & Containers are recognised by Oracle & other major ISV's as a supported method of application & OS isolation & consolidation.


Using these technologies to consolidate workloads can significantly reduce the number of physical systems you need to manage, while providing the legendary RAS features of SPARC hardware & the Solaris OS. This usually translates into better uptime which means happier users and happier admins and also saves money which means happier IT Managers & CxO's!


Bottom Line: LDOMS & Containers = Happiness!



For more info on running Oracle with LDOMS/Containers, go here or here

Since I will likely have to transition blogging systems after the Oracle acquisition, I decided to avoid the trouble and just begin using WordPress now. I intend to use this site for future posts regarding Oracle and Sun performance... and I will likely repost material as well.

My new Oracle/Sun performance blog is:

I already made a post about the value of predictable IO latency with Exadata V2. I hope you enjoy it and this new site.

Take care,
Glenn

TPC-C Sun SPARC Enterprise T5440 with Oracle RAC World Record Database Result

Sun and Oracle demonstrate the World's fastest database performance. Sun Microsystems using 12 Sun SPARC Enterprise T5440 servers, 60 Sun Storage F5100 Flash arrays and Oracle 11g Enterprise Edition with Real Application Clusters and Partitioning delivered a world-record TPC-C benchmark result.

  • The 12-node Sun SPARC Enterprise T5440 server cluster result delivered a world record TPC-C benchmark result of 7,646,486.7 tpmC and $2.36 $/tpmC (USD) using Oracle 11g R1 on a configuration available 12/14/09.

  • The 12-node Sun SPARC Enterprise T5440 server cluster beats the performance of the IBM Power 595 (5GHz) with IBM DB2 9.5 database by 26% and has 16% better price/performance on the TPC-C benchmark.

  • The complete Oracle/Sun solution used 10.7x better computational density than the IBM configuration (computational density = performance/rack).

  • The complete Oracle/Sun solution used 8 times fewer racks than the IBM configuration.

  • The complete Oracle/Sun solution has 5.9x better power/performance than the IBM configuration.

  • The 12-node Sun SPARC Enterprise T5440 server cluster beats the performance of the HP Superdome (1.6GHz Itanium2) by 87% and has 19% better price/performance on the TPC-C benchmark.

  • The Oracle/Sun solution utilized Sun FlashFire technology to deliver this result. The Sun Storage F5100 flash array was used for database storage.

  • Oracle 11g Enterprise Edition with Real Application Clusters and Partitioning scales and effectively uses all of the nodes in this configuration to produce the world record performance.

  • This result showed Sun and Oracle's integrated hardware and software stacks provide industry-leading performance.

More information on this benchmark will be posted in the next several days.

Performance Landscape

TPC-C results (sorted by tpmC, bigger is better)


System
tpmC Price/tpmC Avail Database Cluster Racks w/KtpmC
12 x Sun SPARC Enterprise T5440 7,646,487 2.36 USD 12/14/09 Oracle 11g RAC Y 9 9.6
IBM Power 595 6,085,166 2.81 USD 12/10/08 IBM DB2 9.5 N 76 56.4
HP Integrity Superdome 4,092,799 2.93 USD 08/06/07 Oracle 10g R2 N 46 to be added

Avail - Availability date
w/KtmpC - Watts per 1000 tpmC
Racks - clients, servers, storage, infrastructure

Sun and IBM TPC-C Response times


System
tpmC

Response Time

New Order 90th%

Response Time

New Order Average

12 x Sun SPARC Enterprise T5440 7,646,487 0.170 0.168
IBM Power 595 6,085,166 1.69
1.22
Response Time Ratio - Sun Better

9.9x 7.3x

Sun uses 7x comparison to highlight the differences in response times between Sun's solution and IBM.  Although notice that Sun is 10x faster on New Order transactions that finish in the 90% percentile.

It is also interesting to note that none of Sun's response times, avg or 90th percentile, for any transaction is over 0.25 seconds. While IBM does not have even one interactive transaction, not even the menu, below 0.50 seconds. Graphs of Sun's and IBM's response times for New-Order can be found in the full disclosure reports on TPC's website TPC-C Official Result Page.

Results and Configuration Summary

Hardware Configuration:

    9 racks used to hold

    Servers:
      12 x Sun SPARC Enterprise T5440
      4 x 1.6 GHz UltraSPARC T2 Plus
      512 GB memory
      10 GbE network for cluster
    Storage:
      60 x Sun Storage F5100 Flash Array
      61 x Sun Fire X4275, Comstar SAS target emulation
      24 x Sun StorageTek 6140 (16 x 300 GB SAS 15K RPM)
      6 x Sun Storage J4400
      3 x 80-port Brocade FC switches
    Clients:
      24 x Sun Fire X4170, each with
      2 x 2.53 GHz X5540
      48 GB memory

Software Configuration:

    Solaris 10 10/09
    OpenSolaris 6/09 (COMSTAR) for Sun Fire X4275
    Oracle 11g Enterprise Edition with Real Application Clusters and Partitioning
    Tuxedo CFS-R Tier 1
    Sun Web Server 7.0 Update 5

Benchmark Description

TPC-C is an OLTP system benchmark. It simulates a complete environment where a population of terminal operators executes transactions against a database. The benchmark is centered around the principal activities (transactions) of an order-entry environment. These transactions include entering and delivering orders, recording payments, checking the status of orders, and monitoring the level of stock at the warehouses.

See Also

Disclosure Statement

TPC Benchmark C, tpmC, and TPC-C are trademarks of the Transaction Performance Processing Council (TPC). 12-node Sun SPARC Enterprise T5440 Cluster (1.6GHz UltraSPARC T2 Plus, 4 processor) with Oracle 11g Enterprise Edition with Real Application Clusters and Partitioning, 7,646,486.7 tpmC, $2.36/tpmC. Available 12/14/09. IBM Power 595 (5GHz Power6, 32 chips, 64 cores, 128 threads) with IBM DB2 9.5, 6,085,166 tpmC, $2.81/tpmC, available 12/10/08. HP Integrity Superdome(1.6GHz Itanium2, 64 processors, 128 cores, 256 threads) with Oracle 10g Enterprise Edition, 4,092,799 tpmC, $2.93/tpmC. Available 8/06/07. Source: www.tpc.org, results as of 11/5/09.

The Build Environment Effort has done a lot of analysis of how our current build process works to find out if and how we can improve the experience of building OpenOffice.org.

One of the things we took a look at is scalability. Currently two-way and four-way machines are standard developer hardware, but this will likely change as it will become more common to have more cores and hardware will become cheaper.

Parallelization in the current build process

There are two ways to use concurrent processes in the current build process:

  • Parallel build of source directories provided by build.pl
  • Parallel build of targets in one source directory provided by dmake

Unfortunately, these two ways of parallelization are completely independent and have no way of communicating with each other. If one wants to make sure each core on a four-way system gets used when possible one has to use both kinds of parallelization:
build --all -P4 -- -P4

If one would not specify the first -P4, one would run on less than four cores if there are not more targets to build in parallel in one directory, because there is only one directory build at a time. If one would not specify the second -P4, one would run on less than four cores, because sometimes there are no four directories buildable because of dependencies.
However, when enough targets to build are available in both kinds of parallelization, there will be 16 processes running. On Linux, this "overload" alone does not severely slow down the build.

For a current four-way system parallelization is not too bad however:

  • a -P4 -- -P4 build is only 16% slower than the quarter of a single process build
  • a -P9 -- -P1 build is only 21% slower than the quarter of a single process build

But when you have 20 cores (with distcc or in the not too distant future) you would have a maximum of 400 processes running and that would slow down the build. Also, the build system has no control over the priorities of the 400 jobs and thus cannot put the ones with the most dependencies first. Thus, the build will be slower, because targets with no or few dependencies are "stealing" CPU-time from more important targets with more dependencies.

Here is a visualization of the number of dmakes running in a -P9 -- -P1 build:

visualization of parallelization with -P9 -- -P1

Here is a visualization of the number of dmakes running in a -P4 -- -P4 build:

visualization of parallelization with -P4 -- -P4

 Note that there can be 20 or more dmake processes starting and dying in one second and the diagram only used the last state change in one second. So if there are N-1 processes running for a -PN build, it is likely that build.pl was just spawning a process at the tick of the second.

Build Bottlenecks

To identify the bottlenecks in the build process one has to track the number of processes over the time of a build.

Here is a diagram showing the number of parallel dmakes in a -P9 -- -P1 build: P9P1-Timeline

It shows the number of dmakes running and the modules which are being build at that the given point in time. The bar representing a module starts at the point in time when it is "announced" i.e. when it is buildable, because all dependencies are there. The bar ends at the point in time when the module was delivered to the solver. Note the start of the bar does not per se mean that a process is working on the module: For example a lot of modules depend on svx and not every module will get a process right after svx has been delivered.
One thing easily identified by examining the diagram is a "critical path" -- a sequence of modules, where each module follows the dependency of itself that was delivered last:

(stlport ->) soltools -> xml2cmp -> sal -> salhelper -> registry ->
idlc -> udkapi -> offapi -> offuh -> cppu -> cppuhelper ->
jvmfwk -> stoc -> 18npool -> tools -> unotools -> sot -> vcl -> toolkit -> svtools ->
framework -> basic -> sfx2-> avmedia -> drawinglayer -> svx -> formula -> sc ->
postprocess -> packimages -> instsetoo_native

One can see how the build process "dries out" quite often along this path as modules are waiting for their dependencies to be delivered. These are the bottlenecks of the build. Stlport was not used in this build, but if it would have been used it would be another bottleneck.

Conclusion

Currently parallelization is not as bad as one might have expected for full builds on a regular developer workstation running Linux. However, the comparison of -P9 and -P4 builds shows the current build system has limitations on the scalability that will be more noticeable as systems with higher parallelization become more common. Next, we will present the same analysis for builds on the Windows platform, were builds are traditionally much slower.


I'm a great fan of the hardware performance counters that you find on most processors. Often you can look at the profile and instantly identify what the issue is. Sometimes though, it is not obvious, and that's where the performance counters can really help out.

I was looking at one such issue last week, the performance of the application was showing some variation, and it wasn't immediately obvious what the issue was. The usual suspects in these cases are:

  • Excessive system time
  • Process migration
  • Memory placement
  • Page size
  • etc.

Unfortunately, none of these seemed to explain the issue. So I hacked together the following script cputrackall which ran the test code under cputrack for all the possible performance counters. Dumped the output into a spreadsheet, and compared the fast and slow runs of the app. This is something of a "fishing trip" script, just gathering as much data as possible in the hope that something leaps out, but sometimes that's exactly what's needed. I regularly get to sit in front of a new chip before the tools like ripc have been ported, and in those situations the easiest thing to do is to look for hardware counter events that might explain the runtime performance. In this particular instance, it helped me to confirm my suspicion that there was a difference in branch misprediction rates that was causing the issue.

Here is a BestPerf blog index to a variety of benchmarks announced at Oracle Open World and others talked about at the conference.

Colors used:

Benchmark
Best Practices
Other

ORACLEOPENWORLD

CMT Servers

Oct 11, 2009 * TPC-C World Record Sun - Oracle *
Oct 13, 2009 Sun T5440 Oracle BI EE Sun T5440 World Record
Oct 13, 2009 SPECweb200 Sun T5440 World Record, Solaris Containers and Sun Storage F5100
Sep 01, 2009 String Searching - Sun T5240 & T5440 Outperform IBM Cell Broadband Engine
Aug 27, 2009 Sun T5240 Beats 4-Chip IBM Power 570 POWER6 System on SPECjbb2005
Aug 26, 2009 Sun T5220 Sets Single Chip World Record on SPECjbb2005
Aug 12, 2009 SPECmail2009 on Sun T5240 and Sun Java System Messaging Server 6.3
Jul 23, 2009 World Record Performance of Sun CMT Servers
Jul 22, 2009 Why does 1.6 beat 4.7?
Jul 21, 2009 Zeus ZXTM Traffic Manager World Record on Sun T5240
Jul 21, 2009 Sun T5440 World Record SAP-SD 4-Processor Two-tier SAP ERP 6.0 EP4 (Unicode)

SPARC64 Servers

Oct 13, 2009 SAP 2-tier SD Benchmark on Sun M9000/32 SPARC64 VII
Oct 13, 2009 Oracle PeopleSoft Payroll Sun M4000 and Sun Storage F5100 World Record Performance
Oct 12, 2009 Best Practices: M4000 Sun Storage F5100 is a good option for Peoplesoft Payroll
Oct 13, 2009 Oracle Hyperion Sun M5000 and Sun Storage 7410
Oct 13, 2009 SPECcpu2006 Results On MSeries Servers, New SPARC64 VII

X86 Servers

Oct 13, 2009 SAP 2-tier SD-Parallel on Sun Blade X6270 1-node, 2-node and 4-node
Aug 28, 2009 Sun X4270 World Record SAP-SD 2-Processor Two-tier SAP ERP 6.0 EP 4 (Unicode)
Oct 02, 2009 Sun X4270 VMware VMmark benchmark achieves excellent result
Sep 22, 2009 Sun X4270 Virtualized for Two-tier SAP ERP 6.0 EP4 (Unicode) Standard Sales and Distribution Benchmark

HPC Benchmarks

Oct 13, 2009 Halliburton ProMAX Oil & Gas Appl on Sun 6048/X6275 Cluster and Oracle Database
Oct 13, 2009 MCAE ABAQUS faster on Sun F5100 and Sun X4270 - Single Node World Record
Oct 12, 2009 MCAE ANSYS faster on Sun F5100 and Sun X4270
Oct 12, 2009 MCAE MCS/NASTRAN faster on Sun F5100 and Fire X4270
Oct 13, 2009 CP2K Life Sciences, Ab-initio Chem - Sun C48 with Sun Blade X6275 - QDR InfiniBand
Oct 09, 2009 X6275 Cluster Demonstrates Performance and Scalability on WRF 2.5km CONUS Dataset

Specific Storage Benchmarks

Oct 12, 2009 SPC-2 Sun Storage 6180 RAID 5 & RAID 6 Over 70% Better $/Performance than IBM
Oct 12, 2009 SPC-1 Sun Storage 6180 Over 70% Better $/Performance than IBM
Oct 12, 2009 1.6 Million 4K IOPS in 1RU on Sun Storage F5100 Flash Array

Additional CMT Server Benchmarks

Jul 21, 2009 1.6 GHz SPEC CPU2006 - Rate Benchmarks
Jul 21, 2009 Sun Blade T6320 World Record SPECjbb2005 performance
Jul 21, 2009 Sun T5440 SPECjbb2005 Beats IBM POWER6 Chip-to-Chip
The Sun SPARC Enterprise T5440 server with 1.6GHz UltraSPARC T2 Plus with Solaris Containers, Sun Flash Open Storage, and Sun JAVA System Web Server 7.0 Update 5 achieved World Record SPECweb2005.
  • Sun has obtained a World Record SPECweb2005 performance result of 100,209 SPECweb2005 on the Sun SPARC Enterprise T5440, running Solaris 10 10/09 Sun JAVA System Web Server 7.0 Update 5, and Java Hotspot™ Server VM.

  • This result demonstrates performance leadership of the Sun SPARC Enterprise T5440 server and its scalability, by using Solaris Containers to consolidate multiple web serving environments, and Sun OpenStorage Flash technology to store large datasets for fast data retrieval.

  • The Sun SPARC Enterprise T5440 delivers 21% greater SPECweb2005 performance than the HP DL370 G6 with 3.2GHz Xeon W5580 processors.

  • The Sun SPARC Enterprise T5440 delivers 40% greater SPECweb2005 performance than the HP DL 585 G5 with four 3.114 GHz Opteron 8393 SE processors.

  • The Sun SPARC Enterprise T5440 delivers 2x the SPECweb2005 performance of the HP DL 580 G5 with four 2.66GHz Xeon X7460 processors.

  • There are no IBM Power6 results on the SPECweb2005 benchmark.

  • This benchmark result clearly demonstrates that the Sun SPARC Enterprise T5440 running Solaris 10 10/09 and Sun Java System Webserver 7.0 Update 5 can support thousands of concurrent web server sessions and is an industry leader in web serving with a Sun solution.

Performance Landscape

Server

Processor

SPECweb2005

Banking*

Ecomm*

Support*

Webserver

OS

Sun T5440

4x 1.6 T2 Plus

100,209

176,500

133,000

95,000

Java WebServer

Solaris

HP DL370 G6

2x 3.2 W5580

83,073

117,120

142,080

76,352

Rock

RedHat
Linux

HP DL585 G5

4x 3.11 O8393

71,629

117,504

123,072

56,320

Rock

RedHat
Linux

HP DL580 G5

4x 2.66 X7460

50,013

97,632

69,600

40,800

Rock

RedHat
Linux

* Banking - SPECweb2005-Banking
   Ecomm - SPECweb2005-Ecommerce
   Support - SPECweb2005-Support

Results and Configuration Summary

Hardware Configuration:

  1 Sun SPARC Enterprise T5440 with

  • 4 x UltraSPARC T2 Processor 8 core, 64 threads, 1.6 GHz
  • 254 GB memory
  • 6 x 4Gb PCI Express 8-Port Host Adapter (SG-XPCIE8SAS-E-Z)
  • 1 x Sun Storage F5100 Flash Array (TA5100RASA4-80AA)
  • 1 x Sun Storage F5100 Flash Array (TA5100RASA4-40AA)

Server Software Configuration:

  • Solaris 10 10/09
  • JAVA System Web Server 7.0 Update 5
  • Java Hotspot™ Server VM

Network configuration:

  • 1 x Arista DCS-7124s 24-10GbE port  switch
  • 1 x Cisco 2970 series (WS-C2970G-24TS-E) switch for the three 1 GbE networks

Back-end Simulator:

  1 Sun Fire X4270 with

  • 2 x 2.93 GHz Intel X5570 Quad core
  • 48GB memory
  • Solaris 10 10/09
  • JSWS 7.0 Update 5
  • Java Hotspot™ Server VM

Clients:

  8 Sun Blade™ T6320

  • 1 x 1.417 GHz UltraSPARC-T2
  • 64 GB memory
  • Solaris 10 5/09
  • Java Hotspot™ Server VM

  8 Sun Blade™ 6270

  • 2 x 2.93 GHz Intel X5570 Quad core
  • 36 GB memory
  • Solaris 10 5/09
  • Java Hotspot™ Server VM

Benchmark Description

SPECweb2005, successor to SPECweb99 and SPECweb99_SSL, is an industry standard benchmark for evaluating Web Server performance developed by SPEC. The benchmark simulates multiple user sessions accessing a Web Server and generating static and dynamic HTTP requests. The major features of SPECweb2005 are:

  • Measures simultaneous user sessions
  • Dynamic content: currently PHP and JSP implementations
  • Page images requested using 2 parallel HTTP connections
  • Multiple, standardized workloads: Banking (HTTPS), E-commerce (HTTP and HTTPS), and Support (HTTP)
  • Simulates browser caching effects
  • File accesses more accurately simulate today's disk access patterns

Key Points and Best Practices

  • The server was divided into four Solaris Containers and a single web server instance was executed in each container.
  • Four processor sets were created (with varying numbers of threads depending on the workload) to run the web server in. This was done to reduce memory access latency using the physical memory closest to the processor.  All interrupts were run on the remaining threads.
  • Each web server is executed in the FX scheduling class to improve performance by reducing the frequency of context switches.
  • Two Sun Storage F5100 Flash Arrays (holding the target file set and logs) were shared by the four containers  for fast data retrieval.   
  • Use of Solaris Containers highlights the consolidation of multiple web serving environments on a single server.
  • Use of the Sun Ext I/O Expansion unit and Sun Storage F5100 Flash Arrays highlight the expandability of the server.

    Disclosure Statement

    Sun SPARC Enterprise T5440 (8 cores, 1 chip) 100209 SPECweb2005, was submitted to SPEC for review on October 13, 2009.  HP ProLiant DL370 G6 (8 cores, 2 chips) 83,073 SPECweb2005. HP ProLiant DL585 G5 (16 cores, 4 chips) 71,629 SPECweb2005. HP ProLiant DL580 G5 (24 cores, 4 chips) 50,013 SPECweb2005. SPEC, SPECweb reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of Oct 10, 2009.

    Just released:

    HPC Profiling with the Sun Studio Performance Tools
    Marty Itzkowitz and Yukon Maruyama (Sun Microsystems) describe how to use the Sun Studio Performance Tools to understand the performance issues in single-threaded, multi-threaded,  OpenMP, and MPI applications, and the techniques used to profile them. This paper was presented at the Third Parallel Tools Workshop held in Dresden Germany in September.

    The link to the article is:

    http://developers.sun.com/sunstudio/documentation/techart/hpc_profiling.pdf

    TPC-C Sun SPARC Enterprise T5440 with Oracle RAC World Record Database Result

    Sun and Oracle demonstrate the World's fastest database performance. Sun Microsystems using 12 Sun SPARC Enterprise T5440 servers, 60 Sun Storage F5100 Flash arrays and Oracle 11g Enterprise Edition with Real Application Clusters and Partitioning delivered a world-record TPC-C benchmark result.

    • The 12-node Sun SPARC Enterprise T5440 server cluster result delivered a world record TPC-C benchmark result of 7,646,486.7 tpmC and $2.36 $/tpmC (USD) using Oracle 11g R1 on a configuration available 12/14/09.

    • The 12-node Sun SPARC Enterprise T5440 server cluster beats the performance of the IBM Power 595 (5GHz) with IBM DB2 9.5 database by 26% and has 16% better price/performance on the TPC-C benchmark.

    • The complete Oracle/Sun solution used 10.7x better computational density than the IBM configuration (computational density = performance/rack).

    • The complete Oracle/Sun solution used 8 times fewer racks than the IBM configuration.

    • The complete Oracle/Sun solution has 5.9x better power/performance than the IBM configuration.

    • The 12-node Sun SPARC Enterprise T5440 server cluster beats the performance of the HP Superdome (1.6GHz Itanium2) by 87% and has 19% better price/performance on the TPC-C benchmark.

    • The Oracle/Sun solution utilized Sun FlashFire technology to deliver this result. The Sun Storage F5100 flash array was used for database storage.

    • Oracle 11g Enterprise Edition with Real Application Clusters and Partitioning scales and effectively uses all of the nodes in this configuration to produce the world record performance.

    • This result showed Sun and Oracle's integrated hardware and software stacks provide industry-leading performance.

    More information on this benchmark will be posted in the next several days.

    Performance Landscape

    TPC-C results (sorted by tpmC, bigger is better)


    System
    tpmC Price/tpmC Avail Database Cluster Racks w/KtpmC
    12 x Sun SPARC Enterprise T5440 7,646,487 2.36 USD 12/14/09 Oracle 11g RAC Y 9 9.6
    IBM Power 595 6,085,166 2.81 USD 12/10/08 IBM DB2 9.5 N 76 56.4
    Bull Escala PL6460R 6,085,166 2.81 USD 12/15/08 IBM DB2 9.5 N 71 56.4
    HP Integrity Superdome 4,092,799 2.93 USD 08/06/07 Oracle 10g R2 N 46 to be added

    Avail - Availability date
    w/KtmpC - Watts per 1000 tpmC
    Racks - clients, servers, storage, infrastructure

    Results and Configuration Summary

    Hardware Configuration:

      9 racks used to hold

      Servers:
        12 x Sun SPARC Enterprise T5440
        4 x 1.6 GHz UltraSPARC T2 Plus
        512 GB memory
        10 GbE network for cluster
      Storage:
        60 x Sun Storage F5100 Flash Array
        61 x Sun Fire X4275, Comstar SAS target emulation
        24 x Sun StorageTek 6140 (16 x 300 GB SAS 15K RPM)
        6 x Sun Storage J4400
        3 x 80-port Brocade FC switches
      Clients:
        24 x Sun Fire X4170, each with
        2 x 2.53 GHz X5540
        48 GB memory

    Software Configuration:

      Solaris 10 10/09
      OpenSolaris 6/09 (COMSTAR) for Sun Fire X4275
      Oracle 11g Enterprise Edition with Real Application Clusters and Partitioning
      Tuxedo CFS-R Tier 1
      Sun Web Server 7.0 Update 5

    Benchmark Description

    TPC-C is an OLTP system benchmark. It simulates a complete environment where a population of terminal operators executes transactions against a database. The benchmark is centered around the principal activities (transactions) of an order-entry environment. These transactions include entering and delivering orders, recording payments, checking the status of orders, and monitoring the level of stock at the warehouses.

    POSTSCRIPT: Here are some comments on IBM's grasping-at-straws-perf/core attacks on the TPC-C result:
    c0t0d0s0 blog: "IBM's Reaction to Sun&Oracle TPC-C

    See Also

    Disclosure Statement

    TPC Benchmark C, tpmC, and TPC-C are trademarks of the Transaction Performance Processing Council (TPC). 12-node Sun SPARC Enterprise T5440 Cluster (1.6GHz UltraSPARC T2 Plus, 4 processor) with Oracle 11g Enterprise Edition with Real Application Clusters and Partitioning, 7,646,486.7 tpmC, $2.36/tpmC. Available 12/14/09. IBM Power 595 (5GHz Power6, 32 chips, 64 cores, 128 threads) with IBM DB2 9.5, 6,085,166 tpmC, $2.81/tpmC, available 12/10/08. HP Integrity Superdome(1.6GHz Itanium2, 64 processors, 128 cores, 256 threads) with Oracle 10g Enterprise Edition, 4,092,799 tpmC, $2.93/tpmC. Available 8/06/07. Source: www.tpc.org, results as of 10/11/09.

    Less than two months ago, Sun Microsystems published an Oracle Business Intelligence benchmark with the best single system performance of 28,000 concurrent BI EE users at ~75% CPU utilization. Sun and Oracle Corporation announced another Oracle Business Intelligence benchmark result today with two identical T5440 servers in the Oracle BI Cluster serving 50,000 concurrent BI EE users.

    An Oracle white paper with Sun's 50,000 user benchmark results can be accessed from Oracle's Business Intelligence web.

    The hardware specifications for each of the T5440s are similar to the hardware that was used in the prior benchmark effort on a single T5440 server. However this time the Presentation Catalog (also frequently referred as the Web Catalog) was moved to a T5220 server where the NFS server was running. Besides this the only other change from the earlier 28,000 user benchmark exercise is the addition of another T5440 to the test rig.

    The following graph shows the scalability of the application from one node to four nodes to eight nodes running on T5440 servers.

    OBIEE on T5440 : Scalability Graph

    Without further ado, here is the summary of the benchmark results along with their significance and some interesting facts:

    • One of the major goals of this benchmark effort is to show the horizontal and vertical scalability of the application (OBIEE) by highlighting the superior performance and the resilience of the underlying hardware (T5440) and the operating system (Solaris). Needless to say the goal has been met.

    • Another goal of this benchmark is to show decent number of concurrent BI EE users executing transactions with good response times. Since we already showed the maximum load that can be achieved on a single BI instance (7500 users) and on a single T5440 server running multiple BI instances (28,000 users), this time we did not attempt to get the peak number that can be achieved from the two T5440 servers in the benchmark environment. Now that there is an additional server in the test setup that is taking care of the Presentation Catalog and the database server, 2 * 28000 = 56,000 BI EE users would have been an achievable target -- but we opted to stop at the "magic" and the "respectable" number 50,000 instead.

    • The entire benchmark run lasted for about 9 hours 45 minutes, and out of which 8 hours were the rampup hours where the 50,000 BI virtual users were logging into the application few users at a time. LoadRunner tool reported only 4 errors for the entire duration of the run; and there are zero errors in the 60 minute steady state period during which the statistics reported in the document were collected.

    • Two Sun SPARC Enterprise T5440 servers each with 4 x 8-Core 1.6 GHz UltraSPARC T2 Plus processors delivered the best performance of 50,000 concurrent BI EE users at around 63% CPU utilization.

    • The BI EE Cluster was deployed on two T5440 servers running Solaris 10 5/09 operating system. All the nodes in the BI Cluster were consolidated onto two T5440 servers using the free and efficient Solaris Containers virtualization technology.

    • The Presentation Catalog was hosted on ZFS powered file system that was created on top of four internal Solid State Drive (SSD) disks. The Catalog was shared among all eight BI nodes in the cluster as an NFS share. One 8-Core 1.2 GHz UltraSPARC T2 processor powered T5220 server was used to run the NFS server. Due to the minimal activity of the database, Oracle 11g database was also hosted on the same server. Solaris 10 5/09 is the operating system.

    • Solid State Drive (SSD) disks with ZFS file system showed significant I/O performance improvement over traditional disks for the Presentation Catalog activity. In addition, ZFS helped get past the UFS limitation of 32767 sub-directories in a Presentation Catalog directory.

    • Caching was turned ON at the application server, which led to minimal database activity on the server. Note hat the caching mechanism was turned ON even in the prior benchmark exercise.

    • The low end CoolThreads CMT Server T5220 and the mid-range T5440 server once again proved to be ideal candidates to deploy and run multi-thread workloads by exhibiting resilient performance when handling large number of simultaneous requests from 50,000 BI EE virtual users. T5220 handled large number of concurrent asynchronous read/write requests from eight different NFS clients.

    • NFS v3 was configured at the NFS Server as well as at the NFS Client nodes. NFS version 4 is the default on Solaris 10, and it might have worked as expected. However a handful of bug reports prompted us to go with the more matured and less buggy version 3.

    • 3283 watts is the average power consumption when all the 50,000 concurrent BI users are in the steady state of the benchmark test. That is, in the case of similarly configured workloads, the T5440 server supports 15.2 users per watt of energy consumed and supports 5,000 users per rack unit.

    • A summary of the results with system-wide averages of CPU and memory utilization is shown below. The latest results are highlighted in blue color.

      #Vusers Clustered #BI Nodes #CPU #Core RAM CPU Memory Avg Trx Response Time #Trx/sec
      7,500 No 1 1 8 32 GB 72.85% 18.11 GB 0.22 sec 155
      28,000 Yes 4 4 32 128 GB 75.04% 76.16 GB 0.25 sec 580
      50,000 Yes 8 8 64 256 GB 63.32% 172.21 GB 0.28 sec 1031

    TOPOLOGY DIAGRAM

    The topology diagram in the benchmark results white paper is almost illegible. Here is the original topology diagram that was inserted into the white paper.

    OBIEE on T5440 : 50K User Benchmark Topology

    Quite frankly I'm not very proud of this drawing -- but that's the best that I could come up with in a short span. Rather than showing the flow of communication between each and every component in the benchmark setup, I simplified the drawing by introducing a "black box" sort of thing - "private network" - in the middle, which protected the drawing from getting messy.


    CPU USAGE GRAPH

    The following two-dimensional graph shows the CPU utilization patterns at all 3 nodes in the benchmark setup for the 60 minute steady state of the benchmark run. This graph was generated using the free GNUplot tool with sar data as the inputs.

    OBIEE on T5440 : 50K User Benchmark CPU Usage Graph

    COMPETITIVE LANDSCAPE

    And finally here is a quick summary of all the results that are published by different vendors so far with similar benchmark kit. Feel free to draw your own conclusions. All this is public information. Check the corresponding benchmark reports by clicking on the URLs under the "#Users" column.

    Server Processors #Users OS
    Chips Cores Threads GHz Type
      2 x Sun SPARC Enterprise T5440 (APP)
      1 x Sun SPARC Enterprise T5220 (NFS,DB)
    8
    1
    64
    8
    512
    64
    1.6
    1.2
    UltraSPARC T2 Plus
    UltraSPARC T2
    50,000 Solaris 10 5/09
      1 x Sun SPARC Enterprise T5440 4 32 256 1.6 UltraSPARC T2 Plus 28,000 Solaris 10 5/09
      5 x Sun Fire T2000 1 8 32 1.2 UltraSPARC T1 10,000 Solaris 10 11/06
      3 x HP DL380 G4 2 4 4 2.8 Intel Xeon 5,800 OEL
      1 x IBM x3755 4 8 8 2.8 AMD Opteron 4,000 RHEL4


    Before you go, do not forget to check the best practices for configuring / deploying Oracle Business Intelligence on top of Solaris 10 running on Sun CMT hardware.

    Related Blog Posts:
    T5440 Rocks [again] with Oracle Business Intelligence Enterprise Edition Workload

    Earlier in the summer I recorded a slidecast on using the Performance Analyzer on parallel codes, it's just come out on the HPC portal.

    Earlier in the summer I recorded a slidecast on using the Performance Analyzer on parallel codes, it's just come out on the HPC portal.

    It is my pleasure to be invited as a panelist in a discussion on Energy Efficiency in the 2nd International Conference on Climate Change (http://www.iccc2009.org/) held on 7-9 October 2009.  The conference was filled with a lot of decision makers from government and business sides to share their thoughts on how to help the climate change issue in a concrete manner. This was reported in the local newspaper hereICCC 2009

    Regarding energy efficiency or Green House Gas (GHG) emission reduction, Sun has always been the pioneer and active promoter in computer industry (see here) and I delivered the following key messages in panel discussion from Computer Industry perspective:

    • Eco-Friendly CPU and Server - UltraSparc T2 Plus and T-Series Servers.  Actually, you can find HSBC endorsement here on how these kind of servers can increase throughput by 3 times via saving energy and cooling by 30% 
    • Datacenter Redesign - Improvement of airflow, replacing old model servers with 'Energy Efficiency' Servers, leveraging virtualization or consolidation on servers and storage can save up to 40-50% in energy consumption as well as electricity bill according to live cases in our Datacenter in US and Eurpore as well as some live deployment in our customers in Malaysia and Switzerland.  Want to know more, click here.
    • SSD and Tape to tackle storage growth - there is a study showing a 50% growth in new data annually and IAS in Hong Kongequipment to hold the data will consume significant power accordingly. SSD can save around 40% in power per G and 98% in power per I/O.  Tape is good for archiving solution together with our sophisticated Tape Library that use it as data storage/retrieval media instead of dump backup media.  Some large banks and hospitals in Hong Kong do use this technology to store their huge amount of CT-Scan images, compliance document or unstructure data.  Please note that one important difference between tape and disks on storing archving data is no power or cooling required for tape the data stored in that media.  While disks always require power and cooling even they are idle and they have a shorter life cycle.  This is a recent Archiving system deployed in Hong Kong by our storage specialist, Chan Tsz Hong, that contains disks and tapes to store large volume of unstructure data.  Click here for more details on this solution.
    In the conference, most of the speakers or panelists are from construction, transportation or finance industry and I am almost the only one from IT/Computer industry.  What I can do is to increase the awareness to the non-IT audience to pay more attention on environment related metrics in their upcoming IT projects or budget instead of just focusing on pure dollar amount metric.  Questions like server utitlization, electricity bill, cooling requirement, eco-friendly computer product, design of datacentre or modular rack/colling design should be put in system/project requirement and active monitoring should be carried out by the business side to control GHG emission from their IT systems.
    Taking a look at the Oracle Processor Core Factor Table, I noticed that on September 24 our friends at Oracle have reduced the per core licensing factor on UltraSPARC T2+ systems. This includes the Sun SPARC Enterprise T5140, T5240, T5440 and T6340 (blade). If you will permit the pun, this is very cool news indeed.

    Technocrati Tags:

    Barbara Hutchings presents on how Ansys parallel performance provides faster turnaround time and the ability to run bigger, more detailed models.

    LOS SERVIDORES CMT DE SUN ALCANZAN NUEVOS RÉCORDS MUNDIALES EN APLICACIONES EMPRESARIALES

    EL SERVIDOR SUN SPARC ENTERPRISE T5220 ES 2.6 VECES MÁS RÁPIDO QUE EL SISTEMA POWER 6 DE IBM EN LOS SISTEMAS DE TRABAJO JAVA EMPRESARIALES; EL SERVIDOR SUN SPARC ENTERPRISE T5240 SOPORTA 2 VECES MÁS USUARIOS DE EMAIL CORPORATIVO QUE EL SISTEMA BASADO EN INTEL XEON

    • Tres nuevos récords mundiales en servicio de correo electrónico y referencia Java.
    • Mejoramiento del desempeño de las organizaciones empresariales y manejo de ahorro de costos de los Sistemas de Redes Abiertas de Sun.
    • Mejor resultado JVM para todos los sistemas con 32 núcleos o menos.
    • Alta capacidad de los servidores SPARC Enterprise para los servicios de correo electrónico corporativo.

    SANTA CLARA, California. Agosto 26, 2009. Sun Microsystems, Inc. (NASDAQ:JAVA) anunció que sus servidores SPARC Enterprise http://www.sun.com/servers/coolthreads/overview/index.jsp con tecnología de chip multi-threading (CMT) fijó tres nuevos récords mundiales sobre el estándar de la industria en servicio de correo electrónico y puntos de referencia de Java. Estos récords resaltan el desempeño de las organizaciones empresariales, infraestructura significante y el manejo de ahorro de costos de los Sistemas de Redes Abiertas de Sun con el Sistema Operativo Solaris entregado a través de la convergencia de cómputo, creación de redes y almacenamiento. Sun también anunció el sobresaliente desempeño del test Oracle Business Intelligence Suite Enterprise Edition (BI Suite EE) y el Zeus Extensible Traffic Manager (ZXTM). *

    Los Resultados fueron:
    *SPECjbb2005* – sobre este punto de referencia promedio de la industria que emula el diseño del servidor de aplicaciones Java del mundo real, el servidor Sun SPARC Enterprise T5220 corriendo el código fuente de OpenSolaris 2009.06 y la más reciente versión del software de la Edición Estándar de la Plataforma Java de Sun (Java SE) (versión 1.6.0_14 Performance Release), fijó un record mundial con un único chip. El servidor SPARC Enterprise T5220 con un procesador UltraSPARC T2 de 1,6 GHz entregó 2.6 veces mejor desempeño que el IBM p570 con un procesador POWER6 de 4,7 GHz y consumió sólo 520 watts en promedio (1). Adicionalmente, el servidor Sun SPARC Enterprise T5440 con cuatro procesadores de UltraSPARC T2 Plus de 1,6 GHz entregó el mejor resultado JVM para todos los sistemas con 32 núcleos o menos (2).

    *SPECmail2009* - El servidor Sun SPARC T5240, alimentado por dos procesadores UltraSPARC T2 Plus de ocho núcleos de 1,6GHz, superó todas las demás soluciones de servidores de mensajería y fijó un nuevo récord mundial en SPECmail2009 con 12 mil usuarios de SPECmail_Ent2009 IMAP4 en 57.758 sesiones/hora. El servidor SPARC Enterprise T5240 corriendo el Sun Java System Messaging Server 6.3 a la cabeza del sistema operativo Solaris duplicó los períodos de sesiones/hora y soportó 2 veces más usuarios que el resultado publicado por el sevidor XServe de Apple. La solución de Sun con matrices Sun StorageTek 2540 también utilizó 10 por ciento menos discos que la solución de ASpple basada en Intel Xeon (3). Diseñado para simular el ambiente real del correo corporativo, este resultado de punto de referencia resalta la capacidad de los servidores SPARC Enterprise T5240 para el servicio de correo electrónico en un gran entorno empresarial utilizando los protocolos estándares de la industria.

    *Oracle Business Intelligence Suite Enterprise Edition* - El servidor SPARC Enterprise T5440 con procesadores UltraSPARC T2 Plus de 1,6 GHz, Solaris Containers y software de base de datos Oracle 11g, entregó un desempeño sorprendente con 28.000 usuarios concurrentes en el test Oracle Business Intelligence Suite Enterprise Edition. La prueba de 280.000 usuarios con nombre con el 10 por ciento de participación simula una organización que necesita soportar un gran número de usuarios simultáneos, cada una realizando variedad de tareas como el reporte ad-hoc, desarrollo de aplicaciones y visualización de informes. Los procesos de negocio en la prueba representan cercanamente un escenario real con un esquema de base de datos de población fundamental.

    Este resultado demuestra la capacidad del servidor SPARC Enterprise T5440 para manejar grandes despliegues BI empresariales y la escalabilidad del clúster Oracle Business Intelligence Suite Enterprise Edition con cuatro nodos corriendo en Solaris Containers en un sistema con un único servidor (4).

    *Zeus Extensible Traffic Manager (ZXTM)* - El servidor SPARC Enterprise T5240 equipado con dos procesadores UltraSPARC T2 Plus ejecutándose a 1,6GHz fijó un récord mundial ZXTM HTTP con resultado de rendimiento de 13.4Gbit/seg con un precio/desempeño de $5,500/Mbps. El servidor de Sun ofreció 34 por ciento mejor rendimiento y 2.6 veces mejor relación precio/desempeño que una máquina F5 BIG-IP VIPRON configurada con un único blade. Este resultado demuestra que el propósito general de los servidores CMT de Sun puede superar el desempeño de aparatos hardware con propósitos especiales a un menor precio.

    Para más información de los nuevos servidores UltraSPARC TS de 1,6GHz y procesador T2 Plus, visite http://www.sun.com/cmt. Detalles anunciados sobre los récords mundiales están disponibles en: http://www.sun.com/benchmarks.

    Just updated the Selecting The Best Compiler Options article for the developer portal. Minor changes, mainly a bit more clarification on floating point optimisations.

    This article offers an example of tuning web applications to prevent performance bottlenecks on chip multithreading (CMT) platforms. A Sun engineer describes a scenario using Java virtual machine (JVM) tuning and jstack at a leading Web 2.0 independent software vendor (ISV).

    Ben Lippmeier gave an excellent presentation at the recent Haskell conference in Edinburgh on his work on porting the Glasgow Haskell Compiler (GHC) back to SPARC. A video of the talk is available.

    Update:Link to slides

    Here is a great article that was published by the IBM WebSphere Portal Performance team in collaboration with my colleague, Daniel Edwin, of Sun ISV Engineering.

    The paper discusses about WebSphere Portal Server v6.1 performance and scalability on Solaris 10 running on the T5240 (Chip Multi-Threading (CMT) based) server and how to configure and deploy a number instances using the Solaris Container technology.

    We have expanded the BigAdmin Resource Center for Sun Servers
    With CoolThreads Technology
    . Topics include UltraSPARC T1, T2, and T2 Plus Processors, as well as Sun SPARC Enterprise T5140 and T5240 Servers.

    Significance of Results

    Sun SPARC Enterprise T5220, T5240 and T5440 servers ran benchmarks using the Aho-Corasick string searching algorithm. String searching or pattern matching are important to a variety of commercial, government and HPC applications. One of the core functions needed for text identification algorithms in data repositories is real-time string searching. For this benchmark, the IBM, HP and Sun systems used the Aho-Corasick algorithm for string searching.

    Sun SPARC Enterprise T5440

    • A 1.6 GHz Sun SPARC Enterprise T5440 server could search a book as tall as Mt. Everest (29,208 feet, 861 GB book) in 61 seconds, which corresponds to a string search rate of 14.2 GB/s.

    • A 1.6 GHz Sun SPARC Enterprise T5440 server can search at a rate of 14.2 GB/s, which corresponds to searching a book containing one terabyte of data (34,745 feet high) in only 70 seconds.

    • The 4-chip 1.6 GHz Sun SPARC Enterprise T5440 server performed string searching at a rate of 14.2 GB/s which is 29.9 times as fast as the 2-chip IBM Cell Broadband Engine DD3 Blade that performed string searching at a rate of 0.475 GB/s

    • The 4-chip 1.6 GHz Sun SPARC Enterprise T5440 server performed string searching 3.7 times as fast as the 4-chip HP DL-580 (2.93 GHz Xeon QC) server that performed string searching at a rate of 3.87 GB/s. The 1.6 GHz Sun SPARC Enterprise T5440 server has a 1.7 times advantage in delivered power-performance over the HP DL-580 (using a power consumption rate of 830 watts for the HP system as measured on other tests).

    • The 1.6 GHz Sun SPARC Enterprise T5440 server demonstrated a 12% improvement over the 1.4 GHz Sun SPARC Enterprise T5440.

    • The 1.6 GHz Sun SPARC Enterprise T5440 server demonstrated a 2x speedup over the 1.6 GHz Sun SPARC Enterprise T5240 server which demonstrated a 2.3x speedup over the 1.4 GHz Sun SPARC Enterprise T5220 server.

    Sun SPARC Enterprise T5240

    • The 2-chip 1.6 GHz Sun SPARC Enterprise T5240 server performed string searching at a rate of 7.22 GB/s which is 15.4 times as fast as the 2-chip IBM Cell Broadband Engine DD3 Blade that performed string searching at a rate of 0.475 GB/s.

    • The 2-chip 1.6 GHz Sun SPARC Enterprise T5240 server performed string searching 1.9 times as fast as the 4-chip HP DL-580 (2.93 GHz Xeon QC) server that performed string searching at a rate of 3.87 GB/s. The 1.6 GHz Sun SPARC Enterprise T5240 server has a 2.4 times advantage in delivered power-performance over the HP DL-580 (using a power consumption rate of 830 watts for the HP system as measured on other

    • The 1.6 GHz Sun SPARC Enterprise T5240 server demonstrated a 14% speedup over the 1.4 GHz Sun SPARC Enterprise T5240 server.

    Sun SPARC Enterprise T5220

    • The 1-chip 1.4 GHz Sun SPARC Enterprise T5220 server performed string searching at a rate of 3.16 GB/s which is 6.7 times as fast as the 2-chip IBM Cell Broadband Engine DD3 Blade that performed string searching at a rate of 0.475 GB/s.

    Performance Landscape

    System Throughput
    (GB/sec)
    Chips Cores
    Sun SPARC Enterprise T5440 (1.6 GHz) 14.2 4 32
    Sun SPARC Enterprise T5440 (1.4 GHz) 12.7 4 32
    Sun SPARC Enterprise T5240 (1.6 GHz) 7.2 2 16
    Sun SPARC Enterprise T5240 (1.4 GHz) 6.4 2 16
    HP DL-580 (2.9 GHz) 3.9 4 16
    Sun SPARC Enterprise T5220 (1.4 GHz) 3.2 1 8
    IBM Cell Broadband Engine DD3 Blade (3.2 GHz) 0.475 2 16

    Results and Configuration Summary

    Hardware Configuration:
      Sun SPARC Enterprise T5440 (1.6 GHz)
        4 x 1.6 GHz UltraSPARC T2 Plus processors
        256 GB
      Sun SPARC Enterprise T5440 (1.4 GHz)
        4 x 1.4 GHz UltraSPARC T2 Plus processors
        128 GB
      Sun SPARC Enterprise T5240 (1.6 GHz)
        2 x 1.6 GHz UltraSPARC T2 Plus processors
        64 GB
      Sun SPARC Enterprise T5240 (1.4 GHz)
        2 x 1.4 GHz UltraSPARC T2 Plus processors
        64 GB
      Sun SPARC Enterprise T5220 (1.4 GHz)
        1 x 1.4 GHz UltraSPARC T2 processor
        32 GB

    Software Configuration:

      Sun SPARC Enterprise T5440 (1.6 GHz)
        OpenSolaris 2009.06
        Sun Studio 12 (Sun C 5.9 2007.05)
      Sun SPARC Enterprise T5440 (1.4 GHz)
        Solaris 10 2008.07
        Sun Studio 12 (Sun C 5.9 2007.05)
      Sun SPARC Enterprise T5240 (1.6 GHz)
        OpenSolaris 2009.06
        Sun Studio 12 (Sun C 5.9 2007.05)
      Sun SPARC Enterprise T5240 (1.4 GHz)
        Solaris 10 2008.07
        Sun Studio 12 (Sun C 5.9 2007.05)
      Sun SPARC Enterprise T5220 (1.4 GHz)
        Solaris 10 2008.07
        Sun Studio 12 (Sun C 5.9 2007.05)

    Benchmark Description

    One of the core functions needed for text identification algorithms in data repositories is real-time string searching. This string searching benchmark demonstrates the usefulness of Sun's UltraSPARC T2 and T2 Plus processors for both ease of code creation and speed of code execution. In IEEE Computer, Volume 41, Number 4, pp. 42-50, April 2008, IBM describes a variant of the Aho-Corasick string searching algorithm that uses deterministic finite automata. The algorithm first constructs a graph that represents a dictionary, then walks that graph using successive input characters from a text file. Each "state" in the graph includes a state transition table (STT) that is accessed using the next input character from the text file to determine the address of the next state in the graph. IBM defines an automaton as a two-step loop that: (1) obtains the address of the next state from the STT, and (2) fetches the next state in the graph.

    IBM reports the performance of its Cell Broadband Engine (CBE) to execute this algorithm to search a 4.4 MB version of the King James Bible using a dictionary of the 20,000 most used words in the English language (average word length of 7.59 characters). Each of the 8 synergistic processing elements (SPEs) of each of the two CBEs executes 16 automata, for a total of 256 automata. All automata and hence all SPEs access a single, shared dictionary.

    IBM describes elaborate optimizations of the Aho-Corasick algorithm, including state shuffling, state replication, alphabet shuffling and state caching. These optimizations were required to: (1) overcome "memory congestion", i.e., contention amongst the SPEs for access to the shared dictionary, and (2) compensate for the limited local storage that is associated with each SPE. These optimizations were necessary to achieve the performance reported for the CBE DD3 Blade.

    IBM does not provide references that indicate where to obtain the dictionary and Bible. IBM reports the algorithmic performance in Gbits/s but does not indicate whether an 8-bit byte is extended to 10 bits as required for network transmission.

    In order to closely approximate the dictionary and Bible that were used by IBM, Sun used a dictionary of 25,143 English words (the Open Solaris file cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/spell/list) for which the average word length is 7.2 characters, and a 4.6 MB version of the King James Bible (www.patriot.net/users/bmcgin/kjv12.zip). For reporting of results in Gbits/s, the length of a byte is assumed to be 8 bits.

    Key Points and Best Practices

    • Power was measured during execution of the Aho-Corasick algorithm using a WattsUp power meter, and the average rate of power consumption is presented.

    • The Aho-Corasick algorithm as deployed on the IBM Cell Broadband Engine DD3 Blade required substantial optimization and tuning to achieve the reported performance, whereas on the Sun SPARC Enterprise T5220, T5240 or T5440 servers only a basic implementation of the algorithm and a simple compilation were needed.

    • In order to demonstrate the usefulness of Sun's UltraSPARC T2 and T2 Plus processors for both ease of code generation and speed of code execution, Sun implemented the Aho-Corasick algorithm using ANSI C. No optimizations of the algorithm were required to achieve the performance reported for the T5220, T5240 and T5440. The source code was compiled using the -m32 -xO3 and -xopenmp options. The dictionary is represented using a graph that comprises 82 MB. Each core of the T5220, T5240 or T5440 executes 8 automata using one OpenMP thread per automaton. Thus, the T5220 executes 64 total automata, the T5240 executes 128 total automata and the T5440 executes 256 total automata. All automata and hence all cores access a single, shared dictionary. Access to this dictionary is accelerated by the large, shared L2 caches of the Sun SPARC Enterprise T5220, T5240 and T5440.

    See Also

    Need to know more about parallel and multithreaded programming, but were afraid to ask? 

    Here's a really good set of seven tutorials presented by Ruud van der Pas called:

    "An Introduction to Parallel Programming"

    La pasada semana se celebró en la Universidad de Standford el simposio sobre procesadores de altas prestaciones Hot Chips. Sun avanzó detalles sobre la siguiente generación de sus chips multihebra que sucederán a los actuales Ultra Sparc T2+, que actualmente llevan el nombre de "Rainbow Falls".

    Como titulares citaría que llevarán 16 cores por chip, cada uno de ellos con un potente coprocesador y el gran reto será la gestión del caudal masivo que exige la conexión de 16 cores a 16 bancos L2, para lo que dispondrá de importantes innovaciones en el área de interconexión de proximidad mediante conexiones de radiofrecuencia, reduciendo drásticamente los interfaces y disminuyendo el área de componentes mayores (L2 tags).

    Incluirá cuatro unidades de coherencia que soportarán las peticiones a memoria independientemente de que sean a local o remota, manteniendo la coherencia a través del cluster SMP.

    Según Sanjay Patel, el ingeniero que realizó la presentación, los grandes retos son: conectividad interna, reducción del número de interfaces, reducción del área L2, necesidad de cuatro unidades de coherencia y conectividad chip a chip.

    Para los interesados en profundizar les remito a este blog o a las reseñas de la prensa electrónica especializada TG Dialy o The Register.

    Es evidente la creciente demanda de sistemas de alto caudal y las exigencias crecientes de encriptación para soluciones web seguras y nuevos servicios como la VOIP. Y esta tecnología mantiene el liderazgo en esta línea.

    Durante mis vacaciones el mundo ha seguido su curso, aunque la actividad profesional se ralentice algo. Para retomar este blog me gustaría compartir un par de noticias aparecidas en agosto referidas a los sistemas Sun de gamas medias que creo son interesantes para sectores como la sanidad, la educación y la administración electrónica que abren sus "cursos" con muchas más necesidades que recursos.

    La primera noticia se refiere a la publicación de resultados de benchmarks de los sistemas de la gama CMT con records mundiales incluidos. 

    • El benchmark SPECjbb2005  emula el diseño de aplicaciones Java para servidores. El SPARC Enterprise T5220 server con un chip UltraSPARC T2 a 1.6 GHz marcó record para chip único al dar 2,6 veces mejores presraciones que la competencia consumiendo solo 520 watios.
    • El SPARC Enterprise T5440 server con cuatro chips UltraSPARC T2 Plus a 1.6 GHz marcó el mejor resultado para una  JVM en el benchmark SPECjbb2005 para sistemas con 32 or menod cores.
    • En el benchmark SPECmail2009 un servidor de correo corporativo tiene que gestionar miles de correos de usuarios. El SPARC Enterprise T5240 server con chips de ocho cores 1.6 GHz UltraSPARC T2 Plus mejoró las marcas de otros servidores con un registro de 12.000 IMAP4 usuarios a 57,758 usuarios por hora. 
    Detalles de los benchmarks aquí.

    ¿Quién no va apurado con sus servicios de aplicaciones, correo, web, etc?. Los servidores Sun con tecnología CMT son una alternativa muy valorable no sólo por sus prestaciones sino también por su precio.

    Por su parte Infoworld ha realizado un benchmark a los servidores  Sun Fire X2270 y Sun Fire X4270, basados en CPU Intel Nehalem (desde el 2.0GHz E5504s hasta el 2.93GHz X5570s) y montados en racks de 1U y 2U respectivamente. Han obtenido 8,2 y 8,8 respectivamente. Ambos marcados como "muy bueno". 

    El X2270 lo ven indicado para servidor front-end Web, pequeño servidor de bases de datos o miembro de una granja de virtualización. El X4270 lo consideran un todo terreno al que le van como anillo al dedo funciones de servidor de aplicaciones, servidor de almacenamiento o cualquier otra tarea que se le quiera encomendar.


    Two weeks ago the Parallel Computing Laboratory at the University of California Berkeley ran an excellent three-day summer bootcamp on parallel computing. I was one of about 200 people who attended remotely while another large pile of people elected to attend in person on the UCB campus. This was an excellent opportunity to listen to some very well known and talented people in the HPC community. Video and presentation material is available on the web and I would recommend it to anyone interested in parallel computing or HPC. See below for details.

    The bootcamp, which was called the 2009 Par Lab Boot Camp - Short Course on Parallel Programming covered a wide array of useful topics, including introductions to many of the current and emerging HPC parallel computing models (pthreads, OpenMP, MPI, UPC, CUDA, OpenCL, etc.), hands-on labs for in-person attendees, and some nice discussions on parallelism and how to find it with an emphasis on the motifs (patterns) of parallelism identified in The Landscape of Parallel Computing Research: A View From Berkeley. There was also a presentation on performance analysis tools and several application-level talks. It was an excellent event.

    The bootcamp agenda is shown below. Session videos and PDF decks are available here.

    talk title speaker
    Introduction and Welcome Dave Patterson (UCB)
    Introduction to Parallel Architectures John Kubiatowicz (UCB)
    Shared Memory Programming with Pthreads, OpenMP and TBB Katherine Yelick (UCB & LBNL), Tim Mattson (Intel), Michael Wrinn (Intel)
    Sources of parallelism and locality in simulation James Demmel (UCB)
    Architecting Parallel Software Using Design Patterns Kurt Keutzer (UCB)
    Data-Parallel Programming on Manycore Graphics Processors Bryan Catanzaro (UCB)
    OpenCL Tim Mattson (Intel)
    Computational Patterns of Parallel Programming James Demmel (UCB)
    Building Parallel Applications Ras Bodik (UCB), Ras Bodik (UCB), Nelson Morgan (UCB)
    Distributed Memory Programming in MPI and UPC Katherine Yelick (UCB & LBNL)
    Performance Analysis Tools Karl Fuerlinger (UCB)
    Cloud Computing Matei Zaharia (UCB)

    Significance of Results

    A Sun SPARC Enterprise T5240 server equipped with two UltraSPARC T2 Plus processors at 1.6GHz delivered a result of 422782 SPECjbb2005 bops, 26424 SPECjbb2005 bops/JVM. The Sun SPARC Enterprise T5240 consumed an average of 875 Watts of power during the execution of the benchmark.

    • The Sun SPARC Enterprise T5240 server running 2x 1.6 GHz UltraSPARC T2 Plus processor delivered 5% better performance than an IBM Power 570 with 4x 4.7 GHz POWER6 processors as measured by the SPECjbb2005 benchmark.

    • The Sun SPARC Enterprise T5240 server equipped with two UltraSPARC T2 Plus processors at 1.6GHz demonstrated 10% better performance than the Sun SPARC Enterprise T5240 server equipped with two UltraSPARC T2 Plus processors at 1.4GHz.
    • One Sun SPARC Enterprise T5240 (two 1.6GHz UltraSPARC T2 Plus chips, 2RU) has 2.3 times the power/performance than the IBM Power 570 (8RU) that used four 4.7GHz POWER6 chips.
    • The Sun SPARC Enterprise T5240 used OpenSolaris 2009.06 and the Sun JDK 1.6.0_14 Performance Release to obtain this result.

    Performance Landscape

    SPECjbb2005 Performance Chart (ordered by performance), select results presented.

    bops : SPECjbb2005 Business Operations per Second (bigger is better)

    System Processors Performance
    Chips Cores Threads GHz Type bops bops/JVM
    Sun SPARC Enterprise T5240 2 16 128 1.6 UltraSPARC T2 Plus 422782 26424
    IBM Power 570 4 8 16 4.7 POWER6 402923 100731
    Sun SPARC Enterprise T5240 2 16 128 1.4 UltraSPARC T2 Plus 384934 24058

    Complete benchmark results may be found at the SPEC benchmark website http://www.spec.org.

    Results and Configuration Summary

    Hardware Configuration:

      Sun SPARC Enterprise T5240
        2 x 1.6 GHz UltraSPARC T2 Plus processors
        64 GB

    Software Configuration:

      OpenSolaris 2009.06
      Java HotSpot(TM) 32-Bit Server, Version 1.6.0_14 Performance Release

    Benchmark Description

    SPECjbb2005 (Java Business Benchmark) measures the performance of a Java implemented application tier (server-side Java). The benchmark is based on the order processing in a wholesale supplier application. The performance of the user tier and the database tier are not measured in this test. The metrics given are number of SPECjbb2005 bops (Business Operations per Second) and SPECjbb2005 bops/JVM (bops per JVM instance).

    Key Points and Best Practices

    • Each JVM executed in the FX scheduling class to improve performance by reducing the frequency of context switches.
    • Each JVM was bound to a separate processor containing 1 core to reduce memory access latency using the physical memory closest to the processor.

    See Also

    Disclosure Statement

    SPEC, SPECjbb reg tm of Standard Performance Evaluation Corporation. Results as of 8/25/2009 on http://www.spec.org.
    Sun SPARC T5240 (2 chips, 16 cores) 422782 SPECjbb2005 bops, 26424 SPECjbb2005 bops/JVM;Sun SPARC T5240 (2 chips, 16 cores) 384934 SPECjbb2005 bops, 24058 SPECjbb2005 bops/JVM; IBM Power 570 (4 chips, 8 cores) 402923 SPECjbb2005 bops, 100731 SPECjbb2005 bops/JVM.

    Sun watts were measured on the system during the test.

    IBM p 570 4P (2 building blocks) power specifications calculated as 80% of maximum input power reported 7/8/09 in 'Facts and Features Report': ftp://ftp.software.ibm.com/common/ssi/pm/br/n/psb01628usen/PSB01628USEN.PDF