HPC Blogs

When a thread hits an error in a multithreaded application, that error will take out the entire app. Here's some example code:

#include <pthread.h>
#include <stdio.h>

void *work(void * param)
{
  int*a;
  a=(int*)(1024*1024);
  (*a)++;
  printf("Child thread exit\n");
}

void main()
{
  pthread_t thread;
  pthread_create(&thread,0,work,0);
  pthread_join(thread,0);
  printf("Main thread exit\n");
}

Compiling and running this produces:

% cc -O -mt pthread_error.c
% ./a.out
Segmentation Fault (core dumped)

Not entirely unexpected, that. The app died without the main thread having the chance to clear up resources etc. This is probably not ideal. However, it is possible to write a signal handler to capture the segmentation fault, and terminate the child thread without causing the main thread to terminate. It's important to realise that there's probably little chance of actually recovering from the unspecified error, but this at least might give the app the chance to report the symptoms of its demise.

#include <pthread.h>
#include <stdio.h>
#include <signal.h>

void *work(void * param)
{
  int*a;
  a=(int*)(1024*1024);
  (*a)++;
  printf("Child thread exit\n");
}

void hsignal(int i)
{
  printf("Signal %i\n",i);
  pthread_exit(0);
}

void main()
{
  pthread_t thread;
  sigset(SIGSEGV,hsignal);
  pthread_create(&thread,0,work,0);
  pthread_join(thread,0);
  printf("Main thread exit\n");
}

Which produces the output:

% cc -O -mt pthread_error.c
% ./a.out
Signal 11
Main thread exit

Last week the Lustre-Powered Hyperion Project received a coveted HPC Wire Readers Choice Award. The NNSA's Lawrence Livermore National Laboratory teamed with Sun and nine other vendors to build and support a large-scale Linux cluster test bed to explore high performance computing technologies. Full Story

MilaX, a small Live CD distribution that runs completely off a CD, can be a good tool for troubleshooting a non-booting Solaris installation. In this tech tip, Bernd Schemmer of the BigAdmin community describes how to convert the MilaX Live CD to a WANBOOT image, so it can be used on a WANBOOT server for troubleshooting machines that run the Solaris OS on SPARC platforms.

This BigAdmin article shows how to add support for booting from SATA DVDs by installing Kernel Update patch 137137-09 to the SPARC miniroot image for the Solaris 10 5/08 or 8/07 release on SPARC platforms.

HPC Needs a Killer App?  Interesting proposal by Mr. Rattner, he claims that our typical list doesn't promote the growth.  Rather it needs a framework that winds up driving growth.  3-D Web?  Hmmm.   What does he mean?  Much like a "cloud" (the most overused term in computing today) 3-D Web could mean a lot of things.  According to him:

  • Continuous Simulation
  • Multi-view 3D animated
  • Immersive and Collaborative environments

Continuously simulated implies that we haven't been doing this.  Maybe true across the vast commercial and government markets, but for chip design and computer environments (chip design in particular) we have been doing this for years. 

Multi-View 3D animation.  In this context as shown by example (fernland) the simulation of fern lifecycle across multiple genetic makeup of the ferns as well as differences in terrain, cloud cover, soil type, etc...  Interesting, but is it really all that different than what we can do in a traditional scientific simulation the way we've always done it?  Is the ability to visualize this and simulate the environment all that critical to the scientific understanding?  As an old Graphics geek and Sci Viz guy this is a heretical stance, but I'm not sure it adds that much.

 The next example of applying HPC to the fashion industry is something that may very well be worth doing.  Being able to build virtual prototypes of clothing and simulate the real physics of the clothing, textiles etc. can make the same difference to fashion design that crash codes and CFD modeling made to the manufacturing industries.  Pretty cool.  This is where the Siggraph 1989 or so animation and simulation stuff has taken us.  With a few advances in creating the underlying human model and applying realistic movement simulation techniques this is a mostly solved problem that can indeed make the difference that is the subject of the talk.  Disruptive.

In his demonstration codes there was one example of cloth physics that shows a silk cloth falling over an object and then falling to the floor.  Not only did they simulate the drape and fold of the cloth, but also the air flow under the cloth, the billowing and fluttering that is a result of the fall and slide.  Each frame of the simulation took 6 min on a 16 node machine.  Not exaclyt real time, but not as bad as you might think.  The other simulation was a water flow and (more importantly) a water sound simulation that lasts about 10 seconds or so and took on the order of 13 hours of compute time to run the physics and produce the audio and visual simulation.

Next, he went into a sort of sales presentation on Intel processors.  The difference between this and Dell is at least he is showing off the performance instead of death by slideware.  The audience isn't leaving.  So we have that going for us.

Call to action:

  • 3D Web is most important (in his opinion) than the 3D web
  • Government support isn't enough
  • Need to develop, standardize and use 3D Web

Interesting talk.  Definitely something to think about.



We'll be at Supercomputing 09 in Portland this week, at the Sun Booth. Stop by and say hello, if you're attending. 

I hope to have more to say about this years annual event, so stay tuned.

GreenBeat 2009 will bring together the nation’s 500 leading entrepreneurs, investors, utilities, technology executives, and policymakers to accelerate the development of a leaner, more efficient electrical grid. GreenBeat 2009 will map out the hottest business and technology opportunities the Smart Grid has to offer. Renovating the power grid requires big ideas from start-ups!

The program will feature participation by Al Gore, former Vice President and Nobel Prize Winner; John Doerr of Kleiner Perkins; Eric Schmitt of Google and Vinod Khosla of Khosla Ventures.  The program also features executives from Cisco, Tendril, Oracle, and more. Expect lively discussion and power networking. The program also includes an innovation competition that will highlight new technologies and will explore the financial and investment opportunities afforded by the stimulus package.  More details can be found at www.greenbeat2009.com.

Will Smart Grid and smart metering initiatives change consumer behavior, and how? How will the more than $4 billion the Obama Administration has earmarked for Smart Grid make a difference? Which incentives and policies will speed deployment? Where will the most disruption take place in this trillion-dollar business that hasn’t changed for decades? Do startups stand a chance? Where should they focus? What role will cybersecurity and interoperability play in how a new, revamped grid takes shape? GreenBeat will get industry leaders’ answers to these questions and more.

We have made arrangements for 30% off the regular rate and you may use the following link and special code to register. http://greenbeat2009.eventbrite.com/?discount=GBSunVIP



Sun Studio is now in the Rocks roll both for Solaris and Linux. You will find Sun Studio 12 update 1, Sun ClusterTools and Sun Grid Engine in the Linux roll here at: http://www.rocksclusters.org/wordpress/?page_id=107 and for Solaris roll here at: http://www.rocksclusters.org/wordpress/?page_id=108 . Bringing Sun Studio to Rocks makes it more readily available for Cluster development, primarily in research organization.

If you arent familiar with the Rocks distribution, it is basically an open-source Linux cluster distribution that enables end users to easily build computational clusters, grid endpoints and visualization tiled-display walls. Rocks is used by researchers to create their own clusters. You can get more information about Rocks here at their site (here).

Sun will be in SuperComputing 09, highlighting Flash technology and many other future technologies its working on. Come join Sun at Birds of a Feather session and at the Whisper Suite . For more info, follow this link.



Sun Studio is now in the Rocks roll both for Solaris and Linux. You will find Sun Studio 12 update 1, Sun ClusterTools and Sun Grid Engine in the Linux roll here at: http://www.rocksclusters.org/wordpress/?page_id=107 and for Solaris roll here at: http://www.rocksclusters.org/wordpress/?page_id=108 . Bringing Sun Studio to Rocks makes it more readily available for Cluster development, primarily in research organization.

If you arent familiar with the Rocks distribution, it is basically an open-source Linux cluster distribution that enables end users to easily build computational clusters, grid endpoints and visualization tiled-display walls. Rocks is used by researchers to create their own clusters. You can get more information about Rocks here at their site (here).

Sun will be in SuperComputing 09, highlighting Flash technology and many other future technologies its working on. Come join Sun at Birds of a Feather session and at the Whisper Suite . For more info, follow this link.

We have come a long way from what seems now to be a distant memory - systems that had the CPU built from a large number of integrated circuits.  Enter the age of 'system on a chip' heralded by Sun's innovations with the Ultra SPARC T2 processor.

The new age processor with a second generation release in 2007, right on the heels of the original  T1 processor that debuted in 2005, is a paradigm shift from traditional thinking.  Having proved itself on the network and the application tier, it has now arrived to fire up the database performance. The TPC-C benchmark by Oracle has just proved that using a dozen multi-socket UltraSPARC T2 systems  coupled with other innovations from Sun.  The blazing fast 5100 storage is a key component but we focus on the 'system' in this blog.

Chip Multi Threading (CMT)

The T2 processor has eight cores, with each core supporting eight threads using two execution units. While the other processors are playing catch with this new design philosophy running fewer threads on higher clock speeds, the T2 processor supports sixty four threads in all. It also integrates networking, security and i/o - integrated 10 Gb Ethernet, 8 lane PCI-Express and 8 FP & cryptographic processing units. The T2 Plus processor lends itself to multi socket designs using the networking function to provide coherency links.

In the last couple of decades, performance improvements have largely come from increasing clock speeds  and enhancing instruction level parallelism by means such as multiple instruction issue, out-of-order execution and branch prediction. However, we have witnessed diminishing returns forced by two constrains, namely

- level of instruction level parallelism possible in today's commercial applications
Commercial applications tend to have low ILP due to large working sets and poor locality of reference on memory access, resulting in poor cache hit rates.  Also prediction becomes tough on data dependent branches making discarded work on complex and power hungry designs very expensive.  Increasing power usage and heat dissipation arising from design complexity further compounds the problem.

- memory latency
It is common knowledge that CPU speeds have gone up by order of magnitude above the rest of the system including memory. A cache miss and a resultant slower access to the memory makes the processor to idle on a few clock cycles every time.  

Contrast this with the break-away approach of the Sun design in the T2 processor characterized by a more rounded architecture.  The T2 processor provides a dramatic increase in the number of threads and a high bandwidth memory subsytem.  Each of the eight processor core support two groups of four threads  each using two Integer Execution Units. On every clock one instruction from each thread group is picked up by one Execution unit (EXU) - the Least Recently Fetched among the ready threads.

The hardware hides memory access delays and pipeline stalls by scheduling other threads onto the execution pipe with zero cycle penalty on the context switch.  Rather than have the processor idle on a cache miss, the T2 processor takes up another thread in the next clock, aided by context switching in the hardware. Each thread has its own program counter, one-line instruction buffer and a register file.  Each EXU contains state for four threads and the Integer Register File (IRF) contains 8 register windows for every thread.The large memory bandwidth available ensures a smooth flow of supplies for the  many threads and cores.

The contrarian design yields the following advantages:
- Reduced thermal envelope by simplifying processor design
- Increased overall system throughput with 64 threads
Together, the above two factors offer a higher 'Performance per Watt.' 

For the moment, the majority of commercial applications fall short of maximizing the throughput potential of this new design. There are also other applications which are heavily dependent on single thread performance making the processor clock speed the single most important factor.

As with all new thinking, a new application development approach is needed. The following are the critical factors:
The quantum of work done should be large enough to utilize the high thread count.
The work needs to be broken into independent units, bringing in parallelism

- compilers with optimization options take care of the obvious while the
- programmer will need to take care of the not so obvious possibilities of code parallelization

Load balancing across threads becomes the key
Thread synchronization penalties and thread creation overheads need to be controlled

Linking to high performance libraries supplied by the vendor provides substantial performance increase.

As with any paradigm shift, the support for this approach would grow towards a crescendo. This would be further helped by the direction almost all of the mainstream processor technologies are taking. Sun is clearly in the driver's seat with the big lead it has in the UltraSPARC T2 processor and the Solaris Operating system, the proven platform for multi-threaded workloads.


This story is hard to pass up:  Sun's BestPerf blog (read the details here) recently reported how they got a 12x performance improvement over a single-threaded version on an important Seismic (Reverse Time Migration) benchmark using Sun Studio's OpenMP feature on SLES10. Its a great story of how Sun can deliver performance through a combination of Sun Studio and new Hardware (via Sun Storage F5100 Flash Array). Yes, this is the same Flash Array that has been the talk of the town and has notched up several World Record wins.
Several points come to mind:
  • Sun Studio and OpenMP are key to exploiting parallel performance. Not just with Flash, but also with multiple cores now becoming the mainstay in chip offerings. Multi-threading, parallel performance (and parallel programming, for those who are willing to take the effort) is going to be even more critical to fully utilize system resources now and into the future.
  • Sun Studio performance here is highlighted on SuSE 10. Note this, because I've had to defended the impression that Sun Studio doesnt do as well on Linux; it does. Sun Studio does not leave any performance, features, tools, options, optimizations out of its offering on Linux.
  • The Flash Array Storage alone gets a 2.2x performance win over 15K disks. But the combination with Sun Studio in achieving parallelism that the Flash Array Storage can exploit is even more attractive.


Sun Studio now runs on Oracle Enterprise Linux. This extends the Linux platforms supported to include RHEL 5, SuSE 10, CentOS 5, and now OEL. Sun Studio continues to be available FREE on Linux as well as Solaris and OpenSolaris platforms.
You can download it from the Sun Download center (here).

Hi There!

I've been a little quiet for the last 2 weeks, working on a decent size Siebel deployment plan. (including Oracle DB, Siebel, Fusion middleware components, as well as integrated 3rd party applications)


The partner came to Sun requesting sizing for the hardware infrastructure (putting Oracle on Sun makes sense for many, many reasons!).
The request was for x86 servers, running RedHat and 'we'll virtualise everything'. I guess that was fine, except for one little detail… Oracle doesn't provide explicit support for most of its applications when running in virtualised environments, unless its Oracle VM.
Now this partner didn't have skills in Oracle VM, so what alternative was available?


Answer: LDOMS & Containers.


Thats right, Sun's hardware partitioning & OS virtualisation technologies are supported for the deployment of most Oracle applications (support for LDOMs, Containers, or both, depending on the particular application itself).


So how does this affect the infrastructure solution?


Well, the x86 solution could not take advantage of virtualisation and thus required almost 100 physical servers (1RU or blades).
The SPARC CMT solution required less than 40 blades, provided a massive reduction in the number of physical systems that needed to be managed, took up much less floorspace in the datacenter (5 B6000 chassis vs 10), greatly reduced the number of Ethernet & FC SAN switch ports & cabling required (saved approx. 200k on FC switches alone), etc.


And it turned out cheaper too!


Lesson learned?


Not all enterprise applications & ISV's support running their products within virtualised environments (or will require you to replicate the issue on physical hardware first). LDOMS & Containers are recognised by Oracle & other major ISV's as a supported method of application & OS isolation & consolidation.


Using these technologies to consolidate workloads can significantly reduce the number of physical systems you need to manage, while providing the legendary RAS features of SPARC hardware & the Solaris OS. This usually translates into better uptime which means happier users and happier admins and also saves money which means happier IT Managers & CxO's!


Bottom Line: LDOMS & Containers = Happiness!



For more info on running Oracle with LDOMS/Containers, go here or here

Os invito a asistir a la segunda sesión de esta temporada de los Desayunos Tecnológicos que organiza el departamento de ingeniería de sistemas. En esta ocasión mis compañeros os explicarán como sacar el máximo provecho a las innovaciones tecnológicas en materia de procesadores y servidores. Como sabéis, no todos los procesadores están diseñados para la misma función, por lo que es importante conocer el tipo de solución que se necesita ejecutar para elegir adecuadamente.

La asistencia es gratuita y puede ser presencial o a través de Internet para el que no pueda acercarse a desayunar con nosotros. En ambos casos hay que inscribirse en el enlace que se cita. A los que seleccionen la asistencia a través de Internet, se les envía posteriormente un enlace para la sesión y una contraseña de acceso horas antes del desayuno. La pasada sesión ha sido muy bien valorada, por lo que os animo a participar.

Desayunos Tecnológicos. Inscríbete ahora
Estimad@ Amig@:

Nuestra próxima sesión de Desayunos Tecnológicos tendrá lugar el miércoles 11 de Noviembre, con el tema "Estrategia en el Datacenter. Un procesador para cada carga de trabajo". Te invitamos a participar.

Si no te es posible asistir a la sesion presencial, puedes seguirla por Internet; una vez que te registres, te enviaremos la información de acceso.

LUGAR
Sun Solution Center
Oficinas de Sun en Madrid,
(C/ Serrano Galvache, 56)

AGENDA
09:30 - 10:00 Bienvenida y Café
10:00 - 10:30 Tecnología de procesadores Sun:
SPARC y CMT (Chip Multi Threading)
10:30 - 11:00 Asociación aplicaciones/procesadores
11:00 - 13:00 Arquitecturas de referencia, experiencias y benchmarks publicados


Te invitamos a inscribirte aquí para asistir a esta sesión.
Te adelantamos la planificación de las siguientes sesiones para que puedas agendarlas y asistir a las que sean de tu interés:

Miércoles 25 de Noviembre
Desayuno conjunto Sun-Oracle
"Estrategia en el Datacenter.Seguridad y Control de acceso administrativo"
Jueves 10 de Diciembre
Desayuno conjunto Sun-Oracle
"Estrategia en el Datacenter: Acceso al dato"
Miércoles 16 de Diciembre StarOffice 9

Es una buena oportunidad de ponerse al día en las novedades que Sun ofrece. Inscribiros aquí.
Today, I'm concluding one more batch of MX000 Server Administration training in Bangalore. Anyone interested in knowing details about what's covered in MX000 Server Administration Training, click here.



The Sun Studio Performance Analyzer reference manual, updated for Sun Studio 12 update 1, is now available on docs.sun.com:

Developing high performance applications requires a combination of compiler features, libraries of optimized functions, and tools for performance analysis. The Performance Analyzer manual describes the tools that are available to help you assess the performance of your code, identify potential performance problems, and locate the part of the code where the problems occur.

http://docs.sun.com/app/docs/doc/821-0304


In June 2009, Sun Studio announced a blogging contest that ran until September.
The winners of that contest are now being showcased on the Sun Studio landing page.
The first winner to be showcased here, on Sun Studio page, and here, at SDN Program News, is Sandeep Koranne, whose entry describes how Sun Studio 12 compilers are used to engineer a complex, innovative discrete geometry algorithmic application. Sandeep is happy that he gets a 20% boost from Sun Studio compilers over GCC. But more than just performance, using Sun Studio 12 Compilers allowed him to "experiment with data-structures, perform automated performance tuning and overall presented a better environment for complex algorithmic coding, where the scientific researcher uses the programming environment to not only develop the code, but also to document and collaborate about the algorithm and methods used in the application" . The code is written in Standard C++, uses STL and written with portability in mind. Sandeep uses an IDE feature for Automated Task List generation innovatively to collect a list of "TODO" items. Neat!
Good work, Sandeep. And congratulations!
And congratulations to the other winners as well.

In June 2009, Sun Studio announced a blogging contest that ran until September.
The winners of that contest are now being showcased on the Sun Studio landing page.
The first winner to be showcased here, on Sun Studio page, and here, at SDN Program News, is Sandeep Koranne, whose entry describes how Sun Studio 12 compilers are used to engineer a complex, innovative discrete geometry algorithmic application. Sandeep is happy that he gets a 20% boost from Sun Studio compilers over GCC. But more than just performance, using Sun Studio 12 Compilers allowed him to "experiment with data-structures, perform automated performance tuning and overall presented a better environment for complex algorithmic coding, where the scientific researcher uses the programming environment to not only develop the code, but also to document and collaborate about the algorithm and methods used in the application" . The code is written in Standard C++, uses STL and written with portability in mind. Sandeep uses an IDE feature for Automated Task List generation innovatively to collect a list of "TODO" items. Neat!
Good work, Sandeep. And congratulations!
And congratulations to the other winners as well.
I spent some time last week at OOW talking with Oracle customers regarding the technology in the Exadata V2 database machine. There were certainly a lot of customers excited to use this for their data warehouses - 21GB/sec disk throughput, 50GB/sec flash cache, and Hybrid Columnar Compression really accelerate this machine past the competition. The viability of Exadata V2 for DW/BI was really a given, but what impressed me the most was the number of customers looking to consolidate applications in this environment.

Ever since I was first brought onto this project, I thought Exadata V2 would be an excellent platform for consolidation. In my experience working on the largest of Sun's servers, I have seen customers with dozens of instances on a single machine. Using M9000 series machines, you can create domains in order to support multiple environments - this very much mirrors what Exadata V2 can provide. Exadata V2 allows DBAs to deploy multiple instances across a grid of RAC nodes available in the DB machine – and since you are using RAC, availability is a given. Also, the addition of Flash allows for up to 1 million IOPs to support your ERP/OLTP environments. Consider the picture below.

With this environment, your production data warehouse can share the same infrastructure as the ERP, test, and development environments. This model allows the flexibility to add/subtract nodes from a particular database as needed. But, the operational efficiency is not the biggest benefit to consolidation. The savings in terms of power, space, and cooling are substantial.

Consider for a moment the number of drives necessary to match the 1 million IOPs available in the database machine. Assuming you are using the best 15,000 rpm drive, you would be able to do 250 IOPs/drive. So, to get to 1 million IOPs, you would need 4,000 drives! A highly dense 42U storage rack can house any where from 300-400 drives. So, you would need 10 racks, just for the storage and at least one rack for servers.

With Exadata V2, you get more than 10:1 savings in floor space and all the power an cooling benefits as well. It is no wonder people are excited about Exadata V2 as a platform to consolidate compute and storage resources.

I'm a great fan of the hardware performance counters that you find on most processors. Often you can look at the profile and instantly identify what the issue is. Sometimes though, it is not obvious, and that's where the performance counters can really help out.

I was looking at one such issue last week, the performance of the application was showing some variation, and it wasn't immediately obvious what the issue was. The usual suspects in these cases are:

  • Excessive system time
  • Process migration
  • Memory placement
  • Page size
  • etc.

Unfortunately, none of these seemed to explain the issue. So I hacked together the following script cputrackall which ran the test code under cputrack for all the possible performance counters. Dumped the output into a spreadsheet, and compared the fast and slow runs of the app. This is something of a "fishing trip" script, just gathering as much data as possible in the hope that something leaps out, but sometimes that's exactly what's needed. I regularly get to sit in front of a new chip before the tools like ripc have been ported, and in those situations the easiest thing to do is to look for hardware counter events that might explain the runtime performance. In this particular instance, it helped me to confirm my suspicion that there was a difference in branch misprediction rates that was causing the issue.

The City of San Antonio, Texas (CoSA) is home to more than 1.5 million people, and provides multiple online services to its residents including bill payment, career assistance, licensing, permits, and public safety information. CoSA employees also rely on access to applications and data for use in daily work activities including financial systems, HR software, and public safety applications used by the police and fire departments.

Over time, the city's server infrastructure had struggled to keep pace with its service delivery, and CoSA was running out of room in its data center. The CoSA IT department needed to also upgrade its IT infrastructure to reduce maintenance costs and enhance services. CoSA already had a long-established relationship with Sun and felt that leveraging Sun's SPARC servers as a platform for the Solaris 10 Operating System and Solaris zones provided the best opportunity for ROI with its virtualization technologies and energy-efficient mainframe-class servers.
Sun Customer City of San Antonio
(Image courtesy: City of San Antonio)
The city, managing several separate environments, decided to consolidate its SAP NetWeaver systems on Sun SPARC Enterprise M5000 and M4000 servers. CoSA also migrated from the Solaris 9 OS to the Solaris 10 OS to take advantage of Solaris Zones and allow multiple applications to run in isolation from one another on the same physical hardware. The solution also includes Sun Blade 6000 Modular systems and multiple Sun Fire T2000 servers with energy-efficient CoolThreads technology. Finally, CoSA replaced 80 physical Windows servers with 12 Sun Fire X4600 M2 servers as a VMWare virtual infrastructure platform in its Windows environment.

Sun's server and virtualization solution allowed CoSA to consolidate from 16 to 4 racks of servers, and reduce the datacenter footprint for these workloads by over 85%. The solution has also helped CoSA reduce the maintenance overhead, giving administrators more time to deploy new systems that benefit the city and its residents. The consolidation has helped the city achieve considerable cost savings and CoSA expects to realize a full ROI within two and a half years based on the reduced support costs alone. Kevin Goodwin, the assistant director for CoSA IT department said: “Sun's enterprise-class virtualization technologies have served the City of San Antonio well. They're a critical component of our overall IT transformation and optimization strategy, allowing us to rapidly deploy highly available server capacity to meet the city's changing business needs while saving money in the process.”

Check out the complete details here.

I gave a short talk about Fortress, a new parallel language, at the Sun HPC Workshop in Regensburg, Germany and thought I'd post the slides here with commentary. Since I'm certainly not a Fortress expert by any stretch of the imagination, my intent was to give the audience a feel for the language and its origins rather than attempt a deep dive in any particular area. Christine Flood from the SunLabs Programming Languages Research Group helped me with the slides. I also stole liberally from presentations and other materials created by other Fortress team members.

The unofficial Fortress tag line, inspired by Fortress's emphasis on programmer productivity. With Fortress, programmer/scientists express their algorithms in a mathematical notation that is much closer to their domain of expertise than the syntax of the typical programming language. We'll see numerous examples in the following slides.

At the highest level, there are two things to know about Fortress. First, that it started as a SunLabs research project, and, second, that the work is being done in the open under as the Project Fortress Community, whose website is here. Source code downloads, documentation, code samples, etc., are all available on the site.

Fortress was conceived as part of Sun's involvement in a DARPA program called High Productivity Computing Systems (HPCS,) which was designed to encourage the development of hardware and software approaches that would significantly increase the productivity of the application developers and users of High Performance Computing systems. Each of the three companies selected to continue past the introductory phase of the program proposed a language designed to meet these requirements. IBM chose essentially to extend Java for HPC, while both Cray and Sun proposed new object-oriented languages. Michèle Weiland at the University of Edinburgh has written a short technical report that offers a comparison of the three language approaches. It is available in PDF format here.

I've mentioned productivity, but not defined it. I recommend visiting Michael Van De Vanter's publications page for more insight. Michael was a member of the Sun HPCS team who focused with several colleagues on the issue of productivity in an HPCS context. His considerable publication list is here.

Because I don't believe Sun's HPCS proposal has ever been made public, I won't comment further on the specific scalability goals set for Fortress other than to say they were chosen to complement the proposed hardware approach. Because Sun was not selected to proceed to the final phase of the HPCS program, we have not built the proposed system. We have, however, continued the Fortress project and several other initiatives that we believe are of continuing value.

Growability was a philosophical decision made by Fortress designers and we'll talk about that later. For now, note that Fortress is implemented as a small core with an extensive and growing set of capabilities provided by libraries.

As mentioned earlier, Fortress is designed to accommodate the programmer/scientist by allowing algorithms to be expressed directly in familiar mathematical notation. It is also important to note that Fortress constructs are parallel by default, unlike many other languages which require an explicit declaration to create parallelism. Actually to be more precise, Fortress is "potentially parallel" by default. If parallelism can be found, it will be exploited.

Finally, some code. We will look at several versions of a factorial function over the next several slides to illustrate some features of Fortress. (For additional illustrative Fibonacci examples, go here.) The first version of the function is shown here beneath a concise, mathematical definition of factorial for reference.

The red underlines highlight two Fortressisms. First, the condition in the first conditional is written naturally as a single range rather than as the more conventional (and convoluted) two-clause condition. And, second, the actual recursion shows that juxtaposition can be used to imply multiplication as is common when writing mathematical statements.

This version defines a new operator, the "!" factorial operator, and then uses that operator in the recursive step. The code has also been run through the Fortress pretty printer that converts it from ASCII form to a more mathematically formatted representation. As you can see, the core logic of the code now closely mimics the mathematical definition of factorial.

This non-recursive version of the operator definition uses a loop to compute the factorial.

Since Fortress is parallel by default, all iterations of this loop could theoretically be executed in parallel, depending on the underlying platform. The "atomic" keyword ensures that the update of the variable result is performed atomically to ensure correct execution.

This slide shows an example of how Fortress code is written with a standard keyboard and what the code looks like after it is formatted with Fortify, the Fortress pretty printer. Several common mathematical operators are shown at the bottom of the slide along with their ASCII equivalents.

A few examples of Fortress operator precedence. Perhaps the most interesting point is the fact that white space matters to the Fortress parser. Since the spacing in the 2nd negative example implies a precedence different than the actual precedence, this statement would be rejected by Fortress on the theory that its execution would not compute the result intended by the programmer.

Don't go overboard with juxtaposition as a multiplication operator -- there is clearly still a role for parentheses in Fortress, especially when dealing with complex expressions. While these two statements are supposed to be equivalent, I should point out that the first statement actually has a typo and will be rejected by Fortress. Can you spot the error? It's the "3n" that's the problem because it isn't a valid Fortress number, illustrating one case in which a juxtaposition in everyday math isn't accepted by the language. Put a space between the "3" and the "n" to fix the problem.

Here is a larger example of Fortress code. On the left is the algorithmic description of the conjugate gradient (CG) component of the NAS parallel benchmarks, taken directly from the original 1994 technical report. On the right is the Fortress code. Or do I have that backwards? :-)

More Fortress code is available on the Fortress by Example page at the community site.

Several ways to express ranges in Fortress.

The first static array definition creates a one dimensional array of 1000 32-bit integers. The second definition creates a one dimensional array of length size, initialized to zero. I know it looks like a 2D array, but the 2nd instance of ZZ32 in the Array construct refers to the type of the index rather than specifying a 2nd array dimension.

The last array subexpression is interesting, since it is only partially specified. It extracts a 20x10 subarray from array b starting at its origin.

Tuple components can be evaluated in parallel, including arguments to functions. As with for loops, do clauses execute in parallel.

In Fortress, generators control how loops are run and generators generally run computations in any order, often in parallel. As an example, the sum reduction over X and Y is controlled by a generator that will cause the summation of the products to occur in parallel or at the least in a non-deterministic order if running on a single-processor machine.

In Fortress, when parallelism is generated the execution of that work is handled using a work stealing strategy similar to that used by Cilk. Essentially, when a compute resource finishes executing its tasks, it pulls work items from other processor's work queues, ensuring that compute resources stay busy by load balancing the available work across all processors.

Essentially, a restatement of an earlier point: In Fortress, generators play the role that iterators play in other languages. By relegating the details of how the index space is processed to the generator, it is natural to then also allow the generator to control how the enclosed processing steps are executed. A generator might execute computations serially or in parallel.

A generator could conceivably also control whether computations are done locally on a single system or distributed across a cluster, though the Fortress interpreter currently only executes within a single node. To me, the generator concept is one of the nicer aspects of Fortress.

Guy Steele, who is Fortress Principal Investigator along with Eric Allen, has been working in the programming languages area long enough to know the wisdom of these statements. Watch him live the reality of growing a language in his keynote at the 1998 ACM OOPSLA conference. Be amazed at the cleverness, but listen to the message as well.

The latest version of the Fortress interpreter (source and binary) is available here. If you would like to browse the source code online, do so here.

Some informational pointers. Christine also tells me that the team is working on an overview talk like this one. Except I expect it will be a lot better. :-) Though I only scratched the surface in a superficial way, I hope this brief overview has given you at least the flavor of what Project Fortress is about.


Darryl Gove has a brief but informative slideware discussion showing the use of the Performance Analyzer with a simple OpenMP code that demonstrates the performance differences between static and dynamic scheduling. It's worth watching.

In case you missed news from Oracle Open World this week,Sun announced new world records and the world's fastest flash array among the Sun products making industry news this week...

One of the key announcements focused on Sun's revamped SPARC Enterprise Server Line:

Sun and Fujitsu announce new, faster, quad-core SPARC64 VII processors and an enhanced memory controller for the SPARC Enterprise server line . This enhanced server line has already proven itself by setting four new world records, and the faster processors deliver customers up to 20 percent better performance than the previous generation. The update we're announcing today makes it possible for customers to increase the performance of their mission-critical enterprise applications while capitalizing on existing infrastructure investments -- with upgrades at half the cost of IBM.

Our SPARC Enterprise servers, together with Solaris, offer an unequaled combination of mainframe-class performance and reliability, as well as virtualization and consolidation capabilities, in an open system.

Taking into account Sunday's announcement of Sun FlashFire, this week's announcements demonstrate Sun's continued innovation across the systems portfolio, from high-end servers to the integration of Flash technology across our entire software, server and storage portfolio.

For more information on this week's Sun news check out the following:

Oracle Open World Highlights
OOW McNealy Keynote Highlights
OOW Ellison Keynote Highlights

It was heartening to see a lot of Sun Hardware at Oracle OW.  For years, I've tried to persuade Sun TechDays and other folks to showcase Sun hardware at these developer shows, but its never really materialized in any meaningful way. Sure, theres the odd server for virtualization, etc at the shows, but that was mostly it.
By comparison, there was plenty of Sun HW here. I'm going to try and list out some of the big, hunking boxes I saw in the Sun booth and elsewhere. I'm sure my list isnt complete; I expect I will update this blog to make it more so. For now, here goes, what I saw.
  1. Top of the list, of course, is the Sun Oracle Exadata Version 2(tagline: Hardware from Sun, Software from Oracle). Basically an OLTP database machine billed as twice as fast as its predecessor. This was the treat of the show, showcased just outside the Keynote location. Impressive piece of iron and it drew a lot of crowds (both onlookers as well as buyers, from what I hear).
  2. StorageTek Modular Library system with 200 to 3000 cartridge slots (machine on display had 700). With a robotic arm that was continuously in motion, this machine made an impressive demo. And it was placed right next to our SunStudio booth, which drew curious onlookers.
  3. Sun Storage 7000 Unified Storage, aka Amber Road. This is an amazing amount of data (those on display were 12TB systems) in a small form-factor and with some amazing ease of administration to go with it.
  4. Sun Storage Flash Array system. This is the secret sauce that makes the Exadata database machine tick! Flash speeds are the talk of the town since they have the potential to increase IOPS by an order of magnitude and save $$$ by making disk/Flash tradeoffs for throughput, storage and price.
  5. Rackmount Servers: Mostly featured at the Demo stations were rackmounts systems based on UltraSPARC T2 (Enterprise T5240 servers), or Nehalem (Sun Fire X4450 servers) or AMD servers (Sun Fire 4240 servers)
Besides this, there were banners about the Sun branded database machine built out of UltraSPARC T2 5440s that recently claimed #1 status in all 7 key benchmarks (follow this link). The message was clear, from what I could tell: Sun is going to bring performance to the game and Oracle will optimize all Software to work efficiently on Solaris and Sun systems. In view of recent press announcements touting World Record TPC-C performance and a promise to keep Sun customers happy by investing even more in Sun technology than previously, this showcasing of Sun hardware bodes well for Sun customers as well as for Oracle's enterprise partners and customers. Best of all, there seems to be a palpable excitement in both companies about the synergies around this acquisition that was hard to miss both from the Sun booth as well as the Oracle booths.

Yesterday was my first day at OOW. Even though there were some scintillating events over the weekend, in particular these keynotes from Sun's Scott McNealy & James Gosling(view here) and Oracle's Larry Ellison (view here), I wasnt at that portion of OOW.
My first impressions, even before I entered Moscone, was Wow! The place was entirely taken over by Oracle. Buses ran billboards advertising Oracle and the event, there was even a huge tent between Moscone North and South, reserved as dining area and essentially closing Howard Street (picture here). There was even the scale model BMW Oracle Racing High-tech Catamaran on display at the Fourth and Howard Streets intersection. Exhibitions were in Moscone South AND Moscone West. Essentially, that 6 block area was nothing but Oracle OpenWorld.
My second impression was suits. Lots and lots of them. Essentially different from IDF, which billed itself as the next, next, next big thing, and JavaOne, which is clearly a hacker's conference (and where James reminded Sun CEO Jonathan Schwartz that he was out of place in his suit at the keynote and got huge applause from the audience), this one is a carefully and well-scripted conference. I could not listen to the entire keynote from Phillips and Catz (view here), but what I could hear was very carefully laid out and executed. One astounding fact I gathered (and later could relate to): Oracle has over 3000 products and the portfolio is growing ever faster!
So, I had booth duty on the exhibition floor. Moscone South. Essentially a technology, but even more importantly, a services showcase. All the major partners were there: HP, IBM, Dell, AMD, Intel and of course Sun. And also, networking and wireless partners like Cisco, Brocade, AT&T, Blackberry and Verizon.  But also, Infosys, CSC, NetApp, Deloitte, Wipro, EDS, Accenture, KPMG, PriceWaterhouseCooper, Tata Consulting (TCS). I'm singling out that last list because I havent seen them at any of the developer conferences I usually go to (Sun TechDays, JavaOne, IDF, LinuxWorld, etc). Oracle itself was fairly hidden (or backgrounded), giving their partners essentially all the glory and topspots on the floor.  [Moscone West has a HUGE, HUGE Salesforce.com presence which I intend to check out today].
There was a Cloud booth (for those of you who think Oracle is anti-Cloud) and I engaged in some interesting and long discussions with vendors in that booth (except Amazon, I'll corner them today, because they are more of a known quantity as far as I'm concerned, so unlikely that I'll learn anything new). On-Demand computing seems to have a big presence in what Oracle calls "DemoGrounds" (see this picture, eg).
The Sun booths were very strategic and visible. Right next to the main entrance. We had some foot traffic, but for the Sun Studio booth, mostly non-existent. I probably talked to about a dozen to 15 non-Sun folks and some of them were even Oracle folks, who I knew by email before. Given that the crowd was a suited, mostly business IT type crowd, I am not surprised. A few that came by were disappointed that we didnt run on Windows, but were suitably impressed by the offering and demo when I showed them what we had.
An interesting day. Tiring, since the shift turned out to be a 5+ hour shift without a lot of interesting traffic, but I think I learned a bit from others there. Which makes it entirely worthwhile.
More details tomorrow, I hope.

The SPEC CPU2006 benchmarks were run on the new 2.88 GHz and 2.53 GHz SPARC64 VII processors for the Sun SPARC Enterprise Mseries servers. The new processors were tested in the Sun SPARC Enterprise M4000, M5000, M8000, M9000 servers.


  • The Sun SPARC Enterprise M9000 server running the new 2.88 GHz SPARC64 VII processors beats the IBM Power 595 server running 5.0 GHz POWER6 processors by 20% on the SPECint_rate2006 benchmark.

  • The Sun SPARC Enterprise M9000 server running the new 2.88 GHz SPARC64 VII processors beats the IBM Power 595 server running 5.0 GHz POWER6 processors by 29% on the SPECint_rate_base2006 benchmark.

  • The Sun SPARC Enterprise M9000 server with 64 SPARC64 VII 2.88GHz processors delivered results of 2590 SPECint_rate2006 and 2100 SPECfp_rate2006.

  • The Sun SPARC Enterprise M9000 server with 64 SPARC64 VII processors at 2.88GHz improves performance vs. 2.52 GHz by 13% for SPECint_rate2006 and 5% for SPECfp_rate2006.

  • The Sun SPARC Enterprise M9000 server with 32 SPARC64 VII 2.88GHz processors delivered results of 1450 SPECint_rate2006 and 1250 SPECfp_rate2006.

  • The Sun SPARC Enterprise M9000 server with 32 SPARC64 VII processors at 2.88GHz improves performance vs. 2.52 GHz by 17% for SPECint_rate2006 and 13% for SPECfp_rate2006.

  • The Sun SPARC Enterprise M8000 server with 16 SPARC64 VII 2.88GHz processors delivered results of 753 SPECint_rate2006 and 666 SPECfp_rate2006.

  • The Sun SPARC Enterprise M8000 server with 16 SPARC64 VII processors at 2.88GHz improves performance vs. 2.52 GHz by 18% for SPECint_rate2006 and 14% for SPECfp_rate2006.

  • The Sun SPARC Enterprise M5000 server with 8 SPARC64 VII 2.53GHz processors delivered results of 296 SPECint_rate2006 and 234 SPECfp_rate2006.

  • The Sun SPARC Enterprise M5000 server with 8 SPARC64 VII processors at 2.53GHz improves performance vs. 2.40 GHz by 12% for SPECint_rate2006 and 5% for SPECfp_rate2006.

  • The Sun SPARC Enterprise M4000 server with 4 SPARC64 VII 2.53GHz processors delivered results of 152 SPECint_rate2006 and 116 SPECfp_rate2006.

  • The Sun SPARC Enterprise M4000 server with 4 SPARC64 VII processors at 2.53GHz improves performance vs. 2.40 GHz by 13% for SPECint_rate2006 and 4% for SPECfp_rate2006.

Performance Landscape

SPEC CPU2006 Performance Charts - bigger is better, selected results, please see www.spec.org for complete results. All results as of 10/07/09.

In the tables below
"Base" = SPECint_rate_base2006 or SPECfp_rate_base2006
"Peak" = SPECint_rate2006 or SPECfp_rate2006

SPECint_rate2006 results - large systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type GHz Base Peak
SGI Altix 4700 Bandwidth 1024/512 Itanium 2 1.6 1020 9031 na
Sun Blade X6440 Cluster 768/192 Opteron 8384 2.7 705 8845 na
SGI Altix 4700 Density 256/128 Itanium 2 1.66 256 2893 3354
vSMP Foundation 128/32 Xeon X5570 2.93 255 3147 na
SGI Altix 4700 Bandwidth 256/128 Itanium 2 1.6 256 2715 2971
SPARC Enterprise M9000 256/64 SPARC64 VII 2.88 511 2400 2590 New
SPARC Enterprise M9000 256/64 SPARC64 VII 2.52 511 2088 2288
IBM Power 595 64/32 POWER6 5.0 128 1866 2155
HP Superdome 128/64 Itanium 2 1.6 128 1534 1648
SPARC Enterprise M9000 128/32 SPARC64 VII 2.88 255 1370 1450 New
SPARC Enterprise M9000 128/64 SPARC64 VI 2.4 255 1111 1294
SPARC Enterprise M9000 128/32 SPARC64 VI 2.52 255 1141 1240
Unisys ES7000 96/16 Xeon X7460 2.66 96 999 1049
SGI Altix ICE 8200EX 32/8 Xeon X5570 2.93 64 931 999
IBM Power 575 32/16 POWER6 4.7 64 812 934
IBM Power 570 32/16 POWER6+ 4.2 64 661 832
SPARC Enterprise M8000 64/16 SPARC64 VII 2.88 127 706 753 New
SPARC Enterprise M9000 64/32 SPARC64 VI 2.4 127 553 650
SPARC Enterprise M8000 64/16 SPARC64 VII 2.52 127 565 637

SPECint_rate2006 results - small systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type GHz Base Peak
Sun Fire X4440 24/4 Opteron 8435 SE 2.6 24 296 377
SPARC Enterprise M5000 32/8 SPARC64 VII 2.53 64 267 296 New
Sun Blade X6440 16/4 Opteron 8389 2.9 16 226 292
HP ProLiant BL680c G5 24/4 Xeon E7458 2.4 24 247 268
SPARC Enterprise M5000 32/8 SPARC64 VII 2.4 63 232 264
IBM Power 550 8/4 POWER6+ 5.0 16 215 263
Sun Fire X2270 8/2 Xeon X5570 2.93 16 223 260
SPARC Enterprise T5240 16/2 UltraSPARC T2 Plus 1.6 127 171 183
SPARC Enterprise M4000 16/4 SPARC64 VII 2.53 32 136 152 New
SPARC Enterprise M4000 16/4 SPARC64 VII 2.4 32 118 135

SPECfp_rate2006 results - large systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type GHz Base Peak
SGI Altix 4700 Bandwidth 1024/512 Itanium 2 1.6 1020 10583 na
SGI Altix 4700 Density 1024/512 Itanium 2 1.66 1020 10580 na
Sun Blade X6440 Cluster 768/192 Opteron 8384 2.7 705 6502 na
SGI Altix 4700 Bandwidth 256/128 Itanium 2 1.6 256 3419 3507
ScaleMP vSMP Foundation 128/32 Xeon X5570 2.93 255 2553 na
IBM Power 595 64/32 POWER6 5.0 128 1681 2184
IBM Power 595 64/32 POWER6 5.0 128 1822 2108
SPARC Enterprise M9000 256/64 SPARC64 VII 2.88 511 1930 2100 New
SPARC Enterprise M9000 256/64 SPARC64 VII 2.52 511 1861 2005
SGI Altix 4700 Bandwidth 128/64 Itanium 2 1.66 128 1832 1947
HP Superdome 128/64 Itanium 2 1.6 128 1422 1479
SPARC Enterprise M9000 128/32 SPARC64 VII 2.88 255 1190 1250 New
SPARC Enterprise M9000 128/64 SPARC64 VI 2.4 255 1160 1225
SPARC Enterprise M9000 128/32 SPARC64 VII 2.52 255 1059 1110
IBM Power 575 32/16 POWER6 4.7 64 730 839
SPARC Enterprise M8000 64/16 SPARC64 VII 2.88 127 616 666 New
SPARC Enterprise M9000 64/32 SPARC64 VI 2.52 127 588 636
IBM Power 570 32/16 POWER6+ 4.2 64 517 602
SPARC Enterprise M8000 64/32 SPARC64 VI 2.4 127 538 582

SPECfp_rate2006 results - small systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type GHz Base Peak
Supermicro H8QM8-2 24/4 Opteron 8435 SE 2.8 24 261 287
SPARC Enterprise T5440 32/4 UltraSPARC T2 Plus 1.6 255 254 270
IBM Power 560 16/8 POWER6+ 3.6 32 226 263
SPARC Enterprise M5000 32/8 SPARC64 VII 2.53 64 218 234 New
SPARC Enterprise M5000 32/8 SPARC64 VII 2.4 63 208 223
IBM Power 550 8/4 POWER6+ 5.0 16 188 222
ASUS Z8PE-D18 8/2 Xeon X5570 2.93 16 197 203
SPARC Enterprise T5240 16/2 UltraSPARC T2 Plus 1.6 127 124 133
SPARC Enterprise M4000 16/4 SPARC64 VII 2.53 32 111 116 New
SPARC Enterprise M4000 16/4 SPARC64 VII 2.4 32 107 112

Results and Configuration Summary

Test Configurations:

Sun SPARC Enterprise M9000
64 x 2.88 GHz SPARC64 VII
1152 GB (448 x 2GB + 64 x 4GB)
Solaris 10 5/09
Sun Studio 12 Update 1

Sun SPARC Enterprise M9000
32 x 2.88 GHz SPARC64 VII
704 GB (160 x 2GB + 96 x 4GB)
Solaris 10 5/09
Sun Studio 12 Update 1

Sun SPARC Enterprise M8000
16 x 2.88 GHz SPARC64 VII
512 GB (128 x 4GB)
Solaris 10 10/09
Sun Studio 12 Update 1

Sun SPARC Enterprise M5000
8 x 2.53 GHz SPARC64 VII
128 GB (64 x 2GB)
Solaris 10 10/09
Sun Studio 12 Update 1

Sun SPARC Enterprise M4000
4 x 2.53 GHz SPARC64 VII
32 GB (32 x 1GB)
Solaris 10 10/09
Sun Studio 12 Update 1

Results Summary:

M9000 M9000 M8000 M5000 M4000
SPECint_rate_base2006 2400 1370 706 267 136
SPECint_rate2006 2590 1450 753 296 152
SPECfp_rate_base2006 1930 1190 616 218 111
SPECfp_rate2006 2100 1250 666 234 116
SPECint_base2006 - - 12.4 - 12.1
SPECint2006 - - 13.6 - 12.9
SPECfp_base2006 - - 15.6 - 13.3
SPECfp2006 - - 16.5 - 13.9
SPECfp2006 - autopar - - 28.2 - -
SPECfp2006 - autopar - - 33.9 - -

Benchmark Description

SPEC CPU2006 is SPEC's most popular benchmark, with over 8000 results published in the three years since it was introduced. It measures:

  • "Speed" - single copy performance of chip, memory, compiler
  • "Rate" - multiple copy (throughput)

The rate metrics are used for the throughput-oriented systems described on this page. These metrics include:

  • SPECint_rate2006: throughput for 12 integer benchmarks derived from real applications such as perl, gcc, XML processing, and pathfinding
  • SPECfp_rate2006: throughput for 17 floating point benchmarks derived from real applications, including chemistry, physics, genetics, and weather.

There are "base" variants of both the above metrics that require more conservative compilation, such as using the same flags for all benchmarks.

Key Points and Best Practices

Result on this page for the Sun SPARC Enterprise M9000 server were measured on a Fujitsu SPARC Enterprise M9000. The Sun SPARC Enterprise M9000 and Fujitsu SPARC Enterprise M9000 are electronically equivalent. Results for the Sun SPARC Enterprise M8000, M4000 and M5000 were measured on those systems. The similarly named Fujitsu sytems are electronically equivalent.

Use the latest compiler. The Sun Studio group is always working to improve the compiler. Sun Studio 12 Update 1, which are used in these submissions, provides updated code generation for a wide variety of SPARC and x86 implementations.

I/O still counts. Even in a CPU-intensive workload, some I/O remains. This point is explored in some detail at http://blogs.sun.com/jhenning/entry/losing_my_fear_of_zfs.

See Also

Disclosure Statement

SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation. Competitive results from www.spec.org as of 7 October 2009. Sun's new results quoted on this page have been submitted to SPEC. Sun SPARC Enterprise M9000 2400 SPECint_rate_base2006, 2590 SPECint_rate2006, 1930 SPECfp_rate_base2006, 2100 SPECfp_rate2006; Sun SPARC Enterprise M9000 (32 chips) 1370 SPECint_rate_base2006, 1450 SPECint_rate2006, 1190 SPECfp_rate_base2006, 1250 SPECfp_rate2006; Sun SPARC Enterprise M8000 706 SPECint_rate_base2006, 753 SPECint_rate2006, 616 SPECfp_rate_base2006, 666 SPECfp_rate2006; Sun SPARC Enterprise M5000 267 SPECint_rate_base2006, 296 SPECint_rate2006, 218 SPECfp_rate_base2006, 234 SPECfp_rate2006; Sun SPARC Enterprise M4000 136 SPECint_rate_base2006, 152 SPECint_rate2006, 111 SPECfp_rate_base2006, 116 SPECfp_rate2006; Sun SPARC Enterprise M9000 (2.52GHz) 2088 SPECint_rate_base2006, 2288 SPECint_rate2006, 1860 SPECfp_rate_base2006, 2010 SPECfp_rate2006; Sun SPARC Enterprise M9000 (32 chips 2.52GHz) 1140 SPECint_rate_base2006, 1240 SPECint_rate2006, 1060 SPECfp_rate_base2006, 1110 SPECfp_rate2006; Sun SPARC Enterprise M8000 (2.52GHz) 565 SPECint_rate_base2006, 637 SPECint_rate2006, 538 SPECfp_rate_base2006, 582 SPECfp_rate2006; Sun SPARC Enterprise M5000 (2.4GHz) 232 SPECint_rate_base2006, 264 SPECint_rate2006, 208 SPECfp_rate_base2006, 223 SPECfp_rate2006; Sun SPARC Enterprise M4000 (2.4GHz) 118 SPECint_rate_base2006, 135 SPECint_rate2006, 107 SPECfp_rate_base2006, 112 SPECfp_rate2006; IBM Power 595 1866 SPECint_rate_base2006, 2155 SPECint_rate2006,

I had an e-mail which told the sorry tale of a new system which tool longer to build a project than an older system, of theoretically similar performance. The system showed low utilisation when doing the build indicating that it was probably spending a lot of time waiting for something.

The first thing to look at was a profile of the build process using `collect -F on`, which produced the interesting result that the build was taking just over 2 minutes of user time, a few seconds of system time, and thousands of seconds of "Other Wait" time.

"Other wait" often means waiting for network, or disk, or just sleeping. The other thing to realise about profiling multiple processes is that all the times are cumulative, so all the processes that are waiting accumulate "other wait" time. Hence it will be a rather large number if multiple processes are doing it. So this confirmed and half explained the performance issue. The build was slow because it was waiting for something.

Sorting the profile by "other wait" indicated two places that the wait was coming from, one was waitpid - meaning that the time was due to a process waiting for another process, well we knew that! The other was a door call. Tracing up the call stack eventually lead into the C and C++ compiler, which were calling gethostbyname. The routine doing the calling was "generate_prefix" which is the routine responsible for generating a random prefix for function names - the IP address of the machine was used as one of the inputs for the generation of a prefix.

The performance problem was due to gethostbyname timing out, common reasons for this are missed configurations in the /etc/hosts and /etc/nsswitch.conf files. In this example adding the host name to the hosts file cured the problem.

The compiler flag -xalias_level allows a user to assert the degree of aliasing that exists within the source code of an application. If the assertion is not true, then the behaviour of the application is undefined. It is definitely worth looking at the examples given in the user's guide, although they can be a bit "dry" to read. So here's an example which illustrates what can happen:

struct stuff{
 int value1;
 int value2;
};

void fill(struct stuff *x)
{
  x->value1=0;      // Clear value1 
  int * r=(int*)x;  // Take the address of the structure
  int var = *r;     // Take the value from value1
  x->value1=var;    // And store it back into value1
}

The above code will clear value1 and then load and store this value back. So for correctly working code value1 should exit the function containing zero. However, if -xalias_level=basic is used to build the application, then this tells the compiler that no two pointers to variables of different types will alias. So pointer to an int will never alias with an int. So the read from *r does not alias with x.value1.

So with this knowledge the compiler is free to remove the original store to x.value1, because it has been told that nothing will alias with it, and there is a later store to the same address. The later store will overwrite the initial store.

Fortunately it the lint utility can pick up these issues:

$ lint -Xalias_level=basic alias.c
(9) warning: cast of nonscalar pointer to scalar pointer is valid only at -xalias_level=any

For the example above the compiler does the correct thing and eliminates all the instructions but the store to value1. For more complex examples there is no guarantee that the code will be correct if it violates the -xalias_level setting.

Earlier in the summer I recorded a slidecast on using the Performance Analyzer on parallel codes, it's just come out on the HPC portal.