Chip Multithreading Blogs

My Hot Chips presentation on the next-generation UltraSPARC security accelerator can be found here.

Some great work by Krishan Yenduri has led to nice improvements in the multi-socket bulk cipher performance on UltraSPARC T2+ processors. The improvements are available in the current build, snv_117. Krishna has performance data for scaling on a 4-socket T5440 system in his recent blog. Using the same kernel umicrobenchmark, the following plot shows the scaling on a dual-socket UltraSPARC T2 Plus system:


In this test, the requesting threads are scheduled by Solaris (rather than bound to specific cores), so Solaris will tend to even distribute the threads across the 16 cores in the system – this explains by you get this rapid increase in aggregate cryptographic throughout as the number of threads is increased. If the first 8-threads where bound to core 0, the second 8 to core 1 and so on, the scaling would be essentially linear as the cores are added.

So, a 2-socket T2+ system is delivering around 9GBytes/second. Not bad, given most other dual-socket systems can deliver at max around 2GB/s. Further, from the above it is apparent that we hit 9GB/s on the T2 system with less than 50% of the HW strands being utilized.

I'll be presenting Sun's next-generation on-chip UltraSPARC security accelerator at this year's Hot Chips. The preliminary program can be found here.

Following from my recent post mentioning the acceleration of encrypt/decrypt and OpenSSL enc using the T2 crypto HW (here) I went and did some basic tests to see what kind of uptick was achieved:

Large file processing. File in /tmp

(1) openssl perf test (SW crypto)


timex /usr/sfw/bin/openssl enc -aes-128-cbc -k testpass -in /tmp/file.data -out /tmp/file.data.enc


(2) openssl perf test (HW crypto)


timex /usr/sfw/bin/openssl enc -aes-128-cbc -k testpass -engine pkcs11 -in /tmp/file.data -out /tmp/file.data.enc


(3) encrypt perf test (HW crypto)


timex encrypt -a aes -i /tmp/file.data -o /tmp/file.data.enc


Comparing (1) versus (2) I saw about a 4X improvement in performance when I started using the T2 HW crypto. With (1) versus (3) I saw a 2.5X improvement. So a fairly decent performance improvement! I looked into why encrypt is currently being outperformed by OpenSSL and it looks like it is due to buffer sizing – OpenSSL is using a buffer that is 2X larger than is being used by encrypt to read(), encrypt and write() the file data. I modified encrypt to use a 64KB buffer size and saw encrypt performance improvement over (1) increase to over 7X.

So, it looks like you can get get a serious performance from the HW crypto when encrypting large files like ZFS snapshots. In fact, for the above experiment just doing a simple “cp /tmp/file.data /tmp/file.data.enc” is less than 2X faster than using the enhanced version of encrypt to perform AES-128-CBC encryption of the data too.

My slides on using light-weight compression to enhance available offchip bandwidth on future processors can now be found here. As way of an introduction, we found that light-weight compression schemes can improve the effective offchip bandwidth by over 3X on a wide variety of important workloads.

An interesting project to backup and securely store ZFS snapshots to the cloud can be found here. This is a great opportunity for the UltraSPARC T2 cryptographic hardware accelerators that can be used to significantly accelerate the process of encrypting the ZFS snapshot. The shell script for automating the process uses the encrypt function that will automatically use the UltraSPARC T2 cryptographic accelerators.

Interesting book on Solaris security can be found here. According to the blurb, it covers "the main security features in the Solaris operating system, including roles and privileges, cryptographic services, network security, auditing, and Solaris Trusted Extensions".

Interesting new book, the Developers Edge, brings together a good collection of technical articles harvested from the Sun Blogosphre. Naturally, there is some info included on T2 crypto. The book can be found here.

I'm presenting on leveraging compression to increase the effective offchip bandwidth of multicore processors at this weeks Multicore Expo in Santa Clara. Details here.

A paper on the UltraSPARC T2 crypto hardware and the Solaris cryptographic framework will be presented at the upcoming International Workshop on Multicore Software Engineering. Details on the workshop can be found here.

When running IPsec on the UltraSPARC T2, performance can frequently be improved by increasing the number of worker threads provided by the Solaris kernel crypto framework. The number of worker threads is controlled by the crypto_taskq_threads variable. This can be set in /etc/system or altered using mdb (n2cp should be unloaded and reloaded after changing via mdb).


UltraSPARC T2 IPsec performance when using the HW crypto accelerators is, not surprisingly, pretty impressive -- especially, if you use jumbo frames.

Yesterday's NPR on-point program discussed security -- it can be found here. Mostly obvious stuff, but good to see some of these issues/problems getting air time.

Support for async crypto operations is not provided via the userland cryptographic framework. However, it is pretty simple to create a simple driver that can be used by a userland app to gain access to the kernel cryptographic framework and async support. Performance is pretty good -- if you look at how the framework is implemented, requests to the hardware are passed down to the kernel framework via /dev/crypto anyway. You could probably talk to /dev/crypto directly -- looking here -- but there are also plenty of simple of driver examples on docs.sun.com that can be easily enhanced to provide this functionality.

I've been investigating IPsec performance on the UltraSPARC T2 and have found uperf (which can be found here) to be very helpful -- especially for multi-threaded stress testing. Currently, I've got two T5220 systems connected directly by 10GbE and I'm investigating peak IPsec performance...

I've been comparing the sync and async APIs to the kernel cryptographic framework and if you are interested in improving single-thread crypto performance on the UltraSPARC T2, async can be interesting:

8KB objects, crypto_encrypt_mac() operations (3DES, MD5)

# operations

Performance improvement (async perf / sync perf)

1

0.77

2

1.97

4

3.1

8

4.74



So, if you have the opportunity to handle multiple outstanding crypto operations per thread, using the async API is a good way to go, potentially improving crypto performance by over 4.7X. If you only have one outstanding request per thread, then sync delivers better performance, because there are no Solaris interrupt overheads.

Playing with libsrtp recently and just experimenting with enhancing the library to use the T2 HW crypto. Generally, strp uses AES counter mode. Looking at the libstrp code, there is a keystream_buffer buffer which is XORed with the packet stream. Once the keybuffer is emptied it is refilled. Currently, the buffer is 128-bits i.e. 1 block. This approach is not too ineffecient when performing AES in SW, but will lead to suboptimal performance when using crypto HW. There are typically some SW overheads associated with accessing the crypto hardware, and so performance generally increases with the size of the object being processed. Accordingly, in libsrtp it is preferable if the keystream_buffer is increased considerably in size (e.g. 8KB) and refills are performed much less frequently.

Browsing a few internet forums, it interesting to note that there is quite a debate on the use of autopar in SPEC results, with some calling for its use to be disallowed. While it is true that when autopar is permitted, the resulting performance is no longer single-thread performance, it nevertheless it an interesting measure of how well a particular processor/compiler combination perform (Even before the common use of autopar, SPEC was never just a measure of performance, but was heavily influenced by the sophistication of the compiler). Further, in today's multi-core processors, where caches & off-chip bandwidth are shared, even single-thread SPEC runs don't give all the of information one needs to fully understand the impact on the processor. Rather, SPEC ratio is merely an indication of of well the processor/system can run a single-instance of an application (i.e. peak single-application performance) and SPECrate is a measure of the peak throughput.


Finally, it can be argued that many HPC codes are amenable to autopar, so its use in SPECfp is relevant. Its use in SPECint is more problematic as most integer codes are more difficult to noticeably accelerate using autopar -- its just unfortunate that SPEC06int includes libquantum.....

With my recent talk of T2s, CMTs and autopar I thought it might be interesting to provide a link to Sun's recently released OpenSPARC Internals book. This provides detailed information on many of these topics and can be downloaded for free here. I contributed the first chapter, so its probably safe to skip that one :-)

Following from previous discussions on the benefits of autopar for SPECcpu, the next logical question is "what kind of benefit does autopar provide on CMT processors like the UltraSPARC T2". On UltraSPARC T2, we have a multitude of hardware strands available and, as previously discussed, the bare-metal inter-thread communication latencies are extremely low. I talked with some of the compiler gurus at Sun and sure enough this analysis had been undertaken for SPECfp. The results are as follows:



Pretty cool! 7 of the suites X benchmarks show some benefit, with 1 showing over 30X speedup, a further 3 showing a benefit of over 10X, and the remainder showing 2-4X improvements. Some of the benchmarks show peak performance when using less than 64-threads. This is not unexpected, as this is an out-of-the-box run and given that T2 has shared pipelines and, like most multi-core processors, shared caches and offchip bandwidth, some tweaking is required to maximize performance.

Continuing the SPECcpu theme, an interesting paper from Intel describing the performance upticks from autopar, SSE and other optimizations can be found here. On a dual-core they show decent gains on 6 benchmarks and slight gains on a further two.


Similarly, using the Sun Studio compiler, 8 benchmarks benefit from autopar, delivering a 16% improvement in the geometric mean on a dual-core processor (as illustrated here).



It is interesting to look at recent SPEC2006 results and compare them with the results from just a year or so ago. In addition to the normal improvements one would expect as compilers become increasingly familiar with this fairly new benchmark suite, autoparallelism is being employed to boost scores on this traditionally single-thread performance benchmark.

For SPECint, the autopar gains seem to be limited to one benchmark – libquantum, while on the FP side, there are several. Looking at libquantum:

1) August 07,    3.00 GHz Intel Dual core, 4MB L2$  (4MB between 2-cores), no autopar : libquantum  31.6
2) August 07,    3.00 GHz Intel Dual core, 4MB L2$  (4MB between 2-cores), autopar :    libquantum  78.9
3) September 08, 3.20 GHz Intel Quad core, 12MB L2$ (6MB between 2-cores), autopar :    libquantum 283
4) September 08, 2.66 Ghz Intel Six core,  9MB L2$ (3MB between 2-cores), autopar :     libquantum 316

For libquantum there are obviously other compiler optimizations being undertaken in these latest benchmark submissions (as evidenced by various interim results e.g. here), but the comparison of (3) and (4) alone illustrate the power of threading – 50% reduction in per core L2$, and a 20% reduction in frequency, but the libquantum score improves by 10%. Indeed, if you factor out the frequency difference, the 4 to 6 score scaling is very good. It is also interesting to note that the new libquantum scores are around 10X higher than the scores for the other benchmarks. In fact, if you assume that (3) would be say 3X lower without autoparallelism support in the compiler, then the processors whole SPECint ratio score (which is a geometric mean of all of the benchmark scores) falls by over 10%. If the libquantum score is reduced to that seen in (1), then the overall score falls by almost 20%!


It is interesting to note that comparing (1) and (3), the overall score has improved from 21 to 29.3 – an almost 40% improvement. However, roughly half of that gain comes from threading libquantum! Of the remainder, some comes from frequency, some comes from compiler optimizations that significantly improve hmmer performance, and the remainder from a number of other small increases.


It is interesting that the majority of the gains on this traditionally single-threaded benchmark that we have observed in the last several processor generations come from multithreading.....

Similarly, for SPECfp, there are a number of benchmarks which benefit significantly from autoparallelism. This is not surprising, as FP workloads are typically more amenable to this style of optimization. Comparing the recent results from a quad-core Opteron both with and without autopar, it looks like there are 4 FP benchmarks that benefit significantly:


bwaves : 3X improvement

cactusADM : 7.6X improvement

gemsFDTD : 2.2X improvement

wrf : 1.48X improvement


Cumulatively, threading these 3 benchmarks delivers about a 30% improvement in the SPECfp score!

Following from the recent post discussing modifying OpenSSL to enable OpenSSH to take advantage of the UltraSPARC T2 crypto accelerators, I should also mention that it is possible to just use the PKCS11 engine modified OpenSSL that Sun provides. You should use the –with-ssl-engine when you configure OpenSSH. Further, it may just be my mistake, but I am having problems getting OpenSSH to use the PKCS11 engine unless I modify openssl-compat.c. In the unmodified code, ssh_SSLeay_add_all_algorithms() does:

/* Enable use of crypto hardware */
ENGINE_load_builtin_engines();
ENGINE_register_all_complete();

I changed this to:

ENGINE *pkcengine;
/* Enable use of crypto hardware */
ENGINE_load_builtin_engines();
pkcengine = ENGINE_by_id("pkcs11");
ENGINE_init(pkcengine);
ENGINE_set_default_ciphers(pkcengine);

and things started working fine. I need to find some cycles to go back I see if I had things misconfigured.

Following from the last entry about recent enhancements to SunSSH to enable it to take advantage of the UltraSPARC T2 cryptographic accelerators, for those who use OpenSSH, its also possible to leverage the T2 cryptographic accelerators. One simple way to achieve this, without modifying OpenSSH itself, is just to use a version of OpenSSL that has been modified to take advantage of the HW crypto; for the standard aes-128-cbc operating mode, the simplest way to achieve this is to modify aes_cbc.c to call libpkcs11. Its about a 10 line modification and can be applied to any version of OpenSSL. I will post the required code later today here.

Great to see from Jan's recent blog entry that SunSSH has been enhanced to take advantage of the UltraSPARC T2 hardware cryptographic accelerators – see here for more details.

I will spend some time playing with this later this week and report more generally on the performance benefits I observe

Recent discussion about whether AMD's upcoming SSE5 instructions can be used to significantly accelerate (5X) AES can be seen here. Given they don't seem to provide dedicated AES instructions, its tricky to see how this can be readily achieved -- especially given for AES-CBC even special purpose AES instructions only provide around 6X.....

Any thoughts?



A new CMT whitepaper discussing how to accelerate multithreaded applications on CMT processors can be found here. The whitepaper touches on high-performance cryptography on CMT processors and microparallelism techniques.

The busstat tool can be a useful performance tuning aid, allowing one to drill into the load an application is placing on the memory sub-system. However, one note of caution, on the T2/T2+ the bank_busy_stalls counters should not be used, as erroneous results are returned – makes it looks like the application is causing bank busy stalls, even when this is not the case. Future revs of busstat are aware of this, but in the interim, this is a performance counter to ignore when tuning your app.

On a recent rev of Nevada, I just ran some ECC (elliptic curve cryptography) ubenchmarks, comparing a UltraSPARC T2 using the HW crypto accelerators and a 3GHz Xeon:




These numbers are for ecdsa operations. The Xeon #s are from openSSL speed (optimized compilation), while the T2 #s are generating from interacting directly with the framework. These numbers are for binary curves using Galois field operations.

While cryptography is typically viewed as computationally intensive (and so less well suited to CMT processors), software implementations of common cryptographic algorithms can be readily optimized to excel on CMT processors. Current software implementations have been optimized for traditional processors, with multiple lookup tables sized to all fit in a processors small level-1 caches. It is this use of multiple small tables that leads to the high computational overheads associated with most cryptographic implementations -- due to the significant arithmetic operations needed to manipulate access to the tables and recombine the results from the various tables.

For example, consider the Kasumi algorithm, which is essential to 3G mobile telephony. In Kasumi, a block is 8-bytes, the key is 128-bits (although it is expanded to a 1024-bit key schedule before use), and processing consists of 8 rounds per block. While a variety of operations are performed per block, the most costly operation is termed FI and consists of the following (in C notation):

nine = (u16)(in>>7);
seven = (u16)(in&0x7F);
nine = (u16)(S9[nine] ^ seven);
seven = (u16)(S7[seven] ^ (nine & 0x7F));

seven ^= (subkey>>9);
nine ^= (subkey&0x1FF);
nine = (u16)(S9[nine] ^ seven);
seven = (u16)(S7[seven] ^ (nine & 0x7F));
in = (u16)((seven<<9) + nine);
return( in );

where in and subkey are two-byte variables, S9 is a 512-element lookup table and S7 is a 128-element lookup table. This operation is performed three times per round, for a total of 24 times per block. Each FI operation requires 22 instructions (for SPARC), for a total of 576 FI-derived instructions per block. Given the abundance of logical and shift operations, it is apparent that superscalar processors will perform this function very well, with an Instructions Per Cycle (IPC) of 2.5 or more. In contrast, the Niagara single-strand IPC is around 0.65. Further, due to the compute intensive nature of the code, the single Niagara strand uses around two thirds of the processor core's issue resources. As a result, performance does not scale as additional Kasumi threads run on a core.

To overcome this problem, the implementation can be optimized to radically reduce the instruction count. A reduction in instruction count may be achieved by replacing large parts of the FI function using a large lookup table. In the original Kasumi code, the 16-bit elements are divided into two smaller elements, one 7-bits and one 9-bits. These smaller elements are processed independently and the results combined. While this ensures that the lookup tables are small, significant logical and arithmetic operations are required to split the 16-bit elements and later recombine the smaller 7-bit and 9-bit elements back into the 16-bit elements. Significant computational saving may be achieved by processing an entire 16-bit element at once, using large lookup tables, as shown below:


t0 = LT0[in];
t0 = t0 ^ subkey;
in = LT1[t0];


The new lookup tables (LT0 and LT1) are now much larger, each being composed of 65536 2-byte elements. Note that the lookup tables are constant, may be precomputed, and are independent of the keys. However, using this approach, the FI function now only requires five instructions, a four times reduction from the original implementation. Further note that in both the optimized and the original code, the lookup table accesses are dependent and cannot be performed in parallel or prefetched in advance.


The lookup tables that once fitted in the L1 cache are now much larger and will now largely reside in the L2 cache -- this instruction count reduction has been at the expense of increased memory stalls, but here we are laying to the strengths of a CMT processor. As a result, it would appear that the performance of the code will remain largely unchanged, having traded increased instruction count for increased memory stalls. This optimization technique is beneficial for at least two reasons. First, MT (multithreading) performance is improved. For the initial implementation, due to the large computational requirements of the algorithm, as additional strands are leveraged, aggregate core performance improves very little. Given that a single strand is capable of consuming almost all of a processor core’s resources, as additional VT/SMT strands are leveraged, these strands rapidly start to deprive the other strands of resources, and the aggregate core performance is improved very little. In contrast, in the optimized version, the strands spend most of their time stalled waiting for accesses to the lookup tables to complete and consume a much smaller fraction of a processor core’s resources. As a result, as the number of strands is increased, performance scales almost linearly. Indeed, for Niagara, per-core Kasumi performance is around 8 times the performance of a single strand, and per-chip Kasumi performance is close to 64X single-strand performance. Indeed, single-core Kasumi performance is around 1.3X the performance of a single-core of a 3GHz Xeon processor.


I've been gradually expanding the crypto wiki (which can be found here); adding additional info and some code examples. Please let me know what additional information would be useful to add, how the wiki could be improved, and even add your own thoughts....