My Hot Chips presentation on the next-generation UltraSPARC security accelerator can be found here.
My Hot Chips presentation on the next-generation UltraSPARC security accelerator can be found here.
Some great work by Krishan Yenduri has led to nice improvements in the multi-socket bulk cipher performance on UltraSPARC T2+ processors. The improvements are available in the current build, snv_117. Krishna has performance data for scaling on a 4-socket T5440 system in his recent blog. Using the same kernel umicrobenchmark, the following plot shows the scaling on a dual-socket UltraSPARC T2 Plus system:

In this test, the requesting threads are scheduled by Solaris (rather than bound to specific cores), so Solaris will tend to even distribute the threads across the 16 cores in the system – this explains by you get this rapid increase in aggregate cryptographic throughout as the number of threads is increased. If the first 8-threads where bound to core 0, the second 8 to core 1 and so on, the scaling would be essentially linear as the cores are added.
So, a 2-socket T2+ system is delivering around 9GBytes/second. Not bad, given most other dual-socket systems can deliver at max around 2GB/s. Further, from the above it is apparent that we hit 9GB/s on the T2 system with less than 50% of the HW strands being utilized.
I'll be presenting Sun's next-generation on-chip UltraSPARC security accelerator at this year's Hot Chips. The preliminary program can be found here.
Following from my recent post mentioning the acceleration of encrypt/decrypt and OpenSSL enc using the T2 crypto HW (here) I went and did some basic tests to see what kind of uptick was achieved:
Large file processing. File in /tmp
(1) openssl perf test (SW crypto)
timex /usr/sfw/bin/openssl enc -aes-128-cbc -k testpass -in /tmp/file.data -out /tmp/file.data.enc
(2) openssl perf test (HW crypto)
timex /usr/sfw/bin/openssl enc -aes-128-cbc -k testpass -engine pkcs11 -in /tmp/file.data -out /tmp/file.data.enc
(3) encrypt perf test (HW crypto)
timex encrypt -a aes -i /tmp/file.data -o /tmp/file.data.enc
Comparing (1) versus (2) I saw about a 4X improvement in performance when I started using the T2 HW crypto. With (1) versus (3) I saw a 2.5X improvement. So a fairly decent performance improvement! I looked into why encrypt is currently being outperformed by OpenSSL and it looks like it is due to buffer sizing – OpenSSL is using a buffer that is 2X larger than is being used by encrypt to read(), encrypt and write() the file data. I modified encrypt to use a 64KB buffer size and saw encrypt performance improvement over (1) increase to over 7X.
So, it looks like you can get get a serious performance from the HW crypto when encrypting large files like ZFS snapshots. In fact, for the above experiment just doing a simple “cp /tmp/file.data /tmp/file.data.enc” is less than 2X faster than using the enhanced version of encrypt to perform AES-128-CBC encryption of the data too.
My slides on using light-weight compression to enhance available offchip bandwidth on future processors can now be found here. As way of an introduction, we found that light-weight compression schemes can improve the effective offchip bandwidth by over 3X on a wide variety of important workloads.
An interesting project to backup and securely store ZFS snapshots to the cloud can be found here. This is a great opportunity for the UltraSPARC T2 cryptographic hardware accelerators that can be used to significantly accelerate the process of encrypting the ZFS snapshot. The shell script for automating the process uses the encrypt function that will automatically use the UltraSPARC T2 cryptographic accelerators.
Interesting book on Solaris security can be found here. According to the blurb, it covers "the main security features in the Solaris operating system, including roles and privileges, cryptographic services, network security, auditing, and Solaris Trusted Extensions".
Interesting new book, the Developers Edge, brings together a good collection of technical articles harvested from the Sun Blogosphre. Naturally, there is some info included on T2 crypto. The book can be found here.
I'm presenting on leveraging compression to increase the effective offchip bandwidth of multicore processors at this weeks Multicore Expo in Santa Clara. Details here.
A paper on the UltraSPARC T2 crypto hardware and the Solaris cryptographic framework will be presented at the upcoming International Workshop on Multicore Software Engineering. Details on the workshop can be found here.
When running IPsec on the UltraSPARC T2, performance can frequently be improved by increasing the number of worker threads provided by the Solaris kernel crypto framework. The number of worker threads is controlled by the crypto_taskq_threads variable. This can be set in /etc/system or altered using mdb (n2cp should be unloaded and reloaded after changing via mdb).
UltraSPARC T2 IPsec performance when using the HW crypto accelerators is, not surprisingly, pretty impressive -- especially, if you use jumbo frames.
Yesterday's NPR on-point program discussed security -- it can be found here. Mostly obvious stuff, but good to see some of these issues/problems getting air time.
Support for async crypto operations is not provided via the userland cryptographic framework. However, it is pretty simple to create a simple driver that can be used by a userland app to gain access to the kernel cryptographic framework and async support. Performance is pretty good -- if you look at how the framework is implemented, requests to the hardware are passed down to the kernel framework via /dev/crypto anyway. You could probably talk to /dev/crypto directly -- looking here -- but there are also plenty of simple of driver examples on docs.sun.com that can be easily enhanced to provide this functionality.
I've been investigating IPsec performance on the UltraSPARC T2 and have found uperf (which can be found here) to be very helpful -- especially for multi-threaded stress testing. Currently, I've got two T5220 systems connected directly by 10GbE and I'm investigating peak IPsec performance...
I've been comparing the sync and async APIs to the kernel cryptographic framework and if you are interested in improving single-thread crypto performance on the UltraSPARC T2, async can be interesting:
8KB objects, crypto_encrypt_mac() operations (3DES, MD5)
|
# operations |
Performance improvement (async perf / sync perf) |
|
1 |
0.77 |
|
2 |
1.97 |
|
4 |
3.1 |
|
8 |
4.74 |
So, if you have the opportunity to handle multiple outstanding crypto operations per thread, using the async API is a good way to go, potentially improving crypto performance by over 4.7X. If you only have one outstanding request per thread, then sync delivers better performance, because there are no Solaris interrupt overheads.
Playing with libsrtp recently and just experimenting with enhancing the library to use the T2 HW crypto. Generally, strp uses AES counter mode. Looking at the libstrp code, there is a keystream_buffer buffer which is XORed with the packet stream. Once the keybuffer is emptied it is refilled. Currently, the buffer is 128-bits i.e. 1 block. This approach is not too ineffecient when performing AES in SW, but will lead to suboptimal performance when using crypto HW. There are typically some SW overheads associated with accessing the crypto hardware, and so performance generally increases with the size of the object being processed. Accordingly, in libsrtp it is preferable if the keystream_buffer is increased considerably in size (e.g. 8KB) and refills are performed much less frequently.
Browsing a few internet forums, it interesting to note that there is quite a debate on the use of autopar in SPEC results, with some calling for its use to be disallowed. While it is true that when autopar is permitted, the resulting performance is no longer single-thread performance, it nevertheless it an interesting measure of how well a particular processor/compiler combination perform (Even before the common use of autopar, SPEC was never just a measure of performance, but was heavily influenced by the sophistication of the compiler). Further, in today's multi-core processors, where caches & off-chip bandwidth are shared, even single-thread SPEC runs don't give all the of information one needs to fully understand the impact on the processor. Rather, SPEC ratio is merely an indication of of well the processor/system can run a single-instance of an application (i.e. peak single-application performance) and SPECrate is a measure of the peak throughput.
Finally, it can be argued that many HPC codes are amenable to autopar, so its use in SPECfp is relevant. Its use in SPECint is more problematic as most integer codes are more difficult to noticeably accelerate using autopar -- its just unfortunate that SPEC06int includes libquantum.....
With my recent talk of T2s, CMTs and autopar I thought it might be interesting to provide a link to Sun's recently released OpenSPARC Internals book. This provides detailed information on many of these topics and can be downloaded for free here. I contributed the first chapter, so its probably safe to skip that one :-)
Following from previous discussions on the benefits of autopar for SPECcpu, the next logical question is "what kind of benefit does autopar provide on CMT processors like the UltraSPARC T2". On UltraSPARC T2, we have a multitude of hardware strands available and, as previously discussed, the bare-metal inter-thread communication latencies are extremely low. I talked with some of the compiler gurus at Sun and sure enough this analysis had been undertaken for SPECfp. The results are as follows:
Pretty cool! 7 of the suites X benchmarks show some benefit, with 1 showing over 30X speedup, a further 3 showing a benefit of over 10X, and the remainder showing 2-4X improvements. Some of the benchmarks show peak performance when using less than 64-threads. This is not unexpected, as this is an out-of-the-box run and given that T2 has shared pipelines and, like most multi-core processors, shared caches and offchip bandwidth, some tweaking is required to maximize performance.
Continuing the SPECcpu theme, an interesting paper from Intel describing the performance upticks from autopar, SSE and other optimizations can be found here. On a dual-core they show decent gains on 6 benchmarks and slight gains on a further two.
Similarly, using the Sun Studio compiler, 8 benchmarks benefit from autopar, delivering a 16% improvement in the geometric mean on a dual-core processor (as illustrated here).
It is interesting to look at recent SPEC2006 results and compare them with the results from just a year or so ago. In addition to the normal improvements one would expect as compilers become increasingly familiar with this fairly new benchmark suite, autoparallelism is being employed to boost scores on this traditionally single-thread performance benchmark.
For SPECint, the autopar gains seem to be limited to one benchmark – libquantum, while on the FP side, there are several. Looking at libquantum:
1) August 07, 3.00 GHz Intel Dual core, 4MB L2$ (4MB between 2-cores), no autopar : libquantum 31.6 2) August 07, 3.00 GHz Intel Dual core, 4MB L2$ (4MB between 2-cores), autopar : libquantum 78.9 3) September 08, 3.20 GHz Intel Quad core, 12MB L2$ (6MB between 2-cores), autopar : libquantum 283 4) September 08, 2.66 Ghz Intel Six core, 9MB L2$ (3MB between 2-cores), autopar : libquantum 316
For libquantum there are obviously other compiler optimizations being undertaken in these latest benchmark submissions (as evidenced by various interim results e.g. here), but the comparison of (3) and (4) alone illustrate the power of threading – 50% reduction in per core L2$, and a 20% reduction in frequency, but the libquantum score improves by 10%. Indeed, if you factor out the frequency difference, the 4 to 6 score scaling is very good. It is also interesting to note that the new libquantum scores are around 10X higher than the scores for the other benchmarks. In fact, if you assume that (3) would be say 3X lower without autoparallelism support in the compiler, then the processors whole SPECint ratio score (which is a geometric mean of all of the benchmark scores) falls by over 10%. If the libquantum score is reduced to that seen in (1), then the overall score falls by almost 20%!
It is interesting to note that comparing (1) and (3), the overall score has improved from 21 to 29.3 – an almost 40% improvement. However, roughly half of that gain comes from threading libquantum! Of the remainder, some comes from frequency, some comes from compiler optimizations that significantly improve hmmer performance, and the remainder from a number of other small increases.
It is interesting that the majority of the gains on this traditionally single-threaded benchmark that we have observed in the last several processor generations come from multithreading.....
Similarly, for SPECfp, there are a number of benchmarks which benefit significantly from autoparallelism. This is not surprising, as FP workloads are typically more amenable to this style of optimization. Comparing the recent results from a quad-core Opteron both with and without autopar, it looks like there are 4 FP benchmarks that benefit significantly:
bwaves : 3X improvement
cactusADM : 7.6X improvement
gemsFDTD : 2.2X improvement
wrf : 1.48X improvement
Cumulatively, threading these 3 benchmarks delivers about a 30% improvement in the SPECfp score!
Following from the recent post discussing modifying OpenSSL to enable OpenSSH to take advantage of the UltraSPARC T2 crypto accelerators, I should also mention that it is possible to just use the PKCS11 engine modified OpenSSL that Sun provides. You should use the –with-ssl-engine when you configure OpenSSH. Further, it may just be my mistake, but I am having problems getting OpenSSH to use the PKCS11 engine unless I modify openssl-compat.c. In the unmodified code, ssh_SSLeay_add_all_algorithms() does:
/* Enable use of crypto hardware */ ENGINE_load_builtin_engines(); ENGINE_register_all_complete();
I changed this to:
ENGINE *pkcengine;
/* Enable use of crypto hardware */
ENGINE_load_builtin_engines();
pkcengine = ENGINE_by_id("pkcs11");
ENGINE_init(pkcengine);
ENGINE_set_default_ciphers(pkcengine);
and things started working fine. I need to find some cycles to go back I see if I had things misconfigured.
Following from the last entry about recent enhancements to SunSSH to enable it to take advantage of the UltraSPARC T2 cryptographic accelerators, for those who use OpenSSH, its also possible to leverage the T2 cryptographic accelerators. One simple way to achieve this, without modifying OpenSSH itself, is just to use a version of OpenSSL that has been modified to take advantage of the HW crypto; for the standard aes-128-cbc operating mode, the simplest way to achieve this is to modify aes_cbc.c to call libpkcs11. Its about a 10 line modification and can be applied to any version of OpenSSL. I will post the required code later today here.
Great to see from Jan's recent blog entry that SunSSH has been enhanced to take advantage of the UltraSPARC T2 hardware cryptographic accelerators – see here for more details.
I will spend some time playing with this later this week and report more generally on the performance benefits I observeRecent discussion about whether AMD's upcoming SSE5 instructions can be used to significantly accelerate (5X) AES can be seen here. Given they don't seem to provide dedicated AES instructions, its tricky to see how this can be readily achieved -- especially given for AES-CBC even special purpose AES instructions only provide around 6X.....
Any thoughts?
A new CMT whitepaper discussing how to accelerate multithreaded applications on CMT processors can be found here. The whitepaper touches on high-performance cryptography on CMT processors and microparallelism techniques.
The busstat tool can be a useful performance tuning aid, allowing one to drill into the load an application is placing on the memory sub-system. However, one note of caution, on the T2/T2+ the bank_busy_stalls counters should not be used, as erroneous results are returned – makes it looks like the application is causing bank busy stalls, even when this is not the case. Future revs of busstat are aware of this, but in the interim, this is a performance counter to ignore when tuning your app.
On a recent rev of Nevada, I just ran some ECC (elliptic curve cryptography) ubenchmarks, comparing a UltraSPARC T2 using the HW crypto accelerators and a 3GHz Xeon:

These numbers are for ecdsa operations. The Xeon #s are from openSSL speed (optimized compilation), while the T2 #s are generating from interacting directly with the framework. These numbers are for binary curves using Galois field operations.
While cryptography is typically viewed as computationally
intensive (and so less well suited to CMT processors), software
implementations of common cryptographic algorithms can be readily
optimized to excel on CMT processors. Current software
implementations have been optimized for traditional processors, with
multiple lookup tables sized to all fit in a processors small level-1
caches. It is this use of multiple small tables that leads to the
high computational overheads associated with most cryptographic
implementations -- due to the significant arithmetic operations
needed to manipulate access to the tables and recombine the results
from the various tables.
For example, consider the Kasumi
algorithm, which is essential to 3G mobile telephony. In Kasumi, a
block is 8-bytes, the key is 128-bits (although it is expanded to a
1024-bit key schedule before use), and processing consists of 8
rounds per block. While a variety of operations are performed per
block, the most costly operation is termed FI and consists of the
following (in C notation):
nine = (u16)(in>>7);
seven
= (u16)(in&0x7F);
nine = (u16)(S9[nine] ^ seven);
seven =
(u16)(S7[seven] ^ (nine & 0x7F));
seven ^= (subkey>>9);
nine ^= (subkey&0x1FF);
nine = (u16)(S9[nine] ^ seven);
seven = (u16)(S7[seven] ^ (nine & 0x7F));
in =
(u16)((seven<<9) + nine);
return( in );
where in
and subkey are two-byte variables, S9 is a 512-element lookup table
and S7 is a 128-element lookup table. This operation is performed
three times per round, for a total of 24 times per block. Each FI
operation requires 22 instructions (for SPARC), for a total of 576
FI-derived instructions per block. Given the abundance of logical and
shift operations, it is apparent that superscalar processors will
perform this function very well, with an Instructions Per Cycle (IPC)
of 2.5 or more. In contrast, the Niagara single-strand IPC is around
0.65. Further, due to the compute intensive nature of the code, the
single Niagara strand uses around two thirds of the processor core's
issue resources. As a result, performance does not scale as
additional Kasumi threads run on a core.
To overcome this
problem, the implementation can be optimized to radically reduce the
instruction count. A reduction in instruction count may be achieved
by replacing large parts of the FI function using a large lookup
table. In the original Kasumi code, the 16-bit elements are divided
into two smaller elements, one 7-bits and one 9-bits. These smaller
elements are processed independently and the results combined. While
this ensures that the lookup tables are small, significant logical
and arithmetic operations are required to split the 16-bit elements
and later recombine the smaller 7-bit and 9-bit elements back into
the 16-bit elements. Significant computational saving may be achieved
by processing an entire 16-bit element at once, using large lookup
tables, as shown below:
t0 = LT0[in];
t0 = t0 ^
subkey;
in = LT1[t0];
The new lookup tables (LT0 and
LT1) are now much larger, each being composed of 65536 2-byte
elements. Note that the lookup tables are constant, may be
precomputed, and are independent of the keys. However, using this
approach, the FI function now only requires five instructions, a four
times reduction from the original implementation. Further note that
in both the optimized and the original code, the lookup table
accesses are dependent and cannot be performed in parallel or
prefetched in advance.
The lookup tables that once fitted
in the L1 cache are now much larger and will now largely reside in
the L2 cache -- this instruction count reduction has been at the
expense of increased memory stalls, but here we are laying to the
strengths of a CMT processor. As a result, it would appear that the
performance of the code will remain largely unchanged, having traded
increased instruction count for increased memory stalls. This
optimization technique is beneficial for at least two reasons. First,
MT (multithreading) performance is improved. For the initial
implementation, due to the large computational requirements of the
algorithm, as additional strands are leveraged, aggregate core
performance improves very little. Given that a single strand is
capable of consuming almost all of a processor core’s
resources, as additional VT/SMT strands are leveraged, these strands
rapidly start to deprive the other strands of resources, and the
aggregate core performance is improved very little. In contrast, in
the optimized version, the strands spend most of their time stalled
waiting for accesses to the lookup tables to complete and consume a
much smaller fraction of a processor core’s resources. As a
result, as the number of strands is increased, performance scales
almost linearly. Indeed, for Niagara, per-core Kasumi performance is
around 8 times the performance of a single strand, and per-chip
Kasumi performance is close to 64X single-strand performance. Indeed,
single-core Kasumi performance is around 1.3X the performance of a
single-core of a 3GHz Xeon processor.
I've been gradually expanding the crypto wiki (which can be found here); adding additional info and some code examples. Please let me know what additional information would be useful to add, how the wiki could be improved, and even add your own thoughts....