This short tutorial describes how vector optimization techniques, originally developed for Cray Supercomputers, can now provide an order of magnitude increase in performance for many audio apps running on standard Intel computers of the current era. It provides an example in Cycling'74 gen~ code. But first, you really might enjoy how I learned it, or perhaps not, as it's a very tragic story.

The Cray backplane was curved, marketeers said, to reduce the distance that the TTL-ECL digital-logic signals needed to travel, making the distinctive Cray shape. Obviously though, that makes no difference, as signals travel the same distance regardless. The real reason was to reduce its massive footpriint.
The Cray backplane was curved, marketeers said, to reduce the distance that the TTL-ECL digital-logic signals needed to travel, making the distinctive Cray shape. Obviously though, that makes no difference, as signals travel the same distance regardless. The real reason was to reduce its massive footpriint.

1. The Man with 100 Supercomputers in his Basement

Although this is a very old coding technique, very few people know about it. I was taught it in 1985 by Shawn Hailey, shortly after he left Stanford to work in optimizing analog circuit simulation for the most powerful computers of the era.

Shawn's first rule was that you had to sit with him in silence for at least half an hour. That's something almost no one figured out, and many who did thought it rude. But he was Texan, so for a Texan engineer in those days, that was actually not so long. Silicon Valley had just taken over, and as far as they were concerned, Texas Instruments was robbed of its throne. Undeservedly. And when Shawn asked ,me, what was I thinking during those early silences? I told him that, and added, I knew what he was thinking about too.

"Really. How so?"

"It's obvious. Since you started running a company, that';s the only time you get to figure out the software problems your staff are having." He laughed his head off.

After the silence, he could talk for days at a time, fueled by the supply of sandwiches and sodas he always had arrive on periodic order. He arranged them in advance a year at a time, to reduce hassle. At that time physical mundanities such as food were of no interest to him at all, and in fact it rather surprised him how some people would hang around simply to get a free sandwich. Anyway, I will try to be more brief than Shawn. As I was one of the original engineers on the Intel Pentium I, by now I've got a pretty fair idea what people need to know about vector programming now, but given the nature of my mentor on this, a few more words are still quite appropriate here.

Shawn eventually retired to Hawaii, totally fed up with teaching morons how to program. Some say, he was so fed up with it, that's why he had a sex change operation. We can't know, as he's since passed away. If you ask anything me about that, I'd say first, I was at Oxford University and have an IQ 175, and I felt like a moron next to him too. I would account one brief anecdote, from standing next to all the supercomputers in his basement, about 10 years after we first met. He'd put my company on monthly retainer, I'm told, because of me, but all I ever actually ever did was insist they wait until Shawn spoke first.

So finally I was there. We hadn't seen each other for years. And he took me down to his basement. There the supercomputers were, as neat as he could make it, in a dozen rows or so, humming and groaning and rumbling. It was very noisy. No one ever went down there, normally, it was all hooked up with massive Ethernet snakes, and even more massive power lines and conditioners, on the porous floor with foam that would blossom up through its holes if a fire started. By that time, supercomputer prices had already started to plummet. So there wasn't one in the room worth more than 10 million dollars any more,, even though there were, well, a hundred of them.

"I don't ask for these things. They keep just sending them to me," Shawn remarked in complaint. "I've had to move to a larger building twice already."

"Which is best?" I asked, trying not to be too excited. I could feel the vibrations through the soles of my feet, even with air sneakers on. If we really had anything in common, we were all jeans and sneakers guys. And how many people ever had a hundred supercomputers in their basement? It was mind blowing.

Shawn screwed up his eyes and said, after his usual incredibly long pause, "The best one is...the one you program. I just finished teaching you."

...That was the last time I ever saw Shawn except once, when he appeared in a meeting, briefly, that his twin brother had organized with my entire company. They both laughed as we all remarked, in turn, how rarely we had seen them together, so rarely in fact, we had started to wonder, seriously, if they were actually one person pretending to be two. So they asked me, directly, what I thought they should do with all those supercomputers. They really were not sure at all. My tutelage was over, and it was time for an overly long silence from me.

Finally I replied, as I had been doing some work for a supercomputing journal, that I sympathized with that problem, butMoore's law meant they were hosed, and they were being exploited by supercomputer manufacturers, who would continue to send them millions of dollars worth of junk every month just so they could write it off. They asked if I was sure, saying 'Moore's law was just hocus pocus for the masses.' With utter resignation I acknowledged their skepticism, but told them how I was getting together with Intel's chairman Gordon Moore on Sundays to play pool at the time, and frankly, my own opinion of the best thing they could do is cut their losses and expenses as much as possible. I never normally told people about my friendship with Gordon, I explained, but I really believed he was right, not because we were friends, but because of my prior knowledge in semiconductor technology, when I was a .full-time journalist at CMP on a semiconductor journal called 'VLSI Design.' During that earlier time, I had met hundreds of Presidents in silicon valley, which was really the best education anyone could get in those days (there weren't even many computer engineering degrees around). So, that was my honest answer, as they had always been honest with me, and now, what did they want me to do? Ask Gordon to put them in his basement instead?

They didn't know about my friendship with him, then they agreed, much as they disliked Gordon Moore, they didn't want to ask him that. So that was the conclusion of a decade-long, bizarre camaraderie. and the end of my own company too, in fact. My own President, Jim Burkhardt, demanded I monetize my friendship with Gordon, in compensation for the lavish retirement that he believed I had just frittered away, stupidly, by being honest. When I refused, and we parted ways entirely, Jim embarked on a solid and continuous year-long cocaine spree, reducing himself and everything he owned to rubble--between incoherent fits of Coltrane adulation and driving at 100MPH the wrong way down empty freeways at 2am--until, even more tragically, eventually killing himself during an affair with his brother's wife. He was 30 years older than me, I couldn't stop his insanity either, and It's another thing I've never talked about.

Notwithstanding, partly due to incredible advances in X-ray lithography, Gordon turned out to be far more right than almost anyone believed even in 1995, including Gordon himself. Semiconductors carried on shrinking much after his retirement. Meanwhile, not even Shawn's twin brother could save Shawn, who, however brilliant he was technically, had no real business acumen. But the Hailey knowledge remains as it was, and moreover--ironically--something no one could have really anticipated either--their knowledge is now more applicable than ever, So I share it below, after with one FINAL not.

I really don't share such sad stories very often, but no matter what happened at their ends, there truly was a brief epoch of brilliance, there, that deserved far more than the decade it got before they all drove themselves insane, most likely, because of a billion dollars of rapidly depreciating junk in a basement. You can think whatever you like. That's what I think. .The only thing I can retort in defense, for a story with such miserable ending, is that my lovely wife Ruth frequently said I should write my past down, when we were married. So I have written it down. There are facts here than no one else in the world knew before. Not even Ruth ever heard this story. We were at Oxford together. Ruth left me for a richer man, but she was born into abject poverty, extraordinary punk-loving scholar that she was, and I couldn't really hold it against her for wanting more material things. I married her because, well, there was the time she asked to borrow some scotch tape, after the old scotch tape holding her broken eyeglasses together had worn out. She was a Coventry punk with nerdy glasses. So I helped her move to California and get a decent job and a PhD. Now she's a Coventry punk scholar, and she's still teaching. We just never stop working.

If you ever wondered what it was really to be ivy league in silicon valley during the Internet boom, rather than what the TV shows portray: that's how it was. Warts and all.

If you are interested in more such experiences, please seeDAWs and Me: Crossing Paths with Millionaires. Or, you can see my own work in the area so far, in theRemix3D Products that implement this technology.

2. The Nitty and Grtty

On modern CPUs, most instructions can complete in one clock cycle because the difficult ones are pipelined. The pipeline length introduces latency which can stall subsequent instructions if they are dependent on the results. Most commonly these stalls occur due to floating-point adds and multiplies, which are typically much slower than their integer equivalents. Code branches also can stall subsequent instructions, and in typical apps are equivalent in latency stall to to a floating-point multiply+add combination. Divides, exponents, logs, and trigonometric functions take even longer. But if the code is properly structured, the latency can be minimized. And as processor speeds reached over 1GHz, the pipelines got even longer, and the stalls even more frequent (partly marketed as 'power saving'). Those are the most important things to know.

True vector optimization starts with loop unwrapping: code loops add conditional tests at each loop iteration, which are deathly to floating-point DSP. Also, it does not use arrays; ideally these are compiled out, as arrays add index lookup operations. Then the operations are broken down into their simplest possible forms, rather than combining them into compound expressions, so that each floating-point operation can be individually tuned. After that, the ordering of variables in the app's compiled code is hand optimized, in order that sequential instructions make use of the same operands where possible, keeping CPU registers full and minimizing cache fetch. Finally, the results are pipeline-ordered for the minimal possible interdependency of sequential instructions, eliminating floating-point-unit stalls. These coding techniques were first pioneered by Cray supercomputers, which knowledgeable programmers refer to as vector optimization because individual processing paths for multiple interleaved tasks can be visualized as vectors through the runtime code.

This example shows RemixQuad3D's code sequence for two controls: effects send level and effects send pan. There are four sends for each channel and four channels, so there are 16 vectored processing paths here. The lazy and quick way to do this:

Hide LIne Numbers
  1. for(x=0; x<4; x++){
  2.    for(y=0; y<4; y++){
  3.       outL[y] += inL[y], fxpan[x, y] * .01] * dbtoa(fxsend[x, y]);
  4.       outR[y] += inR[y] 1 - fxpan[x, y] *.01] * dbtoa(fxsend[x, y])
  5.    }
  6. }
Of course this highly inefficient code is typically lauded for its comparative clarity. Now, here is the hand-coded version to optimize FPU throughput, which isabout eight times faster.. Note, for the interleaving to eliminate pipeline stalls, intermediate results must all be named. The app's internal coding conventions implement 3-character variable names for all primary coefficients, and 4-character names for intermediate coefficients, in a non-colliding global namespace. That is of course most variables are scoped rather than global, but a uniform naming convention is essential for sanity when coding!
Hide LIne Numbers
  1. //  EFFECTS PAN ***************************************************
  2. xab1 *= .01; xab2 *= .01; xab3 *= .01; xab4 *= .01;
  3. xbb1 *= .01; xbb2 *= .01; xbb3 *= .01; xbb4 *= .01;
  4. xcb1 *= .01; xcb2 *= .01; xcb3 *= .01; xcb4 *= .01;
  5. xdb1 *= .01; xdb2 *= .01; xdb3 *= .01; xdb4 *= .01;
  6. xa1L = xaa1 *(1 -xab1); xa1R = xaa1 *xab1; 
  7. xa2L = xaa2 *(1 -xab2); xa2R = xaa2 *xab2;
  8. xa3L = xaa3 *(1 -xab3); xa3R = xaa3 *xab3; 
  9. xa4L = xaa4 *(1 -xab4); xa4R = xaa4 *xab4;
  10. xb1L = xba1 *(1 -xbb1); xb1R = xba1 *xbb1; 
  11. xb2L = xbb2 *(1 -xbb2); xb2R = xba2 *xbb2;
  12. xb3L = xba3 *(1 -xbb3); xb3R = xba3 *xbb3; 
  13. xb4L = xba4 *(1 -xbb4); xb4R = xba4 *xbb4;
  14. xc1L = xca1 *(1 -xcb1); xc1R = xca1 *xcb1; 
  15. xc2L = xca2 *(1 -xcb2); xc2R = xca2 *xcb2;
  16. xc3L = xca3 *(1 -xcb3); xc3R = xca3 *xcb3; 
  17. xc4L = xca4 *(1 -xcb4); xc4R = xca4 *xcb4;
  18. xd1L = xda1 *(1 -xdb1); xd1R = xda1 *xdb1; 
  19. xd2L = xda2 *(1 -xdb2); xd2R = xda2 *xdb2;
  20. xd3L = xda3 *(1 -xdb3); xd3R = xda3 *xdb3; 
  21. xd4L = xda4 *(1 -xdb4); xd4R = xda4 *xdb4;
  22. //  EFFECTS LEVEL ***************************************************
  23. xLa = db2a(xaL); xRa = db2a(xaR); xLb = db2a(xbL); xRb = db2a(xbR); 
  24. xLc = db2a(xcL); xRc = db2a(xcR); xLd = db2a(xdL); xRd = db2a(xdR); 
  25. xa1L *= xLa; xa2L *= xLa; xa3L *= xLa; xa4L *= xLa; 
  26. xa1R *= xRa; xa2R *= xRa; xa3R *= xRa; xa4R *= xRa; 
  27. xb1L *= xLb; xb2L *= xLa; xb3L *= xLb; xb4L *= xLb;
  28. xb1R *= xRb; xb2R *= xRb; xb3R *= xRb; xb4R *= xRb;
  29. xc1L *= xLc; xc2L *= xLc; xc3L *= xLc; xc4L *= xLc;
  30. xc1R *= xRc; xc2R *= xRc; xc3R *= xRc; xc4R *= xRc;
  31. xd1L *= xLd; xd2L *= xLd; xd3L *= xLd; xd4L *= xLd;
  32. xd1R *= xRd; xd2R *= xRd; xd3R *= xRd; xd4R *= xRd;
  33. //  EFFECTS CHANNEL SUM *********************************************
  34. xa1L += xa2L; xa1R += xa2R; xb1L += xb2L; xb1R += xb2R;
  35. xc1L += xc2L; xc1R += xc2R; xd1L += xd2L; xd1R += xd2R;
  36. xa1L += xa3L; xa1R += xa3R; xb1L += xb3L; xb1R += xb3R;
  37. xc1L += xc3L; xc1R += xc3R; xd1L += xd3L; xd1R += xd3R;
  38. xa1L += xa4L; xa1R += xa4R; xb1L += xb4L; xb1R += xb4R;
  39. xc1L += xc4L; xc1R += xc4R; xd1L += xd4L; xd1R += xd4R;

As a consequence of the hand ordering of control-parameter instructions, MIDI can modulate multiple controls, in real time, with minimal addition in CPU load (typical load is 5% on a 4GHz i7 at 48kHz sample rate). All values are calculated at 64-bit quality.

But why doesn't the above pan use 3dB center compensation? That question means you can read enough code to recognize, instantly, that the above code is for linear stereo pan, and know enough about mixing to recognize that linear stereo pan causes changes in total sound pressure. However for effects sends, the compensation rarely makes sense, as the effects change the sound pressure so much as it is, and thus, almost all mixers use linear stereo pan for signal sends, as the additional accuracy from center weighting makes no effective difference. But if you want it, it is not difficult to add, as a number of possible weighted pans are already available in the Synthcore library. If you are interested in more such detail, please see theSynthcore Library on this site.

2.1. Appendix A: Latency Adjustments

- There is Audio Latency and Control Latency.
  • Audio Latency is the delay between the reception of an audio-signal value and its output from the app. You can adjust this buffer size to your audio driver and system with the I/O Vector Size in the app's Audio Settings panel. Values as low as 32 samples are typically available, with lower latency in custom ASIO hardware such as from RME. Internally, this app double-buffers the I/O to improve performance, so the actual latency will be twice the I/O vector size setting. The Plugin I/O is also buffered, so if you include plugins, each one adds an additional buffer delay.
  • Control Latency is the time it takes for a MIDI control signal to change a parameter. First, the MIDI protocol requires a mimimum 31.25Kbps througput. This app samples MIDI at 80kbps, over double the standard rate. Then, internally, the software calculates control signals in blocks, set by Signal Vector Size in the app's Audio Settings panel. Signal vector size must be the same or smaller than I/O vector size.

Typical latency expectations are well met with the default sizes of 256 audio samples for the I/O vector buffer size, and 256 samples for the signal vector size. That provides the following:

Total latency, with default settings, is 4.8 milliseconds at 48KHz sample rate for audio, and 4.8 milliseconds for control signals, with ~5% CPU load on a 4GHz i7.

But Reaktor has a control rate of 400Hz typ. - By default, Reaktor processes samples at 2.5ms intervals. If you find this necessary, reduce the default I/O and signal vector size to 128 samples at 48kHz, then the app processes control signals at 2ms intervals, faster than Reaktor's default. You will need at least a 2GHz machine to for control rates smaller than 2ms; or you would need to Enable Hardware Interrupts for consistent audio performance. However enabling hardware interrupts is generally not necessary, and can cause other audio apps to complain as they are already using hardware interrupts.

So why is the default at 256 samples? - Remixquad3D can easily process control rates faster than 250Hz, and can typically obtain 0.2ms latency, but cpu load is more uneven and reaches 10% on occasion. On top of that, for mixing it is rarely advisable to change parameters much more quickly anyway. This is because it can cause clicks in the audio source. So internally, all the continuous control signals are low-pass filtered by highly efficient custom integrators, with a settling time of 46 audio clock cycles and tail clipping to eliminate something called 'denormals.' Denormals are numbers which are so small, or so large, that the CPU has to switch to 128-bit precision, which greatly increases CPU load. The app's control integrators effectively smooth the control signals so that no clicks are ensured, and without introducing denormals, down to 64-sample DSP vector blocks. So it can work. On the other hand, a control window rate four times the size of the sample rate results in smoothest CPU load, because RemixQuad calculates everything in batches of four. So by reducing signal vector size down to 64, and leaving I/O vectors at 256, you will obtain a consistent 0.8ms latency whether you are using plugins or not (for a slightly higher CPU load, and not recommended on slower machines).

Even faster smoothing rates are possible by addingzero-crossover logic. This additional coding technique causes control level changes only to be applied exactly when the audio signal crosses zero. Zero crossover is implemented in Yofiel's Synthcore library, and may be implemented as an upgrade in the future, depending on demand.

Synthcore library

For a reference describing more settings, please see:the RemixQuad3D Manual