Arduino performance

In the recent days we’ve been working on getting the Piggy’s firmware optimized to make squeeze out every last drop of performance and create even better sounds. Part of the optimization is to get the modulation frequency as high as possible, so that also at high pitch, fat FM sounds are possible.

The Piggy is part analogue (VCA, VCO, filter), but also has digital components. It has, for instance, a second digital oscillator (inside the Arduino) which is used for frequency modulation. The modulation oscillator has to follow the pitch of the analogue VCO, which can get quite high. A few kiloherz is nothing special.

In order for the frequency to be high, the code needs to run quickly. Typically, we want everything to be done in 300 microseconds, and then a new cycle can begin. But doing this proves a little harder than first thought. The Piggy uses the Arduino’s micro controller, the Atmel AVR. This baby runs at 16 MHz and has the potential to run instructions in a single clock cycle, so only 62.5 nanoseconds! This should be a walk in the park then, right? I mean, we’re only talking about a single oscillator and one envelope?

Well, it would be that easy, if it weren’t for the fact that the AVR is 8 bit. This means numbers (16, or 32 bit, which are typically needed for precision applications) need to be handled in multiple steps. For instance, a multiplication of two 16 bit numbers is actually 4 8-bit multiplies. Think A x B = (a + c) x (b + d) = ab + cd + cb + cd. That means 4 terms. for 32 bit numbers this is actually 16 terms, and all of those need to be added too. So, watch out with bigger numbers and multiplies. They can take a couple of microseconds in the worst case.

But even worse, the AVR, in the tradition of all 8 bit processors, does not have a divide instruction. Instead, it’s typically done by a software long division routine. It turns out a full 32 bit divide takes no less than 50 microseconds (!!). We measured this using GPIO on the oscilloscope and using a special benchmark program (run division 10000 times and print the elapsed time). This is completely deadly to the performance. In such cases the following optimizations may be in order:

  • Resort to shifts.. use powers of two (2,4,8,16..) and the compiler will replace the div routine by a simple shift. This almost completely frees up the 50 microseconds.
  • In case you cannot use a power of two, then you can perhaps pre-calculate the inverse and multiply with this at run-time. This will be 10 times faster or more.
  • Avoid division altogether. Sometimes there are smart algorithms for this (for instance the Bresenham line algorithm).

In case of designing a synthesizer it’s important to understand that division is typically required for envelopes. The rest is primarily linear signal processing, with perhaps a non-linear look-up table based operation here and there.. This can be solved with cheap operations. Envelope data does not need to be computed for every sample, it can be done at something like 500 or 1000 Hz.

So, you can put envelope processing in a background task, and let modulation run in the foreground. But that’s something for another post 😉