Forward Error Correction with the TI CC1120 Sub-GHz transceiver
We're building a new, fancier flight computer and decided to give the new, fancier TI CC1120 chip a try. It has significantly more transmit power (+16dBm) and receive sensitivity (-109dBm) than the transciever built in to the TI CC1111 RF SoC that we've built our other boards with. Given that we'd already decided to use a fancier STM32L processor, using the nicest stand-alone radio chip we could find seemed to make good sense.
One of the nice features found in the CC1111 is support for FEC using a 1/2 rate constraint length 4 convolutional code. Both the encoder and the soft-decision Viterbi decoder are built right into the chip; all we had to do was flip a couple of bits and we got error correction, interleaving, data whitening and CRC checking all for free.
We assumed that, as the CC1120 was advertised as a 'better' replacement for our CC1111 radio, that it would also include all of these fine features. After we got the chip all soldered down to our first prototype boards, we read through the reference manual looking for the right register values to flip for forward error correction. We found support for CRC computation and data whitening, but no mention of interleaving or forward error correction at all.
Thanks, TI. Of course, without the FEC pieces, the other packet handling hardware is worthless—you can't exactly do a CRC on the encoded data and expect it to be useful on the other end of the link, and data whitening happens after the CRC.
Basic CC1120 driver
I assumed I'd eventually figure out something to do about the FEC pieces and started writing a basic CC1120 driver AltOS, the operating system used in all of our projects. I started with the transmit side, as that seemed easier as I could use the existing packet hardware by uploading the desired bit sequence using whatever fancy encoding desired and the hardware would happily transmit it.
I first tested that using low data-rate bits (2kbps), transmitting an alternating sequence of zeros and ones to generate a 1kHz tone on a regular 70cm receiver. That seemed to work just fine, demonstrating that I had some grasp of the basic operation of the chip.
Convolutional Encoding in Software
Not to be daunted by a small challenge in software, I set out to replicate the encoding scheme used in the CC1111 chips for use in our CC1120-based design. We obviously want our CC1120 boards to interoperate with the CC1111 boards. Fortunately, TI fully documents the CRC, data whitening, convolutional encoding and interleaving done in the hardware.
I created my own implementation of the encoder (licensed, as usual, under the GPLv2) which you can see in the file ao_fec_tx.c It's not quite as flexible as the hardware, it always whitens the data and always appends the CRC bytes to the end of the packet, along with the necessary trellis termination bytes.
The core of that code is surprisingly simple:
static const uint8_t ao_fec_encode_table[16] = {
/* next 0 1 state */
0, 3, /* 000 */
1, 2, /* 001 */
3, 0, /* 010 */
2, 1, /* 011 */
3, 0, /* 100 */
2, 1, /* 101 */
0, 3, /* 110 */
1, 2 /* 111 */
};
uint8_t
ao_fec_encode(const uint8_t *in, uint8_t len, uint8_t *out)
{
uint8_t extra[AO_FEC_PREPARE_EXTRA];
uint8_t extra_len;
uint32_t encode, interleave;
uint8_t pair, byte, bit;
uint16_t fec = 0;
const uint8_t *whiten = ao_fec_whiten_table;
extra_len = ao_fec_prepare(in, len, extra);
for (pair = 0; pair < len + extra_len; pair += 2) {
encode = 0;
for (byte = 0; byte < 2; byte++) {
if (pair + byte == len)
in = extra;
fec |= *in++ ^ *whiten++;
for (bit = 0; bit < 8; bit++) {
encode = encode << 2 | ao_fec_encode_table[fec >> 7];
fec = (fec << 1) & 0x7ff;
}
}
interleave = 0;
for (bit = 0; bit < 4 * 4; bit++) {
uint8_t byte_shift = (bit & 0x3) << 3;
uint8_t bit_shift = (bit & 0xc) >> 1;
interleave = (interleave << 2) | ((encode >> (byte_shift + bit_shift)) & 0x3);
}
*out++ = interleave >> 24;
*out++ = interleave >> 16;
*out++ = interleave >> 8;
*out++ = interleave >> 0;
}
return (len + extra_len) * 2;
}
ao_fec_encode_table takes care of the convolutional piece, ao_fec_whiten_table (not shown) contains the data whitening codes while the little loop at the bottom does the interleaving. I didn't spend a lot of time optimizing this as it seemed to run plenty fast on the 32MHz processor.
With the bits all nicely encoded, I passed them to my simple CC1120 driver and let it send them along to our CC1111 receiver. Much to my delight, with only a few bugs fixed, it worked just fine! I'd managed to get one direction working, validating the hardware and making it clear that we'd at least be able to use the board in flight for telemetry.
Getting Soft-decision data out of the CC1120
In packet mode, the CC1120 receiver converts the incoming GFSK stream directly into bits, making its best guess as to whether the detected value was a one or zero. Of course, like any FM modulation, the receiver will be producing intermediate values, especially when the signal is weak or masked by interference. A Viterbi decoder can make good use of this information;these noisy intermediate values are less influential in the final resulting path cost which allows the strong 0/1 values to pull the solution to the right one. Soft decision decoding yields a theoretical 2dB gain over hard decision decoders, and given our desire for maximum packet reception over long distances, it seemed important to make this work.
The CC1120 has a way to recover the soft decision values, but only one 8-bit value at a time. It does, however, helpfully provide a GPIO wire which can be programmed to interrupt the CPU when the soft decision data has arrived.
So, I hooked up a GPIO line on the processor to the GPIO line on the CC1120 and let it interrupt once per bit time, or 38400 times per second, to read this valuable soft decision data out of the radio part. According to TI support, this is the only way to make this work, and that they generally recommend that developers use the packet mode stuff and throw away this valueable source of data.
I was happily surprised to learn that the STM32L can in fact survive this interrupt rate, and by setting interrupt priorities appropriately, I manage to reliably capture an entire packets worth of data without dropping a single bit.
I set up a simple packet dumping function to run on the prototype board and captured raw encoded bytes from our CC1111-based flight computer so that I'd have some "real" data to work from.
Soft-decision Viterbi Decoding
The sample encoding implementation from TI made that piece fairly straightforward. Lacking a similar sample for the decoding piece meant that I'd be starting from scratch. And, implementing something that seemed like magic to me when I started. I started by reading a few articles on Viterbi decoding:
I found Chip Fleming's tutorial very useful as it included a bunch of worked examples using a convolutional encoders with constraint length (k) of 3 instead of 4; this makes all of the trellis diagrams half the size.
I also sent my friend, Phil Karn, a query about this process, and he responded with a bunch of helpful suggestions, while also recommending that I just steal his DSP and FEC library implementation. Of course, I don't like using code that I can't understand, and his code only implements much longer codes than I needed, so I'd have to understand it anyways to make it work for me. Phil's code is a marvel, but it's not exactly comprehensible to someone who doesn't even know what a Viterbi decoder is supposed to do.
So, with Chip's tutorial on my screen, I wrote a primitive Viterbi decoder. This did only the Viterbi step, and split that into two phases. The first computed the cost for each state in the system for every received pair of symbols received. It also tracked the chain of states through the entire decode process:
for (state = 0; state < 8; state++) {
struct ao_soft_sym zero = ao_soft_sym(ao_fec_encode_table[state * 2 + 0]);
struct ao_soft_sym one = ao_soft_sym(ao_fec_encode_table[state * 2 + 1]);
uint8_t zero_state = ao_next_state(state, 0);
uint8_t one_state = ao_next_state(state, 1);
int zero_cost = ao_cost(s, zero);
int one_cost = ao_cost(s, one);
zero_cost += cost[b][state];
one_cost += cost[b][state];
if (zero_cost < cost[b+1][zero_state]) {
prev[b+1][zero_state] = state;
cost[b+1][zero_state] = zero_cost;
}
if (one_cost < cost[b+1][one_state]) {
prev[b+1][one_state] = state;
cost[b+1][one_state] = one_cost;
}
}
At the end of the packet, it found the least costly path and walked that from back to front, generating output bits (yes, one bit per element of the 'bits' array):
for (state = 1; state < 8; state++) {
if (cost[b][state] < c) {
c = cost[b][state];
min_state = state;
}
}
for (b = len/2; b > 0; b--) {
bits[b-1] = min_state & 1;
min_state = prev[b][min_state];
}
Completely separate from this code were the interleaving and data whitening stages. Keeping those separate allowed them to be tested independently, which made debugging them far easier.
None of this code ever ran on the STM32L processor, it was instead run in a test framework on my laptop. I first got this working with data generated by the software convolutional encoder, then I managed to get it to decode the raw packets that I had recorded directly from the receiver it self.
This marked a significant milestone in my mind—I'd managed to replace, in software, the CC1111 FEC encoder and decoder. However, the decoder performance was not fast enough to manage back-to-back packets yet. Having something in hand that worked meant that further development would be directly comparable to something that did work, making optimization much easier. Yet another lesson in making it work, and then making it work fast.
Optimizing the Viterbi Decoder
The first step was to reduce memory usage within the decoder. While it's best to defer computing the path until the very last bit has been received, it's also true that the chances of a bit affecting old data reduces as the distance between the new and old data increases. With the rule of thumb being a distance of k × 6 or k × 7, I saved 32 output bits for each state and wrote the oldest of 8 each time the buffer filled up, making a distance of 24 bits between the newest saved data and the current decoding position. That meant a fixed amount of storage was needed for an arbitrarily long packet.
I tested a variety of different history sizes from 8 to 56 bits (cooresponding to 16, 32 and 64 bits of saved data for each state). With 100000 random packets generated that had gaussian noise added to each bit, here are the error rates for the different history lengths:
History Length | Total Packets | Correct | Incorrect |
---|---|---|---|
8 bits | 100000 | 90520 (90.52%) | 9480 (9.48%) |
24 bits | 100000 | 93614 (93.61%) | 6386 (6.39%) |
56 bits | 100000 | 93620 (93.62%) | 6380 (6.38%) |
The performance of the 8-bit history and 16-bit history versions turned out to be essentially identical (given the 32-bit nature of the STM32L, that shouldn't be too surprising). The performance of the 56-bit version, which manipulated uint64_t data types was sigificantly slower, and given the very modest benefit, deemed not worth it. If we move to a larger constraint length (k > 4), or started using a punctured convolutional encoding, we would want to re-measure this.
The next step was to combine the de-interleaving, decoding, de-whitening and CRC computation into a single loop. This would eliminate a pile of data copying and extra memory usage. Nothing fancy here, but it made enough of a difference that I could decode most of the incoming packets now, dropping only about 25% of them.
To get to being able to decode every packet, I needed to start decoding the packet while it was being received. Because of interleaving, I had to have 32 bits of data to be able to decode anything, so it was a simple matter of having my per-soft-value interrupt handler start the decoder going after each interleave block of data was received. With this in place, I was able to decode a 76-byte payload (160 transmitted bits) in 14.7 ms, or just about 50% of the time it took to transmit the packet. Unrolling one of the inner decoding loops eliminated a bunch of computation and sped this up even more, decoding packets in 9.5ms.
Success at last, a solid 100% of packets received were being decoded, as long as the CPU was otherwise idle.
Finally, I went and re-read Phil Karn's code and pondered over how it works. It was opaque until I wrote a message back to Phil asking how it worked and pointing out sections that were confusing. Fortunately, simply phrasing my confusion in words made me understand how the code works. Here's my original code for computing the per-state cost metric:
/* Metric for a zero bit */
uint32_t bitcost = ((uint32_t) (s0 ^ ao_fec_decode_table[(state<<1)]) +
(uint32_t) (s1 ^ ao_fec_decode_table[(state<<1)+1]));
{
/* Total path metric and state for a zero bit */
uint32_t cost0 = cost[p][state] + bitcost;
uint8_t state0 = ao_next_state(state, 0);
/* Compare the total path metric against all existing
* metrics heading into the same new state, choosing
* the least expensive and killing the others. Record
* the new lowest cost and output bit.
*/
if (cost0 < cost[n][state0]) {
cost[n][state0] = cost0;
bits[n][state0] = (bits[p][state] << 1) | (state & 1);
}
}
{
/* Total path metric and state for a one bit. Because
* encoding a one is always exactly the opposite of
* encoding a zero from any state, this cost is
* 255*2 - bitcost
*/
uint32_t cost1 = cost[p][state] + 510 - bitcost;
uint8_t state1 = ao_next_state(state, 1);
if (cost1 < cost[n][state1]) {
cost[n][state1] = cost1;
bits[n][state1] = (bits[p][state] << 1) | (state & 1);
}
}
Before this loop is run, I have to initialize cost[n] to a large value so that the comparisons will be true for the first case to reach each state. Phil's code, in contrast, doesn't have to initialize cost[n] as we know which two states can reach any possible new state. His code looks like this:
/* Metric for a zero bit */
bitcost = ((uint32_t) (s0 ^ ao_fec_decode_table[(state<<1)]) +
(uint32_t) (s1 ^ ao_fec_decode_table[(state<<1)|1]));
/* Only state and state+4 reach state<<1 with a zero bit */
/* Cost from state to state<<1 for a zero bit*/
m0 = cost[p][state] + bitcost;
/* Cost from state+4 to state<<1 for a zero bit */
m1 = cost[p][state+4] + (510 - bitcost);
/* Which source state is cheaper? */
bit = m0 > m1;
/* Record the cheaper cost and the add to the path of bits */
cost[n][state<<1] = bit ? m1 : m0;
bits[n][state<<1] = (bits[p][state + (bit<<2)] << 1) | (state&1);
/* State and state+4 reach (state<<1)+1 with a one bit */
/* Cost from state to (state<<1)+1 for a one bit */
m0 -= (bitcost+bitcost-510);
/* Cost from state+4 to (state<<1)+1 for a one bit */
m1 += (bitcost+bitcost-510);
/* Which source state is cheaper? */
bit = m0 > m1;
/* Record cheaper cost and add to the path of bits */
cost[n][(state<<1)+1] = bit ? m1 : m0;
bits[n][(state<<1)+1] = (bits[p][state + (bit<<2)] << 1) | (state&1);
This code does two states at a time, and so while it's slightly longer than the above, it only needs to run half as many times. The resulting code now decodes packets in 5.7ms.
The final version of the code can be see in ao_fec_rx.c
The whole AltOS operating system, with many drivers and support for CC1111, AVR and STM32L processirs is available, under the GPLv2 from git://git.gag.com/scm/git/fw/altos