Float/String Conversion in Picolibc: Enter “Ryū”
I recently wrote about this topic having concluded that the best route for now was to use the malloc-free, but imprecise, conversion routines in the tinystdio alternative.
A few days later, Sreepathi Pai pointed me at some very recent work in this area:
This is amazing! Thirty years after the papers referenced in the previous post, Ulf Adams came up with some really cool ideas and managed to reduce the math required for 64-bit conversion to 128 bit integers. This is a huge leap forward; we were doing long multi-precision computations before, and now it's all short enough to fit in registers (ok, a lot of registers, but still).
Getting the Ryū Code
The code is available on github: https://github.com/ulfjack/ryu. Reading through it, it's very clear that the author focuses on performance with lots of tuning for common cases. Still, it's quite readable, especially compared with the newlib multi-precision based code.
Picolibc String/Float conversion interface
Picolibc has some pretty basic needs for the float/string conversion code, it wants four functions:
__dtoa_engine
int __dtoa_engine(double x, struct dtoa *dtoa, uint8_t max_digits, uint8_t max_decimals);
This converts the double
x
to a string of decimal digits and a decimal exponent stored inside the 'dtoa' struct. It limits the total number of digits tomax_digits
and, optionally (when max_decimals is non-zero), limits the number of fractional digits tomax_decimals - 1
. This latter supports 'f' formats. Returns the number of digits stored, which is <= max_digits. Less if the number can be accurately represented in fewer digits.__ftoa_engine
int __ftoa_engine(float x, struct ftoa *ftoa, uint8_t max_digits, uint8_t max_decimals);
The same as
__dtoa_engine
, except for floats.__atod_engine
double __atod_engine(uint64_t m10, int e10);
To avoid needing to handle stdio inside the conversion function,
__atod_engine
receives fully parsed values, the base-10 significand (m10
) and exponent (e10
). The value to convert ism10 * pow(10, e10)
.__atof_engine
float __atof_engine(uint32_t m10, int e10);
The same as
__atod_engine
, except for floats.
With these, it can do printf, scanf, ecvt, fcvt, gcvt, strtod, strtof and atof.
Porting Ryū to Picolibc
The existing Ryū float-to-string code always generates the number of
digits necessary for accurate output. I had to hack it up to generate
correctly rounded shorter output when max_digits
or max_decimals
were smaller. I'm not sure I managed to do that correctly, but at
least it appears to be passing all of the test cases I have. In normal
operation, Ryū iteratively removes digits from the answer that aren't
necessary to disambiguate with neighboring values.
What I changed was to keep removing digits using that method until the answer had few enough digits to fit in the desired length. There's some tricky rounding code that adjusts the final result and I had to bypass that if I'd removed extra digits.
That was about the only change necessary to the core algorithm. I also trimmed the code to only include the general case and not the performance improvements, then wrapped it with code to provide the _engine interface.
On the string-to-float side, most of what I needed to do was remove the string parsing bits at the start of the function and switch from performance-optimized to space-optimized versions of a couple of internal routines.
Correctness Results
Because these new functions are now 'exact', I was able to adjust the picolibc tests to compare all of the bits for string/float conversion instead of having to permit a bit of slop in the answers. With those changes, the picolibc test suite passes, which offers some assurance that things aren't completely broken.
Size Results
Snek uses the 32-bit float versions of the conversion routines, and for that, the size difference is:
text data bss dec hex filename
59068 44 37968 97080 17b38 snek-qemu-riscv-orig.elf
59430 44 37968 97442 17ca2 snek-qemu-riscv-ryu.elf
362
362 bytes added to gain accurate printf/strtof results seems like a good trade-off in this case.
Performance
I haven't measured performance at all, but I suspect that it won't be nearly as problematic on most platforms as the source code makes it appear. And that's because Ryū is entirely integer arithmetic with no floating point at all. This avoids using the soft fp code for platforms without hardware float support.
Pointers to the Code
I haven't merged this to picolibc master yet, it's on the ryu branch:
Review, especially of the hack above to return short results, would be greatly appreciated!
Thanks again to Ulf Adams for creating this code and to Sreepathi Pai for sending me a note about it!