Using SD Cards with Raspberry Pi Pico and TinyGo (FatFs), SD card access with TinyGo showed considerable degradation especially in read performance compared to C++, so I investigated the cause and attempted improvements.
Causes of Speed Degradation
The following projects were used for comparison.
Logic Analyzer Waveform Comparison
First, the SPI waveform during read access (512 byte transfer) was monitored with a logic analyzer.
![]() |
| SPI Waveform with TinyGo |
![]() |
| SPI Waveform with C++ |
While the SPI SCK frequency is 125 MHz / 4 = 31.25 MHz in both cases, a very large difference was confirmed in the time from starting one 8-bit access to starting the next: 310ns for C++ versus 1.75us for TinyGo.
Code Comparison
The code on both sides corresponding to the waveform section confirmed with the logic analyzer was examined. Both are code on the SDK side, not in the projects.
tinygo/src/machine/machine_rp2040_spi.go
func (spi SPI) rx(rx []byte, txrepeat byte) error {
var deadline = ticks() + _SPITimeout
plen := len(rx)
const fifoDepth = 8 // see txrx
var rxleft, txleft = plen, plen
for txleft != 0 || rxleft != 0 {
if txleft != 0 && spi.isWritable() && rxleft < txleft+fifoDepth {
spi.Bus.SSPDR.Set(uint32(txrepeat))
txleft--
}
if rxleft != 0 && spi.isReadable() {
rx[plen-rxleft] = uint8(spi.Bus.SSPDR.Get())
rxleft--
continue // if reading succesfully in rx there is no need to check deadline.
}
if ticks() > deadline {
return ErrSPITimeout
}
}
return nil
}
pico-sdk/src/rp2_common/hardware_spi/spi.c
int __not_in_flash_func(spi_read_blocking)(spi_inst_t *spi, uint8_t repeated_tx_data, uint8_t *dst, size_t len) {
invalid_params_if(SPI, 0 > (int)len);
const size_t fifo_depth = 8;
size_t rx_remaining = len, tx_remaining = len;
while (rx_remaining || tx_remaining) {
if (tx_remaining && spi_is_writable(spi) && rx_remaining < tx_remaining + fifo_depth) {
spi_get_hw(spi)->dr = (uint32_t) repeated_tx_data;
--tx_remaining;
}
if (rx_remaining && spi_is_readable(spi)) {
*dst++ = (uint8_t) spi_get_hw(spi)->dr;
--rx_remaining;
}
}
return (int)len;
}
In summary, the findings are as follows.
- In TinyGo, everything down to the lowest layer of the machine package, including microcontroller register access, is written in (Tiny)Go.
- The TinyGo rx() function is a direct port of the spi.c spi_read_blocking() function, aside from the addition of timeout error handling and differences in return values.
Since the processing content itself is essentially identical between TinyGo and C, the cause appears to be the overhead from register access via TinyGo.
Speed Improvement Modifications
machine Package
With the target functions identified, modifications were considered to call C functions for performance-critical parts of machine.SPI within the machine package.
For normal packages, placing a modified version locally and referencing it would allow replacement without renaming the existing package, but this approach did not work for the machine package.
The reason, as described here, is that machine is not strictly a package but a TinyGo library.
Therefore, a mymachine.SPI type was defined using embedding, and functions with mymachine.SPI as the receiver were defined.
Additionally, since the TinyFs sdcard package that calls machine.SPI needed to be replaced with mymachine.SPI, it was brought locally and modified.
pico_tinygo_fatfs_test/mymachine/machine_rp2040_spi.go
// +build rp2040
package mymachine
// #include "./spi.h"
import "C"
import (
"machine"
"device/rp"
"errors"
"unsafe"
)
type SPI struct {
*machine.SPI
}
...
pico_tinygo_fatfs_test/sdcard/sdcard.go
package sdcard
import (
"fmt"
"machine"
"time"
"pico_tinygo_fatfs_test/mymachine"
)
...
func New(b mymachine.SPI, sck, sdo, sdi, cs machine.Pin) Device { // Replace machine.SPI with mymachine.SPI. (Keep machine for others)
return Device{
bus: b,
cs: cs,
sck: sck,
sdo: sdo,
sdi: sdi,
cmdbuf: make([]byte, 6),
dummybuf: make([]byte, 512),
tokenbuf: make([]byte, 1),
sdCardType: 0,
}
}
...
On the main side, the modified packages were changed to local imports, and the spi variable was replaced from machine.SPI to mymachine.SPI.
pico_tinygo_fatfs_test/main.go
package main import ( "fmt" "machine" "time" "os" //"tinygo.org/x/drivers/sdcard" "pico_tinygo_fatfs_test/sdcard" //"tinygo.org/x/tinyfs/fatfs" "pico_tinygo_fatfs_test/fatfs" "pico_tinygo_fatfs_test/mymachine" ) var ( spi mymachine.SPI sckPin machine.Pin sdoPin machine.Pin sdiPin machine.Pin csPin machine.Pin ledPin machine.Pin serial = machine.Serial ) ...
C Call from rx() Function
The rx() function is rewritten to call the C-side spi_read_blocking() function. For the first argument spi_inst of spi_read_blocking(), alignment is needed so that it points to the base address of the corresponding SPI device's registers. For the other arguments, they are simply cast to match the C-side types. Note that timeout handling was removed.
pico_tinygo_fatfs_test/mymachine/machine_rp2040_spi.go
func (spi SPI) rx(rx []byte, txrepeat byte) error {
spi_inst := (*C.spi_inst_t)(unsafe.Pointer(&spi.Bus.SSPCR0.Reg))
repeated_tx_data := C.uint8_t(txrepeat)
dst := (*C.uint8_t)(unsafe.Pointer(&rx[0]))
plen := C.size_t(len(rx))
C.spi_read_blocking(spi_inst, repeated_tx_data, dst, plen)
return nil
}
Regarding the C source, pico-sdk/src/rp2_common/hardware_spi/spi.c was placed under /pico_tinygo_fatfs_test/mymachine in as close to its original form as possible. Additionally, several files from pico-sdk were placed to enable compilation of spi.c and spi.h. (under mymachine/hardware)
pico_tinygo_fatfs_test/mymachine/spi.c
int spi_read_blocking(spi_inst_t *spi, uint8_t repeated_tx_data, uint8_t *dst, size_t len) {
invalid_params_if(SPI, 0 > (int)len);
const size_t fifo_depth = 8;
size_t rx_remaining = len, tx_remaining = len;
while (rx_remaining || tx_remaining) {
if (tx_remaining && spi_is_writable(spi) && rx_remaining < tx_remaining + fifo_depth) {
spi_get_hw(spi)->dr = (uint32_t) repeated_tx_data;
--tx_remaining;
}
if (rx_remaining && spi_is_readable(spi)) {
*dst++ = (uint8_t) spi_get_hw(spi)->dr;
--rx_remaining;
}
}
return (int)len;
}
Confirming Speed Improvement
After changing the internals of the rx() function to a C call, the benchmark and logic analyzer waveform were re-examined.
Benchmark
- pico_fatfs_test (C++)
===================== == pico_fatfs_test == ===================== mount ok Type is FAT32 Card size: 32.00 GB (GB = 1E9 bytes) FILE_SIZE_MB = 5 BUF_SIZE = 512 bytes Starting write test, please wait. write speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 447.7192, 6896, 1007, 1142 446.4797, 7589, 1024, 1145 Starting read test, please wait. read speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 974.9766, 1050, 403, 524 974.4066, 1049, 402, 524 - pico_tinygo_fatfs_test (TinyGo after changing machine.SPI to C calls)
============================ == pico_tinygo_fatfs_test == ============================ mount ok Type is FAT32 Card size: 32.00 GB (GB = 1E9 bytes) FILE_SIZE_MB = 5 BUF_SIZE = 512 bytes Starting write test, please wait. write speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 390.2342, 17208, 1092, 1292 362.8215, 54979, 897, 1383 Starting read test, please wait. read speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 835.5080, 16601, 554, 591 830.9257, 21483, 559, 594
The speed was significantly improved, achieving performance close to C++.
Logic Analyzer Waveform
![]() |
| TinyGo After Changing machine.SPI to C Calls |
It was confirmed that the time from starting one 8-bit access to starting the next was significantly improved from 1.75us to 360ns. Although the exact same C function as the C++ version is used in the TinyGo environment, there is still a slight difference. This may be related to the C++ version using arm-none-eabi-gcc as the compiler while TinyGo uses LLVM/Clang, but since very close speeds were achieved, this will be left here for now.
Summary
By converting parts of TinyGo's machine package to Cgo, C/C++ level access speeds were achieved. The Cgo functionality can be said to be very simple and easy to use, despite the need for copying before calls. On the other hand, the original intent behind writing the entire machine package in TinyGo is presumably to preserve the advantages of the Go language, so whether the Cgo approach for performance improvement is a viable option is a matter of debate. For example, for data transfer parts like this, methods that maintain Go code while reducing overhead by leveraging DMA rather than frequent direct register access should perhaps also be considered.




No comments:
Post a Comment