Using SD Cards with TinyGo (Speed Improvement Edition)

Saturday, May 7, 2022

Raspberry Pi Pico TinyGo

t f B! P L

Using SD Cards with Raspberry Pi Pico and TinyGo (FatFs), SD card access with TinyGo showed considerable degradation especially in read performance compared to C++, so I investigated the cause and attempted improvements.

Causes of Speed Degradation

The following projects were used for comparison.

Logic Analyzer Waveform Comparison

First, the SPI waveform during read access (512 byte transfer) was monitored with a logic analyzer.

SPI Waveform with TinyGo
SPI Waveform with C++

While the SPI SCK frequency is 125 MHz / 4 = 31.25 MHz in both cases, a very large difference was confirmed in the time from starting one 8-bit access to starting the next: 310ns for C++ versus 1.75us for TinyGo.

Code Comparison

The code on both sides corresponding to the waveform section confirmed with the logic analyzer was examined. Both are code on the SDK side, not in the projects.

tinygo/src/machine/machine_rp2040_spi.go

func (spi SPI) rx(rx []byte, txrepeat byte) error {
	var deadline = ticks() + _SPITimeout
	plen := len(rx)
	const fifoDepth = 8 // see txrx
	var rxleft, txleft = plen, plen
	for txleft != 0 || rxleft != 0 {
		if txleft != 0 && spi.isWritable() && rxleft < txleft+fifoDepth {
			spi.Bus.SSPDR.Set(uint32(txrepeat))
			txleft--
		}
		if rxleft != 0 && spi.isReadable() {
			rx[plen-rxleft] = uint8(spi.Bus.SSPDR.Get())
			rxleft--
			continue // if reading succesfully in rx there is no need to check deadline.
		}
		if ticks() > deadline {
			return ErrSPITimeout
		}
	}
	return nil
}

pico-sdk/src/rp2_common/hardware_spi/spi.c

int __not_in_flash_func(spi_read_blocking)(spi_inst_t *spi, uint8_t repeated_tx_data, uint8_t *dst, size_t len) {
    invalid_params_if(SPI, 0 > (int)len);
    const size_t fifo_depth = 8;
    size_t rx_remaining = len, tx_remaining = len;
 
    while (rx_remaining || tx_remaining) {
        if (tx_remaining && spi_is_writable(spi) && rx_remaining < tx_remaining + fifo_depth) {
            spi_get_hw(spi)->dr = (uint32_t) repeated_tx_data;
            --tx_remaining;
        }
        if (rx_remaining && spi_is_readable(spi)) {
            *dst++ = (uint8_t) spi_get_hw(spi)->dr;
            --rx_remaining;
        }
    }
 
    return (int)len;
}

In summary, the findings are as follows.

  • In TinyGo, everything down to the lowest layer of the machine package, including microcontroller register access, is written in (Tiny)Go.
  • The TinyGo rx() function is a direct port of the spi.c spi_read_blocking() function, aside from the addition of timeout error handling and differences in return values.

Since the processing content itself is essentially identical between TinyGo and C, the cause appears to be the overhead from register access via TinyGo.

Speed Improvement Modifications

machine Package

With the target functions identified, modifications were considered to call C functions for performance-critical parts of machine.SPI within the machine package. For normal packages, placing a modified version locally and referencing it would allow replacement without renaming the existing package, but this approach did not work for the machine package. The reason, as described here, is that machine is not strictly a package but a TinyGo library. Therefore, a mymachine.SPI type was defined using embedding, and functions with mymachine.SPI as the receiver were defined.
Additionally, since the TinyFs sdcard package that calls machine.SPI needed to be replaced with mymachine.SPI, it was brought locally and modified.


pico_tinygo_fatfs_test/mymachine/machine_rp2040_spi.go

// +build rp2040
 
package mymachine
 
// #include "./spi.h"
import "C"
import (
	"machine"
	"device/rp"
	"errors"
	"unsafe"
)
 
type SPI struct {
	*machine.SPI
}
...

pico_tinygo_fatfs_test/sdcard/sdcard.go

package sdcard
 
import (
	"fmt"
	"machine"
	"time"
 
	"pico_tinygo_fatfs_test/mymachine"
)
...
func New(b mymachine.SPI, sck, sdo, sdi, cs machine.Pin) Device { // Replace machine.SPI with mymachine.SPI. (Keep machine for others)
	return Device{
		bus:        b,
		cs:         cs,
		sck:        sck,
		sdo:        sdo,
		sdi:        sdi,
		cmdbuf:     make([]byte, 6),
		dummybuf:   make([]byte, 512),
		tokenbuf:   make([]byte, 1),
		sdCardType: 0,
	}
}
...

On the main side, the modified packages were changed to local imports, and the spi variable was replaced from machine.SPI to mymachine.SPI.


pico_tinygo_fatfs_test/main.go

package main
 
import (
	"fmt"
	"machine"
	"time"
	"os"
 
	//"tinygo.org/x/drivers/sdcard"
	"pico_tinygo_fatfs_test/sdcard"
	//"tinygo.org/x/tinyfs/fatfs"
	"pico_tinygo_fatfs_test/fatfs"
	"pico_tinygo_fatfs_test/mymachine"
)
 
var (
	spi    mymachine.SPI
	sckPin machine.Pin
	sdoPin machine.Pin
	sdiPin machine.Pin
	csPin  machine.Pin
	ledPin machine.Pin
 
	serial  = machine.Serial
)
...

C Call from rx() Function

The rx() function is rewritten to call the C-side spi_read_blocking() function. For the first argument spi_inst of spi_read_blocking(), alignment is needed so that it points to the base address of the corresponding SPI device's registers. For the other arguments, they are simply cast to match the C-side types. Note that timeout handling was removed.


pico_tinygo_fatfs_test/mymachine/machine_rp2040_spi.go

func (spi SPI) rx(rx []byte, txrepeat byte) error {
	spi_inst := (*C.spi_inst_t)(unsafe.Pointer(&spi.Bus.SSPCR0.Reg))
	repeated_tx_data := C.uint8_t(txrepeat)
	dst := (*C.uint8_t)(unsafe.Pointer(&rx[0]))
	plen := C.size_t(len(rx))
	C.spi_read_blocking(spi_inst, repeated_tx_data, dst, plen)
	return nil
}

Regarding the C source, pico-sdk/src/rp2_common/hardware_spi/spi.c was placed under /pico_tinygo_fatfs_test/mymachine in as close to its original form as possible. Additionally, several files from pico-sdk were placed to enable compilation of spi.c and spi.h. (under mymachine/hardware)


pico_tinygo_fatfs_test/mymachine/spi.c

int spi_read_blocking(spi_inst_t *spi, uint8_t repeated_tx_data, uint8_t *dst, size_t len) {
    invalid_params_if(SPI, 0 > (int)len);
    const size_t fifo_depth = 8;
    size_t rx_remaining = len, tx_remaining = len;
 
    while (rx_remaining || tx_remaining) {
        if (tx_remaining && spi_is_writable(spi) && rx_remaining < tx_remaining + fifo_depth) {
            spi_get_hw(spi)->dr = (uint32_t) repeated_tx_data;
            --tx_remaining;
        }
        if (rx_remaining && spi_is_readable(spi)) {
            *dst++ = (uint8_t) spi_get_hw(spi)->dr;
            --rx_remaining;
        }
    }
 
    return (int)len;
}

Confirming Speed Improvement

After changing the internals of the rx() function to a C call, the benchmark and logic analyzer waveform were re-examined.

Benchmark

  • pico_fatfs_test (C++)

    =====================
    == pico_fatfs_test ==
    =====================
    mount ok
    Type is FAT32
    Card size:   32.00 GB (GB = 1E9 bytes)
     
    FILE_SIZE_MB = 5
    BUF_SIZE = 512 bytes
    Starting write test, please wait.
     
    write speed and latency
    speed,max,min,avg
    KB/Sec,usec,usec,usec
    447.7192, 6896, 1007, 1142
    446.4797, 7589, 1024, 1145
     
    Starting read test, please wait.
     
    read speed and latency
    speed,max,min,avg
    KB/Sec,usec,usec,usec
    974.9766, 1050, 403, 524
    974.4066, 1049, 402, 524
        

  • pico_tinygo_fatfs_test (TinyGo after changing machine.SPI to C calls)

    ============================
    == pico_tinygo_fatfs_test ==
    ============================
    mount ok
    Type is FAT32
    Card size:   32.00 GB (GB = 1E9 bytes)
     
    FILE_SIZE_MB = 5
    BUF_SIZE = 512 bytes
    Starting write test, please wait.
     
    write speed and latency
    speed,max,min,avg
    KB/Sec,usec,usec,usec
    390.2342, 17208, 1092, 1292
    362.8215, 54979, 897, 1383
     
    Starting read test, please wait.
     
    read speed and latency
    speed,max,min,avg
    KB/Sec,usec,usec,usec
    835.5080, 16601, 554, 591
    830.9257, 21483, 559, 594
        


The speed was significantly improved, achieving performance close to C++.

Logic Analyzer Waveform

TinyGo After Changing machine.SPI to C Calls

It was confirmed that the time from starting one 8-bit access to starting the next was significantly improved from 1.75us to 360ns. Although the exact same C function as the C++ version is used in the TinyGo environment, there is still a slight difference. This may be related to the C++ version using arm-none-eabi-gcc as the compiler while TinyGo uses LLVM/Clang, but since very close speeds were achieved, this will be left here for now.


Summary

By converting parts of TinyGo's machine package to Cgo, C/C++ level access speeds were achieved. The Cgo functionality can be said to be very simple and easy to use, despite the need for copying before calls. On the other hand, the original intent behind writing the entire machine package in TinyGo is presumably to preserve the advantages of the Go language, so whether the Cgo approach for performance improvement is a viable option is a matter of debate. For example, for data transfer parts like this, methods that maintain Go code while reducing overhead by leveraging DMA rather than frequent direct register access should perhaps also be considered.

About Me

My photo
Electronics, programming & audio

Featured Post

Synchronizing Radio-Controlled Clocks with Raspberry Pi Pico W (JJY Standard Radio Wave Emulator)

As a Raspberry Pi Pico W application, I built a JJY emulator for radio-controlled clocks (for time synchronization) with minimal peripheral...

QooQ