Lock-Free Ring Buffers for ISR-to-Task Communication

By embeddedSoft

Published in Embedded C/C++

May 27, 2026

4 min read

Lock-Free Ring Buffers for ISR-to-Task Communication

Introduction

Why Lock-Free? The Cost of Critical Sections

The SPSC Ring Buffer Invariant

The Hidden Danger: Memory Reordering

Complete Implementation

DMB vs DSB vs ISB: When to Use Which

The Volatile Myth

When You Need More Than SPSC

Testing and Verification

Conclusion

References

Introduction

In virtually every embedded system, interrupt service routines (ISRs) need to pass data to the background main loop or to an RTOS task. The UART receive interrupt collects bytes and hands them off. A timer ISR samples a sensor reading for later processing. A DMA complete interrupt signals that a block of data is ready. The fundamental challenge is the same: how do you safely share data between an asynchronous interrupt handler and synchronous application code without corrupting state?

The obvious answer—disable interrupts around shared data access—works, but it introduces jitter into your interrupt response and scales poorly as system complexity grows. A more elegant solution is the single-producer single-consumer (SPSC) lock-free ring buffer, a data structure that requires no mutexes, no critical sections, and no interrupt disabling. The only hidden ingredient is correct use of memory barriers—and getting them right is where most embedded developers stumble.

This article walks through the complete design and implementation of a lock-free SPSC ring buffer for ISR-to-task communication on ARM Cortex-M processors, explains when and why memory barriers are needed, and identifies the most common pitfalls that cause silent data corruption in production firmware.

Why Lock-Free? The Cost of Critical Sections

Consider a typical UART receive driver. The UART RX interrupt fires every time a byte arrives. The ISR must store the byte somewhere so the main loop can process it later. The naive approach wraps the buffer access in a critical section:

// Naive: disabling interrupts around shared data
void UART4_IRQHandler(void) {
    uint8_t byte = UART4->RXDR;
    __disable_irq();           // CRITICAL SECTION START
    rx_buffer[head] = byte;
    head = (head + 1) & MASK;
    __enable_irq();            // CRITICAL SECTION END
}

This works, but every byte received adds interrupt latency for the entire system. If your UART runs at 1 Mbps (one byte every 8 µs), the processor spends a meaningful fraction of its time with interrupts disabled. Worse, if the main loop also disables interrupts for other purposes, you create nested critical sections with unpredictable worst-case latency.

A lock-free ring buffer eliminates this entirely. The ISR never disables interrupts. The main loop never disables interrupts. The two contexts coordinate through a carefully designed protocol typically requiring one barrier on the producer side.

The SPSC Ring Buffer Invariant

The ring buffer uses a fixed-size array and two indices: head (written only by the producer) and tail (written only by the consumer). The key insight is that with only one writer per index, no read-modify-write protocol is needed for the indices themselves—each context has exclusive ownership of its index.

The invariant that makes this work:

The producer writes data into buf[head], then advances head.
The consumer reads data from buf[tail], then advances tail.
Each index variable is written by exactly one context and read by the other.

Because there is no concurrent writer for any single variable, we do not need atomic compare-and-swap operations or hardware locks. The only requirement is that the memory access order visible to the other context matches the program order.

The Hidden Danger: Memory Reordering

While many Cortex-M systems behave as strongly ordered for normal memory, write buffers, caches (e.g., Cortex-M7), and bus interactions can still affect memory visibility ordering. This means the hardware and compiler may reorder memory accesses for performance. Consider the producer’s code:

buf[head] = byte;   // Store the data byte
head = next_head;  // Publish by advancing head

Without a barrier, the processor could reorder these stores—publishing the new head value before the data byte is actually written to memory. The consumer sees head advance, reads buf[old_head], and gets stale or zero data. This is not a theoretical concern; it happens in practice on Cortex-M7 and any processor with a write buffer or store forwarding.

The fix is a Data Memory Barrier (__DMB()), a CMSIS intrinsic that compiles to the ARM DMB instruction:

rb->buf[head] = byte;   // Store the data byte
__DMB();                // Barrier: ensure store completes before next store
rb->head = next_head;   // Publish: consumer can now see the new byte

The DMB instruction guarantees that all explicit memory accesses before the barrier are observed before any explicit memory accesses after the barrier. It does not stall the pipeline until writes reach external memory—it only enforces ordering, making it far cheaper than a DSB (Data Synchronization Barrier).

Complete Implementation

Here is a production-quality SPSC ring buffer implementation for a UART receive path on ARM Cortex-M:

#include <stdint.h>
#include <stdbool.h>

// CMSIS intrinsics provide __DMB, __DSB, __ISB on Cortex-M
// For non-ARM targets, use compiler barriers or stdatomic
#if defined(__ARM_ARCH)
#include "cmsis_gcc.h"   // or cmsis_armcc.h / cmsis_compiler.h"
#else
#define __DMB() __asm volatile("" ::: "memory")
#endif

#define RING_BUF_SIZE   256U   /* Must be power of 2 */
#define RING_BUF_MASK   (RING_BUF_SIZE - 1U)

typedef struct {
    uint8_t         buf[RING_BUF_SIZE];
    volatile uint32_t head;   /* Written by producer (ISR) only */
    volatile uint32_t tail;   /* Written by consumer (main) only */
} RingBuf_t;

/**
 * Write one byte to the ring buffer (producer side).
 * Called from ISR context. Never disables interrupts.
 * Returns true on success, false if buffer is full.
 */
bool ring_buf_write(RingBuf_t *rb, uint8_t byte) {
    uint32_t head = rb->head;
    uint32_t next = (head + 1U) & RING_BUF_MASK;

    /* Full check: next == tail means no room (usable capacity is SIZE-1) */
    if (next == rb->tail) {
        return false;
    }

    rb->buf[head] = byte;     /* Store data */
    __DMB();                  /* Ensure data is visible before head advances */
    rb->head = next;          /* Publish */
    return true;
}

/**
 * Read one byte from the ring buffer (consumer side).
 * Called from main loop context.
 * Returns true on success, false if buffer is empty.
 */
bool ring_buf_read(RingBuf_t *rb, uint8_t *out) {
    uint32_t tail = rb->tail;

    /* Empty check: tail == head means nothing to read */
    if (tail == rb->head) {
        return false;
    }

    __DMB();                  /* Ensure head read completes before buf access */
    *out = rb->buf[tail];     /* Read data */
    rb->tail = (tail + 1U) & RING_BUF_MASK;
    return true;
}

/* ---- UART RX ISR ---- */
static RingBuf_t g_uart_rx = { .head = 0, .tail = 0 };

void UART4_IRQHandler(void) {
    uint8_t byte = (uint8_t)(UART4->RXDR & 0xFFU);
    ring_buf_write(&g_uart_rx, byte);
    UART4->ICR = USART_ICR_RXNECF;  /* Clear overrun flag if needed */
}

/* ---- Main loop ---- */
int main(void) {
    system_init();
    uart_init();
    for (;;) {
        uint8_t byte;
        while (ring_buf_read(&g_uart_rx, &byte)) {
            process_received_byte(byte);
        }
        enter_low_power_mode();
    }
}

DMB vs DSB vs ISB: When to Use Which

ARM provides three barrier instructions, and using the wrong one can cause subtle bugs or unnecessary performance overhead:

Barrier	Guarantees	Use Case
DMB (Data Memory Barrier)	Ordering of explicit memory accesses	Lock-free data structures, producer-consumer protocols
DSB (Data Synchronization Barrier)	DMB + completion of all prior memory accesses	Before `WFI`/`WFE`, before sleep, after configuring MPU
ISB (Instruction Synchronization Barrier)	DSB + flush pipeline, refetch all subsequent instructions	After modifying vector tables, after self-modifying code, after MPU updates

Rule of thumb: Use DMB for lock-free ring buffers. Use DSB before entering low-power sleep modes to ensure all writes reach the memory system. Use ISB only when changing the instruction execution context itself.

The consumer-side DMB in ring_buf_read is a conservative safety measure. It ensures the load of rb->head is completed before the load of rb->buf[tail]. Without it, the architecture does not guarantee ordering between independent loads, so buf[tail] could be read before the latest head value is fully observed. On simpler Cortex-M systems it is often not strictly required, but improves portability and safety.

The Volatile Myth

A common misconception is that declaring shared variables as volatile is sufficient for ISR safety. volatile prevents the compiler from caching a register across sequence points and ensures each access generates a load/store instruction. However, volatile does not prevent the hardware from reordering memory accesses.

volatile uint32_t head;  // Compiler won't optimize away loads/stores,
                         // BUT the CPU may still reorder them!

Always combine volatile with appropriate memory barriers. volatile handles the compiler; barriers handle the hardware.

When You Need More Than SPSC

The lock-free pattern described here is safe only when there is exactly one writer per index variable. If you have:

Multiple ISRs writing to the same buffer (multi-producer): Use atomic compare-and-swap (LDREX/STREX on ARM) or fall back to critical sections.
Multiple tasks reading from the same buffer (multi-consumer): Same as above—atomic operations or mutexes.
Single producer, single consumer: SPSC ring buffer with DMB barriers—no atomics needed.

For MPMC (multiple producer, multiple consumer) scenarios, ARMv8-A provides LSE atomic instructions (LDADD, LDEOR, etc.) that offer better scalability than exclusive monitor-based loops.

Testing and Verification

Lock-free algorithms are notoriously difficult to test because failures are timing-dependent. Strategies that help:

Stress test at maximum interrupt rate: Feed data as fast as the peripheral allows; any race condition will eventually manifest.
Enable all compiler optimizations: Race conditions often only appear at -O2 or higher, where the compiler reorders and optimizes aggressively.
Use a logic analyzer: Verify that no bytes are dropped or duplicated under sustained load.
Check the DSB before WFI: If your main loop enters sleep after draining the buffer, add a DSB before the WFI instruction to ensure all memory writes are visible before the processor suspends execution.

Conclusion

The lock-free SPSC ring buffer is one of the most useful patterns in embedded systems programming. It decouples interrupt processing from application logic with zero interrupt latency overhead and zero risk of deadlock. The implementation is small—under 30 lines of C—but it depends on a correct understanding of memory ordering on modern processors.

The key takeaways: declare shared indices as volatile to prevent compiler reordering, insert __DMB() barriers to prevent hardware reordering, enforce single-writer ownership of each index variable, and resist the temptation to add more synchronization than the pattern requires. When applied correctly, this technique scales to any number of producer-consumer pairs in your firmware, from UART drivers to DMA completion handlers to custom event queues.