Atomic Operations in Embedded C: Lock-Free Synchronization for Cortex-M

By Jithin Tom

Published in Embedded C/C++

July 03, 2026

2 min read

Atomic Operations in Embedded C: Lock-Free Synchronization for Cortex-M

The Cortex-M Atomic Hardware Primitives

C11 Atomics in Embedded C

Memory Ordering: The Subtle Bug Source

Memory Barriers for DMA and Peripherals

Common Pitfalls

When to Choose Atomics Over Mutexes

Summary

References

Frequently Asked Questions

Atomic operations are the foundation of lock-free programming in embedded systems. While mutexes and semaphores provide simple mutual exclusion, they carry overhead: context switches, priority inversion risk, and memory footprint. For high-frequency interrupt-to-task communication or multi-core synchronization, atomic operations offer deterministic, wait-free alternatives — provided you understand the memory model and hardware primitives.

The Cortex-M Atomic Hardware Primitives

ARM Cortex-M processors (M3/M4/M7/M33) implement the ARMv7-M/ARMv8-M architecture with exclusive access instructions:

+------------------------------------------------------------------+
|              Cortex-M Exclusive Access Instructions              |
+------------------------------------------------------------------+
|                                                                  |
|  LDREX   Rd, [Rn]      Load Register Exclusive (8/16/32-bit)     |
|  STREX   Rd, Rt, [Rn]  Store Register Exclusive (8/16/32-bit)    |
|  CLREX                 Clear Exclusive Monitor                   |
|                                                                  |
|  LDREX/STREX work as a pair: LDREX tags the address, STREX       |
|  succeeds only if no other context modified the address since    |
|  the LDREX. Returns 0 on success, 1 on failure (retry loop).     |
+------------------------------------------------------------------+

These instructions form the hardware basis for compare-and-swap (CAS), atomic increment, and other read-modify-write operations. The exclusive monitor is per-core and tracks a single address granule (typically 4-16 bytes).

Natural Atomicity

On Cortex-M, aligned 32-bit loads and stores are naturally atomic — they cannot be interrupted mid-transfer. This means:

volatile uint32_t shared_flag = 0;

// Thread A
shared_flag = 1;        // Atomic store (single STR instruction)

// Thread B
if (shared_flag == 1)   // Atomic load (single LDR instruction)

However, read-modify-write sequences are NOT atomic:

shared_counter++;       // NOT atomic! Compiles to LDR, ADD, STR
                        // Interrupt between LDR and STR loses updates

For RMW operations, you need LDREX/STREX or C11 atomics.

C11 Atomics in Embedded C

The C11 _Atomic type qualifier and <stdatomic.h> provide portable atomic operations. Modern embedded toolchains (GCC 10+, Clang 12+, IAR 9+) support them for Cortex-M.

#include <stdatomic.h>
#include <stdbool.h>
#include <stdint.h>

// Atomic flag - lock-free on Cortex-M
atomic_flag lock = ATOMIC_FLAG_INIT;

// Spinlock using test-and-set
// WARNING: Avoid spinlocks on single-core systems without disabling interrupts,
// as a high-priority thread will spin forever if the lock holder is preempted.
void lock_acquire(atomic_flag *lock) {
    while (atomic_flag_test_and_set_explicit(lock, memory_order_acquire)) {
        // Spin - in embedded, consider __WFE() for low-power wait
        __WFE();
    }
}

void lock_release(atomic_flag *lock) {
    atomic_flag_clear_explicit(lock, memory_order_release);
    __SEV();  // Wake waiting cores
}

Atomic Integer Types

atomic_int counter = 0;
atomic_uintptr_t shared_ptr = 0;

// Atomic increment (fetch_add returns old value)
int old = atomic_fetch_add_explicit(&counter, 1, memory_order_relaxed);

// Compare-and-swap (CAS) - returns true on success
uintptr_t expected = old_ptr;
bool success = atomic_compare_exchange_strong_explicit(
    &shared_ptr, &expected, new_ptr,
    memory_order_acq_rel, memory_order_relaxed
);

Critical: Always check ATOMIC_INT_LOCK_FREE macro. On Cortex-M, 32-bit atomics are lock-free (value 2). 64-bit atomics may use library locks (value 1) — avoid them in ISRs.

Memory Ordering: The Subtle Bug Source

The hardest part of atomics is memory ordering. While Cortex-M enforces strict access ordering for Device and Strongly-ordered memory (e.g., peripherals), Normal memory (SRAM) is weakly ordered, allowing compiler and hardware reordering.

+------------------------------------------------------------------+
|              C11 Memory Order Quick Reference                    |
+------------------------------------------------------------------+
|                                                                  |
|  memory_order_relaxed    No ordering, atomicity only             |
|  memory_order_consume    Load ordering (deprecated, avoid)       |
|  memory_order_acquire    Load: subsequent ops don't move before  |
|  memory_order_release    Store: prior ops don't move after       |
|  memory_order_acq_rel    RMW: acquire + release                  |
|  memory_order_seq_cst    Full barrier, global total order        |
|                                                                  |
|  ARM mapping (typical 32-bit):                                   |
|    acquire load  -> LDR + DMB                                    |
|    release store -> DMB + STR                                    |
|    RMW operation -> LDREX + ... + STREX                          |
+------------------------------------------------------------------+

Practical Example: Lock-Free Ring Buffer

#define RB_SIZE 256
typedef struct {
    // Note: Align elements to cache line size (e.g. 32 bytes on Cortex-M7)
    // to prevent cache line thrashing (false sharing) between cores.
    _Alignas(32) atomic_uint head; // Producer index
    _Alignas(32) atomic_uint tail; // Consumer index
    _Alignas(32) uint8_t     buf[RB_SIZE];
} ringbuf_t;

_Static_assert((RB_SIZE & (RB_SIZE - 1)) == 0, "RB_SIZE must be a power of 2");

bool rb_push(ringbuf_t *rb, uint8_t data) {
    uint32_t head = atomic_load_explicit(&rb->head, memory_order_relaxed);
    uint32_t next = (head + 1) & (RB_SIZE - 1); // Fast modulo for power-of-2
    
    // Check full - need acquire to see consumer's tail update
    if (next == atomic_load_explicit(&rb->tail, memory_order_acquire))
        return false;  // Full
    
    rb->buf[head] = data;
    
    // Release: ensure data write visible before head update
    atomic_store_explicit(&rb->head, next, memory_order_release);
    return true;
}

bool rb_pop(ringbuf_t *rb, uint8_t *data) {
    uint32_t tail = atomic_load_explicit(&rb->tail, memory_order_relaxed);
    
    // Check empty - need acquire to see producer's head update
    if (tail == atomic_load_explicit(&rb->head, memory_order_acquire))
        return false;  // Empty
    
    *data = rb->buf[tail];
    uint32_t next = (tail + 1) & (RB_SIZE - 1); // Fast modulo for power-of-2
    
    // Release: ensure data read complete before tail update
    atomic_store_explicit(&rb->tail, next, memory_order_release);
    return true;
}

This single-producer, single-consumer (SPSC) ring buffer is wait-free — each operation completes in bounded steps without loops. The acquire/release ordering pairs ensure the producer’s data write is visible to the consumer before the index updates.

Memory Barriers for DMA and Peripherals

Atomics synchronize between CPU threads/cores. For CPU↔DMA or CPU↔peripheral sharing, you need hardware memory barriers:

// DMA buffer descriptor shared with peripheral
typedef struct {
    volatile uint32_t src_addr;
    volatile uint32_t dst_addr;
    volatile uint32_t length;
    volatile uint32_t control;
    // Pad to 32-byte cache line size to prevent cache corruption 
    // of adjacent variables during cache invalidation.
    uint32_t          reserved[4]; 
} dma_desc_t;

dma_desc_t desc __attribute__((aligned(32)));  // Cache line aligned and padded

void start_dma_transfer(uint32_t src, uint32_t dst, uint32_t len) {
    desc.src_addr = src;
    desc.dst_addr = dst;
    desc.length = len;
    
    // Note: On Cortex-M7 with D-Cache, clean the cache before DMB
    // SCB_CleanDCache_by_Addr((uint32_t*)&desc, sizeof(dma_desc_t));
    
    // DMB: Ensure all writes to desc complete before
    // the control write that triggers DMA
    __DMB();
    
    desc.control = DMA_CTRL_ENABLE | DMA_CTRL_IRQ_EN;
    
    // DSB: Ensure control write completes before return
    __DSB();
}

Key difference: __DMB() (Data Memory Barrier) orders memory accesses. __DSB() (Data Synchronization Barrier) waits for completion. For DMA, you typically need both.

Common Pitfalls

1. Volatile ≠ Atomic

volatile int counter;  // Prevents compiler optimization, NOT atomic!
counter++;             // Still compiles to LDR/ADD/STR — race condition!

2. Missing Memory Order on Flags

// WRONG: relaxed store, relaxed load - no ordering guarantee
atomic_store_explicit(&ready, 1, memory_order_relaxed);
// ... other writes ...
// Consumer sees ready=1 but other writes not visible!

// CORRECT: release store, acquire load
atomic_store_explicit(&ready, 1, memory_order_release);
// Consumer:
if (atomic_load_explicit(&ready, memory_order_acquire)) {
    // All prior writes now visible
}

3. ABA Problem in Lock-Free Structures

// CAS loop vulnerable to ABA: value changes A->B->A, CAS succeeds incorrectly
// Solution: Use tagged pointers or double-width CAS (not on Cortex-M)
// Or accept ABA if logically harmless (e.g., reference counting)

4. Cache Coherency (Cortex-M7)

Memory barriers (__DMB()) ensure ordering, but they do not flush the data cache. If D-Cache is enabled, DMA might read stale data. Always use SCB_CleanDCache_by_Addr() / SCB_InvalidateDCache_by_Addr() or configure shared RAM as non-cacheable via the MPU.

When to Choose Atomics Over Mutexes

Scenario	Recommended Primitive
Simple counter/flag shared ISR↔Task	Atomic (lock-free)
Single pointer handoff	Atomic with acquire/release
SPSC ring buffer	Atomic indices
Complex multi-variable invariant	Mutex
Need to block/wait with timeout	Semaphore/Mutex
Multi-producer/multi-consumer	Mutex or lock-free queue (complex)

Summary

Atomic operations on Cortex-M map directly to LDREX/STREX hardware primitives, providing wait-free synchronization for simple shared data. The C11 <stdatomic.h> interface gives portable access with explicit memory ordering. Master the acquire/release pairing — it’s the key to correct lock-free code. For DMA and peripheral sharing, supplement with __DMB()/__DSB() barriers. Choose atomics for high-frequency, low-contention synchronization; reach for mutexes when invariants span multiple variables or you need blocking semantics.

References

ARM Architecture Reference Manual ARMv7-M and ARMv8-M: LDREX/STREX specification — https://developer.arm.com/documentation/ddi0403
C11 Standard ISO/IEC 9899:2011, Section 7.17 — Atomic operations library
Michael Wong, “Memory Ordering in C11 and C++11” — https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3419.pdf
STM32 Programming Manual PM0214: Cortex-M4 MCUs and MPUs — https://www.st.com/resource/en/programming_manual/pm0214-stm32-cortexm4-mcus-and-mpus-programming-manual-stmicroelectronics.pdf
Anthony Williams, C++ Concurrency in Action (2nd ed., Manning, 2019), Chapter 5 — Memory model and atomics. ISBN 978-1617294693

Frequently Asked Questions

What makes an operation atomic on Cortex-M processors?

An operation is atomic if it executes as a single, uninterruptible instruction. On Cortex-M, LDREX/STREX instructions provide exclusive access to memory, and certain single instructions like 32-bit aligned loads/stores are naturally atomic. The C11 _Atomic keyword maps to these hardware primitives.

When should I use atomic operations instead of a mutex?

Use atomics for simple shared variables (counters, flags, pointers) where the critical section is a single read-modify-write. Mutexes are better for complex invariants spanning multiple variables or when you need to block the caller. Atomics avoid context switches and priority inversion but require lock-free algorithms.

What is the difference between memory_order_relaxed and memory_order_acq_rel?

memory_order_relaxed only guarantees atomicity — no ordering constraints on surrounding memory accesses. memory_order_acq_rel provides acquire semantics on load (subsequent reads/writes won't move before it) and release semantics on store (preceding reads/writes won't move after it), establishing happens-before relationships between threads.

Can I use atomic operations for DMA buffer sharing between CPU and peripheral?

Yes, but you need memory barriers (DMB/DSB). The CPU and DMA controller are separate bus masters. Use volatile for the buffer pointers and explicit __DMB() intrinsics. On Cortex-M7 with D-Cache, you must also perform cache maintenance or use non-cacheable memory.