HomeAbout UsContact Us

Atomic Operations in Embedded C: Lock-Free Synchronization for Cortex-M

By Jithin Tom
Published in Embedded C/C++
July 03, 2026
2 min read
Atomic Operations in Embedded C: Lock-Free Synchronization for Cortex-M

Table Of Contents

01
The Cortex-M Atomic Hardware Primitives
02
C11 Atomics in Embedded C
03
Memory Ordering: The Subtle Bug Source
04
Memory Barriers for DMA and Peripherals
05
Common Pitfalls
06
When to Choose Atomics Over Mutexes
07
Summary
08
References
09
Frequently Asked Questions

Atomic operations are the foundation of lock-free programming in embedded systems. While mutexes and semaphores provide simple mutual exclusion, they carry overhead: context switches, priority inversion risk, and memory footprint. For high-frequency interrupt-to-task communication or multi-core synchronization, atomic operations offer deterministic, wait-free alternatives — provided you understand the memory model and hardware primitives.

The Cortex-M Atomic Hardware Primitives

ARM Cortex-M processors (M3/M4/M7/M33) implement the ARMv7-M/ARMv8-M architecture with exclusive access instructions:

+------------------------------------------------------------------+
| Cortex-M Exclusive Access Instructions |
+------------------------------------------------------------------+
| |
| LDREX Rd, [Rn] Load Register Exclusive (8/16/32-bit) |
| STREX Rd, Rt, [Rn] Store Register Exclusive (8/16/32-bit) |
| CLREX Clear Exclusive Monitor |
| |
| LDREX/STREX work as a pair: LDREX tags the address, STREX |
| succeeds only if no other context modified the address since |
| the LDREX. Returns 0 on success, 1 on failure (retry loop). |
+------------------------------------------------------------------+

These instructions form the hardware basis for compare-and-swap (CAS), atomic increment, and other read-modify-write operations. The exclusive monitor is per-core and tracks a single address granule (typically 4-16 bytes).

Natural Atomicity

On Cortex-M, aligned 32-bit loads and stores are naturally atomic — they cannot be interrupted mid-transfer. This means:

volatile uint32_t shared_flag = 0;
// Thread A
shared_flag = 1; // Atomic store (single STR instruction)
// Thread B
if (shared_flag == 1) // Atomic load (single LDR instruction)

However, read-modify-write sequences are NOT atomic:

shared_counter++; // NOT atomic! Compiles to LDR, ADD, STR
// Interrupt between LDR and STR loses updates

For RMW operations, you need LDREX/STREX or C11 atomics.

C11 Atomics in Embedded C

The C11 _Atomic type qualifier and <stdatomic.h> provide portable atomic operations. Modern embedded toolchains (GCC 10+, Clang 12+, IAR 9+) support them for Cortex-M.

#include <stdatomic.h>
#include <stdbool.h>
#include <stdint.h>
// Atomic flag - lock-free on Cortex-M
atomic_flag lock = ATOMIC_FLAG_INIT;
// Spinlock using test-and-set
// WARNING: Avoid spinlocks on single-core systems without disabling interrupts,
// as a high-priority thread will spin forever if the lock holder is preempted.
void lock_acquire(atomic_flag *lock) {
while (atomic_flag_test_and_set_explicit(lock, memory_order_acquire)) {
// Spin - in embedded, consider __WFE() for low-power wait
__WFE();
}
}
void lock_release(atomic_flag *lock) {
atomic_flag_clear_explicit(lock, memory_order_release);
__SEV(); // Wake waiting cores
}

Atomic Integer Types

atomic_int counter = 0;
atomic_uintptr_t shared_ptr = 0;
// Atomic increment (fetch_add returns old value)
int old = atomic_fetch_add_explicit(&counter, 1, memory_order_relaxed);
// Compare-and-swap (CAS) - returns true on success
uintptr_t expected = old_ptr;
bool success = atomic_compare_exchange_strong_explicit(
&shared_ptr, &expected, new_ptr,
memory_order_acq_rel, memory_order_relaxed
);

Critical: Always check ATOMIC_INT_LOCK_FREE macro. On Cortex-M, 32-bit atomics are lock-free (value 2). 64-bit atomics may use library locks (value 1) — avoid them in ISRs.

Memory Ordering: The Subtle Bug Source

The hardest part of atomics is memory ordering. While Cortex-M enforces strict access ordering for Device and Strongly-ordered memory (e.g., peripherals), Normal memory (SRAM) is weakly ordered, allowing compiler and hardware reordering.

+------------------------------------------------------------------+
| C11 Memory Order Quick Reference |
+------------------------------------------------------------------+
| |
| memory_order_relaxed No ordering, atomicity only |
| memory_order_consume Load ordering (deprecated, avoid) |
| memory_order_acquire Load: subsequent ops don't move before |
| memory_order_release Store: prior ops don't move after |
| memory_order_acq_rel RMW: acquire + release |
| memory_order_seq_cst Full barrier, global total order |
| |
| ARM mapping (typical 32-bit): |
| acquire load -> LDR + DMB |
| release store -> DMB + STR |
| RMW operation -> LDREX + ... + STREX |
+------------------------------------------------------------------+

Practical Example: Lock-Free Ring Buffer

#define RB_SIZE 256
typedef struct {
// Note: Align elements to cache line size (e.g. 32 bytes on Cortex-M7)
// to prevent cache line thrashing (false sharing) between cores.
_Alignas(32) atomic_uint head; // Producer index
_Alignas(32) atomic_uint tail; // Consumer index
_Alignas(32) uint8_t buf[RB_SIZE];
} ringbuf_t;
_Static_assert((RB_SIZE & (RB_SIZE - 1)) == 0, "RB_SIZE must be a power of 2");
bool rb_push(ringbuf_t *rb, uint8_t data) {
uint32_t head = atomic_load_explicit(&rb->head, memory_order_relaxed);
uint32_t next = (head + 1) & (RB_SIZE - 1); // Fast modulo for power-of-2
// Check full - need acquire to see consumer's tail update
if (next == atomic_load_explicit(&rb->tail, memory_order_acquire))
return false; // Full
rb->buf[head] = data;
// Release: ensure data write visible before head update
atomic_store_explicit(&rb->head, next, memory_order_release);
return true;
}
bool rb_pop(ringbuf_t *rb, uint8_t *data) {
uint32_t tail = atomic_load_explicit(&rb->tail, memory_order_relaxed);
// Check empty - need acquire to see producer's head update
if (tail == atomic_load_explicit(&rb->head, memory_order_acquire))
return false; // Empty
*data = rb->buf[tail];
uint32_t next = (tail + 1) & (RB_SIZE - 1); // Fast modulo for power-of-2
// Release: ensure data read complete before tail update
atomic_store_explicit(&rb->tail, next, memory_order_release);
return true;
}

This single-producer, single-consumer (SPSC) ring buffer is wait-free — each operation completes in bounded steps without loops. The acquire/release ordering pairs ensure the producer’s data write is visible to the consumer before the index updates.

Memory Barriers for DMA and Peripherals

Atomics synchronize between CPU threads/cores. For CPU↔DMA or CPU↔peripheral sharing, you need hardware memory barriers:

// DMA buffer descriptor shared with peripheral
typedef struct {
volatile uint32_t src_addr;
volatile uint32_t dst_addr;
volatile uint32_t length;
volatile uint32_t control;
// Pad to 32-byte cache line size to prevent cache corruption
// of adjacent variables during cache invalidation.
uint32_t reserved[4];
} dma_desc_t;
dma_desc_t desc __attribute__((aligned(32))); // Cache line aligned and padded
void start_dma_transfer(uint32_t src, uint32_t dst, uint32_t len) {
desc.src_addr = src;
desc.dst_addr = dst;
desc.length = len;
// Note: On Cortex-M7 with D-Cache, clean the cache before DMB
// SCB_CleanDCache_by_Addr((uint32_t*)&desc, sizeof(dma_desc_t));
// DMB: Ensure all writes to desc complete before
// the control write that triggers DMA
__DMB();
desc.control = DMA_CTRL_ENABLE | DMA_CTRL_IRQ_EN;
// DSB: Ensure control write completes before return
__DSB();
}

Key difference: __DMB() (Data Memory Barrier) orders memory accesses. __DSB() (Data Synchronization Barrier) waits for completion. For DMA, you typically need both.

Common Pitfalls

1. Volatile ≠ Atomic

volatile int counter; // Prevents compiler optimization, NOT atomic!
counter++; // Still compiles to LDR/ADD/STR — race condition!

2. Missing Memory Order on Flags

// WRONG: relaxed store, relaxed load - no ordering guarantee
atomic_store_explicit(&ready, 1, memory_order_relaxed);
// ... other writes ...
// Consumer sees ready=1 but other writes not visible!
// CORRECT: release store, acquire load
atomic_store_explicit(&ready, 1, memory_order_release);
// Consumer:
if (atomic_load_explicit(&ready, memory_order_acquire)) {
// All prior writes now visible
}

3. ABA Problem in Lock-Free Structures

// CAS loop vulnerable to ABA: value changes A->B->A, CAS succeeds incorrectly
// Solution: Use tagged pointers or double-width CAS (not on Cortex-M)
// Or accept ABA if logically harmless (e.g., reference counting)

4. Cache Coherency (Cortex-M7)

Memory barriers (__DMB()) ensure ordering, but they do not flush the data cache. If D-Cache is enabled, DMA might read stale data. Always use SCB_CleanDCache_by_Addr() / SCB_InvalidateDCache_by_Addr() or configure shared RAM as non-cacheable via the MPU.

When to Choose Atomics Over Mutexes

ScenarioRecommended Primitive
Simple counter/flag shared ISR↔TaskAtomic (lock-free)
Single pointer handoffAtomic with acquire/release
SPSC ring bufferAtomic indices
Complex multi-variable invariantMutex
Need to block/wait with timeoutSemaphore/Mutex
Multi-producer/multi-consumerMutex or lock-free queue (complex)

Summary

Atomic operations on Cortex-M map directly to LDREX/STREX hardware primitives, providing wait-free synchronization for simple shared data. The C11 <stdatomic.h> interface gives portable access with explicit memory ordering. Master the acquire/release pairing — it’s the key to correct lock-free code. For DMA and peripheral sharing, supplement with __DMB()/__DSB() barriers. Choose atomics for high-frequency, low-contention synchronization; reach for mutexes when invariants span multiple variables or you need blocking semantics.

References

  1. ARM Architecture Reference Manual ARMv7-M and ARMv8-M: LDREX/STREX specification — https://developer.arm.com/documentation/ddi0403
  2. C11 Standard ISO/IEC 9899:2011, Section 7.17 — Atomic operations library
  3. Michael Wong, “Memory Ordering in C11 and C++11” — https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3419.pdf
  4. STM32 Programming Manual PM0214: Cortex-M4 MCUs and MPUs — https://www.st.com/resource/en/programming_manual/pm0214-stm32-cortexm4-mcus-and-mpus-programming-manual-stmicroelectronics.pdf
  5. Anthony Williams, C++ Concurrency in Action (2nd ed., Manning, 2019), Chapter 5 — Memory model and atomics. ISBN 978-1617294693

Frequently Asked Questions

What makes an operation atomic on Cortex-M processors?

An operation is atomic if it executes as a single, uninterruptible instruction. On Cortex-M, LDREX/STREX instructions provide exclusive access to memory, and certain single instructions like 32-bit aligned loads/stores are naturally atomic. The C11 _Atomic keyword maps to these hardware primitives.

When should I use atomic operations instead of a mutex?

Use atomics for simple shared variables (counters, flags, pointers) where the critical section is a single read-modify-write. Mutexes are better for complex invariants spanning multiple variables or when you need to block the caller. Atomics avoid context switches and priority inversion but require lock-free algorithms.

What is the difference between memory_order_relaxed and memory_order_acq_rel?

memory_order_relaxed only guarantees atomicity — no ordering constraints on surrounding memory accesses. memory_order_acq_rel provides acquire semantics on load (subsequent reads/writes won't move before it) and release semantics on store (preceding reads/writes won't move after it), establishing happens-before relationships between threads.

Can I use atomic operations for DMA buffer sharing between CPU and peripheral?

Yes, but you need memory barriers (DMB/DSB). The CPU and DMA controller are separate bus masters. Use volatile for the buffer pointers and explicit __DMB() intrinsics. On Cortex-M7 with D-Cache, you must also perform cache maintenance or use non-cacheable memory.

Tags

embedded-catomiclock-freecortex-msynchronizationmemory-barriers

Share


Previous Article
Linker Scripts and Memory Layout in Embedded C: A Practical Guide
Jithin Tom

Jithin Tom

A Closer Look at C/C++, RTOS, and Embedded Systems

Related Posts

Linker Scripts and Memory Layout in Embedded C: A Practical Guide
Linker Scripts and Memory Layout in Embedded C: A Practical Guide
July 02, 2026
2 min
© 2026, All Rights Reserved.
Powered By Netlyft

Quick Links

Advertise with usAbout UsContact Us

Social Media