
Performance profiling is a critical yet often deferred phase in embedded development. On microcontrollers running an RTOS, understanding where CPU cycles go and whether deadlines are met can determine product success.
This article covers practical techniques for measuring and optimizing RTOS performance — from built-in FreeRTOS runtime stats to commercial trace visualization.
Many developers rely on intuition for CPU utilization estimates. This fails because RTOS systems exhibit complex runtime behaviors that are hard to reason about statically. Priority inversion, variable interrupt latency, task interference causing jitter, and peak stack usage during worst-case interrupt nesting all demand quantitative analysis. Profiling reveals actual CPU load, execution times, response times, and stack headroom.
FreeRTOS includes built-in support for per-task CPU usage measurement. Enabling it requires three configuration directives and a hardware timer:
/* In FreeRTOSConfig.h */#define configUSE_TRACE_FACILITY 1#define configGENERATE_RUN_TIME_STATS 1#define configUSE_STATS_FORMATTING_FUNCS 1#ifndef __ASSEMBLER__extern void vConfigureTimerForRunTimeStats(void);extern uint32_t ulGetRunTimeCounterValue(void);/* Hardware timer access macros */#define portCONFIGURE_TIMER_FOR_RUN_TIME_STATS() vConfigureTimerForRunTimeStats()#define portGET_RUN_TIME_COUNTER_VALUE() ulGetRunTimeCounterValue()#endif
The runtime counter must be at least 10 times faster than the tick frequency. For a 1 kHz tick rate, a 1 MHz timer provides microsecond granularity. On STM32, TIM2 configured as a free-running 32-bit counter is a common choice:
void vConfigureTimerForRunTimeStats(void){__HAL_RCC_TIM2_CLK_ENABLE();uint32_t tim_clk = HAL_RCC_GetPCLK1Freq();/* On STM32, if the APB1 prescaler is not 1, the timer clock is 2 * PCLK1.* We check the MSB of the PPRE1 prescaler field (RCC_CFGR_PPRE1_2) to see if it is configured. */if ((RCC->CFGR & RCC_CFGR_PPRE1_2) != 0) {tim_clk *= 2;}TIM2->PSC = (tim_clk / 1000000UL) - 1;TIM2->ARR = 0xFFFFFFFF;TIM2->EGR = TIM_EGR_UG; /* Trigger update event to load prescaler */TIM2->CR1 = TIM_CR1_CEN;}uint32_t ulGetRunTimeCounterValue(void){return TIM2->CNT;}
With runtime stats enabled, two APIs become available:
/* Per-task CPU usage as a percentage */char buf[512];vTaskGetRunTimeStats(buf);printf("%s\n", buf);/* System-wide task state dump */vTaskList(buf);printf("%s\n", buf);
The vTaskList output includes each task’s name, state, priority, stack high-water mark, and task number. The vTaskGetRunTimeStats output provides the accumulated runtime count and automatically calculates the CPU utilization percentage for each task.
Effective profiling focuses on four core metrics:
| Metric | Definition | Why It Matters |
|---|---|---|
| Execution Time | Total CPU time a task spends in the Running state | Reveals computational cost of algorithm and ISR interference |
| Response Time | Time from task activation to task completion | Captures blocking from mutexes, queue waits, and preemption |
| CPU Load | Percentage of total time the CPU is executing (not idle) | Determines headroom for new features and burst workloads |
| Stack High Water Mark | Minimum free stack space since task creation | Safety margin for stack overflow prevention |
The distinction between execution time and response time is critical. A control algorithm that executes in 2 ms but has a response time of 15 ms is spending 13 ms not running — likely preempted by a higher-priority task, or blocked waiting on a mutex or queue. Execution-time-only profiling would miss this entirely.
Runtime stack monitoring via uxTaskGetStackHighWaterMark() is the most reliable method for sizing task stacks. The recommended workflow during development:
/* Ensure INCLUDE_uxTaskGetStackHighWaterMark is set to 1 in FreeRTOSConfig.h *//* Call periodically from a monitor task during integration testing */void profile_task_stacks(void){TaskHandle_t handles[] = { sensorTaskHandle,commsTaskHandle,controlTaskHandle };const char *names[] = { "Sensor", "Comms", "Control" };for (int i = 0; i < 3; i++){UBaseType_t hwm = uxTaskGetStackHighWaterMark(handles[i]);/* On Cortex-M, hwm is in words (4 bytes each) */uint32_t free_bytes = hwm * sizeof(StackType_t);printf("%s: %lu bytes free\n", names[i], (unsigned long)free_bytes);}}
Final stack size should be set to peak observed usage plus a 25–50% safety margin. This margin accounts for untested code paths, future changes, and interrupt nesting depth variations that may occur in the field.
For complex timing issues — sporadic deadline misses, intermittent priority inversion, variable response times — timeline visualization is invaluable. Percepio Tracealyzer records FreeRTOS kernel events (context switches, API calls, ISR entries/exits) and displays them as an interactive Gantt-chart-like trace view.
Execution Timeline (simplified Tracealyzer view)=================================================ISR /-\ /-\ /-\ /-\| | | | | | | |Task A +----------+ +----------+| running | | running |+----------+ +----------+^ blocked by mutex held by Task BTask B +--+ +-------+| | |running|+--+ +-------+^ ^| +-- releases mutex, Task A unblocks+-- acquires mutex
Tracealyzer automatically calculates execution time, response time, and blocking time for each task instance. The tool also visualizes CPU load, stack usage, and heap allocation over time — making it straightforward to correlate a performance anomaly with its root cause.
Key insights that are nearly impossible to derive from printf debugging alone:
Once profiling data is available, the most common optimizations yield significant results:
1. Reduce ISR Duration Long ISRs increase interrupt latency for all lower-priority interrupts. Move processing from ISR to task context using a queue or task notification. ISR sends a notification to a handler task, which does the actual work.
2. Eliminate Unnecessary Blocking If profiling reveals that a high-priority task spends significant time waiting on a mutex held by a lower-priority task, restructure the access pattern: use priority inheritance mutexes (FreeRTOS supports this natively), reduce critical section duration, or switch to lock-free ring buffers for ISR-to-task data transfer.
3. Batch Processing A task that wakes every 1 ms to process 1 sample consumes more CPU on context-switch overhead than on actual work. Batching 10 samples per wake-up reduces context switches by 10x while only adding 9 ms of processing latency.
4. Tick Rate Selection The tick interrupt frequency impacts both timing resolution and power consumption. If deadlines permit, using tickless idle mode instead of a continuous tick significantly reduces power draw.
The FreeRTOS runtime counter adds negligible overhead — a single timer read per context switch, so it can remain active in production. However, be cautious with uxTaskGetStackHighWaterMark(): it performs an expensive word-by-word search through the unused stack memory. In production, stack checks should be disabled, or run very infrequently in a low-priority diagnostic task.
RTOS performance profiling transforms debugging from guesswork into engineering. Enable FreeRTOS runtime statistics early in development — the setup cost is minimal and the data is invaluable. Use uxTaskGetStackHighWaterMark() to right-size task stacks with empirical data instead of estimates. For complex timing issues, trace visualization tools like Tracealyzer make the invisible visible: priority inversion, sporadic deadline misses, and CPU load spikes become obvious on an interactive timeline. Profile early, profile often, and let data drive optimization decisions.
Quick Links
Legal Stuff





