RTOS Performance Profiling and Optimization Techniques

By Jithin Tom

Published in Embedded OS

June 10, 2026

3 min read

RTOS Performance Profiling and Optimization Techniques

Why Profile an RTOS Application?

Enabling FreeRTOS Runtime Statistics

Understanding Key Performance Metrics

Stack Usage Optimization

Trace-Based Profiling with Tracealyzer

Practical Optimization Strategies

Profiling in Production

Summary

References

Frequently Asked Questions

Performance profiling is a critical yet often deferred phase in embedded development. On microcontrollers running an RTOS, understanding where CPU cycles go and whether deadlines are met can determine product success.

This article covers practical techniques for measuring and optimizing RTOS performance — from built-in FreeRTOS runtime stats to commercial trace visualization.

Why Profile an RTOS Application?

Many developers rely on intuition for CPU utilization estimates. This fails because RTOS systems exhibit complex runtime behaviors that are hard to reason about statically. Priority inversion, variable interrupt latency, task interference causing jitter, and peak stack usage during worst-case interrupt nesting all demand quantitative analysis. Profiling reveals actual CPU load, execution times, response times, and stack headroom.

Enabling FreeRTOS Runtime Statistics

FreeRTOS includes built-in support for per-task CPU usage measurement. Enabling it requires three configuration directives and a hardware timer:

/* In FreeRTOSConfig.h */
#define configUSE_TRACE_FACILITY         1
#define configGENERATE_RUN_TIME_STATS    1
#define configUSE_STATS_FORMATTING_FUNCS 1

#ifndef __ASSEMBLER__
extern void vConfigureTimerForRunTimeStats(void);
extern uint32_t ulGetRunTimeCounterValue(void);

/* Hardware timer access macros */
#define portCONFIGURE_TIMER_FOR_RUN_TIME_STATS()  vConfigureTimerForRunTimeStats()
#define portGET_RUN_TIME_COUNTER_VALUE()          ulGetRunTimeCounterValue()
#endif

The runtime counter must be at least 10 times faster than the tick frequency. For a 1 kHz tick rate, a 1 MHz timer provides microsecond granularity. On STM32, TIM2 configured as a free-running 32-bit counter is a common choice:

void vConfigureTimerForRunTimeStats(void)
{
    __HAL_RCC_TIM2_CLK_ENABLE();
    
    uint32_t tim_clk = HAL_RCC_GetPCLK1Freq();
    /* On STM32, if the APB1 prescaler is not 1, the timer clock is 2 * PCLK1.
     * We check the MSB of the PPRE1 prescaler field (RCC_CFGR_PPRE1_2) to see if it is configured. */
    if ((RCC->CFGR & RCC_CFGR_PPRE1_2) != 0) {
        tim_clk *= 2;
    }
    
    TIM2->PSC = (tim_clk / 1000000UL) - 1;
    TIM2->ARR = 0xFFFFFFFF;
    TIM2->EGR = TIM_EGR_UG; /* Trigger update event to load prescaler */
    TIM2->CR1 = TIM_CR1_CEN;
}

uint32_t ulGetRunTimeCounterValue(void)
{
    return TIM2->CNT;
}

With runtime stats enabled, two APIs become available:

/* Per-task CPU usage as a percentage */
char buf[512];
vTaskGetRunTimeStats(buf);
printf("%s\n", buf);

/* System-wide task state dump */
vTaskList(buf);
printf("%s\n", buf);

The vTaskList output includes each task’s name, state, priority, stack high-water mark, and task number. The vTaskGetRunTimeStats output provides the accumulated runtime count and automatically calculates the CPU utilization percentage for each task.

Understanding Key Performance Metrics

Effective profiling focuses on four core metrics:

Metric	Definition	Why It Matters
Execution Time	Total CPU time a task spends in the Running state	Reveals computational cost of algorithm and ISR interference
Response Time	Time from task activation to task completion	Captures blocking from mutexes, queue waits, and preemption
CPU Load	Percentage of total time the CPU is executing (not idle)	Determines headroom for new features and burst workloads
Stack High Water Mark	Minimum free stack space since task creation	Safety margin for stack overflow prevention

The distinction between execution time and response time is critical. A control algorithm that executes in 2 ms but has a response time of 15 ms is spending 13 ms not running — likely preempted by a higher-priority task, or blocked waiting on a mutex or queue. Execution-time-only profiling would miss this entirely.

Stack Usage Optimization

Runtime stack monitoring via uxTaskGetStackHighWaterMark() is the most reliable method for sizing task stacks. The recommended workflow during development:

/* Ensure INCLUDE_uxTaskGetStackHighWaterMark is set to 1 in FreeRTOSConfig.h */

/* Call periodically from a monitor task during integration testing */
void profile_task_stacks(void)
{
    TaskHandle_t handles[] = { sensorTaskHandle,
                               commsTaskHandle,
                               controlTaskHandle };
    const char *names[] = { "Sensor", "Comms", "Control" };

    for (int i = 0; i < 3; i++)
    {
        UBaseType_t hwm = uxTaskGetStackHighWaterMark(handles[i]);
        /* On Cortex-M, hwm is in words (4 bytes each) */
        uint32_t free_bytes = hwm * sizeof(StackType_t);
        printf("%s: %lu bytes free\n", names[i], (unsigned long)free_bytes);
    }
}

Final stack size should be set to peak observed usage plus a 25–50% safety margin. This margin accounts for untested code paths, future changes, and interrupt nesting depth variations that may occur in the field.

Trace-Based Profiling with Tracealyzer

For complex timing issues — sporadic deadline misses, intermittent priority inversion, variable response times — timeline visualization is invaluable. Percepio Tracealyzer records FreeRTOS kernel events (context switches, API calls, ISR entries/exits) and displays them as an interactive Gantt-chart-like trace view.

Execution Timeline  (simplified Tracealyzer view)
=================================================

ISR      /-\  /-\    /-\        /-\
         | |  | |    | |        | |
Task A   +----------+  +----------+
         |  running |  |  running |
         +----------+  +----------+
              ^ blocked by mutex held by Task B

Task B   +--+      +-------+
         |  |      |running|
         +--+      +-------+
         ^         ^
         |         +-- releases mutex, Task A unblocks
         +-- acquires mutex

Tracealyzer automatically calculates execution time, response time, and blocking time for each task instance. The tool also visualizes CPU load, stack usage, and heap allocation over time — making it straightforward to correlate a performance anomaly with its root cause.

Key insights that are nearly impossible to derive from printf debugging alone:

Which specific preemption caused the worst-case response time?
How often does priority inversion occur on a given mutex?
Is there a periodic CPU load spike correlated with a specific event?

Practical Optimization Strategies

Once profiling data is available, the most common optimizations yield significant results:

1. Reduce ISR Duration Long ISRs increase interrupt latency for all lower-priority interrupts. Move processing from ISR to task context using a queue or task notification. ISR sends a notification to a handler task, which does the actual work.

2. Eliminate Unnecessary Blocking If profiling reveals that a high-priority task spends significant time waiting on a mutex held by a lower-priority task, restructure the access pattern: use priority inheritance mutexes (FreeRTOS supports this natively), reduce critical section duration, or switch to lock-free ring buffers for ISR-to-task data transfer.

3. Batch Processing A task that wakes every 1 ms to process 1 sample consumes more CPU on context-switch overhead than on actual work. Batching 10 samples per wake-up reduces context switches by 10x while only adding 9 ms of processing latency.

4. Tick Rate Selection The tick interrupt frequency impacts both timing resolution and power consumption. If deadlines permit, using tickless idle mode instead of a continuous tick significantly reduces power draw.

Profiling in Production

The FreeRTOS runtime counter adds negligible overhead — a single timer read per context switch, so it can remain active in production. However, be cautious with uxTaskGetStackHighWaterMark(): it performs an expensive word-by-word search through the unused stack memory. In production, stack checks should be disabled, or run very infrequently in a low-priority diagnostic task.

Summary

RTOS performance profiling transforms debugging from guesswork into engineering. Enable FreeRTOS runtime statistics early in development — the setup cost is minimal and the data is invaluable. Use uxTaskGetStackHighWaterMark() to right-size task stacks with empirical data instead of estimates. For complex timing issues, trace visualization tools like Tracealyzer make the invisible visible: priority inversion, sporadic deadline misses, and CPU load spikes become obvious on an interactive timeline. Profile early, profile often, and let data drive optimization decisions.

References

Frequently Asked Questions

How do you measure CPU utilization in an RTOS?

Measure the execution time of the idle task. Since the idle task only runs when the CPU has nothing else to do, CPU utilization is calculated as `100% - (Idle Task Run Time / Total Time)`.

What is execution tracing and why is it useful?

Tracing records scheduling events, context switches, interrupts, and queue operations with timestamps. Visualization tools (like Percepio Tracealyzer) help diagnose timing jitter, priority inversions, and deadlocks.

How does queue buffer sizing affect performance?

Queues copy data by value. If data items are large, queue operations become slow. In such cases, pass pointers to statically allocated memory pools instead of passing the raw data structures.