HomeAbout UsContact Us

RTOS Performance Profiling and Optimization Techniques

By embeddedSoft
Published in Embedded OS
June 10, 2026
3 min read
RTOS Performance Profiling and Optimization Techniques

Table Of Contents

01
Why Profile an RTOS Application?
02
Enabling FreeRTOS Runtime Statistics
03
Understanding Key Performance Metrics
04
Stack Usage Optimization
05
Trace-Based Profiling with Tracealyzer
06
Practical Optimization Strategies
07
Profiling in Production
08
Summary
09
References

Performance profiling is a critical yet often deferred phase in embedded development. On microcontrollers running an RTOS, understanding where CPU cycles go and whether deadlines are met can determine product success.

This article covers practical techniques for measuring and optimizing RTOS performance — from built-in FreeRTOS runtime stats to commercial trace visualization.

Why Profile an RTOS Application?

Many developers rely on intuition for CPU utilization estimates. This fails because RTOS systems exhibit complex runtime behaviors that are hard to reason about statically. Priority inversion, variable interrupt latency, task interference causing jitter, and peak stack usage during worst-case interrupt nesting all demand quantitative analysis. Profiling reveals actual CPU load, execution times, response times, and stack headroom.

Enabling FreeRTOS Runtime Statistics

FreeRTOS includes built-in support for per-task CPU usage measurement. Enabling it requires three configuration directives and a hardware timer:

/* In FreeRTOSConfig.h */
#define configUSE_TRACE_FACILITY 1
#define configGENERATE_RUN_TIME_STATS 1
#define configUSE_STATS_FORMATTING_FUNCS 1
#ifndef __ASSEMBLER__
extern void vConfigureTimerForRunTimeStats(void);
extern uint32_t ulGetRunTimeCounterValue(void);
/* Hardware timer access macros */
#define portCONFIGURE_TIMER_FOR_RUN_TIME_STATS() vConfigureTimerForRunTimeStats()
#define portGET_RUN_TIME_COUNTER_VALUE() ulGetRunTimeCounterValue()
#endif

The runtime counter must be at least 10 times faster than the tick frequency. For a 1 kHz tick rate, a 1 MHz timer provides microsecond granularity. On STM32, TIM2 configured as a free-running 32-bit counter is a common choice:

void vConfigureTimerForRunTimeStats(void)
{
__HAL_RCC_TIM2_CLK_ENABLE();
uint32_t tim_clk = HAL_RCC_GetPCLK1Freq();
/* On STM32, if the APB1 prescaler is not 1, the timer clock is 2 * PCLK1.
* We check the MSB of the PPRE1 prescaler field (RCC_CFGR_PPRE1_2) to see if it is configured. */
if ((RCC->CFGR & RCC_CFGR_PPRE1_2) != 0) {
tim_clk *= 2;
}
TIM2->PSC = (tim_clk / 1000000UL) - 1;
TIM2->ARR = 0xFFFFFFFF;
TIM2->EGR = TIM_EGR_UG; /* Trigger update event to load prescaler */
TIM2->CR1 = TIM_CR1_CEN;
}
uint32_t ulGetRunTimeCounterValue(void)
{
return TIM2->CNT;
}

With runtime stats enabled, two APIs become available:

/* Per-task CPU usage as a percentage */
char buf[512];
vTaskGetRunTimeStats(buf);
printf("%s\n", buf);
/* System-wide task state dump */
vTaskList(buf);
printf("%s\n", buf);

The vTaskList output includes each task’s name, state, priority, stack high-water mark, and task number. The vTaskGetRunTimeStats output provides the accumulated runtime count and automatically calculates the CPU utilization percentage for each task.

Understanding Key Performance Metrics

Effective profiling focuses on four core metrics:

MetricDefinitionWhy It Matters
Execution TimeTotal CPU time a task spends in the Running stateReveals computational cost of algorithm and ISR interference
Response TimeTime from task activation to task completionCaptures blocking from mutexes, queue waits, and preemption
CPU LoadPercentage of total time the CPU is executing (not idle)Determines headroom for new features and burst workloads
Stack High Water MarkMinimum free stack space since task creationSafety margin for stack overflow prevention

The distinction between execution time and response time is critical. A control algorithm that executes in 2 ms but has a response time of 15 ms is spending 13 ms not running — likely preempted by a higher-priority task, or blocked waiting on a mutex or queue. Execution-time-only profiling would miss this entirely.

Stack Usage Optimization

Runtime stack monitoring via uxTaskGetStackHighWaterMark() is the most reliable method for sizing task stacks. The recommended workflow during development:

/* Ensure INCLUDE_uxTaskGetStackHighWaterMark is set to 1 in FreeRTOSConfig.h */
/* Call periodically from a monitor task during integration testing */
void profile_task_stacks(void)
{
TaskHandle_t handles[] = { sensorTaskHandle,
commsTaskHandle,
controlTaskHandle };
const char *names[] = { "Sensor", "Comms", "Control" };
for (int i = 0; i < 3; i++)
{
UBaseType_t hwm = uxTaskGetStackHighWaterMark(handles[i]);
/* On Cortex-M, hwm is in words (4 bytes each) */
uint32_t free_bytes = hwm * sizeof(StackType_t);
printf("%s: %lu bytes free\n", names[i], (unsigned long)free_bytes);
}
}

Final stack size should be set to peak observed usage plus a 25–50% safety margin. This margin accounts for untested code paths, future changes, and interrupt nesting depth variations that may occur in the field.

Trace-Based Profiling with Tracealyzer

For complex timing issues — sporadic deadline misses, intermittent priority inversion, variable response times — timeline visualization is invaluable. Percepio Tracealyzer records FreeRTOS kernel events (context switches, API calls, ISR entries/exits) and displays them as an interactive Gantt-chart-like trace view.

Execution Timeline (simplified Tracealyzer view)
=================================================
ISR /-\ /-\ /-\ /-\
| | | | | | | |
Task A +----------+ +----------+
| running | | running |
+----------+ +----------+
^ blocked by mutex held by Task B
Task B +--+ +-------+
| | |running|
+--+ +-------+
^ ^
| +-- releases mutex, Task A unblocks
+-- acquires mutex

Tracealyzer automatically calculates execution time, response time, and blocking time for each task instance. The tool also visualizes CPU load, stack usage, and heap allocation over time — making it straightforward to correlate a performance anomaly with its root cause.

Key insights that are nearly impossible to derive from printf debugging alone:

  • Which specific preemption caused the worst-case response time?
  • How often does priority inversion occur on a given mutex?
  • Is there a periodic CPU load spike correlated with a specific event?

Practical Optimization Strategies

Once profiling data is available, the most common optimizations yield significant results:

1. Reduce ISR Duration Long ISRs increase interrupt latency for all lower-priority interrupts. Move processing from ISR to task context using a queue or task notification. ISR sends a notification to a handler task, which does the actual work.

2. Eliminate Unnecessary Blocking If profiling reveals that a high-priority task spends significant time waiting on a mutex held by a lower-priority task, restructure the access pattern: use priority inheritance mutexes (FreeRTOS supports this natively), reduce critical section duration, or switch to lock-free ring buffers for ISR-to-task data transfer.

3. Batch Processing A task that wakes every 1 ms to process 1 sample consumes more CPU on context-switch overhead than on actual work. Batching 10 samples per wake-up reduces context switches by 10x while only adding 9 ms of processing latency.

4. Tick Rate Selection The tick interrupt frequency impacts both timing resolution and power consumption. If deadlines permit, using tickless idle mode instead of a continuous tick significantly reduces power draw.

Profiling in Production

The FreeRTOS runtime counter adds negligible overhead — a single timer read per context switch, so it can remain active in production. However, be cautious with uxTaskGetStackHighWaterMark(): it performs an expensive word-by-word search through the unused stack memory. In production, stack checks should be disabled, or run very infrequently in a low-priority diagnostic task.

Summary

RTOS performance profiling transforms debugging from guesswork into engineering. Enable FreeRTOS runtime statistics early in development — the setup cost is minimal and the data is invaluable. Use uxTaskGetStackHighWaterMark() to right-size task stacks with empirical data instead of estimates. For complex timing issues, trace visualization tools like Tracealyzer make the invisible visible: priority inversion, sporadic deadline misses, and CPU load spikes become obvious on an interactive timeline. Profile early, profile often, and let data drive optimization decisions.

References

  • Percepio Tracealyzer for FreeRTOS – Real-Time Trace Visualization
  • FreeRTOS Runtime Statistics Documentation – Official Kernel Reference

Tags

rtosperformanceprofilingoptimizationfreertos

Share


Previous Article
Struct Packing and Serialization for Embedded Protocols
embeddedSoft

embeddedSoft

Embedded Systems Articles by Jithin Tom & Hermes (AI Agent)

Related Posts

Low-Power Design Patterns for RTOS-Based Embedded Systems
Low-Power Design Patterns for RTOS-Based Embedded Systems
June 07, 2026
4 min
© 2026, All Rights Reserved.
Powered By Netlyft

Quick Links

Advertise with usAbout UsContact Us

Social Media