Multitasking OS for Xmega

11-02-2021 1461 words 7 minutes

Contents

Last year, I received a request to extend the firmware of an existing project for a new product. The code, which was already quite intricate, risked becoming an incomprehensible and bug-ridden mess. So, while chatting with a friend, we agreed that “if there were threads, the code would be much more linear”. And that’s how, after a series of searches on GitHub and similar platforms, I ended up implementing a multitasking system for AVR Xmega.

Why AVR Again?

With the advent of 32-bit ARM controllers, the AVR architecture has got a bad rap. So why, in 2021, develop a multitasking system for this platform? The answer is rather simple: the hardware was already using this particular microcontroller, and a significant portion of the software had already been written.

Features

As a starting point, I referred to the xmultitasking repository ¹, which provided a very basic implementation that I extended to meet the following requirements:

Non-preemptive: it is the task currently in execution that voluntarily releases the CPU;
Static: no dynamic memory allocation;
“Systick” presence: a timer generates periodic interrupts (e.g., 1ms). A task can release the CPU to wait for a timeout;
Simple synchronization primitives.

Task Descriptor

In practice, a task corresponds to a function executed cyclically, for example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


void task1() __attribute__((noreturn));

void task1() {
  /* do something, but only once */

  for(;;) {
    /* do something */
    // Release CPU

    /* do something else */
    // Release CPU again
  }
}

Each task can be in one of the following three states:

Running
Ready
Blocked

Since the AVR is a single-core processor, only one task can be running at a time, thus holding CPU usage. A blocked task is waiting for some kind of event, such as a timeout, a specific interrupt, or an event triggered by another task. A ready task is part of a list, together with other tasks, waiting to resume execution as soon as the CPU becomes available (context switch).

Stack Partitioning

/images/multitasking-xmega-memory.png — AVR memory layout.

In a typical memory layout, static variables are allocated at the lowest memory addresses, at the beginning of the RAM. Traditionally, this area is divided into two sections, .data and .bss, with the latter reserved for uninitialized static variables. The sizes are known at compile-time and can be obtained by querying avr-size project.elf.

 text    data     bss     dec     hex filename
   58       2       2      62      3e project.elf

Automatic variables require a variable-sized area of memory, known as the stack. The stack area starts from the bottom of the RAM and grows towards lower addresses. The size is variable: memory is allocated when new variables are added to the scope and automatically deallocated when they go out of scope. Additionally, the stack stores the return address upon function call. To keep track of the stack head, a dedicated processor register called the stack pointer (SP) is used.

/images/multitasking-xmega-memory-task.png — Memory layout with MAX_TASK = 3.

Each task must have its own scope with its stack of automatic variables. Therefore, it is necessary to partition the available memory to virtually have MAX_TASK stack areas. In the current implementation, each task has the same stack size, STACK_SIZE, which is 200 bytes.

Stack Initialization

A task is created using the TASK_create function, which, except for a very short prologue, is an alter ego of the assembly routine TASK_init. As explained in the next paragraph, the context switch routine expects a specific stack layout. When creating a new task, all registers are set to zero except the program counter, which must match the address of the task’s handler function. Some models of Xmega CPU have bigger memory address space, larger than 64 kwords. In such cases, the program counter’s size is 24 bits, and an additional byte is needed on the stack. In this simple multitasking system, it is assumed that the handler is not located in this “extended memory,” so the most significant byte is always set to zero.

Context Switch

When switching from one task to another, both the CPU register contents and the current state of the stack pointer must be preserved. The context switch routine is implemented within the TASK_yield() function, which will be discussed in the next paragraph.

/images/multitasking-xmega-switch.png — Context switch execution between task i (orange) and task j (blue).

The operations to be performed, in order, are:

Saving the program counter for task i (return address, or RA).
Saving all general-purpose registers (r0-r31) and any auxiliary registers.
Saving the current stack pointer, SPi.
Loading the new stack pointer, SPj.
Loading the auxiliary registers and general-purpose registers.
Returning to the address of task j.

These are the only parts of the program that need to be implemented in pure assembly, as they directly modify the stack content and the respective pointer.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51


context_switch:
// Save general CPU registers (task i context)
push	r31
push	r30
; ... omitted ...
push	r1
push	r0
; save Xmega I/O register, omitted

// Save current stack pointer
; load SP vector base
ldi		zl, lo8(task_stack_ptr)
ldi		zh, hi8(task_stack_ptr)
; move to &SPi
lsl		r19				// old task index * 2
add		zl, r19
adc		zh, r1			// GGC keeps r1 always 0
; critical section, save stack pointer in SPi
cli
in		r16, CPU_SPL
st		Z+, r16
in		r16, CPU_SPH
sei
st		Z, r16

// Load next stack pointer
; load SP vector base (again!)
ldi		zl, lo8(task_stack_ptr)
ldi		zh, hi8(task_stack_ptr)
; move to &SPj
lsl		r17				// new task index * 2
add		zl, r17
adc		zh, r1			// GGC keeps r1 always 0
; critical section, store SPj in stack pointer
ld		r16, Z+
cli
out		CPU_SPL, r16
ld		r16, Z
out		CPU_SPH, r16
sei

// Reload general CPU registers (task j context)
; restore Xmega I/O register, omitted

pop		r0
pop		r1
; ... omitted ...
pop		r30
pop		r31

; ret

It is interesting to note that at no point in the code are the return addresses manipulated, or at least not visibly. This is because the call instruction, used here to call TASK_yield, already saves the return address on the stack by pushing the two or three bytes of the program counter. Conversely, ret performs the inverse operation to return to the caller. The trick that allows TASK_yield to perform the context switch lies in moving the stack pointer to the intermediate code section between call and ret.

Scheduler

The scheduler implements a simple round-robin algorithm without priorities or preemption, so it must be manually invoked using the aforementioned TASK_yield() function.

In short, the scheduler continues cycling as long as the enabled task mask task_enable_mask_AT is zero. When at least one active task is found, the first one after the current task is selected, and a context switch is performed.

If the current task is the only one that has been awakened, a context switch is avoided, saving some machine cycles.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


// find next task
	lds		r17, task_index
find_next_task:
	lds		r16, task_index_mask
	lsl		r16
	inc		r17
	cpi		r17, 8
	brlt	no_wrap
	ldi		r17, 0			// task_index
	ldi		r16, 1			// task_index_mask
no_wrap:
	sts		task_index, r17
	sts		task_index_mask, r16

	//lds		r18, task_enable_mask_AT
	and		r16, r18		// & task_enable_mask_AT
	breq	find_next_task	// not enabled

	// Found a task

	// 1) check if same task, avoiding some context switch overhead
	cp		r17, r19
	brne	context_switch
	jmp		same_task

Stack Smash!

The AVR architecture does not have any hardware memory protection mechanisms. However, since a task could allocate more stack space than necessary, leading to disastrous consequences, I deemed it appropriate to implement at least a rudimentary software mechanism for control. I took inspiration from canary bytes².

The principle is simple: at the end of each stack segment, I place a byte initialized with a known value (canary byte). During each context switch, I check if the content remains unchanged. If it has changed, I can conclude that the task about to release the CPU has grown too large and corrupted the stacks of other tasks.

In these unfortunate cases, there is little that can be done. I have implemented the program to abort, potentially flashing a menacing red LED. If I am debugging, I can inspect the register contents to note which task caused the damage and save a memory dump.

Improvements

Stack smash detector
Watchdog
Sleep mode