RISC-V Bare Metal Programming - Chapter 4: Another Brick in the Wall

Submitted by MarcAdmin on Sat, 11/30/2019 - 01:41

Chapter 3 of this RISC-V bare metal tutorial studied the linking process and how a developer can control where code and data are placed in memory. Constants, initialized variables and uninitialized variables were defined and explicitly positioned in RAM as prescribed by a linker script. The running example program was updated to read operands from RAM to perform its task, and subsequently store the result in a different location in RAM. However, up to this point only the base RV64I instruction set has been used. This chapter will explore some of the standard extensions available in the RISC-V ISA.

One of the objectives in the design of the RISC-V ISA is to support many different deployment environments which may have varying constraints for efficiency, performance, and cost. For this reason, the base instruction set was restricted to the minimum required to build a useful program. This reduces the processor complexity potentially yielding performance and efficiency gains. However, these gains may be lost when performing more complex computations. To address potential limitations in the base instruction set, optional standard extensions have been defined to expand the available set of instructions. The standard extensions available for 32 and 64-bit instruction sets include:

M: Support for multiply and divide (RV32M and RV64M).
A: Atomic operations (RV32A and RV64A).
F: Floating point support (RV32F and RV64F).
D: Double precision floating point support (RV32D and RV64D).

This set of standard extensions are typically included in most implementations of RISC-V cores. The base set plus these extensions is often referred to as the G instruction set (RV32G or RV64G). Each of these standard extensions will be explored in this chapter.

Multiply

The M extension provides instructions for multiplying and dividing integers using both word and double-word length operands. When using word length operands, the result will not require more than 64-bits of memory which fits in an RV64I registers. The following listing of the product.s source file shows the assembly code of a function to multiply word sized integer operands:

 1:         .text
 2:         .align 2
 3:         .global __imul32
 4: __imul32:
 5:         # Input:
 6:         # a0: 32-bit multiplicand
 7:         # a1: 32-bit multiplier
 8:         # Result:
 9:         # a0: 64-bit product
10:         addi    sp, sp, -32
11:         sd      ra, 24(sp)
12:         mulw    a0, a0, a1
13:         ld      ra, 24(sp)
14:         addi    sp, sp, 32
15:         ret

Due to the fact that the arguments of this function are expected to be word-length data, the calcluation of the product can be performed using a single instruction (mulw on line 12). The main program can be updated as follows to invoke the __imul32 function:

 1:         .section ".text.init"
 2:         .align 2
 3:         .global _start
 4:         .global _stack_end
 5: _start:
 6:         lw      a0, operand1
 7:         lw      a1, operand2
 8:         la      sp,_stack_end
 9:         call    sum
10:         la      t1, result1
11:         sw      a0, 0(t1)
12:         call    __imul32
13:         la      t1, result2
14:         sd      a0, 0(t1)
15: stop:   j       stop
16:         .section ".rodata"
17: operand1:       .word   4
18:         .data
19: operand2:       .word   5
20:         .bss
21: result1:        .word   0
22: result2:        .dword  0

The .bss section of the ELF file was updated to declare two result variables: result1 on line 21 which will hold the sum of the operands in a word, and result2 on line 22 which will hold their product in a double-word.

After the sum of the operands is calculated, and the result is saved in memory, it is kept in register a0 to be used as the multiplicand. The value of operand2 will be used as the multiplier; its value should still be in the a1 register since its content is not modified by the sum function. The __imul32 function is then called on line 12 and the result is saved in memory at line 14.

The program can be compiled and executed in qemu using the following sequence of commands:

riscv64-unknown-elf-as -o add.o add.s
riscv64-unknown-elf-as -o main.o main.s
riscv64-unknown-elf-as -o product.o product.s
riscv64-unknown-elf-ld -T chapter3.lds -o main.elf add.o main.o product.o
qemu-system-riscv64 -M virt -serial /dev/null -nographic -kernel main.elf
QEMU 3.1.0 monitor - type 'help' for more information
(qemu)

The chapter3.lds linker script is the same one that was used in chapter 3. The result values can be inspected from the qemu console using the xp command:

(qemu) xp /1wd 0x80001004
0000000080001004:          9
(qemu) xp /1gd 0x80001008
0000000080001008:         45
(qemu)

The location of result1 in memory is the same as result from the previous chapter. The memory location of result2 will be 4-bytes beyond result1 since this value is 32-bits wide. Therefore the product result can be found at memory offset 0x80001008. This can easily be verified using the objdump utility:

$ riscv64-unknown-elf-objdump -D -j.bss main.elf 

sum.elf:     file format elf64-littleriscv


Disassembly of section .bss:

0000000080001004 <result1>:
    80001004:	0000                	unimp
	...

0000000080001008 <result2>:
	...

As expected the multiplication of 9 and 5 is 45.

Multiplication using registers is a little more complicated when dealing with 64-bit values. This is due to the fact that the product will be wider (in bits) than either the multiplier or multiplicand. The __imul32 function assumes that the operands are word-length values, therefore the result will fit in a single double-word register. However, the calculated product will be truncated if double-word length operands are provided. The product of two 64-bit values may have as many as 128 bits which is wider than any available register in the RV64I instruction set. To mitigate this problem, the RISC-V ISA requires two instructions to perform a multiplication: one to calculate the most significant double-word (mulh), and a second to calculate the least significant double-word (mul). The following listing illustrates the __imul64 function that can handle 64-bit operands:

 1:         .global __imul64
 2: __imul64:
 3:         # Input:
 4:         # a0: 64-bit multiplicand
 5:         # a1: 64-bit multiplier
 6:         # Result:
 7:         # a0: low 64-bits of the product
 8:         # a1: high 64-bits of the product
 9:         addi    sp, sp, -32
10:         sd      ra, 24(sp)
11:         sd      t1, 16(sp)
12:         sd      t0, 8(sp)
13:         mv      t0, a0
14:         mv      t1, a1
15:         mul     a0, t0, t1
16:         mulh    a1, t0, t1
17:         ld      t0, 8(sp)
18:         ld      t1, 16(sp)
19:         ld      ra, 24(sp)
20:         addi    sp, sp, 32
21:         ret

This code can be added to the product.s source file to provide a multiplication operation that uses 64-bit integers. The first thing this function does is save the contents of registers t1 (line 11) and t0 (line 12) which will be used by this function.

note these are supposed to be caller saved registers, presumably the caller of the product function would have saved them. However, we are saving them here anyway

The values of the function arguments are then moved into the temporary registers (lines 13 and 14). This is required because, unlike the first version of this function, the arguments need to be reused and the value of a0 will be overwritten by the first mutiplication on line 15 which calculates the product of the low 32-bits of the operands. The second multiplication (line 16) will calculate the product of the high 32-bits of the operands and store the result in a1.

The main program must be updated to handle a potential 128-bit result from the __imul64 function:

 1:         .section ".text.init"
 2:         .align 2
 3:         .global _start
 4:         .global _stack_end
 5: _start:
 6:         lw      a0, operand2
 7:         lw      a1, operand1
 8:         la      sp,_stack_end
 9:         call    sum
10:         la      t1, result1
11:         sw      a0, 0(t1)
12:         call    __imul64
13:         la      t1, result2
14:         sd      a0, 8(t1)
15:         sd      a1, 0(t1)
16: stop:   j       stop
17:         .section ".rodata"
18: operand1:       .word   4
19:         .data
20: operand2:       .word   5
21:         .bss
22: result1:        .word   0
23: result2:        .dword  0, 0

The most significant change is that the result must be stored to memory using two instructions: one to store the product of the low 32-bits (line 14), and one to store the product of the high 32-bits (line 15). The result2 variable on line 23 must also be updated to reserve 128-bits for the product. The arguments of the __imul64 function are the same as those of the __imul32 function. Therefore the new function can be invoked by simply changing the call label on line 12.

After recompiling and linking the modified source files, the result can be inspecetd in the qemu console by printing out 2 double-word values at offset 0x80001008:

$ qemu-system-riscv64 -M virt -serial /dev/null -nographic -kernel main.elf
QEMU 3.1.0 monitor - type 'help' for more information
(qemu) xp /2gd 0x80001008
0000000080001008:                   45                    0
(qemu) quit

Note that the __imul64 function can also be used with 32-bit operands. The value of the high double-word will be zero in this case since no overflow occurred.

Divide

The RVM extension also provides instructions to calculate the quotient and remainder a division of an integer by another integer. This is slightly less complicated than multiplication because the result cannot be wider than the operands. However, this also limits divisions to dividends and divisors with a maximum of 64-bits. Therefore this is not a true reciprocal of the multiplication which can have a 128-bit result.

The following listing illustrates the contents of the divide.s source file which defines the function to divide an unsigned 64-bit integer divisor by an unsigned 64-bit integer dividend.

 1:         .text
 2:         .align 2
 3:         .global __idiv64u
 4: __idiv64u:
 5:         addi    sp, sp, -32
 6:         sd      ra, 24(sp)
 7:         beqz    a1, __idiv64u_exit
 8:         div     a0, a0, a1
 9: __idiv64u_exit: 
10:         ld      ra, 24(sp)
11:         addi    sp, sp, 32
12:         ret

This function is fairly straight forward, after ensuring that the dividend is not zero, it simply calls the div instruction to calculate the quotient. The check to ensure that the dividend is not zero on line 7 is necessary because R64M does not trap on a divide by zero error. If the dividend is zero, the div instruction will be skipped.

Since the result of the __imul64 function is a 64-bit value due to its small operands, the __idiv64 function can be invoked on the result to verify its accuracy. The main.s program can be updated as follows to divide the result of __imul64 by operand2, and save the result in a variable in the .data section named result3.

 1:         .section ".text.init"
 2:         .align 2
 3:         .global _start
 4:         .global _stack_end
 5: _start:
 6:         lw      a0, operand1
 7:         lw      a1, operand2
 8:         la      sp,_stack_end
 9:         call    sum
10:         la      t1, result1
11:         sw      a0, 0(t1)
12:         call    __imul64
13:         la      t1, result2
14:         sd      a0, 0(t1)
15:         sd      a1, 8(t1)
16:         bnez    a1, stop
17:         lw      a1, operand2
18:         call    __idiv64u
19:         la      t0, result3
20:         sd      a0, 0(t0)
21: stop:   j       stop
22:         .section ".rodata"
23: operand1:       .word   4
24:         .data
25: operand2:       .word   5
26:         .bss
27: result1:        .word   0
28: result2:        .dword  0, 0
29: result3:        .dword  0

After __imul64 returns, the value is checked for overflow (line 16) by asserting that the value returned in a1 is zero. This will ensure that the result of the multiplication fits in a single 64-bit register. If the result is greater than 64-bits wide, the division will be skipped. Otherwise operand2 is loaded into register a1. This check is not strictly necessary unless different operand values are used which may result in an overflow.

The divide function will determine the quotient of the __imul64 result by the value of operand2. The quotient will be stored in the result3 variable. This should be the same as the result of the sum function (in result1). This can be verified by assembling and linking this program and running the binary in qemu. The value of result3

riscv64-unknown-elf-as -o add.o add.s
riscv64-unknown-elf-as -o divide.o divide.s
riscv64-unknown-elf-as -o main.o main.s
riscv64-unknown-elf-as -o product.o product.s
riscv64-unknown-elf-ld -T chapter3.lds -o main.elf add.o divide.o main.o product.o
qemu-system-riscv64 -M virt -serial /dev/null -nographic -kernel main.elf
QEMU 3.1.0 monitor - type 'help' for more information
(qemu) xp /1wd 0x80001004
0000000080001004:          9
(qemu) xp /1gd 0x80001018
0000000080001018:                    9
(qemu)

The offset of the result3 variable will be 0x80001018; it is 16-bytes beyond the result2 variable which is locaed at 0x80001008 (therefore +0x10). This can be verified using objdump as in the previous example.

As expected, result3 contains the integer 9 which is the result of the sum function in variable result1 at offset 0x80001004.

This value is convenient because 5 divides 45 exactly. If we divided the result of __imul64 by operand1 instead, the result would be 11 and there would be a remainder of 1. In the current implementation, this value is lost. However, the divide function can be updated to calculate the quotient and the remainder of a division. The updated __idiv64u function is illustrated in the following listing.

 1: __idiv64u:
 2:         # Input:
 3:         # a0: 64-bit divisor
 4:         # a1: 64-bit dividend
 5:         # Returns:
 6:         # a0 => 64-bit quotient
 7:         # a1 => 64-bit remainder
 8:         addi    sp, sp, -32
 9:         sd      ra, 24(sp)
10:         sd      t1, 16(sp)
11:         sd      t0, 8(sp)
12:         beqz    a1, __idiv64u_exit
13:         mv      t0, a0
14:         mv      t1, a1
15:         div     a0, t0, t1
16:         rem     a1, t0, t1
17: __idiv64u_exit: 
18:         ld      t0, 8(sp)
19:         ld      t1, 16(sp)
20:         ld      ra, 24(sp)
21:         addi    sp, sp, 32
22:         ret

This new implementation will save the argument values in temporary registers because this is a two-step function and the first argument would be overriden in the first step. The divide function then calculates the quotient on line 15, and the remainder on line 16. The main.s program must also be updated to save the result of the new divide function in two double words.

 1:         .section ".text.init"
 2:         .align 2
 3:         .global _start
 4:         .global _stack_end
 5: _start:
 6:         lw      a0, operand1
 7:         lw      a1, operand2
 8:         la      sp,_stack_end
 9:         call    sum
10:         la      t1, result1
11:         sw      a0, 0(t1)
12:         call    __imul64
13:         la      t1, result2
14:         sd      a0, 0(t1)
15:         sd      a1, 8(t1)
16:         bnez    a1, stop
17:         lw      a1, operand1
18:         beqz    a1, stop
19:         call    __idiv64u
20:         la      t0, result3
21:         sd      a0, 0(t0)
22:         sd      a1, 8(t0)
23: stop:   j       stop
24:         .section ".rodata"
25: operand1:       .word   4
26:         .data
27: operand2:       .word   5
28:         .bss
29: result1:        .word   0
30: result2:        .dword  0, 0
31: result3:        .dword  0, 0

The only changes are that operand1 is used as the dividend on line 17 and an instruction was added on line 22 to store the remainder in ram. The result3 variable was also updated to allocate two double-words of memory on line 31. If this program is assembled and linked, then executed in qemu (as in the previous example), the contents of operand3 can be inspected to see that both the quotient and remainder have been calculated:

(qemu) xp /2gd 0x80001018
0000000080001018:                   11                    1

This provides a more flexible implementation of __idiv64u, but if a true reciprocal of the __imul64 function is desired, the function must allow for a 128-bit divisor argument. The RV64M extension does not define an instruction to calculate this, therefore the calculation must be performed in parts.

If the 128-bit divisor is broken up into four words, the division can be carried out on each part individually and the result combined. This is possible because of the following:

\(x = 2^{32}w_h + w_l\)

The quotient of \(x\) by some integer \(d\) can be calculated as:

\(x/d = 2^{32}w_h/d + (2^{32}*w_{h}\mod{d} + w_l)/d\)

This calculation can be implemented with the following RISC-V assembly code:

 1:         .global __idiv128u
 2: __idiv128u:
 3:         # Input:
 4:         # a0: Address where the 128-bit quotient will be stored (high
 5:         #     dword, low dword).
 6:         # a1: 64-bit dividend
 7:         # a2: Address of the 128-bit divisor (high dword, low dword)
 8:         # Returns:
 9:         # a0: Address of the 128-bit quotient
10:         # a1: 64-bit remainder
11:         addi    sp, sp, -32
12:         sd      ra, 24(sp)
13:         # Check for divide by zero
14:         beqz    a1, __idiv128u_exit
15:         addi    t2, a2, 16
16:         li      t3, 0           # t3 = remainder
17: __idiv128u_next_dword:
18:         lwu     t1, (a2)        # t1 = low word
19:         ld      t0, (a2)
20:         srli    t0, t0, 32      # t0 = high word
21: __idiv128u_high_word:
22:         slli    t3, t3, 32      
23:         add     t0, t0, t3
24:         divu    t4, t0, a1      # t4 = t0/a1
25:         slli    t5, t4, 32      # t5 = t4 * 2^32
26:         remu    t3, t0, a1      # t3 = t0 mod a1
27: __idiv128u_low_word:
28:         slli    t3, t3, 32      # t3 = t3 * 2^32
29:         add     t0, t1, t3
30:         divu    t4, t0, a1
31:         add     t5, t5, t4
32:         remu    t3, t0, a1
33:         sd      t5, (a0)
34:         addi    a2, a2, 8
35:         addi    a0, a0, 8
36:         bne     t2, a2, __idiv128u_next_dword
37:         mv      a0, t3
38: __idiv128u_exit:        
39:         ld      ra, 24(sp)
40:         addi    sp, sp, 32
41:         ret

This function iteratively performs a 64-bit division on 32-bit words of the divisor. The remainder is scaled (28), then added to the next word of the divisor (line 29) and the process is repeated for the next 64-bit double word.

The following listing illustrates an updated main.s file:

 1:         .section ".text.init"
 2:         .align 2
 3:         .global _start
 4:         .global _stack_end
 5: _start:
 6:         lw      a0, operand1
 7:         lw      a1, operand2
 8:         la      sp, _stack_end
 9:         call    sum
10:         la      t1, result1
11:         sw      a0, 0(t1)
12:         call    __imul64
13:         la      t1, divisor
14:         sd      a0, 8(t1)
15:         sd      a1, 0(t1)
16:         la      a0, quotient
17:         lw      a1, operand1
18:         la      a2, divisor
19:         call    __idiv128u
20:         la      t0, remainder
21:         sd      a0, (t0)
22: stop:   j       stop
23:         .section ".rodata"
24: operand1:       .word   4
25:         .data
26: operand2:       .word   5
27:         .bss
28: result1:        .word   0
29: result2:        .dword  0, 0
30: result3:        .dword  0, 0
31: divisor:        .dword  0, 0
32: quotient:       .dword  0, 0
33: remainder:      .dword  0

This updated main program does not perform an overflow check since the __idiv128u function can handle a 128-bit divisor. This function also reads its operands directly from memory rather than from registers due to the fact that the divisor may not fit in a single register. The memory at label quotient will be updated with the result of the division. The remainder will be returned by the function, which is then saved to the memory at label remainder on line 20.

Atomic Instructions

Synchronization is an important feature in multiprocessing systems. Thus far, the examples have used a single hardware thread, or hart, therefore there has not been any need to synchronize memory access. RISC-V defines the A extension which provides instructions to atomically read-modify-write data in memory. These instructions can be used to support synchronization between multiple hardware threads running in the same memory space.

The most basic synchronization primitive is the atomic compare and swap operation. This will compare a value in a register with a value in memory. If the two values are equal, the value in another register will be swapped with the value in memory. The pseudo code for this is as follows:

Load value in register R1
Load address of the second value in R2
Load the value at address R2 into a temporary register T1
Load swap value in register R3
If R1 == T1:
1. Store R3 at memory location R2
2. R3 := T1

This entire sequence is expected to be performed atomically (i.e. there can be no interrupt between the time the value T1 is read from memory, and the end of the procedure. This can be implemented using Load Reserved/Store Conditional instructions provided by the RVA extension. The following listing illustrates the implementation of a compare-and-swap function:

 1:         .text
 2:         .align 2
 3:         .global compare_and_swap
 4:         # a0: Address of value operand
 5:         # a1: Value to compare
 6:         # a2: Value to swap if (a0) == a1
 7:         # return: a0 == 0 => CAS successful
 8:         # return: a0 == 1 => CAS failed
 9: compare_and_swap:
10:         lr.d    t0, (a0)
11:         bne     t0, a1, nomatch
12:         sc.d    a0, a2, (a0)
13:         bnez    a0, compare_and_swap
14:         j       exit
15: nomatch:
16:         li      a0, 1
17: exit:
18:         ret

This function will atomically compare the value in memory located at the address in a0 with the value in register a1, and store the value of a2 at the location in a0 if they match.

The load-reserved instruction on line 10 loads the value at memory location a0 into register t0, and registers a reservation on the address in memory. The nature of the memory reservation is specific to the implementation of the RISC-V core and is transparent to the program. The memory range that is reserved can be arbitrarily sized, however, it must be at least large enough to enclose the value that was loaded.

The value of t0 is then compared with a1. If the values match, the store-conditional instruction on line 12 will save the value in a2 to the memory location of a0. This will also release the reservation on the memory address. If the values do not match, the memory is not updated (this instruction is skipped).

If another hardware thread writes data to the memory for which there is a reservation, then the store-conditional instruction will fail and a non-zero error code will be written to the destination register which is a0 in this function (line 12. In this case, the compare-and-swap operation is restarted (line 13.

The main program shown in the following listing will invoke the compare-and-swap function:

 1:         .section ".text.init"
 2:         .align 2
 3:         .global _start
 4:         .global _stack_end
 5: _start:
 6:         la      sp, _stack_end
 7:         la      a0, n
 8:         li      a1, 5
 9:         li      a2, 6
10:         call    compare_and_swap
11:         la      a0, n
12:         li      a1, 5
13:         li      a2, 7
14:         call    compare_and_swap
15: stop:   j       stop
16:         .balign 8
17: n:              .dword  5

Starting at line 7, the function arguments are setup by first loading the address of the variable n into a0. Note that the alignment of the data loaded by the lr.d instruction must be aligned on an 8-byte boundary (similarly the lr.w instruction expects the data to be aligned to a 4-byte boundary). The .balign (byte align) assembler directive on line 16 ensures that this is the case.

The first invocation of the function on line 10 will succeed, thus the value of n will be updated to 6. the second invocation will fail, this the value of n will not be changed. This can be verified by assembling the program and inspecting the memory from the qemu monitor:

riscv64-unknown-elf-as  -o chapter4_cas_main.o chapter4_cas_main.s
riscv64-unknown-elf-as  -o cas.o cas.s
riscv64-unknown-elf-ld -T chapter3.lds -o chapter4-cas.elf chapter4_cas_main.o cas.o
qemu-system-riscv64 -M virt -serial /dev/null -nographic -kernel chapter4-cas.elf
QEMU 3.1.0 monitor - type 'help' for more information
(qemu) xp /1gd 0x80001008
0000000080001008:                    6
(qemu)

In addition to the load-reserved/store-conditional instructions, the RVA extension also provides atomic memory operations. These atomically perform an operation on a value in memory, and swap the previous content of the memory location into the targetted register. The supported operations include: add, and, or, xor, max, min, and swap. Moreover, the min and max instructions have signed and unsigned variants. These instructions are convenient for defining another useful synchronization primitive: the test-and-set spinlock.

Spinlocks can be acquired by setting a sentinel value in a specific memory location, but only if that value is not already set therein. If the target memory location already contains the sentinel value, the spinlock will loop until it is released. The lock is released by clearing the memory location (i.e. setting it to zero). The implementation of a spinlock acquire/release pair is illustrated in the listing that follows:

 1:         .text
 2:         .align 2
 3:         .global spinlock_acquire
 4: spinlock_acquire:
 5:         # a0 = memory address of the spinlock
 6:         li      t1, 1 #
 7:         amoswap.d.aq    t0, t1, (a0) #
 8:         bnez    t1, spinlock_acquire #
 9:         ret
10: 
11:         .global spinlock_release
12: spinlock_release:
13:         # a0 = memory address of the spinlock
14:         amoswap.d.rl zero, zero, (a0) #
15:         ret
16:

This listing defines two sub-routines: one to acquire a spinlock, and one to release it. The spinlock_acquire function loads the value 1 to use as the sentinel on line 6. Then the atomic memory operation amoswap is used on line 7 to swap the value of the sentinel with the contents of the memory location specified in a0. The value contained in the lock location will be saved in register t0. If this value is not zero, the lock was already acquired by another thread, therefore the function will try again (line 8), otherwise the function returns.

the spinlock release function will simply write zero into the memory location specified in a0. This will allow another thread that is spinning on the lock to acquire it.

The amoswap instruction has two variants: one for double-words (amoswap.d) and one for word values (amoswap.w). Moreover, there are flags which define define the release consistency semantics of the memory operation (the .aq and .rl suffixes). Basically by setting the .aq suffix on the operation, then the effect of memory operations that occur after this one in the current hardware thread will not be observed by another thread before the effect of the current instruction. Conversely, when the .rl suffix is specified, the effects of memory operations preceding that of the current instruction will not be observed by other threads after its own effect.

The following program illustrates the use of the spinlock functions to define a critical section:

 1:         .section ".text.init"
 2:         .align 2
 3:         .global _start
 4:         .global _stack_end
 5: _start:
 6:         la      sp, _stack_end
 7:         la      a0, lock #
 8:         call    spinlock_acquire #
 9:         la      t0, n #
10:         ld      a0, (t0)
11:         li      a1, 1
12:         call    sum
13:         la      t0, n
14:         sd      a0, (t0) #
15:         la      a0, lock
16:         call    spinlock_release #
17: stop:   j       stop
18:         .data
19:         .balign 8
20: lock:   .dword  0
21: n:      .dword 0
22:

This program will attempt to acquire the spinlock on line 8 (the address of the lock variable is loaded on line 7). This function call will block until the lock is acquired. Since there is only a single hardware thread, the lock should be acquired immediately. The critical section starts on line 9. The variable n is loaded and incremented by calling the sum function (defined in a previous chapter). The critical section ends on line 14, at which point the program releases the spinlock (16. Following the execution of this program, the contents of the variable n should be 1:

riscv64-unknown-elf-as  -o chapter4_spinlock_main.o chapter4_spinlock_main.s
riscv64-unknown-elf-as  -o spinlock.o spinlock.s
riscv64-unknown-elf-as  -o add.o add.s
riscv64-unknown-elf-ld -T chapter3.lds -o chapter4-spinlock.elf chapter4_spinlock_main.o spinlock.o add.o
qemu-system-riscv64 -M virt -serial /dev/null -nographic -kernel chapter4-spinlock.elf
QEMU 3.1.0 monitor - type 'help' for more information
(qemu) xp /1gd 0x80001008
0000000080001008:                    1

Floating Point

In chapter 2, the base set of the base I (integer) registers were enumerated. However, when inspecting the VirtIO machine in qemu, using the info registers command, certain registers were listed that are not described in the table. These registers exist to support the F or D extensions which provide floating point arithmetic instructions that work with operands which conform to the IEEE 754-2008 standard. The F extension provides support for single-precission values and operands, and the D extension provides the same instructions for double-precision values.

The 32 additional registers, f0-f31, are used exclusively by the instructions provided by the RVF and RVD extensions. This doubles the number of registers available to the processor without increasing the space required for the register specifier in the instruction op-code since only enough bits to enumerate 32 registers are required (5 bits).

If only the RVF extension is supported, the f registers will be 32-bits wide. If the RVD extension is supported, the f registers will be 64-bits wide. If both RVF and RVD are supported, the RVF instructions will use only the lower 32-bits of the 64-bit registers.

The f registers are enumerated in the following table with their ABI name and a description:

Register(s)	ABI Name(s)	Description
f0-f7	ft0-ft7	Temporary
f8-f9	fs0-fs1	Saved register
f10-f11	fa0-fa1	Function argument/Return value
f12-f17	fa2-fa7	Function argument
f18-f27	fs2-fs11	Saved register
f28-f31	ft8-ft9	Temporary

These registers roughly mirror the base integer registers with two notable exception: unlike x0, f0 is not hardwired to 0, it can be used just like every other register. Moreover there are no registers to manage return addresses, stacks, globals, or threads. The equivalent f registers are used as temporaries.

The convention for who is responsible for saving the contents of the registers is essentially the same as the equivalent base integer registers: Saved registers and temporary registers are to be saved by the callee. All other registers must be saved by the caller.

In addition to the 32 f registers, the RVF and RVD extensions define a status and control register: fcsr. The RVF and RVD extensions provide the frcsr instruction to read this register, storing its value into the targetted integer register. Similarly, the fscsr instruction will copy the original value of fcsr into the destination integer register, and the write the value in the source integer register thereto.

The fcsr prescribes the rounding mode used by floating point operations. The rounding mode field occupies bits 5-7 of the register. The RVF and RVD extensions also define the frrm instruction to retrieve the rounding mode.

The fcsr register also contains flags to indicate exception conditions that may have occured while executing floating-point arithmetic since it was last reset. These errors include:

NV: Invalid operation (fcsr[4])
DZ: Divide by zero (fcsr[3])
OF: Overflow (fcsr[2])
UF: Underflow (fcsr[1])
NX: Inexact (fcsr[0])

The floating-point exception flags can also be retrieved using the frflags instruction which saves their state in the specified integer registers.

The RVF and RVD extensions define two load instructions and two store instructions. These are essentially mirrors of the base load and store instructions that use the f registers rather than the x integer registers. Therefore their addressing mode and format are the same as the lw, ld, sw and sd instructions.

The RVF and RVD extensions also provide a set of arithmetic instructions including:

fadd
fsub
fmul
fdiv
fsqrt

Each instruction has a single- and double-precision variant which can be specified by adding a .s or .d suffix to the instruction respectively.

The floating-point arithmetic instructions will operate using only the f registers, therefore the extensions provide instructions to move data from integer to floating point registers.

The following function implementation will demonstrate some of these instructions. The function in fvector.s will multiply each element from an array of floating-point values by a floating-point scalar:

 1:         .text
 2:         .align 2
 3:         .global __vec_scalef
 4:         # a0: number of elements, 'n', in the array
 5:         # fa0: A double-precision floating-point scalar 'a'
 6:         # a1: Address of array of x[n] double-precision floating-point values.
 7: __vec_scalef:
 8:         addi    sp, sp, -32
 9:         sd      ra, 24(sp)
10:         beqz    a0, __vec_scalef_exit
11: __vec_scalef_loop:
12:         fld     fa5,0(a1)
13:         fmul.d  fa5, fa5, fa0
14:         fsd     fa5,0(a1)
15:         addi    a1, a1,8
16:         addi    a0, a0,-1
17:         bnez    a0, __vec_scalef_loop
18: __vec_scalef_exit:
19:         ld      ra, 24(sp)
20:         addi    sp, sp, 32
21:         ret

In this function each value of the double array is loaded on line 12 at each iteration (up to a maximum set by the integer value in a0). The loaded value is multipled by the double-precision floating-point value in fa0 on line 13, then stored to the same memory location on line 14.

The source data for the function can be defined using the .double assembler directive. This directive will store double-precision floating-point values in successive memory double-words. The .float directive will do the same for single-precision floating-point values.

There are many more instructions defined in the RVF and RVD extensions. Enough to dedicate an entire chapter to this topic. Moreover, the qemu support for the RVF and RVD does not seem to be fully immplemented for the version available in the Debian 10 packages. A more thorough investigation of these extensions will be reserved for a future chapter.

Conclusion

The RISC-V architecture is designed to be a simple as possible but no simpler. Therefore a building block philosophy is followed to allow chip designers to include as many or as few instructions as needed. This provides some flexibility to system designers to satisfy cost, efficiency, and performance constraints specific to the application domain.

Breaking out instructions into optional extensions is like having Lego bricks representing sub-sets of the total RISC-V ISA. In this chapter the M, A, F, and D extensions were used to create a small library of functions that can be re-used in the future to perform more complex calculations, and to synchronize memory access across hardware threads.

In addition to these there are two other optional standard extensions that were not covered in this chapter:

C: Compressed instructions.
V: Vector instructions for SIMD processing.

Discussion of these extensions will be reserved for future chapter.

In the next chapter the priviledged instruction set will be described. This allows for varying levels of support for the base instructions. In this chapter, the utility functions defined so far will be used to create more complex programs. The syncrhonization utilities will be particularly useful when dealing with interrupts.