Example 3: Machine Language for the RiSC-16
This example explains how an assembler takes an assembly file and resolves symbolic addresses to turn the assembly code into machine language. It also hints at the translation from C code to assembly code that a compiler would perform, but it does not go into detailed explanation of that -- it is mostly hand-waving, since high-level language compilers are a huge topic for another course.In this example, we will trace a piece of C code to assembly code and then to machine language. We are mostly interested in the translation from assembly to machine language, but the C code helps to show exactly what the purpose of each instruction is. We will use the RiSC Level-0 assembly code in this example, since it is far simpler than the MIPS assembly code, and since your first assignment is to build a RiSC Level-0 assembler, to build a RiSC Level-0 simulator, and to write a short piece of code in RiSC Level-0 assembly code. Once you understand this (relatively) simple language, we will move into the slightly more complex MIPS assembly code.
The following is a piece of C code that simply moves through an array A, one item at a time. It compares the value found at the ith element of array A to the constant k -- if they are equal, we move to the next element; if they are not, we exit the loop.
int A[10]; int i=0; int j=1; int k=0; while (A[i] == k) { i = i + j; }First, remember that an "integer" in the world of the RiSC is a 16-bit (2-byte) quantity. Also remember that the RiSC is a word-addressed machine, not a byte-addressed machine, therefore memory address 0 refers to the first 2-byte word in memory, address 1 refers to the second 2-byte word in memory, etc.
The following is an equivalent piece of RiSC assembly code.
lw 1, 0, i # reg 1 <- value of i lw 2, 0, j # reg 2 <- value of j lw 3, 0, k # reg 3 <- value of k loop: lw 4, 1, A # reg 4 <- i + A (A [i]) beq 4, 3, 1 # if A[i] == k, jump ahead to PC+1+1 beq 0, 0, exit # otherwise, unconditional jump to exit add 1, 1, 2 # i = i + j beq 0, 0, loop # back to top of while-loop exit: halt # when while-loop finishes, halt i: .fill 0 # int i = 0; j: .fill 1 # int j = 1; k: .fill 0 # int k = 0; A: .fill 0 # int A[10]; (this is A[0]) .fill 0 # int A[10]; (this is A[1]) .fill 0 # int A[10]; (this is A[2]) .fill 0 # int A[10]; (this is A[3]) .fill 0 # int A[10]; (this is A[4]) .fill 0 # int A[10]; (this is A[5]) .fill 0 # int A[10]; (this is A[6]) .fill 0 # int A[10]; (this is A[7]) .fill 0 # int A[10]; (this is A[8]) .fill 0 # int A[10]; (this is A[9])
How the assembly code relates to the C code
The individual integers i, j, and k are each created by a .fill directive, which tells the assembler to create a (2-byte) word-sized value and place it at this spot, rather than an instruction. The array A is created by a whole bunch of .fill directives; since there are 10 items in the array, there are 10 .fill directives. By default, we have initialized the values in the array to zero.Note that in RiSC code, we place the data at the end of the program; this is simply a convention that allows the RiSC to start executing at location 0 as soon as the program is loaded. However, real-world machines might not behave this way; it is not a rule, it is just the convention that we use in this simple computer.
The first thing that the assembly code does is load all of the variables: i, j, and k. This is done with the load-word instruction, lw. Each of the first three lines adds the value of the label i, j, or k to the value found in register 0 (it will always contain zero) and loads the value from memory found at this address; therefore the first three loads simply load values found at memory locations i, j, and k.
lw 1, 0, i # reg 1 <- value of i lw 2, 0, j # reg 2 <- value of j lw 3, 0, k # reg 3 <- value of kAfter the variables are loaded, at the "loop" label, we load a value from array A. The instruction says to add the location of A to the value found in register 1 (variable i, which at this point has the value 0), and use the sum as a memory address.
loop: lw 4, 1, A # reg 4 <- i + A (A [i])This gives us the address for A[0] (the first item in A) -- we load the value found at this address and place it into register 4. Register 4 now contains the value A[i], and in this instance i=0, so register 4 has A[0]. The next time we execute the loop, we will branch back to this instruction, which will again load A[i], but in later instances the value of i might have changed; in particular, the next time around, the value in register 1 will be 1 (i=1), therefore when we execute this instruction, we will load the value A[1].
Back to the present -- we have just loaded the value A[0] into register 4. Using the branch-if-equal instruction, beq, we compare the value of A[0] (in register 4) to the value of k (in register 3):
beq 4, 3, 1 # if A[i] == k, jump ahead to PC+1+1and if they are equal we branch over the following instruction:
beq 0, 0, exit # otherwise, unconditional jump to exitwhich exits the while-loop. If the value in k is not equal to the value in A[i], the first branch is not taken and control falls through to the second branch, which is an unconditional jump to the label exit. It is considered unconditional because the jump occurs if the value in register 0 is equal to itself (it always will be).
If A[i] == k, then the first branch instruction succeeds and we jump past the jump-to-exit instruction to the following two instructions:
add 1, 1, 2 # i = i + j beq 0, 0, loop # back to top of while-loopThe add increments register 1 (variable i) by the value in register 2 (variable j), or i = i + j. The following beq instruction is an unconditional jump to the beginning of the loop, where the process starts again at the point of loading the next entry in array A.
How the assembler translates the assembly code to machine code
On the first pass, the assembler assigns addresses to each of the instructions and labels. For instance, the following addresses work perfectly well:0: lw 1, 0, i 1: lw 2, 0, j 2: lw 3, 0, k 3: loop: lw 4, 1, A 4: beq 4, 3, 1 5: beq 0, 0, exit 6: add 1, 1, 2 7: beq 0, 0, loop 8: exit: halt 9: i: .fill 0 10: j: .fill 1 11: k: .fill 0 12: A: .fill 0 13: .fill 0 14: .fill 0 15: .fill 0 16: .fill 0 17: .fill 0 18: .fill 0 19: .fill 0 20: .fill 0 21: .fill 0At this point, the assembler has made the following connections:
loop -> address 3 exit -> address 8 i -> address 9 j -> address 10 k -> address 11 A -> starts at address 12Aside: Here is a subtle point that is interesting, but it is not necessary for understanding how assemblers work. The assembler does not realize that A is in fact an array; it merely knows where the label A is, and that it is followed by a whole bunch of integers. The assembler does not need to know how A is being used; that is up to the compiler, or the writer of the assembly code. The assembler just does what it is told (create an integer with label A and follow it with 9 more integers), and it assumes that the author of the assembly code (either a compiler or a human) knows what it is doing.
On the second pass over the code, the assembler replaces the labels:
0: lw 1, 0, 9 1: lw 2, 0, 10 2: lw 3, 0, 11 3: lw 4, 1, 12 4: beq 4, 3, immed(1) 5: beq 0, 0, 8 6: add 1, 1, 2 7: beq 0, 0, 3 8: halt 9: 0 10: 1 11: 0 12: 0 13: 0 14: 0 15: 0 16: 0 17: 0 18: 0 19: 0 20: 0 21: 0We will get to the "immed(1)" in a moment.
Each lw assembly-code instruction has the following format:
lw regA, regB, immedand the first four instructions get translated to the Load-Type machine-code instruction format:
opcode:3 regA:3 regB:3 immed:7Here are the first four loads:
0: lw 1, 0, 9 1: lw 2, 0, 10 2: lw 3, 0, 11 3: lw 4, 1, 12The binary opcode for lw is 101; its decimal value is 5. This gives us the following fields that make up the first four instructions (in decimal first, then in binary to show the widths of the fields):
5 1 0 9 5 2 0 10 5 3 0 11 5 4 1 12 101 001 000 0001001 101 010 000 0001010 101 011 000 0001011 101 100 001 0001100The next two instructions are beq instructions, which have the following assembly-code format:
beq regA, regB, target-addressand these beq instructions get translated to the CondBranch-Type machine-code instruction format:
opcode:3 regA:3 regB:3 immed:7Here are the two beq instructions:
4: beq 4, 3, immed(1) 5: beq 0, 0, 8Note that the target-address must be translated to the immediate value; the immediate value that goes into the machine-code instruction must be PC-relative. Therefore, for the instruction at address 4, the PC is 4 and the target-address is "immed(1)", which simply means that the assembly-language programmer used a constant value rather than a symbolic address (a label). So in this case, we can just put the constant right into the instruction field (assuming that it is within the bounds of -64 and 63).
The instruction at address 5 is different: it used a label for its target-address, which has resolved to the absolute location 8. We solve the following equation for immediate:
8 = PC + 1 + immediateThe PC at this address is 5, the target address is 8, therefore the immediate value that should go into the instruction is the value 2.
The binary opcode for beq is 110; its decimal value is 6. This gives us the following fields that make up the two beq instructions at addresses 4 and 5 (in decimal first, then in binary to show the widths of the fields):
6 4 3 1 6 0 0 2 110 100 011 0000001 110 000 000 0000010The next instruction, add, has the following assembly-code format:
add regA, regB, regCand an add instruction gets translated to the Register-type machine-code instruction format:
opcode:3 regA:3 regB:3 0000 dest:3Note that the middle four bits of an add instruction are unused and are therefore set to zero.
This is the add instruction to translate:
6: add 1, 1, 2The binary opcode for add is 000, decimal 0. This gives us the following fields that make up the add instruction (decimal first, then binary):
0 1 1 0 2 000 001 001 0000 010The next instruction to translate is another beq instruction:
7: beq 0, 0, 3Like the beq instruction at address 5, this one used a label for its target address, and the label resolved to the absolute address 3. We solve the following for immediate:
3 = PC + 1 + immediateThe PC at this address is 7, so the immediate value is -5. The instruction therefore has the following machine-language encoding (decimal, then binary):
6 0 0 -5 110 000 000 1111011In general, forward branches have positive immediate values, and backward branches have negative immediate values.
Lastly, we encode the halt instruction at location 8.
8: haltAs described in the project description, the encoding for HALT looks like the following (decimal & binary):
jalr 0, 0, 0x71 111 000 000 1110001Thus, we have the following for the instructions:
0: 101 001 000 0001001 1: 101 010 000 0001010 2: 101 011 000 0001011 3: 101 100 001 0001100 4: 110 100 011 0000001 5: 110 000 000 0000010 6: 000 001 001 0000 010 7: 110 000 000 1111011 8: 111 000 000 1110001Next come the pieces of data, which are pretty simple. They resolve to the following:
9: 0000000000000000 10: 0000000000000001 11: 0000000000000000 12: 0000000000000000 13: 0000000000000000 14: 0000000000000000 15: 0000000000000000 16: 0000000000000000 17: 0000000000000000 18: 0000000000000000 19: 0000000000000000 20: 0000000000000000 21: 0000000000000000The final output, in binary, looks like the following:
1010010000001001 1010100000001010 1010110000001011 1011000010001100 1101000110000001 1100000000000010 0000010010000010 1100000001111011 1110000001110001 0000000000000000 0000000000000001 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000With spaces added for readability, it looks like this:
1010 0100 0000 1001 1010 1000 0000 1010 1010 1100 0000 1011 1011 0000 1000 1100 1101 0001 1000 0001 1100 0000 0000 0010 0000 0100 1000 0010 1100 0000 0111 1011 1110 0000 0111 0001 0000 0000 0000 0000 0000 0000 0000 0001 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000And looks like the following in hex:
a409 a80a ac0b b08c d181 c002 0482 c07b e071 0000 0001 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
Hope this helps,
Prof. Jacob