Lab 4: TinyRV1 Processor
Part C: Accumulate Accelerator
Lab 4 will give you experience designing, implementing, testing, and prototyping a single-cycle processor microarchitecture and a specialized accelerator. The processor will implement the TinyRV1 instruction set. The instruction set manual is located here:
The lab will continue to leverage concepts from Topic 2: Combinational Logic, Topic 3: Boolean Algebra, Topic 4: Combinational Building Blocks, Topic 6: Sequential Logic, Topic 7: Finite-State Machines, and Topic 8: Sequential Building Blocks. The lab will also leverage concepts from Topic 9: Instruction Set Architecture and Topic 10: Single-Cycle Processors The lab will continue to provide opportunities to leverage the three key abstraction principles: modularity, hierarchy, and regularity.
The lab includes seven parts:
-
Part A: Processor Components
- Due 11/6 @ 11:59pm via GitHub
- Students should work on Part A before, during, and after your assigned lab section during the week of 11/3
- Pre-lab survey on Canvas is (roughly) due by end of lab section during the week of 11/3
-
Part B: TinyRV1 Processor
- Due 11/13 @ 11:59pm via GitHub
- Students should work on Part B before, during, and after your assigned lab section during the week of 11/10
-
Part C: Accumulate Accelerator
- Due 11/25 @ 11:59pm via GitHub
- Students should plan to submit Part C before they leave for Thanksgiving Break
-
Part D: FPGA Prototype v1
- Due week of 11/17 during assigned lab section
- This part will focus on prototyping the code developed in Part A+B
- Even though completed with a partner, every student must turn in their own paper check-off sheet in their lab section!
-
Part E: FPGA Prototype v2
- Due week of 12/1 during assigned lab section
- This part will focus on prototyping the code developed in Part A+B+C
- Even though completed with a partner, every student must turn in their own paper check-off sheet in their lab section!
-
Part F: TinyRV1 Assembly
- Due 12/4 @ 11:59pm via GitHub
- This part will include all of the assembly developed during Part D+E
-
Part G: Report
- Due on 12/8 at 11:59pm for all groups!
- Post-lab survey on Canvas is due at the same time as the report
All parts of Lab 4 must be done with a partner. You can confirm your partner on Canvas (Click on People, then Groups, then search for your name to find your lab group).
Both students must contribute to all parts!
It is not acceptable for one student to exclusively work on the code while the other student exclusively works on the report. It is not acceptable for one student to exclusively work on hardware design while the other student exclusively works on testing. Both students must contribute to all parts. Student understanding of Verilog design and testing will be assessed on the prelim exams, final exam, and Verilog coding exam. The instructors will also survey the Git commit log on GitHub to confirm that both students are contributing equally. If you are using pair programming, then both students must take turns using their own account so both students have representative Git commits. Students should create commits after finishing each step of the lab, so their contribution is clear in the Git commit log. A student's whose contribution is limited as represented by the Git commit log will receive a significant deduction to their lab score.
This handout assumes that you have read and understand the course tutorials and that you have attended the discussion sections. This handout assumed you have successfully completed Part A and Part B. You should have already cloned your group remote repository, so use git pull to ensure you have any recent updates before working on your lab assignment.
where XX should be replaced with your group number.
The following table shows all of the hardware modules you will be developing in Lab 4.

1. TinyRV1 Accumulate Assembly Program
We will start by implementing a simple accumulate assembly program. We
have provided you a template in lab4/asm/accumulate.asm. Take a look at
this template.
start:
# wait for button to be pressed
addi x1, x0, 1
wait_posedge:
lw x2, 0x208(x0)
bne x2, x1, wait_posedge
# wait for button to be unpressed
wait_negedge:
lw x2, 0x208(x0)
bne x2, x0, wait_negedge
# read and display size
lw x1, 0x200(x0)
sw x1, 0x210(x0)
# set breadboard pin high for timing
addi x3, x0, 1
sw x3, 0x21c(x0)
#''' LAB ASSIGNMENT ''''''''''''''''''''''''''''''''''''''''''''''''''''
# Write your accumulate loop
#'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
# Be sure to understand the code above and below so you know what
# register stores the size and what register stores the result.
# set breadboard pin low for timing
sw x0, 0x21c(x0)
# done
sw x4, 0x214(x0)
jal x0, start
.data
# result result seven
# size (dec) (hex) segment
# ----------------------------
.word 36 # 1 36 0x024 4
.word 26 # 2 62 0x03e 30
.word 69 # 3 131 0x083 3
.word 57 # 4 188 0x0bc 28
...
We have provided you assembly code to wait for a button to be pressed. Note that we cannot simply wait for in2 to be one because then we might end up triggering multiple accumulations. So we must first wait for the button to go from zero to one, and then wait for the button to go from one to zero. You can assume the behavior of your assembly program is undefined when size is zero.
The input data is specified usint the .data and .word assembler
directives. This means the input data will loaded in the data section of
physical memory, and recall from the TinyRV1 ISA that the data section
starts at address 0x100.
Once the button has been pressed then we read in0 and display the size on the seven-segment displays. We then write one to out3 before we start executing the accumulate loop and then write zero to out3 after we have finished executing the accumulate loop. Out3 will be connected to a pin on the breadboard; our plan is probe this pin on the breadboard with the oscilloscope so we can precisely measure the time to accumulate an array of numbers and then compare this to specialized hardware.
After the accumulate loop, we again set the breadboard pin to be low, before writing the result to out1 to be displayed on a seven-segment display. The very last step is to unconditionally jump back to the start of the program.
After implementing the accumulate loop, you can test it using the ISA simulator. You will need to use the TUI because you need to set in2 to one and then set it back to zero to start the accumulate loop. Try accumulating running the accumulate loop multiple times within the same execution with different size values to ensure your assembly program is working correctly.
% cd ${HOME}/ece2300/groupXX/build
% make proc-isa-sim
% make accumulate.bin
% ./proc-isa-sim +bin=accumulate.bin +tui
You can also test it using the RTL single-cycle processor simulator.
% cd ${HOME}/ece2300/groupXX/build
% make proc-scycle-sim
% make accumulate.bin
% ./proc-scycle-sim +bin=accumulate.bin +tui
2. Interface and Implementation Specification
Your TinyRV1 single-cycle processor is programmable meaning it can perform different functionality simply by executing different assembly level programs. We will also be implementing a specialized accelerator which can only perform a single function. We will then do a comparative analysis to understand the performance and area of both our general-purpose programmable processor and a specialized accelerator.
Our accelerator will implement the same functionality as the accumulate assembly program you implemented in the previous part. It will accumulate 32-bit integer values stored in an array in memory to produce a single sum. The interface for our accelerator is shown below.
module AccumXcel
(
(* keep=1 *) input logic clk,
(* keep=1 *) input logic rst,
// Input val/rdy interface
(* keep=1 *) input logic in_val,
(* keep=1 *) output logic in_rdy,
(* keep=1 *) input logic [6:0] in_size,
// Result
(* keep=1 *) output logic [31:0] result,
// Memory interface
(* keep=1 *) output logic mem_val,
(* keep=1 *) output logic [31:0] mem_addr,
(* keep=1 *) input logic [31:0] mem_rdata
);

The accelerator uses a latency insensitive valid/ready interface; this is the same interface you used in Lab 3. Recall that a val/rdy interface uses the following micro-protocol:
Valid/Ready Micro-Protocol
Assume we have a producer that wishes to send a message to a consumer using the valid/ready micro-protocol. At the beginning of the cycle, the producer determines if it has a new message to send to the consumer. If so, it sets the message bits appropriately and then sets the valid signal high. Also at the beginning of the cycle, the consumer determines if it is able to accept a new message from the producer. If so, it sets the ready signal high. At the end of the cycle, the producer and consumer can independently AND the valid and ready signals together; if both signals are true then the message is considered to have been sent from the producer to the consumer and both sides can update their internal state appropriately. Otherwise, we will try again on the next cycle. To avoid long combinational paths and/or combinational loops, we should avoid making the valid signal depend on the ready signal or the ready signal depend on the valid signal. If you absolutely must, you can make the ready signal depend on the valid signal but it is considered very bad practice to make the valid signal depend on the ready signal. As long as you adhere to this valid/ready micro-protocol, composing modules via the stream interfaces should not cause significant timing issues.
So to send an accumulate message to the accelerator we need to: (1) set
in_size to be the number of elements to accumulate; and (2) set
in_val to be one. Whenever in_val and in_rdy are both one then the
message has been sent. If in_val is one and in_rdy is zero, then the
accelerator is busy and the message is not sent. If in_val is zero and
in_rdy is one, then the accelerator is not busy but there is no message
being sent. Do not make in_rdy depend on in_val!
The output result port is used to indicate the accumulated sum across
the elements in an array. When in_rdy is zero, then result is
undefined. When in_rdy is one, then result must always be the
accumulated sum of the array elements from the most recent accumulate
message. After resetting the accelerator (i.e., when the rst signal is
one for one or more cycles), result should be zero. You can assume
the behavior of your accelerator is undefined when size is zero.
The accelerator will assume the array starts at memory address 0x000 and
that the array is no longer than 512B (i.e., 128 4B values). Note that
the size is specified in elements not bytes! So if the size is 4 then
the accelerator should accumulate the values stored at memory addresses
0x000, 0x004, 0x008, and 0x00c. The memory interface is very similar to
the processor/memory interface. To read from memory set mem_addr to the
memory address to be read and mem_val to one; then mem_rdata will
have the read data. Remember the physical memory uses a combinational
read.
Assume physical memory is initialized with an array containing the values 1, 2, 3, and 4 (i.e., the value 1 is at address 0x000, the value 2 is at address 0x004, etc). Here is an example trace of one possible correct execution for accumulating the four elements in the array.
---- in ----
cycles val rdy size result memory request
------------------------------------------------------
0: 0 1 () 00000000 | # reset result to zero!
1: 0 1 () 00000000 |
2: 1 1 4 () 00000000 |
3: 0 0 . () | rd:00000000:00000001
4: 0 0 . () | rd:00000004:00000002
5: 0 0 . () | rd:00000008:00000003
6: 0 0 . () | rd:0000000c:00000004
7: 0 0 . () | rd:00000010:xxxxxxxx
8: 0 1 () 0000000a |
9: 0 1 () 0000000a |
The first column shows the cycle count. The size column is one of four options:
-
number: this means
in_val && in_rdyso a message is successfully sent to the accelerator; the number shows what size is being sent to the accelerator -
#: this meansin_val && !in_rdyso we have a valid message to send the the accelerator, but the accelerator is not ready -
.: this means!in_val && !in_rdyso we do not have a valid message to send the the accelerator, but regardless the accelerator is not ready -
blank: this means
!in_val && in_rdyso we do not have a valid message to send the the accelerator, but the accelarator is still ready (and waiting for work to do)
The accelerator has correctly reset the result to zero at the beginning
of the trace. A message with size of 4 is successfully sent to the
accelerator on cycle 2. On cycles 3-7, the accelerator is not ready, but
we do not have any new accumulate messages to send the accelerator
anyways. The result column shows the result output only when in_rdy is
one. We can see that in_rdy is one on cycle 8 and the accelerator has
produced the correct result (1+2+3+4 = 10 which is 0x0000000a in hex).
The memory request column shows the type, address, and read data for
valid memory requests being sent to the physical memory; we can see this
accelerator has sent five memory requests to physical memory. The read
data from the final memory request was not used since we are only
accumulating four elements.
Your accelerator must correctly handle multiple consecutive accumulate messages. Again, assume physical memory is initialized with an array containing the values 1, 2, 3, and 4 (i.e., the value 1 is at address 0x000, the value 2 is at address 0x004, etc). Here is an example trace of one possible correct execution for first accumulating the first three elements in the array and then accumulating the first four elements in the array.
---- in ----
cycles val rdy size result memory request
------------------------------------------------------
0: 0 1 () 00000000 | # reset result to zero!
1: 0 1 () 00000000 |
2: 1 1 3 () 00000000 |
3: 1 0 # () | rd:00000000:00000001
4: 1 0 # () | rd:00000004:00000002
5: 1 0 # () | rd:00000008:00000003
6: 1 0 # () | rd:0000000c:00000004
7: 1 1 4 () 00000006 |
8: 0 0 . () | rd:00000000:00000001
9: 0 0 . () | rd:00000004:00000002
10: 0 0 . () | rd:00000008:00000003
11: 0 0 . () | rd:0000000c:00000004
12: 0 0 . () | rd:00000010:xxxxxxxx
13: 0 1 () 0000000a |
14: 0 1 () 0000000a |
The accelerator has correctly reset the result to zero at the beginning
of the trace. A message with size of 3 is successfully sent to the
accelerator on cycle 2. On cycles 3-6, the producer has a second
accumulate message to send to the accelerator, but the accelerator is not
ready. On cycle 7, the accelerator is now ready, so we see two things:
(1) since in_rdy is one, the result is shown as 1+2+3 = 6; and (2)
since in_val is one and in_rdy is one the second accumulate message
with size of 4 is successfully sent to the accelerator. On cycles 8-12,
the accelerator is busy (i.e., not ready) executing the second
transaction. On cycle 13, the accelerator is once again ready and we see
the accelerator has produced the correct result (1+2+3+4 = 10 which is
0x0000000a in hex).
Your design does not need to be match the above traces cycle-by-cycle.
It just needs to implement the specification. If your accelerator takes
a few cycles more or less to implement the accumulate that is perfectly
fine. The key is that when the accumulator is done it sets result to
the final result and in_rdy to one (and keeps both of these output
ports set this way!) until it starts executing the next transaction.
You must implement your accelerator using a datapath
(lab4/AccumXcelDpath.v) and a control unit (lab4/AccumXcelCtrl.v). We
have very specific requirements on what kind of hardware modeling is
permitted in these two files.
-
Datapath Rules: The datapath must be completely structural. You can use any of the components previously developed in the lab assignments. You should only instantiate and connect modules that you have implemented and tested separately. This means you cannot directly use any logic in this module; no always blocks and nothing in an assign statement other than basic connectivity.
-
Control Unit Rules: The control unit can either be purely combinational or use a finite-state machine (FSM). If you do use an FSM, it must have three parts: (1) the state register which should be implemented using a
DFFR_RTL,Register_16b_RTL, orRegister_32b_RTL; (2) analways_combblock to implement the combinational next state logic; and (3) analways_combblock to implement the output logic. There should be no other logic in the control unit. You cannot usealways_ffblocks directly in your control unit!
You are free to structure your datapath however you like, and you are
free to use any kind of combinational logic or finite-state-machine for
the control unit; but you must follow the above rules. The provided
AccumXcel.v file composes the datapath and control units. You will need
to modify this file to add new control and/or status signals.
You might want to take a look the music player for inspiration on how to read an array of values from memory. We strongly recommend using your counter from Lab 3 as the core of your datapath to generate memory addresses similar to the approach we used in the music player. However, this is not required and you are welcome to take any approach you like as long as you follow the above rules.
Use an incremental design approach!
Do not implement the entire datapath, then implement the entire control unit, and then try to run your first test! Take an incremental approach. We recommend you develop your accelerator using the following three steps (this is exactly what the course staff did!). Draw a FSM and/or datapath block diagram for each step before writing any code!
-
Step 1: Fetch: Implement a datapath and FSM that just fetches data from memory and never stops. Run the basic test case and look at the cycle-level trace. Dump the waveforms and look at the waveforms in Surfer. See if your accelerator is correctly fetching each element from memory.
-
Step 2: Fetch and Stop: Augment your datapath and FSM so that it fetches data from memory and stop once it fetches
in_sizeelements. Run the basic test case and look at the cycle-level trace. Dump the waveforms and look at the waveforms in Surfer. See if your accelerator is correctly fetching each element from memory and also stops after fetching four elements. -
Step 2: Fetch, Accumulate, and Stop: Now that you know your accelerator can fetch data correctly, augment your datapath and FSM so that it fetches data from memory, does the accumulation, and stops once it fetches
in_sizeelements. Run this basic test case and look at the cycle-level trace. Dump the waveforms and look at the waveforms in Surfer. See if your accelerator is correctly fetching each element from memory, accumulating the values, stops after fetching four elements, and outputs the correct sum.
You really do need an accelerator block diagram!
Once you have your accumulator accelerator working you really do need to make sure you have a clean block diagram which shows all of the registers, muxes, adders, etc and how they are connected. You will need to show this to the TAs in Lab 4, Part E and then you will annotate this diagram with the actual critical path and component delays as analyzed by Quartus. If you do not draw this block diagram now, you will need to draw it during Lab 4, Part E wasting precious time!
2. Testing Strategy
You are responsible for writing directed tests for your accelerator in
lab4/test/AccumXcel-test.v. You do not need to implement random or
xprop tests.
3.1. Directed Testing
The test cases will look like this:
task test_case_1_basic();
t.test_case_begin( "test_case_1_basic" );
// Load test data into test memory (remember, accelerator always
// starts accumulating from address 0x000!)
data( 'h000, 1 );
data( 'h004, 2 );
data( 'h008, 3 );
data( 'h00c, 4 );
// Send message to accumulate 4 elements
// ---- in ----
// val rdy size result
check( 1, 1, 4, 0 );
// Simulate for 20 cycles
for ( int i = 0; i < 20; i = i+1 )
check( 0, 0, 0, 0, IGNORE_OUTPUTS );
// Check result is correct
// ---- in ----
// val rdy size result
check( 0, 1, 0, 10 );
t.test_case_end();
endtask
You can use the data task to load data into the test memory just like
we did in our processor tests in Part B. The check task is similar to
what we have used in the past to set in_val and in_size and check
in_rdy, and result. Notice how the check task includes an optional
fifth argument which can be used to ignore the outputs. So the above test
case first loads a four-element array into memory. The test case sends a
message to the accelerator with a size of four, and then waits for 20
cycles (without checking in_rdy nor result!). After 20 cycles, the
test case checks to make sure the result is correct.
We have added cycle-by-cycle tracing code in the check task which
produces a trace similar in spirit to what was shown above. You can use
these traces for preliminary debugging of your accelerator before
potentially moving to using waveforms.
Be sure to test various data values stored in the array, various array sizes, trying to send a message to the accelerator when it is busy, and sending multiple messages to the accelerator within the same test case (i.e., reusing the accelerator multiple times).
3.2. Interactive Simulator
We have provided you an interactive simulator which will emulate the FPGA prototype you will be implementing in Part E. After finishing implementing and thoroughly testing your accelerator, you can build and run the simulator like this:
The simulator will show the cycle-level trace of the accelerator and then the final result using the seven-segment displays.
0: xxx () |
1: () 00000000 |
2: () 00000000 |
3: 4 () 00000000 |
4: . () | rd:00000000:00000024
5: . () | rd:00000004:0000001a
6: . () | rd:00000008:00000045
7: . () | rd:0000000c:00000039
8: . () | rd:00000010:0000000b
9: () 000000bc |
10: () 000000bc |
11: () 000000bc |
12: () 000000bc |
13: () 000000bc |
=== === ===
| | | | | | |
| | | | | | |
=== === ===
| | | | | |
| | | | | |
=== === ===
Your design does not need to be match the above trace cycle-by-cycle.
It just needs to implement the specification. If your accelerator takes
a few cycles more or less to implement the accumulate that is perfectly
fine. The key is result should be reset zero and that in_rdy should be
one when the accumulator produces the final result.
The interactive simulator is using the following array loaded into the memory.
addr data size result result seven
(hex) (dec) (Dec) (dec) (hex) segment
---------------------------------------
000 36 1 36 0x024 4
004 26 2 62 0x03e 30
008 69 3 131 0x083 3
00c 57 4 188 0x0bc 28
010 11 5 199 0x0c7 7
014 68 6 267 0x10b 11
018 41 7 308 0x134 20
01c 90 8 398 0x18e 14
020 32 9 430 0x1ae 14
024 76 10 506 0x1fa 26
028 44 11 550 0x226 6
02c 19 12 569 0x239 25
030 17 13 586 0x24a 10
034 59 14 645 0x285 5
038 99 15 744 0x2e8 8
03c 49 16 793 0x319 25
040 65 17 858 0x35a 26
044 12 18 870 0x366 6
048 55 19 925 0x39d 29
04c 0 20 925 0x39d 29
050 51 21 976 0x3d0 16
054 42 22 1018 0x3fa 26
058 82 23 1100 0x44c 12
05c 23 24 1123 0x463 3
060 21 25 1144 0x478 24
064 54 26 1198 0x4ae 14
068 83 27 1281 0x501 1
06c 31 28 1312 0x520 0
070 16 29 1328 0x530 16
074 76 30 1404 0x57c 28
078 21 31 1425 0x591 17
07c 4 32 1429 0x595 21
The switches were set to 00100 in binary which is 4 in decimal. We find the row in the table when the size is 4. The final sum should be 118 which is 0x0bc in hex. We can see in the cycle-level trace that the final result is indeed 0x0bc. Since the seven-segment display only shows values from 0-31, it will only be able to show the bottom five bits of the complete result. Then we can look in the "seven segment" column to see what the seven-segment display should show in the real FPGA prototype. For size 4, the seven-segment display should show 28 and indeed that is what we see using the interactive simulator.
3. Lab Code Submission
To submit your code you simply push your code to GitHub. You can push
your code as many times as you like before the deadline. Students are
responsible for going to the GitHub website for your repository, browsing
the source code, and confirming the code on GitHub is the code they want
to submit is on GitHub Be sure to verify your code is passing your tests
both on ecelinux and on GitHub Actions. Your design code will be
assessed both in terms of code quality, verification quality, and
functionality.
3.1. Code Quality
Your code quality score will be based on how well you follow the course coding conventions posted here:
3.2. Verification Quality
Verification quality is based on how well your testing enables making a compelling case for correctness. You will need to write compelling directed test case. Use comments appropriately to describe your test cases.
3.3. Functionality
Your functionality score will be determined by running your code against a series of tests developed by the instructors to test its correctness. Note that we will be using the automated build system to test your final code submission as shown below.