Lab 4: TinyRV1 Processor
Part C: Accumulate Accelerator

Lab 4 will give you experience designing, implementing, testing, and prototyping a single-cycle processor microarchitecture and a specialized accelerator. The processor will implement the TinyRV1 instruction set. The instruction set manual is located here:

https://cornell-ece2300.github.io/ece2300-mkdocs/ece2300-tinyrv1-isa

The lab will continue to leverage concepts from Topic 2: Combinational Logic, Topic 3: Boolean Algebra, Topic 4: Combinational Building Blocks, Topic 6: Sequential Logic, Topic 7: Finite-State Machines, and Topic 8: Sequential Building Blocks. The lab will also leverage concepts from Topic 9: Instruction Set Architecture and Topic 10: Single-Cycle Processors The lab will continue to provide opportunities to leverage the three key abstraction principles: modularity, hierarchy, and regularity.

The lab includes seven parts:

Part A: Processor Components
- Due 11/6 @ 11:59pm via GitHub
- Students should work on Part A before, during, and after your assigned lab section during the week of 11/3
- Pre-lab survey on Canvas is (roughly) due by end of lab section during the week of 11/3
Part B: TinyRV1 Processor
- Due 11/13 @ 11:59pm via GitHub
- Students should work on Part B before, during, and after your assigned lab section during the week of 11/10
Part C: Accumulate Accelerator
- Due 11/25 @ 11:59pm via GitHub
- Students should plan to submit Part C before they leave for Thanksgiving Break
Part D: FPGA Prototype v1
- Due week of 11/17 during assigned lab section
- This part will focus on prototyping the code developed in Part A+B
- Even though completed with a partner, every student must turn in their own paper check-off sheet in their lab section!
Part E: FPGA Prototype v2
- Due week of 12/1 during assigned lab section
- This part will focus on prototyping the code developed in Part A+B+C
- Even though completed with a partner, every student must turn in their own paper check-off sheet in their lab section!
Part F: TinyRV1 Assembly
- Due 12/4 @ 11:59pm via GitHub
- This part will include all of the assembly developed during Part D+E
Part G: Report
- Due on 12/8 at 11:59pm for all groups!
- Post-lab survey on Canvas is due at the same time as the report

All parts of Lab 4 must be done with a partner. You can confirm your partner on Canvas (Click on People, then Groups, then search for your name to find your lab group).

Both students must contribute to all parts!

It is not acceptable for one student to exclusively work on the code while the other student exclusively works on the report. It is not acceptable for one student to exclusively work on hardware design while the other student exclusively works on testing. Both students must contribute to all parts. Student understanding of Verilog design and testing will be assessed on the prelim exams, final exam, and Verilog coding exam. The instructors will also survey the Git commit log on GitHub to confirm that both students are contributing equally. If you are using pair programming, then both students must take turns using their own account so both students have representative Git commits. Students should create commits after finishing each step of the lab, so their contribution is clear in the Git commit log. A student's whose contribution is limited as represented by the Git commit log will receive a significant deduction to their lab score.

This handout assumes that you have read and understand the course tutorials and that you have attended the discussion sections. This handout assumed you have successfully completed Part A and Part B. You should have already cloned your group remote repository, so use git pull to ensure you have any recent updates before working on your lab assignment.

% cd ${HOME}/ece2300/groupXX
% git pull
% tree

where XX should be replaced with your group number.

The following table shows all of the hardware modules you will be developing in Lab 4.

1. TinyRV1 Accumulate Assembly Program

We will start by implementing a simple accumulate assembly program. We have provided you a template in lab4/asm/accumulate.asm. Take a look at this template.

start:

  # wait for button to be pressed

  addi x1, x0, 1
wait_posedge:
  lw   x2, 0x208(x0)
  bne  x2, x1, wait_posedge

  # wait for button to be unpressed

wait_negedge:
  lw   x2, 0x208(x0)
  bne  x2, x0, wait_negedge

  # read and display size

  lw   x1, 0x200(x0)
  sw   x1, 0x210(x0)

  # set breadboard pin high for timing

  addi x3, x0, 1
  sw   x3, 0x21c(x0)

  #''' LAB ASSIGNMENT ''''''''''''''''''''''''''''''''''''''''''''''''''''
  # Write your accumulate loop
  #'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
  # Be sure to understand the code above and below so you know what
  # register stores the size and what register stores the result.

  # set breadboard pin low for timing

  sw   x0, 0x21c(x0)

  # done

  sw   x4, 0x214(x0)
  jal  x0, start

 .data
           #       result result   seven
           #  size  (dec)  (hex) segment
           # ----------------------------
  .word 36 #     1     36  0x024    4
  .word 26 #     2     62  0x03e   30
  .word 69 #     3    131  0x083    3
  .word 57 #     4    188  0x0bc   28
  ...

We have provided you assembly code to wait for a button to be pressed. Note that we cannot simply wait for in2 to be one because then we might end up triggering multiple accumulations. So we must first wait for the button to go from zero to one, and then wait for the button to go from one to zero. You can assume the behavior of your assembly program is undefined when size is zero.

The input data is specified usint the .data and .word assembler directives. This means the input data will loaded in the data section of physical memory, and recall from the TinyRV1 ISA that the data section starts at address 0x100.

Once the button has been pressed then we read in0 and display the size on the seven-segment displays. We then write one to out3 before we start executing the accumulate loop and then write zero to out3 after we have finished executing the accumulate loop. Out3 will be connected to a pin on the breadboard; our plan is probe this pin on the breadboard with the oscilloscope so we can precisely measure the time to accumulate an array of numbers and then compare this to specialized hardware.

After the accumulate loop, we again set the breadboard pin to be low, before writing the result to out1 to be displayed on a seven-segment display. The very last step is to unconditionally jump back to the start of the program.

After implementing the accumulate loop, you can test it using the ISA simulator. You will need to use the TUI because you need to set in2 to one and then set it back to zero to start the accumulate loop. Try accumulating running the accumulate loop multiple times within the same execution with different size values to ensure your assembly program is working correctly.

% cd ${HOME}/ece2300/groupXX/build
% make proc-isa-sim
% make accumulate.bin
% ./proc-isa-sim +bin=accumulate.bin +tui

You can also test it using the RTL single-cycle processor simulator.

% cd ${HOME}/ece2300/groupXX/build
% make proc-scycle-sim
% make accumulate.bin
% ./proc-scycle-sim +bin=accumulate.bin +tui

2. Interface and Implementation Specification

Your TinyRV1 single-cycle processor is programmable meaning it can perform different functionality simply by executing different assembly level programs. We will also be implementing a specialized accelerator which can only perform a single function. We will then do a comparative analysis to understand the performance and area of both our general-purpose programmable processor and a specialized accelerator.

Our accelerator will implement the same functionality as the accumulate assembly program you implemented in the previous part. It will accumulate 32-bit integer values stored in an array in memory to produce a single sum. The interface for our accelerator is shown below.

module AccumXcel
(
  (* keep=1 *) input  logic        clk,
  (* keep=1 *) input  logic        rst,

  // Input val/rdy interface

  (* keep=1 *) input  logic        in_val,
  (* keep=1 *) output logic        in_rdy,
  (* keep=1 *) input  logic  [6:0] in_size,

  // Result

  (* keep=1 *) output logic [31:0] result,

  // Memory interface

  (* keep=1 *) output logic        mem_val,
  (* keep=1 *) output logic [31:0] mem_addr,
  (* keep=1 *) input  logic [31:0] mem_rdata
);

The accelerator uses a latency insensitive valid/ready interface; this is the same interface you used in Lab 3. Recall that a val/rdy interface uses the following micro-protocol:

Valid/Ready Micro-Protocol

Assume we have a producer that wishes to send a message to a consumer using the valid/ready micro-protocol. At the beginning of the cycle, the producer determines if it has a new message to send to the consumer. If so, it sets the message bits appropriately and then sets the valid signal high. Also at the beginning of the cycle, the consumer determines if it is able to accept a new message from the producer. If so, it sets the ready signal high. At the end of the cycle, the producer and consumer can independently AND the valid and ready signals together; if both signals are true then the message is considered to have been sent from the producer to the consumer and both sides can update their internal state appropriately. Otherwise, we will try again on the next cycle. To avoid long combinational paths and/or combinational loops, we should avoid making the valid signal depend on the ready signal or the ready signal depend on the valid signal. If you absolutely must, you can make the ready signal depend on the valid signal but it is considered very bad practice to make the valid signal depend on the ready signal. As long as you adhere to this valid/ready micro-protocol, composing modules via the stream interfaces should not cause significant timing issues.

So to send an accumulate message to the accelerator we need to: (1) set in_size to be the number of elements to accumulate; and (2) set in_val to be one. Whenever in_val and in_rdy are both one then the message has been sent. If in_val is one and in_rdy is zero, then the accelerator is busy and the message is not sent. If in_val is zero and in_rdy is one, then the accelerator is not busy but there is no message being sent. Do not make in_rdy depend on in_val!

The output result port is used to indicate the accumulated sum across the elements in an array. When in_rdy is zero, then result is undefined. When in_rdy is one, then result must always be the accumulated sum of the array elements from the most recent accumulate message. After resetting the accelerator (i.e., when the rst signal is one for one or more cycles), result should be zero. You can assume the behavior of your accelerator is undefined when size is zero.

The accelerator will assume the array starts at memory address 0x000 and that the array is no longer than 512B (i.e., 128 4B values). Note that the size is specified in elements not bytes! So if the size is 4 then the accelerator should accumulate the values stored at memory addresses 0x000, 0x004, 0x008, and 0x00c. The memory interface is very similar to the processor/memory interface. To read from memory set mem_addr to the memory address to be read and mem_val to one; then mem_rdata will have the read data. Remember the physical memory uses a combinational read.

Assume physical memory is initialized with an array containing the values 1, 2, 3, and 4 (i.e., the value 1 is at address 0x000, the value 2 is at address 0x004, etc). Here is an example trace of one possible correct execution for accumulating the four elements in the array.

       ---- in ----
cycles val rdy size   result     memory request
------------------------------------------------------
    0:   0   1      () 00000000 |                       # reset result to zero!
    1:   0   1      () 00000000 |
    2:   1   1    4 () 00000000 |
    3:   0   0    . ()          | rd:00000000:00000001
    4:   0   0    . ()          | rd:00000004:00000002
    5:   0   0    . ()          | rd:00000008:00000003
    6:   0   0    . ()          | rd:0000000c:00000004
    7:   0   0    . ()          | rd:00000010:xxxxxxxx
    8:   0   1      () 0000000a |
    9:   0   1      () 0000000a |

The first column shows the cycle count. The size column is one of four options:

number: this means in_val && in_rdy so a message is successfully sent to the accelerator; the number shows what size is being sent to the accelerator
#: this means in_val && !in_rdy so we have a valid message to send the the accelerator, but the accelerator is not ready
.: this means !in_val && !in_rdy so we do not have a valid message to send the the accelerator, but regardless the accelerator is not ready
blank: this means !in_val && in_rdy so we do not have a valid message to send the the accelerator, but the accelarator is still ready (and waiting for work to do)

The accelerator has correctly reset the result to zero at the beginning of the trace. A message with size of 4 is successfully sent to the accelerator on cycle 2. On cycles 3-7, the accelerator is not ready, but we do not have any new accumulate messages to send the accelerator anyways. The result column shows the result output only when in_rdy is one. We can see that in_rdy is one on cycle 8 and the accelerator has produced the correct result (1+2+3+4 = 10 which is 0x0000000a in hex). The memory request column shows the type, address, and read data for valid memory requests being sent to the physical memory; we can see this accelerator has sent five memory requests to physical memory. The read data from the final memory request was not used since we are only accumulating four elements.

Your accelerator must correctly handle multiple consecutive accumulate messages. Again, assume physical memory is initialized with an array containing the values 1, 2, 3, and 4 (i.e., the value 1 is at address 0x000, the value 2 is at address 0x004, etc). Here is an example trace of one possible correct execution for first accumulating the first three elements in the array and then accumulating the first four elements in the array.

       ---- in ----
cycles val rdy size   result     memory request
------------------------------------------------------
    0:   0   1      () 00000000 |                      # reset result to zero!
    1:   0   1      () 00000000 |
    2:   1   1    3 () 00000000 |
    3:   1   0    # ()          | rd:00000000:00000001
    4:   1   0    # ()          | rd:00000004:00000002
    5:   1   0    # ()          | rd:00000008:00000003
    6:   1   0    # ()          | rd:0000000c:00000004
    7:   1   1    4 () 00000006 |
    8:   0   0    . ()          | rd:00000000:00000001
    9:   0   0    . ()          | rd:00000004:00000002
   10:   0   0    . ()          | rd:00000008:00000003
   11:   0   0    . ()          | rd:0000000c:00000004
   12:   0   0    . ()          | rd:00000010:xxxxxxxx
   13:   0   1      () 0000000a |
   14:   0   1      () 0000000a |

The accelerator has correctly reset the result to zero at the beginning of the trace. A message with size of 3 is successfully sent to the accelerator on cycle 2. On cycles 3-6, the producer has a second accumulate message to send to the accelerator, but the accelerator is not ready. On cycle 7, the accelerator is now ready, so we see two things: (1) since in_rdy is one, the result is shown as 1+2+3 = 6; and (2) since in_val is one and in_rdy is one the second accumulate message with size of 4 is successfully sent to the accelerator. On cycles 8-12, the accelerator is busy (i.e., not ready) executing the second transaction. On cycle 13, the accelerator is once again ready and we see the accelerator has produced the correct result (1+2+3+4 = 10 which is 0x0000000a in hex).

Your design does not need to be match the above traces cycle-by-cycle. It just needs to implement the specification. If your accelerator takes a few cycles more or less to implement the accumulate that is perfectly fine. The key is that when the accumulator is done it sets result to the final result and in_rdy to one (and keeps both of these output ports set this way!) until it starts executing the next transaction.

You must implement your accelerator using a datapath (lab4/AccumXcelDpath.v) and a control unit (lab4/AccumXcelCtrl.v). We have very specific requirements on what kind of hardware modeling is permitted in these two files.

Datapath Rules: The datapath must be completely structural. You can use any of the components previously developed in the lab assignments. You should only instantiate and connect modules that you have implemented and tested separately. This means you cannot directly use any logic in this module; no always blocks and nothing in an assign statement other than basic connectivity.
Control Unit Rules: The control unit can either be purely combinational or use a finite-state machine (FSM). If you do use an FSM, it must have three parts: (1) the state register which should be implemented using a DFFR_RTL, Register_16b_RTL, or Register_32b_RTL; (2) an always_comb block to implement the combinational next state logic; and (3) an always_comb block to implement the output logic. There should be no other logic in the control unit. You cannot use always_ff blocks directly in your control unit!

You are free to structure your datapath however you like, and you are free to use any kind of combinational logic or finite-state-machine for the control unit; but you must follow the above rules. The provided AccumXcel.v file composes the datapath and control units. You will need to modify this file to add new control and/or status signals.

You might want to take a look the music player for inspiration on how to read an array of values from memory. We strongly recommend using your counter from Lab 3 as the core of your datapath to generate memory addresses similar to the approach we used in the music player. However, this is not required and you are welcome to take any approach you like as long as you follow the above rules.

Use an incremental design approach!

Do not implement the entire datapath, then implement the entire control unit, and then try to run your first test! Take an incremental approach. We recommend you develop your accelerator using the following three steps (this is exactly what the course staff did!). Draw a FSM and/or datapath block diagram for each step before writing any code!

Step 1: Fetch: Implement a datapath and FSM that just fetches data from memory and never stops. Run the basic test case and look at the cycle-level trace. Dump the waveforms and look at the waveforms in Surfer. See if your accelerator is correctly fetching each element from memory.
Step 2: Fetch and Stop: Augment your datapath and FSM so that it fetches data from memory and stop once it fetches in_size elements. Run the basic test case and look at the cycle-level trace. Dump the waveforms and look at the waveforms in Surfer. See if your accelerator is correctly fetching each element from memory and also stops after fetching four elements.
Step 2: Fetch, Accumulate, and Stop: Now that you know your accelerator can fetch data correctly, augment your datapath and FSM so that it fetches data from memory, does the accumulation, and stops once it fetches in_size elements. Run this basic test case and look at the cycle-level trace. Dump the waveforms and look at the waveforms in Surfer. See if your accelerator is correctly fetching each element from memory, accumulating the values, stops after fetching four elements, and outputs the correct sum.

You really do need an accelerator block diagram!

Once you have your accumulator accelerator working you really do need to make sure you have a clean block diagram which shows all of the registers, muxes, adders, etc and how they are connected. You will need to show this to the TAs in Lab 4, Part E and then you will annotate this diagram with the actual critical path and component delays as analyzed by Quartus. If you do not draw this block diagram now, you will need to draw it during Lab 4, Part E wasting precious time!

2. Testing Strategy

You are responsible for writing directed tests for your accelerator in lab4/test/AccumXcel-test.v. You do not need to implement random or xprop tests.

3.1. Directed Testing

The test cases will look like this:

task test_case_1_basic();
  t.test_case_begin( "test_case_1_basic" );

  // Load test data into test memory (remember, accelerator always
  // starts accumulating from address 0x000!)

  data( 'h000, 1 );
  data( 'h004, 2 );
  data( 'h008, 3 );
  data( 'h00c, 4 );

  // Send message to accumulate 4 elements

  //     ---- in ----
  //     val rdy size result
  check( 1,  1,  4,   0 );

  // Simulate for 20 cycles

  for ( int i = 0; i < 20; i = i+1 )
    check( 0, 0, 0, 0, IGNORE_OUTPUTS );

  // Check result is correct

  //     ---- in ----
  //     val rdy size result
  check( 0,  1,  0,   10 );

  t.test_case_end();
endtask

You can use the data task to load data into the test memory just like we did in our processor tests in Part B. The check task is similar to what we have used in the past to set in_val and in_size and check in_rdy, and result. Notice how the check task includes an optional fifth argument which can be used to ignore the outputs. So the above test case first loads a four-element array into memory. The test case sends a message to the accelerator with a size of four, and then waits for 20 cycles (without checking in_rdy nor result!). After 20 cycles, the test case checks to make sure the result is correct.

We have added cycle-by-cycle tracing code in the check task which produces a trace similar in spirit to what was shown above. You can use these traces for preliminary debugging of your accelerator before potentially moving to using waveforms.

Be sure to test various data values stored in the array, various array sizes, trying to send a message to the accelerator when it is busy, and sending multiple messages to the accelerator within the same test case (i.e., reusing the accelerator multiple times).

3.2. Interactive Simulator

We have provided you an interactive simulator which will emulate the FPGA prototype you will be implementing in Part E. After finishing implementing and thoroughly testing your accelerator, you can build and run the simulator like this:

% cd ${HOME}/ece2300/groupXX/build
% make accum-xcel-sim
% ./accum-xcel-sim +switches=00100

The simulator will show the cycle-level trace of the accelerator and then the final result using the seven-segment displays.

  0: xxx ()          |
  1:     () 00000000 |
  2:     () 00000000 |
  3:   4 () 00000000 |
  4:   . ()          | rd:00000000:00000024
  5:   . ()          | rd:00000004:0000001a
  6:   . ()          | rd:00000008:00000045
  7:   . ()          | rd:0000000c:00000039
  8:   . ()          | rd:00000010:0000000b
  9:     () 000000bc |
 10:     () 000000bc |
 11:     () 000000bc |
 12:     () 000000bc |
 13:     () 000000bc |

   ===             ===    ===
  |   |  |   |        |  |   |
  |   |  |   |        |  |   |
          ===      ===    ===
  |   |      |    |      |   |
  |   |      |    |      |   |
   ===             ===    ===

Your design does not need to be match the above trace cycle-by-cycle. It just needs to implement the specification. If your accelerator takes a few cycles more or less to implement the accumulate that is perfectly fine. The key is result should be reset zero and that in_rdy should be one when the accumulator produces the final result.

The interactive simulator is using the following array loaded into the memory.

addr  data  size  result result seven
(hex) (dec) (Dec) (dec)  (hex)  segment
---------------------------------------
000   36     1       36  0x024    4
004   26     2       62  0x03e   30
008   69     3      131  0x083    3
00c   57     4      188  0x0bc   28
010   11     5      199  0x0c7    7
014   68     6      267  0x10b   11
018   41     7      308  0x134   20
01c   90     8      398  0x18e   14

020   32     9      430  0x1ae   14
024   76    10      506  0x1fa   26
028   44    11      550  0x226    6
02c   19    12      569  0x239   25
030   17    13      586  0x24a   10
034   59    14      645  0x285    5
038   99    15      744  0x2e8    8
03c   49    16      793  0x319   25

040   65    17      858  0x35a   26
044   12    18      870  0x366    6
048   55    19      925  0x39d   29
04c    0    20      925  0x39d   29
050   51    21      976  0x3d0   16
054   42    22     1018  0x3fa   26
058   82    23     1100  0x44c   12
05c   23    24     1123  0x463    3

060   21    25     1144  0x478   24
064   54    26     1198  0x4ae   14
068   83    27     1281  0x501    1
06c   31    28     1312  0x520    0
070   16    29     1328  0x530   16
074   76    30     1404  0x57c   28
078   21    31     1425  0x591   17
07c    4    32     1429  0x595   21

The switches were set to 00100 in binary which is 4 in decimal. We find the row in the table when the size is 4. The final sum should be 118 which is 0x0bc in hex. We can see in the cycle-level trace that the final result is indeed 0x0bc. Since the seven-segment display only shows values from 0-31, it will only be able to show the bottom five bits of the complete result. Then we can look in the "seven segment" column to see what the seven-segment display should show in the real FPGA prototype. For size 4, the seven-segment display should show 28 and indeed that is what we see using the interactive simulator.

3. Lab Code Submission

To submit your code you simply push your code to GitHub. You can push your code as many times as you like before the deadline. Students are responsible for going to the GitHub website for your repository, browsing the source code, and confirming the code on GitHub is the code they want to submit is on GitHub Be sure to verify your code is passing your tests both on ecelinux and on GitHub Actions. Your design code will be assessed both in terms of code quality, verification quality, and functionality.

3.1. Code Quality

Your code quality score will be based on how well you follow the course coding conventions posted here:

https://cornell-ece2300.github.io/ece2300-mkdocs/ece2300-coding-conventions

3.2. Verification Quality

Verification quality is based on how well your testing enables making a compelling case for correctness. You will need to write compelling directed test case. Use comments appropriately to describe your test cases.

3.3. Functionality

Your functionality score will be determined by running your code against a series of tests developed by the instructors to test its correctness. Note that we will be using the automated build system to test your final code submission as shown below.

% mkdir -p ${HOME}/ece2300
% cd ${HOME}/ece2300
% git clone git@github.com:cornell-ece2300/groupXX
% cd groupXX

% mkdir -p build
% cd build
% ../configure
% make check-lab4-partC

Lab 4: TinyRV1 ProcessorPart C: Accumulate Accelerator