Lab 4: TinyRV1 Processor
Part B: TinyRV1 Processor

Lab 4 will give you experience designing, implementing, testing, and prototyping a single-cycle processor microarchitecture and a specialized accelerator. The processor will implement the TinyRV1 instruction set. The instruction set manual is located here:

https://cornell-ece2300.github.io/ece2300-mkdocs/ece2300-tinyrv1-isa

The lab will continue to leverage concepts from Topic 2: Combinational Logic, Topic 3: Boolean Algebra, Topic 4: Combinational Building Blocks, Topic 6: Sequential Logic, Topic 7: Finite-State Machines, and Topic 8: Sequential Building Blocks. The lab will also leverage concepts from Topic 9: Instruction Set Architecture and Topic 10: Single-Cycle Processors The lab will continue to provide opportunities to leverage the three key abstraction principles: modularity, hierarchy, and regularity.

The lab includes seven parts:

Part A: Processor Components
- Due 11/6 @ 11:59pm via GitHub
- Students should work on Part A before, during, and after your assigned lab section during the week of 11/3
- Pre-lab survey on Canvas is (roughly) due by end of lab section during the week of 11/3
Part B: TinyRV1 Processor
- Due 11/13 @ 11:59pm via GitHub
- Students should work on Part B before, during, and after your assigned lab section during the week of 11/10
Part C: Accumulator Accelerator
- Due 11/24 @ 11:59pm via GitHub
- Students should plan to submit Part C before they leave for
- Thanksgiving Break
Part D: FPGA Prototype v1
- Due week of 11/17 during assigned lab section
- This part will focus on prototyping the code developed in Part A+B
- Even though completed with a partner, every student must turn in their own paper check-off sheet in their lab section!
Part E: FPGA Prototype v2
- Due week of 12/1 during assigned lab section
- This part will focus on prototyping the code developed in Part A+B+C
- Even though completed with a partner, every student must turn in their own paper check-off sheet in their lab section!
Part F: TinyRV1 Assembly
- Due 12/4 @ 11:59pm via GitHub
- This part will include all of the assembly developed during Part D+E
Part G: Report
- Due on 12/8 at 11:59pm for all groups!
- Post-lab survey on Canvas is due at the same time as the report

All parts of Lab 4 must be done with a partner. You can confirm your partner on Canvas (Click on People, then Groups, then search for your name to find your lab group).

Both students must contribute to all parts!

It is not acceptable for one student to exclusively work on the code while the other student exclusively works on the report. It is not acceptable for one student to exclusively work on hardware design while the other student exclusively works on testing. Both students must contribute to all parts. Student understanding of Verilog design and testing will be assessed on the prelim exams, final exam, and Verilog coding exam. The instructors will also survey the Git commit log on GitHub to confirm that both students are contributing equally. If you are using pair programming, then both students must take turns using their own account so both students have representative Git commits. Students should create commits after finishing each step of the lab, so their contribution is clear in the Git commit log. A student's whose contribution is limited as represented by the Git commit log will receive a significant deduction to their lab score.

This handout assumes that you have read and understand the course tutorials and that you have attended the discussion sections. This handout assumed you have successfully completed Part A. You should have already cloned your individual remote repository, so use git pull to ensure you have any recent updates before working on your lab assignment.

% cd ${HOME}/ece2300/groupXX
% git pull
% tree

where XX should be replaced with your group number.

The following table shows all of the hardware modules you will be developing in Lab 4.

1. Interface and Implementation Specification

A complete TinyRV1 embedded system includes the following key components: (1) the TinyRV1 processor itself; (2) 512 bytes of physical memory for storing instructions and data; (3) an SPI unit to enable a host workstation to load assembly programs into physical memory; (4) various external I/O devices (e.g., switches, sevent-segment displays, distance sensor, multi-note player); and (5) a memory bus which interconnects the processor, physical memory, SPI, and external I/O devices.

We will provide you the physical memory module and you have already implemented various external I/O devices in previous labs. In Part B, you will be using the datapath components from Part A along with components from previous labs to implement the TinyRV1 processor, memory bus, and SPI.

1.1. TinyRV1 Single-Cycle Processor

The single-cycle TinyRV1 processor has the following interface:

module ProcScycle
(
  (* keep=1 *) input  logic        clk,
  (* keep=1 *) input  logic        rst,

  // Memory Interfaces

  (* keep=1 *) output logic        imem_val,
  (* keep=1 *) input  logic        imem_wait,
  (* keep=1 *) output logic [31:0] imem_addr,
  (* keep=1 *) input  logic [31:0] imem_rdata,

  (* keep=1 *) output logic        dmem_val,
  (* keep=1 *) input  logic        dmem_wait,
  (* keep=1 *) output logic        dmem_type,
  (* keep=1 *) output logic [31:0] dmem_addr,
  (* keep=1 *) output logic [31:0] dmem_wdata,
  (* keep=1 *) input  logic [31:0] dmem_rdata,

  // Trace Interface

  (* keep=1 *) output logic        trace_val,
  (* keep=1 *) output logic [31:0] trace_addr,
  (* keep=1 *) output logic        trace_wen,
  (* keep=1 *) output logic [4:0]  trace_wreg,
  (* keep=1 *) output logic [31:0] trace_wdata
);

Memory Interfaces

The instruction memory interface is used to read instructions similar to how we read notes in Lab 3. To read an instruction set the imem_val output port high and the imem_addr output port to the desired instruction address; the instruction will be returned via the imem_rdata input port combinationally (i.e., in the same cycle). The data memory interface enables load and store instructions to read and write memory. It is similar to the instruction memory interface except now we have an additional dmem_type output port which specifies whether we want to read memory (i.e., dmem_type is zero) or write memory (i.e., dmem_type is one). We also need the dmem_wdata output port for the write data. Finally, both the instruction and data memory interfaces include a wait input which enables the memory to tell the processor to wait and try the memory request again on the next cycle.

Trace Interface

The trace interface is used for verification and should produce a "trace" of all instructions executed by the processor. This interface was not discussed in lecture, but is an important practical aspect of implementing a real processor. We need a simple way to check that each instruction has executed correctly.

Whenever the processor executes an instruction it should set the trace_val output high. It should also set the trace_addr output port to the address of the executed instruction. If the instruction writes the register file, it should set trace_wen to one, trace_wreg to the corresponding write destination register, and trace_data to the data written to the register file. If the instruction writes x0 then trace_wen should be one, trace_wreg should be zero, and trace_data is undefined. We will be able to write test cases that check the trace interface to verify the processor is executing a given assembly program correctly.

Initial Implementation

The TinyRV1 single-cycle processor implementation will be decomposed into a datapath and a control unit. The datapath must be implemented structurally meaning we should only instantiate other components without using any always blocks or combinational logic. The control unit will be implemented using a single casez in an always_comb block along with assign statements. We give you an initial processor implementation that supports the ADDI and ADD instruction but does not support the memory wait signals. The datapath for this initial processor is shown below.

The control signal table for this initial processor is shown below.

  // Localparams for op2_sel control signal

  localparam rf  = 1'd0;
  localparam imm = 1'd1;

  // Task for setting control signals

  task automatic cs
  (
    input logic op2_sel_
  );
    op2_sel = op2_sel_;
  endtask

  // Control signal table

  always_comb begin
    casez ( inst )
                          //  op2
                          //  sel
      `TINYRV1_INST_ADDI: cs( imm );
      `TINYRV1_INST_ADD:  cs( rf  );
      default:            cs( 'x  );
    endcase
  end

A control signal table has a column for every control signal and a row for every instruction. Since the initial implementation only supports two instructions, the initial control signal table only has two rows. The first row has all of the control signals required to execute the ADDI instruction, and the second row has all of the control signals required to execute the ADD instruction. We are using a casez because we want to support don't cares in the instruction decoding (i.e., see TINYRV1_INST_ADDI in tinyrv1.v). In this case, we only need one column since we currently only have a single control signal: op2_sel which controls the mux for operand 2 of the ALU. Notice how we are using a SystemVerilog task to set the control signals. This will make it easy to add more columns in a clean way. Also notice how we have created localparams (i.e., rf, imm) for the two values of the mux select. While not required, these kind of localparams can make your control signal table more readable.

The following code will connect the initial control unit up to the initial datapath.

// Control Signals (Control Unit -> Datapath)

logic op2_sel;

// Status Signals (Datapath -> Control Unit)

logic [31:0] inst;

// Instantiate/Connect Datapath

ProcSimpleDpath dpath
(
  .clk         (clk),
  .rst         (rst),
  .imem_addr   (imem_addr),
  .imem_rdata  (imem_rdata),
  .dmem_addr   (dmem_addr),
  .dmem_wdata  (dmem_wdata),
  .dmem_rdata  (dmem_rdata),
  .trace_addr  (trace_addr),
  .trace_wen   (trace_wen),
  .trace_wreg  (trace_wreg),
  .trace_wdata (trace_wdata),
  .op2_sel     (op2_sel),
  .inst        (inst)
);

// Instantiate/Connect Datapath

ProcSimpleDpath dpath
(
  .rst         (rst),
  .imem_val    (imem_val),
  .imem_wait   (imem_wait),
  .dmem_val    (dmem_val ),
  .dmem_wait   (dmem_wait),
  .dmem_type   (dmem_type),
  .trace_val   (trace_val),
  .op2_sel     (op2_sel),
  .inst        (inst)
);

There are two things to notice: (1) connecting all of these ports is quite tedious; and (2) every port connects to a signal with the exact same name. Take a quick look at the code in ProcScycle which actually connects the datapath to the control unit.

// Control Signals (Control Unit -> Datapath)

logic op2_sel;

// Status Signals (Datapath -> Control Unit)

logic [31:0] inst;

// Insantiate/Connect Datapath and Control Unit

ProcSimpleDpath dpath
(
  .*
);

ProcSimpleCtrl ctrl
(
  .*
);

Notice how much more compact this is! We are using the SystemVerilog .* operator which connects every port to a signal with the exact same name. This way whenever we want to add a new control or status signal all we need to do is declare a signal with the right name and we are done.

This initial implementation does not handle the memory wait signals and indeed we recommend waiting until all eight instructions are fully working without memory waits before incorporating this support into your processor.

Final Implementation

The final datapath is shown below.

This is the same as in lecture except with the trace interface.

The final control signal table will have eight columns and eight rows just like in lecture. This is an example of what the first two rows might look like, but keep in mind that your implementation might look different depending on how you implement your processor.

task automatic cs
(
  input logic [1:0] pc_sel_,
  input logic [1:0] imm_type_,
  input logic       op2_sel_,
  input logic       alu_func_,
  input logic [1:0] wb_sel_,
  input logic       rf_wen_pre_,
  input logic       dmem_val_pre_,
  input logic       dmem_type_
);
  pc_sel       = pc_sel_;
  imm_type     = imm_type_;
  op2_sel      = op2_sel_;
  alu_func     = alu_func_;
  wb_sel       = wb_sel_;
  rf_wen_pre   = rf_wen_pre_;
  dmem_val_pre = dmem_val_pre_;
  dmem_type    = dmem_type_;
endtask

// Control signal table

always_comb begin
  casez ( inst )
                        //  pc          imm  op2  alu  wb   rf   dmem dmem
                        //  sel         type sel  func sel  wen  val  type
    `TINYRV1_INST_ADDI: cs( pc_plus4,   I,   imm, add, alu, 1,   0,   'x  );
    `TINYRV1_INST_ADD:  cs( pc_plus4,   'x,  rf,  add, alu, 1,   0,   'x  );
    ...
    default:            cs( 'x,         'x,  'x,  'x,  'x,  'x,  'x,  'x  );
  endcase
end

Again, we are using a SystemVerilog task to enable cleanly setting eight signals in one line of the casez, and we are using localparams for the various entries in the control signal table (e.g., I, imm, rf, add, alu, etc). While not required, these localparams can improved the readability of your control signal table.

Integrating the Memory Wait Signals

Once you have all eight instructions thoroughly tested without any memory waits, you can try to incorporate the imem_wait and dmem_wait signals. The first step is to add a new pc_en control signal that is connected to the enable input of the PC. Then make sure the PC is not enabled if we are waiting on memory.

We also need to make sure we do not write garabage data to the register file if we are waiting on either the instruction memory or the data memory. This is why the control signal table writes a "preliminary" version of the rf_wen control signal; we use the _pre suffix to indicate it is a preliminary version. We will need to add some additional logic (in an assign statement after the always_comb) to incorporate the memory wait signals. If either imem_wait or dmem_wait are one we need to ensure that rf_wen is zero so we do not accidently write garbage to a register in the register file; if both imem_wait and dmem_wait are zero then rf_wen is the same as the preliminary version. So for example, the additional combinational logic for rf_wen to factor in the memory wait signals might look like this:

assign rf_wen = !rst && !imem_wait && !dmem_wait && rf_wen_pre;

So rf_wen is one if: we are not in reset and we are not waiting on the instruction memory and we are not waiting on the data memory and the current instruction really does want to write the register file.

Finally, we need to make sure we do not write garbage data to memory if we are waiting on the instruction memory. Again, this is why the control signal table writes a "preliminary" version of dmem_val. We can then combine the imem_wait signal and dmem_val_pre to determine dmem_val. Do not make dmem_val depend on dmem_wait since this could cause a combinational loop!

Use an incremental design approach!

Implementing the processor is the culmination of many months of work on the lab assignments. It is a complex hardware module so we strongly discourage using a big bang design approach. Do not just start writing code, try to implement the entire datapath and control unit in one shot, and then see if you can get it to work. This approach can easily take hours and hours.

Your goal instead should be incrementally modify the control unit and datapath to add support for each of the remaining six instructions (i.e., MUL, LW, SW, JAL, JR, BNE) one at a time. Adding each instruction should involve the following seven steps:

Step 1: Modify the datapath
Step 2: Add ports to the dpath and ctrl for new control or status signals
Step 3: Add new signals in ProcScycle to connect the dpath and ctrl
Step 4: Add a column in the control signal table for each new control signal
Step 5: Add a new row in the control signal table for the new instruction
Step 6: Fill in all empty entries in the control signal table
Step 7: Thoroughly test the instruction

Use this seven step process. Implement one instruction, thoroughly test that instruction, then move on to the next instruction. Once you have finished implementing all eight instructions and thoroughly tested each one, then you can work on adding support for the memory wait signals.

Let's illustrate this process by adding the MUL instruction. First, we need to modify the datapath to include a multiplier and a new write-back multiplexor which will choose whether the ALU or the multiplier writes the register file (Step 1). Here is the modified datapath.

Notice this means we have a new control signal wb_sel. So we need to add a new one-bit output port to ProcScycleCtrl named wb_sel and a new one-bit input port to ProcScycleDpath named wb_sel (Step 2). We also need to add a wb_sel signal to ProcScycle to connect these two ports together (Step 3) like this:

// Control Signals (Control Unit -> Datapath)

logic op2_sel;
logic wb_sel;

// Status Signals (Datapath -> Control Unit)

logic [31:0] inst;

// Insantiate/Connect Datapath and Control Unit

ProcSimpleDpath dpath
(
  .*
);

ProcSimpleCtrl ctrl
(
  .*
);

The .* makes these top-level connections very simple! In the control unit, we need to add a new column to the control signal table (Step 4) and a new row for the MUL instruction (Step 5) like this:

  // Localparams for op2_sel control signal

  localparam rf  = 1'd0;
  localparam imm = 1'd1;

  // Localparams for wb_sel control signal

  localparam alu = 1'd0;
  localparam mul = 1'd1;

  // Task for setting control signals

  task automatic cs
  (
    input logic op2_sel_,
    input logic wb_sel_,
  );
    op2_sel = op2_sel_;
    wb_sel  = wb_sel_;
  endtask

  // Control signal table

  always_comb begin
    casez ( inst )
                          //  op2  wb
                          //  sel  sel
      `TINYRV1_INST_ADDI: cs( imm, ??? );
      `TINYRV1_INST_ADD:  cs( rf,  ??? );
      `TINYRV1_INST_MUL:  cs( ???, ??? );
      default:            cs( 'x,  'x  );
    endcase
  end

Now we need to fill in the empty entries marked ??? with the correct values (Step 6). We have created two new localparams (alu,mul) to make it easier to understand the values in the control signal table. Finally, we must thoroughly test this instruction before moving on to the next instruction (Step 7).

1.3. Memory Bus

To be posted soon!

1.4. SPI

To be posted soon!

2. Testing Strategy

As always, we need a compelling strategy to ensure the correctness of each component: the TinyRV1 processor, the memory bus, and the SPI.

2.2. Testing the Processor

Testing the processor is more complex than testing individual hardware blocks. We have provided you some testing infrastructure to simplify the process, but students should still expect to dedicated significant time to verifying their processor correctly implements the TinyRV1 ISA.

We have provided you a functional-level FL processor model (also called an instruction set simulator) located in test/ProcFL.v. The FL processor model executes the instruction semantics behaviorally using high-level Verilog. It is not meant to model hardware. The FL processor model can be used to make sure your tests are correct before you run those tests on your single-cycle processor.

The test cases for the processor is located in these test files:

test/Proc-addi-test-cases.v
test/Proc-add-test-cases.v
test/Proc-mul-test-cases.v
test/Proc-lw-test-cases.v
test/Proc-sw-test-cases.v
test/Proc-jal-test-cases.v
test/Proc-jr-test-cases.v
test/Proc-bne-test-cases.v
test/Proc-wait-test-cases.v

Each file should only test a single instruction with the final wait test cases used to test if the processor can correctly handle memory waits. Processor test cases look like this:

task test_case_1_basic();
  t.test_case_begin( "test_case_1_basic" );

  // Write assembly program into memory

  asm( 'h000, "addi x1, x0, 2"   );
  asm( 'h004, "addi x2, x1, 2"   );

  // Check each executed instruction
  //           addr   en reg   data
  check_trace( 'h000, 1, 5'd1, 32'h0000_0002 ); // addi x1, x0, 2
  check_trace( 'h004, 1, 5'd2, 32'h0000_0004 ); // addi x2, x1, 2

  t.test_case_end();
endtask

Every processor test case includes two parts.

asm tasks are used to write instructions into the memory. The asm task takes two arguments: the address for the instruction and an assembly instruction represented as a string. The asm task will take care of converting the assembly instruction into a machine instruction. The asm tasks represent the static instruction sequence (i.e., what instructions are stored in memory before the processor starts executing).
check_trace tasks are like the check tasks you have seen elsewhere, but check_trace tasks will wait for the trace_val signal to be high before checking to see if the trace_addr and trace_data outputs from the processor match the desired values. The check_trace tasks are used to check the dynamic instruction sequence (i.e., what instructions the processor actually executes at runtime).

The above basic test case for the ADDI instruction uses the trace to make sure the first ADDI instruction writes the value 2 to the register file and the second ADDI instruction writes the value 4 to the register file. When writing register X0, the trace data is undefined. We do not want to enforce that the register write data is zero when writing X0 since this would require special hardware to handle this case. Here is how you might test reading and writing register X0.

task test_case_2_regX0();
  t.test_case_begin( "test_case_2_regX0" );

  // Write assembly program into memory

  asm( 'h000, "addi x1, x0, 0"   );
  asm( 'h004, "addi x0, x1, 0"   );

  // Check each executed instruction
  //           addr   en reg   data
  check_trace( 'h000, 1, 5'd1, 32'h0000_0000 ); // addi x1, x0, 0
  check_trace( 'h004, 0, 5'd0, 32'hxxxx_xxxx ); // addi x0, x1, 0

  t.test_case_end();
endtask

Processor test cases for memory can include an additional part:

task test_case_1_basic();
  t.test_case_begin( "test_case_1_basic" );

  // Write assembly program into memory

  asm( 'h000, "addi x1, x0, 0x100" );
  asm( 'h004, "lw   x2, 0(x1)"     );

  // Write data into memory

  data( 'h100, 32'hdead_beef );

  // Check each executed instruction

  check_trace( 'h000, 'h0000_0100 ); // addi x1, x0, 0x100
  check_trace( 'h004, 'hdead_beef ); // lw   x2, 0(x1)

  t.test_case_end();
endtask

In addition to the asm tasks and check_trace tasks, we can also use a data task to write data into the memory. The above basic test case for the LW instruction first uses an ADDI instruction to get the memory address 0x100 into register x1. The test case then performs a LW instruction to load the data from address 0x100 into register x2. The check_trace tasks verify that the ADDI instruction correctly writes the address to the register file, and that the LW instruction correctly loads the value 0xdeadbeef from memory address 0x100.

The check_trace tasks become particularly important when testing control flow instructions. The following test case is for the JAL instruction:

task test_case_1_basic();
  t.test_case_begin( "test_case_1_basic" );

  // Write assembly program into memory

  asm( 'h000, "addi x1, x0, 1" );
  asm( 'h004, "jal  x2, 0x00c" );
  asm( 'h008, "addi x1, x0, 2" );
  asm( 'h00c, "addi x1, x0, 3" );

  // Check each executed instruction
  //           addr   en reg   data
  check_trace( 'h000, 1, 5'd1, 32'h0000_0001 ); // addi x1, x0, 1
  check_trace( 'h004, 1, 5'd2, 32'h0000_0008 ); // jal  x2, 0x00c
  check_trace( 'h00c, 1, 5'd1, 32'h0000_0003 ); // addi x1, x0, 3

  t.test_case_end();
endtask

Here we can see the static instruction sequence includes four instructions, but the dynamic instruction sequence only includes three instructions because the JAL instruction jumps over the instruction at address 0x008. Note that in the assembly format used for testing our processor, the literal in a JAL and BNE instruction is the absolute address of the target not the actual immediate. The assembler will take care of creating the appropriate PC relative immediate.

You can run the test cases for the ADDI and ADD instructions on the FL processor model like this:

% cd ${HOME}/ece2300/groupXX/build
% make ProcFL-addi-test && ./ProcFL-addi-test
% make ProcFL-add-test  && ./ProcFL-add-test

You will need to add more tests cases to the appropriate -test-cases.v file. Do not simple have a single directed test case (i.e., a single task); you must have many directed test cases (i.e., many tasks). Each directed test case should focus on testing a different aspect of the corresponding instruction. Remember to always make sure your tests pass on the FL processor model before attempting to run those tests on your single-cycle processor model!

2.2 Testing the Memory Bus and SPI

We provide tests for the MemoryBusAddrDecoder, MemoryBusDataMux, and the MemoryBus itself. We also provide tests for the SPI. While you are free to add more tests for these components it is not required.

3. Lab Code Submission

To submit your code you simply push your code to GitHub. You can push your code as many times as you like before the deadline. Students are responsible for going to the GitHub website for your repository, browsing the source code, and confirming the code on GitHub is the code they want to submit is on GitHub Be sure to verify your code is passing your tests both on ecelinux and on GitHub Actions. Your design code will be assessed both in terms of code quality, verification quality, and functionality.

3.1. Code Quality

Your code quality score will be based on how well you follow the course coding conventions posted here:

https://cornell-ece2300.github.io/ece2300-mkdocs/ece2300-coding-conventions

3.2. Verification Quality

Verification quality is based on how well your testing enables making a compelling case for correctness. You will need to write compelling directed test cases, use reasonable randomg testing, and include a simple X-propgation test case. Use comments appropriately to describe your test cases.

3.3. Functionality

Your functionality score will be determined by running your code against a series of tests developed by the instructors to test its correctness. Note that we will be using the automated build system to test your final code submission as shown below.

% mkdir -p ${HOME}/ece2300
% cd ${HOME}/ece2300
% git clone git@github.com:cornell-ece2300/groupXX
% cd groupXX

% mkdir -p build
% cd build
% ../configure
% make check-lab4-partB

Lab 4: TinyRV1 ProcessorPart B: TinyRV1 Processor