I have a microblaze instantiated with 16 stream interfaces with a custom IP attached to two. What is the correct header file or function to communicate over these interfaces in Vitis (Not HLS)?
Based on the full example that you can find here, I am going to provide a general idea:
Include the mb_interface.h in your C source
Use the putfsl and getfsl macros to write and read from the stream.
Such macros are wrapper around special assembly instructions that the microblaze will execute by writing the data on the axi stream interface. The ìd is the stream id. Here you can find all the possible functions and here you can explore the ISA.
#define putfsl(val, id) asm volatile ("put\t%0,rfsl" stringify(id) :: "d" (val))
The fundamental issue is that
#include "mb_interface.h"
/*
* Write 4 32-bit words.
*/
static void inline write_axis(volatile unsigned int *a)
{
register int a0, a1, a2, a3;
a3 = a[3]; a1 = a[1]; a2 = a[2]; a0 = a[0];
putfsl(a0, 0); putfsl(a1, 0); putfsl(a2, 0); putfsl(a3, 0);
}
int main()
{
volatile unsigned int outbuffer[BUFFER_SIZE] = { 0x0, 0x1, 0x2, 0x3 }
};
/* Perform transfers */
write_axis(outbuffer);
return 0;
}
Related
I've read some of the common write policies in the microarchitecture of GPUs. For most of the GPU the written policy is the same as the below picture (the picture is from the gpgpu-sim manual). based on the below picture I have a question. can we have dirty data on the l1 cache?
The L1 on some GPU architectures is a write-back cache for global accesses. Note that this topic varies by GPU architecture, e.g. for whether global activity is cached in L1.
Speaking generally, then, yes you can have dirty data. By this I mean that the data in the L1 cache is modified (compared to what is otherwise in global space or the L2 cache) and it has not yet been "flushed" or updated into the L2 cache. (You can also have "stale" data - data in the L1 that has not been modified, but is not consistent with the L2.)
We can create a simple proof point for this (dirty data).
The following code, when executed on a cc7.0 device (and probably some other archtectures as well) will not give the expected answer of 1024.
This is due to the fact that the L1, which is a separate entity per SM, is not immediately flushed to the L2. It therefore has "dirty data" by the above definition.
(The code is broken for this reason. Don't use this code. It's just a proof point.)
#include <iostream>
#include <cuda_runtime.h>
constexpr int num_blocks = 1024;
constexpr int num_threads = 32;
struct Lock {
int *locked;
Lock() {
int init = 0;
cudaMalloc(&locked, sizeof(int));
cudaMemcpy(locked, &init, sizeof(int), cudaMemcpyHostToDevice);
}
~Lock() {
if (locked) cudaFree(locked);
locked = NULL;
}
__device__ __forceinline__ void acquire_lock() {
while (atomicCAS(locked, 0, 1) != 0);
}
__device__ __forceinline__ void unlock() {
atomicExch(locked, 0);
}
};
__global__ void counter(Lock lock, int *total) {
if (threadIdx.x == 1) {
lock.acquire_lock();
*total = *total + 1;
// __threadfence(); uncomment this line to fix
lock.unlock();
}
}
int main() {
int *total_dev;
cudaMalloc(&total_dev, sizeof(int));
int total_host = 0;
cudaMemcpy(total_dev, &total_host, sizeof(int), cudaMemcpyHostToDevice);
{
Lock lock;
counter<<<num_blocks, num_threads>>>(lock, total_dev);
cudaDeviceSynchronize();
cudaMemcpy(&total_host, total_dev, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << total_host << std::endl;
}
cudaFree(total_dev);
}
In case there is any further doubt about whether this is a proper proof (e.g. to dispel arguments about things being "optimized into a register" etc.) we can study the resultant sass code. The end of the above kernel has code that looks like this:
/*0130*/ LDG.E.SYS R0, [R4] ; /* 0x0000000004007381 */
// load *total /* 0x000ea400001ee900 */
/*0140*/ IADD3 R7, R0, 0x1, RZ ; /* 0x0000000100077810 */
// add 1 /* 0x004fd00007ffe0ff */
/*0150*/ STG.E.SYS [R4], R7 ; /* 0x0000000704007386 */
// store *total /* 0x000fe8000010e900 */
/*0160*/ ATOMG.E.EXCH.STRONG.GPU PT, RZ, [R2], RZ ; /* 0x000000ff02ff73a8 */
//lock.unlock /* 0x000fe200041f41ff */
/*0170*/ EXIT ;
Since the result register has definitely been stored to the global space, we can infer that if another thread (in another SM) reads an unexpected value in global space for *total it must be due to the fact that the store from another SM has not reached the L2, i.e. has not reached device-wide consistency/coherency. Therefore the data in some other SM is "dirty". We can (presumably) rule out the "stale" case here (the data in the other L1 was written, but I have "old" data in my L1) because the global load indicated above does not happen until the lock is acquired in the SM.
Note that the above code "fails" on cc7.0 devices (and probably some other device architectures). It does not necessarily fail on the GPU you are using. But it is still "broken".
I'm working on Keil software and using LM3S316 microcontroller. Usually we address registers in microcontrollers in form of:
#define GPIO_PORTC_DATA_R (*((volatile uint32_t *)0x400063FC))
My question is how can I access to single pin of register for example, if I have this method:
char process_key(int a)
{ PC_0 = a ;}
How can I get PC_0 and how to define it?
Thank you
Given say:
#define PIN0 (1u<<0)
#define PIN1 (1u<<1)
#define PIN2 (1u<<2)
// etc...
Then:
char process_key(int a)
{
if( a != 0 )
{
// Set bit
GPIO_PORTC_DATA_R |= PIN0 ;
}
else
{
// Clear bit
GPIO_PORTC_DATA_R &= ~PIN0 ;
}
}
A generalisation of this idiomatic technique is presented at How do you set, clear, and toggle a single bit?
However the read-modify-write implied by |= / &= can be problematic if the register might be accessed in different thread/interrupt contexts, as well as adding a possibly undesirable overhead. Cortex-M3/4 parts have a feature known as bit-banding that allows individual bits to be addressed directly and atomically. Given:
volatile uint32_t* getBitBandAddress( volatile const void* address, int bit )
{
__IO uint32_t* bit_address = 0;
uint32_t addr = reinterpret_cast<uint32_t>(address);
// This bit maniplation makes the function valid for RAM
// and Peripheral bitband regions
uint32_t word_band_base = addr & 0xf0000000u;
uint32_t bit_band_base = word_band_base | 0x02000000u;
uint32_t offset = addr - word_band_base;
// Calculate bit band address
bit_address = reinterpret_cast<__IO uint32_t*>(bit_band_base + (offset * 32u) + (static_cast<uint32_t>(bit) * 4u));
return bit_address ;
}
Then you can have:
char process_key(int a)
{
static volatile uint32_t* PC0_BB_ADDR = getBitBandAddress( &GPIO_PORTC_DATA_R, 0 ) ;
*PC0_BB_ADDR = a ;
}
You could of course determine and hard-code the bit-band address; for example:
#define PC0 (*((volatile uint32_t *)0x420C7F88u))
Then:
char process_key(int a)
{
PC0 = a ;
}
Details of the bit-band address calculation can be found ARM Cortex-M Technical Reference Manual, and there is an on-line calculator here.
I'm working on a project using a MC9S12ZVM family processor and need to be able to get, save and restore the current interrupt enabled state. This is needed to access variables from the main line code that may be modified by the interrupt handler that are larger than word in size and therefore not atomic.
pseudo code: (variable is 32bits and -= isn't atomic anyhow)
state_save = current_interrupt_state();
DisableInterrupt();
variable -= x;
RestoreInterrupts(state_save);
Edit: I found something that works, but has the issue of modifying the stack.
asm(PSH CCW);
asm(SEI);
Variable++;
asm(PUL CCW);
This is ok as long as I don't need to do anything other than a simple variable++, but I don't like exiting a block with the stack modified.
It seems you are referring to the global interrupt mask. If so, then this is one way to disable it and then restore it to previous state:
static const uint8_t CCR_I_MASK = 0x10;
static uint8_t ccr;
void disable_interrupts (void)
{
__asm PSHA;
__asm TPA; // transfer CCR to A
__asm STA ccr; // store CCR in RAM variable
__asm PULA;
__asm SEI;
}
void restore_interrupts (void)
{
if((ccr & CCR_I_MASK) == 0)
{
__asm CLI; // i was not set, clear it
}
else
{
; // i was set, do nothing
}
}
__asm is specific to the Codewarrior compiler, with or without "strict ANSI" option set.
Ok, I've found an answer to my problem, with thanks to those who commented.
static volatile uint16_t v = 0u;
void testfunction(void);
void testfunction(void)
{
static uint16_t L_CCR;
asm( PSH D2 );
asm( TFR CCW, D2);
asm( ST D2, L_CCR );
asm( PUL D2 );
asm( SEI );
v++;
asm( PSH D2 );
asm( LD D2, L_CCR );
asm( TFR D2, CCW);
asm( PUL D2 );
}
I was looking into spi driver in u boot , here is a small snippet from
omap_spi.c
void spi_init(void)
{
gpMCSPIRegs = (MCSPI_REGS *)MCSPI_SPI1_IO_BASE;
unsigned long u, n;
/* initialize the multipad and interface clock */
spi_init_spi1();
/* soft reset */
CSP_BITFINS(gpMCSPIRegs->SYSCONFIG, SPI_SYSCONFIG_SOFTRESET, 1);
for (n = 0; n < 100; n++) {
u = CSP_BITFEXT(gpMCSPIRegs->SYSSTATUS,
SPI_SYSSTATUS_RESETDONE);
if (u)
break;
}
...more code
}
here in
omap_spi.h
#define CSP_BITFINS(var, bit, val) \
(CSP_BITFCLR(var, bit)); (var |= CSP_BITFVAL(bit, val))
my confusion here is that when they do soft reset , they call this CSP_BITFINS macro. inside this macro all they do is just manipulate bits and fill structures. where do they access that hardware registers to configure ?
If you look further, you'll find that the pointer they are using, gpMCSPIRregs, is volatile and pointing at the memory-mapper hardware registers. The bits they are setting/clearing are in the hardware registers.
Im making a xor gate in SystemC, from the binding of four NAND gates. I want the module to receive a vector of N bits, where N is passed as parameter. I should be able to perform & and not bitwise operations (for the NAND gate).
The best solution may be using sc_bv_base type, but I don't know how to initialize it in the constructor.
How can I create a bit vector using a custom length?
A way to parameterise the module is to create a new C++ template for the module.
In this example, the width of the input vector can be set at the level of the instantiation of this module
#ifndef MY_XOR_H_
#define MY_XOR_H_
#include <systemc.h>
template<int depth>
struct my_xor: sc_module {
sc_in<bool > clk;
sc_in<sc_uint<depth> > din;
sc_out<bool > dout;
void p1() {
dout.write(xor_reduce(din.read()));
}
SC_CTOR(my_xor) {
SC_METHOD(p1);
sensitive << clk.pos();
}
};
#endif /* MY_XOR_H_ */
Note that the struct my_xor: sc_module is used i.s.o. the SC_MODULE macro. (See page 40 , 5.2.5 SC_MODULE of the IEEE Std 1666-2011).
You can test this with the following testbench:
//------------------------------------------------------------------
// Simple Testbench for xor file
//------------------------------------------------------------------
#include <systemc.h>
#include "my_xor.h"
int sc_main(int argc, char* argv[]) {
const int WIDTH = 8;
sc_signal<sc_uint<WIDTH> > din;
sc_signal<bool> dout;
sc_clock clk("clk", 10, SC_NS, 0.5); // Create a clock signal
my_xor<WIDTH> DUT("my_xor"); // Instantiate Device Under Test
DUT.din(din); // Connect ports
DUT.dout(dout);
DUT.clk(clk);
sc_trace_file *fp; // Create VCD file
fp = sc_create_vcd_trace_file("wave"); // open(fp), create wave.vcd file
fp->set_time_unit(100, SC_PS); // set tracing resolution to ns
sc_trace(fp, clk, "clk"); // Add signals to trace file
sc_trace(fp, din, "din");
sc_trace(fp, dout, "dout");
sc_start(31, SC_NS); // Run simulation
din = 0x00;
sc_start(31, SC_NS); // Run simulation
din = 0x01;
sc_start(31, SC_NS); // Run simulation
din = 0xFF;
sc_start(31, SC_NS); // Run simulation
sc_close_vcd_trace_file(fp); // close(fp)
return 0;
}
Note that I'm using a struct and not a class. A class is also possible.
class my_xor: public sc_module{
public:
The XOR in this code is just the xor_reduce. You can find more about in the IEEE Std 1666-2011 at page 197 (7.2.8 Reduction operators). But I assume this is not the solution you wanted to have.