Can we have dirty data on l1 cache in gpu? - gpu

I've read some of the common write policies in the microarchitecture of GPUs. For most of the GPU the written policy is the same as the below picture (the picture is from the gpgpu-sim manual). based on the below picture I have a question. can we have dirty data on the l1 cache?

The L1 on some GPU architectures is a write-back cache for global accesses. Note that this topic varies by GPU architecture, e.g. for whether global activity is cached in L1.
Speaking generally, then, yes you can have dirty data. By this I mean that the data in the L1 cache is modified (compared to what is otherwise in global space or the L2 cache) and it has not yet been "flushed" or updated into the L2 cache. (You can also have "stale" data - data in the L1 that has not been modified, but is not consistent with the L2.)
We can create a simple proof point for this (dirty data).
The following code, when executed on a cc7.0 device (and probably some other archtectures as well) will not give the expected answer of 1024.
This is due to the fact that the L1, which is a separate entity per SM, is not immediately flushed to the L2. It therefore has "dirty data" by the above definition.
(The code is broken for this reason. Don't use this code. It's just a proof point.)
#include <iostream>
#include <cuda_runtime.h>
constexpr int num_blocks = 1024;
constexpr int num_threads = 32;
struct Lock {
int *locked;
Lock() {
int init = 0;
cudaMalloc(&locked, sizeof(int));
cudaMemcpy(locked, &init, sizeof(int), cudaMemcpyHostToDevice);
}
~Lock() {
if (locked) cudaFree(locked);
locked = NULL;
}
__device__ __forceinline__ void acquire_lock() {
while (atomicCAS(locked, 0, 1) != 0);
}
__device__ __forceinline__ void unlock() {
atomicExch(locked, 0);
}
};
__global__ void counter(Lock lock, int *total) {
if (threadIdx.x == 1) {
lock.acquire_lock();
*total = *total + 1;
// __threadfence(); uncomment this line to fix
lock.unlock();
}
}
int main() {
int *total_dev;
cudaMalloc(&total_dev, sizeof(int));
int total_host = 0;
cudaMemcpy(total_dev, &total_host, sizeof(int), cudaMemcpyHostToDevice);
{
Lock lock;
counter<<<num_blocks, num_threads>>>(lock, total_dev);
cudaDeviceSynchronize();
cudaMemcpy(&total_host, total_dev, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << total_host << std::endl;
}
cudaFree(total_dev);
}
In case there is any further doubt about whether this is a proper proof (e.g. to dispel arguments about things being "optimized into a register" etc.) we can study the resultant sass code. The end of the above kernel has code that looks like this:
/*0130*/ LDG.E.SYS R0, [R4] ; /* 0x0000000004007381 */
// load *total /* 0x000ea400001ee900 */
/*0140*/ IADD3 R7, R0, 0x1, RZ ; /* 0x0000000100077810 */
// add 1 /* 0x004fd00007ffe0ff */
/*0150*/ STG.E.SYS [R4], R7 ; /* 0x0000000704007386 */
// store *total /* 0x000fe8000010e900 */
/*0160*/ ATOMG.E.EXCH.STRONG.GPU PT, RZ, [R2], RZ ; /* 0x000000ff02ff73a8 */
//lock.unlock /* 0x000fe200041f41ff */
/*0170*/ EXIT ;
Since the result register has definitely been stored to the global space, we can infer that if another thread (in another SM) reads an unexpected value in global space for *total it must be due to the fact that the store from another SM has not reached the L2, i.e. has not reached device-wide consistency/coherency. Therefore the data in some other SM is "dirty". We can (presumably) rule out the "stale" case here (the data in the other L1 was written, but I have "old" data in my L1) because the global load indicated above does not happen until the lock is acquired in the SM.
Note that the above code "fails" on cc7.0 devices (and probably some other device architectures). It does not necessarily fail on the GPU you are using. But it is still "broken".

Related

Microblaze How to use AXI Stream?

I have a microblaze instantiated with 16 stream interfaces with a custom IP attached to two. What is the correct header file or function to communicate over these interfaces in Vitis (Not HLS)?
Based on the full example that you can find here, I am going to provide a general idea:
Include the mb_interface.h in your C source
Use the putfsl and getfsl macros to write and read from the stream.
Such macros are wrapper around special assembly instructions that the microblaze will execute by writing the data on the axi stream interface. The ìd is the stream id. Here you can find all the possible functions and here you can explore the ISA.
#define putfsl(val, id) asm volatile ("put\t%0,rfsl" stringify(id) :: "d" (val))
The fundamental issue is that
#include "mb_interface.h"
/*
* Write 4 32-bit words.
*/
static void inline write_axis(volatile unsigned int *a)
{
register int a0, a1, a2, a3;
a3 = a[3]; a1 = a[1]; a2 = a[2]; a0 = a[0];
putfsl(a0, 0); putfsl(a1, 0); putfsl(a2, 0); putfsl(a3, 0);
}
int main()
{
volatile unsigned int outbuffer[BUFFER_SIZE] = { 0x0, 0x1, 0x2, 0x3 }
};
/* Perform transfers */
write_axis(outbuffer);
return 0;
}

Addressing pins of Register in microcontrollers

I'm working on Keil software and using LM3S316 microcontroller. Usually we address registers in microcontrollers in form of:
#define GPIO_PORTC_DATA_R (*((volatile uint32_t *)0x400063FC))
My question is how can I access to single pin of register for example, if I have this method:
char process_key(int a)
{ PC_0 = a ;}
How can I get PC_0 and how to define it?
Thank you
Given say:
#define PIN0 (1u<<0)
#define PIN1 (1u<<1)
#define PIN2 (1u<<2)
// etc...
Then:
char process_key(int a)
{
if( a != 0 )
{
// Set bit
GPIO_PORTC_DATA_R |= PIN0 ;
}
else
{
// Clear bit
GPIO_PORTC_DATA_R &= ~PIN0 ;
}
}
A generalisation of this idiomatic technique is presented at How do you set, clear, and toggle a single bit?
However the read-modify-write implied by |= / &= can be problematic if the register might be accessed in different thread/interrupt contexts, as well as adding a possibly undesirable overhead. Cortex-M3/4 parts have a feature known as bit-banding that allows individual bits to be addressed directly and atomically. Given:
volatile uint32_t* getBitBandAddress( volatile const void* address, int bit )
{
__IO uint32_t* bit_address = 0;
uint32_t addr = reinterpret_cast<uint32_t>(address);
// This bit maniplation makes the function valid for RAM
// and Peripheral bitband regions
uint32_t word_band_base = addr & 0xf0000000u;
uint32_t bit_band_base = word_band_base | 0x02000000u;
uint32_t offset = addr - word_band_base;
// Calculate bit band address
bit_address = reinterpret_cast<__IO uint32_t*>(bit_band_base + (offset * 32u) + (static_cast<uint32_t>(bit) * 4u));
return bit_address ;
}
Then you can have:
char process_key(int a)
{
static volatile uint32_t* PC0_BB_ADDR = getBitBandAddress( &GPIO_PORTC_DATA_R, 0 ) ;
*PC0_BB_ADDR = a ;
}
You could of course determine and hard-code the bit-band address; for example:
#define PC0 (*((volatile uint32_t *)0x420C7F88u))
Then:
char process_key(int a)
{
PC0 = a ;
}
Details of the bit-band address calculation can be found ARM Cortex-M Technical Reference Manual, and there is an on-line calculator here.

Writing to non-volatile memory without disrupting UART interrupts execution on STM32F4XX

I have several OVERRUN errors on UART peripheral because I keep receiving UART data while my code is stall because I'm executing a write operation on flash.
I'm using interrupts for UART and has it is explained on Application Note AN3969 :
EEPROM emulation firmware runs from the internal Flash, thus access to
the Flash will be stalled during operations requiring Flash erase or
programming (EEPROM initialization, variable update or page erase). As
a consequence, the application code is not executed and the interrupt
can not be served.
This behavior may be acceptable for many applications, however for
applications with realtime constraints, you need to run the critical
processes from the internal RAM.
In this case:
Relocate the vector table in the internal RAM.
Execute all critical processes and interrupt service routines from the internal RAM. The compiler provides a keyword to declare functions
as a RAM function; the function is copied from the Flash to the RAM at
system startup just like any initialized variable. It is important to
note that for a RAM function, all used variable(s) and called
function(s) should be within the RAM.
So I've search on the internet and found AN4808 which provides examples on how to keep the interrupts running while flash operations.
I went ahead and modified my code :
Linker script : Added vector table to SRAM and define a .ramfunc section
/* stm32f417.dld */
ENTRY(Reset_Handler)
MEMORY
{
ccmram(xrw) : ORIGIN = 0x10000000, LENGTH = 64k
sram : ORIGIN = 0x20000000, LENGTH = 112k
eeprom_default : ORIGIN = 0x08004008, LENGTH = 16376
eeprom_s1 : ORIGIN = 0x08008000, LENGTH = 16k
eeprom_s2 : ORIGIN = 0x0800C000, LENGTH = 16k
flash_unused : ORIGIN = 0x08010000, LENGTH = 64k
flash : ORIGIN = 0x08020000, LENGTH = 896k
}
_end_stack = 0x2001BFF0;
SECTIONS
{
. = ORIGIN(eeprom_default);
.eeprom_data :
{
*(.eeprom_data)
} >eeprom_default
. = ORIGIN(flash);
.vectors :
{
_load_vector = LOADADDR(.vectors);
_start_vector = .;
*(.vectors)
_end_vector = .;
} >sram AT >flash
.text :
{
*(.text)
*(.rodata)
*(.rodata*)
_end_text = .;
} >flash
.data :
{
_load_data = LOADADDR(.data);
. = ALIGN(4);
_start_data = .;
*(.data)
} >sram AT >flash
.ramfunc :
{
. = ALIGN(4);
*(.ramfunc)
*(.ramfunc.*)
. = ALIGN(4);
_end_data = .;
} >sram AT >flash
.ccmram :
{
_load_ccmram = LOADADDR(.ccmram);
. = ALIGN(4);
_start_ccmram = .;
*(.ccmram)
*(.ccmram*)
. = ALIGN(4);
_end_ccmram = .;
} > ccmram AT >flash
.bss :
{
_start_bss = .;
*(.bss)
_end_bss = .;
} >sram
. = ALIGN(4);
_start_stack = .;
}
_end = .;
PROVIDE(end = .);
Reset Handler : Added vector table copy SRAM and define a .ramfunc section
void Reset_Handler(void)
{
unsigned int *src, *dst;
/* Copy vector table from flash to RAM */
src = &_load_vector;
dst = &_start_vector;
while (dst < &_end_vector)
*dst++ = *src++;
/* Copy data section from flash to RAM */
src = &_load_data;
dst = &_start_data;
while (dst < &_end_data)
*dst++ = *src++;
/* Copy data section from flash to CCRAM */
src = &_load_ccmram;
dst = &_start_ccmram;
while (dst < &_end_ccmram)
*dst++ = *src++;
/* Clear the bss section */
dst = &_start_bss;
while (dst < &_end_bss)
*dst++ = 0;
SystemInit();
SystemCoreClockUpdate();
RCC->AHB1ENR = 0xFFFFFFFF;
RCC->AHB2ENR = 0xFFFFFFFF;
RCC->AHB3ENR = 0xFFFFFFFF;
RCC->APB1ENR = 0xFFFFFFFF;
RCC->APB2ENR = 0xFFFFFFFF;
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOAEN;
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOBEN;
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOCEN;
RCC->AHB1ENR |= RCC_AHB1ENR_GPIODEN;
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOEEN;
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOFEN;
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOGEN;
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOHEN;
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOIEN;
RCC->AHB1ENR |= RCC_AHB1ENR_CCMDATARAMEN;
main();
while(1);
}
system_stm32f4xxx.c : Uncommented VECT_TAB_SRAM define
/*!< Uncomment the following line if you need to relocate your vector Table in
Internal SRAM. */
#define VECT_TAB_SRAM
#define VECT_TAB_OFFSET 0x00 /*!< Vector Table base offset field.
This value must be a multiple of 0x200. */
Added a definition of RAMFUNC to set section attributes :
#define RAMFUNC __attribute__ ((section (".ramfunc")))
Addded RAMFUNC before UART related function and prototypes so it gets run from RAM.
RAMFUNC void USART1_IRQHandler(void)
{
uint32_t sr = USART1->SR;
USART1->SR & USART_SR_ORE ? GPIO_SET(LED_ERROR_PORT, LED_ERROR_PIN_bp):GPIO_CLR(LED_ERROR_PORT, LED_ERROR_PIN_bp);
if(sr & USART_SR_TXE)
{
if(uart_1_send_write_pos != uart_1_send_read_pos)
{
USART1->DR = uart_1_send_buffer[uart_1_send_read_pos];
uart_1_send_read_pos = (uart_1_send_read_pos + 1) % USART_1_SEND_BUF_SIZE;
}
else
{
USART1->CR1 &= ~USART_CR1_TXEIE;
}
}
if(sr & (USART_SR_RXNE | USART_SR_ORE))
{
USART1->SR &= ~(USART_SR_RXNE | USART_SR_ORE);
uint8_t byte = USART1->DR;
uart_1_recv_buffer[uart_1_recv_write_pos] = byte;
uart_1_recv_write_pos = (uart_1_recv_write_pos + 1) % USART_1_RECV_BUF_SIZE;
}
}
My target runs properly with vector table and UART function in RAM but I still I get an overrun on USART. I'm also not disabling interrupts when performing the flash write operation.
I also tried to run code from CCM RAM instead of SRAM but has I saw on this post code can't be executed on CCM RAM on STMF32F4XX...
Any idea ? Thanks.
Any attempt to read from flash while a write operation is ongoing causes the bus to stall.
In order to not be blocked by flash writes, I think not only the the interrupt code, but the interrupted function has to run from RAM too, otherwise the core cannot proceed to a state when interrupts are possible.
Try relocating the flash handling code to RAM.
If it's possible, I'd advise switching to an MCU with two independent banks of flash memory, like the pin- and software-compatible 427/429/437/439 series. You can dedicate one bank to program code and the other to EEPROM-like data storage, then writing the second bank won't disturb code running from the first bank.
As suggested, it might be necessary to execute code from RAM; or, rather, make sure that no flash read operations are performed while the write is in progress.
To test, you might want to compile the entire executable for RAM, rather than flash (i.e., place everything into RAM and not use the flash at all).
You could then use gdb to load the binary and start execution... test your uart and make sure it is working as expected. At least this way you can be sure the flash is unused.
Some micros have READ WHILE WRITE sections that do not have a problem performing multiple operations simultaneously.

How do I get the current interrupt state (enabled, disabled or current level) on a MC9S12ZVM processor

I'm working on a project using a MC9S12ZVM family processor and need to be able to get, save and restore the current interrupt enabled state. This is needed to access variables from the main line code that may be modified by the interrupt handler that are larger than word in size and therefore not atomic.
pseudo code: (variable is 32bits and -= isn't atomic anyhow)
state_save = current_interrupt_state();
DisableInterrupt();
variable -= x;
RestoreInterrupts(state_save);
Edit: I found something that works, but has the issue of modifying the stack.
asm(PSH CCW);
asm(SEI);
Variable++;
asm(PUL CCW);
This is ok as long as I don't need to do anything other than a simple variable++, but I don't like exiting a block with the stack modified.
It seems you are referring to the global interrupt mask. If so, then this is one way to disable it and then restore it to previous state:
static const uint8_t CCR_I_MASK = 0x10;
static uint8_t ccr;
void disable_interrupts (void)
{
__asm PSHA;
__asm TPA; // transfer CCR to A
__asm STA ccr; // store CCR in RAM variable
__asm PULA;
__asm SEI;
}
void restore_interrupts (void)
{
if((ccr & CCR_I_MASK) == 0)
{
__asm CLI; // i was not set, clear it
}
else
{
; // i was set, do nothing
}
}
__asm is specific to the Codewarrior compiler, with or without "strict ANSI" option set.
Ok, I've found an answer to my problem, with thanks to those who commented.
static volatile uint16_t v = 0u;
void testfunction(void);
void testfunction(void)
{
static uint16_t L_CCR;
asm( PSH D2 );
asm( TFR CCW, D2);
asm( ST D2, L_CCR );
asm( PUL D2 );
asm( SEI );
v++;
asm( PSH D2 );
asm( LD D2, L_CCR );
asm( TFR D2, CCW);
asm( PUL D2 );
}

u-boot spi initialisation in omap3

I was looking into spi driver in u boot , here is a small snippet from
omap_spi.c
void spi_init(void)
{
gpMCSPIRegs = (MCSPI_REGS *)MCSPI_SPI1_IO_BASE;
unsigned long u, n;
/* initialize the multipad and interface clock */
spi_init_spi1();
/* soft reset */
CSP_BITFINS(gpMCSPIRegs->SYSCONFIG, SPI_SYSCONFIG_SOFTRESET, 1);
for (n = 0; n < 100; n++) {
u = CSP_BITFEXT(gpMCSPIRegs->SYSSTATUS,
SPI_SYSSTATUS_RESETDONE);
if (u)
break;
}
...more code
}
here in
omap_spi.h
#define CSP_BITFINS(var, bit, val) \
(CSP_BITFCLR(var, bit)); (var |= CSP_BITFVAL(bit, val))
my confusion here is that when they do soft reset , they call this CSP_BITFINS macro. inside this macro all they do is just manipulate bits and fill structures. where do they access that hardware registers to configure ?
If you look further, you'll find that the pointer they are using, gpMCSPIRregs, is volatile and pointing at the memory-mapper hardware registers. The bits they are setting/clearing are in the hardware registers.