I don't understand where does redis ziplist save so much memory;
a ziplist entry has structure as below
struct ZipListEntry{
int prev_entry_bytes_length;
int encoding;
char* contents;
};
a common list node has structure as this
class ListNode{
int prev;
int next;
int encoding;
char* data;
};
compare both, ListNode just has one more member next than ZipListEntry, where does redis ziplist save memory so much? Is it because ZipList's memory are allocate for once so the memory is continuous while common ListNode's memory are allocated for several times which generates memory slices?
Related
I'm in a C++/CLI project, and I have a byte* variable that I want to fully convert into a managed array<byte> as efficiently as possible.
Currently, the only way that I've seen is to manually create the managed array<byte> object, and then copy individual bytes from the byte* variable, as shown below:
void Foo(byte* source, int bytesCount)
{
auto buffer = gcnew array<byte>(bytesCount);
for (int i = 0; i < bytesCount; ++i)
{
buffer[i] = source[i];
}
}
Is there any other way to do this more efficiently? Ideally, to not have to copy the memory at all.
If not, is there any way to do this more cleanly?
You can't create a managed array from an unmanaged buffer without copying.
However, you don't need to copy the individual bytes in a loop, though. You can pin the managed array (see pin_ptr) and then copy the bytes directly from the unmanaged buffer into the memory of the managed array, such as with memcpy() or equivalent.
void Foo(byte* source, int bytesCount)
{
auto buffer = gcnew array<byte>(bytesCount);
{
pin_ptr<byte> p = &buffer[0];
byte *cp = p;
memcpy(cp, source, bytesCount);
}
// use buffer as needed...
}
I wanted to ask, how will behave DMA SPI rx in STM32 in following situation.
I have a specified (for example) 96 Bytes array called A which is intended to store the data received from the SPI. I turn on my circular SPI DMA which operates on each Byte, is configured to 96 Byte.
Is it possible, when DMA will fill my 96 Bytes array, the Transfer Complete interrupt will went off, to quickly copy the 96 Byte array to another - B, before circular DMA will start writing to A(and destroy the data saved in B)?
I want to transfer(every time when I will get new data from A in B) data from B quickly over USB to PC.
I'm just thinking how to transmit continous data stream SPI from STM32 over USB to PC, because a block of 96 Bytes of data transferred by USB once per certain time is easier I think than stream in real time SPI to USB by STM32? I don't know it's even possible
For that to work, you would have to be able to guarantee that you can copy all the data before the next SPI byte is received and transferred to the start of the buffer. Whether that were possible would depend on the clock speed of the processor and the speed of the SPI, and be able to guarantee that no higher priority interrupts occur that might delay the transfer. To be safe it would need an exceptionally slow SPI speed, and in that case would probably not need to use DMA at all.
All in all it is a bad idea and entirely unnecessary. The DMA controller has a "half-transfer" interrupt for exactly this purpose. You will get the HT interrupt when the first 48 bytes are transferred, and the DMA will continue transferring the remaining 48 bytes while you copy lower half buffer. When you get the transfer complete you transfer the upper half. That extends the time you have to transfer the data from the receive time of a single byte to the receive time of 48 bytes.
If you actually need 96 bytes on each transfer, then you simply make your buffer 192 bytes long (2 x 96).
In pseudo-code:
#define BUFFER_LENGTH 96
char DMA_Buffer[2][BUFFER_LENGTH] ;
void DMA_IRQHandler()
{
if( DMA_IT_Flag(DMA_HT) == SET )
{
memcpy( B, DMA_Buffer[0], BUFFER_LENGTH ) ;
Clear_IT_Flag(DMA_HT) ;
}
else if( DMA_IT_Flag(DMA_TC) == SET )
{
memcpy( B, DMA_Buffer[1], BUFFER_LENGTH ) ;
Clear_IT_Flag(DMA_TC) ;
}
}
With respect to transferring the data to a PC over USB, first of all you need to be sure that your USB transfer rate is at least as fast or faster than the SPI transfer rate. It is likely that the USB transfer is less deterministic (because it is controlled by the PC host - that is you can only output data on the USB when the host explicitly asks for it), so even if the the average transfer rate is sufficient, there may be latency that requires further buffering, so rather then simply copying from the DMA buffer A to a USB buffer B, you may need a circular buffer or FIFO queue to feed the USB. On the other hand, if you already have the buffer DMA_Buffer[0], DMA_Buffer[1] and B you already effectively have a FIFO of three blocks of 96 bytes, which may be sufficient
In one of my projects I faced a similar problem. The task was to transfer data coming from an external ADC chip (connected with SPI) to PC over full speed USB. The data was (8 ch x 16-bit) and I was requested to achieve the fastest sampling frequency possible.
I ended up with a triple buffer solution. There is 4 possible states a buffer can be in:
READY: Buffer is full with data, ready to be send over USB
SENT: Buffer is already sent and outdated
IN_USE: DMA (requested by SPI) is currently filling this buffer
NEXT: This buffer is considered empty and will be used when IN_USE is full.
As the timing of the USB request can't be synchonized with with the SPI process, I believe a double buffer solution wouldn't work. If you don't have a NEXT buffer, by the time you decide to send the READY buffer, DMA may finish filling the IN_USE buffer and start corrupting the READY buffer. But in a triple buffer solution, READY buffer is safe to send over USB, as it won't be filled even the current IN_USE buffer is full.
So the buffer states look like this as the time passes:
Buf0 Buf1 Buf2
==== ==== ====
READY IN_USE NEXT
SENT IN_USE NEXT
NEXT READY IN_USE
NEXT SENT IN_USE
IN_USE NEXT READY
Of course, if the PC don't start USB requests fast enough, you may still loose a READY buffer as soon as it turns into NEXT (before becoming SENT). PC sends USB IN requests asynchronously with no info about the current buffer states. If there is no READY buffer (it's in SENT state), the STM32 responds with a ZLP (zero length package) and the PC tries again after 1 ms delay.
For the implementation on STM32, I use double buffered mode and I modify M0AR & M1AR registers in the DMA Transfer Complete ISR to address 3 buffers.
BTW, I used (3 x 4000) bytes buffers and achieved 32 kHz sampling frequency at the end. USB is configured as vendor specific class and it uses bulk transfers.
Generally using circular DMA only works if you trigger on the half full/half empty, otherwise you don't have enough time to copy information out of the buffer.
I would recommend against copying the data out the buffer during the interrupt. Rather use the data directly from the buffer without an additional copy step.
If you do the copy in the interrupt, you are blocking other lower priority interrupts during the copy. On a STM32 a simple naive byte copy of 48 bytes may take additional 48*6 ~ 300 clock cycles.
If you track the buffers read and write positions independently, you just need to update a single pointer and post a delayed a notification call to the consumer of the buffer.
If you want a longer period then don't use circular DMA, rather use normal DMA in 48 byte blocks and implement circular byte buffer as a data structure.
I did this for a USART at 460k baud that receives asynchronously variable length packets. If you ensure that the producer only updates the write pointer and the consumer only updates the read pointer you can avoid data races in most of it. Note that the read and write of an aligned <=32 bit variable on cortex m3/m4 is atomic.
The included code is a simplified version of the circular buffer with DMA support that I used. It is limited to buffer sizes that are 2^n and uses Templates and C++11 functionality so it may not be suitable depending on your development/platform constraints.
To use the buffer call getDmaReadBlock() or getDMAwriteBlock() and get the DMA memory address and block length. Once the DMA completes use skipRead() / skipWrite() to increment the read or write pointers by the actual amount that was transferred.
/**
* Creates a circular buffer. There is a read pointer and a write pointer
* The buffer is full when the write pointer is = read pointer -1
*/
template<uint16_t SIZE=256>
class CircularByteBuffer {
public:
struct MemBlock {
uint8_t *blockStart;
uint16_t blockLength;
};
private:
uint8_t *_data;
uint16_t _readIndex;
uint16_t _writeIndex;
static constexpr uint16_t _mask = SIZE - 1;
// is the circular buffer a power of 2
static_assert((SIZE & (SIZE - 1)) == 0);
public:
CircularByteBuffer &operator=(const CircularByteBuffer &) = default;
CircularByteBuffer(uint8_t (&data)[SIZE]);
CircularByteBuffer(const CircularByteBuffer &) = default;
~CircularByteBuffer() = default;
private:
static uint16_t wrapIndex(int32_t index);
public:
/*
* The number of byte available to be read. Writing bytes to the buffer can only increase this amount.
*/
uint16_t readBytesAvail() const;
/**
* Return the number of bytes that can still be written. Reading bytes can only increase this amount.
*/
uint16_t writeBytesAvail() const;
/**
* Read a byte from the buffer and increment the read pointer
*/
uint8_t readByte();
/**
* Write a byte to the buffer and increment the write pointer. Throws away the byte if there is no space left.
* #param byte
*/
void writeByte(uint8_t byte);
/**
* Provide read only access to the buffer without incrementing the pointer. Whilst memory accesses outside the
* allocated memeory can be performed. Garbage data can still be read if that byte does not contain valid data
* #param pos the offset from teh current read pointer
* #return the byte at the given offset in the buffer.
*/
uint8_t operator[](uint32_t pos) const;
/**
* INcrement the read pointer by a given amount
*/
void skipRead(uint16_t amount);
/**
* Increment the read pointer by a given amount
*/
void skipWrite(uint16_t amount);
/**
* Get the start and lenght of the memeory block used for DMA writes into the queue.
* #return
*/
MemBlock getDmaWriteBlock();
/**
* Get the start and lenght of the memeory block used for DMA reads from the queue.
* #return
*/
MemBlock getDmaReadBlock();
};
// CircularByteBuffer
// ------------------
template<uint16_t SIZE>
inline CircularByteBuffer<SIZE>::CircularByteBuffer(uint8_t (&data)[SIZE]):
_data(data),
_readIndex(0),
_writeIndex(0) {
}
template<uint16_t SIZE>
inline uint16_t CircularByteBuffer<SIZE>::wrapIndex(int32_t index){
return static_cast<uint16_t>(index & _mask);
}
template<uint16_t SIZE>
inline uint16_t CircularByteBuffer<SIZE>::readBytesAvail() const {
return wrapIndex(_writeIndex - _readIndex);
}
template<uint16_t SIZE>
inline uint16_t CircularByteBuffer<SIZE>::writeBytesAvail() const {
return wrapIndex(_readIndex - _writeIndex - 1);
}
template<uint16_t SIZE>
inline uint8_t CircularByteBuffer<SIZE>::readByte() {
if (readBytesAvail()) {
uint8_t result = _data[_readIndex];
_readIndex = wrapIndex(_readIndex+1);
return result;
} else {
return 0;
}
}
template<uint16_t SIZE>
inline void CircularByteBuffer<SIZE>::writeByte(uint8_t byte) {
if (writeBytesAvail()) {
_data[_writeIndex] = byte;
_writeIndex = wrapIndex(_writeIndex+1);
}
}
template<uint16_t SIZE>
inline uint8_t CircularByteBuffer<SIZE>::operator[](uint32_t pos) const {
return _data[wrapIndex(_readIndex + pos)];
}
template<uint16_t SIZE>
inline void CircularByteBuffer<SIZE>::skipRead(uint16_t amount) {
_readIndex = wrapIndex(_readIndex+ amount);
}
template<uint16_t SIZE>
inline void CircularByteBuffer<SIZE>::skipWrite(uint16_t amount) {
_writeIndex = wrapIndex(_writeIndex+ amount);
}
template <uint16_t SIZE>
inline typename CircularByteBuffer<SIZE>::MemBlock CircularByteBuffer<SIZE>::getDmaWriteBlock(){
uint16_t len = static_cast<uint16_t>(SIZE - _writeIndex);
// full is (write == (read -1)) so on wrap around we need to ensure that we stop 1 off from the read pointer.
if( _readIndex == 0){
len = static_cast<uint16_t>(len - 1);
}
if( _readIndex > _writeIndex){
len = static_cast<uint16_t>(_readIndex - _writeIndex - 1);
}
return {&_data[_writeIndex], len};
}
template <uint16_t SIZE>
inline typename CircularByteBuffer<SIZE>::MemBlock CircularByteBuffer<SIZE>::getDmaReadBlock(){
if( _readIndex > _writeIndex){
return {&_data[_readIndex], static_cast<uint16_t>(SIZE- _readIndex)};
} else {
return {&_data[_readIndex], static_cast<uint16_t>(_writeIndex - _readIndex)};
}
}
`
I Was going through the CCL code samples along with the oneapi toolkit.
In the below DPC++(SYCL) code initially sendbuf a buffer is created in the cpu side and is not initialised and in the part where offloading to target device takes place the dev_acc_sbuf[id] variable, which is a variable in the kernel scope is modified. This variable(dev_acc_sbuf) is not hence used in the program neither is its value copied back to sendbuf.Then in the next line the sendbuf variable is used for allreduce. I am not able to understand how changing the dev_acc_sbuf makes change in the sendbuf.
cl::sycl::queue q;
cl::sycl::buffer<int, 1> sendbuf(COUNT);
/* open sendbuf and modify it on the target device side */
q.submit([&](cl::sycl::handler& cgh) {
auto dev_acc_sbuf = sendbuf.get_access<mode::write>(cgh);
cgh.parallel_for<class allreduce_test_sbuf_modify>(range<1>{COUNT}, [=](item<1> id) {
dev_acc_sbuf[id] += 1;
});
});
/* invoke ccl_allreduce on the CPU side */
ccl_allreduce(&sendbuf,
&recvbuf,
COUNT,
ccl_dtype_int,
ccl_reduction_sum,
NULL,
NULL,
stream,
&request);
In the line "auto dev_acc_sbuf = sendbuf.get_access<mode::write>(cgh);" the dev_acc_sbuf is a handle that accesses sendbuf and not a seperate buffer. The changes made in the dev_acc_sbuf handle gets reflected to the original buffer ie the sendbuffer . This is an advantage in SYCL as the changes made in the kernel scope is automatically copied back to the original variable
On most systems, the host and the device do not share physical memory, the CPU might use RAM and the GPU might use its own global memory. SYCL needs to know which data it will be sharing between the host and the devices.
For this purpose, SYCL uses its buffers, the buffer class is generic over the element type and the number of dimensions. When passed a raw pointer, the buffer(T* ptr, range size) constructor takes ownership of the memory it has been passed. This means that we absolutely cannot use that memory ourselves while the buffer exists, which is why we begin a C++ scope. At the end of their scope, the buffers will be destroyed and the memory returned to the user. A size argument is a range object, which has to have the same number of dimensions as the buffer and is initialized with the number of elements in each dimension. Here, we have one dimension with one element.
Buffers are not associated with a particular queue or context, so they are capable of handling data transparently between multiple devices.
Accessors are used to access request control over the device memory from the buffer objects. Their modes will take care of data movement between host and device. So we need not have to explicitly copy back the result from device to host.
Below is the example for more clarification:
#include <bits/stdc++.h>
#include <CL/sycl.hpp>
using namespace std;
class vector_addition;
int main(int, char**) {
//creating host memory
int *a=(int *)malloc(10*sizeof(int));
int *b=(int *)malloc(10*sizeof(int));
int *c=(int *)malloc(10*sizeof(int));
for(int i=0;i<10;i++){
a[i]=i;
b[i]=10-i;
}
cl::sycl::default_selector device_selector;
cl::sycl::queue queue(device_selector);
std::cout << "Running on "<< queue.get_device().get_info<cl::sycl::info::device::name>()<< "\n";
{
//creating buffer from pointer of host memory
cl::sycl::buffer<int, 1> a_sycl{a, cl::sycl::range<1>{10} };
cl::sycl::buffer<int, 1> b_sycl{b, cl::sycl::range<1>{10} };
cl::sycl::buffer<int, 1> c_sycl{c, cl::sycl::range<1>{10} };
queue.submit([&] (cl::sycl::handler& cgh) {
//creating accessor of buffer with proper mode
auto a_acc = a_sycl.get_access<cl::sycl::access::mode::read>(cgh);
auto b_acc = b_sycl.get_access<cl::sycl::access::mode::read>(cgh);
auto c_acc = c_sycl.get_access<cl::sycl::access::mode::write>(cgh);//responsible for copying back to host memory
//kernel for execution
cgh.parallel_for<class vector_addition>(cl::sycl::range<1>{ 10 }, [=](cl::sycl::id<1> idx) {
c_acc[idx] = a_acc[idx] + b_acc[idx];
});
});
}
for(int i=0;i<10;i++){
cout<<c[i]<<" ";
}
cout<<"\n";
return 0;
}
I reguarly check lwIP, a free TCP/IP stack with Coverity.
As a network stack, we have untrusted data coming in from the network which is stored in struct pbuf (some members omitted for clarity):
struct pbuf {
void *payload;
u16_t len;
u16_t ref;
};
My questions are:
1) I want to model that "void* payload" of struct pbuf ALWAYS points to tainted data, every access to it must be untrusted. How can I do this?
2) We use refcounting (u16_t ref). Is there any way to model refcounting in Coverity?
Say I have a CUDA GPU kernel for a custom tensorlfow op that uses constant memory:
__constant__ int cdata[100];
__global__ void frobulate(float * data)
{
int i = blockDim.x*blockIdx.x + threadIdx.x;
float value = data[i];
for(int j=0; j < 100; ++j) {
value += cdata[i];
}
}
Then, when implementing the Compute method in my Frobulate custom op
class Frobulate : public tensorflow::OpKernel
{
public:
void Compute(OpKernelContext * context) override
{
...
// Get the current device
const Device & device = context->eigen_device<Eigen::GpuDevice>();
// Local, mutating version of constant data.
// For illustration purposes only
int local_data[100];
// Reason about our local shape
TensorShape local_shape(100);
// Create a pointer to hold allocated output
Tensor * pinned_ary_ptr = nullptr;
// Allocate memory for the complex_phase,
// I don't think allocate_output is correct here...
// but we need pinned host memory for an async transfer
OP_REQUIRES_OK(context, context->allocate_output(
0, local_shape, &pinned_ary_ptr));
for(int i=0; i<100; ++i)
{ pinned_ary_ptr[i] = local_data[i]; }
// Get the symbol address of cdata and enqueue an
// async transfer on the device's stream
int * d_cdata_ptr;
cudaGetSymbolAddress((void **)&d_cdata_ptr, &cdata);
cudaMemcpyAsync(d_cdata_ptr, pinned_ary_ptr, sizeof(int)*100,
cudaMemcpyHostToDevice, device.stream());
// Call the kernel
frobulate<<<grid, blocks, 0, device.stream()>>>(data);
}
};
Is this the right way to go about doing things? i.e. Ideally it would be good to make cdata an Input or Attr in my REGISTER_OP, but I don't think this will link up to the constant data correctly. I think the cudaGetSymbolAddress is necessary...
Is it safe? i.e. Will I interfere with tensorflow's GPU Stream Executor by enqueueing my own cuda commands and memcpys on the supplied stream?
Is context->allocate_output the correct method to call to get some pinned memory? Looking in the tensorflow codebase suggests that there are temp and scratch allocators, but I don't know if they're exposed to the user...
Edit 1: Does this allocate pinned memory? (memory usually allocated with cudaHostAlloc, whose pages are pinned for DMA transfers to the GPU, i.e. they're prevented from being swapped out by the OS).
tensorflow::AllocatorAttributes pinned_allocator;
pinned_allocator.set_on_host(true);
pinned_allocator.set_gpu_compatible(true);
// Allocate memory for the constant data
OP_REQUIRES_OK(context, context->allocate_temp(
DT_UINT8, cdata_shape, &cdata_tensor,
pinned_allocator));
Yes the cudaGetSymbolAddress is necessary. Constant memory is specific to the kernel and should not
It should not. Just make sure that the sequence of operations in your stream execution are in the right order and synced up properly.
Yes output is the memory that the kernel will write as the result of the operation. the scratch memory is mainly used for memory that you need just for a single operation of the kernel. Some cudnn kernels like the convolutions one, use it. See tensorflow/kernels/conv_ops.cc