Cache block swap between different ways - gem5

I am trying to achieve a kind of functionality in cache which will swap the blocks within the same set but different ways. For example, data stored in way 0 will be moved to way 1 and the data was stored in way 1 then is moved to way 0. The block should not be evicted from the cache during this swapping.
The cache model is the classic set-associative cache (not Ruby).
I noticed cache_blk.hh tells what kind of data is stored in a cache block, so I wrote something in base_tag as below to perform the content swapping.
blkA->tag=blkB->tag;
blkA->srcMasterId=blkB->srcMasterId
blkA->task_id=blkB->task_id
blkA->tickInserted=blkB->tickInserted
blkA->refCount=blkB->refCount
blkA->whenReady=blkB->whenReady
blkA->status=blkB->status
for (int i=0;i<blk_size;i++)
{
*(blkA->data + i)=*(blkB->data + i);
}
This modification caused data corruption, so it did not work correctly. Can someone give me some suggestions about this?
Many Thanks!

Related

Displaying a pymongo cursor without loading in a list

I am working on a PyQT5 app. I want to display a large dataset extracted from a MongoDB database.
To do so, I extract my collection in 3 cursors (I need to sort the display). However, until now, I was casting the cursors in a list, then emitting it.
But now my database grew significantly in size, and runtime became a major issue. Going through lists is time-consuming. Therefore, I am trying to find a way to directly access the cursor in its entirety without looping through all of it (I know it sounds a bit tricky since the cursor is just a reference to the collection)
An example of what is done now :
D1 = list(self.collection.find({"$and":[ {"location": current_location}, {"vehicle":"Car" }]}))
D2 = list(self.collection.find({"$and":[ {"location": current_location}, {"vehicle":{"$nin":["Car","Truck"]}}]}))
D3 = list(self.collection.find({"$and":[ {"location": current_location}, {"vehicle" : "Truck"}]}).sort([("_id",-1)]).limit(60 - len(D1)))
extracted = D1 + D2 + D3
l_data_extracted.emit(extracted) # send the loaded data to the front
Loading 3 cursors are time-consuming, but then passing them in 1 list makes it heavier for the app.
I looked for resources about cursor management, but every time, I see answers which involve looping or casting (which I already do).
Would there be a way to directly emit the cursor and a bit like in C, pass the argument by reference to get the pointed data ? Or am I bound to loop/cast in the list due to its particular nature?

Chronicle Queue - reader/tailer latency when run at same time while writing

I'm setting up a market data back-testing using Chronicle Queue (CQ), reading data from a binary file then writing into a single CQ and simultaneously reading the data from that CQ and dumping the statistics. I am doing a POC to replace our existing real-time market data feed handler worker queue.
While doing basic read/writes testing on Linux/SSD setup, I see reads are lagging behind writes - in fact latency is accumulating. Both Appender and Tailer are running as separate processes on same host.
Would like to know, if there is any issue in the code I am using?
Below is the code snippet -
Writer -
In constructor -
myQueue = SingleChronicleQueueBuilder.binary(queueName).build();
myAppender = myQueue.acquireAppender();
In data callback -
myAppender.writeDocument(myDataPacket);
myQueue.close();
where myDataPacket is Java object wrapping the byte[] and other fields.
Tailer -
In Constructor -
myQueue = SingleChronicleQueueBuilder.binary(queueName).build();
myTailer = myQueue.createTailer();
In Read method -
while (notLastRecord)
{
if(myTailer.readDocument(myDataPacket))
{
notLastRecord = ;
//do stuff
}
}
myQueue.close();
Any help is highly appreciated.
Thanks,
Pavan
First of all I assume by "reads are lagging behind writes - in fact latency is accumulating" you mean that for every every subsequent message, the time the message is read from the queue is further from the time the event was written to the queue.
If you see latency accumulating like that, most likely the data is produced much quicker then you can consume it which from the use case you described is very much possible - if all you need at the write side is parsing simple text line and dump it into a queue file, it's quick, but if you do some processing when you read the entry from the queue - it might be slower.
From the code it's not clear what/how much work your code is doing, and the code looks OK to me, except you probably shouldn't call queue.close() after each appender.writeDocument() call but most likely you are not doing this otherwise it would blow up.
Without seeing actual code or test case it's impossible to say more.

How to use VkPipelineCache?

If I understand it correctly, I'm supposed to create an empty VkPipelineCache object, pass it into vkCreateGraphicsPipelinesand data will be written into it. I can then reuse it with other pipelines I'm creating or save it to a file and use it on the next run.
I've tried following the LunarG example to extra the info:
uint32_t headerLength = pData[0];
uint32_t cacheHeaderVersion = pData[1];
uint32_t vendorID = pData[2];
uint32_t deviceID = pData[3];
But I always get headerLength is 32 and the rest 0. Looking at the spec (https://vulkan.lunarg.com/doc/view/1.0.26.0/linux/vkspec.chunked/ch09s06.html Table 9.1), the cacheHeaderVersion should always be 1, as the only available cache header version is VK_PIPELINE_CACHE_HEADER_VERSION_ONE = 1.
Also the size of pData is usually only 32 bytes, even when I create 10 pipelines with it. What am I doing wrong?
A Vulkan pipeline cache is an opaque object that's only meaningful to the driver. There are very few operations that you're supposed to use on it.
Creating a pipeline cache, optionally with a block of opaque binary data that was saved from an earlier run
Getting the opaque binary data from an existing pipeline cache, typically to serialize to disk before exiting your application
Destroying a pipeline cache as part of the proper shutdown process.
The idea is that the driver can use the cache to speed up creation of pipelines within your program, and also to speed up pipeline creation on subsequent runs of your application.
You should not be attempting to interpret the cache data returned from vkGetPipelineCacheData at all. The only purpose for that data is to be passed into a later call to vkCreatePipelineCache.
Also the size of pData is usually only 32 bytes, even when I create 10 pipelines with it. What am I doing wrong?
Drivers must implement vkCreatePipelineCache, vkGetPipelineCacheData, etc. But they don't actually have to support caching. So if you're working with a driver that doesn't have anything it can cache, or hasn't done the work to support caching, then you'd naturally get back an empty cache (other than the header).

VB.NET, best practice for sorting concurrent results of threadpooling?

In short, I'm trying to "sort" incoming results of using threadpooling as they finish. I have a functional solution, but there's no way in the world it's the best way to do it (it's prone to huge pauses). So here I am! I'll try to hit the bullet points of what's going on/what needs to happen and then post my current solution.
The intent of the code is to get information about files in a directory, and then write that to a text file.
I have a list (Counter.ListOfFiles) that is a list of the file paths sorted in a particular way. This is the guide that dictates the order I need to write to the text file.
I'm using a threadpool to collect the information about each file, create a stringbuilder with all of the text ready to write to the text file. I then call a procedure(SyncUpdate, inlcluded below), send the stringbuilder(strBld) from that thread along with the name of the path of the file that particular thread just wrote to the stringbuilder about(Xref).
The procedure includes a synclock to hold all the other threads until it finds a thread passing the correct information. That "correct" information being when the xref passed by the thread matches the first item in my list (FirstListItem). When that happens, I write to the text file, delete the first item in the list and do it again with the next thread.
The way I'm using the monitor is probably not great, in fact I have little doubt I'm using it in an offensively wanton manner. Basically while the xref (from the thread) <> the first item in my list, I'm doing a pulseall for the monitor. I originally was using monitor.wait, but it would eventually just give up trying to sort through the list, even when using a pulse elsewhere. I may have just been coding something awkwardly. Either way, I don't think it's going to change anything.
Basically the problem comes down to the fact that the monitor will pulse through all of the items it has in the queue, when there's a good chance the item I am looking for probably got passed to it somewhere earlier in the queue or whatever and it's now going to sort through all of the items again before looping back around to find a criteria that matches. The result of this is that my code will hit one of these and take a huge amount of time to complete.
I'm open to believing I'm just using the wrong tool for the job, or just not using tool I have correctly. I would strongly prefer some sort of threaded solution (unsurprisingly, it's much faster!). I've been messing around a bit with the Parallel Task functionality today, and a lot of the stuff looks promising, but I have even less experience with that vs. threadpool, and you can see how I'm abusing that! Maybe something with queue? You get the idea. I am directionless. Anything someone could suggest would be much appreciated. Thanks! Let me know if you need any additional information.
Private Sub SyncUpdateResource(strBld As Object, Xref As String)
SyncLock (CType(strBld, StringBuilder))
Dim FirstListitem As String = counter.ListOfFiles.First
Do While Xref <> FirstListitem
FirstListitem = Counter.ListOfFiles.First
'This makes the code much faster for reasons I can only guess at.
Thread.Sleep(5)
Monitor.PulseAll(CType(strBld, StringBuilder))
Loop
Dim strVol As String = Form1.Volname
Dim strLFPPath As String = Form1.txtPathDir
My.Computer.FileSystem.WriteAllText(strLFPPath & "\" & strVol & ".txt", strBld.ToString, True)
Counter.ListOfFiles.Remove(Xref)
End SyncLock
End Sub
This is a pretty typical multiple producer, single consumer application. The only wrinkle is that you have to order the results before they're written to the output. That's not difficult to do. So let's let that requirement slide for a moment.
The easiest way in .NET to implement a producer/consumer relationship is with BlockingCollection, which is a thread-safe FIFO queue. Basically, you do this:
In your case, the producer threads get items, do whatever processing they need to, and then put the item onto the queue. There's no need for any explicit synchronization--the BlockingCollection class implementation does that for you.
Your consumer pulls things from the queue and outputs them. You can see a really simple example of this in my article Simple Multithreading, Part 2. (Scroll down to the third example if you're just interested in the code.) That example just uses one producer and one consumer, but you can have N producers if you want.
Your requirements have a little wrinkle in that the consumer can't just write items to the file as it gets them. It has to make sure that it's writing them in the proper order. As I said, that's not too difficult to do.
What you want is a priority queue of some sort onto which you can place an item if it comes in out of order. Your priority queue can be a sorted list or even just a sequential list if the number of items you expect to get out of order isn't very large. That is, if you typically have only a half dozen items at a time that could be out of order, then a sequential list could work just fine.
I'd use a heap because it performs well. The .NET Framework doesn't supply a heap, but I have a simple one that works well for jobs like this. See A Generic BinaryHeap Class.
So here's how I'd write the consumer (the code is in pseudo-C#, but you can probably convert it easily enough).
The assumption here is that you have a BlockingCollection called sharedQueue that contains the items. The producers put items on that queue. Consumers do this:
var heap = new BinaryHeap<ItemType>();
foreach (var item in sharedQueue.GetConsumingEnumerable())
{
if (item.SequenceKey == expectedSequenceKey)
{
// output this item
// then check the heap to see if other items need to be output
expectedSequenceKey = expectedSequenceKey + 1;
while (heap.Count > 0 && heap.Peek().SequenceKey == expectedSequenceKey)
{
var heapItem = heap.RemoveRoot();
// output heapItem
expectedSequenceKey = expectedSequenceKey + 1;
}
}
else
{
// item is out of order
// put it on the heap
heap.Insert(item);
}
}
// if the heap contains items after everything is processed,
// then some error occurred.
One glaring problem with this approach as written is that the heap could grow without bound if one of your consumers crashes or goes into an infinite loop. But then, your other approach probably would suffer from that as well. If you think that's an issue, you'll have to add some way to skip an item that you think won't ever be forthcoming. Or kill the program. Or something.
If you don't have a binary heap or don't want to use one, you can do the same thing with a SortedList<ItemType>. SortedList will be faster than List, but slower than BinaryHeap if the number of items in the list is even moderately large (a couple dozen). Fewer than that and it's probably a wash.
I know that's a lot of info. I'm happy to answer any questions you might have.

Concurrent access to a single FFTSetup data structure in GCD

Is it okay to create a single FFTSetup data structure and use it to perform multiple FFT computations concurrently? Would something like the following work?
FFTSetup fftSetup = vDSP_create_fftsetup(
16, // vDSP_Length __vDSP_log2n,
kFFTRadix2 // FFTRadix __vDSP_radix
);
NSAssert(fftSetup != NULL, #"vDSP_create_fftsetup() failed to allocate storage");
for (int i = 0; i < 100; i++)
{
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0),
^{
vDSP_fft_zrip(
fftSetup, // FFTSetup __vDSP_setup,
&(splitComplex[i]), // DSPSplitComplex *__vDSP_ioData,
1, // vDSP_Stride __vDSP_stride,
16, // vDSP_Length __vDSP_log2n,
kFFTDirection_Forward // FFTDirection __vDSP_direction
);
});
}
I suppose that the answer depends on the following considerations:
1) Does vDSP_fft_zrip() only access the data within fftSetup (or the data pointed to by it) in a "read-only" fashion? Or are there perhaps some temporary buffers (scratch space) within fftSetup that is written to by vDSP_fft_zrip() in performing its FFT computations?
2) If data like that in fftSetup is being accessed in a "read-only" fashion, is it okay for multiple processes/threads/tasks/blocks to access it simultaneously? (I am thinking of the case where it is possible for more than one process to open the same file for reading, though not necessarily for writing or appending. Is this analogy appropriate?)
On a related note, just how much memory is being taken up by the FFTSetup data structure? Is there any way to find out? (It is an opaque data type.)
You may create one FFT setup and use it repeatedly and concurrently. This is the intended use. (I am the author of the current implementations of vDSP_fft_zrip and other FFT implementations in vDSP.)
In Using Fourier Transforms we're told that the FFTSetup contains the FFT weight array which is a series of complex exponentials. The vDSP_create_fftsetup documentation says
Once prepared, the setup structure can be used repeatedly by FFT
functions (which read the data in the structure and do not alter it)
for any (power of two) length up to that specified when you created
the structure.
so
conceptually, vDSP_fft_zrip should not need to modify the weight array and so it would appear to be one of the FFT functions that do not alter the FFTSetup (I haven't seen any that do apart from create/destroy), however there are no guarantees on what the actual implementation does - it could do anything.
if vDSP_fft_zrip truly accesses its FFTSetup in a read-only fashion, then it's fine to do that from multiple threads.
As for memory usage, the FFT weight array is e^{i*k*2*M_PI/N} for k = [0..N-1], which are N complex float values, so that would 2*N*sizeof(float).
But those complex exponentials are very symmetric so who knows, under the hood the implementation could require less memory. Or more!
In your case, N = 2^16, so it wouldn't be strange to see up to 256k being used.
Where does that leave you? I think it seems reasonable that the FFTSetup be accessible from multiple threads, but it appears to be undocumented. You could be lucky. Or unlucky and unpleasantly surprised now or in a future version of the framework.
So... do you feel lucky?
I wouldn't attempt any explicit concurrency with the vDSP functions, or any other function in the Accelerate framework (of which vDSP is a part) for that matter. Why? Because Accelerate is already designed to take advantage of multiple cores, as well as specific nuances of a given processor implementation, on your behalf - see http://developer.apple.com/library/mac/#DOCUMENTATION/Darwin/Reference/ManPages/man7/vecLib.7.html. You may end up essentially re-parallelizing already parallel computations that are internal to the implementation (if not now, then possibly in a later version). The best approach to the Accelerate framework is generally to assume that it's more clever than you are and just use it in the simplest way possible, then do your performance measurements. If those measurements reflect a level of performance that is somehow insufficient for your needs, then try your own optimizations (and/or file a bug report against the Accelerate framework at http://bugreport.apple.com since the authors of that framework are always interested in knowing where or if their efforts somehow fell short of developer requirements).