I have the following problem:
Microcontroller with usb1.1, 32byte buffer for bulk transfers and a lot of real time data to move to Linux (kernel2.6) PC.
As far as I understand the maximum theoretical bandwidth available for bulk transfers in this case is 19 transfers * 32 bytes / frame (1ms) = 608 Kbytes/second
The problem for me is that this is still not enough to move the data in real time and changing to an USB 2.0 uC is not possible ...
Is there anything I can do in SW ( create a patch for linux2.6 ) in order to get 1 or 2 extra bulk transfers per frame ?
Thanks,
George
Since the limit is imposed by the physical USB hardware, there is no way to speed up transfer short of implementing compression on both sides of the transfer.
Even then, it is unlikely you will be able to speed up the transfer considerably.
Related
when use scylla_setup, iotune study my reuslt is:
Measuring sequential write bandwidth: 473 MB/s
Measuring sequential read bandwidth: 499 MB/s
Measuring random write IOPS: 1902 IOPS
Measuring random read IOPS: 1999 IOPS
iops is 1900-2000,
when use fio,
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/dev/sdc1 --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
the result is
test: (groupid=0, jobs=1): err= 0: pid=11697: Wed Jun 26 08:58:13 2019
read: IOPS=47.6k, BW=186MiB/s (195MB/s)(3070MiB/16521msec)
bw ( KiB/s): min=187240, max=192136, per=100.00%, avg=190278.42, stdev=985.15, samples=33
iops : min=46810, max=48034, avg=47569.61, stdev=246.38, samples=33
write: IOPS=15.9k, BW=62.1MiB/s (65.1MB/s)(1026MiB/16521msec)
bw ( KiB/s): min=62656, max=65072, per=100.00%, avg=63591.52, stdev=590.96, samples=33
iops : min=15664, max=16268, avg=15897.88, stdev=147.74, samples=33
cpu : usr=4.82%, sys=12.81%, ctx=164053, majf=0, minf=23
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwt: total=785920,262656,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=186MiB/s (195MB/s), 186MiB/s-186MiB/s (195MB/s-195MB/s), io=3070MiB (3219MB), run=16521-16521msec
WRITE: bw=62.1MiB/s (65.1MB/s), 62.1MiB/s-62.1MiB/s (65.1MB/s-65.1MB/s), io=1026MiB (1076MB), run=16521-16521msec
Disk stats (read/write):
sdc: ios=780115/260679, merge=0/0, ticks=792798/230409, in_queue=1023170, util=99.47%
read iops is 46000 - 48000,write iops is 15000-16000
(NB: It looks like the questioner filed this as a Scylla Github issue too - https://github.com/scylladb/scylla/issues/4604 )
[Why is] the disk iops from scylla_setup iotune [...] different from fio test data
Different benchmarks, different results:
Scylla may have been using a much bigger block size (e.g. 64k) per I/O (this is likely the biggest factor). As you make the block size bigger (up to some maximum due to diminishing returns) the bandwidth (i.e. total amount of data you can send in say a second) achieved with that block size goes up but the IOPS you get will typically down (you are sending more data per I/O after all). This is normal!
Scylla could be using buffered I/O (rather than direct I/O)
Scylla may have been benchmarking reads and writes separately
Scylla may have been using a bigger queue depth
Scylla may have been batching its submissions differently
Scylla may be writing a different type of data
And so on...
In general, it's very difficult to take benchmarks done with different tools and compare them directly to each other - you would need to know what they are doing under the hood for any comparison to be meaningful. Trying to look at IOPS or bandwidth in isolation without more context is meaningless as you typically trade one off against the other. It's better to use the same benchmark tool with identical options to compare two different machines changes or to measure the impact of tuning on the same machine.
TLDR; This is likely an apples to oranges comparison where the tools are measuring different contexts.
PS: gtod_reduce is a go faster stripe that very few people actually need. If your hardware isn't capable of doing gigabytes per second and you're not seeing your CPU maxed out it's unlikely reducing gettimeofday calls is going to nudge the result very much.
(This question might be more appropriate for Server Fault (and thus get better replies there) because it's not directly about programming)
I'm going to improve OCL kernel performance and want to clarify how memory transactions work and what memory access pattern is really better (and why).
The kernel is fed with vectors of 8 integers which are defined as array: int v[8], that means, before doing any computation entire vector must be loaded into GPRs. So, I believe the bottleneck of this code is initial data load.
First, I consider some theory basics.
Target HW is Radeon RX 480/580, that has 256 bit GDDR5 memory bus, on which burst read/write transaction has 8 words granularity, hence, one memory transaction reads 2048 bits or 256 bytes. That, I believe, what CL_DEVICE_MEM_BASE_ADDR_ALIGN refers to:
Alignment (bits) of base address: 2048.
Thus, my first question: what is the physical sense of 128-byte cacheline? Does it keep the portion of data fetched by single burst read but not really requested? What happens with the rest if we requested, say, 32 or 64 bytes - thus, the leftover exceeds the cache line size? (I suppose, it will be just discarded - then, which part: head, tail...?)
Now back to my kernel, I think that cache does not play a significant role in my case because one burst reads 64 integers -> one memory transaction can theoretically feed 8 work items at once, there is no extra data to read, and memory is always coalesced.
But still, I can place my data with two different access patterns:
1) contiguous
a[i] = v[get_global_id(0) * get_global_size(0) + i];
(wich actually perfomed as)
*(int8*)a = *(int8*)v;
2) interleaved
a[i] = v[get_global_id(0) + i * get_global_size(0)];
I expect in my case contiguous would be faster because as said above one memory transaction can completely stuff 8 work items with data. However, I do not know, how the scheduler in compute unit physically works: does it need all data to be ready for all SIMD lanes or just first portion for 4 parallel SIMD elements would be enough? Nevertheless, I suppose it is smart enough to fully provide with data at least one CU first, as soon as CU's may execute command flows independently.
While in second case we need to perform 8 * global_size / 64 transactions to get a complete vector.
So, my second question: is my assumption right?
Now, the practice.
Actually, I split entire task in two kernels because one part has less register pressure than another and therefore can employ more work items. So first I played with pattern how the data stored in transition between kernels (using vload8/vstore8 or casting to int8 give the same result) and the result was somewhat strange: kernel that reads data in contiguous way works about 10% faster (both in CodeXL and by OS time measuring), but the kernel that stores data contiguously performs surprisingly slower. The overall time for two kernels then is roughly the same. In my thoughts both must behave at least the same way - either be slower or faster, but these inverse results seemed unexplainable.
And my third question is: can anyone explain such a result? Or may be I am doing something wrong? (Or completely wrong?)
Well, not really answered all my question but some information found in vastness of internet put things together more clear way, at least for me (unlike abovementioned AMD Optimization Guide, which seems unclear and sometimes confusing):
«the hardware performs some coalescing, but it's complicated...
memory accesses in a warp do not necessarily have to be contiguous, but it does matter how many 32 byte global memory segments (and 128 byte l1 cache segments) they fall into. the memory controller can load 1, 2 or 4 of those 32 byte segments in a single transaction, but that's read through the cache in 128 byte cache lines.
thus, if every lane in a warp loads a random word in a 128 byte range, then there is no penalty; it's 1 transaction and the reading is at full efficiency. but, if every lane in a warp loads 4 bytes with a stride of 128 bytes, then this is very bad: 4096 bytes are loaded but only 128 are used, resulting in ~3% efficiency.»
So, for my case it does not realy matter how the data is read/stored while it is always contiguous, but the order the parts of vectors are loaded may affect the consequent command flow (re)scheduling by compiler.
I also can imagine that newer GCN architecture can do cached/coalesced writes, that is why my results are different from those prompted by that Optimization Guide.
Have a look at chapter 2.1 in the AMD OpenCL Optimization Guide. It focuses mostly on older generation cards but the GCN architecture did not completely change, therefore should still apply to your device (polaris).
In general AMD cards have multiple memory controllers to which in every clock cycle memory requests are distributed. If you for example access your values in column-major instead of row-major logic your performance will be worse because the requests are sent to the same memory controller. (by column major I mean a column of your matrix is accessed together by all the work-items executed in the current clock cycle, this is what you refer to as coalesced vs interleaved). If you access one row of elements (meaning coalesced) in a single clock cycle (meaning all work-items access values within the same row), those requests should be distributed to different memory controllers rather than the same.
Regarding alignment and cache line sizes, I'm wondering if this really helps improving the performance. If I were in your situation I would try to have a look whether I can optimize the algorithm itself or if I access the values often and it would make sense to copy them to the local memory. But than again it is hard to tell without any knowledge about what your kernels execute.
Best Regards,
Michael
I'm currently working on a school project to design a network, and we're asked to assess traffic on the network. In our solution (dealing with taxi drivers), each driver will have a smartphone that can be used to track its position to assign him the best ride possible (through Google Maps, for instance).
What would be the size of data sent and received by a single app during one day? (I need a rough estimate, no real need for a precise answer to the closest bit)
Thanks
Gps Positions compactly stored, but not compressed needs this number of bytes:
time : 8 (4 bytes is possible too)
latitude: 4 (if used as integer or float) or 8
longitude 4 or 8
speed: 2-4 (short: 2: integer 4)
course (2-4)
So binary stored in main memory, one location including the most important attributes, will need 20 - 24 bytes.
If you store them in main memory as single location object, additonal 16 bytes per object are needed in a simple (java) solution.
The maximum recording frequence is usually once per second (1/s): Per hour this need: 3600s * 40 byte = 144k. So a smartphone easily stores that even in main memory.
Not sure if you want to transmit the data:
When transimitting this to a server data usually will raise, depending of the transmit protocoll used.
But it mainly depends how you transmit the data and how often.
If you transimit every 5 minutes a position, you dont't have to care, even
when you use a simple solution that transmits 100 times more bytes than neccessary.
For your school project, try to transmit not more than every 5 or better 10 minutes.
Encryption adds an huge overhead.
To save bytes:
- Collect as long as feasible, then transmit at once.
- Favor binary protocolls to text based. (BSON better than JSON), (This might be out of scope for your school project)
In the usb specification (Table 5-4) is stated that given an isochronous endpoint with a maxPacketSize of 128 Bytes as much as 10 transactions can be done per frame. This gives 128 * 10 * 1000 = 1.28 MB/s of theorical bandwidth.
At the same time it states
The host must not issue more than 1 transaction in a single frame for a specific isochronous endpoint.
Isn't it contradictory with the aforementioned table ?
I've done some tests and found that only 1 transaction is done per frame on my device. Also, I found on several web sites that just 1 transaction can be done per frame(ms). Of course I suppose the spec is the correct reference, so my question is, what could be the cause of receiving only 1 packet per frame ? Am I misunderstanding the spec and what i think are transactions are actually another thing ?
The host must not issue more than 1 transaction in a single frame for a specific isochronous endpoint.
Assuming USB Full Speed you could still have 10 isochronous 128 byte transactions per frame by using 10 different endpoints.
The Table 5-4 seems to miss calculations for chapter 5.6.4 "Isochronous Transfer Bus Access Constraints". The 90% rule reduces the max number of 128 byte isochr. transactions to nine.
We've been trying to figure out why we only achieve writing speed of ~53MBps on UHS104 cards that claim 90MBps.
Due to hardware constraints, clock frequency supplied to the card is only 148.5 MHz (instead of 208MHz).
Does that mean that we should achieve speed of (148.5 * 4)/8 = 74.25MBps?
Or is our caclulation wrong since it assumes that if card guarantees speed of 90MBps on frequency of 208MHz, then it should guarantee speed of 74.25MBps on frequency of 148.5?
The simplified physical layer spec states that for maximum performance you need to write full AU blocks - usually 2 or 4 MByte, otherwise the card will have to copy data around internally when writing across block boundaries. Unfortunately, most of the Speed Class Specification is missing in the 4.13 chapter.
The first AUs may have a different wear level strategy, as they are normally used for the FATs. This could make them slower to write to.