I'm currently working on a school project to design a network, and we're asked to assess traffic on the network. In our solution (dealing with taxi drivers), each driver will have a smartphone that can be used to track its position to assign him the best ride possible (through Google Maps, for instance).
What would be the size of data sent and received by a single app during one day? (I need a rough estimate, no real need for a precise answer to the closest bit)
Thanks
Gps Positions compactly stored, but not compressed needs this number of bytes:
time : 8 (4 bytes is possible too)
latitude: 4 (if used as integer or float) or 8
longitude 4 or 8
speed: 2-4 (short: 2: integer 4)
course (2-4)
So binary stored in main memory, one location including the most important attributes, will need 20 - 24 bytes.
If you store them in main memory as single location object, additonal 16 bytes per object are needed in a simple (java) solution.
The maximum recording frequence is usually once per second (1/s): Per hour this need: 3600s * 40 byte = 144k. So a smartphone easily stores that even in main memory.
Not sure if you want to transmit the data:
When transimitting this to a server data usually will raise, depending of the transmit protocoll used.
But it mainly depends how you transmit the data and how often.
If you transimit every 5 minutes a position, you dont't have to care, even
when you use a simple solution that transmits 100 times more bytes than neccessary.
For your school project, try to transmit not more than every 5 or better 10 minutes.
Encryption adds an huge overhead.
To save bytes:
- Collect as long as feasible, then transmit at once.
- Favor binary protocolls to text based. (BSON better than JSON), (This might be out of scope for your school project)
Related
So what I mean by live second data is something like the stock market where every second the data is getting inputted to the exact area of the specific stock item.
How would the data look in the database? Does it have a timestamp of each second? If so, wouldn't that cause the database to quickly fill up? Are there specific Databases that manage this type of stuff?
Thank you!
Given the sheer amount of money that gets thrown around in fintech, I'd be surprised if trading platforms even use traditional RDMBS databases to store their trading data, but I digress...
How would the data look in the database?
(Again, assuming they're even using a relation-based model in the first place) then something like this in SQL:
CREATE TABLE SymbolPrices (
Symbol char(4) NOT NULL, -- 4 bytes, or even 3 bytes given a symbol char only needs 32 bits-per-char.
Utc datetime NOT NULL, -- 8 byte timestamp (nanosececond precision)
Price int NOT NULL -- Assuming integer cents (not 4 digits), that's 4 bytes
)
...which has a fixed row length of 16 bytes.
Does it have a timestamp of each second?
It can do, but not per second - you'd need far greater granularity than that: I wouldn't be surprised if they were using at least 100-nanosecond resolution, which is a common unit for computer system clock "ticks" (e.g. .NET's DateTime.Ticks is a 64-bit integer value of 100-nanosecond units). Java and JavaScript both use milliseconds, though this resolution might be too coarse.
Storage space requirements for changing numeric values can always be significantly optimized if you instead store the deltas instead of absolute values: I reckon it could come down to 8 bytes per record:
I reason that 3 bytes is sufficient to store trade timestamp deltas at ~1.5ms resolution assuming 100,000 trades per day per stock: that's 16.7m values to represent a 7 hour (25,200s) trading window,
Price deltas also likely be reduced to a 2 byte value (-$327.68 to +$327.67).
And assuming symbols never exceed 4 uppercase Latin characters (A-Z), then that can be represented in 3 bytes.
Giving an improved fixed row length of 8 bytes (3 + 3 + 2).
Though you would now need to store "keyframe" data every few thousand rows to prevent needing to re-play every trade from the very beginning to get the current price.
If data is physically partitioned by symbol (i.e.. using a separate file on disk for each symbol) then you don't need to include the symbol in the record at all, bringing the row length down to merely 5 bytes.
If so, wouldn't that cause the database to quickly fill up?
No, not really (at least assuming you're using HDDs made since the early 2000s); consider that:
Major stock-exchanges really don't have that many stocks, e.g. NASDAQ only has a few thousand stocks (5,015 apparently).
While high-profile stocks (APPL, AMD, MSFT, etc) typically have 30-day sales volumes on the order of 20-130m, that's only the most popular ~50 stocks, most stocks have 30-day volumes far below that.
Let's just assume all 5,000 stocks all have a 30-day volume of 3m.
That's ~100,000 trades per day, per stock on average.
That would require 100,000 * 16 bytes per day per stock.
That's 1,600,000 bytes per day per stock.
Or 1.5MiB per day per stock.
556MiB per year per stock.
For the entire exchange (of 5,000 stocks) that's 7.5GiB/day.
Or 2.7TB/year.
When using deltas instead of absolute values, then the storage space requirements are halved to ~278MiB/year per stock, or 1.39TB/year for the entire exchange.
In practice, historical information would be likely be archived and compressed (likely using a column-major approach to make them more amenable to good compression with general purpose compression schemes, and if data is grouped by symbol then that shaves off another 4 bytes).
Even without compression, partitioning by symbol and using deltas means needing around only 870GB/year for the entire exchange.
That's small enough to fit into a $40 HDD drive from Amazon.
Are there specific Databases that manage this type of stuff?
Undoubtedly, but I don't think they'd need to optimize for storage-space specifically - more likely write-performance and security.
They use different big data architectures like Kappa and Lambda where data is processed in both near real-time and batch pipelines, in this case live second data is "stored" in a messaging engine like Apache Kafka and then it's retrieved, processed and ingested to databases with streaming processing engines like Apache Spark Streaming
They often don't use RDMBS databases like MySQL, SQL Server and so forth to store the data and instead they use NoSQL data storage or formats like Apache Avro or Apache Parquet stored in buckets like AWS S3 or Google Cloud Storage properly partitioned to improve performance.
A full example can be found here: Streaming Architecture with Apache Spark and Kafka
I'm going to improve OCL kernel performance and want to clarify how memory transactions work and what memory access pattern is really better (and why).
The kernel is fed with vectors of 8 integers which are defined as array: int v[8], that means, before doing any computation entire vector must be loaded into GPRs. So, I believe the bottleneck of this code is initial data load.
First, I consider some theory basics.
Target HW is Radeon RX 480/580, that has 256 bit GDDR5 memory bus, on which burst read/write transaction has 8 words granularity, hence, one memory transaction reads 2048 bits or 256 bytes. That, I believe, what CL_DEVICE_MEM_BASE_ADDR_ALIGN refers to:
Alignment (bits) of base address: 2048.
Thus, my first question: what is the physical sense of 128-byte cacheline? Does it keep the portion of data fetched by single burst read but not really requested? What happens with the rest if we requested, say, 32 or 64 bytes - thus, the leftover exceeds the cache line size? (I suppose, it will be just discarded - then, which part: head, tail...?)
Now back to my kernel, I think that cache does not play a significant role in my case because one burst reads 64 integers -> one memory transaction can theoretically feed 8 work items at once, there is no extra data to read, and memory is always coalesced.
But still, I can place my data with two different access patterns:
1) contiguous
a[i] = v[get_global_id(0) * get_global_size(0) + i];
(wich actually perfomed as)
*(int8*)a = *(int8*)v;
2) interleaved
a[i] = v[get_global_id(0) + i * get_global_size(0)];
I expect in my case contiguous would be faster because as said above one memory transaction can completely stuff 8 work items with data. However, I do not know, how the scheduler in compute unit physically works: does it need all data to be ready for all SIMD lanes or just first portion for 4 parallel SIMD elements would be enough? Nevertheless, I suppose it is smart enough to fully provide with data at least one CU first, as soon as CU's may execute command flows independently.
While in second case we need to perform 8 * global_size / 64 transactions to get a complete vector.
So, my second question: is my assumption right?
Now, the practice.
Actually, I split entire task in two kernels because one part has less register pressure than another and therefore can employ more work items. So first I played with pattern how the data stored in transition between kernels (using vload8/vstore8 or casting to int8 give the same result) and the result was somewhat strange: kernel that reads data in contiguous way works about 10% faster (both in CodeXL and by OS time measuring), but the kernel that stores data contiguously performs surprisingly slower. The overall time for two kernels then is roughly the same. In my thoughts both must behave at least the same way - either be slower or faster, but these inverse results seemed unexplainable.
And my third question is: can anyone explain such a result? Or may be I am doing something wrong? (Or completely wrong?)
Well, not really answered all my question but some information found in vastness of internet put things together more clear way, at least for me (unlike abovementioned AMD Optimization Guide, which seems unclear and sometimes confusing):
«the hardware performs some coalescing, but it's complicated...
memory accesses in a warp do not necessarily have to be contiguous, but it does matter how many 32 byte global memory segments (and 128 byte l1 cache segments) they fall into. the memory controller can load 1, 2 or 4 of those 32 byte segments in a single transaction, but that's read through the cache in 128 byte cache lines.
thus, if every lane in a warp loads a random word in a 128 byte range, then there is no penalty; it's 1 transaction and the reading is at full efficiency. but, if every lane in a warp loads 4 bytes with a stride of 128 bytes, then this is very bad: 4096 bytes are loaded but only 128 are used, resulting in ~3% efficiency.»
So, for my case it does not realy matter how the data is read/stored while it is always contiguous, but the order the parts of vectors are loaded may affect the consequent command flow (re)scheduling by compiler.
I also can imagine that newer GCN architecture can do cached/coalesced writes, that is why my results are different from those prompted by that Optimization Guide.
Have a look at chapter 2.1 in the AMD OpenCL Optimization Guide. It focuses mostly on older generation cards but the GCN architecture did not completely change, therefore should still apply to your device (polaris).
In general AMD cards have multiple memory controllers to which in every clock cycle memory requests are distributed. If you for example access your values in column-major instead of row-major logic your performance will be worse because the requests are sent to the same memory controller. (by column major I mean a column of your matrix is accessed together by all the work-items executed in the current clock cycle, this is what you refer to as coalesced vs interleaved). If you access one row of elements (meaning coalesced) in a single clock cycle (meaning all work-items access values within the same row), those requests should be distributed to different memory controllers rather than the same.
Regarding alignment and cache line sizes, I'm wondering if this really helps improving the performance. If I were in your situation I would try to have a look whether I can optimize the algorithm itself or if I access the values often and it would make sense to copy them to the local memory. But than again it is hard to tell without any knowledge about what your kernels execute.
Best Regards,
Michael
I am using the new CMSensorRecord with watchOS 2 and have no problems up until I try to get data out of CMRecordedAccelerometerData. I believe I am using fast enumeration like it is reccomended to go through the list from the docs
for (CMRecordedAccelerometerData* data in list) {
NSLog(#"Sample: (%f),(%f),(%f) ", data.acceleration.x,data.acceleration.y,data.acceleration.z);
}
however going through even 60 min of data takes a long time process (the accelerometer records at 50Hz (50 data points a second)) — the WWDC video on Core Motion recommends to decimate the data to decrease processing time, how can this be implemented? Or am I misunderstanding the intention of using this, are we intended to send the CMSensorDataList to the iPhone for processing? I would like to add each axis to an array on the Apple Watch without iPhone help.
If for example I recorded for the max 12 hours — there would be over 100 million data points to read over, even in this case would it be possible for the Apple Watch to read through this?
At the moment it is taking minutes to read through a couple million data points.
In the usb specification (Table 5-4) is stated that given an isochronous endpoint with a maxPacketSize of 128 Bytes as much as 10 transactions can be done per frame. This gives 128 * 10 * 1000 = 1.28 MB/s of theorical bandwidth.
At the same time it states
The host must not issue more than 1 transaction in a single frame for a specific isochronous endpoint.
Isn't it contradictory with the aforementioned table ?
I've done some tests and found that only 1 transaction is done per frame on my device. Also, I found on several web sites that just 1 transaction can be done per frame(ms). Of course I suppose the spec is the correct reference, so my question is, what could be the cause of receiving only 1 packet per frame ? Am I misunderstanding the spec and what i think are transactions are actually another thing ?
The host must not issue more than 1 transaction in a single frame for a specific isochronous endpoint.
Assuming USB Full Speed you could still have 10 isochronous 128 byte transactions per frame by using 10 different endpoints.
The Table 5-4 seems to miss calculations for chapter 5.6.4 "Isochronous Transfer Bus Access Constraints". The 90% rule reduces the max number of 128 byte isochr. transactions to nine.
In the MS-assisted case, it is the GPS receiver which sends the measurements for the SLP to calculate and revert. I understand the measurements include the Ephemeris, Iono, DGPS etc + Doppler shift that are sent. Please let me know if my understanding is right.
Does the SET send the code (the entire data transmitted by satellites as is) that it receives as is or splits it into the above components and send?
All the assistance information in SUPL is encapsulated using RRLP protocol (Radio resource location services (LCS) protocol for GSM), RRC (Radio Resource Control for UMTS) or TIA 801 (for CDMA 2000) or LPP (LTE Positioning Protocol for LTE). I'm just looking at RRLP standard ETSI TS 101 527. The following part sounds interesting:
A.3.2.5 GPS Measurement Information Element
The purpose of the GPS Measurement Information element is to provide
GPS measurement information from the MS to the SMLC. This information
includes the measurements of code phase and Doppler, which enables the
network-based GPS method where position is computed in the SMLC. The
proposed contents are shown in table A.5 below, and the individual
fields are described subsequently.
In subsequent section it is defined as:
reference frame - optional, 16 bits - the frame number of the last measured burst from the reference BTS modulo 42432
GPS TOW (time of week) - mandatory, 24 bits, unit of 1ms
number of satellites - mandatory, 4 bits
Then for each satellite the following set of data is transmitted:
satellite ID - 6 bits
C/No - 6 bits
Doppler shift - 16 bits, 0.2Hz unit
Whole Chips - 10 bits
Fractional Chips - 10 bits
Multipath Indicator - 2 bits
Pseudorange Multipath Error - 3+3 bits (mantissa/exponent)
I'm not familiar that much with GPS operation to understand all the parameters, but as far as I understand:
C/No is simply a signal(carrier) to noise ratio
Doppler shift - gives the frequency shift for a given satellite, obviously
Whole/Fractional Chips together give the phase (and thus satellite distance)
My understanding is that things like almanac, ephemeris, Iono, DGPS are all known on the network side. As far as I know those things are transferred from network to MS in MS-based mode.
Hope that helps.
Measurements collected from MS-assisted location requests include:
Satellite ID
code phase - whole chips
code phase - fractional chips
Doppler
Signal strength
Multipath indicator
pseudorange RMS indicator
In addition, the GPS time of measurements is also provided as one value (in milliseconds) for the time which all measurements are valid.
In practice, the required fields that need to be accurate and correct are:
Satellite ID
code phase - whole chips
code phase - fractional chips
Doppler
The code phase values for each satellite are almost always used for the most accurate location calculation. Doppler values can be used to estimate a rough location but aren't usually accurate enough to really contribute to the final solution.
The other values for signal strength, multipath indication, and RMS indicator usually vary in meaning so much between vendors that they don't really provide much benefit for the position calculation. They would normally be used for things like weighting other values so that good satellites count more in the final position.
The network already knows (or should know) the ephemeris and ionospheric model. They are not measurements collected by the handset.