why does query from streaming buffer doesn't get billed - google-bigquery

When the query contains a reference to only data from stream buffer, the query returns the results correctly but doesn't show any bytes billed, any links that explain this behavior/ confirm would be helpful

Last year, we've been confirmed that Data accumulated in the streaming buffer is not calculated towards "bytes processed" and therefore not billed.
It's unknown if the BigQuery team will change this, but since data usually doesn't stay in the buffer for long (typically less than 90 minutes, and much less if you stream at high rate), then maybe this behavior will stay this way for some time.
Just keep in mind that once data is moved to the permanent storage, all subsequent queries reading that data will report bytes processed and bill accordingly.

Related

How do databases store live second data?

So what I mean by live second data is something like the stock market where every second the data is getting inputted to the exact area of the specific stock item.
How would the data look in the database? Does it have a timestamp of each second? If so, wouldn't that cause the database to quickly fill up? Are there specific Databases that manage this type of stuff?
Thank you!
Given the sheer amount of money that gets thrown around in fintech, I'd be surprised if trading platforms even use traditional RDMBS databases to store their trading data, but I digress...
How would the data look in the database?
(Again, assuming they're even using a relation-based model in the first place) then something like this in SQL:
CREATE TABLE SymbolPrices (
Symbol char(4) NOT NULL, -- 4 bytes, or even 3 bytes given a symbol char only needs 32 bits-per-char.
Utc datetime NOT NULL, -- 8 byte timestamp (nanosececond precision)
Price int NOT NULL -- Assuming integer cents (not 4 digits), that's 4 bytes
)
...which has a fixed row length of 16 bytes.
Does it have a timestamp of each second?
It can do, but not per second - you'd need far greater granularity than that: I wouldn't be surprised if they were using at least 100-nanosecond resolution, which is a common unit for computer system clock "ticks" (e.g. .NET's DateTime.Ticks is a 64-bit integer value of 100-nanosecond units). Java and JavaScript both use milliseconds, though this resolution might be too coarse.
Storage space requirements for changing numeric values can always be significantly optimized if you instead store the deltas instead of absolute values: I reckon it could come down to 8 bytes per record:
I reason that 3 bytes is sufficient to store trade timestamp deltas at ~1.5ms resolution assuming 100,000 trades per day per stock: that's 16.7m values to represent a 7 hour (25,200s) trading window,
Price deltas also likely be reduced to a 2 byte value (-$327.68 to +$327.67).
And assuming symbols never exceed 4 uppercase Latin characters (A-Z), then that can be represented in 3 bytes.
Giving an improved fixed row length of 8 bytes (3 + 3 + 2).
Though you would now need to store "keyframe" data every few thousand rows to prevent needing to re-play every trade from the very beginning to get the current price.
If data is physically partitioned by symbol (i.e.. using a separate file on disk for each symbol) then you don't need to include the symbol in the record at all, bringing the row length down to merely 5 bytes.
If so, wouldn't that cause the database to quickly fill up?
No, not really (at least assuming you're using HDDs made since the early 2000s); consider that:
Major stock-exchanges really don't have that many stocks, e.g. NASDAQ only has a few thousand stocks (5,015 apparently).
While high-profile stocks (APPL, AMD, MSFT, etc) typically have 30-day sales volumes on the order of 20-130m, that's only the most popular ~50 stocks, most stocks have 30-day volumes far below that.
Let's just assume all 5,000 stocks all have a 30-day volume of 3m.
That's ~100,000 trades per day, per stock on average.
That would require 100,000 * 16 bytes per day per stock.
That's 1,600,000 bytes per day per stock.
Or 1.5MiB per day per stock.
556MiB per year per stock.
For the entire exchange (of 5,000 stocks) that's 7.5GiB/day.
Or 2.7TB/year.
When using deltas instead of absolute values, then the storage space requirements are halved to ~278MiB/year per stock, or 1.39TB/year for the entire exchange.
In practice, historical information would be likely be archived and compressed (likely using a column-major approach to make them more amenable to good compression with general purpose compression schemes, and if data is grouped by symbol then that shaves off another 4 bytes).
Even without compression, partitioning by symbol and using deltas means needing around only 870GB/year for the entire exchange.
That's small enough to fit into a $40 HDD drive from Amazon.
Are there specific Databases that manage this type of stuff?
Undoubtedly, but I don't think they'd need to optimize for storage-space specifically - more likely write-performance and security.
They use different big data architectures like Kappa and Lambda where data is processed in both near real-time and batch pipelines, in this case live second data is "stored" in a messaging engine like Apache Kafka and then it's retrieved, processed and ingested to databases with streaming processing engines like Apache Spark Streaming
They often don't use RDMBS databases like MySQL, SQL Server and so forth to store the data and instead they use NoSQL data storage or formats like Apache Avro or Apache Parquet stored in buckets like AWS S3 or Google Cloud Storage properly partitioned to improve performance.
A full example can be found here: Streaming Architecture with Apache Spark and Kafka

How to interpret query process GB in Bigquery?

I am using a free trial of Google bigquery. This is the query that I am using.
select * from `test`.events where subject_id = 124 and id = 256064 and time >= '2166-01-15T14:00:00' and time <='2166-01-15T14:15:00' and id_1 in (3655,223762,223761,678,211,220045,8368,8441,225310,8555,8440)
This query is expected to return at most 300 records and not more than that.
However I see a message like this as below
But the table on which this query operates is really huge. Does this indicate the table size? However, I ran this query multiple times a day
Due to this, it resulted in error below
Quota exceeded: Your project exceeded quota for free query bytes scanned. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors
How long do I have to wait for this error to go-away? Is the daily limit 1TB? If yes, then I didn't not use close to 400 GB.
How to view my daily usage?
If I can edit quota, can you let me know which option should I be editing?
Can you help me with the above questions?
According to the official documentation
"BigQuery charges for queries by using one metric: the number of bytes processed (also referred to as bytes read)", regardless of how large the output size is. What this means is that if you do a count(*) on a 1TB table, you will supposedly be charged $5, even though the final output is very minimal.
Note that due to storage optimizations that BigQuery is doing internally, the bytes processed might not equal to the actual raw table size when you created it.
For the error you're seeing, browse the Google Console to "IAM & admin" then "Quotas", where you can then search for quotas specific to the BigQuery service.
Hope this helps!
Flavien

What is the maximum LIMIT DURATION in the LAG function in ASA?

I am streaming data from devices and I want to use the LAG function to identify the last value received from a particular device. The data is not streamed at a regular period and in rare cases it could be days between receiving data from a device.
Is there a maximum period for the LIMIT DURATION clause?
Is there any down-side to having long LIMIT DURATION periods?
There is no maximum period for LIMIT DURATION in the language. However it is limited by amount of data the input source can hold - e.g. 1 day is default retention policy for Event Hub (can be increased in configuration).
When job is being started, Azure Stream Analytics reads up to LIMIT DURATION amount of data from the source to make sure it has correct value for the LAG at job start time. If data volume is high, this can increase job start time.
If you need to use data that is more than several days old, it may make more sense to use it as a reference data (which can be updated at daily intervals for example)

Mesaure upload and download speed in iPhone

I would like to measure the upload and download speed of data in iPhone, is any API available to achieve the same? Is it correct to measure it on the basis of dividing total bytes received with time taken in response?
Yes, it is correct to measure the total bytes / time taken, that is exactly what the speed is. You might want to take an average if you want to constantly show the download speed.., like using 500 bytes and the time it took to download those particular ones.
For doing this you could like have an NSMutableArray, as a buffer, which you empty idk every 2 seconds. Then you do [bufferMutableArray length]/2 and you know how many bytes a second you had those 2 seconds. When you empty the buffer ofc append to the data you are downloading.
There is no direct API to know the speed.
Total data received/sent and time only will give you average speed. There use to be lot of variation in the speed over the time so if you want more accurate value then do the speed calculation based on sampling.
(Data transferred in 1 miniut) /(60 seconds) ---> this solution only if you need greater accuracy in the speed calculation. The sampling duration can changed based on the level of accuracy required.

SQL connection lifetime

I am working on an API to query a database server (Oracle in my case) to retrieve massive amount of data. (This is actually a layer on top of JDBC.)
The API I created tries to limit as much as possible the loading of every queried information into memory. I mean that I prefer to iterate over the result set and process the returned row one by one instead of loading every rows in memory and process them later.
But I am wondering if this is the best practice since it has some issues:
The result set is kept during the whole processing, if the processing is as long as retrieving the data, it means that my result set will be open twice as long
Doing another query inside my processing loop means opening another result set while I am already using one, it may not be a good idea to start opening too much result sets simultaneously.
On the other side, it has some advantages:
I never have more than one row of data in memory for a result set, since my queries tend to return around 100k rows, it may be worth it.
Since my framework is heavily based on functionnal programming concepts, I never rely on multiple rows being in memory at the same time.
Starting the processing on the first rows returned while the database engine is still returning other rows is a great performance boost.
In response to Gandalf, I add some more information:
I will always have to process the entire result set
I am not doing any aggregation of rows
I am integrating with a master data management application and retrieving data in order to either validate them or export them using many different formats (to the ERP, to the web platform, etc.)
There is no universal answer. I personally implemented both solutions dozens of times.
This depends of what matters more for you: memory or network traffic.
If you have a fast network connection (LAN) and a poor client machine, then fetch data row by row from the server.
If you work over the Internet, then batch fetching will help you.
You can set prefetch count or your database layer properties and find a golden mean.
Rule of thumb is: fetch everything that you can keep without noticing it
if you need more detailed analysis, there are six factors involved:
Row generation responce time / rate(how soon Oracle generates first row / last row)
Row delivery response time / rate (how soon can you get first row / last row)
Row processing response time / rate (how soon can you show first row / last row)
One of them will be the bottleneck.
As a rule, rate and responce time are antagonists.
With prefetching, you can control the row delivery response time and row delivery rate: higher prefetch count will increase rate but decrease response time, lower prefetch count will do the opposite.
Choose which one is more important to you.
You can also do the following: create separate threads for fetching and processing.
Select just ehough rows to keep user amused in low prefetch mode (with high response time), then switch into high prefetch mode.
It will fetch the rows in the background and you can process them in the background too, while the user browses over the first rows.