When using Streaming Inserts to load data to BigQuery, are the data size calculated by the JSON size or the data size in BigQuery storage?
I'm asking since the JSON might be 3 times bigger than the actual data inside.
I've read all Google documentation I found without finding a clear answer.
Streaming inserts are billed according to BigQuery's internal storage mechanism. As per the official Google docs:
Storage pricing is based on the amount of data stored in your tables, which we calculate based on the types of data you store. Streaming Inserts $0.05 per GB, with individual rows calculated using a 1 KB minimum size.
https://cloud.google.com/bigquery/pricing#storage
Related
I have an app where I receive 300m JSON text files (10m daily, retention = 30 days) from a Kafka topic.
The data it contains needs to be aggregated every day based on different properties.
We would like to build it with Apache Spark, using Azure Databricks, because the size of the data will gro, we cannot vertically scale this process anymore (currently runs in 1 Postgres server) and we also need something that is cost-effective.
Having this job in Apache Spark is straightforward in theory, but I haven't found any practical advice on how to process JSON objects efficiently.
These are the options as I see:
Store the data in Postgres and ingest it with the Spark job (SQL) - may be slow to transfer the data
Store the data in Azure Blob Storage in JSON format - We may run out of the number of files that can be stored, also this seems inefficient to read so many files
Store the JSON data in big chunks, eg. 100.000 JSON in one file - it could be slow to delete/reinsert when the data changes
Convert the data to CSV or some binary format with a fixed structure and store it in blob format in big chunks - Changing the format will be a challenge but it would rarely happen in the future, also CSV/binary is quicker to parse
Any practical advice would be really appreciated. Thanks in advance.
There are multiple factors to be consider :
If you are trying to read the data on daily manner then strongly suggested to do store the data in Parquet format and store in databricks. If not accessing daily then store in Azure buckets itself (computation cost will be minimised)
If JSON data to be flattened then you need to do all the data manipulations and write into delta tables with OPTIMISE conditions.
If really retention 30 mandatory then be cautious with file formats bcz data will grow exponentially on daily basis. Other wise Alter table properties with retention period to 7 days or 15 days.
I have a BigQuery table with the following properties:
Table size: 1.64 TB
Number of rows: 9,883,491,153
The data is put there using streaming inserts (in batches of 500 rows each).
From the Google Cloud Pricing Calculator the costs for these inserts so far should roughly be 86 $.
But in reality, it turns out to be around 482 $.
The explanation is in the pricing docs:
Streaming inserts (tabledata.insertAll): $0.010 per 200 MB (You are charged for rows that are successfully inserted. Individual rows are calculated using a 1 KB minimum size.)
So, in the case of my table, each row is just 182 bytes, but I need to pay the full 1024 bytes for each row, resulting in ~ 562 % of the originally (incorrectly) estimated costs.
Is there a canonical (and of course legal) way to improve the situation, i.e., reduce cost? (Something like inserting into a temp table with just one array-of-struct column, to hold multiple rows in a row, and then split-moving regularly into the actual target table?)
I can suggest you these options:
Use BigQuery Storage Write API. You can stream records into BigQuery and they can be available as the ones written in the DB, or batch a process to insert a large number of records to commit in a single operation.
Some advantages are:
Lower cost because you have 2 TB per month free.
It supports exactly-once semantics through the use of stream offset.
If a table schema changes while a client is streaming, BigQuery
Storage Write notifies the client.
Here is more information about BigQuery Storage Write.
Another option, you could use Beam/DataFlow to create a batch for streaming into BigQuery and use BigQueryIO with the write method of batch.
You can see more information here.
We have a requirement to store website analytical data (think: views on a page, interactions, etc). Note: this is seperate to Google Analytics data, as we want to own the data and enrich it as we see fit.
Storage requirements:
each 'event' will have a timestamp, event type and some other metadata (user id, etc)
the storage is append only. No updates or deletes
writes are consistent, but not IOT scale. Maybe, 50/sec
estimating growth of about 100 million rows a year
Query requirements:
graphing data cumulatively over a period of time
slice/filter data by all the metadata as well as day/week/month/year slices
will likely need to be integrated into a larger data warehouse
Question: Is this a no brainer for a time series DB like InfluxDB,or can I get away with a well tuned SQL server table?
Given a 1-terabyte data set which comes from the sources in a couple hundred csv files, and divides naturally into two large tables, what's the best way to store the data in Google Cloud Storage? Partitioning by date does not apply as the data is relatively static and only updated quarterly. Is it best to combine all of the data into two large files and map each to a BigQuery table? Is it better to partition? If so, on what basis? Is there a threshold file size above which BigQuery performance degrades?
Depending on the use case:
To query data => then load it into BigQuery from GCS.
To store the data => leave it in GCS.
Question: "I want to query and have created a table in BiqQuery, but with only a subset of the data totaling a few GB. My question is if I have a TB of data should I keep it in one giant file GCS or should I split it up?"
Answer: Just load it all into BigQuery. BigQuery eats TB's for breakfast.
Has the BigQuery's datasets a maximum size (GB of inserted data)?
I don't find an answer for this in BigQuery documentation. The quota policy page talks about the maximum size of uploaded files and the max number of load jobs per day but not specify a maximum size per dataset or table.
I need know how much data I can upload to a datasets for an academic research.
Thanks
"Unlimited" is a big word, but one of the strong points about BigQuery is you shouldn't be able to find the limit.
The daily limit is almost 10 petabytes per project per day. If you have more data than that, just keep pushing the next day, it should not break.
https://developers.google.com/bigquery/docs/quota-policy#import
(Before doing a multi-terabyte import, it's a good idea to contact sales, to make sure there is physical capacity ready to handle the theoretical limits)