BigQuery - Max datasets size

BigQuery - Max datasets size - google-bigquery

Has the BigQuery's datasets a maximum size (GB of inserted data)?
I don't find an answer for this in BigQuery documentation. The quota policy page talks about the maximum size of uploaded files and the max number of load jobs per day but not specify a maximum size per dataset or table.
I need know how much data I can upload to a datasets for an academic research.
Thanks

"Unlimited" is a big word, but one of the strong points about BigQuery is you shouldn't be able to find the limit.
The daily limit is almost 10 petabytes per project per day. If you have more data than that, just keep pushing the next day, it should not break.
https://developers.google.com/bigquery/docs/quota-policy#import
(Before doing a multi-terabyte import, it's a good idea to contact sales, to make sure there is physical capacity ready to handle the theoretical limits)

Related

Snowflake Compute Capacity

We use Snowflake on AWS and have some "Daily Jobs" to rebuild our data on daily basis. Some of jobs take ~2+ hours to be built on Large Size Snowflake Warehouse. As about price per this document I can state that 2 hours of computations on Medium Warehouse = 1 hour computation on Large Warehouse. Should we consider running our stuff on X-Large Warehouse for ~30 mins and are there any engineering/SQL related drawbacks/issues? We want to get our data modelled in less amount of time and for the same amount of money.
Thank you in advance!
https://docs.snowflake.com/en/user-guide/credits.html

The way it is supposed to work is that yes, cost doubles but computational power doubles leading to the same cost as it takes half the time
This is not always true for things like data loading where increasing warehouse size does not always mean faster loading (many factors involved in this, file size etc)
If your tasks are ordered correctly and you don't have a situation where you need a previous task to be finished first due to dependency, then I see no reason why you shouldn't do it.
At the very least, you should test one for one of the daily runs

Does Google BigQuery charge by processing time?

So, this is somewhat of a "realtime" question. I run a query and it's currently going at almost 12K seconds, but it told me This query will process 239.3 GB when run. Does BQ charge by the time besides the data processed? Should I stop this now?

I assume you are using on-demand pricing model - you are billed based on amount of processed bytes and processing time is not involved
BigQuery uses a columnar data structure. You're charged according to the total data processed in the columns you select, and the total data per column is calculated based on the types of data in the column. For more information about how your data size is calculated, see data size calculation.
You aren't charged for queries that return an error, or for queries that retrieve results from the cache.
Charges are rounded to the nearest MB, with a minimum 10 MB data processed per table referenced by the query, and with a minimum 10 MB data processed per query.
Cancelling a running query job may incur charges up to the full cost for the query were it allowed to run to completion.
When you run a query, you're charged according to the data processed in the columns you select, even if you set an explicit LIMIT on the results.
See more at BigQuery pricing

For on-demand queries you are charged for the amount of data processed by BigQuery engine and for expensive queries only you are charged extra for the complexity (which could manifest itself by increased query time).
The amount of data processed is reflected by totalBytesProcessed. And also by totalBytesBilled which is the same for ordinary queries. For complex/expensive queries you are charged extra for the complexity and technically it's done by totalBytesBilled becoming bigger than totalBytesProcessed.
More details: see this link

Is there an option to limit the number to columns in a sink export into BigQuery?

I created a sink export to load audit logs into BigQuery. However, there are a large number of columns that I don't need from the audit log. Is there a way to pick and choose the columns in the sink export?

We need to define our reason for wanting to reducing the number of columns. My thinking is that you are concerned about costs. If we look at active storage, we find that the current price is $0.02 / GB with the first 10GB free each month. If the data is untouched for 90 days, that storage cost drops to $0.01/GB. Next we have to estimate how much storage is used for recording all columns for a month vs recording just the storage you want to record. If we can make some projections, then we can make a call on how much the cost might change if we reduced storage usage. What we will want to estimate will be the number of log records to be exported per month and the size of the average log record if written as-is today vs a log record with only minimally needed fields.
If we do find that there is a distinction that makes a significant cost saving, one further thought would be to export the log entries to Pub/Sub and have them trigger a cloud function. However, I'm dubious that we might end up finding that the savings on BQ storage is then lost due to the cost of Pub/Sub and Cloud Function (and possibly BQ streaming insert).
Another thought might be to realize that the BQ log records are written to tables named by "day". We could have a batch job that runs after a days worth of records are written that copies only the columns of interest to a new table. Again, we are going to have to watch that we don't end up with higher costs elsewhere in our attempt to reduce storage costs.

At what scale of data is the ROI of partitioning the most valuable?

So I'm looking into data warehousing and partitioning and am very curious at to what scale makes the most sense for partitioning a data on a key (for instance, SaleDate).
Tutorials often mention that you're trying to break it down into logical chunks so as to make updating the data less likely to cause service disruptions.
So let's say I'm a medium scale company working in a given US state. I do a lot of work in relation to SaleDate, often tens of thousands of transactions a day (with requisite transaction details, 4-50 each?), and have about 5 years of data. I would like to query and build trend information off of that; for instance:
On a yearly basis to know what items are becoming less popular over time.
On a monthly basis to see what items get popular at a certain time of year (ice in summer)
On a weekly basis to see how well my individual stores are doing
On a daily basis to observe theft trends or something
Now my business unit also wants to query that data, but I'd like to be able to keep it responsive.
How do I know that it would be best to partition on Year, Month, Week, Day, etc for this data set? Is it just whatever I actually observe as providing the best response time by testing out each scenario? Or is there some kind of scale that I can use to understand where my partitions would be the most efficient?
Edit: I, personally, am using Sql Server 2012. But I'm curious as to how others view this question in relation to the core concept rather than the implementation (Unless this isn't one of those cases where you can do so).

Things to consider:
What type of database are you using? Really important, different strategies for Oracle vs SQLServer vs IBM, etc.
Sample queries and run times. Partitions usage depends on the conditions in your where clause, what are you filtering on?
Does it make sense to create/use aggregate tables? Seems like a monthly aggregate would save you some time.
Partitions usage depends on the conditions in your where clause, what are you filtering on?
Lots of options based on the hardware and storage options available to you, need more details to make a more specific recommendation.

Here is an Ms-SQL 2012 database with 7 million records a day, with an ambition to grow the database to 6 years of data for trend analyses.
The partitions are based on the YearWeek column, expressed as an integer (after 201453 comes 201501). So each partition holds one week of transaction data.
This makes for a maximum of 320 partitions, which is well chosen below the maximum of 1000 partitions within a scheme. The maximum size for one partition in one table is now approx. 10 Gb, which makes it much easier to handle than the 3Tb size of the total.
A new file in the partition scheme is used for each new year. The 500Gb datafiles are suitable for backup and deletion.
When calculating data for one month the 4 processors are working in parallel to handle one partition each.

Google BigQuery pricing

I'm a Phd student from Singapore Management University. Currently I'm working in Carnegie Mellon University on a research project which needs the historical events from Github Archive (http://www.githubarchive.org/). I noticed that Google Bigquery has Github Archive data. So I run a program to crawl data using Google Bigquery service.
I just found that the price of Google bigquery shows on the console is not updated in real-time... While I started running the program for a few hours, the fee was only 4 dollar plus, so I thought the price is reasonable and I kept running the program. After 1~2 days, I checked the price again on Sep 13, 2013, the price became 1388$...I therefore immediately stopped using Google bigquery service. And just now I checked the price again, it turns out I need to pay 4179$...
It is my fault that I didn't realize I need to pay this big amount of money for executing queries and obtaining data from Google bigquery.
This project is only for research, not for commercial purpose. I would like to know whether it is possible to waive the fee. I really need [Google Bigquery team]'s kindly help.
Thank you very much & Best Regards,
Lisa

A year later update:
Please note some big developments since this situation:
Querying prices are 85% down.
GithubArchive is publishing daily and yearly tables now - so while developing your queries always test them on smaller datasets.
BigQuery pricing is based on the amount of data queried. One of its highlights is how easily it scales, going from scanning few gigabytes to terabytes in very few seconds.
Pricing scaling linearly is a feature: Most (or all?) other databases I know of would require exponentially more expensive resources, or are just not able to handle these amounts of data - at least not in a reasonable time frame.
That said, linear scaling means that a query over a terabyte is a 1000 times more expensive than a query over a gigabyte. BigQuery users need to be aware of this and plan accordingly. For these purposes BigQuery offers the "dry run" flag, that allows one to see exactly how much data will be queried before running the query - and adjust accordingly.
In this case WeiGong was querying a 105 GB table. Ten SELECT * LIMIT 10 queries will quickly amount to a terabyte of data, and so on.
There are ways to make these same queries consume much less data:
Instead of querying SELECT * LIMIT 10, call only the columns you are looking for. BigQuery charges based on the columns you are querying, so having unnecessary columns, will add unnecessary costs.
For example, SELECT * ... queries 105 GB, while SELECT repository_url, repository_name, payload_ref_type, payload_pull_request_deletions FROM [githubarchive:github.timeline] only goes through 8.72 GB, making this query more than 10 times less expensive.
Instead of "SELECT *" use tabledata.list when looking to download the whole table. It's free.
Github archive table contains data for all time. Partition it if you only want to see one month data.
For example, extracting all of the January data with a query leaves a new table of only 91.7 MB. Querying this table is a thousand times less expensive than the big one!
SELECT *
FROM [githubarchive:github.timeline]
WHERE created_at BETWEEN '2014-01-01' and '2014-01-02'
-> save this into a new table 'timeline_201401'
Combining these methods you can go from a $4000 bill, to a $4 one, for the same amount of quick and insightful results.
(I'm working with Github archive's owner to get them to store monthly data instead of one monolithic table to make this even easier)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas