Price of my project in Google Big Query pricing - google-bigquery

I am little confused about pricing of Google Big Query. I need to get final price of my project in Google Big Query. So, how much will cost 1 month of using Big Query if the project will need:
10GB of new data will be added each day
There will be made 25 millions inserts each day to given table - each insert size will be 0.4KB.
Each day there will be 1000 queries fired through whole current table of stored data.
All data will be collected (without deletion) for a 1 year.

With the information you provided I’m guessing you’ll need to use Streaming inserts as normal uploads to BigQuery datasets have 2 daily limits (Daily destination table update limit — 1,000 updates per table per day and Load jobs per project per day — 50,000 (including failures)) that would be hard to workaround.
The price of your project will be composed from the 3 parts: storage, streaming inserts and queries.
BigQuery Storage
For the first month you’ll pay $0.0067 per day for each lot of 10GB ($0.02(per GB per month) * 10(GB) * 1/30(months)). So for 30 days the total cost will be about 4 USD (estimated for daily added lots of 10GB). For the next 11 months it will costs you 0.02 * 300GB * 11 months = 66 USD.
If a table is not edited for 90 consecutive days, the price of storage for that table automatically drops by 50 percent to $0.01 per GB per month. Also, if you don’t access the data you can transfer it to a bucket and chose the storage class - Nearline Storage and it will cost you 0.01$ per 1GB resulting in a cost of 11 months * 300GB * 0.01 = 33 USD.
Streaming inserts
The cost for the streaming inserts for daily 25.000.000 inserts of 1KB (1 KB is the minimum size per individual row) → 25GB (per day) → 750GB (in the month)
Total price per month: 37.5 USD (750GB * $0.05 (per GB successfully inserted))
Query price
I roughly estimated that each query would require 1G of data to be processed resulting in a 1TB per day (1000 queries * 1G) so it will cost 5$/day → 150 USD per month.
As this is the most expensive part of your project you should correctly estimate it. I’d advise you to run some tests with provisioned queries, maybe using the public datasets and see how much it will cost. Also keep in mind about this best practices to limit the queries costs.
Total cost is 4 + 66 + 37.5 + 150 = 257.5 USD not far from the estimated price given by the BigQuery pricing calculator - $255.40
As it’s an estimate I excluded from calculations the free quotas of 10 GB/month free storage and 1TB/month for queries.

Related

Azure Analytics : Kusto Query Dashboard group by client_CountryOrRegion and so whenever changing UTC time: past 24 hours it shows deviation in result

pageViews
| where (url contains "https://***.com")
| summarize TotalUserCount = dcount(user_Id)
| project TotalUserCount
Now when summarizing by client_CountryOrRegion, there is a deviation in result for different time range selected i.e. for 24 days, 2 days, 3 days, 7 days etc... User count by country does not match the total count. Is it due to UTC timezone?
pageViews
| where (url contains "https://***.com")
| summarize Users= dcount(user_Id) by client_CountryOrRegion
Any help or suggestion would be like oxygen.
Quoting the documentation of dcount:
The dcount() aggregation function is primarily useful for estimating the cardinality of huge sets. It trades performance for accuracy, and may return a result that varies between executions. The order of inputs may have an effect on its output.
You can increase the accuracy of the estimation by providing the accuracy argument to dcount, e.g. dcount(user_Id, 4). Note that this will improve the estimation (at the cost of query performance), but still won't be 100% accurate. You can read more about this in the doc.

Should time partition column be set as clustered column in BigQuery? [duplicate]

We are using a public dataset to benchmark BigQuery. We took the same table and partitioned it by day, but it's not clear we are getting many benefits. What's a good balance?
SELECT sum(score)
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
WHERE creation_date > "2019-01-01"
Takes 1 second, and processes 270.7MB.
Same, with partitions:
SELECT sum(score)
FROM `temp.questions_partitioned`
WHERE creation_date > "2019-01-01"
Takes 2 seconds and processes 14.3 MB.
So we see a benefit in MBs processed, but the query is slower.
What's a good strategy to decide when to partition?
(from an email I received today)
When partitioning a table, you need to consider having enough data for each partition. Think of each partition like being a different file - and opening 365 files might be slower than having a huge one.
In this case, the table used for the benchmark has 1.6 GB of data for 2019 (until June in this one). That's 1.6GB/180 = 9 MB of data for each daily partition.
For such a low amount of data - arranging it in daily partitions won't bring much benefits. Consider partitioning the data by year instead. See the following question to learn how:
Partition by week/month//quarter/year to get over the partition limit?
Another alternative is not partitioning the table at all, and instead using clustering to sort the data by date. Then BigQuery can choose the ideal size of each block.
If you want to run your own benchmarks, do this:
CREATE TABLE `temp.questions_partitioned`
PARTITION BY DATE(creation_date)
AS
SELECT *
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
vs no partitions, just clustering by date:
CREATE TABLE `temp.questions_clustered`
PARTITION BY fake_date
CLUSTER BY creation_date
AS
SELECT *, DATE('2000-01-01') fake_date
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
Then my query over the clustered table would be:
SELECT sum(score)
FROM `temp.questions_clustered`
WHERE creation_date > "2019-01-01"
And it took 0.5 seconds, 17 MB processed.
Compared:
Raw table: 1 sec, 270.7MB
Partitioned: 2 sec, 14.3 MB
Clustered: 0.5 sec, 17 MB
We have a winner! Clustering organized the daily data (which isn't much for this table) into more efficient blocks than strictly partitioning it by day.
It's also interesting to look at the execution details for each query on these tables:
Slot time consumed
Raw table: 10.683 sec
Partitioned: 7.308 sec
Clustered: 0.718 sec
As you can see, the query over raw table used a lot of slots (parallelism) to get the results in 1 second. In this case 50 workers processed the whole table with multiple years of data, reading 17.7M rows. The query over the partitioned table had to use a lot of slots - but this because each slot was assigned smallish daily partitions, a reading that used 153 parallel workers over 0.9M rows. The clustered query instead was able to use a very low amount of slots. Data was well organized to be read by 57 parallel workers, reading 1.12M rows.
See also:
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
How can I improve the amount of data queried with a partitioned+clustered table?
how clustering works in BigQuery

What's a good balance to decide when to partition a table in BigQuery?

We are using a public dataset to benchmark BigQuery. We took the same table and partitioned it by day, but it's not clear we are getting many benefits. What's a good balance?
SELECT sum(score)
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
WHERE creation_date > "2019-01-01"
Takes 1 second, and processes 270.7MB.
Same, with partitions:
SELECT sum(score)
FROM `temp.questions_partitioned`
WHERE creation_date > "2019-01-01"
Takes 2 seconds and processes 14.3 MB.
So we see a benefit in MBs processed, but the query is slower.
What's a good strategy to decide when to partition?
(from an email I received today)
When partitioning a table, you need to consider having enough data for each partition. Think of each partition like being a different file - and opening 365 files might be slower than having a huge one.
In this case, the table used for the benchmark has 1.6 GB of data for 2019 (until June in this one). That's 1.6GB/180 = 9 MB of data for each daily partition.
For such a low amount of data - arranging it in daily partitions won't bring much benefits. Consider partitioning the data by year instead. See the following question to learn how:
Partition by week/month//quarter/year to get over the partition limit?
Another alternative is not partitioning the table at all, and instead using clustering to sort the data by date. Then BigQuery can choose the ideal size of each block.
If you want to run your own benchmarks, do this:
CREATE TABLE `temp.questions_partitioned`
PARTITION BY DATE(creation_date)
AS
SELECT *
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
vs no partitions, just clustering by date:
CREATE TABLE `temp.questions_clustered`
PARTITION BY fake_date
CLUSTER BY creation_date
AS
SELECT *, DATE('2000-01-01') fake_date
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
Then my query over the clustered table would be:
SELECT sum(score)
FROM `temp.questions_clustered`
WHERE creation_date > "2019-01-01"
And it took 0.5 seconds, 17 MB processed.
Compared:
Raw table: 1 sec, 270.7MB
Partitioned: 2 sec, 14.3 MB
Clustered: 0.5 sec, 17 MB
We have a winner! Clustering organized the daily data (which isn't much for this table) into more efficient blocks than strictly partitioning it by day.
It's also interesting to look at the execution details for each query on these tables:
Slot time consumed
Raw table: 10.683 sec
Partitioned: 7.308 sec
Clustered: 0.718 sec
As you can see, the query over raw table used a lot of slots (parallelism) to get the results in 1 second. In this case 50 workers processed the whole table with multiple years of data, reading 17.7M rows. The query over the partitioned table had to use a lot of slots - but this because each slot was assigned smallish daily partitions, a reading that used 153 parallel workers over 0.9M rows. The clustered query instead was able to use a very low amount of slots. Data was well organized to be read by 57 parallel workers, reading 1.12M rows.
See also:
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
How can I improve the amount of data queried with a partitioned+clustered table?
how clustering works in BigQuery

How to partition based on the month and year in Azure SQL Data Warehouse

I am going to use ADF to copy 5 billion rows to Azure SQL data warehouse. Azure SQL DWH will distribute the table into 60 distributions by default, But I want to add another 50 partitions based on the month and year as follow:
PARTITION ( DateP RANGE RIGHT FOR VALUES
(
'2015-01-01', 2015-02-01', ......2018-01-01','2018-02-01','2018-03-01','2018-04-01','2018-5-01',.......
))
But, the column that I am using to partition the table includes date and time together :
2015-01-01 00:30:00
Do you think my partitioning approach is correct?
5B rows / (50 partitions x 60 Distributions) = 1.7M rows/partition on average
That's probably too many partitions, but if you have a lot of single-month queries it might be worth it. You would definitely want to defragment your columnstores after load.
I tend to agree with David that this is probably overkill for the number of partitions. You'll want to make sure that you have a pretty even distribution of data and with 1.7M rows or so, you'll be on the lower side. You can probably move to quarter based partitions (e.g., '2017-12-31', '2018-03-01', '2018-06-30') to get good results for query performance. This would give you 4 partitions a year since 2015 (or 20 total). So the math is:
5B rows / (20 partitions * 60 distributions) = 4.167M rows/partition.
While the number of partitions does matter for partition elimination scenarios, this is a fact table with columnstore indexes which will do an additional level of index segment elimination during query time. Over partitioning can make the situation worse rather than better.
The guideline from Microsoft specifies that while sizing partitions, especially for columnstore indexed tables in Azure DW, the MINIMUM volume MUST be 60 million rows PER partition. Anything lower may NOT give an optimum performance. The logic to that is, there must be a MINIMUM of 1 M rows per distribution per partition. Since every partition created will internally create sixty additional distributions, the minimum works out to 60M per partition proposed to be created

Big Query Error: Your project exceeded quota for free query bytes scanned

I am trying to access the records for the year 2015 from the nyc_yellow cabs table on big query. And I keep running into this error.
Error: Quota exceeded: Your project exceeded quota for free query bytes scanned. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors.
My Query is:
SELECT
DAYOFWEEK(pickup_datetime) dayofweek,
INTEGER(100*AVG(trip_distance/((dropoff_datetime-pickup_datetime)/3600000000)))/100 speed,
FROM
[nyc-tlc:yellow.trips]
WHERE
fare_amount/trip_distance between 2
and 10
and year(pickup_datetime) = 2015
GROUP BY 1
ORDER BY 1
I am using the Free trial.
Every month you get 1 free terabyte to query data in BigQuery - it seems you ran out of it.
But don't worry! It replenishes on an ongoing basis, so you only need to wait a couple hours, not a full month to continue querying.
This query is only 33GB big, so you can do around 30 of these queries for free every month.
As you are only looking at 2015, a way to save quota is to use the monthly tables that the TLC provided, if you change the query to only July 2015:
SELECT DAYOFWEEK(pickup_datetime) dayofweek, INTEGER(100*AVG(trip_distance/((dropoff_datetime-pickup_datetime)/3600000000)))/100 speed,
FROM [nyc-tlc:yellow.trips_2015_07]
WHERE fare_amount/trip_distance between 2 and 10
AND year(pickup_datetime) = 2015
GROUP BY 1 ORDER BY 1
This one only processes 353MB, so you could run 2,800 free queries like this a month (way better!).