Is the cost incurred when partitioning date tables in BigQuery? - google-bigquery

BigQuery quotes this command for creating a partition from existing tables:
bq partition mydataset.sharded_ mydataset.partitioned
(see partitioned tables)
But when I run this, I see that the data is actually getting moved. Since selecting data from raw large tables is very expensive, I wonder how Google applies billing for this situation.

The bq partition CLI command leverages copy jobs rather than queries, which don't incur execution costs (but you do still get charged for the persisted storage that it may generate).
If you're using the CLI, copy jobs can be specified using the bq cp command.

Related

Duplicate several tables in bigquery project at once

In our BQ export schema, we have one table for each day as per the screenshot below.
I want to copy the tables before a certain date (2021-feb-07). I know how to copy one day at a time via the UI, but is there not a way to use the cloud console to write a code for copying the selected date range, all at once? Or maybe an sql command directly from a query window?
I think you should transform your sharding tables into a partitioned table. So you can handled your tables with just a single query. As mention in the official documentation, partitioned tables perform better.
To make the conversion, you can just execute the following commands in the console.
bq partition \
--time_partitioning_type=DAY \
--time_partitioning_expiration 259200 \
mydataset.sourcetable_ \
mydataset.mytable_partitioned
This will make your sharded tables sourcetable_(xxx) into a single partitioned table mytable_partitioned which can be query with just a single query trough your entire set of data entries.
SELECT
*
FROM
`myprojectid.mydataset.mytable_partitioned`
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2022-01-01') AND TIMESTAMP('2022-01-03')
For more details about the conversion commands you can check this link. Also, I recommend to check the links about querying partionated tables and partiotioned tables for more details.

AWS Equivalent For Incremental SQL Query

I can create a materialised view in RDS (postgreSQL) to keep track of the 'latest' data output from a SQL query, and then visualise this in QuickSight. This process is also very 'quick' as it doesn't result in calling additional AWS services and/or re-processing all data again (through the SQL query). My assumption is how this works is it runs a SQL, re-runs the SQL but not for the whole data again, so that if you structure the query correctly, you can end up having a 'real time running total' metric for example.
The issue is, creating materialised views (per 5 seconds) for 100's of queries, and having them all stored in a database is not scalable. Imagine a DB with 1TB data, creating an incremental/materialised view seems much less painful than using other AWS services, but eventually won't be optimal for processing time/cost etc.
I have explored various AWS services, none of which seem to solve this problem.
I tried using AWS Glue. You would need to create 1 script per query and output it to a DB. The lag between reading and writing the incremental data is larger than creating a materialised view; because you can incrementally process data, but then to append it to the current 'total' metric is another process.
I explored using AWS Kinesis followed by a Lambda to run a SQL on the 'new' data in the stream, and store the value in S3 or RDS. Again, this adds latency and doesn't work as well as a materialised view.
I read that AWS Redshift does not have materialised views therefore stuck to RDS (PostgreSQL).
Any thoughts?
[A similar issue: incremental SQL query - except I want to avoid running the SQL on "all" data to avoid massive processing costs.]
Edit (example):
table1 has schema (datetime, customer_id, revenue)
I run this query: select sum(revenue) from table1.
This would scan the whole table to come up with a metric per customer_id.
table1 now gets updated with new data as the datetime progresses e.g. 1 hour extra data.
If I run select sum(revenue) from table1 again, it scans all the data again.
A more efficient way is to just compute the query on the new data, and append the result.
Also, I want the query to actively run where there is a change in data, not have to 'run it with a schedule' so that my front end dashboards basically 'auto update' without the customer doing much.

AWS Athena MSCK REPAIR TABLE tablename command

Is there any number of partitions we would expect this command
MSCK REPAIR TABLE tablename;
to fail on?
I have a system that currently has over 27k partitions and the schema changes for the Athena table we drop the table, recreate the table with say the new column(s) tacked to the end and then run
MSCK REPAIR TABLE tablename;
We had no luck with this command doing any work what so every after we let it run for 5 hours. Not a single partition was added. Wondering if anyone has information about a partition limit we may have hit but can't find documented anywhere.
MSCK REPAIR TABLE is an extremely inefficient command. I really wish the documentation didn't encourage people to use it.
What to do instead depends on a number of things that are unique to your situation.
In the general case I would recommend writing a script that performed S3 listings and constructed a list of partitions with their locations, and used the Glue API BatchCreatePartition to add the partitions to your table.
When your S3 location contains lots of files, like it sounds yours does, I would either use S3 Inventory to avoid listing everything, or list objects with a delimiter of / so that I could list only the directory/partition structure part of the bucket and skip listing all files. 27K partitions can be listed fairly quickly if you avoid listing everything.
Glue's BatchCreatePartitions is a bit annoying to use since you have to specify all columns, the serde, and everything for each partition, but it's faster than running ALTER TABLE … ADD PARTION … and waiting for query execution to finish – and ridiculously faster than MSCK REPAIR TABLE ….
When it comes to adding new partitions to an existing table you should also never use MSCK REPAIR TABLE, for mostly the same reasons. Almost always when you add new partitions to a table you know the location of the new partitions, and ALTER TABLE … ADD PARTION … or Glue's BatchCreatePartitions can be used directly with no scripting necessary.
If the process that adds new data is separate from the process that adds new partitions, I would recommend setting up S3 notifications to an SQS queue and periodically reading the messages, aggregating the locations of new files and constructing the list of new partitions from that.

How to handle hive locking across hive and presto

I have a few hive tables that are insert-overwrite from spark and hive. Those tables are also accessed by analysts on presto. Naturally, we're running into some windows of time that users are hitting an incomplete data set because presto is ignoring locks.
The options I can think of:
Fork the presto-hive connector to support hive S and X locks appropriately. This isn't too bad, but time consuming to do properly.
Swap the table location on the hive metastore once an insert overwrite is complete. This is OK, but a little messy because we like to store explicit locations at the database level and let the tables inherit location.
Stop doing insert-overwrite on these tables and instead just add a new partition for the things that have changed, then once a new partition is written, alter the hive table to see it. Then we can just have views on top of the data that will properly reconcile the latest version of each row.
Stop doing insert-overwrite on s3 which has a long window of copy from hive staging to the target table. If we move to hdfs for all insert-overwrite, we still have the issue, but it's over the span of time that it takes to do a hdfs mv which is significantly faster. (probably bad: there's still a window that we can get incomplete data)
My question is how do people generally handle that? It seems like a common scenario that would have an explicit solution, but I seem to be missing it. This can be asked in general for any third party tool that can query the hive metastore and interact with the hdfs/s3 directly while not respecting hive locks.

Result of Bigquery job running on a table in which data is loaded via streamingAPI

I have a BQ wildcard query that merges a couple of tables with the same schema (company_*) into a new, single table (all_companies). (all_companies will be exported later into Google Cloud Storage)
I'm running this query using the BQ CLI with all_companies as the destination table and this generates a BQ Job (runtime: 20mins+).
The company_* tables are populated constantly using the streamingAPI.
I've read about BigQuery jobs, but I can't find any information about streaming behavior.
If I start the BQ CLI query at T0, the streamingAPI adds data to company_* tables at T0+1min and the BQ CLI query finishes at T0+20min, will the data added at T0+1min be present in my destination table or not?
As described here the query engine will look at both the Columnar Storage and the streaming buffer, so potentially the query should see the streamed data.
It depends what you mean by a runtime of 20 minutes+. If the query is run 20 minutes after you create the job then all data in the streaming buffer by T0+20min will be included.
If on the other hand the job starts immediately and takes 20 minutes to complete, you will only see data that is in the streaming buffer at the moment the table is queried.