Insert query on Partitioned tables in Bigquery - hive

I have some Insert queries written in hive to be migrated in Bigquery.
For example:
insert into test.abc partition(yrmth) select * from test.xyz
In Bigquery, partition is only supported in YYYYMMDD format. I'm able to dump the data in partitioned table through BQ command line tool by loading test.abc$20171125.
How can I achieve the same using DML statements in Bigquery?
I have learnt that Legacy SQL doesn't support writing DML statements and Standard SQL doesn't support the table specifications like test.abc$20171125 that is required for loading the data in corresponding partition.

You are correct - DML statements are not yet supported over partitioned tables.
Just do simple select select * from test.xyz with destination table test.abc$20171125. This is supported by Web UI, bq command line, API and any client of your choice
Check https://issuetracker.google.com/issues/36383555 if you want to try alpha release for column based partitioned tables - DML over partitioned tables is part of it

Related

Bigquery - Updating GCS path of Bigquery external tables

I have a requirement where I might have to update the Bigquery External tables on a periodic basis.
The GCS location has timestamp for every incremental run, I would like to update to the latest timestamp folder as the path of External table.
One way i see is only dropping the table and creating again by pointing it to latest folder. But, is there any other way to update it without dropping the table
As suggested by #Samuel , you can use the SQL statement CREATE or REPLACE EXTERNAL TABLES for your requirement. Scheduled queries support DML and DDL statements which can be used to create the new tables. You can use the below mentioned query parameter to create the table according to your schedule :
My_database_name.my_table_name.my_results_{run_date}
For more information you can refer to this documentation.

Optimize Temporary Table on Presto/Hive SQL

I would like to optimize my computation time for queries ran on PRESTO/HIVE SQL. One of the techniques I used to do on Redshift was to improve efficiency of temporary tables as in the following :
BEGIN;
CREATE TEMPORARY TABLE my_temp_table(
column_a varchar(128) encode lzo,
column_b char(4) encode bytedict)
distkey (column_a) -- Assuming you intend to join this table on column_a
sortkey (column_b) -- Assuming you are sorting or grouping by column_b
;
INSERT INTO my_temp_table SELECT column_a, column_b FROM my_table;
COMMIT;
I have tried that on Presto/Hive SQL but it is not supported. Do you please know the equivalent of this technique on Presto/Hive SQL?
Many thanks!
Redshift is relational database, Presto is a distributed SQL Query Engine. Presto currently doesn't support the creation of temporary tables and also not the creation of indexes. But you may create tables based on a SQL statement via CREATE TABLE AS - Presto Documentation
You optimize the performance of Presto in two ways:
Optimizing the query itself
Optimizing how the underlying data is stored
One of the best articles around is Top 10 Performance Tuning Tips for Amazon Athena - Athena is a AWS Service based on Presto 0.172 and therefore the tips should also work for Presto.
I am not a Redshift expert but it seems you want to precompute a data set, distributing it and sorting by selected columns, so that it is faster to query.
This corresponds to Presto Hive connector ability to:
partition data -- data with same value in partitioning column(s) will form a single partition, which is a folder on storage; do not use partitioning on high cardinality columns. This is defined using partitioned_by table property
bucket data -- data is grouped in files using hash of bucketing column(s); this is similar to partitioning to a certain extent. This is defined using bucketed_by and bucket_count table properties.
sort data -- within data file, data is sorted by given column(s). This is defined using sorted_by table property.
See examples in Trino (formerly Presto SQL) Hive connector documentation
Note: while i realize documentation is scarce at the moment, i filed an issue to improve it. In the meantime, you can get additional information on Trino (formerly Presto SQL) community slack.

Write Bigquery results to Bigquery in Datalab

I ran a query abc and got the result table m in Datalab.
Is there any way I can create a new table in Bigquery and write the content of table m to it in Datalab?
%%bq query --name abc
select *
from `test`
m=abc.execute().result()
Yes you can, Datalab supports a number of BigQuery magic commands, among them is create, which allows for the creation of datasets and tables.
See the documentation here:
http://googledatalab.github.io/pydatalab/datalab.magics.html
And specifically this section, which details all the available BigQuery commands:
positional arguments:
{sample,create,delete,dryrun,udf,execute,pipeline,table,schema,datasets,tables,extract,load}
commands
sample Display a sample of the results of a BigQuery SQL
query. The cell can optionally contain arguments for
expanding variables in the query, if -q/--query was
used, or it can contain SQL for a query.
create Create a dataset or table.
delete Delete a dataset or table.
dryrun Execute a dry run of a BigQuery query and display
approximate usage statistics
udf Create a named Javascript BigQuery UDF
execute Execute a BigQuery SQL query and optionally send the
results to a named table. The cell can optionally
contain arguments for expanding variables in the
query.
pipeline Define a deployable pipeline based on a BigQuery
query. The cell can optionally contain arguments for
expanding variables in the query.
table View a BigQuery table.
schema View a BigQuery table or view schema.
datasets List the datasets in a BigQuery project.
tables List the tables in a BigQuery project or dataset.
extract Extract BigQuery query results or table to GCS.
load Load data from GCS into a BigQuery table.

How do I query the streaming buffer in BigQuery if the _PARTITIONTIME field isn't available with Standard SQL

We have a partitioned BigQuery table that's partitioned on a custom field (event_timestamp). This was due to needing to import historical data while still keeping things partitioned and query costs under control.
This means the table doesn't have a _PARTITIONTIME field, and we can't do queries WHERE _PARTITIONTIME IS NULL to view recently sent records sitting in the streaming buffer.
So is there a trick to being able to query the streaming buffer even if you're using a custom field for partitioning?
Edit We're using Standard SQL, not Legacy SQL, sorry for not being explicit about that previously.
Even though querying tables with time partitioning field supports only via standard SQL, I believe Legacy SQL is still supported to query the streaming inserts into the table
#legacySQL
select * from [<project-name>:<data-set>.<table>$__UNPARTITIONED__]
Reference:
https://cloud.google.com/bigquery/docs/partitioned-tables
Query Streaming Data Using --time_partitioning_field

BigQuery insert into a partitioned table from an existing table

I have to tables with the same schema tab1 and tab1_partitioned where the latter is partitioned by day.
I am trying to insert data into the partitioned table with the following command:
bq query --allow_large_results --replace --noflatten_results --destination_table 'advertiser.development_partitioned$20160101' 'select * from advertiser.development where ymd = 20160101';
but I get the following error:
BigQuery error in query operation: Error processing job 'total-handler-133811:bqjob_r78379ac2513cb515_000001553afb7196_1': Provided Schema does not match Table
Both have exactly the same schema and I really don't understand why I am getting that error. Can someone shed some light on my issue?
In fact, I would prefer If BigQuery supported the dynamic partitioning insert that is supported in Hive, but some days of search seem to point that is not possible :-/
The behavior you are seeing is due to how we treat write dispositions when using them with table partitions.
You should be able to append to the partition using a WRITE_APPEND disposition to get the query to go through.
bq query --allow_large_results --append_table --noflatten_results --destination_table 'advertiser.development_partitioned$20160101' 'select * from advertiser.development where ymd = 20160101';
There are some complications to making it work with --replace, but we are looking into improved schema support for table partitions at this time.
Please let me know if this doesn't work for you. Thanks!
To answer the other part of your question about dynamic partitioning - we do plan to support richer flavors of partitioning and we believe that they will handle the majority of use cases.
FYI, I don't think it was always so, but there is now a way to copy data from non-partitioned to partitioned tables in bigquery just using DML from the bigquery UI. For example, if you have a date string in your origin table, of the form YYYY-MM-DD, you could run this to move the data to a partitioned table ...
create table my_dataset.my_table (sesh STRING, prod STRING) partition by DATE(_PARTITIONTIME);
insert into my_dataset.my_table (_PARTITIONTIME, sesh, prod) select CAST(PARSE_DATE('%Y-%m-%d', mydatestr) as TIMESTAMP), sesh, prod from my_dataset.my_orig_table;