Is it possible to overwrite a partition in BigQuery? - google-bigquery

This article: https://cloud.google.com/bigquery/docs/writing-results states that it is possible to overwrite a BigQuery table with new data however what I'd like to do is overwrite a partition (or multiple partitions). Is that possible?
I've read through tonnes of of documentation about inserting data into BigQuery (e.g. https://cloud.google.com/bigquery/docs/creating-column-partitions) and can't find any reference to overwriting partitions so I assume the answer to my question is "no", but thought I'd ask anyway.

You can always over-write a partitioned table in BQ using the postfix of YYYYMMDD in the output table name of your query, along with using WRITE_TRUNCATE as your write disposition (i.e. to truncate whatever is existing in that partition and write new results).
So, lets say when you run your query, and you want to overwrite a partition for date 2019-01-15 in your table named xyz, you just set the output destination for your query results to be yourdataset.xyz$20190115 and specify the write disposition to be WRITE_TRUNCATE.
Hope it helps.

You are in luck! This is possible through MERGE DML.
https://cloud.google.com/bigquery/docs/using-dml-with-partitioned-tables#pruning_partitions_when_using_a_merge_statement
My advice is to play around with it a bit. If you can't get it working, post a new question with specific data/queries.

Related

"Write append" options bigquery

I am working on bigquery with standard sql and I have the following problem.
I am transforming a table with millions of data, but I will only work with the data of yesterday and today.
The result of that query (which is already listed) I have to store in another table.
The problem is that what must be executed every 1 hour and when creating the scheduled query and placing the option of "write append", the data that has been previously saved will be duplicated.
I need something like "write to table if it does not exist"
You should have your scheduled query written with replace in mind:
REPLACE TABLE `dataset.mytable`
AS
SELECT 1;
this way you fully replace on each run.
Update:
You may use MERGE statement to skip existing rows and add only new ones.
Materialized views are appending only new data.
They can query only single table, support only a limited set of aggregation functions (APPROX_COUNT_DISTINCT, ARRAY_AGG, AVG, COUNT, HLL_COUNT.INIT, MAX, MIN, SUM) and do not support a computation on top of an aggregation but maybe they will fit your use case.

Unable to load partitions in Athena with case sensitivity ON

I have data in S3 which is partitioned in YYYY/MM/DD/HH/ structure (not year=YYYY/month=MM/day=DD/hour=HH)
I set up a Glue crawler for this, which creates a table in Athena, but when I query the data in Athena it gives an error as one field has duplicate name (URL and url , which the SerDe converts to lowercase, causing a name conflict).
To fix this, I manually create another table (using the above table definition SHOW CREATE TABLE), adding 'case.insensitive'= FALSE to the SERDEPROPERTIES
WITH SERDEPROPERTIES ('paths'='deviceType,emailId,inactiveDuration,pageData,platform,timeStamp,totalTime,userId','case.insensitive'= FALSE)
I changed the s3 directory structure to the hive-compatible naming year=/month=/day=/hour= and then created the table with 'case.insensitive'= FALSE, then ran the MSCK REPAIR TABLE command for the new table, which loads all the partitions.
(Complete CREATE TABLE QUERY)
But upon querying, I can only find 1 data column (platform) and the partition columns, rest of all the columns are not parsed. But I've actually copied the Glue-generated CREATE TABLE query, with the case_insensitive=false condition.
How can I fix this?
I think you have multiple, separate issues: one with the crawler, and one with the serde, and one with duplicate keys:
Glue Crawler
If Glue Crawler delivered on what they promise they would be a fairly good solution for most situations and would save us from writing the same code over and over again. Unfortunately, if you stray outside of the (undocumented) use cases Glue Crawler was designed for, you often end up with various issues, from the strange to the completely broken (see for example this question, this question, this question, this question, this question, or this question).
I recommend that you skip Glue Crawler and instead write the table DDL by hand (you have a good template in what the crawler created, it just isn't good enough). Then you write a Lambda function (or shell script) that you run on a schedule to add new partitions.
Since your partitioning is only on time, this is a fairly simple script: it just needs to run every once in a while and add the partition for the next period.
It looks like your data is from Kinesis Data Firehose which produces a partitioned structure at hour granularity. Unless you have lots of data coming every hour I recommend you create a table that is only partitioned on date, and run the Lambda function or script once per day to add the next day's partition.
A benefit from not using Glue Crawler is that you don't have to have a one-to-one correspondence between path components and partition keys. You can have a single partition key that is typed as date, and add partitions like this: ALTER TABLE foo ADD PARTITION (dt = '2020-05-13') LOCATION 's3://some-bucket/data/2020/05/13/'. This is convenient because it's much easier to do range queries on a full date than when the components are separate.
If you really need hourly granularity you can either have two partition keys, one which is the date and one the hour, or just the one with the full timestamp, e.g. ALTER TABLE foo ADD PARTITION (ts = '2020-05-13 10:00:00') LOCATION 's3://some-bucket/data/2020/05/13/10/'. Then run the Lambda function or script every hour, adding the next hour's partition.
Having too a granular partitioning doesn't help with performance, and can instead hurt it (although the performance hit comes mostly from the small files and the directories).
SerDe config
As for the reason why you're only seeing the value of the platform column, it's because it's the only case where the column name and property have the same casing.
It's a bit surprising that the DDL you link to doesn't work, but I can confirm that it really doesn't. I tried creating a table from that DDL, but without the pagedata column (I also skipped the partitioning, but that shouldn't make a difference for the test), and indeed only the platform column had any value when I queried the table.
However, when I removed the case.insensitive serde property it worked as expected, which got me thinking that it might not work the way you think it does. I tried setting it to TRUE instead of FALSE, which made the table work as expected again. I think we can conclude from this that the Athena documentation is just wrong when it says "By default, Athena requires that all keys in your JSON dataset use lowercase". In fact, what happens is that Athena lower cases the column names, but it also lower cases the property names when reading the JSON.
With further experimentation it turned out the path property was redundant too. This is a table that worked for me:
CREATE EXTERNAL TABLE `json_case_test` (
`devicetype` string,
`timestamp` string,
`totaltime` string,
`inactiveduration` int,
`emailid` string,
`userid` string,
`platform` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://some-bucket/data/'
I'd say that case.insensitive seems to cause more problems than it solves.
Duplicate keys
When I added the pagedata column (as struct<url:string>) and added "pageData":{"URL":"URL","url":"url"} to the data, I got the error:
HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key "url"
And I got the error regardless of whether the pagedata column was involved in the query or not (e.g. SELECT userid FROM json_case_test also errored). I tried the case.insensitive serde property with both TRUE and FALSE, but it had no effect.
Next, I took a look at the source documentation for the serde, which first of all is worded much better, and secondly contains the key piece of information: that you also need to provide mappings for the columns when you turn off case insensitivity.
With the following serde properties I was able to get the duplicate key issue to go away:
WITH SERDEPROPERTIES (
"case.insensitive" = "false",
"mapping.pagedata" = "pageData",
"mapping.pagedata.url" = "pagedata.url",
"mapping.pagedata.url2"= "pagedata.URL"
)
You would have to provide mappings for all the columns except for platform, too.
Alternative: use JSON functions
You mentioned in a comment to this answer that the schema of the pageData property is not constant. This is another case where Glue Crawlers unfortunately don't really work. If you're unlucky you'll end up with a flapping schema that includes some properties some days (see for example this question).
What I realised when I saw your comment is that there is another solution to your problem: set up the table manually (as described above) and use string as the type for the pagedata column. Then you can use functions like JSON_EXTRACT_SCALAR to extract the properties you want during query time.
This solution trades increased complexity of the queries for way fewer headaches trying to keep up with an evolving schema.

Query rows modified in certain timespan

Is it possible to query only rows that were modified in a certain timespan?
The table in question is generated by Google Analytics, but it doesn't have an obvious way to query something like this (for example a last_modified timestamp or something smiliar).
As others have commented, BigQuery does not have a concept of updated rows. It's append only.
If you want to get newly inserted rows for a given timespan, you could either use timestamps when inserting and query using that column, or use table decorators [1].
[1] https://cloud.google.com/bigquery/table-decorators

Dynamically Querying Multiple Tables In BigQuery

I have a BigQuery database where daily data is uploaded into it's own table. So I have tables named "20131201", "20131202", etc. I can write a fixed query to "merge" those tables by doing:
SELECT * FROM db.20131201, db.20131202, ...
I'd like to have a single query that does not require me to update the Custom SQL everytime a new table is added. Something like:
SELECT * FROM db.*
Which currently doesn't work. I would like to avoid making one giant table. Is there a work-around that I can do, or will this have to be a feature request?
End-goal is for a Tableau data connection to all the tables.
This isn't exactly what you've asked for, but I've managed to use https://developers.google.com/bigquery/query-reference#tablewildcardfunctions in particular
TABLE_DATE_RANGE(prefix, timestamp1, timestamp2)
to achieve a similar result for use in tableaux. You'll still need to provide 2 date parameters, but it's substantially better than dynamically generating the FROM clause.
Hope this helps.
As of now in google bigquery this dynamic Sql [like "EXECUTE SQL" in mssqlserver] is not avilable...sulry google will look inthis i belive :)

question on how to use sql server integrated service

I have a table called book with, the attrbutes are booked_id, yearmon, and day_01....day_31. Now i need to unpivot the table and transform day_01...day_31 into rows, I have successed in doing that, but the problem is that my yearmon is a format of 200805 and i need to append a day to it based on day_01 or day_02 etc, so that i can create a new column with date information for example, if it is day_01, it looks like 20080501. Instead of writing huge query, does anyone how to use ssis to tranform it
You should be able to use the Unpivot component and the Derived Column component to do what you need. Look into those and post back if they don't seem to do what you need.