Creating partitioned external table in bigquery - google-bigquery

I wanted to create a external table in bigquery which loads data from google cloud storage.
During creation of table from Web UI the option of Partitioning Type gets disabled.
Is there any way i can create partitioned external table ?
My data is already partitioned by date format on GCS
Ex: /somepath/data/dt=2018-03-22

Federated tables in GCS automatically act as partitioned tables when you use the file name as a variable (_FILE_NAME).
For example, this view transforms the file name into a native date:
#standardSQL
CREATE VIEW `fh-bigquery.views.wikipedia_views_test_ddl`
AS SELECT
PARSE_TIMESTAMP('%Y%m%d-%H%M%S', REGEXP_EXTRACT(_FILE_NAME, '[0-9]+-[0-9]+')) datehour
, _FILE_NAME filename
, line
FROM `fh-bigquery.views.wikipedia_views_gcs`
Later I can write queries like:
#standardSQL
SELECT *
FROM `fh-bigquery.views.wikipedia_views_test_ddl`
WHERE EXTRACT(YEAR FROM datehour)=2015
AND EXTRACT(MONTH FROM datehour)=10
AND EXTRACT(DAY FROM datehour)=21
AND EXTRACT(HOUR FROM datehour)=7
... and these queries will only open the files with names that match this pattern.
I wrote a whole story about this at https://medium.com/google-cloud/bigquery-lazy-data-loading-ddl-dml-partitions-and-half-a-trillion-wikipedia-pageviews-cd3eacd657b6.

Related

BigQuery: (backup) copy of sharded Google Analytics tables

When Google Analytics is exported to BigQuery, the data is put in so called sharded tables, one for each day. They all start with ga_sessions_ followed by the suffix of a date.
I want to make a backup copy of these sharded tables.
How do I do that?
If you want to backup the tables to a Cloud Storage Bucket you can try the following.
Query table meta data to get the tables to export.
SELECT
table_name
FROM
`MyDataSet.INFORMATION_SCHEMA.TABLES`
WHERE
table_name LIKE 'ga_sessions_%'
Use the BigQuery export function to export the to the bucket.
-- If the tables are nested use json/avro/parquet
-- But be aware of the data type converstions:
-- https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_stored_in
EXPORT DATA OPTIONS(
uri='gs://bucket/folder/ga_sessions_<date>_*.json',
format='JSON',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT * FROM mydataset.ga_sessions_<date>
Put it together in a BQ script with a loop, using FORMAT to create the query and EXECUTE IMMEDIATE to run the query.
BEGIN
DECLARE backup_date STRING DEFAULT CAST(CURRENT_DATE('UTC') AS STRING);
FOR record IN
(
SELECT
table_name
FROM
`MyDataSet.INFORMATION_SCHEMA.TABLES`
WHERE
table_name LIKE 'ga_sessions_%')
DO
EXECUTE IMMEDIATE
FORMAT("""
EXPORT DATA
OPTIONS(
uri=CONCAT('gs://your_backup_bucket/path/to/folder/', '%s','/', '%s','_*.json'),
format='JSON',
overwrite=true,
header=true,
field_delimiter=';')
AS SELECT * FROM my_project.my_data_set.%s
"""
, backup_date, record.table_name, record.table_name);
END FOR;
END;
backup_date is used to create a 'folder' with the export date as a name in the bucket for the tables.
The * in the URI allows a table to be exported into multiple tables. This only matters if the exported table is bigger than 1GB (See here)
Set a life cycle rule on your storage bucket to archive files after an appropriate time or set it to archive by default if it's only for backup purposes (Accessed less once a year, see storage classes).
Props to Tim Lou for this article on using table meta data.
This answer relates to creating the backup in BQ:
So a sharded table such as ga_sessions_ basically exists of many tables, one for each day. We need to copy all those tables separately.
The below answer is a copy from the article below and all respect goes to the author:
https://medium.com/#Nayana22/playing-with-sharded-tables-in-bigquery-123e1ec5e453
So you can do as follows:
Take the list of all the days that you wish to copy by hitting below SQL command. [You can modify this as per your need]
SELECT
REPLACE(STRING_AGG(CONCAT('"', partition_name,'"') ORDER BY partition_name ), ","," ")
FROM
(
SELECT
DISTINCT date AS partition_name
FROM
`[Project_ID].[DATASET_NAME].ga_sessions_*`
ORDER BY PARSE_DATE(“%Y%m%d”, date)
)
Output:
"20180322" "20180323" "20180324" "20180325" "20180326" "20180327"...
Now, go to google console, select the project and click on Activate Cloud Shell.
Check the current project using echo $DEVSHELL_PROJECT_ID. If you’re in an incorrect project then change it and move to step 4.
Create a bash script and specify the days that you just got from the above query.
tables=(“20180322” “20180323” “20180324” “20180325” “20180326” “20180327”…)
Iterate through all the days available in tables variable and use BigQuery’s copy command to move a table from source to destination.
Syntax: [Note: There is a space between the source and destination table name]
bq cp -a source_table destination_table
Our, script file looks like this:
tables=("20180322" "20180323" "20180324" "20180325" "20180326" "20180327"…)
for val in ${tables[*]}; do
bq cp -a [source_dataset].ga_sessions_$val [destination_dataset].ga_sessions_backup_$val
done
How to validate whether all the tables are copied or not?
WITH first_ga_session AS (
SELECT MIN(PARSE_DATE("%Y%m%d", REGEXP_EXTRACT(table_id, '20[0–9]
{6,6}'))) AS day
FROM `[PROJECT_ID].[DATASET_NAME].__TABLES__` AS ga_tables
WHERE table_id LIKE 'ga_sessions_backup_2%'
),
all_days AS (
SELECT period
FROM
UNNEST(GENERATE_DATE_ARRAY((SELECT day from first_ga_session),
CURRENT_DATE())) AS period
),
available_ga_sessions AS (
SELECT PARSE_DATE("%Y%m%d", REGEXP_EXTRACT(table_id, '20[0–9]
{6,6}')) AS ga_day
FROM `[PROJECT_ID].[DATASET_NAME].__TABLES__` AS ga_tables
WHERE table_id LIKE 'ga_sessions_backup_2%'
)
SELECT A.period AS Day, B.ga_day AS Available_session
FROM all_days A
LEFT JOIN available_ga_sessions B
ON A.period = B.ga_day
WHERE B.ga_day IS NULL
The above query will give us all the days that are missing in the destination table.
My bash script ended up looking like this (note i'm using cloning with flag --clone:
tables=("20220331" "20220401")
bq --location=eu mk --dataset destination_project_id:destination_dataset
for val in ${tables[*]}; do
echo source_project_id:source_dataset.ga_sessions_$val
bq cp --clone source_project_id:source_dataset.ga_sessions_$val destination_project_id:destination_dataset.ga_sessions_backup_$val
done

AWS Athena: partition by multiple columns in the same path

I am trying to create a table in Athena based on a directory in S3 that looks something like this:
folders/
id=1/
folder1/
folder2/
folder3/
dt=***/
dt=***/
id=2/
...
I want to partition by two columns. One is the id, and on is the dt.
So eventually I want my table to have an id column, and for each id, all of the dt's in its sub-folder folder3. Is there any solution for this that doesn't force me to have a path like this: ...\id=\dt=?
I tried to simply set these two columns in the "partition by" section where the location is the "folders" path, then the table has no data.
I then tried using injection and setting a specific id in a where clause when querying the table, but then the table contains data I don't need, and seems the partition doesn't work as I expected.
Table DDL:
CREATE EXTERNAL TABLE IF NOT EXISTS `database`.`test_table` (
`col1` string,
`col2` string,
) PARTITIONED BY (
id string,
dt string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://folders/'
Appreciate any help!
You can "manually" add the partitions using something like
alter table your_table add if not exists
partition (id=1, dt=0)
location '/id=1/folder3/dt=0/'
partition (id=1, dt=1)
location 'id=1/folder3/dt=1'
...
you can programmatically add all your partitions on s3 this way using the aws cli to list all folders, loop over them and add them to the partition table using a query like the above (see the docs).
An alternative is to use partition projection with custom storage locations, which has the benefit of giving you faster queries and removes the need for manually adding new partitions when new data arrives to S3 (see the partition projection docs, specially the section on custom S3 locations).

Can we add column to an existing table in AWS Athena using SQL query?

I have a table in AWS Athena which contains 2 records. Is there a SQL query using which a new column can be inserted in to the table?
You can find more information about adding columns to table in Athena documentation
Or you can use CTAS
For example, you have a table with
CREATE EXTERNAL TABLE sample_test(
id string)
LOCATION
's3://bucket/path'
and you can create another table from sample_test with the query
CREATE TABLE new_test
AS
SELECT *, 'new' AS new_col FROM sample_test
You can use any available query after AS
This is mainly for future readers like me, who was struggling to get this working for Hive table with AVRO data and if you don't want to create new table i.e updating schema of the existing table. It works for csv using 'add columns', but not for Hive + AVRO. For Hive + AVRO, to append columns at the end, before partition columns, the solution is available at this link. However, there are couple of things to note that, we need to pass full schema to the literal attribute and not just the changes; and (not sure why but) we had to alter hive table for all 3 things in the same order - 1. add columns using add columns 2. set tblproperties and 3. set serdeproperties. Hopefully it helps someone.

why hive can‘t select data from hdfs when use partition?

I use flume to write data to hdfs,path like /hive/logs/dt=20151002.Then,i use hive to select data,but the count of response is always 0.
Here is my create table sql,CREATE EXTERNAL TABLE IF NOT EXISTS test (id STRING) partitioned by (dt string) ROW FORMAT DELIMITED fields terminated by '\t' lines terminated by '\n' STORED AS TEXTFILE LOCATION '/hive/logs'
Here is my select sql,select count(*) from test
It seems that you are not registering partition in hive meta-store.
Although partition is present in hdfs path,Hive won't know it if its not registered in meta store. To register it you can do the following:
ALTER TABLE test ADD PARTITION (dt='20151002') location '/hive/logs/dt=20151002';

External table does not return the data in its folder

I have created an external table in Hive with at this location :
CREATE EXTERNAL TABLE tb
(
...
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/cloudera/data';
The data is present in the folder but when I query the table, it returns nothing. The table is structured in a way that it fits the data structure.
SELECT * FROM tb LIMIT 3;
Is there a kind of permission issue with Hive tables: do specific users have permissions to query some tables?
Do you know some solutions or workarounds?
You have created your table as partitioned table base on column datehour, but you are putting your data in /user/cloudera/data. Hive will look for data in /user/cloudera/data/datehour=(some int value). Since it is an external table hive will not update the metastore. You need to run some alter statement to update that
So here are the steps for external tables with partition:
1.) In you external location /user/cloudera/data, create a directory datehour=0909201401
OR
Load data using: LOAD DATA [LOCAL] INPATH '/path/to/data/file' INTO TABLE partition(datehour=0909201401)
2.) After creating your table run a alter statement:
ALTER TABLE ADD PARTITION (datehour=0909201401)
Hope it helps...!!!
When we create an EXTERNAL TABLE with PARTITION, we have to ALTER the EXTERNAL TABLE with the data location for that given partition. However, it need not be the same path as we specify while creating the EXTERNAL TABLE.
hive> ALTER TABLE tb ADD PARTITION (datehour=0909201401)
hive> LOCATION '/user/cloudera/data/somedatafor_datehour'
hive> ;
When we specify LOCATION '/user/cloudera/data' (though its optional) while creating an EXTERNAL TABLE we can take some advantage of doing repair operations on that table. So when we want to copy the files through some process like ETL into that directory, we can sync up the partition with the EXTERNAL TABLE instead of writing ALTER TABLE statement to create another new partition.
If we already know the directory structure of the partition that HIVE would create, we can simply place the data file in that location like '/user/cloudera/data/datehour=0909201401/data.txt' and run the statement as shown below:
hive> MSCK REPAIR TABLE tb;
The above statement will sync up the partition to the hive meta store of the table "tb".