BigQuery: (backup) copy of sharded Google Analytics tables - google-bigquery

When Google Analytics is exported to BigQuery, the data is put in so called sharded tables, one for each day. They all start with ga_sessions_ followed by the suffix of a date.
I want to make a backup copy of these sharded tables.
How do I do that?

If you want to backup the tables to a Cloud Storage Bucket you can try the following.
Query table meta data to get the tables to export.
SELECT
table_name
FROM
`MyDataSet.INFORMATION_SCHEMA.TABLES`
WHERE
table_name LIKE 'ga_sessions_%'
Use the BigQuery export function to export the to the bucket.
-- If the tables are nested use json/avro/parquet
-- But be aware of the data type converstions:
-- https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_stored_in
EXPORT DATA OPTIONS(
uri='gs://bucket/folder/ga_sessions_<date>_*.json',
format='JSON',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT * FROM mydataset.ga_sessions_<date>
Put it together in a BQ script with a loop, using FORMAT to create the query and EXECUTE IMMEDIATE to run the query.
BEGIN
DECLARE backup_date STRING DEFAULT CAST(CURRENT_DATE('UTC') AS STRING);
FOR record IN
(
SELECT
table_name
FROM
`MyDataSet.INFORMATION_SCHEMA.TABLES`
WHERE
table_name LIKE 'ga_sessions_%')
DO
EXECUTE IMMEDIATE
FORMAT("""
EXPORT DATA
OPTIONS(
uri=CONCAT('gs://your_backup_bucket/path/to/folder/', '%s','/', '%s','_*.json'),
format='JSON',
overwrite=true,
header=true,
field_delimiter=';')
AS SELECT * FROM my_project.my_data_set.%s
"""
, backup_date, record.table_name, record.table_name);
END FOR;
END;
backup_date is used to create a 'folder' with the export date as a name in the bucket for the tables.
The * in the URI allows a table to be exported into multiple tables. This only matters if the exported table is bigger than 1GB (See here)
Set a life cycle rule on your storage bucket to archive files after an appropriate time or set it to archive by default if it's only for backup purposes (Accessed less once a year, see storage classes).
Props to Tim Lou for this article on using table meta data.

This answer relates to creating the backup in BQ:
So a sharded table such as ga_sessions_ basically exists of many tables, one for each day. We need to copy all those tables separately.
The below answer is a copy from the article below and all respect goes to the author:
https://medium.com/#Nayana22/playing-with-sharded-tables-in-bigquery-123e1ec5e453
So you can do as follows:
Take the list of all the days that you wish to copy by hitting below SQL command. [You can modify this as per your need]
SELECT
REPLACE(STRING_AGG(CONCAT('"', partition_name,'"') ORDER BY partition_name ), ","," ")
FROM
(
SELECT
DISTINCT date AS partition_name
FROM
`[Project_ID].[DATASET_NAME].ga_sessions_*`
ORDER BY PARSE_DATE(“%Y%m%d”, date)
)
Output:
"20180322" "20180323" "20180324" "20180325" "20180326" "20180327"...
Now, go to google console, select the project and click on Activate Cloud Shell.
Check the current project using echo $DEVSHELL_PROJECT_ID. If you’re in an incorrect project then change it and move to step 4.
Create a bash script and specify the days that you just got from the above query.
tables=(“20180322” “20180323” “20180324” “20180325” “20180326” “20180327”…)
Iterate through all the days available in tables variable and use BigQuery’s copy command to move a table from source to destination.
Syntax: [Note: There is a space between the source and destination table name]
bq cp -a source_table destination_table
Our, script file looks like this:
tables=("20180322" "20180323" "20180324" "20180325" "20180326" "20180327"…)
for val in ${tables[*]}; do
bq cp -a [source_dataset].ga_sessions_$val [destination_dataset].ga_sessions_backup_$val
done
How to validate whether all the tables are copied or not?
WITH first_ga_session AS (
SELECT MIN(PARSE_DATE("%Y%m%d", REGEXP_EXTRACT(table_id, '20[0–9]
{6,6}'))) AS day
FROM `[PROJECT_ID].[DATASET_NAME].__TABLES__` AS ga_tables
WHERE table_id LIKE 'ga_sessions_backup_2%'
),
all_days AS (
SELECT period
FROM
UNNEST(GENERATE_DATE_ARRAY((SELECT day from first_ga_session),
CURRENT_DATE())) AS period
),
available_ga_sessions AS (
SELECT PARSE_DATE("%Y%m%d", REGEXP_EXTRACT(table_id, '20[0–9]
{6,6}')) AS ga_day
FROM `[PROJECT_ID].[DATASET_NAME].__TABLES__` AS ga_tables
WHERE table_id LIKE 'ga_sessions_backup_2%'
)
SELECT A.period AS Day, B.ga_day AS Available_session
FROM all_days A
LEFT JOIN available_ga_sessions B
ON A.period = B.ga_day
WHERE B.ga_day IS NULL
The above query will give us all the days that are missing in the destination table.
My bash script ended up looking like this (note i'm using cloning with flag --clone:
tables=("20220331" "20220401")
bq --location=eu mk --dataset destination_project_id:destination_dataset
for val in ${tables[*]}; do
echo source_project_id:source_dataset.ga_sessions_$val
bq cp --clone source_project_id:source_dataset.ga_sessions_$val destination_project_id:destination_dataset.ga_sessions_backup_$val
done

Related

Schema change in Delta table - How to remove a partition from the table schema without overwriting?

Given a Delta table:
CREATE TABLE IF NOT EXISTS mytable (
...
)
USING DELTA
PARTITIONED BY part_a, part_b, part_c
LOCATION '/some/path/'
This table already has tons of data. However, the desired schema is:
CREATE TABLE IF NOT EXISTS mytable (
...
)
USING DELTA
PARTITIONED BY part_a, part_b -- <<-- ONLY part_a and part_b for partitions, ie, removing part_c
LOCATION '/some/path/'
How this schema change can be achieved?
I eventually took the following approach:
Backup the original table to be on the safe-side
spark.read.table into memory
df.write.option(“overwriteSchema”, “true”) to the original location
I chose this approach so I don’t need to change the original data location.
In more details:
1. Backup the original table to be on the safe-side
Since this was in on Databricks I could use their proprietary deep clone feature:
create table mydb.mytable_backup_before_schema_migration_v1
deep clone mydb.mytable
location 'dbfs:/mnt/defaultDatalake/backups/zones/mydb/mytable_backup_before_schema_migration_v1'
If you are not in Databricks and don't have access to its deep clone, you still can backup the table by reading and writing a copy to another place.
2. read and 3. overwrite with new schema
val df = spark.read.format("delta").table("mydb.mytable")
df
.write
.format("delta")
.mode("overwrite")
.partitionBy("part_a", "part_b")
.option("overwriteSchema", "true")
.saveAsTable("mydb.mytable") // same table, same location, but different data physical organization because partition changes

Select count(*) from Table , Select * from Table dosent yeild any output

I am trying to build a managed table (which orc formatted ,bucketed and table properties is set to true for transnational )on which i can run the update/Insert Statement In hive .
I am running this whole setup on AWS EMR and the Hive version is 2.4.3 the default directory store the data is S3.
I am able to populate the table from another external table .
However am getting select count(*) as zero and no output for select *
i dropped the table and recreated the table and repopulated the data .
The ANALYZE TABLE TABLE-NAME COMPUTE STATISTICS gives proper output .

BigQuery Equivalent of "CREATE TABLE my_table (LIKE your_table)"

I want to create a table which schema is exactly the same as another table. In other SQL engines, I think I was able to use "CREATE TABLE my_table (LIKE your_table)" or some variations.
I couldn't find the equivalent in BigQuery yet. Is this possible in some fashion?
Use this form:
CREATE TABLE dataset.new_table AS
SELECT *
FROM dataset.existing_table
LIMIT 0
This creates a new table with the same schema as the old one, and there is no cost due to the LIMIT 0.
Note that this does not preserve partitioning, table description, etc., however. Another option is to use the CLI (or API), making a copy of the table and then overwriting its contents, e.g.:
$ bq cp dataset.existing_table dataset.new_table
$ bq query --use_legacy_sql --replace --destination_table=dataset.new_table \
"SELECT * FROM dataset.new_table LIMIT 0;"
Now the new table has the same structure and attributes as the original did.
To create a partitioned and/or clustered table the syntax would be:
CREATE TABLE project.dataset.clustered_table
PARTITION BY DATE(created_time)
CLUSTER BY
account_id
AS SELECT * FROM project.dataset.example_table LIMIT 0

Creating partitioned external table in bigquery

I wanted to create a external table in bigquery which loads data from google cloud storage.
During creation of table from Web UI the option of Partitioning Type gets disabled.
Is there any way i can create partitioned external table ?
My data is already partitioned by date format on GCS
Ex: /somepath/data/dt=2018-03-22
Federated tables in GCS automatically act as partitioned tables when you use the file name as a variable (_FILE_NAME).
For example, this view transforms the file name into a native date:
#standardSQL
CREATE VIEW `fh-bigquery.views.wikipedia_views_test_ddl`
AS SELECT
PARSE_TIMESTAMP('%Y%m%d-%H%M%S', REGEXP_EXTRACT(_FILE_NAME, '[0-9]+-[0-9]+')) datehour
, _FILE_NAME filename
, line
FROM `fh-bigquery.views.wikipedia_views_gcs`
Later I can write queries like:
#standardSQL
SELECT *
FROM `fh-bigquery.views.wikipedia_views_test_ddl`
WHERE EXTRACT(YEAR FROM datehour)=2015
AND EXTRACT(MONTH FROM datehour)=10
AND EXTRACT(DAY FROM datehour)=21
AND EXTRACT(HOUR FROM datehour)=7
... and these queries will only open the files with names that match this pattern.
I wrote a whole story about this at https://medium.com/google-cloud/bigquery-lazy-data-loading-ddl-dml-partitions-and-half-a-trillion-wikipedia-pageviews-cd3eacd657b6.

Export the data from a hive/impala table with few conditions into file

What is the efficient way to export the data from hive/impala table with conditions into file(the data would be huge, close to 10 GB)? The format of the hive table is paraquet with snappy compressed and file is csv.
The table is partitioned daily and data needs to be extracted on daily basis, I would like to know if
1) Imapala approach
impala-shell -k -i servername:portname -B -q 'select * from table where year_month_date=$$$$$$$$' -o filename '--output_delimiter=\001'
2) Hive approach
Insert overwrite directory '/path' select * from table where year_month_date=$$$$$$$$
would be efficient
Assuming table tbl as your hive parquet table and condition as your filter condition.
CTAS command:
CREATE TABLE tbl_text ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/tmp/data' AS select * from tbl where condition;
You will find your CSV text file (delimited by ',') at /tmp/data in HDFS.
You can get this file to your local file system if needed using:
hadoop fs -get /tmp/data
Please try to use Dynamic Partitioning for your Hive/Impala table to efficiently export the data conditionally.
Partition your table with the columns of your interest and based on your queries for best results
Step 1: Create a Temporary Hive Table TmpTable and load your raw data into it
Step 2: Set hive parameters to support Dynamic partition
SET hive.exec.dynamic.partition.mode=non-strict;
SET hive.exec.dynamic.partition=true;
Step 3: Create your Main Hive Table with partition columns, example :
CREATE TABLE employee (
emp_id int,
emp_name string
PARTITIONED BY (location string)
STORED AS PARQUET;
Step 4: Load data from Temporary table to your employee table (Main Table)
insert overwrite table employee partition(location)
select emp_id,emp_name, location from TmpTable;
Step 5: export the data from hive with a condition
INSERT OVERWRITE DIRECTORY '/path/to/output/dir' SELECT * FROM employee WHERE location='CALIFORNIA';
Please refer this link:
Dynamic Partition Concept
Hope this is useful.