I am streaming data into BQ, every day I run a scheduled job in Dataprep that takes 24 hours of data and modifies some data and creates a new table in the BQ dataset with 24 hours of data.
The original table though stays unmodified and keeps on gathering data.
What I would like to do is delete all rows in the table after the dataprep makes a copy so that a new 24 hours of data streaming is gathered
How can I make this automated, I can't seem to find anything in dataprep that drops the original table and creating a new table.
You can do this setting up your table as partitioned table due to you are ingesting data constantly.
This option is to do it manually:
bq rm '[YOUR_DATASET].[YOUR_TABLE]$xxxxxxx'
And with the expiration time you can set the time when the data of the table will be deleted:
bq update --time_partitioning_expiration [INTEGER] [YOUR_PROJECT_ID]:[YOUR_DATASET].[YOUR_TABLE]
You could use a Scheduled Query to clear out the table:
https://cloud.google.com/bigquery/docs/scheduling-queries
Scheduled queries support DDL so you could schedule a query that deletes all the rows from this table, or deletes the table entirely, on a daily basis. at a specific time.
Related
I have a table in BigQuery with 100 columns. Now I want to append more rows to it via Transfer but the new CSV has only 99 columns. How should I proceed with this?
I tried creating a schema and adding that column as NULLABLE but it didn't work
I am presuming your CSV file is stored in GCS Bucket and trying to use BQ Data Transfer service to load data periodically by scheduling it.
You can not directly Load/Append data into BQ Table due to schema mismatch.
But as an alternative, create a Staging table named staging_table_csv with 99 columns and Schedule a Data transfer service to load CSV to this table on Overwrite mode.
Now write a query to Append the contents of this staging table staging_table_csv to the target BQ table.
Query might look like this:
#standardSQL
INSERT INTO `project.dataset.target_table`
SELECT
*,
<DEFAULT_VALUE> AS COL100
FROM
`project.dataset.staging_table_csv`
Now schedule this query to run after the staging table is loaded
Make sure to keep a buffer between the Staging table load and the Target Table load. You can perform trials to find a suitable buffer.
For eg: If Transfer is scheduled at 12:00, Schedule Target Table load Query t 12:05 or 12:10
Note: Creating an extra Staging table would incur storage costs but
since it is overwritten for each load, historical data cost is not
incurred
It seems that the BigQuery CLI supports restoring tables in a dataset after they have been deleted by using BigQuery Time Travel functionality -- as in:
bq cp dataset.table#TIME_AGO_UNIX dataset.table
However, this assumes we know the names of the tables. I want to write a script to iterate over all the tables that were in the dataset at TIME_AGO_UNIX time.
How would I go about finding those tables at that time?
I have a daily ingestion of data into HDFS . From data into HDFS I generate Hive tables partitioned by date and another column. One day has 130G data. After generate the data, I run msck repair. Now every msck tasks more than 2 hours. In my mind, msck will scan the whole table data (we have about 200 days data) and then update metadata. My question is: is there a way let the msck only scan the last day data and then update the metadata to speed up the whole process? by the way there is no small files issue, I already merge the small files before msck.
When you creating external table or doing repair/recover partitions with this configuration:
set hive.stats.autogather=true;
Hive scans each file in the table location to get statistics and it can take too much time.
The solution is to switch it off before create/alter table/recover partitions
set hive.stats.autogather=false;
See these related tickets: HIVE-18743, HIVE-19489, HIVE-17478
If you need statistics, you can gather statistics only for new partitions if necessary using
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]
COMPUTE STATISTICS
See details here: ANALYZE TABLE
Also if you know which partitions should be added, use ALTER TABLE ADD PARTITION - you can add many partitions in single command.
I have a BQ wildcard query that merges a couple of tables with the same schema (company_*) into a new, single table (all_companies). (all_companies will be exported later into Google Cloud Storage)
I'm running this query using the BQ CLI with all_companies as the destination table and this generates a BQ Job (runtime: 20mins+).
The company_* tables are populated constantly using the streamingAPI.
I've read about BigQuery jobs, but I can't find any information about streaming behavior.
If I start the BQ CLI query at T0, the streamingAPI adds data to company_* tables at T0+1min and the BQ CLI query finishes at T0+20min, will the data added at T0+1min be present in my destination table or not?
As described here the query engine will look at both the Columnar Storage and the streaming buffer, so potentially the query should see the streamed data.
It depends what you mean by a runtime of 20 minutes+. If the query is run 20 minutes after you create the job then all data in the streaming buffer by T0+20min will be included.
If on the other hand the job starts immediately and takes 20 minutes to complete, you will only see data that is in the streaming buffer at the moment the table is queried.
I submitted a Bigquery load job, it ran and returned with the status successful. But the data didn't make into the destintation table.
Here was the command that was run:
/usr/local/bin/bq load --nosynchronous_mode --project_id=ardent-course-601 --job_id=logsToBq_load_impressions_20140816_1674a956_6c39_4859_bc45_eb09db7ef99a --source_format=NEWLINE_DELIMITED_JSON dw_logs_impressions.impressions_20140816 gs://sm-uk-hadoop/queries/logsToBq_transformLogs/impressions/20140816/9307f6e3-0b3a-44ca-8571-7107c399998c/part* /opt/sm-analytics/projects/logsTobqMR/jsonschema/impressionsSchema.txt
I checked the job status of the job logsToBq_load_impressions_20140816_1674a956_6c39_4859_bc45_eb09db7ef99a. The input file count and size showed the correct number of input files and total size.
Does anyone know why the data didn't make into the table but yet the job is reported as successful?
Just in case this is not a mistake on our side, I ran the load job again but to a different destination table and this time the data made into the destination table fine.
Thank you.
I experienced this recently with BigQuery in sandbox mode without a billing account.
In this mode the partition expiration is automatically set to 60 days. If you load data into the table where the partitioned column(e.g. date) is older than 60 days it won't show up in the table. The load job still succeeds with the correct number of output rows.
This is very surprising, but I've confirmed via the logs that this is indeed the case.
Unfortunately, the detailed logs for this job, which ran on August 16, are no longer available. We're investigating whether this may have affected other jobs more recently. Please ping this thread if you see this issue again.
we had this issue in our system and the reason was like table was set with partition expiry for 30 days and table was partitioned on timestamp column.. Hence when someone was ingesting data which is older than partition expiry date bigquery load jobs were successfully completed in Spark but we see no data in ingestion tables.. since it was getting deleted moment after it was ingested.. due to partition expiry set on.
Please check your bigquery table partition expiry parameters and see the partition column value of incoming data. If it value will be lower than partition expiry.. you wont see data in bigquery tables.. it will get deleted just after the ingestion.