Remove Records in BigQuery Table By Partition Metadata? - sql

Coming across an issue and wondering if anyone would be able to help. There is a designated table in our BQ project that hosts sales myproject_dataset.sales_table. This table is not partitioned by _PARTITIONTIME but by the date identifier in the sales files, Sales_Date so I'm unable to query data in this table by day it was ingested but by date in the Sales file.
There was a file loaded into myproject_dataset.sales_table table with incorrect data for a particular date, ex. 2022-10-19. The issue is this file also contains records from previous dates as well, so executing the following command to remove the incorrect data won't solve the issue:
DELETE from myproject_dataset.sales_table
WHERE Sales_Date = 2022-10-19"
I queried using INFORMATION_SCHEMA.PARTITIONS to get the partition_ID of the incorrect file loaded into myproject_dataset.sales_table on the particular date, ex. 2022-10-19.
Is there a way to delete records by partition metadata, ex. partition_ID in a BQ table?
I queried using INFORMATION_SCHEMA.PARTITIONS to get the table metadata, particularly the partition_ID of the incorrect file loaded into myproject_dataset.sales_table on the particular date, ex. 2022-10-19.

Related

Is there a metadata table that shows information other than the column name/id/data type for a partitioned table

big query table metadata as seen on the UI console
Is there an info schema table or equivalent that shows the following metadata information about a partitioned table :
Partition range end,
Partition range interval,
Partition range start,
Partitioned by (eg day, hour, month, year)
I can see them in the console UI when clicking on the details tab on the table, but cannot find which if any info schema tables contain them. I've looked in all the listed ones from the google site but cannot se them in there:
https://cloud.google.com/bigquery/docs/information-schema-intro
Is there a table in BQ this is contained that's accessible?
At the moment the only way I can find to determine whether a table that's been date partitioned for example by month/day is to look at the length of the partition_id for each partition and determine it that way. It would be more useful/reliable if I could see the information as it's displayed on the console, where it shows partitioned by DAY or MONTH or YEAR etc.
Use below
select *
from `your-project.your_dataset.INFORMATION_SCHEMA.TABLES` tbl
where regexp_contains(lower(ddl), 'partition by')
Below are fields available to choose from

Scheduled query to append data to partitioned BigQuery table - incompatible table partitioning specification

I am trying to append data to a table partitioned by month using the BQ Console.
The SQL used to create the table and partition was:
CREATE TABLE xxxxxx
PARTITION BY DATE_TRUNC(Event_Dt, MONTH)
I used Event_Dt as the partitioned field in BQ Console:
The scheduled query does not run and I get the following error message:
"Incompatible table partitioning specification. Destination table exists with partitioning specification interval(type:MONTH,field:Event_Dt), but transfer target partitioning specification is interval(type:DAY,field:Event_Dt). Please retry after updating either the destination table or the transfer partitioning specification."
How do I enter Event_Dt in the BQ Console to indicate that it is partitioned by month and not day?
I solved my problem. All I needed to do was remove Event_Dt from the Destination table partitioning field in the BQ Console. The partitioned table updated successfully when I left the field blank.

Automatically add date for each day in SQL

I'm working on BigQuery and have created a view using multiple tables. Each day data needs to be synced with multiple platforms. I need to insert a date or some other field via SQL through which I can identify which rows were added into the view each day or which rows got updated so only that data I can take forward each day instead of syncing all every day. Best way I can think is to somehow add the the current date wherever an update to a row happens but that date needs to be constant until a further update happens for that record.
Ex:
Sample data
Say we get the view T1 on 1st September and T2 on 2nd. I need to to only spot ID:2 for 1st September and ID:3,4,5 on 2nd September. Note: no such date column is there.I need help in creating such column or any other approach to verify which rows are getting updated/added daily
You can create a BigQuery schedule queries with frequency as daily (24 hours) using below INSERT statement:
INSERT INTO dataset.T1
SELECT
*
FROM
dataset.T2
WHERE
date > (SELECT MAX(date) FROM dataset.T1);
Your table where the data is getting streamed to (in your case: sample data) needs to be configured as a partitioned table. Therefor you use "Partition by ingestion time" so that you don't need to handle the date yourself.
Configuration in BQ
After you recreated that table append your existing data to that new table with the help of the format options in BQ (append) and RUN.
Then you create a view based on that table with:
SELECT * EXCEPT (rank)
FROM (
SELECT
*,
ROW_NUMBER() OVER (GROUP BY invoice_id ORDER BY _PARTITIONTIME desc) AS rank
FROM `your_dataset.your_sample_data_table`
)
WHERE rank = 1
Always use the view from that on.

Keeping older data on partition hive table

Keep history data onto a partitioned table
Team,
I have scenario here - I have 2 tables - one is non partitioned and another one is partitioned table partition on one date field.
Have loaded the data from non-partitioned tables to a partitioned table and I have set the below parameter to load onto partition table.
write.partitionBy(“date”) \
.format(“orc”) \
.mode(“overwrite”) \
.saveAsTable(“schema.table1”)
Now both table count match which has 3 years of data. Which is as expected.
now I have refreshed only latest one year of data and try to load the partitioned table but it got loaded only with 1 year data, where as I need all 3 years data in the partitioned table.
What am I missing here.. I have to refresh only 1 year of data and load it to the partition table and keep building history.
Kindly suggest. Thanks
write.partitionBy(“date”)
.format(“orc”)
.mode(“overwrite”)
.saveAsTable(“schema.table1”)
Need to keep history with latest data refresh on each day basis.

Extract rows which haven't been extracted already using SQL Server stored procedure

I have a table Customers. I'm trying to design a way which will extract data from Customers table daily and create a CSV of this data. I want to pick only those records which haven't been extracted yet. How can I keep track of whether it has been extracted or not? I cannot alter the Customers table to add a flag.
So far I'm planning to use a Stage table which will have this flag. So I'm writing a stored procedure to get the data from the Customers table and have the flag set to 0 for each of these records. And use SSIS to create the CSV after pulling this data from stage table and once the records have been extracted into CSV update the staging table with flag=1 for those records.
What is a good design for this problem?
Customer table:
CustomerID | Name | RecordCreated | RecordUpdated
Create another table tblExportedEmpID with a column CustomerID. Add the customer id of each customer extracted from the Customer table into that new table. And to extract the customer from the Customer table which are not extracted yet, you can use this query :
select * from customer where customerid not in(select customerid from tblExportedEmpID)
You have RecordCreated and RecordUpdated. Why even bother with a separate record-per table if you have that information?
You'll need to create a table or equivalent "saved until next run" data area. The first thing you have your script do is grab the current time, and whatever was stored in that data area. Then, have your statement query everything:
SELECT <list of columns and transformation>
FROM Customers
WHERE recordCreated >= :lastRunTime AND recordCreated < :currentRunTime
(or recordUpdated, if you need to re-extract if the customer's name changes)
Note that you want the exclusive upper-bound (<) to cover the case where your stored timestamp has less resolution than the mechanism getting the timestamp.
For the last step, store off your run start - whatever the script grabbed for "current time" - into the "saved until next run" data area.