I have a requirement to find the latest partition and find total no of records for multiple tables.
Currently doing manually . Is their any best way to do it.
Thank you
Related
I am trying to find the best way to DeDuplicate my data set, and am having trouble finding a solution I can implement. To explain, currently have a GCS bucket that I upload a .csv into daily. Each of these .csv files has the data for the previous 5 days. Due to many overlapping days I end up with quite a few duplicates in my Table in BigQuery. (Note: I have to upload this way as due to delays in our systems sometimes records don't get added for a day or two, and the date they show is the date the transaction took place, not the date it was added.)
When the duplicates are added they are completely identical, for all 30 of the columns that we have in the data set. What do I need to do in order to remove the duplicates? We have a "Transaction ID" column as the first column, and it is a distinct ID per transaction. I assume I can use this, just not sure how.
I tried the below, and it removed all the null values, but not duplicates with actual information in the rows.
CREATE OR REPLACE TABLE project.dataset.tableDeDuped
AS
SELECT DISTINCT * FROM project.dataset.table
You could either try with ROW_NUMBER() or ARRAY_AGG() as shown here.
Delete duplicate rows from a BigQuery table
Check Jordan and Felipe's answers, if you have any questions, let us know.
I have an application using an AWS Aurora SQL postgres 10 DB that expects +5M records per day on a table. The application will be running on a kubernetes environment with ~5 pods.
One of the applications requirements is to export a method to build an object with all the possible values of 5 columns of the table. ie: all distinct values of the name column.
We expect ~100 different values per column. A distinct/group by takes more than 1s per column, making the process not meeting the non functional requirements (process time).
The solution I found was to create a table/view with the distinct of each column, that table/view will be refreshed with a cron like task.
Is this the more effective approach to meet the non functional/process time requirement using only postgres tools?
One possible solution is a materialized view that you regularly refresh. Between these refreshes, the data will become slightly stale.
Alternatively, you can maintain a separate table with just the distinct values and use triggers to keep the information up to date whenever rows are modified. This will require a combined index on all the affected columns to be fast.
DISTINCT is always a performance problem if it affects many rows.
I have about 11 years of data in a bunch of Avro files. I wanted to partition by the date of each row, but from the documentation it appears I can't because there are too many distinct dates?
Does clustering help on this? The natural cluster key for my data would still have some that'd have data for more than 4000 days.
two solutions i see:
1)
Combine tables sharding (per year) with time partitioning based on your column. I never tested that myself, but it should work, as every shard is seen as a new table in BQ.
With that you are able to easily address the shard plus the partition with one wildcard/variable.
2)
A good workaround is to create an extra column with the date of you field which should be partitioned.
For every data entry longer ago than 9 years (eg: DATE_DIFF(current_date(), DATE('2009-01-01'), YEAR)) format your date to the 1st of the particular month.
With that you are able to create another 29 years of data.
Be aware that you cannot filter based on that column with a date filter eg in DataStudio. But for query it works.
Best Thomas
Currently as per doc clustering is supported for partition table only. In future it might support non-partition tables.
You can put old data per year in single partition.
You need to add extra column to you table for partioning it.
Say, all data for year 2011 will go to partition 20110101.
For newer data (2019) you can have seperate partition for each date.
This is not a clean solution to problem but using this you can optimize further by using clustering to provide minimal table scan.
4,000 daily partitions is just over 10 years of data. If you require a 'table' with more than 10 years of data one workaround would be to use a view:
Split your table into decades ensuring all tables are partitioned on the same field and have the same schema
Union the tables together in a BigQuery view
This results in a view with 4,000+ partitions which business users can query without worrying about which version of a table they need to use or union-ing the tables themselves.
It might make sense to partition by week/month/year instead of day - depending on how much data you have per day.
In that case, see:
Partition by week/year/month to get over the partition limit?
BigQuery can save query result into specify table, but If target table has day partition, currently I use Python loop query one day data and save to table, is other faster way?
thanks!
There are two related feature requests that you can vote for and monitor progress - Update date-partitioned tables from results of a query and Partition on non-date field
Meantime, conceptually - the way you approach this - using loop - is correct and the only way as of now (August 2017)
I have a daily partitioned table, and I want to delete older partitions by API.
The documentation only says that older partitions which are not updated for 3 months are stored with 50% discount. Thanks Google, but I really do not intend to keep those data for half a century.
I thought the whole point of partitioned tables was to allow deleting old data, but all I find is a discount. Is there a way of doing this?
You can use Tables: delete API to delete specific partition of the table by specifying that partition as yourTable$YYYYMMDD
And you can use timePartitioning.expirationMs property to set Number of milliseconds for which to keep the storage for a partition. You can set this property either while creating table via Tables: insert API or you can patch existing table via Tables: patch API