I am looking for a way to define a Teradata Data Transfer custom schema that implements a month based date partition. The documentation only provides a method to do this at a timestamp or date level.
https://cloud.google.com/bigquery-transfer/docs/teradata-migration-options#custom_schema_file
Is there an undocumented approach to defining a custom schema file that handles this? Or is the alternative to migrate at the day level, then once in BigQuery insert into a table that is defined with the month partition?
As mentioned in the Answer :
In Teradata, you might find trunc() to be a simple method:
select a.id, a.name, a.number, a.date
from (select a.*,
row_number() over (partition by trunc(date, 'MON') order by date desc) as seqnum
from tableA a
) a
where seqnum = 1;
Teradata also supports qualify:
select a.id, a.name, a.number, a.date
from tableA a
qualify row_number() over (partition by trunc(date, 'MON') order by date desc) = 1
As mentioned in the Documentation, you can refer to the Explanation:
A partitioned primary index enables Teradata Database to partition the
rows of a table or uncompressed join index in such a way that row
subsets can be accessed efficiently without resorting to full-table
scans. If the partitioning expression is defined using an updatable
current date or updatable current timestamp, the partition that
contains the most recent rows can be defined to be as narrow as
possible to optimize efficient access to those rows.
An additional
benefit of an updatable current date or updatable current timestamp
for a partitioning is that the partitioning expression can be designed
in such a way that it might not need to be changed as a function of
time.
you can specify the DATE, CURRENT_DATE, or
CURRENT_TIMESTAMP functions in the partitioning expression of a table
or uncompressed join index and then periodically update the resolution
of their values. This enables rows to be repartitioned on the newly
resolved values of the DATE, CURRENT_DATE, or CURRENT_TIMESTAMP
functions at any time you determine that they require reconciliation.
You can update the resolution of your partitioning scheme by
submitting appropriate ALTER TABLE TO CURRENT statements.
For more information, you can refer to the documentation.
Related
My data is partitioned by day in the standard Hive format:
/year=2020/month=10/day=01
/year=2020/month=10/day=02
/year=2020/month=10/day=03
/year=2020/month=10/day=04
...
I want to query all data from the last 60 days, using Amazon Athena (IE: Presto). I want this query to use the partitioned columns (year, month, day) so that only the necessary partition files are scanned. Assuming I can't change the file partition format, what is the best approach to this problem?
You don't have to use year, month, day as the partition keys for the table. You can have a single partition key called date and add the partitions like this:
ALTER TABLE the_table ADD
PARTITION (`date` = '2020-10-01') LOCATION 's3://the-bucket/data/year=2020/month=10/day=01'
PARTITION (`date` = '2020-10-02') LOCATION 's3://the-bucket/data/year=2020/month=10/day=02'
...
With this setup you can even set the type of the partition key to date:
PARTITIONED BY (`date` date)
Now you have a table with a date column typed as a DATE, and you can use any of the date and time functions to do calculations on it.
What you won't be able to do with this setup is use MSCK REPAIR TABLE to load partitions, but you really shouldn't do that anyway – it's extremely slow and inefficient and really something you only do when you have a couple of partitions to load into a new table.
An alternative way to that proposed by Theo, is to use the following syntax, e.g.:
select ... from my_table where year||month||day between '2020630' and '20201010'
this works when the format for the columns year, month and day are string. It's particularly useful to query across months.
porting some stuff to bigquery, and come across an issue.
We have a bunch of data with no unique key value. Unfortuantely some report logic requires a unique value for each row.
So in systems like Oracle I would just user the ROWNUM or ROWID psudeo columns.
In vertica, which doesn't have those psudeo columns I would use ROW_NUMBER() OVER(). But in bigquery that is failing with the error:
'dataset:bqjob_r79e7b4147102bdd7_0000016482b3957c_1': Resources exceeded during query execution: The query could not be executed in the allotted memory.
OVER() operator used too much memory..
The value does not have to be persistent, just a unique value within the query results.
Would like to avoid extract-process-reload if possible.
So is there any way to assign a unqiue value to query result rows in bigquery SQL?
Edit: Sorry, should have clarified. Using standard sql, not legacy
For ROW_NUMBER() OVER() to scale, you'll need to use PARTITION.
See https://stackoverflow.com/a/16534965/132438
#standardSQL
SELECT *
, FORMAT('%i-%i-%i', year, month, ROW_NUMBER() OVER(PARTITION BY year, month)) id
FROM `publicdata.samples.natality`
I can't find any documentation relating to this. Is time_partitioning_type=DAY the only way to partition a table in BigQuery? Can this parameter take any other values besides a date?
Note that even if you partition on day granularity, you can still write your queries to operate at the level of months using an appropriate filter on _PARTITIONTIME. For example,
#standardSQL
SELECT * FROM MyDatePartitionedTable
WHERE DATE_TRUNC(EXTRACT(DATE FROM _PARTITIONTIME), MONTH) = '2017-01-01';
This selects all rows from January of this year.
Unfortunately not. BigQuery currently only supports date-partitioned tables.
https://cloud.google.com/bigquery/docs/partitioned-tables
BigQuery offers date-partitioned tables, which means that the table is divided into a separate partition for each date
It seems like this would work:
#standardSQL
CREATE OR REPLACE TABLE `My_Partition_Table`
PARTITION BY event_month
OPTIONS (
description="this is a table partitioned by month"
) AS
SELECT
DATE_TRUNC(DATE(some_event_timestamp), month) as event_month,
*
FROM `TableThatNeedsPartitioning`
For those that run into the error "Too many partitions produced by query, allowed 4000, query produces at least X partitions", due to the 4000 partitions BigQuery limit as of 2023.02, you can do the following:
CREATE OR REPLACE TABLE `My_Partition_Table`
PARTITION BY DATE_TRUNC(date_column, MONTH)
OPTIONS (
description="This is a table partitioned by month"
) AS
-- Your query
Basically, take #david-salmela 's answer, but move the DATE_TRUNC part to the PARTITION BY section.
It seems to work exactly like PARTITION BY date_column in terms of querying the table (e.g. WHERE date_column = "2023-02-20"), but my understanding is that you always retrieve data for a whole month in terms of cost.
I would like to know how to order a table by date(the oldest date first) after the insert of a new value in the same table using a trigger.
Thank you
RDBMSs (or at least the common ones) have no notion of ordering a table - a table is just a bunch of rows, which the database may return in any arbitrary order.
Your way to control this order is to explicitly declare what order you want them with an order by clause in your query:
SELECT *
FROM my_table
ORDER BY last_modification_time
I have one table in Oracle consists of around 55 million records with partition on date column.
This table stores around 600,000 records for each day based on some position.
Now, some analytical functions are used in one select query in procedure e.g. lead, lag, row_number() over(partition by col1, date order by col1, date) which is taking too much time due to 'partition by' and 'order by' clause on date column.
Is there any other alternative to optimize the query ?
Have you considered using a materialized view where you store the results of your analytical functions?
More information about MVs
http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_6002.htm