I tried to partition a table based on a TIMESTAMP column. I ran the following query
CREATE OR REPLACE TABLE `stackoverflow.questions_2018_partitioned` PARTITION BY DATE(creation_date) AS SELECT * FROM `bigquery-public-data.stackoverflow.posts_questions` WHERE creation_date BETWEEN '2018-01-01' AND '2018-07-01';
but the partition table is empty. I copied this query from HERE.
would you please help me find out my mistake?
Maybe the default partition expiration time in your environment is 60 days. Expiration can be set to never with OPTION (partition_expiration_days = NULL). The full command is
CREATE OR REPLACE TABLE `stackoverflow.questions_2018_partitioned`
PARTITION BY DATE(creation_date)
OPTIONS (partition_expiration_days = NULL)
AS SELECT * FROM `bigquery-public-data.stackoverflow.posts_questions` WHERE creation_date BETWEEN '2018-01-01' AND '2018-07-01';
Your personal Google Cloud account with billing-enabled will let you have partitioned tables that don't expire. This will NOT work however, in qwiklab.
Related
I have around 600 partitioned tables called table.ga_session. Each table is separated by 1 day, and for each table it has its own unique name, for example, table for date (30/12/2021) has its name as table.ga_session_20211230. The same goes for other table, the naming format would be like this table.ga_session_YYYYMMDD.
Now, when I try to call all partitioned table, I cannot use command like this:. The error showed that _PARTITIONTIME is unrecognized.
SELECT
*,
_PARTITIONTIME pt
FROM `table.ga_sessions_20211228`
where _PARTITIONTIME
BETWEEN TIMESTAMP('2019-01-01')
AND TIMESTAMP('2020-01-02')
I also tried this and does not work
select *
from between `table.ga_sessions_20211228`
and
`table.ga_sessions_20211229`
I also cannot use FROM 'table.ga_sessions' to apply WHERE clause to take out range of time as the table does not exist. How do I call all of these partitioned table? Thank you in advance!
You can query using wildcard tables. For example:
SELECT max
FROM `bigquery-public-data.noaa_gsod.gsod*`
WHERE _TABLE_SUFFIX = '1929'
This will specifically query the gsod1929 table, but the table_suffix clause can be excluded if desired.
In your scenario you could do:
select *
from table.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20190101' and '20200102'
For more information see the documentation here:
https://cloud.google.com/bigquery/docs/reference/standard-sql/wildcard-table-reference
When creating a table let's say "orders" with partitioning in the following way my result gets truncated in comparison to if I create it without partitioning. (Commenting and uncommenting rows five and 6).
I suspect that it might have something to do with the BQ limits (found here) but I can't figure out what. The ts is a timestamp field and order_id is a UUID string.
i.e. The count distinct on the last row will yield very different results. When partitioned it will return far less order_ids than without partitioning.
DROP TABLE IF EXISTS
`project.dataset.orders`;
CREATE OR REPLACE TABLE
`project.dataset.orders`
-- PARTITION BY
-- DATE(ts)
AS
SELECT
ts,
order_id,
SUM(order_value) AS order_value
FROM
`project.dataset.raw_orders`
GROUP BY
1, 2;
SELECT COUNT(DISTINCT order_id) FROM `project.dataset.orders`;
(This is not a valid 'answer', I just need a better place to write SQL than the comment box, I don't mind if moderator convert this answer into a comment AFTER it serves its purpose)
What is the number you'd get if you do query below, and which one does it align with (partitioned or non-partitioned)?
SELECT COUNT(DISTINCT order_id) FROM (
SELECT
ts,
order_id,
SUM(order_value) AS order_value
FROM
`project.dataset.raw_orders`
GROUP BY
1, 2
) t;
It turns out that there's a 60 day partition expiration!
https://cloud.google.com/bigquery/docs/managing-partitioned-tables#partition-expiration
So by updating the partition expiration I could get the full range.
We have a set of Google BigQuery tables which are all distinguished by a wildcard for technical reasons, for example content_owner_asset_metadata_*. These tables are updated daily, but at different times.
We need to select the latest partition from each table in the wildcard.
Right now we are using this query to build our derived tables:
SELECT
*
FROM
`project.content_owner_asset_metadata_*`
WHERE
_PARTITIONTIME = (
SELECT
MIN(time)
FROM (
SELECT
MAX(_PARTITIONTIME) as time
FROM
`project.content_owner_asset_metadata_*`
WHERE
_PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
)
)
This statement finds out the date that all the up-to-date tables are guarenteed to have and selects that date's data, however I need a filter that selects the data from the maximum partition time of each table. I know that I'd need to use _TABLE_SUFFIX with _PARTITIONTIME, but cannot quite work out how to make a select work without just loading all our data (very costly) and using a standard greatest-n-per-group solution.
We cannot just union a bunch of static tables, as our dataset ingestion is liable to change and the scripts we build need to be able to accomodate.
With BigQuery scripting (Beta now), there is a way to prune the partitions.
Basically, a scripting variable is defined to capture the dynamic part of a subquery. Then in subsequent query, scripting variable is used as a filter to prune the partitions to be scanned.
Below example uses BigQuery public dataset to demonstrate how to prune partition to only query and scan on latest day of data.
DECLARE max_date TIMESTAMP
DEFAULT (SELECT MAX(_PARTITIONTIME) FROM `bigquery-public-data.sec_quarterly_financials.numbers`);
SELECT * FROM `bigquery-public-data.sec_quarterly_financials.numbers`
WHERE _PARTITIONTIME = max_date;
With INFORMATION_SCHEMA.PARTITIONS (preview) as of posting, this can be achieved by joining to the PARTITIONS table as follows (e.g. with HOUR partitioning):
SELECT i.*
FROM `project.dataset.prefix_*` i
JOIN (
SELECT * EXCEPT (r)
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY table_name ORDER BY partition_id DESC) AS r
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name LIKE "%prefix%"
AND partition_id NOT IN ("__NULL__", "__UNPARTITIONED__"))
WHERE r = 1) p
ON (FORMAT_TIMESTAMP("%Y%m%d%H", i._PARTITIONTIME) = p.partition_id
AND CONCAT("prefix_", i._TABLE_SUFFIX) = p.table_name)
I need to compare and find a percent error for several columns from two different Oracle databases. DB1 DATE_TIME column only has date, no time (DD-MON-YY), DB2 DATE_TIME column has date and time. Each row represents one hour, and DB1 is in the correct order by hour, but with no actual times. Need the relevant columns to match up between id and date_time (specifically, hour), but I've found that a WHERE clause testing for DATE equality will only give the DB1 entry corresponding to 12:00:00 AM because of not having times as part of the date format in DB1, so I'm not able to to compare the correct entries for other times. How can I get around this?
Code below to better illustrate:
SELECT db1.field1, db2.field1, db1.date_time, db2.date_time
FROM db1, db2
WHERE db1.date_time = db2.date_time AND db1.id = X
ORDER BY db2.date_time DESC;
This query runs, but none of the data actually matches because it's only returning the first row of each day from DB1 (corresponding to 12:00:00 AM).
I've thought of somehow inserting corresponding time stamps to DB1 DATE_TIME column based off position so I can include time in the WHERE, but not sure how to do that, or if it will even work. I've seen that running a test query using BETWEEN day1 and day2 (instead of =) returns the results I want for a given range of days, but I'm not sure how to implement that in the JOIN that I'm trying to do with DB2.
Any ideas?
If you care about performance, I would suggest that you create a function-based index on db2:
create index idx_db2_datetime_date on db2(trunc(datetime));
Then, you can use this construct in the query:
SELECT db1.field1, db2.field1, db1.date_time, db2.date_time
FROM db1 JOIN
db2
ON db1.date_time = trunc(db2.date_time)
WHERE db1.id = X
ORDER BY db2.date_time DESC;
For this query, an index on db1(id, date_time) is also helpful.
The indexes are not necessary for the query to work, but the function-based index is a nice way to write a performant query with a function in the ON clause.
Note: Learn to use proper, explicit JOIN syntax. Never use commas in the FROM clause.
For a SELECT, you might want to try along:
SELECT db1_.field1, db2.field1, db1_.date_time, db2.date_time
FROM (
SELECT
id
, field1
, date_time + (RANK() OVER (PARTITION BY date_time ORDER BY id - 1) / 24 date_time
FROM DB1
) DB1_
JOIN db2
ON db1.date_time = TRUNC(db2.date_time, 'HH24')
AND db1.id = X
ORDER BY db2.date_time DESC;
If you prefer to get the hours added to DB1.date_time, please, try:
UPDATE
DB1 TDB1
SET TDB1.date_time =
(SELECT
d_t
FROM
(SELECT
id
, (date_time + (RANK() OVER (PARTITION BY date_time ORDER BY id) - 1) / 24) d_t
FROM DB1)
WHERE id = TD1.id
) DB1
;
Sorry, no suitable test data to verify in full at this time.
Please comment if and as this requires adjustment / further detail.
In Oracle 11g, I have created fact table with date as partition and site_id as sub-partition.
analyse is running daily on this table. but based on one day interval, analyse step is performed.
In SQL DEVELOPER tool, when I open table definition, under partition tab, I am able to see the partition as 23-JAN-2016. For all site_ids, I am able to see sub-partition.
Select * from NPM.EH_MODEM_HIST_PRFRM_FACT subpartition(SYS_SUBP1256625);
When I run the above query, I am able to see the data.
But I am running below query using report sql; but table is not fetching data
select * from NPM.EH_MODEM_HIST_PRFRM_FACT
where time_stamp ='23-JAN-16' and site_id =580
Is there any problem in managing this table?
Probably, what you're actually after is something like:
select *
from NPM.EH_MODEM_HIST_PRFRM_FACT
where time_stamp >= to_date('23/01/2016', 'dd/mm/yyyy')
and time_stamp < to_date('23/01/2016', 'dd/mm/yyyy') + 1
and site_id = 580;
The above assumes that the datatype for the time_stamp column is DATE. If it's actually TIMESTAMP then you should use the SQL below:
select *
from NPM.EH_MODEM_HIST_PRFRM_FACT
where time_stamp >= to_date('23/01/2016', 'dd/mm/yyyy')
and time_stamp < to_date('23/01/2016', 'dd/mm/yyyy') + interval '1' day
and site_id = 580;
Note also that I have specified the date with a four digit year. Two digit years are just soooo pre-y2k! *{;-)