How does one do a SQL select over multiple partitions? - sql

Is there a more efficient way than:
select * from transactions partition( partition1 )
union all
select * from transactions partition( partition2 )
union all
select * from transactions partition( partition3 );

It should be exceptionally rare that you use the PARTITION( partitionN ) syntax in a query.
You would normally just want to specify values for the partition key and allow Oracle to perform partition elimination. If your table is partitioned daily based on TRANSACTION_DATE, for example
SELECT *
FROM transactions
WHERE transaction_date IN (date '2010-11-22',
date '2010-11-23',
date '2010-11-24')
would select all the data from today's partition, yesterday's partition, and the day before's partition.

Can you provide additional context? What are your predicates? What makes you think that you need to explicitly tell the optimizer to go against multiple partitions. You may have the wrong partition key in use, for example.

Related

Creating a partitioned table from query in Big Query does not yield same as without partitioning

When creating a table let's say "orders" with partitioning in the following way my result gets truncated in comparison to if I create it without partitioning. (Commenting and uncommenting rows five and 6).
I suspect that it might have something to do with the BQ limits (found here) but I can't figure out what. The ts is a timestamp field and order_id is a UUID string.
i.e. The count distinct on the last row will yield very different results. When partitioned it will return far less order_ids than without partitioning.
DROP TABLE IF EXISTS
`project.dataset.orders`;
CREATE OR REPLACE TABLE
`project.dataset.orders`
-- PARTITION BY
-- DATE(ts)
AS
SELECT
ts,
order_id,
SUM(order_value) AS order_value
FROM
`project.dataset.raw_orders`
GROUP BY
1, 2;
SELECT COUNT(DISTINCT order_id) FROM `project.dataset.orders`;
(This is not a valid 'answer', I just need a better place to write SQL than the comment box, I don't mind if moderator convert this answer into a comment AFTER it serves its purpose)
What is the number you'd get if you do query below, and which one does it align with (partitioned or non-partitioned)?
SELECT COUNT(DISTINCT order_id) FROM (
SELECT
ts,
order_id,
SUM(order_value) AS order_value
FROM
`project.dataset.raw_orders`
GROUP BY
1, 2
) t;
It turns out that there's a 60 day partition expiration!
https://cloud.google.com/bigquery/docs/managing-partitioned-tables#partition-expiration
So by updating the partition expiration I could get the full range.

Bigquery Select all latest partitions from a wildcard set of tables

We have a set of Google BigQuery tables which are all distinguished by a wildcard for technical reasons, for example content_owner_asset_metadata_*. These tables are updated daily, but at different times.
We need to select the latest partition from each table in the wildcard.
Right now we are using this query to build our derived tables:
SELECT
*
FROM
`project.content_owner_asset_metadata_*`
WHERE
_PARTITIONTIME = (
SELECT
MIN(time)
FROM (
SELECT
MAX(_PARTITIONTIME) as time
FROM
`project.content_owner_asset_metadata_*`
WHERE
_PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
)
)
This statement finds out the date that all the up-to-date tables are guarenteed to have and selects that date's data, however I need a filter that selects the data from the maximum partition time of each table. I know that I'd need to use _TABLE_SUFFIX with _PARTITIONTIME, but cannot quite work out how to make a select work without just loading all our data (very costly) and using a standard greatest-n-per-group solution.
We cannot just union a bunch of static tables, as our dataset ingestion is liable to change and the scripts we build need to be able to accomodate.
With BigQuery scripting (Beta now), there is a way to prune the partitions.
Basically, a scripting variable is defined to capture the dynamic part of a subquery. Then in subsequent query, scripting variable is used as a filter to prune the partitions to be scanned.
Below example uses BigQuery public dataset to demonstrate how to prune partition to only query and scan on latest day of data.
DECLARE max_date TIMESTAMP
DEFAULT (SELECT MAX(_PARTITIONTIME) FROM `bigquery-public-data.sec_quarterly_financials.numbers`);
SELECT * FROM `bigquery-public-data.sec_quarterly_financials.numbers`
WHERE _PARTITIONTIME = max_date;
With INFORMATION_SCHEMA.PARTITIONS (preview) as of posting, this can be achieved by joining to the PARTITIONS table as follows (e.g. with HOUR partitioning):
SELECT i.*
FROM `project.dataset.prefix_*` i
JOIN (
SELECT * EXCEPT (r)
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY table_name ORDER BY partition_id DESC) AS r
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name LIKE "%prefix%"
AND partition_id NOT IN ("__NULL__", "__UNPARTITIONED__"))
WHERE r = 1) p
ON (FORMAT_TIMESTAMP("%Y%m%d%H", i._PARTITIONTIME) = p.partition_id
AND CONCAT("prefix_", i._TABLE_SUFFIX) = p.table_name)

Will the partition be hit in an inner Union?

I have the followning SQL statement:
SELECT *
FROM (
SELECT eu_dupcheck AS dupcheck
, eu_date AS threshold
FROM WF_EU_EVENT_UNPROCESSED
WHERE eu_dupcheck IS NOT NULL
UNION
SELECT he_dupcheck AS dupcheck
, he_date AS threshold
FROM WF_HE_HISTORY_EVENT
WHERE he_dupcheck IS NOT NULL
)
WHERE threshold > sysdate - 30
The second table is partitioned by date but the first isn't. I need to know if the partition of the second table will be hit in this query, or will it do a full table scan?
I would be surprised if Oracle were smart enough to avoid a full table scan. Remember that UNION processes the data by removing duplicates. So, Oracle would have to recognize that:
The where clause is appropriate for the partitioning (this is actually easy).
That partitioning does not affect the duplicate removal (this is a bit harder, but true because the date is in the select).
Oracle has a smart optimizer, so perhaps it can recognize this situation (and it would probably avoid the full table scan for a UNION ALL). However, you are safer by moving the condition to the subqueries:
SELECT *
FROM ((SELECT eu_dupcheck AS dupcheck, eu_date AS threshold
FROM WF_EU_EVENT_UNPROCESSED
WHERE eu_dupcheck IS NOT NULL AND eu_date > sysdate - 30
) UNION
(SELECT he_dupcheck AS dupcheck, he_date AS threshold
FROM WF_HE_HISTORY_EVENT
WHERE he_dupcheck IS NOT NULL AND he_date > sysdate - 30
)
) eh;

How to create a temporary table in Oracle SQL and add data to it?

I have built cohorts of accounts based on date of first usage of our service. I need to use these cohorts in a handful of different queries, but don't want to have to create the queries in each of these downstream queries. Reason: Getting the data the first time took more than 60 minutes, so i don't want to pay that tax for all the other queries.
I know that I could do a statement like the below:
WHERE ACCOUNT_ID IN ('1234567','7891011','1213141'...)
But, I'm wondering if there is a way to create a temporary table that I prepopulate with my data, something like
WITH MAY_COHORT AS ( SELECT ACCOUNT_ID Account_ID, '1234567' Account_ID, '7891011' Account_ID, '1213141' )
I know that the above won't work, but would appreciate any advice or counsel here.
thanks.
Unless I am missing something, you're already on the right track, just an adjustment to your CTE should work:
WITH MAY_COHORT AS ( SELECT Account_ID from TableName WHERE ACCOUNT_ID IN ('1234567','7891011','1213141'...) )
This should give you the May_Cohort table to use for subsequent queries.
You can also use a sub-select for your Ids (no WITH MY_COHORT):
WHERE ACCOUNT_ID IN (
SELECT Account_ID
from TableName "Where ... your condition to build your cohort ..." )

Max and Min Time query

how to show max time in first row and min time in second row for access using vb6
What about:
SELECT time_value
FROM (SELECT MIN(time_column) AS time_value FROM SomeTable
UNION
SELECT MAX(time_column) AS time_value FROM SomeTable
)
ORDER BY time_value DESC;
That should do the job unless there are no rows in SomeTable (or your DBMS does not support the notation).
Simplifying per suggestion in comments - thanks!
SELECT MIN(time_column) AS time_value FROM SomeTable
UNION
SELECT MAX(time_column) AS time_value FROM SomeTable
ORDER BY time_value DESC;
If you can get two values from one query, you may improve the performance of the query using:
SELECT MIN(time_column) AS min_time,
MAX(time_column) AS max_time
FROM SomeTable;
A really good optimizer might be able to deal with both halves of the UNION version in one pass over the data (or index), but it is quite easy to imagine an optimizer tackling each half of the UNION separately and processing the data twice. If there is no index on the time column to speed things up, that could involve two table scans, which would be much slower than a single table scan for the two-value, one-row query (if the table is big enough for such things to matter).