Querying a partitioned BigQuery table across multiple far-apart _PARTITIONDATE days? - google-bigquery

I have a number of very big tables that are partitioned by _PARTITIONDATE which I'd like to query off of regularly in an efficient way. Each time I run the query, I only need to search across a small number of dates, but these dates will change every run and may be months/years apart from one-another.
To capture these dates, I could do _PARTITIONDATE >= '2015-01-01' but this is making the queries run very slow as there are millions of rows on each partition. I could also do _PARTITIONDATE BETWEEN '2015-01-01' AND '2017-01-01', but the exact date range will change every run. What I'd like to do is something like _PARTITIONDATE IN ("2015-03-10", "2016-01-24", "2016-22-03", "2017-06-14") so that the query only needs to run on the dates provided, which from my testing appears to work.
The problem I'm running into is that the list of dates will change every time, requiring me to join in the list of dates in a temp table first. When doing that like this source._PARTITIONDATE IN (datelist.date), it does not work and runs into an error if that's the only WHERE condition when querying a partition-required table.
Any advice for ways I might get this to work or other approach to querying off specific partitions that aren't back to back without having to process querying the whole thing?
I've been reading through the BigQuery documentation but I don't see an answer to this question. I do see it says that the following "doesn't limit the scanned partitions, because it uses table values, which are dynamic." So possibly what I'm trying to do is impossible with the current BQ limitations?
_PARTITIONTIME = (SELECT MAX(timestamp) from dataset.table1)

Script is a possible solution.
DECLARE max_date DEFAULT (SELECT MAX(...) FROM ...);
SELECT .... FROM ... WHERE _PARTITIONDATE = max_date;

Related

Optimize SELECT MAX(timestamp) query

I would like to run this query about once every 5 minutes to be able to run an incremental query to MERGE to another table.
SELECT MAX(timestamp) FROM dataset.myTable
-- timestamp is of type TIMESTAMP
My concern is that will do a full scan of myTable on a regular basis.
What are the best practices for optimizing this query? Will partitioning help even if the SELECT MAX doesn't extract the date from the query? Or is it just the columnar nature of BigQuery will make this optimal?
Thank you.
What you can do is, instead of querying your table directly, query the INFORMATION_SCHEMA.PARTITIONS table within your dataset. Doc here.
You can for instance go for:
SELECT LAST_MODIFIED_TIME
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE TABLE_NAME = "myTable"
The PARTITIONS table hold metadata at the rate of one record for each of your partitions. It is therefore greatly smaller than your table and it's an easy way to cut your query costs. (it is also much faster to query).

BigQuery: querying the latest partition, bytes to be processed vs. actually processed

I'm struggling with querying efficiently the last partition of a table, using a date or datetime field. The first approach was to filter like this:
SELECT *
FROM my_table
WHERE observation_date = (SELECT MAX(observation_date) FROM my_table)
But that, according to BigQuery's processing estimation, scans the entire table and does not use the partitions. Even Google states this happens in their documentation. It does work if I use an exact value for the partition:
SELECT *
FROM my_table
WHERE observation_date = CURRENT_DATE
But if the table is not up to date then the query will not get any results and my automatic procesess will fail. If I include an offset like observation_date = DATE_SUB(CURRENT_DATE, INTERVAL 2 DAY), I will likely miss the latest partition.
What is the best practice to get the latest partition efficiently?
What makes this worse is that BigQuery's estimation of the bytes to be processed with the active query does not match what was actually processed, unless I'm not interpreting those numbers correctly. Find below the screenshot of the mismatching values.
BigQuery screen with aparrently mistmatching processed bytes
Finally a couple of scenarios that I also tested:
If I store a max_date with a DECLARE statement first, as suggested in this post, the estimation seems to work, but it is not clear why. However, the actual processed bytes after running the query is not different than the case that filters the latest partition in the WHERE clause.
Using the same declared max_date in a table that is both partitioned and clustered, the estimation works only when using a filter for the partition, but fails if I include a filter for the cluster.
After some iterations I got an answer from Google and although it doesn't resolve the issue, it acknowledges it happens.
Tables partitioned with DATE or DATETIME fields cannot be efficiently queried for their latest partition. The best practice remains to filter with something like WHERE observation_date = (SELECT MAX(observation_date) FROM my_table) and that will scan the whole table.
They made notes to try and improve this in the future but we have to deal with this for now. I hope this helps somebody that was trying to do the same thing.

A query with MIN(date) not finished in 20 hours: should it be like that, or I did something wrong?

Inspired with a post by Tommaso Pifferi, I've created a PostgreSQL (11) database to operate on my time series data: 316K financial instruments, 139M records in total. Time series of different instruments vary in length and time periods, and often have gaps. There are two tables: description of instruments and data of time series records. The structure is very simple:
TABLE instruments has
instr_id INT PRIMARY KEY and
9 more columns describing each instrument,
TABLE timeseries has
PRIMARY KEY (instr_id, date) where
instr_id INT REFERENCES instruments(instr_id) connects time series records with instrument description,
date DATE NOT NULL is the date of time series records
There is no index on date.
5 more columns containing indicators such as price, trading volume, etc.
I work in Python 3.7, use psycopg2 as the driver and sqlalchemy as the ORM (but this is probably irrelevant). First I filled in the database using DataFrame.to_sql, ran VACUUM and checked that simple queries work correctly. Then I wanted to add to the table instruments some columns summarizing time series properties. Here is the first query I ran using cursor.execute() in order to test this idea. It supposed to find for each time series, what is the date of the earliest time record:
ALTER TABLE instruments
ADD begin DATE;
UPDATE instruments SET
begin = (
SELECT MIN(date) FROM timeseries
WHERE timeseries.instr_id=instruments.instr_id
);
This query has been running on a desktop PC (Intel i5, 8GB memory, Windows 7) for about 20 hours with no result. The server activity displayed in pgAdmin 4 looks as below.
I am new to relational databases and SQL. Is it normal that such a query performs so long, or do I do anything wrong?
Updates like that are typically faster if you aggregate once over everything and join that into the UPDATE statement:
UPDATE instruments
SET "begin" = t.start_date
FROM (
SELECT instr_id, MIN(date)
FROM timeseries
group by instr_id
) t
WHERE t.instr_id = instruments.instr_id;
The answer by a_horse_with_no_name is the correct one, but if you want to speed up the query without rewriting it, you should
CREATE INDEX ON timeseries (date);
That would speed up the repeated subselect and hence the whole query considerably.
What has to be done to get MIN(date)? Well - whole table of 139M records has to be scanned... For every instrument - and that is explanation.
To see how query is executed, please, use explain - here you can find documentation. Note that using explain analyze can take that 5 hours since query has to be executed in order to collect all the information.
What to do? You can create index. Question is if that would work. PG will use index if query fetches less then 2% of the table. In other cases it will go with seqscan - read of whole table. If you feel that seqscan is your case you can consider adding date to the index - that way, instead of reading the table DB can use stats of that index. To check - use explain.
That is general answer. Just try to play with it. If you have more questions, we can try to build up final answer.

how a hive table should be created when we need query based on daily data

i want to understand if i need to query a table and query is something like below
Select * from table_name where date_column > sysdate-2 and date_column < sysdate;
note: my intention is to select data of each day to be specific
then how should i design my table for better results?? i think partitioning based on date will give too many partitions and lead to performance bottle neck , not sure whether bucketing works here.... plz suggest and some explanation
If the data on a daily basis is just not enough to create a partition, you must think on creating partition based on yyyyMM (Year and Month). In that case, your query changes to
Select * from table_name where
my_partition_col in (date_format(sysdate,'yyyyMM'), date_format(sysdate-2,'yyyyMM'))
AND date_column > sysdate-2 and date_column < sysdate;
This optimizes the storage and performance requirement.
You should partition by date.
You are correct that this will create a lot of partitions. Within Hive, each date will be a separate file, and yes, Hive will need to maintain all of that, but that's exactly what Hive is best at.
note: my intention is to select data of each day to be specific
Since this is your intention, you'll get the best performance with daily partitions.
Other sorts of queries, running across multiple dates, may result in the performance bottleneck you're expressing concern about. But if that occurs, you could consider creating a different table to address that use case.
For your primary, current use case, daily partitions are the solution.

What is the fastest way to perform a date query in Oracle SQL?

We have a 6B row table that is giving us challenges when retrieving data.
Our query returns values instantly when doing a...
SELECT * WHERE Event_Code = 102225120
That type of instant result is exactly what we need. We now want to filter to receive values for just a particular year - but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row. There are other approaches like using TRUNC, or BETWEEN, or specifying the datetime in YYYY-MM-DD format for doing comparisons.
Of note, we do not have the option to add indexes to the database as it is a vendor's database.
What is the way to add a date filtering query and enable Oracle to begin streaming the results back in the fastest way possible?
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row
That question is quite different from yours. Firstly, your statement above applies to any data type, not only dates. Also the word many is relative to the number of records in the table. If the optimizer decides that the query will return many of all records in your table, then it may decide that a full scan of the table is faster than using the index. In your situation, this translates to how many records are in 2017 out of all records in the table? This calculation gives you the cardinality of your query which then gives you an idea if an index will be faster or not.
Now, if you decide that an index will be faster, based on the above, the next step is to know how to build your index. In order for the optimizer to use the index, it must match the condition that you're using. You are not comparing dates in your query, you are only comparing the year part. So an index on the date column will not be used by this query. You need to create an index on the year part, so use the same condition to create the index.
we do not have the option to add indexes to the database as it is a vendor's database.
If you cannot modify the database, there is no way to optimize your query. You need to talk to the vendor and get access to modify the database or ask them to add the index for you.
A function can also cause slowness for the number of records involved. Not sure if Function Based Index can help you for this, but you can try.
Had you tried to add a year column in the table? If not, try to add a year column and update it using code below.
UPDATE table
SET year = EXTRACT(YEAR FROM PERFORMED_DATE_TIME);
This will take time though.
But after this, you can run the query below.
SELECT *
FROM table
WHERE Event_Code = 102225120 AND year = 2017;
Also, try considering Table Partitioned for this big data. For starters, see link below,
link: https://oracle-base.com/articles/8i/partitioned-tables-and-indexes
Your question is a bit ambiguous IMHO:
but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Do you mean that
SELECT * WHERE Event_Code = 102225120
is fast, but
SELECT * WHERE Event_Code = 102225120 AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
is slow???
For starters I'll agree with Mitch Wheat that you should try to use PERFORMED_DATE_TIME between Jan 1, 2017 and Dec 31, 2017 instead of Year(field) = 2017. Even if you'd have an index on the field, the latter would hardly be able to make use of it while the first method would benefit enormously.
I'm also hoping you want to be more specific than just 'give me all of 2017' because returning over 1B rows is NEVER going to be fast.
Next, if you can't make changes to the database, would you be able to maintain a 'shadow' in another database? This would require that you create a table with all date-values AND the PK of the original table in another database and query those to find the relevant PK values and then JOIN those back to your original table to find whatever you need. The biggest problem with this would be that you need to keep the shadow in sync with the original table. If you know the original table only changes overnight, you could merge the changes in the morning and query all day. If the application is 'real-time(ish)' then this probably won't work without some clever thinking... And yes, your initial load of 6B values will be rather heavy =)
May this could be usefull (because you avoid functions (a cause for context switching) and if you have an index on your date field, it could be used) :
with
dt as
(
select
to_date('01/01/2017', 'DD/MM/YYYY') as d1,
to_date('31/01/2017', 'DD/MM/YYYY') as d2
from dual
),
dates as
(
select
dt.d1 + rownum -1 as d
from dt
connect by dt.d1 + rownum -1 <= dt.d2
)
select *
from your_table, dates
where dates.d = PERFORMED_DATE_TIME
Move the date literal to RHS:
AND PERFORMED_DATE_TIME >= date '2017-01-01'
AND PERFORMED_DATE_TIME < date '2018-01-01'
But without an (undisclosed) appropriate index on PERFORMED_DATE_TIME, the query is unlikely to be any faster.
One option to create indexes in third party databases is to script in the index and then before any vendor upgrade run a script to remove any indexes you've added. If the index is important, ask the vendor to add it to their database design.