Simple hive select not giving full results - sql

I'm querying a table, let's call it 'customer_table', this has partitions by day in 'yyyymmdd' format.
When looking at the partitions with 'show partitions customer_table' I can see there is data for yesterday and the day before.
However, when I run a query like :
'select customer, customer_joined_date, name, address, city
from customer_table ct
left join addresses addr on ct.cust_id=addr.cust_id
where ct.customer_joined_date >= date_format(date_sub(current_date,7),'yyyyMMdd')
This query is not including the data from yesterday or previous day.
My instinct is the previous 2 days partitions have some kind of lock which prevents querying while data is still streaming into them.
Can you suggest what's happening? Is there an environment parameter which i can set so the query ignores 'locks'?

Could you do the query below just to check that there’s no funky stuff going on in the join-
select max(customer_joined_date)
from customer_table ct
Where ct.customer_joined_date >= date_format(date_sub(current_date,7),'yyyyMMdd');
If you still don’t see the data from latest partitions, you can try doing gather stats for the latest partitions once and see if that results in any chance.Below is an example of the syntax.
ANALYZE TABLE Table1 PARTITION(ds='2008-04-09') COMPUTE STATISTICS;

Related

Specify the partition # based on date range for that pkey value

We have a DW query that needs to extract data from a very large table around 10 TB which is partitioned by datetime column lets say time to purge data based on this column everyday. So my understanding is each partition has worth a day of data. From storage (SSMS GUI) tab I see # of partitions is 1995.
There is no clustered index on this table as its mostly intended for write operations. Just a design by vendor.
SELECT
a.*
FROM dbo.VLTB AS a
CROSS APPLY
(
VALUES($PARTITION.a_func(a.time))
) AS c (pid)
WHERE c.pid = 1896;
Currently query submitted is as
SELECT * from dbo.VLTB
WHERE time >= convert(datetime,'20210601',112)
AND time < convert(datetime,'20210602',112)
So replacing inequality predicates with equality to look in that days specific partition might help. Users via app can control dates when sending but how will they manage if we want them to use partition # as per first query
Question
How do I find a way in above query to find partition number for that day rather than me inserting like for 06/01 I had to give 1896 part#. Is there a better way to have script find the part# to avoid all partitions being scanned and can insert correct part# in where clause query?
Thank you

BigQuery: querying the latest partition, bytes to be processed vs. actually processed

I'm struggling with querying efficiently the last partition of a table, using a date or datetime field. The first approach was to filter like this:
SELECT *
FROM my_table
WHERE observation_date = (SELECT MAX(observation_date) FROM my_table)
But that, according to BigQuery's processing estimation, scans the entire table and does not use the partitions. Even Google states this happens in their documentation. It does work if I use an exact value for the partition:
SELECT *
FROM my_table
WHERE observation_date = CURRENT_DATE
But if the table is not up to date then the query will not get any results and my automatic procesess will fail. If I include an offset like observation_date = DATE_SUB(CURRENT_DATE, INTERVAL 2 DAY), I will likely miss the latest partition.
What is the best practice to get the latest partition efficiently?
What makes this worse is that BigQuery's estimation of the bytes to be processed with the active query does not match what was actually processed, unless I'm not interpreting those numbers correctly. Find below the screenshot of the mismatching values.
BigQuery screen with aparrently mistmatching processed bytes
Finally a couple of scenarios that I also tested:
If I store a max_date with a DECLARE statement first, as suggested in this post, the estimation seems to work, but it is not clear why. However, the actual processed bytes after running the query is not different than the case that filters the latest partition in the WHERE clause.
Using the same declared max_date in a table that is both partitioned and clustered, the estimation works only when using a filter for the partition, but fails if I include a filter for the cluster.
After some iterations I got an answer from Google and although it doesn't resolve the issue, it acknowledges it happens.
Tables partitioned with DATE or DATETIME fields cannot be efficiently queried for their latest partition. The best practice remains to filter with something like WHERE observation_date = (SELECT MAX(observation_date) FROM my_table) and that will scan the whole table.
They made notes to try and improve this in the future but we have to deal with this for now. I hope this helps somebody that was trying to do the same thing.

A query with MIN(date) not finished in 20 hours: should it be like that, or I did something wrong?

Inspired with a post by Tommaso Pifferi, I've created a PostgreSQL (11) database to operate on my time series data: 316K financial instruments, 139M records in total. Time series of different instruments vary in length and time periods, and often have gaps. There are two tables: description of instruments and data of time series records. The structure is very simple:
TABLE instruments has
instr_id INT PRIMARY KEY and
9 more columns describing each instrument,
TABLE timeseries has
PRIMARY KEY (instr_id, date) where
instr_id INT REFERENCES instruments(instr_id) connects time series records with instrument description,
date DATE NOT NULL is the date of time series records
There is no index on date.
5 more columns containing indicators such as price, trading volume, etc.
I work in Python 3.7, use psycopg2 as the driver and sqlalchemy as the ORM (but this is probably irrelevant). First I filled in the database using DataFrame.to_sql, ran VACUUM and checked that simple queries work correctly. Then I wanted to add to the table instruments some columns summarizing time series properties. Here is the first query I ran using cursor.execute() in order to test this idea. It supposed to find for each time series, what is the date of the earliest time record:
ALTER TABLE instruments
ADD begin DATE;
UPDATE instruments SET
begin = (
SELECT MIN(date) FROM timeseries
WHERE timeseries.instr_id=instruments.instr_id
);
This query has been running on a desktop PC (Intel i5, 8GB memory, Windows 7) for about 20 hours with no result. The server activity displayed in pgAdmin 4 looks as below.
I am new to relational databases and SQL. Is it normal that such a query performs so long, or do I do anything wrong?
Updates like that are typically faster if you aggregate once over everything and join that into the UPDATE statement:
UPDATE instruments
SET "begin" = t.start_date
FROM (
SELECT instr_id, MIN(date)
FROM timeseries
group by instr_id
) t
WHERE t.instr_id = instruments.instr_id;
The answer by a_horse_with_no_name is the correct one, but if you want to speed up the query without rewriting it, you should
CREATE INDEX ON timeseries (date);
That would speed up the repeated subselect and hence the whole query considerably.
What has to be done to get MIN(date)? Well - whole table of 139M records has to be scanned... For every instrument - and that is explanation.
To see how query is executed, please, use explain - here you can find documentation. Note that using explain analyze can take that 5 hours since query has to be executed in order to collect all the information.
What to do? You can create index. Question is if that would work. PG will use index if query fetches less then 2% of the table. In other cases it will go with seqscan - read of whole table. If you feel that seqscan is your case you can consider adding date to the index - that way, instead of reading the table DB can use stats of that index. To check - use explain.
That is general answer. Just try to play with it. If you have more questions, we can try to build up final answer.

reduce the amount of data scanned by Athena when using aggregate functions

The below query scans 100 mb of data.
select * from table where column1 = 'val' and partition_id = '20190309';
However the below query scans 15 GB of data (there are over 90 partitions)
select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table);
How can I optimize the second query to scan the same amount of data as the first?
There are two problems here. The efficiency of the the scalar subquery above select max(partition_id) from table, and the one #PiotrFindeisen pointed out around dynamic filtering.
The the first problem is that queries over the partition keys of a Hive table are a lot more complex than they appear. Most folks would think that if you want the max value of a partition key, you can simply execute a query over the partition keys, but that doesn't work because Hive allows partitions to be empty (and it also allows non-empty files that contain no rows). Specifically, the scalar subquery above select max(partition_id) from table requires Trino (formerly PrestoSQL) to find the max partition containing at least one row. The ideal solution would be to have perfect stats in Hive, but short of that the engine would need to have custom logic for hive that open files of the partitions until it found a non empty one.
If you are are sure that your warehouse does not contain empty partitions (or if you are ok with the implications of that), you can replace the scalar sub query with one over the hidden $partitions table"
select *
from table
where column1 = 'val' and
partition_id = (select max(partition_id) from "table$partitions");
The second problem is the one #PiotrFindeisen pointed out, and has to do with the way that queries are planned an executed. Most people would look at the above query, see that the engine should obviously figure out the value of select max(partition_id) from "table$partitions" during planning, inline that into the plan, and then continue with optimization. Unfortunately, that is a pretty complex decision to make generically, so the engine instead simply models this as a broadcast join, where one part of the execution figures out that value, and broadcasts the value to the rest of the workers. The problem is the rest of the execution has no way to add this new information into the existing processing, so it simply scans all of the data and then filters out the values you are trying to skip. There is a project in progress to add this dynamic filtering, but it is not complete yet.
This means the best you can do today, is to run two separate queries: one to get the max partition_id and a second one with the inlined value.
BTW, the hidden "$partitions" table was added in Presto 0.199, and we fixed some minor bugs in 0.201. I'm not sure which version Athena is based on, but I believe it is is pretty far out of date (the current release at the time I'm writing this answer is 309.
EDIT: Presto removed the __internal_partitions__ table in their 0.193 release so I'd suggest not using the solution defined in the Slow aggregation queries for partition keys section below in any production systems since Athena 'transparently' updates presto versions. I ended up just going with the naive SELECT max(partition_date) ... query but also using the same lookback trick outlined in the Lack of Dynamic Filtering section. It's about 3x slower than using the __internal_partitions__ table, but at least it won't break when Athena decides to update their presto version.
----- Original Post -----
So I've come up with a fairly hacky way to accomplish this for date-based partitions on large datasets for when you only need to look back over a few partitions'-worth of data for a match on the max, however, please note that I'm not 100% sure how brittle the usage of the information_schema.__internal_partitions__ table is.
As #Dain noted above, there are really two issues. The first being how slow an aggregation of the max(partition_date) query is, and the second being Presto's lack of support for dynamic filtering.
Slow aggregation queries for partition keys
To solve the first issue, I'm using the information_schema.__internal_partitions__ table which allows me to get quick aggregations on the partitions of a table without scanning the data inside the files. (Note that partition_value, partition_key, and partition_number in the below queries are all column names of the __internal_partitions__ table and not related to your table's columns)
If you only have a single partition key for your table, you can do something like:
SELECT max(partition_value) FROM information_schema.__internal_partitions__
WHERE table_schema = 'DATABASE_NAME' AND table_name = 'TABLE_NAME'
But if you have multiple partition keys, you'll need something more like this:
SELECT max(partition_date) as latest_partition_date from (
SELECT max(case when partition_key = 'partition_date' then partition_value end) as partition_date, max(case when partition_key = 'another_partition_key' then partition_value end) as another_partition_key
FROM information_schema.__internal_partitions__
WHERE table_schema = 'DATABASE_NAME' AND table_name = 'TABLE_NAME'
GROUP BY partition_number
)
WHERE
-- ... Filter down by values for e.g. another_partition_key
)
These queries should run fairly quickly (mine run in about 1-2 seconds) without scanning through the actual data in the files, but again, I'm not sure if there are any gotchas with using this approach.
Lack of Dynamic Filtering
I'm able to mitigate the worst effects of the second problem for my specific use-case because I expect there to always be a partition within a finite amount of time back from the current date (e.g. I can guarantee any data-production or partition-loading issues will be remedied within 3 days). It turns out that Athena does do some pre-processing when using presto's datetime functions, so this does not have the same types of issues with Dynamic Filtering as using a sub-query.
So you can change your query to limit how far it will look back for the actual max using the datetime functions so that the amount of data scanned will be limited.
SELECT * FROM "DATABASE_NAME"."TABLE_NAME"
WHERE partition_date >= cast(date '2019-06-25' - interval '3' day as varchar) -- Will only scan partitions from 3 days before '2019-06-25'
AND partition_date = (
-- Insert the partition aggregation query from above here
)
I don't know if it is still relevant, but just found out:
Instead of:
select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table);
Use:
select a.* from table a
inner join (select max(partition_id) max_id from table) b on a.partition_id=b.max_id
where column1 = 'val';
I think it has something to do with optimizations of joins to use partitions.

SQL Server 2005 query plan optimizer choking on date partitioned tables

We have TABLE A partitioned by date and does not contain data from today, it only contains data from prior day and going to year to date.
We have TABLE B also partitioned by date which does contain data from today as well as data from prior day going to year to date. On top of TABLE B there is a view, View_B which joins against View_C, View_D and left outer joins Table E. View_C and View_D are each selects from 1 table and do not have any other tables joined in. So View_B looks something like
SELECT b.Foo, c.cItem, d.dItem, E.eItem
FROM TABLE_B b JOIN View_C c on c.cItem = b.cItem
JOIN View_D d on b.dItem = d.dItem
LEFT OUTER JOIN TABLE_E on b.eItem = e.eItem
View_AB joins TABLE A and View_B on extract date as well as one other constraint. So it looks something like:
SELECT a.Col_1, b.Col_2, ...
FROM TABLE_A a LEFT OUTER JOIN View_B b
on a.ExtractDate = b.ExtractDate and a.Foo=b.Foo
-- no where clause
When searching for data from anything other than prior day, the query analyzer does what would be expected and does a hash match join to complete the outer join and reads about 116 pages worth of data from table B. If run for prior day however, the query optimizer freaks out and uses a nested join, scans the table 7000+ times and reads 8,000,000+ pages in the join.
We can fake it/force it to use a different query plan by using join hints, however that causes any constraints in the view that look at table B to cause the optimizer to throw an error that the query can't be completed due to join hints.
Editing to add that the pages/scans = the same number as is hit in one scan when run for a prior day where the optimizer correctly chooses a hash instead of nested join.
As mentioned in the comments, we have severely reduced the impact by creating a covered index on TABLE_B to cover the join in View_B but the IO is still higher than it would be if the optimizer chose the correct plan, especially since the index is essentially redundant for all but prior day searches.
The sqlplan is at http://pastebin.com/m53789da9, sorry that it's not the nicely formatted version.
If you can post the .sqlplan for each of the queries it would help for sure, but my hunch is that you are getting a parallel plan when querying for dates prior to the current day and the nested loop is possibly a constant loop over the partitions included in the table which would then spawn a worker thread for each partition (for more information on this, see the SQLCAT post on parallel plans with partitioned tables in Sql2005). Can't verify if this is the case or not without seeing the plans however.
In case anyone ever runs into this, the issue appears to be only tangentially related to the partitioning scheme. Even though we run a statistics update nightly, it appears that SQL Server
Didn't create a statistic on the ExtractDate
Even when the extract date statistic was explicitly created, didn't pick up that the prior day had data.
We resolved it by doing a CREATE STATISTICS TABLE_A_ExtractDate_Stats ON TABLE_A WITH FULLSCAN. Now searching for prior day and a random sampling of days appears to generate the correct plan.