So I have a table of page hits on a website and for each page of a specific type (marketing_page), I am trying to identify the next page a customer hits. So my query would probably look something like this
Select * from
(
Select page_id
, hit_time
, customer_id
, session_id
, page_type
, LEAD(page_id, 1) over (PARTITION BY customer_id, session_id ORDER BY hit_time) as next_page_id
FROM page_hits
)
WHERE page_type = 'marketing_page'
The problem with this approach is that the sub-query becomes HUGE if I keep the WHERE clause outside the sub-query. Ideally I'd like to be able to do something like:
Select page_id
, hit_time
, customer_id
, session_id
, page_type
, LEAD(page_id, 1) over (PARTITION BY customer_id, session_id ORDER BY hit_time) as next_page_id
FROM page_hits
WHERE page_type = 'marketing_page'
but have it still account for pages outside the WHERE clause when doing the LEAD function. I understand that the LEAD function gets evaluated after the WHERE so this is not possible.
I would also like to avoid a self join because of the efficiency issue. Is there a fast/simple way to achieve this?
Thanks!
This is too long for a comment.
If a simple lead() does not work on a Redshift table, that could mean one of several things. What come to mind are:
The Redshift databases is busy, having used up all query connections and you are just waiting. I'll assume this is not the case.
Your data is seriously big.
Your "table" is really a complicated view.
Given the nature of the data, I would assume that the data is seriously big. I would further assume that it is partitioned by some time unit, probably day.
You need to limit the query to one or a handful of partitions to run it. Your question provides no information on how that might be done.
Related
I am looking to retrieve the first row and last row over a window in HiveQL.
I know there are a couple ways to do this:
Use FIRST_VALUE and LAST_VALUE on the columns I am interested in.
SELECT customer,
FIRST_VALUE(product) over (W),
FIRST_VALUE(time) over (W),
LAST_VALUE(product) over (W),
LAST_VALUE(time) over (W)
FROM table
WINDOW W AS (PARTITION BY customer ORDER BY COST)
Calculate ROW_NUMBER() of each row and use a where clause for row_number=1.
WITH table_wRN AS
(
SELECT *,
row_number() over (partition by customer order by cost ASC) rn_B,
row_number() over (partition by customer order by cost DESC) rn_E
FROM table
),
table_first_last AS
(
SELECT *
FROM table_wRN
WHERE (rn_E=1 OR rn_B=1)
)
SELECT table_first.customer,
table_first.product, table_first.time,
table_last.product, table_last.time
FROM table_first_last as table_first WHERE table_first_last.rn_B=1
JOIN table_first_last as table_last WHERE table_first_last.rn_E=1
ON table_first.customer = table_last.customer
My questions:
Does anyone know which of these two is more efficient?
Intuitively, I think the first one should be faster because there is no need for a sub-query or a CTE.
Experimentally, I feel the second is faster but this could be because I am running first_value on a number of columns.
Is there a way to apply first_value and retrieve multiple columns in one shot.
I am looking to reduce the number of times the windowing is done / evaluated (something like cache the window)
Example of pseudo-code:
FIRST_VALUE(product,time) OVER (W) AS product_first, time_first
Thank you!
I am almost certain that the first would be more efficient. I mean two window functions versus two window functions, filtering and two joins?
Once you multiply the number of columns, then there might be an issue of which is faster. That said, look at the execution plan. I would expect that all window functions using the same window frame specification would use the same "windows" processing, with just tweaks for each value.
Hive does not have very good support for complex data types such as strings and arrays. In databases that do, it is easy enough to provide a complex type.
I have really complicated query:
select * from (
select * from tbl_user ...
where ...
and date_created between :date_from and :today
...
order by date_created desc
) where rownum <=50;
Currently query is fast enough because of where clause (only 3 month before today, date_from = today - 90 days).
I have to remove this clause, but it causes performance degradation.
What if first calculate date_from by `
SELECT MIN(date_created) where...
and then insert this value into main query? Set of data will be the same. Will it improve performance? Does it make sense?
Could anyone have any assumption about optimization?
Using an order by operation will of course cause the query to take a little longer to return. That being said, it is almost always faster to sort in the DB than it is to sort in your application logic.
It's hard to really optimize without the full query and schema information, but I'll take a stab at what seems like the most obvious to me.
Converting to Rank()
Your query could be a lot more efficient if you use a windowed rank() function. I've also converted it to use a common table expression (aka CTE). This doesn't improve performance, but does make it easier to read.
with cte as (
select
*
, rank() over (
partition by
-- insert what fields differentiate your rows here
-- unlike a group by clause, this doesn't need to be
-- every field
order by
date_created desc
)
from
tbl_user
...
where
...
and date_created between :date_from and :today
)
select
*
from
cte
where
rk <= 50
Indexing
If date_created is not indexed, it probably should be.
Take a look at your autotrace results. Figure out what filters have the highest cost. These are probably unindexed, and maybe should be.
If you post your schema, I'd be happy to make better suggestions.
Whem I am running the following query, I get a 'resource limited exceeded'-error. If I remove the last line (the order by clause) it works:
SELECT
id,
INTEGER(-position / (CASE WHEN fallback = 0 THEN 2 ELSE 1 END)) AS major_sort
FROM (
SELECT
id,
fallback,
ROW_NUMBER() OVER(PARTITION BY fallback) AS position
FROM
[table] AS r
ORDER BY
r.score DESC ) AS r
ORDER BY major_sort DESC
Actually the entire last line would be:
ORDER BY major_sort DESC, r.score DESC
But neither that would probably make things even worse.
Any idea how I could change the query to circumvent this problem?
((If you wonder what this query does: the table contains a 'ranking' with multiple fallback strategies and I want to create an ordering like this: 'AABAABAABAAB' with 'A' and 'B' being the fallback strategies. If you have a better idea how to achieve this; please feel free to tell me :D))
A top-level ORDER BY will always serialize execution of your query: it will force all computation onto a single node for the purpose of sorting. That's the cause of the resources exceeded error.
I'm not sure I fully understand your goal with the query, so it's hard to suggest alternatives, but you might consider putting an ORDER BY clause within the OVER(PARTITION BY ...) clause. Sorting a single partition can be done in parallel and may be closer to what you want.
More general advice on ordering:
Order is not preserved during BQ queries, so if there's an ordering that you want to preserve on the input rows, make sure it's encoded in your data as an extra field.
The use cases for large amounts of globally-sorted data are somewhat limited. Often when users run into resource limitations with ORDER BY, we find that they're actually looking for something slightly different (locally ordered data, or "top N"), and that it's possible to get rid of the global ORDER BY completely.
Here is my (simplified) problem, very common I guess:
create table sample (client, recordDate, amount)
I want to find out the latest recording, for each client, with recordDate and amount.
I made the below code, which works, but I wonder if there is any better pattern or Oracle tweaks to improve the efficiency of such SELECT. (I am not allowed to modify to the structure of the database, so indexes etc are out of reach for me, and out of scope for the question).
select client, recordDate, Amount
from sample s
inner join (select client, max(recordDate) lastDate
from sample
group by client) t on s.id = t.id and s.recordDate = t.lastDate
The table has half a million records and the select takes 2-4 secs, which is acceptable but I am curious to see if that can be improved.
Thanks
In most cases Windowed Aggregate Functions might perform better (at least it's easier to write):
select client, recordDate, Amount
from
(
select client, recordDate, Amount,
rank() over (partition by client order by recordDate desc) as rn
from sample s
) dt
where rn = 1
Another structure for the query is not exists. This can perform faster under some circumstances:
select client, recordDate, Amount
from sample s
where not exists (select 1
from sample s2
where s2.client = s.client and
s2.recordDate > s.recordDate
);
This would take good advantage of an index on sample(client, recordDate), if one were available.
And, another thing to try is keep:
select client, max(recordDate),
max(Amount) keep (dense_rank first order by recordDate desc)
from sample s
group by client;
This version assumes only one max record date per client (your original query does not make that assumption).
These queries (plus the one by dnoeth) should all have different query plans and you might get lucky on one of them. The best solution, though, is to have the appropriate index.
I have built cohorts of accounts based on date of first usage of our service. I need to use these cohorts in a handful of different queries, but don't want to have to create the queries in each of these downstream queries. Reason: Getting the data the first time took more than 60 minutes, so i don't want to pay that tax for all the other queries.
I know that I could do a statement like the below:
WHERE ACCOUNT_ID IN ('1234567','7891011','1213141'...)
But, I'm wondering if there is a way to create a temporary table that I prepopulate with my data, something like
WITH MAY_COHORT AS ( SELECT ACCOUNT_ID Account_ID, '1234567' Account_ID, '7891011' Account_ID, '1213141' )
I know that the above won't work, but would appreciate any advice or counsel here.
thanks.
Unless I am missing something, you're already on the right track, just an adjustment to your CTE should work:
WITH MAY_COHORT AS ( SELECT Account_ID from TableName WHERE ACCOUNT_ID IN ('1234567','7891011','1213141'...) )
This should give you the May_Cohort table to use for subsequent queries.
You can also use a sub-select for your Ids (no WITH MY_COHORT):
WHERE ACCOUNT_ID IN (
SELECT Account_ID
from TableName "Where ... your condition to build your cohort ..." )