hql: lead then create table, get wrong lead data - sql

Bad case:
CREATE TABLE IF NOT EXISTS tmp12 AS
WITH tmp_table AS
(
SELECT dt, time, LEAD(time,1,'99999') OVER(PARTITION BY urs) AS last_time, urs, tag FROM
(
SELECT dt, urs, time, tag, count(1) FROM table0 LIMIT 10000
) t1
GROUP BY dt, urs, time, tag
ORDER BY urs, time
)
Then
select * from tmp12
I get wrong data like:
WRONG RESULT(dt, time, last_time, urs)
There are some last_time < time.
When I remove CREATE TABLE, I get a good case:
SELECT dt, time, LEAD(time,1,'99999') OVER(PARTITION BY urs) AS last_time, urs, tag FROM
(
SELECT dt, urs, time, tag, count(1) FROM table0 LIMIT 10000
) t1
GROUP BY dt, urs, time, tag
ORDER BY urs, time
with correct result like:
GOOD RESULT(dt, time, last_time, urs)
all the last_time > time.
Why? I just create table and then select from it, but the last_time becomes wrong?

Could you try this:
CREATE TABLE IF NOT EXISTS tmp12 AS
WITH tmp_table AS
(
SELECT dt, time, LEAD(time,1,'99999') OVER(PARTITION BY urs ORDER BY time) AS last_time, urs, tag FROM
(
SELECT dt, urs, time, tag, count(1) FROM table0 LIMIT 10000
) t1
GROUP BY dt, urs, time, tag
ORDER BY urs, time
)
I am guessing that the missing order by in the window function is the issue.

Related

I get different results every time I run my which uses lead function in SQL Impala

I have the following code:
select *, lead(session_end_type) over (partition by user_id, session_id order by user_id, session_id, log_time) as next_session_end_type
from table_name;
However, it seems like this results in different results every time I run it.
What makes this difference?
Thanks in advance!
(I've checked that the code outputs different results through the following code:
create table t1
select *, lead(session_end_type) over (partition by user_id, session_id order by user_id, session_id, log_time) as next_session_end_type
from table_name;
create table t2
select *, lead(session_end_type) over (partition by user_id, session_id order by user_id, session_id, log_time) as next_session_end_type
from table_name;
select count (*) from
(
select * from t1
union
select * from t2
) as t;
The resulting row count is different from t1's row count and t2's row count; meaning that the result of t1 and t2 is different.)
First, there is no need to repeat the partition by columns in the order by. You can simplify this to:
lead(session_end_type) over (partition by user_id, session_id order by log_time) as next_session_end_type
Second, if log_time is not unique for a given user_id/session_id, then the results are unstable. Remember, SQL tables represent unordered sets, so if there are ties in sort keys then there is no "natural" order to fall back on.
You can check this wtih:
select user_id, session_id, log_time, count(*)
from table_name
group by user_id, session_id, log_time
having count(*) > 1
order by count(*) desc;
If you do have a column that uniquely identifies each row (or each user/user session row), then include that in the order by:
lead(session_end_type) over (partition by user_id, session_id
order by log_time, <make it stable column>) as next_session_end_type
)

Display Prev and Current value based on a ID - SQL

I am not sure if a similar question has been posted. I was unable to find one.
I have the following table:
What I trying to get is the below:
Any advice will be appreciated.
Thanks in advance,
Sam
Worked both in Oracle and Snowflake:
SELECT t.ID,
t.prev_dt,
t.current_dt,
t.prev_code,
t.curr_code
FROM (
SELECT id,
order_dt,
LAG(order_dt, 1) OVER (PARTITION by id ORDER BY id, order_dt) prev_dt,
upd_dt current_dt,
LAG(code, 1) OVER (PARTITION by id ORDER BY id, upd_dt) prev_code,
code curr_code
FROM t111
) t
INNER JOIN (
SELECT id,
MAX(order_dt) max_date
FROM t111
GROUP BY id
) idm
ON idm.id=t.id AND t.order_dt=idm.max_date
You seem to want window function lag():
select
id,
lag(order_dt) over(partition by id order by order_by_id) prev_dt,
order_dt current_dt,
lag(code) over(partition by id order by order_by_id) prev_code,
code curr_code
from mytable
Note that the above query does not filter the records of the table. When there is no preceeding record, lag() returns null. If you want to filter out the first record per group, and assuming that such record is identify by order_by_id = 1, you can do:
select *
from (
select
id,
lag(order_dt) over(partition by id order by order_by_id) prev_dt,
order_dt current_dt,
lag(code) over(partition by id order by order_by_id) prev_code,
code curr_code,
order_by_id
from mytable
) t
where order_by_id > 1
Window functions might be the best approach. But you could also use join:
select t1.id, t1.order_dt as prev_dt, t2.upd_dt as curr_date,
t1.code as prev_code, t2.code as curr_code
from t t1 join
t t2
on t1.id = t2.id and t1.order_by_id = 1 and t2.order_by_id = 2;
In Snowflake, I simply do not know whether this would have better, worse, or similar performance to using window functions.

SQL query to get maximum value for each day

So I have a table that looks something like this:
Now, I want the max totalcst for both days, something like this:
I tried using different variations of max and the Row_number funtion but still can't seem to get the result I want. My query:
select date,pid,max(quan*cst), totalcst
from dbo.test1
group by date, pid
But this returns all the records. So if somebody can point me towards the right direction, that would be very helpful.
Thanks in advance!
ROW_NUMBER should work just fine:
WITH CTE AS
(
SELECT *,
RN = ROW_NUMBER() OVER(PARTITION BY [date] ORDER BY totalcst)
FROM dbo.YourTable
)
SELECT [date],
pid,
totalcst
FROM CTE
WHERE RN = 1
;
Here is one simple way:
select t.*
from test1 t
where t.totalcst = (select max(t2.totalcst) from test1 t2 where t2.date = t.date);
This often has the best performance if you have an index on (date, totalcst). Of course, row_number()/rank() is also a very acceptable solution:
select t.*
from (select t.*, row_number() over (partition by date order by totalcst desc) as seqnum
from test1
) t
where seqnum = 1;

select top 10 items per city in SparkSQL

I have the following table SQL table (SparkSQL) .
user_id, city, timestamp, item_id
I need to find the top 10 items of the given city (in terms of the number of time the item_id appeared in that city) in each given date.
I then did the following:
SELECT *
FROM (
SELECT *,
row_number() OVER partition BY city AS rn
FROM mytable) AS foo
ORDER BY rn DESC
However, though it sort by rn, it didn't just give me the top 10 elements of a given date. What would be a proper way to fix this? Thanks!
Dont know what is the function to TRUNC time from timestamp in spark.
But first you need calculate the count, and then the row_number
SELECT *
FROM (
SELECT city, item_id, theDATE, cnt,
ROW_NUMBER() OVER (PARTITION BY city, theDATE
ORDER BY cnt) rn
FROM (SELECT city,
timestamp,
item_id,
to_date(timestamp) as theDATE, -- remove time and leave just date.
COUNT(item_id) OVER (PARTITION BY city, to_date(timestamp)) cnt
FROM mytable
) AS foo
) AS boo
WHERE rn <= 10
ORDER BY city, theDATE, rn

Use MIN() where you cannot GROUP?

I feel pretty dumb, but I get stuck with an apparently very easy query. I have something like this, where every row is a user that watched a movie:
user_id date duration
1 01-01-01 62m
1 03-01-01 95m
2 02-01-01 58m
2 06-01-01 25m
2 08-01-01 95m
3 03-01-01 96m
Now, what I would like to have is a table where I have the first movie watched by each user and its duration. The problem is if I use MIN() then I have to GROUP both user_id and duration. But if I GROUP for duration as well, then I am basically going to have the same table back. How can I solve the problem?
You can use a ranking function like ROW_NUMBER:
WITH CTE AS
(
SELECT rn = ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY date ASC),
user_id, date, duration
FROM dbo.TableName
)
SELECT user_id, date, duration FROM CTE WHERE rn = 1
The advantage of ROW_NUMBER is that you can change the logic easily. For example, if you want to reverse the logic and get the row of the last watched film per user, you just have to change ORDER BY date ASC to ORDER BY date DESC.
The advantage of theCTE (common-table-expression) is that you can also use it to delete or update these records. Often used to delete or identify duplicates. So you can first select to see what will be deleted/updated before you execute it.
Try this query. I haven't tested it.
SELECT date, duration FROM tablename n
WHERE NOT EXISTS(
SELECT date, user_id FROM tablename g
WHERE n.user_id = g.user_id AND g.date < n.date
);
Assuming there can only be a single record per user per date, it'd be something like this:
select y.*
from table t
inner join (
select user_id, min(date) mindate
from table
group by user_id
) t1
on t.user_id = t1.user_id
and t.date = t1.mindate
You can use ROW_NUMBER() which is a ranking function that generates sequential number for every group based on the column that you want to sort. In this case, if there is a tie, only one record for every user is selected but if you want to select all of them, you need to use DENSE_RANK() rather than ROW_NUMBER()
SELECT user_id, date, duration
FROM
(
SELECT user_id, date, duration,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY date) rn
FROM tableName
) a
WHERE rn = 1
this also assumes that the data type of column date is DATE
If you are using SQL Server 2005 or later, you can use windowing functions.
SELECT *
FROM
(
SELECT user_id, date, duration, MIN(date) OVER(PARTITION BY user_id) AS MIN_DATE
FROM MY_TABLE
) AS RESULTS
WHERE date = MIN_DATE
The over clause and partion by will "group by" the user_id and select the min date per user_id without eliminating any rows. Then you select from the table where the date is equal to the min date and you are left with the first date per user_id. This is a common trick once you know about windowing functions.
If you want the first watch_date per user, there should be no date before this date for this user:
SELECT *
FROM watched_movies wm
WHERE NOT EXISTS (
SELECT *
FROM watched_movies nx
WHERE nx.user_id = wm.user_id
AND nx.watch_date < wm.watch_date
);
Note: I replaced the date column by watch_date, since date is a reserved word (type name).
This should give you the duration of the first movie watched on the earliest date:
SELECT a.user_id, b.date, a.duration
FROM table a
INNER JOIN (SELECT user_id,min(date) date FROM table GROUP BY user_id) b ON a.user_id = b.user_id AND a.date = b.date
INNER JOIN (SELECT user_id,date,min(session_id) FROM table GROUP BY user_id, date) c ON b.user_id = c.user_id AND b.date = c.date AND a.session_id = c.session_id
Try this:
WITH TABLE1
AS (SELECT
'1' AS USER_ID,
'01-01-01' AS DT,
62 AS DURATION
FROM
DUAL
UNION ALL
SELECT
'1' AS USER_ID,
'03-01-01' AS DT,
95 AS DURATION
FROM
DUAL
UNION ALL
SELECT
'2' AS USER_ID,
'02-01-01' AS DT,
58 AS DURATION
FROM
DUAL
UNION ALL
SELECT
'2' AS USER_ID,
'06-01-01' AS DT,
25 AS DURATION
FROM
DUAL
UNION ALL
SELECT
'2' AS USER_ID,
'08-01-01' AS DT,
95 AS DURATION
FROM
DUAL
UNION ALL
SELECT
'3' AS USER_ID,
'03-01-01' AS DT,
96 AS DURATION
FROM
DUAL)
SELECT
*
FROM
(SELECT
USER_ID,
DT,
DURATION,
RANK ( ) OVER (PARTITION BY USER_ID ORDER BY DT ASC) AS ROW_RANK
FROM
TABLE1)
WHERE
ROW_RANK = 1
Use a sub-query to get the min date then join that back to the table to get all other relevant columns.
SELECT T2.user_id
,T2.date
,T2.duration
FROM YourTable T2
INNER JOIN
(
SELECT T1.user_id
,MIN(T1.date) as first_date
FROM YourTable T1
) SQ
ON T2.user_id = sq.user_id
AND T2.date = sq.first_date