Value from previous row in GROUP BY as column - sql

I have this table:
+----------+-------------+-------------------+------------------+
| userId| testId| date| note|
+----------+-------------+-------------------+------------------+
| 123123123| 1|2019-01-22 02:03:00| aaa|
| 123123123| 1|2019-02-22 02:03:00| bbb|
| 123456789| 2|2019-03-23 02:03:00| ccc|
| 123456789| 2|2019-04-23 02:03:00| ddd|
| 321321321| 3|2019-05-23 02:03:00| eee|
+----------+-------------+-------------------+------------------+
Would like to get newest note (whole row) for each group userId and testId:
SELECT
n.userId,
n.testId,
n.date,
n.note
FROM
notes n
INNER JOIN (
SELECT
userId,
testId,
MAX(date) as maxDate
FROM
notes
GROUP BY
userId,
testId
) temp ON n.userId = temp.userId AND n.testId = temp.testId AND n.date = temp.maxDate
It works.
But now I'd like to also have previous note in each row:
+----------+-------------+-------------------+-------------+------------+
| userId| testId| date| note|previousNote|
+----------+-------------+-------------------+-------------+------------+
| 123123123| 1|2019-02-22 02:03:00| bbb| aaa|
| 123456789| 2|2019-04-23 02:03:00| ddd| ccc|
| 321321321| 3|2019-05-23 02:03:00| eee| null|
+----------+-------------+-------------------+-------------+------------+
Have no idea how to do it. I heard about LAG() function which might be useful but found no good examples for my case.
I'd like to use it on dataframe in pyspark (if it's important)

use lag() and row_number analytic function
select userid,testid,date,note,previous_note
from
(select userid,testid,date,note,
lag(note)over(partition by useid,testid order by date) as previous_note,
row_number() over(partition by userid,testid order by date desc) rn
from table_name
) a where a.rn=1

select userid,testid,date,note,previous_note from
(select userid,testid,date,note,lead(note)
over(partition by userid,testid order by date desc) as previous_note,
row_number() over(partition by userid,testid order by date desc) srno
from Table_Name
) a where a.srno=1
I hope it will give you right answer which you want. it will give you latest date as new record and previous date note as previous_Note.

Related

How to return all records with the latest datetime value [Postgreql]

How can I return only the records with the latest upload_date(s) from the data below?
My data is as follows:
upload_date |day_name |rows_added|row_count_delta|days_since_last_update|
-----------------------+---------+----------+---------------+----------------------+
2022-05-01 00:00:00.000|Sunday | 526043| | |
2022-05-02 00:00:00.000|Monday | 467082| -58961| 1|
2022-05-02 15:58:54.094|Monday | 421427| -45655| 0|
2022-05-02 18:19:22.894|Monday | 421427| 0| 0|
2022-05-03 16:54:04.136|Tuesday | 496021| 74594| 1|
2022-05-03 18:17:27.502|Tuesday | 496021| 0| 0|
2022-05-04 18:19:26.392|Wednesday| 487154| -8867| 1|
2022-05-05 18:18:15.277|Thursday | 489713| 2559| 1|
2022-05-06 16:15:39.518|Friday | 489713| 0| 1|
2022-05-07 16:18:00.916|Saturday | 482955| -6758| 1|
My desired results should be:
upload_date |day_name |rows_added|row_count_delta|days_since_last_update|
-----------------------+---------+----------+---------------+----------------------+
2022-05-01 00:00:00.000|Sunday | 526043| | |
2022-05-02 18:19:22.894|Monday | 421427| 0| 0|
2022-05-03 18:17:27.502|Tuesday | 496021| 0| 0|
2022-05-04 18:19:26.392|Wednesday| 487154| -8867| 1|
2022-05-05 18:18:15.277|Thursday | 489713| 2559| 1|
2022-05-06 16:15:39.518|Friday | 489713| 0| 1|
2022-05-07 16:18:00.916|Saturday | 482955| -6758| 1|
NOTE only the latest upload_date for 2022-05-02 and 2022-05-03 should be in the result set.
You can use a window function to PARTITION by day (casting the timestamp to a date) and sort the results by most recent first by ordering by upload_date descending. Using ROW_NUMBER() it will assign a 1 to the most recent record per date. Then just filter on that row number. Note that I am assuming the datatype for upload_date is TIMESTAMP in this case.
SELECT
*
FROM (
SELECT
your_table.*,
ROW_NUMBER() OVER (PARTITION BY CAST(upload_date AS DATE)
ORDER BY upload_date DESC) rownum
FROM your_table
)
WHERE rownum = 1
demo
WITH cte AS (
SELECT
max(upload_date) OVER (PARTITION BY upload_date::date),
upload_date,
day_name,
rows_added,
row_count_delta,
days_since_last_update
FROM test101 ORDER BY 1
)
SELECT
upload_date,
day_name,
rows_added,
row_count_delta,
days_since_last_update
FROM
cte
WHERE
max = upload_date;
This is more verbose but I find it easier to read and build:
SELECT *
FROM mytable t1
JOIN (
SELECT CAST(upload_date AS DATE) day_date, MAX(upload_date) max_date
FROM mytable
GROUP BY day_date) t2
ON t1.upload_date = t2.max_date AND
CAST(upload_date AS DATE) = t2.day_date;
I don't know about perfomance right away, but I suspect the window function is worse because you will need to order by, which is usually a slow operation unless your table already have an index for doing so.
Use DISTINCT ON:
SELECT DISTINCT ON (date_trunc('day', upload_date))
to_char(upload_date, 'Day') AS weekday, * -- added weekday optional
FROM tbl
ORDER BY date_trunc('day', upload_date), upload_date DESC;
db<>fiddle here
For few rows per day (like your sample data suggests) it's the simplest and fastest solution possible. See:
Select first row in each GROUP BY group?
I dropped the redundant column day_name from the table. That's just a redundant representation of the timestamp. Storing it only adds cost and noise and opportunities for inconsistent data. If you need the weekday displayed, use to_char(upload_date, 'Day') AS weekday like demonstrated above.
The query works for any number of days, not restricted to 7 weekdays.

Concat a column's value with other column's lead() value in impala

I have a table like the below:
+---------------------+------------------------------------+---------------------+
| prompt | answer | step_timestamp |
+---------------------+------------------------------------+---------------------+
| hi Lary | | 2022-04-04 10:00:00 |
| how are you? | | 2022-04-04 10:02:00 |
| how is your pet? |I am fine | 2022-04-04 10:05:00 |
| what is your hobby? |my pet is good | 2022-04-04 10:15:00 |
| ok thanks |football | 2022-04-04 10:25:00 |
+---------------------+-------------------------------------+---------------------
The answer has to match with the prompt of the previous row.
Expected result :
hi Lary, how are you?I am fine. how is your pet?my pet is good. what is your hobby? football. ok thanks
For this I have done this
WITH SUPER AS(
SELECT call_id, group_concat(tall,'\t') as dialog_text,
FROM
(SELECT ROW_NUMBER() OVER (PARTITION BY tall,call_id
ORDER BY step_timestamp ASC) AS rn,call_id,tall
FROM
(SELECT call_id,step_timestamp, concat(prompt,':',lead(answer) over(PARTITION BY call_id,step_timestamp order by step_timestamp asc)) tall
FROM db.table
ORDER BY step_timestamp ASC
limit 100000000
)as inq
ORDER BY step_timestamp ASC
limit 100000000
) b
WHERE rn =1
GROUP BY call_id,call_ani
)select distinct call_id, dialog_text
from super;
But it does not work as expecting. For example some times I have something like this:
hi lary, how are you?I am fine. how is your pet?my pet is good. how is your pet?I am fine. what is your hobby? football. ok thanks
You probably know the reason already. group_concat() in impala doesnt maintain order by. Now even if you put limit 10000000, it may not put all rows into same node to ensure ordered concat.
Use hive collect_list().
I couldnt find relevance of your rownumber() so i removed it to keep the solution simple. Please test below code with your original data and then add rownumber if needed.
select
id call_id,
concat( concat_ws(',', min(g)) ) dialog_text
from
(
select
s.id,
--use collect list to cooncat all dialogues in order of timestamp.
collect_list(s.tall) over (partition by s.id order by s.step_timestamp desc rows between unbounded preceding and unbounded following) g
from
(
SELECT call_id id,step_timestamp,
concat(prompt,':',lead(answer) over(PARTITION BY call_id,step_timestamp order by step_timestamp asc)) tall
FROM db.table -- This it main data
) s
) gs
-- Need to group by 'id' since we have duplicate collect_list values
group by id

Amount based on a prority

Please feel free to change or suggest me to change my title to better sound on what I am trying to ask.
I have a query that gives the following result:
select
Customer.customer_id,
Transaction.amount
From Customer inner join Transaction on Customer.customer_id = Transcation.coustomer_id
Result:
customer_id| amount
01456 |50
01456 |100
01456 |400
01456 |0
01963 |50
01963 |100
01963 |221
01963 |0
Now, I want to add a priority field to give me a priority of 1, 2, or 3. The lower the amount, the higher the priority. Note: I want to replace 0 with text 'Negative'. Ranking amount expect 0.
This is what I want.
customer_id| amount| priority
01456| 50| 3
01456| 100| 2
01456| 400| 1
01456| 0| Negative
01963| 50| 3
01963| 100| 2
01963| 221| 1
01963| 0| Negative
Is this achievable? Your help will be greatly appreciated.
Window functions like ROW_NUMBER() are perfect for this:
SELECT c.customer_id,
t.amount,
ROW_NUMBER() over (PARTITION BY customer_id ORDER BY amount desc) priority
FROM Customer c
JOIN [Transaction] t on c.customer_id = t.customer_id
The partition by resets the numbering on each unique customer_id, and the order by decides which direction and order to number the rows.
Use row_number() or rank():
select customer_id, amount,
row_number() over (partition by customer_id order by amount desc) as priority
from t;

calculate distance between different entries in one column

I have this kind of column in my table:
Table A:
geom_etrs(geometry)
"0101000020E8640000FE2EAF0B3C981C414E499E34DFE65441"
"0101000020E864..."
"0101000020E875..."
"0101000020E867..."
How can I calculate the distances between each of the entries (they are already defined as POINT)?
I want to create a new column where the distances between 1 and 2, then between 2 and 3, then between 3 and 4 and so on, are displayed.
select st_distance(point, lead(point,1) over (partition by rn))
from ( select point, row_number() over (partition by id) as rn
from table_1)t;
gis=# \d users
user_id | bigint |
geog | geography |
select st_astext(geog), st_astext(lead(geog,1) over (partition by rn)) from ( select geog, row_number() over (partition by user_id) as rn from users limit 10)t;
st_astext | st_astext
-------------------------------------------+-------------------------------------------
POINT(-70.0777937636872 41.6670617084209) | POINT(-70.0783833464664 41.6675384387944)
POINT(-70.0783833464664 41.6675384387944) | POINT(-70.0793901822679 41.667476122803)
POINT(-70.0793901822679 41.667476122803) | POINT(-70.0787530494335 41.6671461707966)
POINT(-70.0787530494335 41.6671461707966) | POINT(-70.07908017161 41.6663672501228)
POINT(-70.07908017161 41.6663672501228) | POINT(-70.0795407352778 41.6669886861798)
POINT(-70.0795407352778 41.6669886861798) | POINT(-70.0798881265976 41.6663775240468)
POINT(-70.0798881265976 41.6663775240468) | POINT(-70.0781470955597 41.6667824284963)
POINT(-70.0781470955597 41.6667824284963) | POINT(-70.0790447962989 41.6675773546665)
POINT(-70.0790447962989 41.6675773546665) | POINT(-70.0778760883834 41.6675017901701)
POINT(-70.0778760883834 41.6675017901701) |
gis=# select st_distance(geog, lead(geog,1) over (partition by rn)) from ( select geog, row_number() over (partition by user_id) as rn from users limit 10)t;
st_distance
--------------
72.21147623
84.13511302
64.48606246
90.70040367
78.96272466
73.78817244
151.81026032
115.69092832
97.69189128
This should work for you
Using window function LEAD gives you the next value as new column next to the current value: demo:db<>fiddle (only with text type because the fiddle does not support geometry, but it's the same)
SELECT
point_column,
LEAD(point_column) OVER ()
FROM
table
Now you are able to calculate the distance with PostGIS' st_distance:
SELECT
st_distance(
point_column,
LEAD(point_column) OVER ()
)
FROM
table

Selecting the most recent row before a certain timestamp

I have a table like this called tt
ID|Name|Date|Value|
------------------------------------
0| S1| 2017-03-05 00:00:00| 1.5|
1| S1| 2017-04-05 00:00:00| 1.2|
2| S2| 2017-04-06 00:00:00| 1.2|
3| S3| 2017-04-07 00:00:00| 1.1|
4| S3| 2017-05-07 00:00:00| 1.2|
I need to select the row with the highest time for each Name that is < theTime
theTime being just a variable with the timestamp. In the example you could hardcode a date string, e.g. < DATE '2017-05-01' I will inject the value of the variable later programmatically with another language
I'm having a difficult time figuring out how to do this... does anyone know?
Also, I would like to know how to select what I described above but limited to a specific name, e.g. name='S3'
It would be nice if hsqldb really supported row_number():
select t.*
from (select tt.*,
row_number() over (partition by name order by date desc) as seqnum
from tt
where . . .
) t
where seqnum = 1;
Lacking that, use a group by and join:
select tt.*
from tt join
(select name, max(date) as maxd
from tt
where date < THETIME
group by name
) ttn
on tt.name = ttn.name and tt.date = ttn.maxd;
Note: this will return duplicates if the maximum date has duplicates for a given name.
The where has the limitation on your timestamp.