Selecting the most recent row before a certain timestamp - sql

I have a table like this called tt
ID|Name|Date|Value|
------------------------------------
0| S1| 2017-03-05 00:00:00| 1.5|
1| S1| 2017-04-05 00:00:00| 1.2|
2| S2| 2017-04-06 00:00:00| 1.2|
3| S3| 2017-04-07 00:00:00| 1.1|
4| S3| 2017-05-07 00:00:00| 1.2|
I need to select the row with the highest time for each Name that is < theTime
theTime being just a variable with the timestamp. In the example you could hardcode a date string, e.g. < DATE '2017-05-01' I will inject the value of the variable later programmatically with another language
I'm having a difficult time figuring out how to do this... does anyone know?
Also, I would like to know how to select what I described above but limited to a specific name, e.g. name='S3'

It would be nice if hsqldb really supported row_number():
select t.*
from (select tt.*,
row_number() over (partition by name order by date desc) as seqnum
from tt
where . . .
) t
where seqnum = 1;
Lacking that, use a group by and join:
select tt.*
from tt join
(select name, max(date) as maxd
from tt
where date < THETIME
group by name
) ttn
on tt.name = ttn.name and tt.date = ttn.maxd;
Note: this will return duplicates if the maximum date has duplicates for a given name.
The where has the limitation on your timestamp.

Related

Running total between two dates SQL

I have a problem with building an efficient query in order to get a running total of sales between two dates.
Now I have the query :
select SalesId,
sum(Sales) as number_of_sales,
Sales_DATE as SalesDate,
ADD_MONTHS(Sales_DATE , -12) as SalesDatePrevYear
from DWH.L_SALES
group by SalesId, Sales_DATE
With the result:
| SalesId| number_of_sales| SalesDate|SalesDatePrevYear|
|:---- |:------:| :-----:|-----:|
| 1000| 1| 20200101|20190101|
| 1001| 1| 20220101|20210101|
| 1002| 1| 20220201|20210201|
| 1003| 1| 20220301|20210301|
The preferred result is the following:
| SalesId| number_of_sales| running total of sales | SalesDate|SalesDatePrevYear|
|:---- |:------:| :-----:| :-----:|-----:|
| 1000| 1| 1 | 20200101|20190101|
| 1001| 1| 1 | 20220101|20210101|
| 1002| 1| 2| 20220201|20210201|
| 1003| 1| 3|20220301|20210301|
As you can see, I want the total of Sales between the two dates, but because I also need the lower level (SalesId), it always stays at 1.
How can i get this efficiently?
You have successfully gotten the result which gives you the start and end dates that you care about, so you just need to take this result and then join it to the original data with an inequality join, and then sum the results. I suggest looking into the style of using CTE's (Common Table Expressions) which is helpful for learning and debugging.
For example,
WITH CTE_BASE_RESULT AS
(
your query goes here
)
SELECT CTE_BASE_RESULT.SalesId, CTE_BASE_RESULT.SalesDate, SUM(Sales) AS Total_Sales_Prior_Year
FROM CTE_BASE_RESULT
INNER JOIN DWH.L_Sales
ON CTE_BASE_RESULT.SalesId = L_Sales.SalesId
AND CTE_BASE_RESULT.SalesDate >= L_Sales.SalesDATE
AND CTE_BASE_RESULT.SalesDatePrevYear > L_Sales.SalesDATE
GROUP BY CTE_BASE_RESULT.SalesId, CTE_BASE_RESULT.SalesDate
I also recommend a website like SQL Generator that can help write complex operations, for example this is called Timeseries Aggregate.
This syntax works for snowflake, I didnt see what system you're on.
Alternatively,
WITH BASIC_OFFSET_1YEAR AS (
SELECT
A.Sales_Id,
A.SalesDate,
SUM(B.Sales) as SUM_SALES_PAST1YEAR
FROM
L_Sales A
INNER JOIN L_Sales B ON A.Sales_Id = B.Sales_Id
WHERE
B.SalesDate >= DATEADD(YEAR, -1, A.SalesDate)
AND B.SalesDate <= A.SalesDate
GROUP BY
A.Sales_Id,
A.SalesDate
)
SELECT
src.*, BASIC_OFFSET_1YEAR.SUM_SALES_PAST1YEAR
FROM
L_Sales src
LEFT OUTER JOIN BASIC_OFFSET_1YEAR
ON BASIC_OFFSET_1YEAR.SalesDate = src.SalesDate
AND BASIC_OFFSET_1YEAR.Sales_Id = src.Sales_Id

How to return all records with the latest datetime value [Postgreql]

How can I return only the records with the latest upload_date(s) from the data below?
My data is as follows:
upload_date |day_name |rows_added|row_count_delta|days_since_last_update|
-----------------------+---------+----------+---------------+----------------------+
2022-05-01 00:00:00.000|Sunday | 526043| | |
2022-05-02 00:00:00.000|Monday | 467082| -58961| 1|
2022-05-02 15:58:54.094|Monday | 421427| -45655| 0|
2022-05-02 18:19:22.894|Monday | 421427| 0| 0|
2022-05-03 16:54:04.136|Tuesday | 496021| 74594| 1|
2022-05-03 18:17:27.502|Tuesday | 496021| 0| 0|
2022-05-04 18:19:26.392|Wednesday| 487154| -8867| 1|
2022-05-05 18:18:15.277|Thursday | 489713| 2559| 1|
2022-05-06 16:15:39.518|Friday | 489713| 0| 1|
2022-05-07 16:18:00.916|Saturday | 482955| -6758| 1|
My desired results should be:
upload_date |day_name |rows_added|row_count_delta|days_since_last_update|
-----------------------+---------+----------+---------------+----------------------+
2022-05-01 00:00:00.000|Sunday | 526043| | |
2022-05-02 18:19:22.894|Monday | 421427| 0| 0|
2022-05-03 18:17:27.502|Tuesday | 496021| 0| 0|
2022-05-04 18:19:26.392|Wednesday| 487154| -8867| 1|
2022-05-05 18:18:15.277|Thursday | 489713| 2559| 1|
2022-05-06 16:15:39.518|Friday | 489713| 0| 1|
2022-05-07 16:18:00.916|Saturday | 482955| -6758| 1|
NOTE only the latest upload_date for 2022-05-02 and 2022-05-03 should be in the result set.
You can use a window function to PARTITION by day (casting the timestamp to a date) and sort the results by most recent first by ordering by upload_date descending. Using ROW_NUMBER() it will assign a 1 to the most recent record per date. Then just filter on that row number. Note that I am assuming the datatype for upload_date is TIMESTAMP in this case.
SELECT
*
FROM (
SELECT
your_table.*,
ROW_NUMBER() OVER (PARTITION BY CAST(upload_date AS DATE)
ORDER BY upload_date DESC) rownum
FROM your_table
)
WHERE rownum = 1
demo
WITH cte AS (
SELECT
max(upload_date) OVER (PARTITION BY upload_date::date),
upload_date,
day_name,
rows_added,
row_count_delta,
days_since_last_update
FROM test101 ORDER BY 1
)
SELECT
upload_date,
day_name,
rows_added,
row_count_delta,
days_since_last_update
FROM
cte
WHERE
max = upload_date;
This is more verbose but I find it easier to read and build:
SELECT *
FROM mytable t1
JOIN (
SELECT CAST(upload_date AS DATE) day_date, MAX(upload_date) max_date
FROM mytable
GROUP BY day_date) t2
ON t1.upload_date = t2.max_date AND
CAST(upload_date AS DATE) = t2.day_date;
I don't know about perfomance right away, but I suspect the window function is worse because you will need to order by, which is usually a slow operation unless your table already have an index for doing so.
Use DISTINCT ON:
SELECT DISTINCT ON (date_trunc('day', upload_date))
to_char(upload_date, 'Day') AS weekday, * -- added weekday optional
FROM tbl
ORDER BY date_trunc('day', upload_date), upload_date DESC;
db<>fiddle here
For few rows per day (like your sample data suggests) it's the simplest and fastest solution possible. See:
Select first row in each GROUP BY group?
I dropped the redundant column day_name from the table. That's just a redundant representation of the timestamp. Storing it only adds cost and noise and opportunities for inconsistent data. If you need the weekday displayed, use to_char(upload_date, 'Day') AS weekday like demonstrated above.
The query works for any number of days, not restricted to 7 weekdays.

Value from previous row in GROUP BY as column

I have this table:
+----------+-------------+-------------------+------------------+
| userId| testId| date| note|
+----------+-------------+-------------------+------------------+
| 123123123| 1|2019-01-22 02:03:00| aaa|
| 123123123| 1|2019-02-22 02:03:00| bbb|
| 123456789| 2|2019-03-23 02:03:00| ccc|
| 123456789| 2|2019-04-23 02:03:00| ddd|
| 321321321| 3|2019-05-23 02:03:00| eee|
+----------+-------------+-------------------+------------------+
Would like to get newest note (whole row) for each group userId and testId:
SELECT
n.userId,
n.testId,
n.date,
n.note
FROM
notes n
INNER JOIN (
SELECT
userId,
testId,
MAX(date) as maxDate
FROM
notes
GROUP BY
userId,
testId
) temp ON n.userId = temp.userId AND n.testId = temp.testId AND n.date = temp.maxDate
It works.
But now I'd like to also have previous note in each row:
+----------+-------------+-------------------+-------------+------------+
| userId| testId| date| note|previousNote|
+----------+-------------+-------------------+-------------+------------+
| 123123123| 1|2019-02-22 02:03:00| bbb| aaa|
| 123456789| 2|2019-04-23 02:03:00| ddd| ccc|
| 321321321| 3|2019-05-23 02:03:00| eee| null|
+----------+-------------+-------------------+-------------+------------+
Have no idea how to do it. I heard about LAG() function which might be useful but found no good examples for my case.
I'd like to use it on dataframe in pyspark (if it's important)
use lag() and row_number analytic function
select userid,testid,date,note,previous_note
from
(select userid,testid,date,note,
lag(note)over(partition by useid,testid order by date) as previous_note,
row_number() over(partition by userid,testid order by date desc) rn
from table_name
) a where a.rn=1
select userid,testid,date,note,previous_note from
(select userid,testid,date,note,lead(note)
over(partition by userid,testid order by date desc) as previous_note,
row_number() over(partition by userid,testid order by date desc) srno
from Table_Name
) a where a.srno=1
I hope it will give you right answer which you want. it will give you latest date as new record and previous date note as previous_Note.

Hive- Error : missing EOF at 'WHERE'

I'm trying to learn Hive, especially functions like unix_timestamp and from_unixtime.
I have three tables
emp (employee table)
+---+----------------+
| id| name|
+---+----------------+
| 1| James Gordon|
| 2| Harvey Bullock|
| 3| Kristen Kringle|
+---+----------------+
txn (transaction table)
+------+----------+---------+
|acc_id|trans_date|trans_amt|
+------+----------+---------+
| 101| 20180105| 951|
| 102| 20180205| 800|
| 103| 20180131| 100|
| 101| 20180112| 50|
| 102| 20180126| 800|
| 103| 20180203| 500|
+------+----------+---------+
acc (account table)
+---+------+--------+
| id|acc_id|cred_lim|
+---+------+--------+
| 1| 101| 1000|
| 2| 102| 1500|
| 3| 103| 800|
+---+------+--------+
I want to find out the people whose trans_amt exceeded their cred_lim in the month of Jan 2018.
The query I'm trying to use is
WITH tabl as
(
SELECT e.id, e.name, a.acc_id, t.trans_amt, a.cred_lim, from_unixtime(unix_timestamp(t.trans_date, 'yyyyMMdd'), 'MMM yyyy') month
FROM emp e JOIN acc a on e.id = a.id JOIN txn t on a.acc_id = t.acc_id
)
SELECT acc_id, sum(trans_amt) total_amt
FROM tabl
GROUP BY tabl.acc_id, tabl.month
WHERE tabl.month = 'Jan 2018' AND tabl.total_amt > cred_lim;
But when I run it, I get an error saying
FAILED: ParseException line 9:2 missing EOF at 'WHERE' near 'month'
This error persists even when I change the where clause to
WHERE tabl.total_amt > cred_lim;
This makes me think the error comes from the GROUP BY clause but I can't seem to figure this out.
Could someone help me with this?
Your query has several problems.
WHERE clause should be used before GROUP BY
There is an extra ')' after GROUP BY columns
tabl.total_amt > cred_lim - This line cannot be used in where
clause because the alias total_amt cannot be used before it is
nested. Instead, use a HAVING clause.
I've made these changes in this query and should work for you.
WITH tabl
AS (
SELECT e.id
,e.name
,a.acc_id
,t.trans_amt
,a.cred_lim
,from_unixtime(unix_timestamp(t.trans_date, 'yyyyMMdd'), 'MMM yyyy') month
FROM emp e
INNER JOIN acc a ON e.id = a.id
INNER JOIN txn t ON a.acc_id = t.acc_id
)
SELECT acc_id
,sum(trans_amt) total_amt
FROM tabl
WHERE month = 'Jan 2018'
GROUP BY acc_id
,month
HAVING SUM(trans_amt) > MAX(cred_lim);

Aggregate by aggregate (ARRAY_AGG)?

Let's say I have a simple table agg_test with 3 columns - id, column_1 and column_2. Dataset, for example:
id|column_1|column_2
--------------------
1| 1| 1
2| 1| 2
3| 1| 3
4| 1| 4
5| 2| 1
6| 3| 2
7| 4| 3
8| 4| 4
9| 5| 3
10| 5| 4
A query like this (with self join):
SELECT
a1.column_1,
a2.column_1,
ARRAY_AGG(DISTINCT a1.column_2 ORDER BY a1.column_2)
FROM agg_test a1
JOIN agg_test a2 ON a1.column_2 = a2.column_2 AND a1.column_1 <> a2.column_1
WHERE a1.column_1 = 1
GROUP BY a1.column_1, a2.column_1
Will produce a result like this:
column_1|column_1|array_agg
---------------------------
1| 2| {1}
1| 3| {2}
1| 4| {3,4}
1| 5| {3,4}
We can see that for values 4 and 5 from the joined table we have the same result in the last column. So, is it possible to somehow group the results by it, e.g:
column_1|column_1|array_agg
---------------------------
1| {2}| {1}
1| {3}| {2}
1| {4,5}| {3,4}
Thanks for any answers. If anything isn't clear or can be presented in a better way - tell me in the comments and I'll try to make this question as readable as I can.
I'm not sure if you can aggregate by an array. If you can here is one approach:
select col1, array_agg(col2), ar
from (SELECT a1.column_1 as col1, a2.column_1 as col2,
ARRAY_AGG(DISTINCT a1.column_2 ORDER BY a1.column_2) as ar
FROM agg_test a1 JOIN
agg_test a2
ON a1.column_2 = a2.column_2 AND a1.column_1 <> a2.column_1
WHERE a1.column_1 = 1
GROUP BY a1.column_1, a2.column_1
) t
group by col1, ar
The alternative is to use array_dims to convert the array values into a string.
You could also try something like this:
SELECT DISTINCT
a1.column_1,
ARRAY_AGG(a2.column_1) OVER (
PARTITION BY
a1.column_1,
ARRAY_AGG(DISTINCT a1.column_2 ORDER BY a1.column_2)
) AS "a2.column_1 agg",
ARRAY_AGG(DISTINCT a1.column_2 ORDER BY a1.column_2)
FROM agg_test a1
JOIN agg_test a2 ON a1.column_2 = a2.column_2 AND a1.column_1 a2.column_1
WHERE a1.column_1 = 1
GROUP BY a1.column_1, a2.column_1
;
(Highlighted are the parts that are different from the query you've posted in your question.)
The above uses a window ARRAY_AGG to combine the values of a2.column_1 alongside the other other ARRAY_AGG, using the latter's result as one of the partitioning criteria. Without the DISTINCT, it would produce two {4,5} rows for your example. So, DISTINCT is needed to eliminate the duplicates.
Here's a SQL Fiddle demo: http://sqlfiddle.com/#!1/df5c3/4
Note, though, that the window ARRAY_AGG cannot have an ORDER BY like it's "normal" counterpart. That means the order of a2.column_1 values in the list would be indeterminate, although in the linked demo it does happen to match the one in your expected output.