How to get the latest values day wise from a timeseries table? - sql

I want to get the latest values of each SIZE_TYPE day wise, ordered by TIMESTAMP. So, only 1 value of each SIZE_TYPE must be present for a given day, and that is the latest value for the day.
How do I get the desired output? I'm using PostgreSQL here.
Input
|TIMESTAMP |SIZE_TYPE|SIZE|
|----------------------------------------|---------|----|
|1595833641356 [Mon Jul 27 2020 07:07:21]|0 |541 |
|1595833641356 [Mon Jul 27 2020 07:07:21]|1 |743 |
|1595833641356 [Mon Jul 27 2020 07:07:21]|2 |912 |
|1595876841356 [Mon Jul 27 2020 19:07:21]|1 |714 |
|1595876841356 [Mon Jul 27 2020 19:07:21]|2 |987 |
|1595963241356 [Tue Jul 28 2020 19:07:21]|0 |498 |
|1595920041356 [Tue Jul 28 2020 07:07:21]|2 |974 |
|1595920041356 [Tue Jul 28 2020 07:07:21]|0 |512 |
*Note: the TIMESTAMP values are in UNIX time. I have given
the date-time string for reference*
Output
|TIMESTAMP |SIZE_TYPE|SIZE|
|----------------------------------------|---------|----|
|1595833641356 [Mon Jul 27 2020 07:07:21]|0 |541 |
|1595876841356 [Mon Jul 27 2020 19:07:21]|1 |714 |
|1595876841356 [Mon Jul 27 2020 19:07:21]|2 |987 |
|1595920041356 [Tue Jul 28 2020 07:07:21]|2 |974 |
|1595963241356 [Tue Jul 28 2020 19:07:21]|0 |498 |
*Note: the TIMESTAMP values are in UNIX time. I have given
the date-time string for reference*
Explanation
For July 27, the latest values for
0: 541 (no other entries for the day)
1: 714
2: 987
For July 28, the latest values for
0: 498
1: nothing (ignore)
2: 974 (no other entries for the day)

You can use distinct on:
select distinct on (floor(timestamp / (24 * 60 * 60 * 1000)), size_type) t.*
from input
order by floor(timestamp / (24 * 60 * 60 * 1000)), size_type,
timestamp desc;
The arithmetic is just to extract the day from the timestamp.
Here is a db<>fiddle.

Related

Get the number of unique days with overlapping dates (in SAS)

I couldn't briefly explain the problem so I'll try to explain it this way. Let's say I have a table similar to the one below.
How do I get the total number of days in October per student that that student has at least 1 book checked out?
Please note that a single student can check out more than 1 book at a time which cause the overlapping dates.
Student
Book
Date_Borrowed
Date_Returned
David
A Thousand Splendid Suns
01 Oct 2021
05 Oct 2021
David
Jane Eyre
09 Oct 2021
13 Oct 2021
David
Please Look After Mom
21 Oct 2021
29 Oct 2021
Fiona
Sense and Sensibility
05 Oct 2021
14 Oct 2021
Fiona
The Girl Who Saved the King of Sweden
05 Oct 2021
14 Oct 2021
Fiona
A Fort of Nine Towers
02 Oct 2021
17 Oct 2021
Fiona
One Hundred Years of Solitude
20 Oct 2021
30 Oct 2021
Fiona
The Unbearable Lightness of Being
20 Oct 2021
30 Oct 2021
Greg
Fahrenheit 451
06 Oct 2021
11 Oct 2021
Greg
One Hundred Years of Solitude
10 Oct 2021
17 Oct 2021
Greg
Please Look After Mom
15 Oct 2021
21 Oct 2021
Greg
4 3 2 1
20 Oct 2021
27 Oct 2021
Greg
The Girl Who Saved the King of Sweden
27 Oct 2021
03 Nov 2021
Marcus
Fahrenheit 451
01 Oct 2021
04 Oct 2021
Marcus
Nectar in a Sieve
15 Oct 2021
15 Oct 2021
Marcus
Please Look After Mom
30 Oct 2021
31 Oct 2021
Priya
Like Water for Chocolate
02 Oct 2021
21 Oct 2021
Priya
Fahrenheit 451
21 Oct 2021
22 Oct 2021
Sasha
Baudolino
03 Oct 2021
29 Oct 2021
Sasha
A Thousand Splendid Suns
07 Oct 2021
16 Oct 2021
Sasha
A Fort of Nine Towers
26 Oct 2021
01 Nov 2021
Thanks in advance!
Using the data step, you can expand each date into a long format. From there, you can use SQL to do a simple count by student after removing overlapping dates.
data foo;
set have;
do date = date_borrowed to date_returned;
output;
end;
keep student date;
format date date9.;
run;
This gets us a long table of all the dates with at least one book checked out for each student.
student date
David 01OCT2021
David 02OCT2021
David 03OCT2021
David 04OCT2021
David 05OCT2021
David 09OCT2021
...
Now we need to remove the overlapping dates.
proc sort data=foo nodupkey;
by student date;
run;
From here, we can do a simple SQL count per student.
proc sql noprint;
create table want as
select student
, intnx('month', date, 0, 'B') as month format=monyy7.
, count(*) as days_checked_out
from foo
where calculated month = '01OCT2021'd
group by student, calculated month
;
quit;
Output:
student month days_checked_out
David OCT2021 19
Fiona OCT2021 27
Greg OCT2021 26
Marcus OCT2021 7
Priya OCT2021 21
Sasha OCT2021 29
An easy way is to make a temporary array with one variable for each day in the time period you want to count. Then just use a do loop to set the variables representing those days to 1. When you have reached the last record for a student then take the sum to find the number of days covered.
First let's convert your posted table into a dataset.
data have;
infile cards dsd dlm='|' truncover;
input Student :$20. Book :$100. (Date_Borrowed Date_Returned) (:date.);
format Date_Borrowed Date_Returned date11.;
cards;
David|A Thousand Splendid Suns|01 Oct 2021|05 Oct 2021
David|Jane Eyre|09 Oct 2021|13 Oct 2021
David|Please Look After Mom|21 Oct 2021|29 Oct 2021
Fiona|Sense and Sensibility|05 Oct 2021|14 Oct 2021
Fiona|The Girl Who Saved the King of Sweden|05 Oct 2021|14 Oct 2021
Fiona|A Fort of Nine Towers|02 Oct 2021|17 Oct 2021
Fiona|One Hundred Years of Solitude|20 Oct 2021|30 Oct 2021
Fiona|The Unbearable Lightness of Being|20 Oct 2021|30 Oct 2021
Greg|Fahrenheit 451|06 Oct 2021|11 Oct 2021
Greg|One Hundred Years of Solitude|10 Oct 2021|17 Oct 2021
Greg|Please Look After Mom|15 Oct 2021|21 Oct 2021
Greg|4 3 2 1|20 Oct 2021|27 Oct 2021
Greg|The Girl Who Saved the King of Sweden|27 Oct 2021|03 Nov 2021
Marcus|Fahrenheit 451|01 Oct 2021|04 Oct 2021
Marcus|Nectar in a Sieve|15 Oct 2021|15 Oct 2021
Marcus|Please Look After Mom|30 Oct 2021|31 Oct 2021
Priya|Like Water for Chocolate|02 Oct 2021|21 Oct 2021
Priya|Fahrenheit 451|21 Oct 2021|22 Oct 2021
Sasha|Baudolino|03 Oct 2021|29 Oct 2021
Sasha|A Thousand Splendid Suns|07 Oct 2021|16 Oct 2021
Sasha|A Fort of Nine Towers|26 Oct 2021|01 Nov 2021
;
Now we can use BY group processing in a data step to aggregate per student. We can set the upper and lower index for the array to be the values SAS uses to represent those days. Temporary arrays are automatically retained across observations, we just need to clear it out when we start a new student.
The SAS compiler does not expect to see a date literal as the index boundaries for an array so we can use %SYSEVALF() to convert the date literal to the integer it represents.
data want;
set have;
by student ;
array october [%sysevalf('01oct2021'd):%sysevalf('31oct2021'd)] _temporary_ ;
if first.student then call missing(of october[*]);
do date=max(date_borrowed,'01oct2021'd) to min(date_returned,'31oct2021'd);
october[date]=1;
end;
if last.student;
days = sum(0, of october[*]);
keep student days;
run;
Results:
Obs Student days
1 David 19
2 Fiona 27
3 Greg 26
4 Marcus 7
5 Priya 21
6 Sasha 29
You could also modify it slightly to not only count the number of "covered" (or unique) days, but also the total number of "book" days.
data want;
set have;
by student ;
array october [%sysevalf('01oct2021'd):%sysevalf('31oct2021'd)] _temporary_ ;
if first.student then call missing(of october[*]);
do date=max(date_borrowed,'01oct2021'd) to min(date_returned,'31oct2021'd);
october[date]=sum(october[date],1);
end;
if last.student;
unique_days = n(of october[*]);
book_days = sum(0,of october[*]);
keep student unique_days book_days;
run;
Results:
unique_ book_
Obs Student days days
1 David 19 19
2 Fiona 27 58
3 Greg 26 34
4 Marcus 7 7
5 Priya 21 22
6 Sasha 29 43

How to calculate median monthly from date of month table?

My dataset:
Date Num_orders
Mar 21 2019 69
Mar 22 2019 82
Mar 24 2019 312
Mar 25 2019 199
Mar 26 2019 2,629
Mar 27 2019 2,819
Mar 28 2019 3,123
Mar 29 2019 3,332
Mar 30 2019 1,863
Mar 31 2019 1,097
Apr 01 2019 1,578
Apr 02 2019 2,353
Apr 03 2019 2,768
Apr 04 2019 2,648
Apr 05 2019 3,192
Apr 06 2019 2,363
Apr 07 2019 1,578
Apr 08 2019 3,090
Apr 09 2019 3,814
Apr 10 2019 3,836
...
I need to calculate the monthly median number of orders from days of the same month:
The desired results:
Month Median_monthly
Mar 2019 1,863
Apr 2019 2,768
May 2019 2,876
Jun 2019 ...
...
I tried to use function date_trunc to extract month from the dataset then group by 'month' but it didn't work out. Thanks for your help, I use Google Bigquery (#standard) environment!
Probably you tried to use PERCENTILE_CONT which can not be used with GROUP BY:
Try to use APPROX_QUANTILES(x, 100)[OFFSET(50)]. It should work with GROUP BY.
SELECT APPROX_QUANTILES](Num_orders, 100)\[OFFSET(50)\] AS median
FROM myTable
GROUP BY Month
Alternativele you can use PERCENTILE_CONT within subquery:
SELECT
DISTINCT Month, median
FROM (
SELECT
Month,
PERCENTILE_CONT(Num_orders, 0.5) OVER(PARTITION BY Month) AS median
FROM myTable
)
This would often be done using DISTINCT:
SELECT DISTINCT DATE_TRUNC(month, date),
PERCENTILE_CONT(Num_orders, 0.5) OVER (PARTITION BY DATE_TRUNC(month, date) AS median
FROM myTable;
Note: There are two percentile functions, PERCENTILE_CONT() and PERCENTILE_DISC(). They have different results when there is a "tie" in the middle of the data.

SQL Table based on a pattern

I have a Table like:
source target
jan feb
mar apr
jun
feb aug
apr jul
oct dec
aug nov
dec may
The output (where I want to create a new_target column):
source target new_target
jan feb aug
mar apr jul
jun
feb aug nov
apr jul
oct dec may
aug nov
dec may
The aim is to create new_targetcolumn based on a logic like - for example, jan in source has value feb in target. This in turn, feb in source has a value aug in target, and so on aug has nov in target column
So the new_target column will have 3rd value: i.e (trace followed between source and target jan->feb->aug->nov, since aug is 3rd value, it is the output in new_target column)
This looks like a left join:
select t.*, tnext.target
from t left join
t tnext
on t.target = t.next.source
Try this:
select m1.source,
m1.target,
m2.target as new_target
from mytable m1
left join mytable m2 on
m1.target = m2.source
The left join will maintain all rows from the original table, while adding values to the new_target column if there is a match.

Adding set lists of future dates to rows in a SQL query

So I am doing a cohort analysis for customers, where a cohort is a group of people who started using the product in the same month. I then keep track of each cohort's total use for every subsequent month up till present time.
For example, the first "cohort month" is January 2012, then I have "use months" January 12, Feb 12, March 12, ..., March 17(current month). One column is "cohort month", and another is "use month". This process repeats for every subsequent cohort month. The table looks like:
Jan 12 | Jan 12
Jan 12 | Feb 12
...
Jan 12 | Mar 17
Feb 12 | Feb 12
Feb 12 | Mar 12
...
Feb 12 | Mar 17
...
Feb 17 | Feb 17
Feb 17 | Mar 17
Mar 17 | Mar 17
The problem arises because I want to do forecasting for one year out for both existing and future cohorts.
That means for the Jan 12 cohort, I want to do prediction for April 17 to Mar 18.
I also want to do predictions for the April 17 cohort (which doesn't exist yet) from April 17 to Mar 18. And so on till predictions for the Mar 18 cohort in Mar 18.
I can handle the predictions, don't worry about that.
My issue is that I cannot figure out how to add in this list of (April 17 .. Mar 17) in the "use month" column before every cohort switches.
I also need to add in cohorts April 17 to Mar 18, and have the applicable parts of this list of (April 17 ... Mar 17) for each of these future cohorts.
So I want the table to look like:
Jan 12 | Jan 12
Jan 12 | Feb 12
...
Jan 12 | Mar 17
Jan 12 | Apr 17
..
Jan 12 | Mar 18
Feb 12 | Feb 12
Feb 12 | Mar 12
...
Feb 12 | Mar 17
Feb 12 | Apr 17
...
Feb 12 | Mar 18
...
...
Feb 17 | Feb 17
Feb 17 | Mar 17
...
Feb 17 | Mar 18
Mar 17 | Mar 17
...
Mar 17 | Mar 18
I know the first solution to come to mind is to do a create a list of all dates Jan 12 to Mar 18, cross join it to itself, and then left outer join to the current table I have (where cohort / use months range from Jan 12 to Mar 17). However, this is not scalable.
Is there a way I can just iteratively add in this list of the months of the next year?
I am using HP Vertica, could use Presto or Hive if absolutely necessary
I think you should use the query here below to create a temporary table out of nothing, and join it with the rest of your query. You can't do anything in a procedural manner in SQL, I'm afraid. You won't be able to get away without a CROSS JOIN. But here, you limit the CROSS JOIN to the generation of the first-of-month pairs that you need.
Here goes:
WITH
-- create a list of integers from 0 to 100 using the TIMESERIES clause
i(i) AS (
SELECT dt::DATE - '2000-01-01'::DATE
FROM (
SELECT '2000-01-01'::DATE + 0
UNION ALL SELECT '2000-01-01'::DATE + 100
) d(d)
TIMESERIES dt AS '1 day' OVER(ORDER BY d::TIMESTAMP)
)
,
-- limits are Jan-2012 to the first of the current month plus one year
month_limits(month_limit) AS (
SELECT '2012-01-01'::DATE
UNION ALL SELECT ADD_MONTHS(TRUNC(CURRENT_DATE,'MONTH'),12)
)
-- create the list of possible months as a CROSS JOIN of the i table
-- containing the integers and the month_limits table, using ADD_MONTHS()
-- and the smallest and greatest month of the month limits
,month_list AS (
SELECT
ADD_MONTHS(MIN(month_limit),i) AS month_first
FROM month_limits CROSS JOIN i
GROUP BY i
HAVING ADD_MONTHS(MIN(month_limit),i) <= (
SELECT MAX(month_limit) FROM month_limits
)
)
-- finally, CROSS JOIN the obtained month list with itself with the
-- filters needed.
SELECT
cohort.month_first AS cohort_month
, use.month_first AS use_month
FROM month_list AS cohort
CROSS JOIN month_list AS use
WHERE use.month_first >= cohort.month_first
ORDER BY 1,2
;

Trying to pull the required rows from the single table with applying conditional statements on columns in sql server?

I have tried in n-number ways to solve this solution but unfortunately I got stuck in all the ways..
source table
id year jan feb mar apr may jun jul aug sep oct nov dec
1234 2014 05 06 12 15 16 17 18 19 20 21 22 23
1234 2013 05 06 12 15 16 17 18 19 20 21 22 23
Task: Assume that we are currently at March 2014, and we need 12 months back date ...(i.e., from Mar 2013 to Feb 2014, and the remaining values needs to be zero except year and id.)
Solution:
id year jan feb mar apr may jun jul aug sep oct nov dec
1234 2014 05 06 0 0 0 0 0 0 0 0 0 0
1234 2013 0 0 12 15 16 17 18 19 20 21 22 23
This needs a code solution for SQL Server 2008. I would be very happy if any body can solve this.
Note:
I got stuck to pull the column names dynamically.
You can try this:
select id, year, case when DATEDiff(month, getdate(), convert(datetime, year + '-01-01'))) < 12 then jan else 0,
DATEDiff(month, getdate(), convert(datetime, year + '-02-01'))) < 12 then fab else 0 ....