SQL Calculate field based on three rows - sql

How could I calculate a field based on values from previous and next rows?
I have this list from users with a date (month and year) and a field indicating if the user has 1+ purchases in that month-year
id_user
Date
Has_purchases
Active
15678
Jan 2021
0
1
15678
feb 2021
1
1
15678
mar 2021
0
1
15678
Apr 2021
0
1
15678
may 2021
0
0
15678
jun 2021
0
1
15678
jul 2021
0
1
15678
Aug 2021
1
1
15678
sep 2021
0
1
15678
oct 2021
0
1
15678
nov 2021
0
1
15678
Dec 2021
1
1
I need to calculate if the user was active on a date (month-year). An active user is defined as an user who has at least one purchase on the last 3 months.
Eg. User 15678 is 'active' on march because user has purchases on february, the same user in unactive on may beacause it does not have purchases on march and april and also does not have purchases on june and july

Related

Pandas Sort Two Columns with Day of Year Wrap-Around to New Year

I have data that may at certain times of the year around the first of each year, that a day_of_year sequence involves changing the "year" column to the new year when day_of_year ==1. It is a trick that I have not been able to figure out and in some ways not sure how to start so any help here is much appreciated. My data looks like this:
Here is my df1 =
day_of_year year var_1
364 2017 17.71666667
364 2018 5.166666667
364 2019 2
364 2020 1.595833333
364 2021 3.75
364 2022 6.8875
365 2017 14.83333333
365 2018 2.758333333
365 2019 4.108333333
365 2020 5.766666667
365 2021 5.291666667
365 2022 10.58636364
1 2017 2.0125
1 2018 14.0125
1 2019 -0.504166667
1 2020 7.666666667
1 2021 5.520833333
1 2022 1.229166667
2 2017 1.7625
2 2018 15.10416667
2 2019 -0.391666667
2 2020 9.5
2 2021 7.645833333
2 2022 0.9125
And, after the re-formatting, I need it to look like the below sorted df with "n/a" for any missing or expected data in a year that might be missing data. thank you again,
final df:
day_of_year year var_1
364 2017 17.71666667
365 2017 14.83333333
1 2018 14.0125
2 2018 15.10416667
364 2018 5.166666667
365 2018 2.758333333
1 2019 -0.504166667
2 2019 -0.391666667
364 2019 2
365 2019 4.108333333
1 2020 7.666666667
2 2020 9.5
364 2020 1.595833333
365 2020 5.766666667
1 2021 5.520833333
2 2021 7.645833333
364 2021 3.75
365 2021 5.291666667
1 2022 1.229166667
2 2022 0.9125
364 2022 6.8875
365 2022 10.58636364
n/a n/a n/a
n/a n/a n/a
Why would you change the year based on the day? Just sort by the two columns:
df.sort_values(by=['year', 'day_of_year'])
Output:
day_of_year year var_1
12 1 2017 2.012500
18 2 2017 1.762500
0 364 2017 17.716667
6 365 2017 14.833333
13 1 2018 14.012500
19 2 2018 15.104167
1 364 2018 5.166667
7 365 2018 2.758333
14 1 2019 -0.504167
20 2 2019 -0.391667
2 364 2019 2.000000
8 365 2019 4.108333
15 1 2020 7.666667
21 2 2020 9.500000
3 364 2020 1.595833
9 365 2020 5.766667
16 1 2021 5.520833
22 2 2021 7.645833
4 364 2021 3.750000
10 365 2021 5.291667
17 1 2022 1.229167
23 2 2022 0.912500
5 364 2022 6.887500
11 365 2022 10.586364
If for some reason you really need to fix the year, use a conditional with mask:
(df.assign(year=df['year'].mask(df['day_of_year'].le(2), df['year'].add(1)))
.sort_values(by=['year', 'day_of_year'])
)
Or, if you want to update the years after a change from 365 to a lower day:
(df.assign(year=df['year'].add(df['day_of_year'].diff().lt(0).cumsum()))
.sort_values(by=['year', 'day_of_year'])
)
Output:
day_of_year year var_1
0 364 2017 17.716667
6 365 2017 14.833333
12 1 2018 2.012500
18 2 2018 1.762500
1 364 2018 5.166667
7 365 2018 2.758333
13 1 2019 14.012500
19 2 2019 15.104167
2 364 2019 2.000000
8 365 2019 4.108333
14 1 2020 -0.504167
20 2 2020 -0.391667
3 364 2020 1.595833
9 365 2020 5.766667
15 1 2021 7.666667
21 2 2021 9.500000
4 364 2021 3.750000
10 365 2021 5.291667
16 1 2022 5.520833
22 2 2022 7.645833
5 364 2022 6.887500
11 365 2022 10.586364
17 1 2023 1.229167
23 2 2023 0.912500
I would convert everything to date time first. Just run:
pd.to_datetime(df['day_of_year'].astype(str) + '-' + df['year'].astype(str),
format='%j-%Y')
I assign it to column ymd and sort, yielding the following:
>>> df.sort_values('ymd')
day_of_year year var_1 ymd
12 1 2017 2.012500 2017-01-01
18 2 2017 1.762500 2017-01-02
0 364 2017 17.716667 2017-12-30
6 365 2017 14.833333 2017-12-31
13 1 2018 14.012500 2018-01-01
19 2 2018 15.104167 2018-01-02
1 364 2018 5.166667 2018-12-30
7 365 2018 2.758333 2018-12-31
14 1 2019 -0.504167 2019-01-01
20 2 2019 -0.391667 2019-01-02
2 364 2019 2.000000 2019-12-30
8 365 2019 4.108333 2019-12-31
15 1 2020 7.666667 2020-01-01
21 2 2020 9.500000 2020-01-02
3 364 2020 1.595833 2020-12-29
9 365 2020 5.766667 2020-12-30
16 1 2021 5.520833 2021-01-01
22 2 2021 7.645833 2021-01-02
4 364 2021 3.750000 2021-12-30
10 365 2021 5.291667 2021-12-31
17 1 2022 1.229167 2022-01-01
23 2 2022 0.912500 2022-01-02
5 364 2022 6.887500 2022-12-30
11 365 2022 10.586364 2022-12-31

Merge Time Series-Data with different time delta

I am trying to merge two dataframes with different time delta. One represents the returns of an asset (df2) on a daily basis and the other one is the inflation rate (df1) which is published once a month but not in a regular inverval. I am trying to merge those two.
df1 =
First Release
Original Release Date
30 Jun 2010 10:01 1.4%
30 Jul 2010 10:00 1.7%
31 Aug 2010 10:00 1.6%
30 Sep 2010 10:00 1.8%
29 Oct 2010 10:02 1.9%
... ...
17 Mar 2022 11:00 5.9%
21 Apr 2022 10:00 7.4%
18 May 2022 10:00 7.4%
17 Jun 2022 10:00 8.1%
19 Jul 2022 10:00 8.6%
[145 rows x 1 columns]
df2 =
Date
2010-08-11 -0.001654
2010-08-12 -0.028538
2010-08-13 0.001072
2010-08-16 -0.007665
2010-08-17 0.002667
...
2022-01-25 0.029663
2022-01-26 0.026082
2022-01-27 -0.000115
2022-01-28 0.002425
2022-01-31 0.007184
Obviously inflation rate should be placed in the new column from the day after it is released until there is a new release. For example 30. June is the first anouncement and 30 Jul the second. So from 1. July to the 30. July should be 1.4 %. The result is published on the 30. but to avoid look-ahead-bias it is more appropriate to have it . Does someone have an idea or maybe encountered some similar problem ?

SQL Custom unique Ordering with repeated sequence

I have a datetime column (data type of timestamp without time zone) named time. I can best explain my issue with a example:
Example I've the following data in this column (pretifying timestamp for this example)
ID TIME
1 1 Mar 2022 - 1PM
2 1 Mar 2022 - 2PM
3 1 Mar 2022 - 1PM
4 1 Mar 2022 - 3PM
5 1 Mar 2022 - 2PM
6 2 Mar 2022 - 2PM
7 2 Mar 2022 - 1PM
8 2 Mar 2022 - 3PM
9 2 Mar 2022 - 1PM
10 1 Mar 2022 - 3PM
11 2 Mar 2022 - 2PM
12 2 Mar 2022 - 3PM
13 3 Mar 2022 - 4PM
14 3 Mar 2022 - 3PM
15 3 Mar 2022 - 3PM
16 3 Mar 2022 - 4PM
If i do ORDER BY time, i get the following result:
ID TIME
1 1 Mar 2022 - 1PM
3 1 Mar 2022 - 1PM
2 1 Mar 2022 - 2PM
5 1 Mar 2022 - 2PM
4 1 Mar 2022 - 3PM
10 1 Mar 2022 - 3PM
7 2 Mar 2022 - 1PM
9 2 Mar 2022 - 1PM
6 2 Mar 2022 - 2PM
11 2 Mar 2022 - 2PM
8 2 Mar 2022 - 3PM
12 2 Mar 2022 - 3PM
14 3 Mar 2022 - 3PM
15 3 Mar 2022 - 3PM
13 3 Mar 2022 - 4PM
16 3 Mar 2022 - 4PM
But i want the result in this way:
ID TIME
1 1 Mar 2022 - 1PM
2 1 Mar 2022 - 2PM
4 1 Mar 2022 - 3PM
13 3 Mar 2022 - 4PM
3 1 Mar 2022 - 1PM
5 1 Mar 2022 - 2PM
10 1 Mar 2022 - 3PM
16 3 Mar 2022 - 4PM
7 2 Mar 2022 - 1PM
6 2 Mar 2022 - 2PM
8 2 Mar 2022 - 3PM
9 2 Mar 2022 - 1PM
11 2 Mar 2022 - 2PM
12 2 Mar 2022 - 3PM
14 3 Mar 2022 - 3PM
13 3 Mar 2022 - 4PM
As you can see first 4 rows have unique timestamp and the sequence should repeat based on Time (1PM, 2PM, 3PM).
How can we do this in SQL? I'm using postresql as my DB. I'm using Rails for my Backend.
EDIT:
Have added more context to example to explain my scenario.
One way you can try to use ROW_NUMBER window function with REPLACE function
SELECT time
FROM (
SELECT *,REPLACE(time,'PM','') val,
ROW_NUMBER() OVER(PARTITION BY REPLACE(time,'PM','')) rn
FROM T
) t1
ORDER BY rn,val
For example, sequence of the col a
with tbl(a, othercol) as
(
SELECT 1,1 UNION ALL
SELECT 1,2 UNION ALL
SELECT 1,3 UNION ALL
SELECT 2,4 UNION ALL
SELECT 2,5 UNION ALL
SELECT 2,6 UNION ALL
SELECT 3,7 UNION ALL
SELECT 3,8 UNION ALL
SELECT 3,9
),
cte as (
SELECT *, row_number() over(partition by a order by a) rn
from tbl
)
select a, othercol
from cte
order by rn, a
The problem you have at hand is a direct result of not choosing the correct data type for the values you store.
To get the sorting correct, you need to convert the string to a proper time value. There is no to_time() function in Postgres, but you can convert it to a timestamp then cast it to a time:
order by to_timestamp("time", 'hham')::time
You should fix your database design and convert that column to a proper time type. Which will also prevent storing invalid values ('3 in the afternoon' or '128foo') in that column

Query to find three instances the same in one column, but must be have three different results in another column

I have a table like the following:
InspectDate | Serial Number | Reference | Error | PartNumber
I need to find the data of errors that occurred in the last 10 days. I can get that, but then I need to find only those problems that occurred on the same reference, but only if they happen to be on three or more different serial numbers.
Please let me know if I need to provide any more info. I have tried using count and filtering by those with more than 3, but that only shows me any one serial number that has more than three errors on that reference.
Sample Data:
InspectDate SerialNumber Reference Error PartNumber
Oct 12 2021 1:58PM 50012 A21 1 PL2-001
Oct 12 2021 3:22PM 50013 A21 1 PL2-001
Oct 12 2021 5:59PM 50062 A21 1 PL2-001
Oct 18 2021 11:24AM 50071 A21 1 PL2-001
Oct 18 2021 12:20PM 50071 A21 2 PL2-001
Oct 18 2021 12:36PM 50071 A21 3 PL2-001
Oct 12 2021 5:59PM 50055 B44 5 AL1-440
Oct 18 2021 11:19AM 50062 B72 1 AL1-660
Oct 18 2021 11:22AM 50071 B72 2 AL1-660
Oct 12 2021 5:39PM 50047 B83 5 AL1-550
Oct 12 2021 3:03PM 50013 V310 2 PL3-010
Oct 18 2021 12:00PM 50071 V310 2 PL3-010
Oct 18 2021 12:37PM 50098 V310 4 PL3-010
Expected Results:
InspectDate SerialNumber Reference Error PartNumber
Oct 12 2021 1:58PM 50012 A21 1 PL2-001
Oct 12 2021 3:22PM 50013 A21 1 PL2-001
Oct 12 2021 5:59PM 50062 A21 1 PL2-001
Oct 18 2021 11:24AM 50071 A21 1 PL2-001
Oct 12 2021 3:03PM 50013 V310 2 PL3-010
Oct 18 2021 12:00PM 50071 V310 2 PL3-010
Oct 18 2021 12:37PM 50098 V310 4 PL3-010
Tempted Code:
Select (all columns), COUNT() AS Instances From (Table)
Where InspectDate >= DATEADD(day, -10, GETDATE())
GROUP BY (all columns)
HAVING COUNT() >= 3
Order by CAST (inspectdate as datetime) DESC
What you need here is a windowed COUNT(DISTINCT. Unfortuantely, SQL Server does not allow COUNT(DISTINCT as a window function.
But we can simulate it using DENSE_RANK and MAX, both as window functions
WITH Ranked AS (
SELECT *,
rn = DENSE_RANK() OVER (PARTITION BY Reference ORDER BY SerialNumber)
FROM [Table]
WHERE InspectDate >= DATEADD(day, -10, GETDATE())
),
DistinctCount AS (
SELECT *,
maxrn = MAX(rn) OVER (PARTITION BY Reference)
FROM Ranked
)
SELECT *
FROM DistinctCount
WHERE maxrn >= 3;

Databricks: replicate columns

Suppose I am having the following Dataframe :
YEAR MONTH Value
2019 JAN 100
2019 JAN 200
2019 MAR 400
2019 MAR 100
And I do the pivot group by YEAR. ( df.groupBy().pivot()....)
YEAR JAN MAR
2019 300 500
But I also wanted to replicate the column of the Months through out the year even there are no data in that month ...
which means I would like to have
YEAR JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
2019 300 0 500 0 0 0 0 0 0 0 0 0
Thanks