Sum/aggregate data based on dates - sql

I have the following table about the items sold out in every shop.
table columns
shop_id: the specific id of the shop,
sold: the purchase amount as dollars ($)
time: the date and time of the purchase.
data
shop_id sold time
1 12.44 23/10/2014 20:20
1 12.77 24/10/2014 20:18
1 10.72 24/10/2014 20:18
1 14.51 24/10/2014 20:18
2 5.94 22/10/2014 20:11
2 15.69 23/10/2014 20:23
2 8.55 24/10/2014 20:12
2 6.96 24/10/2014 20:18
3 8.84 22/10/2014 20:21
3 7.82 22/10/2014 20:21
3 22.19 23/10/2014 20:23
3 13.21 23/10/2014 20:23
4 14.60 23/10/2014 20:20
4 12.19 23/10/2014 20:23
4 5.41 24/10/2014 20:18
4 10.93 24/10/2014 20:19
5 18.54 22/10/2014 20:21
5 7.48 22/10/2014 20:21
5 10.67 24/10/2014 20:18
5 15.96 24/10/2014 20:18
I have 3 classifiers per purchase :
purchase classifiers
low: 0-8 $
medium: 8-12 $
high: 12 and higher $
What I would like to do is to write a query using PostgreSQL that will produce the total purchase each day by each type of purchase classifier.
desired output
date low medium high
22/10/2014 29.10 14.51 12.77
23/10/2014 0 0 70.06
24/10/2014 16.34 51.24 41.39
Thank you very much in advance for your kind help. I am not experienced with postgreSQL. I would do this in R easily, but since moving a huge database table to R is very cumbersome, I need to learn some postgreSQL.

I believe you can do this using conditional aggregation like this:
select
cast(time as date),
sum(case when sold > 0 and sold <= 8 then sold else 0 end) as low,
sum(case when sold > 8 and sold <= 12 then sold else 0 end) as medium,
sum(case when sold > 12 then sold else 0 end) as high
from your_table
group by cast(time as date);
You might have to tweak the ranges a bit - should 8 fall into both low and medium? Also, I can't remember if cast("time" as date) is the correct syntax for Postgresql - I suspect it might not be, but just replace that part with the correct function to get the date from the datetime/timestamp column. It should probably be "time"::date

Related

SQL: how to average across groups, while taking a time constraint into account

I have a table named orders in a Postgres database that looks like this:
customer_id order_id order_date price product
1 2 2021-03-05 15 books
1 13 2022-03-07 3 music
1 14 2022-06-15 900 travel
1 11 2021-11-17 25 books
1 16 2022-08-03 32 books
2 4 2021-04-12 4 music
2 7 2021-06-29 9 music
2 20 2022-11-03 8 music
2 22 2022-11-07 575 travel
2 24 2022-11-20 95 food
3 3 2021-03-17 25 books
3 5 2021-06-01 650 travel
3 17 2022-08-17 1200 travel
3 19 2022-10-02 6 music
3 23 2022-11-08 70 food
4 9 2021-08-20 3200 travel
4 10 2021-10-29 2750 travel
4 15 2022-07-15 1820 travel
4 21 2022-11-05 8000 travel
4 25 2022-11-29 27 books
5 1 2021-01-04 3 music
5 6 2021-06-09 820 travel
5 8 2021-07-30 19 books
5 12 2021-12-10 22 music
5 18 2022-09-19 20 books
Here's a SQL Fiddle: http://sqlfiddle.com/#!17/262fc/1
I'd like to return the average money spent by customers per product, but only consider orders within the first 12 months of a given customer's first purchase within the given product group. (yes, this is challenging!)
For example, for customer 1, order ID 2 and order ID 11 would be factored into the average for books(because order ID 11 took place less than 12 months after customer 1's first order for books, which was order ID 2), but order ID 16 would not be factored into the average (because 8/3/22 is more than 12 months from customer 1's first purchase for books, which took place on 3/5/21).
Here is a matrix showing which orders would be included within a given product (denoted by "yes"):
The desired output would look as follows:
average_spent
books 22.20
music 7.83
travel 1530.71
food 82.50
How would I do this?
Thanks in advance for any assistance you can give!
You can use a subquery to check whether or not to include a product's price in the summation:
select o.product, sum(o.price)/count(*) val from orders o
where o.order_date < (select min(o1.order_date) from orders o1 where
o1.product = o.product and o.user_id = o1.user_id) + interval '12 months'
group by o.product
See fiddle

count number of records by month over the last five years where record date > select month

I need to show the number of valid inspectors we have by month over the last five years. Inspectors are considered valid when the expiration date on their certification has not yet passed, recorded as the month end date. The below SQL code is text of the query to count valid inspectors for January 2017:
SELECT Count(*) AS RecordCount
FROM dbo_Insp_Type
WHERE (dbo_Insp_Type.CERT_EXP_DTE)>=#2/1/2017#);
Rather than designing 60 queries, one for each month, and compiling the results in a final table (or, err, query) are there other methods I can use that call for less manual input?
From this sample:
Id
CERT_EXP_DTE
1
2022-01-15
2
2022-01-23
3
2022-02-01
4
2022-02-03
5
2022-05-01
6
2022-06-06
7
2022-06-07
8
2022-07-21
9
2022-02-20
10
2021-11-05
11
2021-12-01
12
2021-12-24
this single query:
SELECT
Format([CERT_EXP_DTE],"yyyy/mm") AS YearMonth,
Count(*) AS AllInspectors,
Sum(Abs([CERT_EXP_DTE] >= DateSerial(Year([CERT_EXP_DTE]), Month([CERT_EXP_DTE]), 2))) AS ValidInspectors
FROM
dbo_Insp_Type
GROUP BY
Format([CERT_EXP_DTE],"yyyy/mm");
will return:
YearMonth
AllInspectors
ValidInspectors
2021-11
1
1
2021-12
2
1
2022-01
2
2
2022-02
3
2
2022-05
1
0
2022-06
2
2
2022-07
1
1
ID
Cert_Iss_Dte
Cert_Exp_Dte
1
1/15/2020
1/15/2022
2
1/23/2020
1/23/2022
3
2/1/2020
2/1/2022
4
2/3/2020
2/3/2022
5
5/1/2020
5/1/2022
6
6/6/2020
6/6/2022
7
6/7/2020
6/7/2022
8
7/21/2020
7/21/2022
9
2/20/2020
2/20/2022
10
11/5/2021
11/5/2023
11
12/1/2021
12/1/2023
12
12/24/2021
12/24/2023
A UNION query could calculate a record for each of 50 months but since you want 60, UNION is out.
Or a query with 60 calculated fields using IIf() and Count() referencing a textbox on form for start date:
SELECT Count(IIf(CERT_EXP_DTE>=Forms!formname!tbxDate,1,Null)) AS Dt1,
Count(IIf(CERT_EXP_DTE>=DateAdd("m",1,Forms!formname!tbxDate),1,Null) AS Dt2,
...
FROM dbo_Insp_Type
Using the above data, following is output for Feb and Mar 2022. I did a test with Cert_Iss_Dte included in criteria and it did not make a difference for this sample data.
Dt1
Dt2
10
8
Or a report with 60 textboxes and each calls a DCount() expression with criteria same as used in query.
Or a VBA procedure that writes data to a 'temp' table.

How do you get the last entry for each month in SQL?

I am looking to filter very large tables to the latest entry per user per month. I'm not sure if I found the best way to do this. I know I "should" trust the SQL engine (snowflake) but there is a part of me that does not like the join on three columns.
Note that this is a very common operation on many big tables, and I want to use it in DBT views which means it will get run all the time.
To illustrate, my data is of this form:
mytable
userId
loginDate
year
month
value
1
2021-01-04
2021
1
41.1
1
2021-01-06
2021
1
411.1
1
2021-01-25
2021
1
251.1
2
2021-01-05
2021
1
4369
2
2021-02-06
2021
2
32
2
2021-02-14
2021
2
731
3
2021-01-20
2021
1
258
3
2021-02-19
2021
2
4251
3
2021-03-15
2021
3
171
And I'm trying to use SQL to get the last value (by loginDate) for each month.
I'm currently doing a groupby & a join as follows:
WITH latest_entry_by_month AS (
SELECT "userId", "year", "month", max("loginDate") AS "loginDate"
FROM mytable
)
SELECT * FROM mytable NATURAL JOIN latest_entry_by_month
The above results in my desired output:
userId
loginDate
year
month
value
1
2021-01-25
2021
1
251.1
2
2021-01-05
2021
1
4369
2
2021-02-14
2021
2
731
3
2021-01-20
2021
1
258
3
2021-02-19
2021
2
4251
3
2021-03-15
2021
3
171
But I'm not sure if it's optimal.
Any guidance on how to do this faster? Note that I am not materializing the underlying data, so it is effectively un-clustered (I'm getting it from a vendor via the Snowflake marketplace).
Using QUALIFY and windowed function(ROW_NUMBER):
SELECT *
FROM mytable
QUALIFY ROW_NUMBER() OVER(PARTITION BY userId, year, month
ORDER BY loginDate DESC) = 1

Pandas: to get mean for each data category daily [duplicate]

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

How to construct a query to include historical data

I am new to SQL and trying to learn by doing some projects. Currently, I have a query that take in start and end date as parameter. The query is filter by breed_type and race_date. I need to make it so that the data for this query include breed_type 1,2, and 3 for race_date > 1/1/2020 and include breed_type 1 and 2 for race_date < 1/1/2021.
Sample Data:
race_date
breed_type
sales
12/30/2020
1
20
12/30/2020
2
10
12/30/2020
3
40
12/31/2020
3
10
12/31/2020
2
20
1/1/2021
1
25
1/1/2021
2
20
1/2/2021
1
10
1/2/2021
2
10
1/2/2021
3
20
What I currently have:
SELECT SUM(nvl(sales,0)) sales
FROM results t
WHERE t.race_date BETWEEN '12/30/2020' AND '01/02/2021'
AND breed_type in (1,2,3)
but I want this query to show the sales for breed_type in (1,2) for race_date < 01/01/2021 and only include breed_type 3 for race_date > 01/01/2021.
Expected sales should be: 135
Actual sales: 185
Thank you for the help in advance.
Do you mean?
SELECT SUM(sales)
FROM results
WHERE (breed_type in (1,2) AND race_date < '01/01/2021')
OR (breed_type in (3) AND race_date > '01/01/2021')
Be aware that your question excluded 01/01/2021 from both fiters. I assume you would like one of them to be <= or >=.