pandas count groupy 2 attributes - pandas

I have a list that looks like this:
Name arrival_date location
Tom 2019-12-12 Hardware store
Tina 2019-12-31 Post office
Tina 2019-12-14 Post office
Tina 2019-11-30 Police station
With a few thousand entries. The data goes from april 2018 to april 2020
Now I would like to count the number of arrivals for each stop for each date over 2 years
So that is basically looks like this:
October 2018
Hardware Store:26
Police Station:13
...
November 2019
Hardware Store:226
Police Station:113
...
What is a good way to do this with pandas?

Use Series.dt.strftime with GroupBy.size for counts per both attributes:
#if necessary
#df['arrival_date'] = pd.to_datetime(df['arrival_date'])
#df = df.sort_values('arrival_date')
s = df['arrival_date'].dt.strftime('%B %Y').rename('month-year')
df = df.groupby([s, 'location'], sort=False).size().reset_index(name='count')
print (df)
month-year location count
0 December 2019 Hardware store 1
1 December 2019 Post office 2
2 November 2019 Police station 1

Related

Produce weekly and quarterly stats from a monthly figure

I have a sample of a table as below:
Customer Ref
Bear Rate
Distance
Month
Revenue
ABA-IFNL-001
1000
01/01/2022
-135
ABA-IFNL-001
1000
01/02/2022
-135
ABA-IFNL-001
1000
01/03/2022
-135
ABA-IFNL-001
1000
01/04/2022
-135
ABA-IFNL-001
1000
01/05/2022
-135
ABA-IFNL-001
1000
01/06/2022
-135
I also have a sample of a calendar table as below:
Date
Year
Week
Quarter
WeekDay
Qtr Start
Qtr End
Week Day
04/11/2022
2022
45
4
Fri
30/09/2022
29/12/2022
1
05/11/2022
2022
45
4
Sat
30/09/2022
29/12/2022
2
06/11/2022
2022
45
4
Sun
30/09/2022
29/12/2022
3
07/11/2022
2022
45
4
Mon
30/09/2022
29/12/2022
4
08/11/2022
2022
45
4
Tue
30/09/2022
29/12/2022
5
09/11/2022
2022
45
4
Wed
30/09/2022
29/12/2022
6
10/11/2022
2022
45
4
Thu
30/09/2022
29/12/2022
7
11/11/2022
2022
46
4
Fri
30/09/2022
29/12/2022
1
12/11/2022
2022
46
4
Sat
30/09/2022
29/12/2022
2
13/11/2022
2022
46
4
Sun
30/09/2022
29/12/2022
3
14/11/2022
2022
46
4
Mon
30/09/2022
29/12/2022
4
15/11/2022
2022
46
4
Tue
30/09/2022
29/12/2022
5
16/11/2022
2022
46
4
Wed
30/09/2022
29/12/2022
6
17/11/2022
2022
46
4
Thu
30/09/2022
29/12/2022
7
How can I join/link the tables to report on revenue over weekly and quarterly periods using the calendar table? I can put into two tables if needed as an output eg:
Quarter Starting
31/12/2021
01/04/2022
01/07/2022
30/09/2022
Quarter
1
2
3
4
Revenue
500
400
540
540
Week Date Start
31/12/2021
07/01/2022
14/01/2022
21/01/2022
Week
41
42
43
44
Revenue
33.75
33.75
33.75
33.75
I am using alteryx for this but wouldnt mind explaination of possible logic in sql to apply it into the system
Thanks
Before I get into the answer, you're going to have an issue regarding data integrity. All the revenue data is aggregated at a monthly level, where your quarters start and end on someday within the month.
For example - Q4 starts September 30th (Friday) and ends Dec. 29th (Thursday). You may have a day or two that bleeds from another month into the quarters which might throw off the data a bit (esp. if there's a large amount of revenue during the days that bleed into a quarter.
Additionally, your revenue is aggregated at a monthly level - unless you have more granular data (weekly, daily would be best), it doesn't make sense to do a weekly calculation since you'll probably just be dividing revenue by 4.
That being said - You'll want to use a cross tab feature in alteryx to get the data how you want it. But before you do that, we want to aggregate your data at a quarterly level first.
You can do this with an if statement or some other data cleansing tool (sorry, been a while since I used alteryx). Something like:
# Pseudo code - this won't actually work!
# For determining quarter
if (month) between (30/09/2022,29/12/2022) then 4
where you can derive the logic from your calendar table. Then once you have the quarter, you can join in the Quarter Start date based on your quarter calculation.
Now you have a nice clean table that might look something like this:
Month
Revenue
Quarter
Quarter Start Date
01/01/2022
-135
4
30/09/2022
01/01/2022
-135
4
30/09/2022
Aggregate on your quarter to get a cleaner table
Quarter Start Date
Quarter
revenue
30/09/2022
4
300
Then use cross tab, where you pivot on the Quarter start date.
For SQL, you'd be pivoting the data. Essentially, taking the value from a row of data, and converting it into a column. It will look a bit janky because the data is so customized, but here's a good question that goes over pivioting - Simple way to transpose columns and rows in SQL?

Sort SQL by value

I have data like this:
Customer ID
Name
Type
Last Submit
1
Patricio
C
January 2022
2
Dale
A
June 2022
3
Yvonne
C
July 2022
4
Pawe
C
JUne 2022
5
Sergio
B
August 2022
6
Roland
C
August 2022
7
Georg
D
November 2022
8
Catherine
D
October 2022
9
Pascale
E
October 2022
10
Irene
A
November 2022
How to sort type A out of the queue first like A,B,C,D,E,F, then the last submit is at the top.
The example output:
Customer ID
Name
Type
Last Submit
10
Irene
A
November 202[![enter image description here][1]][1]2
1
Dale
A
June 2022
5
Sergio
B
August 2022
6
Roland
C
August 2022
3
Yvonne
C
July 2022
4
Pawe
C
June 2022
1
Patricio
C
January 2022
7
Georg
D
November 2022
8
Catherine
D
October 2022
9
Pascale
E
October 2022
So basically you want to sort by 2 different columns, this is detailed in this other answer: SQL multiple column ordering
In your example you would do
ORDER BY type, last_submit
Hi you can use simple order by in postgresql
like this
SELECT
*
FROM
table (your table name)
ORDER BY
type ASC, last_submit DESC;
In this case, you need to sort your query using the two columns in order.
Add this part to the end of your query.
ORDER BY type, last_submit DESC;
Check out this question "SQL multiple column ordering"

pandas groupby and filling in missing frequencies

I have a dataset of events each of which occurred on a specific day. Using Pandas I have been able to aggregate these into a count of events per month using the groupby function, and then plot a graph with Matplotlib. However, in the original dataset some months do not have any events and so there is no count of events in such a month. Such months do not therefore appear on the graph, but I would like to include then somehow with their zero count
bpm2 = df2_yr1.groupby(['month_year', 'month'])['event_no'].count()
which produces
month_year month
2016-01 January 9
2016-02 February 7
2016-04 April 1
2016-06 June 4
2016-07 July 1
2016-08 August 3
2016-09 September 2
2016-10 October 5
2016-11 November 17
2016-12 December 3
I have been trying to find a way of filling missing months in the dataframe generated by the groupby function with a 'count' value of 0 for, in this example, March and May..
Can anyone offer some advice on how this might be achieved. I have been trying to carry out FFill on the month column but with little success and can't work out how to add in a corresponding zero value for the missing months
First of all, if bpm2 = df2_yr1.groupby(['month_year', 'month'])['event_no'].count() is your code, then it is a series. So, let's change it to a dataframe with bpm2 = df2_yr1.groupby(['month_year', 'month'])['event_no'].count().reset_index(). Now, into the problem.
Change to date format and use pd.Grouper and change back to string format. Also add back the month column and change the formatting of the event_no column:
bpm2 = df2_yr1.groupby(['month_year', 'month'])['event_no'].count().reset_index()
bpm2['month_year'] = bpm2['month_year'].astype(str)
bpm2['month_year'] = pd.to_datetime(bpm2['month_year'])
bpm2 = bpm2.groupby([pd.Grouper(key='month_year', freq='1M')])['event_no'].first().fillna(0).astype(int).reset_index()
bpm2['month'] = bpm2['month_year'].dt.strftime('%B')
bpm2['month_year'] = bpm2['month_year'].dt.strftime('%Y-%m')
bpm2
output:
month_year event_no month
0 2016-01 9 January
1 2016-02 7 February
2 2016-03 0 March
3 2016-04 1 April
4 2016-05 0 May
5 2016-06 4 June
6 2016-07 1 July
7 2016-08 3 August
8 2016-09 2 September
9 2016-10 5 October
10 2016-11 17 November
11 2016-12 3 December

Compare Cumulative Sales per Year-End

Using this sample dataframe:
np.random.seed(1111)
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B','Group C','Group D'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2020',
freq='M'), 10000)})
I am trying to compare 12 month time frames with seaborn plots for a sub-grouping of category. For example, I'd like to compare the cumulative 12 months for each year ending 4-30 vs. the same time period for each year. I cannot wrap my head around how to get a running total of data for each respective year (5/1/17-4/30/18, 5/1/18-4/30/19, 5/1/19-4/30/20). The dates are just examples - I'd like to be able to compare different year-end data points, even better would be able to compare 365 days. For instance, I'd love to compare 3/15/19-3/14/20 to 3/15/18-3/14/19, etc.
I envision a graph for each 'Category' (A,B,C,D) with lines for each respective year representing the running total starting with zero on May 1, building through April 30 of the next year. The x axis would be the month (starting with May 1) & y axis would be 'Units_Sold' as it grows.
Any help would be greatly appreciated!
One way to convert the date to fiscal quarters and extract the fiscal year:
df = pd.DataFrame({'Date':pd.date_range('2019-01-01', '2019-12-31', freq='M'),
'Values':np.arange(12)})
df['fiscal_year'] = df.Date.dt.to_period('Q-APR').dt.qyear
Output:
Date Values fiscal_year
0 2019-01-31 0 2019
1 2019-02-28 1 2019
2 2019-03-31 2 2019
3 2019-04-30 3 2019
4 2019-05-31 4 2020
5 2019-06-30 5 2020
6 2019-07-31 6 2020
7 2019-08-31 7 2020
8 2019-09-30 8 2020
9 2019-10-31 9 2020
10 2019-11-30 10 2020
11 2019-12-31 11 2020
And now you can group by fiscal_year to your heart's content.

MS Access selecting by year intervals

I have a table, where every row has its own date (year of purchase), I should select the purchases grouped into year intervals.
Example:
Zetor 1993
Zetor 1993
JOHN DEERE 2001
JOHN DEERE 2001
JOHN DEERE 2001
Means I have 2 zetor purchase in 1993 and 3 john deere purchase in 2001. I should select the count of the pruchases grouped into these year intervals:
<=1959
1960-1969
1970-1979
1980-1989
1990-1994
1995-1999
2000-2004
2004-2009
2010-2013
I have no idea how should I do this.
The result should look like this on the example above:
<=1959
1960-1969 0
1970-1979 0
1980-1989 0
1990-1994 2
1995-1999 0
2000-2004 3
2004-2009 0
2010-2013 0
Create table with intervals:
tblRanges([RangeName],[Begins],[Ends])
Populate it with your intervals
Use GROUP BY with your table tblPurchases([Item],YearOfDeal):
SELECT tblRanges.RangeName, Count(tblPurchases.YearOfDeal)
FROM tblRanges INNER JOIN tblPurchases ON (tblRanges.Begins <= tblPurchases.Year) AND (tblRanges.Ends >= tblPurchases.YearOfDeal)
GROUP BY tblRanges.RangeName;
You may wish to consider Partition for future use:
SELECT Partition([Year],1960,2014,10) AS [Group], Count(Stock.Year) AS CountOfYear
FROM Stock
GROUP BY Partition([Year],1960,2014,10)
Input:
Tractor Year
Zetor 1993
Zetor 1993
JOHN DEERE 2001
JOHN DEERE 2001
JOHN DEERE 2001
Pre 59 1945
1960 1960
Result:
Group CountOfYear
:1959 1
1960:1969 1
1990:1999 2
2000:2009 3
Reference: http://office.microsoft.com/en-ie/access-help/partition-function-HA001228892.aspx