how to group by date and unique each group and count each group with pandas - pandas

how to group by date and unique each group and count each group with pandas?
Count number of unique MAC address each day
pd.concat([df[['date','Client MAC']], df8[['date',"MAC address"]].rename(columns={"MAC address":"Client MAC"})]).groupby(["date"])
one of column , data example
Association Time
Mon May 14 19:41:20 HKT 2018
Mon May 14 19:43:22 HKT 2018
Tue May 15 09:24:57 HKT 2018
Mon May 14 19:53:33 HKT 2018
i use
starttime=datetime.datetime.now()
dff4 = (df4[['Association Time','Client MAC Address']].groupby(pd.to_datetime(df4["Association Time"]).dt.date.apply(lambda x: dt.datetime.strftime(x, '%Y-%m-%d'))).nunique())
print datetime.datetime.now()-starttime
it runs for 2 minutes, but it also group by association time, it is wrong,
not need to group by association time
Association Time Client MAC Address
Association Time
2017-06-21 1 3
2018-02-21 2 8
2018-02-27 1 1
2018-03-07 3 3

I believe need add ['Client MAC'].nunique():
df = (pd.concat([df[['date','Client MAC']],
df8[['date',"MAC address"]].rename(columns={"MAC address":"Client MAC"})])
.groupby(["date"])['Client MAC']
.nunique())
If dates are datetimes:
df = (pd.concat([df[['date','Client MAC']],
df8[['date',"MAC address"]].rename(columns={"MAC address":"Client MAC"})]))
df = df['Client MAC'].groupby(df["date"].dt.date).nunique()

Related

How do you get the last entry for each month in SQL?

I am looking to filter very large tables to the latest entry per user per month. I'm not sure if I found the best way to do this. I know I "should" trust the SQL engine (snowflake) but there is a part of me that does not like the join on three columns.
Note that this is a very common operation on many big tables, and I want to use it in DBT views which means it will get run all the time.
To illustrate, my data is of this form:
mytable
userId
loginDate
year
month
value
1
2021-01-04
2021
1
41.1
1
2021-01-06
2021
1
411.1
1
2021-01-25
2021
1
251.1
2
2021-01-05
2021
1
4369
2
2021-02-06
2021
2
32
2
2021-02-14
2021
2
731
3
2021-01-20
2021
1
258
3
2021-02-19
2021
2
4251
3
2021-03-15
2021
3
171
And I'm trying to use SQL to get the last value (by loginDate) for each month.
I'm currently doing a groupby & a join as follows:
WITH latest_entry_by_month AS (
SELECT "userId", "year", "month", max("loginDate") AS "loginDate"
FROM mytable
)
SELECT * FROM mytable NATURAL JOIN latest_entry_by_month
The above results in my desired output:
userId
loginDate
year
month
value
1
2021-01-25
2021
1
251.1
2
2021-01-05
2021
1
4369
2
2021-02-14
2021
2
731
3
2021-01-20
2021
1
258
3
2021-02-19
2021
2
4251
3
2021-03-15
2021
3
171
But I'm not sure if it's optimal.
Any guidance on how to do this faster? Note that I am not materializing the underlying data, so it is effectively un-clustered (I'm getting it from a vendor via the Snowflake marketplace).
Using QUALIFY and windowed function(ROW_NUMBER):
SELECT *
FROM mytable
QUALIFY ROW_NUMBER() OVER(PARTITION BY userId, year, month
ORDER BY loginDate DESC) = 1

Pandas Convert Year/Month Int Columns to Datetime and Quarter Average

I have data in a df that is separated into a year and month column and I'm trying to find the average of observed data columns. I cannot find online how to convert the 'year' and 'month' columns to datetime and then to find the Q1, Q2, Q3, etc. averages.
year month data
0 2021 1 7.100427005789888
1 2021 2 7.22523237179488
2 2021 3 8.301528122415217
3 2021 4 6.843885683760697
4 2021 5 6.12365177832918
5 2021 6 6.049659188034206
6 2021 7 5.271174524400343
7 2021 8 5.098493589743587
8 2021 9 6.260155982906011
I need the final data to look like -
year Quarter Q data
2021 1 7.542395833
2021 2 6.33906555
2021 3 5.543274699
I've tried variations of this to change the 'year' and 'month' columns to datetime but it gives a long date starting with year = 1970
df.iloc[:, 1:2] = df.iloc[:, 1:2].apply(pd.to_datetime)
year month wind_speed_ms
0 2021 1970-01-01 00:00:00.000000001 7.100427
1 2021 1970-01-01 00:00:00.000000002 7.225232
2 2021 1970-01-01 00:00:00.000000003 8.301528
3 2021 1970-01-01 00:00:00.000000004 6.843886
4 2021 1970-01-01 00:00:00.000000005 6.123652
5 2021 1970-01-01 00:00:00.000000006 6.049659
6 2021 1970-01-01 00:00:00.000000007 5.271175
7 2021 1970-01-01 00:00:00.000000008 5.098494
8 2021 1970-01-01 00:00:00.000000009 6.260156
Thank you,
I hope this will work for you
# I created period column combining year and month column
df["period"]=df.apply(lambda x:f"{int(x.year)}-{int(x.month)}",axis=1).apply(pd.to_datetime).dt.to_period('Q')
# I applied groupby to period
df=df.groupby("period").mean().reset_index()
df["Quarter"] = df.period.astype(str).str[-2:]
df = df[["year","Quarter","data"]]
df.rename(columns={"data":"Q data"})
year Quarter Q data
0 2021.0 Q1 7.542396
1 2021.0 Q2 6.339066
2 2021.0 Q3 5.543275

Compare Cumulative Sales per Year-End

Using this sample dataframe:
np.random.seed(1111)
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B','Group C','Group D'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2020',
freq='M'), 10000)})
I am trying to compare 12 month time frames with seaborn plots for a sub-grouping of category. For example, I'd like to compare the cumulative 12 months for each year ending 4-30 vs. the same time period for each year. I cannot wrap my head around how to get a running total of data for each respective year (5/1/17-4/30/18, 5/1/18-4/30/19, 5/1/19-4/30/20). The dates are just examples - I'd like to be able to compare different year-end data points, even better would be able to compare 365 days. For instance, I'd love to compare 3/15/19-3/14/20 to 3/15/18-3/14/19, etc.
I envision a graph for each 'Category' (A,B,C,D) with lines for each respective year representing the running total starting with zero on May 1, building through April 30 of the next year. The x axis would be the month (starting with May 1) & y axis would be 'Units_Sold' as it grows.
Any help would be greatly appreciated!
One way to convert the date to fiscal quarters and extract the fiscal year:
df = pd.DataFrame({'Date':pd.date_range('2019-01-01', '2019-12-31', freq='M'),
'Values':np.arange(12)})
df['fiscal_year'] = df.Date.dt.to_period('Q-APR').dt.qyear
Output:
Date Values fiscal_year
0 2019-01-31 0 2019
1 2019-02-28 1 2019
2 2019-03-31 2 2019
3 2019-04-30 3 2019
4 2019-05-31 4 2020
5 2019-06-30 5 2020
6 2019-07-31 6 2020
7 2019-08-31 7 2020
8 2019-09-30 8 2020
9 2019-10-31 9 2020
10 2019-11-30 10 2020
11 2019-12-31 11 2020
And now you can group by fiscal_year to your heart's content.

SSRS: Horizontal alignment on a group

On my dataset I select information from four different years sorted by date and how many subscriptions I had on said date, which looks something like this:
Date Year Subs Day
15/09/2014 2015 57 1
16/09/2014 2015 18 2
17/09/2014 2015 16 3
14/09/2015 2016 10 1
15/09/2015 2016 45 2
16/09/2015 2016 28 3
12/09/2016 2017 32 1
13/09/2016 2017 11 2
14/09/2016 2017 68 3
24/08/2017 2018 23 1
25/08/2017 2018 53 2
26/08/2017 2018 13 3
What I'm trying to do is create an 'Year' Column Group to align them horizontally, but when I do that, this is the result:
result
Expected result:
expected result
Is this achievable in SSRS? I've tried removing the group =(Details), which gives me the desired result, except it only returns one line of information.
Any insight aprreciated.
By default, the Details group causes you to get one row per row in the dataset. In your case, I would suggest grouping the Rows by the Day column and create a column group by Year.
First, create the two groups and add columns inside the column group.
Then, add a row outside and above the Day row group. Place the headings here and then delete the top row. It should look like this:
Now these 4 columns will repeat to the right for each year and you will get rows based on the number of days in your dataset.

How to perform multiple table calculation with joins and group by

I have two tables client and grouping. They look like this:
Client
C_id
C_grouping_id
Month
Profit
Grouping
Grouping_id
Month
Profit
The client table contains monthly profit for every client and every client belongs to a specific grouping scheme specified by C_grouping_id.
The grouping table contains all the groups and their monthly profits.
I'm struggling with a query that essentially calculates the monthly residual for every subscriber:
Residual= (Subscriber Monthly Profit - Grouping monthly Profit)*(average subscriber monthly profits for all months / average profits for all months for the grouping subscriber belongs to)
I have come up with the following query so far but the results seem to be incorrect:
SELECT client.C_id, client.C_grouping_Id, client.Month,
((client.Profit - grouping.profit) * (avg(client.Profit)/avg(grouping.profit))) as "residual"
FROM client
INNER JOIN grouping
ON "C_grouping_id"="Grouping_id"
group by client.C_id, client.C_grouping_Id,client.Month, grouping.profit
I would appreciate it if someone can shed some light on what I'm doing wrong and how to correct it.
EDIT: Adding sample data and desired results
Client
C_id C_grouping_id Month Profit
001 aaa jul 10$
001 aaa aug 12$
001 aaa sep 8$
016 abc jan 25$
016 abc feb 21$
Grouping
Grouping_id Month Profit
aaa Jul 30$
aaa aug 50$
aaa Sep 15$
abc Jan 21$
abc Feb 27$
Query Result:
C_ID C_grouping_id Month Residual
001 aaa Jul (10-30)*(10/31.3)=-6.38
... and so on for every month for avery client.
This can be done in a pretty straight forward way.
The main difficulty is obviously that you try to deal with different levels of aggregation at once (average of the group and the client as well as the current record).
This is rather difficult/clumsy with simple SELECT FROM GROUP BY-SQL.
But with analytical functions aka Window functions this is very easy.
Start with combining the tables and calculating the base numbers:
select c.c_id as client_id,
c.c_grouping_id as grouping_id,
c.month,
c.profit as client_profit,
g.profit as group_profit,
avg (c.profit) over (partition by c.c_id) as avg_client_profit,
avg (g.profit) over (partition by g.grouping_id) as avg_group_profit
from client c inner join grouping g
on c."C_GROUPING_ID"=g."GROUPING_ID"
and c. "MONTH" = g. "MONTH";
With this you already get the average profits by client and by grouping_id.
Be aware that I changed the data type of the currency column to DECIMAL (10,3) as a VARCHAR with a $ sign in it is just hard to convert.
I also fixed the data for MONTHS as the test data contained different upper/lower case spellings which prevented the join to work.
Finally I turned all column names into upper case to, in order to make typing easier.
Anyhow, running this provides you with the following result set:
CLIENT_ID GROUPING_ID MONTH CLIENT_PROFIT GROUP_PROFIT AVG_CLIENT_PROFIT AVG_GROUP_PROFIT
16 abc JAN 25 21 23 24
16 abc FEB 21 27 23 24
1 aaa JUL 10 30 10 31.666
1 aaa AUG 12 50 10 31.666
1 aaa SEP 8 15 10 31.666
From here it's only one step further to the residual calculation.
You can either put this current SQL into a view to make it reusable for other queries or use it as a inline view.
I chose to use it as a common table expression (CTE) aka WITH clause because it's nice and easy to read:
with p as
(select c.c_id as client_id,
c.c_grouping_id as grouping_id,
c.month,
c.profit as client_profit,
g.profit as group_profit,
avg (c.profit) over (partition by c.c_id) as avg_client_profit,
avg (g.profit) over (partition by g.grouping_id) as avg_group_profit
from client c inner join grouping g
on c."C_GROUPING_ID"=g."GROUPING_ID"
and c. "MONTH" = g. "MONTH")
select client_id, grouping_id, month,
client_profit, group_profit,
avg_client_profit, avg_group_profit,
round( (client_profit - group_profit)
* (avg_client_profit/avg_group_profit), 2) as residual
from p
order by grouping_id, month, client_id;
Notice how easy to read the whole statement is and how straight forward the residual calculation is done.
The result is then this:
CLIENT_ID GROUPING_ID MONTH CLIENT_PROFIT GROUP_PROFIT AVG_CLIENT_PROFIT AVG_GROUP_PROFIT RESIDUAL
1 aaa AUG 12 50 10 31.666 -12
1 aaa JUL 10 30 10 31.666 -6.32
1 aaa SEP 8 15 10 31.666 -2.21
16 abc FEB 21 27 23 24 -5.75
16 abc JAN 25 21 23 24 3.83
Cheers,
Lars