How to groupby external time series data together - pandas

I have been trying to groupby some state data together. This is how my data looks like for example, with Date as the index and the rest is features:
Date
Population
Num_Men
Num_Women
State
Region
2020-01-01
500
300
200
NY
North
2020-02-01
800
500
300
GL
Middle
2020-02-01
1000
400
600
""
Middle
2020-02-01
200
50
150
nan
Middle
2020-02-01
600
400
200
NY
North
I know how to group the NY states ones, but if I want to group the ones with state values: GL, "", and nan together. I'm not sure how to do that.
I was looking for the final result to look like:
Date
Population
Num_Men
Num_Women
State
Region
2020-01-01
500
300
200
NY
North
2020-02-01
2000
950
1050
GL
Middle
2020-02-01
600
400
200
NY
North
I did something like this: df.groupby(df.index, {'State': ["GL", "", np.nan]}, but that didn't work. Any help would be appreciated! Thanks!

Let us do replace then groupby with sum and first
df.State = df.State.replace({"''":np.nan,'nan':np.nan})
out = df.groupby(['Region','Date'],as_index=False).\
agg({'Population':'sum',
'Num_Men':'sum',
'Num_Women':'sum',
'State':'first'})
Out[99]:
Region Date Population Num_Men Num_Women State
0 Middle 2020-02-01 2000 950 1050 GL
1 North 2020-01-01 500 300 200 NY
2 North 2020-02-01 600 400 200 NY

Related

How to find the right budget value into the activities in Tableau (Relationship)

I have related two custom sql query in Tableau (via making relationship)
The outcome of the queries looks like :
Q1 : (It shows the starting time of the valid budget.If a user has multiple rows in this table, it means his/her budget has been updated with new amount)
id_user
budgete_start_date
budget_amount
1234
06-11-2021
120
1234
06-07-2022
200
56789
06-01-2022
1200
56789
06-07-2022
2000
643
05-05-2022
30
Q2 :(It shows the budget usage)
id_user
budgete_usage_date
amount_usage
1234
01-12-2021
50
1234
05-08-2022
100
56789
10-02-2022
60
56789
07-08-2022
500
643
05-07-2022
17
I need to find a way to create the following view to know what was the valid budget at
budgete_usage_date.
id_user
budgete_usage_date
amount_usage
valid budget
1234
01-12-2021
50
120
1234
05-08-2022
100
200
56789
10-02-2022
60
1200
56789
07-08-2022
500
2000
643
05-07-2022
17
30
How can I do that with calculated field in Tableau (with db made by relationship)?
If that's not possible, how can I do that directly in query? (changing the relationship to a single query)

Quicksight Calculated field: sum of average?

The dataset I have is currently like so:
country
itemid
device
num_purchases
total_views_per_country_and_day
day
USA
ABC
iPhone11
2
900
2022-06-15
USA
ABC
iPhoneX
5
900
2022-06-15
USA
DEF
iPhoneX
8
900
2022-06-15
UK
ABC
iPhone11
10
350
2022-06-15
UK
DEF
iPhone11
20
350
2022-06-15
total_views_per_country_and_day is already pre-calculated to be the sum grouped by country and day. That is why for each country-day pair, the number is the same.
I have a Quicksight analysis with a filter for day.
The first thing I want is to have a table on my dashboard that shows the number of total views for each country.
However, if I were to do it with the dataset just like that, the table would sum everything:
country
total_views
USA
900+900+900=2700
UK
350+350=700
So what I did was, create a calculated field which is the average of total_views. Which worked---but only if my day filter on dashboard was for ONE day.
When filtered for day = 2022-06-15: correct
country
avg(total_views)
USA
2700/3=900
UK
700/2=350
But let's say we have data from 2022-06-16 as well, the averaging method doesn't work, because it will average based on the entire dataset. So, example dataset with two days:
country
itemid
device
num_purchases
total_views_per_country_and_day
day
USA
ABC
iPhone11
2
900
2022-06-15
USA
ABC
iPhoneX
5
900
2022-06-15
USA
DEF
iPhoneX
8
900
2022-06-15
UK
ABC
iPhone11
10
350
2022-06-15
UK
DEF
iPhone11
20
350
2022-06-15
USA
ABC
iPhone11
2
1000
2022-06-16
USA
ABC
iPhoneX
5
1000
2022-06-16
UK
ABC
iPhone11
10
500
2022-06-16
UK
DEF
iPhone11
20
500
2022-06-16
Desired Table Visualization:
country
total_views
USA
900 + 1000 = 1900
UK
350 + 500 = 850
USA calculation: (900 * 3)/3 + (1000 * 2) /2 = 900 + 1000
UK calculation: (350 * 2) /2 + (500 * 2) /2 = 350 + 500
Basically---a sum of averages.
However, instead it is calculated like:
country
avg(total_views)
USA
[(900 * 3) + (1000*2)] / 5 = 940
UK
[(350 * 2) + (500 * 2)] / 4 = 425
I want to be able to use this calculation later on as well to calculate num_purchases / total_views. So ideally I would want it to be a calculated field. Is there a formula that can do this?
I also tried, instead of calculated field, just aggregating total_views by average instead of sum in the analysis -- exact same issue, but I could actually keep a running total if I include day in the table visualization. E.G.
country
day
running total of avg(total_views)
USA
2022-06-15
900
USA
2022-06-16
900+1000=1900
UK
2022-06-15
350
UK
2022-06-16
350+500=850
So you can see that the total (2nd and 4th row) is my desired value. However this is not exactly what I want.. I don't want to have to add the day into the table to get it right.
I've tried avgOver with day as a partition, that also requires you to have day in the table visualization.
sum({total_views_per_country_and_day}) / distinct_count( {day})
Basically your average is calculated as sum of metric divided by number of unique days. The above should help.

Merge Dataframes based on values from different datasets

I have the following data frames:
print(df)
id_code turnover costs
001 100 200
002 100 200
003 100 200
004 100 200
print(df_db)
Description Code1, Code2, ... CodeN
Retail 001 002 ... nan
Wholesale 003 nan ... nan
Supply 004 nan ... nan
And I would like to create the following final_df, adding a column representing the description in df_db; basically, if the id_code is present in a row of df_db, merge the values:
print(final_df)
id_code turnover costs Description
001 100 200 Retail
002 100 200 Retail
003 100 200 Wholesale
004 100 200 Supply
I tried with pd pivot but it does not report the desired result. How can I obtain final_df?
Use DataFrame.melt + Series.map
if there are no duplicate codes in df_db:
mapper=df_db.melt('Description').set_index('value')['Description']
df['Description']=df['id_code'].map(mapper)
print(df)
id_code turnover costs Description
0 1 100 200 Retail
1 2 100 200 Retail
2 3 100 200 Wholesale
3 4 100 200 Supply
Detail:
print(mapper)
value
1 Retail
3 Wholesale
4 Supply
2 Retail
5 Wholesale
6 Supply
Name: Description, dtype: object
We use melt before merge
final_df=df.merge(df_db.melt('Description').drop('variable',1),left_on='id_code',right_on='value').\
drop('value',1)
Out[157]:
id_code turnover costs Description
0 1 100 200 Retail
1 2 100 200 Retail
2 3 100 200 Wholesale
3 4 100 200 Supply

groupby sum month wise on date time data

I have a transaction data as shown below. which is a 3 months data.
Card_Number Card_type Category Amount Date
0 1 PLATINUM GROCERY 100 10-Jan-18
1 1 PLATINUM HOTEL 2000 14-Jan-18
2 1 PLATINUM GROCERY 500 17-Jan-18
3 1 PLATINUM GROCERY 300 20-Jan-18
4 1 PLATINUM RESTRAUNT 400 22-Jan-18
5 1 PLATINUM HOTEL 500 5-Feb-18
6 1 PLATINUM GROCERY 400 11-Feb-18
7 1 PLATINUM RESTRAUNT 600 21-Feb-18
8 1 PLATINUM GROCERY 800 17-Mar-18
9 1 PLATINUM GROCERY 200 21-Mar-18
10 2 GOLD GROCERY 1000 12-Jan-18
11 2 GOLD HOTEL 3000 14-Jan-18
12 2 GOLD RESTRAUNT 500 19-Jan-18
13 2 GOLD GROCERY 300 20-Jan-18
14 2 GOLD GROCERY 400 25-Jan-18
15 2 GOLD HOTEL 1500 5-Feb-18
16 2 GOLD GROCERY 400 11-Feb-18
17 2 GOLD RESTRAUNT 600 21-Mar-18
18 2 GOLD GROCERY 200 21-Mar-18
19 2 GOLD HOTEL 700 25-Mar-18
20 3 SILVER RESTRAUNT 1000 13-Jan-18
21 3 SILVER HOTEL 1000 16-Jan-18
22 3 SILVER GROCERY 500 18-Jan-18
23 3 SILVER GROCERY 300 23-Jan-18
24 3 SILVER GROCERY 400 28-Jan-18
25 3 SILVER HOTEL 500 5-Feb-18
26 3 SILVER GROCERY 400 11-Feb-18
27 3 SILVER HOTEL 600 25-Mar-18
28 3 SILVER GROCERY 200 29-Mar-18
29 3 SILVER RESTRAUNT 700 30-Mar-18
I am struggling to get below dataframe.
Card_No Card_Type D Jan_Sp Jan_N Feb_Sp Feb_N Mar_Sp GR_T RES_T
1 PLATINUM 70 3300 5 1500 3 1000 2300 100
2 GOLD 72 5200 5 1900 2 1500 2300 1100
3 SILVER . 76 2900 5 900 2 1500 1800 1700
D = Duration in days from first transaction to last transaction.
Jan_Sp = Total spending on January.
Feb_Sp = Total spending on February.
Mar_Sp = Total spending on March.
Jan_N = Number of transaction in Jan.
Feb_N = Number of transaction in Feb.
GR_T = Total spending on GROCERY.
RES_T = Total spending on RESTRAUNT.
I tried following code. I am very new to pandas.
q9['Date'] = pd.to_datetime(Card_Number['Date'])
q9 = q9.sort_values(['Card_Number', 'Date'])
q9['D'] = q9.groupby('ID')['Date'].diff().dt.days
My approach is three steps
get the date range
get the Monthly spending
get the category spending
Step 1: Date
date_df = df.groupby('Card_type').Date.apply(lambda x: (x.max()-x.min()).days)
Step 2: Month
month_df = (df.groupby(['Card_type', df.Date.dt.month_name().str[:3]])
.Amount
.agg({'sum','count'})
.rename({'sum':'_Sp', 'count': '_N'}, axis=1)
.unstack('Date')
)
# rename
month_df.columns = [b+a for a,b in month_df.columns]
Step 3: Category
cat_df = df.pivot_table(index='Card_type',
columns='Category',
values='Amount',
aggfunc='sum')
# rename
cat_df.columns = [a[:2]+"_T" for a in cat_df.columns]
And finally concat:
pd.concat( (date_df, month_df, cat_df), axis=1)
gives:
Date Feb_Sp Jan_Sp Mar_Sp Feb_N Jan_N Mar_N GR_T HO_T RE_T
Card_type
GOLD 72 1900 5200 1500 2 5 3 2300 5200 1100
PLATINUM 70 1500 3300 1000 3 5 2 2300 2500 1000
SILVER 76 900 3200 1500 2 5 3 1800 2100 1700
If your data have several years, and you want to separate them by year, then you can add df.Date.dt.year in each groupby above:
date_df = df.groupby([df.Date.dt.year,'Card_type']).Date.apply(lambda x: (x.max()-x.min()).days)
month_df = (df.groupby([df.Date.dt.year,'Card_type', df.Date.dt.month_name().str[:3]])
.Amount
.agg({'sum','count'})
.rename({'sum':'_Sp', 'count': '_N'}, axis=1)
.unstack(level=-1)
)
# rename
month_df.columns = [b+a for a,b in month_df.columns]
cat_df = (df.groupby([df.Date.dt.year,'Card_type', 'Category'])
.Amount
.sum()
.unstack(level=-1)
)
# rename
cat_df.columns = [a[:2]+"_T" for a in cat_df.columns]
pd.concat((date_df, month_df, cat_df), axis=1)
gives:
Date Feb_Sp Jan_Sp Mar_Sp Feb_N Jan_N Mar_N GR_T HO_T
Date Card_type
2017 GOLD 72 1900 5200 1500 2 5 3 2300 5200
PLATINUM 70 1500 3300 1000 3 5 2 2300 2500
SILVER 76 900 3200 1500 2 5 3 1800 2100
2018 GOLD 72 1900 5200 1500 2 5 3 2300 5200
PLATINUM 70 1500 3300 1000 3 5 2 2300 2500
SILVER 76 900 3200 1500 2 5 3 1800 2100
I would recommend keeping the dataframe this way, so you can access the annual data, e.g. result_df.loc[2017] gives you 2017 data. If you really want 2017 as year, you can do result_df.unstack(level=0).

SQL query to get maximum of all maximum values per customer

I'm struggling with this specific Access 2010 SQL query for quite some time now. Let me first show you what my table looks like:
customerID value
123456789 100
123456789 -100
123456789 300
123456789 -300
123456789 150
123456789 -150
123456789 200
123456789 200
987654321 500
987654321 -500
987654321 200
987654321 -200
987654321 210
987654321 210
You see I have multiple entries for one customerID with several values. These values can be positive and negative. Negative values represent corrections so the corresponding positive value "gets nulled".
What I need to query now is the maximum value of all maximum values per customerID. In the example above, the maximum value of customerID 123456789 is 200, because all other values on this customerID annul each other. The maximum value on customerID 987654321 hence is 210.
Ultimately my query should return the value of 210 as the maximum out of all maximum values per customerID that didn't get corrected/anulled by negative values.
Can you please help me with this?
Edit: Added (duplicate) values 200 and 210 to both customerIDs to make clear that a SUM() wont work here.
Edit #2: Here's some (nearly) real life data: http://pastebin.com/TbNRTw5A
I don't know if this would be your answer, it's just assuming that all negative values have 1 corresponding equal positive value paired up.
SELECT CustomerID, SUM(Stack1.Value) FROM Stack1
GROUP BY CustomerID
So the result would be:
CustomerID Value
123456789 200
987654321 210
Hope this helps
How about this?
WITH tmpPositive AS (SELECT
Stack1.CustomerID, Stack1.Value FROM Stack1 WHERE Stack1.Value > 0),
tmpNegative AS (SELECT
Stack1.CustomerID, Stack1.Value FROM Stack1 WHERE Stack1.Value < 0)
SELECT tmpPositive.CustomerID, MAX(tmpPositive.Value) AS MaxValue FROM tmpPositive
LEFT OUTER JOIN tmpNegative
ON tmpPositive.CustomerID = tmpNegative.CustomerID AND
-tmpPositive.Value = tmpNegative.Value
WHERE tmpNegative.CustomerID IS NULL
GROUP BY tmpPositive.CustomerID;
Here's the test data:
CustomerID Value
---------------------
123456789 100
123456789 -100
123456789 300
123456789 -300
123456789 150
123456789 -150
123456789 200
987654321 500
987654321 -500
987654321 200
987654321 -200
987654321 210
123456789 200
123456789 110
987654321 1250
And the result I have for above query.
CustomerID MaxValue
--------------------
123456789 200
987654321 1250