I've been working with seaborn.catplot in order to have a bar plot (data sample bellow) adding up the values in a counts column for a set of reasons, separated by a group of companies:
sns.catplot(x='Bill_Name', y='counts', hue='Reason',
data=data, kind='bar', height=6, aspect=13/6,
legend=True, palette='hls')
Now for each value there's also a column year. So I was thinking of using seaborn.FacetGrid, in order to have the above in a grid of rows.
So if I understood the way this works correctly, sns.FacetGrid must be fed the data and the year column in this case as the row argument, and then use sns.map, with sns.catplot and its corresponding parameters, however this is not working properly:
g = sns.FacetGrid(data, row="year", height=4, aspect=.5)
g = g.map(sns.catplot, x='Bill_Name', y='counts', hue='Reason',
data=data, kind='bar', height=6, aspect=13/6,
legend=True, palette='hls')
What am I doing wrong?
Here's a sample of the data:
Bill_Name year Reason counts
0 CompanyC 2018.0 Reason6 2
1 CompanyC 2017.0 Reason5 8
2 CompanyB 2017.0 Reason3 146
3 CompanyC 2015.0 Reason6 2
4 CompanyC 2017.0 Reason1 1828
5 CompanyC 2016.0 Reason3 237
6 CompanyB 2018.0 Reason4 1097
7 CompanyC 2016.0 Reason4 11
8 CompanyB 2016.0 Reason5 12
9 CompanyC 2017.0 Reason2 834
10 CompanyB 2016.0 Reason3 97
11 CompanyC 2017.0 Reason6 714
12 CompanyA 2017.0 Reason1 4288
13 CompanyA 2016.0 Reason2 2444
14 CompanyC 2017.0 Reason3 293
15 CompanyB 2016.0 Reason1 1576
16 CompanyA 2016.0 Reason4 37
17 CompanyA 2018.0 Reason5 1
18 CompanyC 2018.0 Reason1 908
19 CompanyC 2018.0 Reason2 478
20 CompanyA 2015.0 Reason1 3826
21 CompanyB 2016.0 Reason4 119
22 CompanyB 2017.0 Reason2 1404
23 CompanyC 2016.0 Reason1 1884
24 CompanyC 2015.0 Reason4 1
25 CompanyA 2016.0 Reason1 6360
26 CompanyA 2018.0 Reason3 225
27 CompanyA 2018.0 Reason4 63
28 CompanyC 2018.0 Reason4 162
29 CompanyC 2016.0 Reason2 1504
You can avoid the facetgrid entirely if you add the row='year' argument to the seaborn catplot:
sns.catplot(x='Bill_Name', y='counts', hue='Reason',row='year', data=data, kind='bar', height=6, aspect=13/6, legend=True, palette='hls')
Related
I am trying to reorganize a temperature data set to get it in the same format as other data sets I have been using. I am having trouble iterating through the data frame and appending the data to a new data frame.
Here is the data:
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 1901 -3.16 -4.14 2.05 6.85 13.72 18.27 22.22 20.54 15.30 10.50 2.60 -2.68
1 1902 -3.73 -2.67 1.78 7.62 14.35 18.21 20.51 19.81 14.97 9.93 3.20 -4.02
2 1903 -3.93 -4.39 2.44 7.18 13.07 17.22 20.25 19.67 15.00 9.35 1.52 -2.84
3 1904 -5.49 -3.92 1.83 7.22 13.46 17.78 20.22 19.25 15.87 9.60 3.20 -2.31
4 1905 -4.89 -4.40 4.54 8.01 13.20 18.24 20.25 20.21 16.15 8.42 3.47 -3.28
.. ... ... ... ... ... ... ... ... ... ... ... ... ...
116 2017 -2.07 1.77 3.84 10.02 14.21 19.69 22.57 20.38 17.15 10.85 4.40 -0.77
117 2018 -2.36 -0.56 3.39 7.49 16.39 20.09 22.39 21.01 17.57 10.37 2.48 -0.57
118 2019 -2.38 -1.85 2.93 9.53 14.10 19.21 22.38 21.31 18.41 9.37 3.00 -0.08
119 2020 -1.85 -0.98 4.50 8.34 14.61 19.66 22.42 21.69 16.75 9.99 4.92 -0.38
120 2021 -0.98 -3.86 3.94 8.41 14.06 20.63 22.22 21.23 17.48 11.47 3.54 0.88
Here is the code that I have tried:
df = pds.read_excel("Temp_Data.xlsx")
data = pds.dataframe()
for i in range(len(df)):
data1 = df.iloc[i]
data.append(data1)
Here is the result of that code:
print(data)
Feb -0.72
Mar 0.75
Apr 6.77
May 14.44
Jun 18.40
Jul 20.80
Aug 20.13
Sep 16.17
Oct 10.64
Nov 2.71
Dec -2.80
Name: 43, dtype: float64, Year 1945.00
Jan -2.62
Feb -0.75
Mar 4.00
Apr 7.29
May 12.31
Jun 16.98
Jul 20.76
Aug 20.11
Sep 16.08
Oct 9.82
Nov 2.09
Dec -3.87
Note: for some reason the data starts at 1945 and goes to 2021.
Here is how I am trying to format the data eventually:
Date Temp
0 190101 -3.16
1 190102 -4.14
2 190103 2.05
3 190104 6.85
4 190105 13.72
5 190106 18.27
6 190107 22.22
7 190108 20.54
8 190109 15.30
9 190110 10.50
10 190111 2.60
11 190112 -2.68
12 190201 -3.73
13 190202 -2.67
14 190203 1.78
15 190204 7.62
16 190205 14.35
17 190206 18.21
18 190207 20.51
19 190208 19.81
20 190209 14.97
21 190210 9.93
22 190211 3.20
23 190212 -4.02
You can use melt to reshape your dataframe then create the Date column from Year and Month columns:
months = {'Jan': '01', 'Feb': '02', 'Mar': '03', 'Apr': '04',
'May': '05', 'Jun': '06', 'Jul': '07', 'Aug': '08',
'Sep': '09', 'Oct': '10', 'Nov': '11', 'Dec': '12'}
# Convert Year and Month columns to YYYMM
to_date = lambda x: x.pop('Year').astype(str) + x.pop('Month').map(months)
out = (df.melt(id_vars='Year', var_name='Month', value_name='Temp')
.assign(Date=to_date).set_index('Date').sort_index().reset_index())
Output:
>>> out
Date Temp
0 190101 -3.16
1 190102 -4.14
2 190103 2.05
3 190104 6.85
4 190105 13.72
.. ... ...
115 202108 21.23
116 202109 17.48
117 202110 11.47
118 202111 3.54
119 202112 0.88
[120 rows x 2 columns]
I have a dataset in this form:
company_name date
0 global_infotech 2019-06-15
1 global_infotech 2020-03-22
2 global_infotech 2020-08-30
3 global_infotech 2018-06-19
4 global_infotech 2018-06-15
5 global_infotech 2018-02-15
6 global_infotech 2018-11-22
7 global_infotech 2019-01-15
8 global_infotech 2018-12-15
9 global_infotech 2019-06-15
10 global_infotech 2018-12-19
11 global_infotech 2019-12-31
12 global_infotech 2019-02-18
13 global_infotech 2018-06-16
14 global_infotech 2019-02-10
15 global_infotech 2019-03-15
16 Qualcom 2019-07-11
17 Qualcom 2018-01-11
18 Qualcom 2018-05-29
19 Qualcom 2018-10-06
20 Qualcom 2018-11-11
21 Qualcom 2019-08-17
22 Qualcom 2019-02-22
23 Qualcom 2019-10-16
24 Qualcom 2018-06-22
25 Qualcom 2018-06-14
26 Qualcom 2018-06-16
27 Syscin 2018-02-10
28 Syscin 2019-02-16
29 Syscin 2018-04-12
30 Syscin 2018-08-22
31 Syscin 2018-09-16
32 Syscin 2019-04-20
33 Syscin 2018-02-28
34 Syscin 2018-01-19
CONSIDERING TODAY'S DATE AS 1st JANUARY 2020, I WANT TO WRITE A CODE TO FIND THE NUMBER OF TIMES EACH COMPANY NAME IS OCCURING IN LAST 3 MONTHS. For example, suppose from 1st Oct 2019 to 1st Jan 2020, golbal_infotech's name is appearing 5 times, then 5 should appear infront of every global_infotech value like:
company_name date appearance_count_last_3_months
0 global_infotech 2019-06-15 5
1 global_infotech 2020-03-22 5
2 global_infotech 2020-08-30 5
3 global_infotech 2018-06-19 5
4 global_infotech 2018-06-15 5
5 global_infotech 2018-02-15 5
6 global_infotech 2018-11-22 5
7 global_infotech 2019-01-15 5
8 global_infotech 2018-12-15 5
9 global_infotech 2019-06-15 5
10 global_infotech 2018-12-19 5
11 global_infotech 2019-12-31 5
12 global_infotech 2019-02-18 5
13 global_infotech 2018-06-16 5
14 global_infotech 2019-02-10 5
15 global_infotech 2019-03-15 5
IIUC:
you can create a custom function:
def getcount(company,month=3,df=df):
df=df.copy()
df['date']=pd.to_datetime(df['date'],format='%Y-%m-%d',errors='coerce')
df=df[df['company_name'].eq(company)]
val=df.groupby(pd.Grouper(key='date',freq=str(month)+'m')).count().max().get(0)
df['appearance_count_last_3_months']=val
return df
getcount('global_infotech')
#OR
getcount('global_infotech',3)
Update:
since you have 92 different companies so you can use for loop:
lst=[]
for x in df['company_name'].unique():
lst.append(getcount(x))
out=pd.concat(lst)
If you print out then you will get your desired output
You can first filter the data for the last 3 months, and then groupby company name and merge back into the original dataframe.
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta
# sample data
df = pd.DataFrame({
'company_name': ['global_infotech', 'global_infotech', 'Qualcom','another_company'],
'date': ['2019-02-18', '2021-07-02', '2021-07-01','2019-02-18']
})
df['date'] = pd.to_datetime(df['date'])
# filter for last 3 months
summary = df[df['date']>=datetime.now()-relativedelta(months=3)]
# groupby then aggregate with desired column name
summary = summary.rename(columns={'date':'appearance_count_last_3_months'})
summary = summary.groupby('company_name')
summary = summary.agg('count')
# merge summary back into original df, filling missing values with 0
df = df.merge(summary, left_on='company_name', right_index=True, how='left')
df['appearance_count_last_3_months'] = df['appearance_count_last_3_months'].fillna(0).astype('int')
# result:
df
company_name date appearance_count_last_3_months
0 global_infotech 2019-02-18 1
1 global_infotech 2021-07-02 1
2 Qualcom 2021-07-01 1
3 another_company 2019-02-18 0
Dear friends i want to transpose the following dataframe into a single column. I cant figure out a way to transform it so your help is welcome!! I tried pivottable but sofar no succes
X 0.00 1.25 1.75 2.25 2.99 3.25
X 3.99 4.50 4.75 5.25 5.50 6.00
X 6.25 6.50 6.75 7.50 8.24 9.00
X 9.50 9.75 10.25 10.50 10.75 11.25
X 11.50 11.75 12.00 12.25 12.49 12.75
X 13.25 13.99 14.25 14.49 14.99 15.50
and it should look like this
X
0.00
1.25
1.75
2.25
2.99
3.25
3.99
4.5
4.75
5.25
5.50
6.00
6.25
etc..
This will do it, df.columns[0] is used as I don't know what are your headers:
df = pd.DataFrame({'X': df.set_index(df.columns[0]).stack().reset_index(drop=True)})
df
X
0 0.00
1 1.25
2 1.75
3 2.25
4 2.99
5 3.25
6 3.99
7 4.50
8 4.75
9 5.25
10 5.50
11 6.00
12 6.25
13 6.50
14 6.75
15 7.50
16 8.24
17 9.00
18 9.50
19 9.75
20 10.25
21 10.50
22 10.75
23 11.25
24 11.50
25 11.75
26 12.00
27 12.25
28 12.49
29 12.75
30 13.25
31 13.99
32 14.25
33 14.49
34 14.99
35 15.50
ty so much!! A follow up question(a)
Is it also possible to stack the df into 2 columns X and Y
this is the data set
This is the data set.
1 2 3 4 5 6 7
X 0.00 1.25 1.75 2.25 2.99 3.25
Y -1.08 -1.07 -1.07 -1.00 -0.81 -0.73
X 3.99 4.50 4.75 5.25 5.50 6.00
Y -0.37 -0.20 -0.15 -0.17 -0.15 -0.16
X 6.25 6.50 6.75 7.50 8.24 9.00
Y -0.17 -0.18 -0.24 -0.58 -0.93 -1.24
X 9.50 9.75 10.25 10.50 10.75 11.25
Y -1.38 -1.42 -1.51 -1.57 -1.64 -1.75
X 11.50 11.75 12.00 12.25 12.49 12.75
Y -1.89 -2.00 -2.00 -2.04 -2.04 -2.10
X 13.25 13.99 14.25 14.49 14.99 15.50
Y -2.08 -2.13 -2.18 -2.18 -2.27 -2.46
For example mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
The first “Column” isn’t named. I want to access it like:
mtcars$hp[mtcars$***carnamecolumn***="Mazda RX4”]
to ask “What’s the Horsepower of the Mazda RX4?”
Can I use a $accessor for this?
I want to build a column for my dataframe df['days_since_last'] that shows the days since the last match for each player_id for each event_id and nan if the row is the first match for the player in the dataset.
Example of my data:
event_id player_id match_date
0 1470993 227485 2015-11-29
1 1492031 227485 2016-07-23
2 1489240 227485 2016-06-19
3 1495581 227485 2016-09-02
4 1490222 227485 2016-07-03
5 1469624 227485 2015-11-14
6 1493822 227485 2016-08-13
7 1428946 313444 2014-08-10
8 1483245 313444 2016-05-21
9 1472260 313444 2015-12-13
I tried the code in Find days since last event pandas dataframe but got nonsensical results.
It seems you need sort first:
df['days_since_last_event'] = (df.sort_values(['player_id','match_date'])
.groupby('player_id')['match_date'].diff()
.dt.days)
print (df)
event_id player_id match_date days_since_last_event
0 1470993 227485 2015-11-29 15.0
1 1492031 227485 2016-07-23 20.0
2 1489240 227485 2016-06-19 203.0
3 1495581 227485 2016-09-02 20.0
4 1490222 227485 2016-07-03 14.0
5 1469624 227485 2015-11-14 NaN
6 1493822 227485 2016-08-13 21.0
7 1428946 313444 2014-08-10 NaN
8 1483245 313444 2016-05-21 160.0
9 1472260 313444 2015-12-13 490.0
Demo:
In [174]: df['days_since_last'] = (df.groupby('player_id')['match_date']
.transform(lambda x: (x.max()-x).dt.days))
In [175]: df
Out[175]:
event_id player_id match_date days_since_last
0 1470993 227485 2015-11-29 278
1 1492031 227485 2016-07-23 41
2 1489240 227485 2016-06-19 75
3 1495581 227485 2016-09-02 0
4 1490222 227485 2016-07-03 61
5 1469624 227485 2015-11-14 293
6 1493822 227485 2016-08-13 20
7 1428946 313444 2014-08-10 650
8 1483245 313444 2016-05-21 0
9 1472260 313444 2015-12-13 160