Pandas dataframe median of a column with condition - pandas

So I have a dataframe with two columns (price, location). Now I want to get the median of price, if the location is e.g. "Paris". How do I achieve that?
dataframe:
location price
paris 5
paris 2
rome 5
paris 4
...
desired result: 4 (median of 2,5,4)

I think you need df.groupby to group on location, and then .median():
median = df.groupby('location').median()
To get the value for each location:
median.loc['paris', 'price']
Output:
4

import pandas as pd
# Build dataframe
data = [['Paris', 2], ['New York', 3], ['Rome', 4], ['Paris', 5], ['Paris', 4]]
df = pd.DataFrame(data, columns=['location', 'price'])
# Get paris only rows
df_paris = df[df['location'] == 'Paris']
# Print median
print(df_paris['price'].median())

Related

extract conditional max based on a reference data frame

my reference data frame is of the following type:
tbl = pd.DataFrame([['Afghanistan', 'AFN', 4],
['Albania', 'ALL', 2],
['France', 'EUR', 1]],
columns=['country', 'currency', 'score'])
a toy working df:
df = pd.DataFrame(
[['France','AFN'],['France','ALL'],['France','EUR'],
['Albania','AFN'],['Albania','ALL'],['Albania','EUR'],
['Afghanistan','AFN'],['Afghanistan','ALL'],['Afghanistan','EUR']],
columns=['country','currency'])
As my working df may have country and currency differently, for example country =='France' and 'currency'=='AFN', I would like to create a column with max score based on either, i.e., this country/currency combo would imply a score of 4.
Desired output:
Out[102]:
country currency score
0 France AFN 4
1 France ALL 2
2 France EUR 1
3 Albania AFN 4
4 Albania ALL 2
5 Albania EUR 2
6 Afghanistan AFN 4
7 Afghanistan ALL 4
8 Afghanistan EUR 4
Here is what I have so far, but it's multiline and extremely clunky:
df = pd.merge(df, tbl[['country', 'score']],
how='left', on='country')
df['em_score'] = df['score']
df = df.drop('score', axis=1)
df = pd.merge(df, tbl[['currency', 'score']],
how='left', on='currency')
df['em_score'] = df[['em_score', 'score']].max(axis=1)
df = df.drop('score', axis=1)
Here's a way to do it:
byCol = {col:tbl[[col,'score']].set_index(col) for col in tbl.columns if col != 'score'}
df['em_score'] = pd.concat([
df.join(byCol[col], on=col).score.rename('score_' + col) for col in byCol
], axis=1).max(axis=1)
Explanation:
for each column in tbl other than score (in your case, country and currency), create a Series with that column as index
use pd.concat() to create a new dataframe with multiple columns, each a Series object created using join() between the working df and one of the Series objects from the previous step
use max() on each row to get the desired em_score.
Full test code with sample df:
import pandas as pd
tbl = pd.DataFrame([['Afghanistan', 'AFN', 4],
['Albania', 'ALL', 2],
['France', 'EUR', 1]],
columns=['country', 'currency', 'score'])
df = pd.DataFrame(
[['France','AFN'],['France','ALL'],['France','EUR'],
['Albania','AFN'],['Albania','ALL'],['Albania','EUR']],
columns=['country','currency'])
print('','tbl',tbl,sep='\n')
print('','df',df,sep='\n')
byCol = {col:tbl[[col,'score']].set_index(col) for col in tbl.columns if col != 'score'}
df['em_score'] = pd.concat([
df.join(byCol[col], on=col).score.rename('score_' + col) for col in byCol
], axis=1).max(axis=1)
print('','output',df,sep='\n')
Output:
tbl
country currency score
0 Afghanistan AFN 4
1 Albania ALL 2
2 France EUR 1
df
country currency
0 France AFN
1 France ALL
2 France EUR
3 Albania AFN
4 Albania ALL
5 Albania EUR
output
country currency em_score
0 France AFN 4
1 France ALL 2
2 France EUR 1
3 Albania AFN 4
4 Albania ALL 2
5 Albania EUR 2
So, for case if you have
tbl = pd.DataFrame([['Afghanistan', 'AFN', 4],
['Albania', 'ALL', 2],
['France', 'EUR', 1],
['France', 'AFN', 0]],
columns=['country', 'currency', 'score'])
This code will find the max of either the max score for the country or the currency of each row:
np.maximum(np.array(tbl.groupby(['country']).max().loc[tbl['country'], 'score']),
np.array(tbl.groupby(['currency']).max().loc[tbl['currency'], 'score']))

Copy first of group down and sum total - pre defined groups

I have previously asked how to iterate through a prescribed grouping of items and received the solution.
import pandas as pd
data = [['apple', 1], ['orange', 2], ['pear', 3], ['peach', 4],['plum', 5], ['grape', 6]]
#index_groups = [0],[1,2],[3,4,5]
df = pd.DataFrame(data, columns=['Name', 'Number'])
for i in range(len(df)):
print(df['Number'][i])
Name Age
0 apple 1
1 orange 2
2 pear 3
3 peach 4
4 plum 5
5 grape 6
where :
for group in index_groups:
print(df.loc[group])
gave me just what I needed. Following up on this I would like to now sum the numbers per group but also copy down the first 'Name' in each group to the other names in the group, and then concatenate so one line per 'Name'.
In the above example the output I'm seeking would be
Name Age
0 apple 1
1 orange 5
2 peach 15
I can append the sums to a list easy enough
group_sum = []
group_sum.append(sum(df['Number'].loc[group]))
But I can't get the 'Names' in order to merge with the sums.
You could try:
df_final = pd.DataFrame()
for group in index_groups:
_df = df.loc[group]
_df["Name"] = df.loc[group].Name.iloc[0]
df_final = pd.concat([df_final, _df])
df_final.groupby("Name").agg(Age=("Number", "sum")).reset_index()
Output:
Name Age
0 apple 1
1 orange 5
2 peach 15

Group by count in pandas dataframe

In pandas dataframe I want to create two new columns that would calculate count the occurrence of the same value and a third column that would calculate the ratio
ratio = count_occurrence_both_columns /count_occurrence_columnA *100
df = pd.DataFrame({"column A": ["Atlanta", "Atlanta", "New York", "New York","New York"], "column B": ["AT", "AT", "NY", "NY", "AT"]})
df
columnA
ColumnB
occurrence_columnA
occurrence_both_columns
Ratio
Atlanta
AT
2
2
100%
Atlanta
AT
2
2
100%
Newyork
NY
3
2
66.66%
Newyork
NY
3
2
66.66%
Newyork
AT
3
1
33.33%
First, you can create a dictionary that has the keys as column A unique values and the values as the count.
>>> column_a_mapping = df['column A'].value_counts().to_dict()
>>> column_a_mapping
>>> {'New York': 3, 'Atlanta': 2}
Then, you can create a new column that has the two columns merged in order to have the same value counts dictionary as above.
>>> df['both_columns'] = (
df[['column A', 'column B']]
.apply(lambda row: '_'.join(row), axis=1)
)
>>> both_columns_mapping = df['both_columns'].value_counts().to_dict()
>>> both_columns_mapping
>>> {'New York_NY': 2, 'Atlanta_AT': 2, 'New York_AT': 1}
Once you have the unique values count, you can simple use the replace pd.Series method.
>>> df['count_occurrence_both_columns'] = df['both_columns'].replace(both_columns_mapping)
>>> df['count_occurrence_columnA'] = df['column A'].replace(column_a_mapping)
Lastly, you can drop the column that has both columns merged and then create you ratio column with:
>>> df['ratio'] = df['count_occurrence_both_columns'] / df['count_occurrence_columnA'] * 100
>>> df.drop('both_columns', axis=1, inplace=True)
You should obtain this dataframe:
column A
column B
count_occurrence_columnA
count_occurrence_both_columns
ratio
Atlanta
AT
2
2
100.000000
Atlanta
AT
2
2
100.000000
New York
NY
3
2
66.666667
New York
NY
3
2
66.666667
New York
AT
3
1
33.333333
Use pandas groupby to count the items
df['occurrence_columnA'] = df.groupby(['column A'])['column B'].transform(len)
df['occurrence_both_columns'] = df.groupby(['column A','column B'])['occurrence_columnA'].transform(len)
The alternative way is to use transform('count') but this will ignore NaN's

Merge dataframe

I have a dataframe df as follows:
I would like to convert it in the following way.
How do I go about it?
All help appreciated.
Thanks
Try creating a DataFrame for each class_id then concat on axis=1:
import pandas as pd
df = pd.DataFrame({'student_id': [1, 1, 1, 2],
'class_id': [1, 2, 3, 1],
'teacher': ['rex', 'fred', 'hulio', 'ross']})
# Generate DF for each class_id
dfs = tuple(df[df['class_id'].eq(c_id)].reset_index(drop=True)
for c_id in df['class_id'].unique())
# concat on axis 1
new_df = pd.concat(dfs, axis=1)
# For Display
print(new_df.fillna('').to_string(index=False))
new_df:
student_id class_id teacher student_id class_id teacher student_id class_id teacher
1 1 rex 1.0 2.0 fred 1.0 3.0 hulio
2 1 ross

Copy columns to dataframe using panda

I have two dataframes and I want to copy the values from one to another. Returned NaN when copying column values to dataframe
These are my df:
data1 = [[1, 2], [3, 4], [5, 6]]
rc = pd.DataFrame(data, columns = ['Sold', 'Leads'])
data2 = [['Company1','2017-05-01',0, 0], ['Company1','2017-05-01',0, 0], ['Company1','2017-05-01',0, 0]]
final = pd.DataFrame(data2, columns = ['company','date','2019_sold', '2019_leads'])
I tried loc indexing
final.loc[(final['date'] == '2017-05-01') & (final['company'] == 'Company1'),['2019_sold','2019_leads']] = rc[['Leads','Sold']]
I expect them to copy the exact value of rc df to final df but the values returned NaN
By using update
rc.index=final.index[(final['date'] == '2017-05-01') & (final['company'] == 'Company1')]
rc.columns=['2019_sold','2019_leads']
final.update(rc)
final
Out[165]:
company date 2019_sold 2019_leads
0 Company1 2017-05-01 1 2
1 Company1 2017-05-01 3 4
2 Company1 2017-05-01 5 6