How to group by a column of list - pandas

I have an imaginary movie dataframe. I would like to group Sales by the values in the list of the Genre column. How can I do it (preferably without exploding the Genre column)? For example, the total sales by genre.
Thanks
data = {
"Movie": ["Avatar", "Leap Year", "Life is Beautiful","Roman Holiday"],
"Sales": [5000, 2500, 2800, 4050],
"Genre": [["Sci-fi","Action"], ["Romantic", "Comedy"], ["Tragic", "Comdey"], ["Romantic"]]
}
df = pd.DataFrame(data)
sales_by_genre = df.groupby(df['Genre'].map(tuple))['Sales'].sum() # <<< This line not working

I can't think of a straight forward way of doing this without exploding the list, so here is an example with explode:
df = df.explode(column='Genre', ignore_index=True)[['Sales','Genre']].groupby('Genre').sum()
print(df)
result:
Sales
Genre
Action 5000
Comdey 2800
Comedy 2500
Romantic 6550
Sci-fi 5000
Tragic 2800

Related

How do I prevent str.contains() from searching for a sub-string?

I want Pandas to search my data frame for the complete string and not a sub-string. Here is a minimal-working example to explain my problem -
data = [['tom', 'wells fargo', 'retired'], ['nick', 'bank of america', 'partner'], ['juli', 'chase', 'director - oil well']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name', 'Place', 'Position'])
# print dataframe.
df
val = 'well'
df.loc[df.apply(lambda col: col.str.contains(val, case=False)).any(axis="columns")]
The correct code would have only returned the second row and not the first one
Name Place Position
0 tom wells fargo retired
2 juli chase director - oil well
Update - My intention is to have a search that looks for the exact string requested. While looking for "well" the algorithm shouldn't extract out "well. Based on the comments, I understand how my question might be misleading.
IIUC, you can use:
>>> df[~df['Position'].str.contains(fr'\b{val}\b')]
Name Place Position
0 tom wells fargo retired
2 juli chase director - oil well
And for all columns:
>>> df[~df.apply(lambda x: x.str.contains(fr'\b{val}\b', case=False)).any(axis=1)]
Name Place Position
0 tom wells fargo retired
2 juli chase director - oil well
The regular expression anchor \b which is a word boundary is what you want.
I added addtional data to your code to illustrate more:
import pandas as pd
data = [
['tom', 'wells fargo', 'retired']
, ['nick', 'bank of america', 'partner']
, ['john','bank of welly','blah']
, ['jan','bank of somewell knwon','well that\'s it']
, ['juli', 'chase', 'director - oil well']
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name', 'Place', 'Position'])
# print dataframe.
df
val = 'well'
df.loc[df.apply(lambda col: col.str.contains(fr"\b{val}\b", case=False)).any(axis="columns")]
EDIT
In Python3 the string can be substitute with the variable with f in front of " or ' and r will express it as regular expression. Then now you can get the val as you want. Thank #smci
and the output is like this
Name
Place
Position
3
jan
bank of somewell knwon
well that's it
4
juli
chase
director - oil well

how to convert income from one currency into another using a histroical fx table in python with pandas

I have two dataframes, one is a income df and the other is a fx df. my income df shows income from different accounts on different dates but it also shows extra income in a different currency. my fx df shows the fx rates for certain currency pairs on the same date the extra income came into the accounts.
I want to convert the currency of the extra income into the same currency as the account so for example, account HP on 23/3 has extra income = 35 GBP, i want to convert that into EUR as that's the currency of the account. Please note it has to use the fx table as i have a long history of data points to fill and other accounts so i do not want to manually code 35 * the fx rate. Finally i then want to create another column for income df that will sum the daily income + extra income in the same currency together
im not sure how to bring both df together so i can get the correct fx rate for that sepecifc date to convert the currency of the extra income into the currency of the account
my code is below
import pandas as pd
income_data = {'date': ['23/3/22', '23/3/22', '24/3/22', '25/3/22'], 'account': ['HP', 'HP', 'JJ', 'JJ'],
'daily_income': [1000, 1000, 2000, 2000], 'ccy of account': ['EUR', 'EUR', 'USD', 'USD'],
'extra_income': [50, 35, 10, 12.5], 'ccy of extra_income': ['EUR', 'GBP', 'EUR', 'USD']}
income_df = pd.DataFrame(income_data)
fx_data = {'date': ['23/3/22', '23/3/22', '24/3/22', '25/3/22'], 'EUR/GBP': [0.833522, 0.833522, 0.833621, 0.833066],
'USD/EUR': [0.90874, 0.90874, 0.91006, 0.90991]}
fx_df = pd.DataFrame(fx_data)
the final df should look like this (i flipped the fx rate so 1/0.833522 to get some of the values)
Would really appreicate if someone could help me with this. my inital thpought was merge but i dont have a common column and not sure if map function would work either as i dont have a dictionary. apologies in advance if any of my code is not greate - i am still self learning, thanks!
Consider creating a common column for merging in both data frames. Below uses assign to add columns and Series operators (over arithmetic ones: +, -, *, /).
# ADD NEW COLUMN AS CONCAT OF CCY COLUMNS
income_data = income_data.assign(
currency_ratio = lambda df: df["ccy of account"] + "/" + df["ccy of extra_income"]
)
# ADD REVERSED CURRENCY RATIOS
# RESHAPE WIDE TO LONG FORMAT
fx_data_long = pd.melt(
fx_data.assign(**{
"GBP/EUR": lambda df: df["EUR/GBP"].div(-1),
"EUR/USD": lambda df: df["USD/EUR"].div(-1)
}),
id_vars = "date",
var_name = "currency_ratio",
value_name = "fx_rate"
)
# MERGE AND CALCULATE
income_data = (
income_data.merge(
fx_data_long,
on = ["date", "currency_ratio"],
how = "left"
).assign(
total_income = lambda df: df["daily_income"].add(df["extra_income"].mul(df["fx_rate"]))
)
)

How to rename all columns with count 1 as 'others'

I am classifying movies by genres . ( Action Adventure SciFi, Thriller Horror Action ,...) so on. I get 200 classes and of that 50 classes have only one value when I groupby. I want to rename each of these rows by value (or occurence=1 each) and rename them as 'Other' so that the other count will be 50 now
Please advise on the code .
dataframe is df and column name is genre
thanks
You could compute the frequency and use np.where to replace like this:
# compute the frequency:
counts = df.groupby('genre').transform('size')
# maps:
df['new_genre'] = np.where(counts > 1, df['genre'], 'Other')

Inserting a column to a Pandas dataframe using another dataframe as a dictionary

I have a dataframe that looks like this:
item_id genre
14441607 COMEDY
14778825 CHILDREN'S
10227943 ACTION/ADVENTURE
10221687 DRAMA
14778833 ACTION/ADVENTURE
I have another dataframe which has sales data for each of the above items for 155 weeks:
item_id sales
10221687 1.2
10221687 0.98
"" ""
So, 155 such rows for each item. What I am wanting to do is to append the genre for each item into the sales dataframe. The resultant dataframe would look like this:
item_id sales genre
10221687 1.2 DRAMA
10221687 0.98 DRAMA
"" "" "
I have looked at pd.insert(), but don't see how to achieve this.
considering your sales data is stored in df2, and genre data is stored in df1 the following code will help you merge.
dfMerge = df2.merge(df1, how='left')

Get all rows that are the same and put into own dataframe

Say I have a dataframe where there are different values in a column, e.g.,
raw_data = {'first_name': ['Jason', 'Molly', np.nan, np.nan, np.nan],
'nationality': ['USA', 'USA', 'France', 'UK', 'UK'],
'age': [42, 52, 36, 24, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'nationality', 'age'])
df
How do I create a new dataframe(s), where each dataframe contains only the values for USA, only the values for UK, and only the values for France? But here is the thing, say `I don't what to specify a condition like
Don't want this
# Create variable with TRUE if nationality is USA
american = df['nationality'] == "USA"
I want all the data aggregated for each nationality whatever the nationality is, without having to specify the nationality condition. I just want all the same nationalities together in their own dataframe. Also, I want all the columns that pertain to that row.
So for example, the function
SplitDFIntoSeveralDFWhereColumnValueAllTheSame(column):
code
Will return an array of dataframes with all the values of a column in each dataframe are equal.
So if I had more data and more nationalities, the aggregation into new dataframes will work without changing the code.
This will give you a dictionary of dataframes where the keys are the unique values of the 'nationality' column and the values are the dataframes you are looking for.
{name: group for name, group in df.groupby('nationality')}
demo
dodf = {name: group for name, group in df.groupby('nationality')}
for k in dodf:
print(k, '\n'*2, dodf[k], '\n'*2)
France
first_name nationality age
2 NaN France 36
USA
first_name nationality age
0 Jason USA 42
1 Molly USA 52
UK
first_name nationality age
3 NaN UK 24
4 NaN UK 70