Inserting a column to a Pandas dataframe using another dataframe as a dictionary - pandas

I have a dataframe that looks like this:
item_id genre
14441607 COMEDY
14778825 CHILDREN'S
10227943 ACTION/ADVENTURE
10221687 DRAMA
14778833 ACTION/ADVENTURE
I have another dataframe which has sales data for each of the above items for 155 weeks:
item_id sales
10221687 1.2
10221687 0.98
"" ""
So, 155 such rows for each item. What I am wanting to do is to append the genre for each item into the sales dataframe. The resultant dataframe would look like this:
item_id sales genre
10221687 1.2 DRAMA
10221687 0.98 DRAMA
"" "" "
I have looked at pd.insert(), but don't see how to achieve this.

considering your sales data is stored in df2, and genre data is stored in df1 the following code will help you merge.
dfMerge = df2.merge(df1, how='left')

Related

Pandas sum corresponding values based on values in another column

df1 contains Itemlist1 and Itemlist2 where each cell can contain any number of items. df2 contains Price and Cost of each item.
Want to obtain a final df with 2 new columns, Totalprice and Totalcost, added to df1. The Totalprice and Totalcost is the sum of all the items in each row of df1.
Managed to arrive at df3 where each item is put in a separate cell. Any suggestion from here please. Thank you.
From your df3, do the replace, then sum with axis=1
cost_dict = dict(zip(df2.Itemcode,df2.Cost))
price_dict = dict(zip(df2.Itemcode,df2.Price))
df1['totalcost'] = df3.replace(cost_dict).sum(axis=1)
df1['totalprice'] = df3.replace(price_dict).sum(axis=1)

Pandas Dataframe MultiIndex groupby with 2 levels including "all" for both levels

I have a Dataframe with a jobtype and age_group as categorical variables and then F1 and F2 as numerical variables. Each row is a one person's reply to a questionnaire and I want to calculate mean values in each jobtype - age_group combination
jobtype | age_group | F1 | F2
"office" | "20-30" | 1.2| 2.4
"hospital"| "40-50" | 2.3| 5.4
...
I have calculated the mean values for each combination of jobtype and age_group by
data_means_by_jobtype_age_group = data.groupby(["jobtype", "age_group"]).mean()
The result is a MultiIndex Dataframe with each combination of jobtype and age_group in the index and the mean values of F1 and F2 like should be. But isn't there a neat way to include "all" and "all" in the combinations, that is "all ages without filtering for each jobtype" and "all jobtypes without filtering for each age group"?
I have now done this separately but combining and renaming indices seems like a lot of work should there be a neat way to do this in one groupby-statement:
data_means_by_jobtype = data.groupby(["jobtype"]).mean()
and vice versa
data_means_by_age_group = data.groupby(["age_group"]).mean()
I think simplists is overwrite columns to same values All before aggregate mean and then join together by concat:
df1 = data.groupby(["jobtype", "age_group"]).mean()
df2 = data.assign(age_group='All').groupby(["jobtype", "age_group"]).mean()
df3 = data.assign(jobtype='All').groupby(["jobtype", "age_group"]).mean()
df = pd.concat([df1, df2, df3])

Extracting a value from a pd dataframe

I have a dataframe column such as below.
{"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/film%20&%20video/narrative%20film"}},"color":16734574,"parent_id":11,"name":"Narrative Film","id":31,"position":13,"slug":"film & video/narrative film"}
I want to extract the info against the word 'slug'. (In this instance it is film & video/narrative film) and store the info as a new dataframe column.
How can I do this ?
Many thanks
This is a (nested) dictionary with different kinds of entries, so it does not make much sense to treat it as a DataFrame column. You could treat it as a DataFrame row, with the dictionary keys giving the column names:
import pandas as pd
dict = {"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/film%20&%20video/narrative%20film"}},
"color":16734574, "parent_id":11, "name":"Narrative Film", "id":31, "position":13,
"slug":"film & video/narrative film"}
df = pd.DataFrame(dict, index=[0])
display(df)
Output:
urls color parent_id name id position slug
0 NaN 16734574 11 Narrative Film 31 13 film & video/narrative film
Note that the urls entry is not recognized, due to the sub-dictionary.
In any case, this does yield slug as a column, so please let me know if this answers your question.
Of course you could also extract the slug entry directly from your dictionary:
dict['slug']

How can I combine same-named columns into one in a pandas dataframe so all the columns are unique?

I have a dataframe that looks like this:
In [268]: dft.head()
Out[268]:
ticker BYND UBER UBER UBER ... ZM ZM BYND ZM
0 analyst worlds uber revenue ... company owning pet things
1 moskow apac note uber ... try things humanization users
2 growth anheuserbusch growth target ... postipo unicorn products revenue
3 stock kong analysts raised ... software revenue things million
4 target uberbeating stock rising ... earnings million pets direct
[5 rows x 500 columns]
In [269]: dft.columns.unique()
Out[269]: Index(['BYND', 'UBER', 'LYFT', 'SPY', 'WORK', 'CRWD', 'ZM'], dtype='object', name='ticker')
How do I combine the the columns so there is only a single unique column name for each ticker?
Maybe you should try making a copy of the column you wish to join then extend the first column with the copy you have.
Code :
First convert the all columns name into one case either in lower or upper case so that there is no miss-match in header case.
def merge_(df):
'''Return the data-frame with columns with the same lowercase'''
# Get the list of unique columns in lowercase
columns = set(map(str.lower,df.columns))
df1 = pd.DataFrame(data=np.zeros((len(df),len(columns))),columns=columns)
# Merging the matching columns
for col in df.cloumns:
df1[col.lower()] += df[col] # words are in str format so '+' will concatenate
return df1

Sum column's values from duplicate rows python3

I have a old.csv like this:
Name,State,Brand,Model,Price
Adam,MO,Toyota,RV4,26500
Berry,KS,Toyota,Camry,18000
Berry,KS,Toyota,Camry,12000
Kavin,CA,Ford,F150,23000
Yuke,OR,Nissan,Murano,31000
and I need a new.csv like this:
Name,State,Brand,Model,Price
Adam,MO,Toyota,RV4,26500
Berry,KS,Toyota,Camry,30000
Kavin,CA,Ford,F150,23000
Yuke,OR,Nissan,Murano,31000
As you can see the difference from these two is:
Berry,KS,Toyota,Camry,18000
Berry,KS,Toyota,Camry,12000
merge to
Berry,KS,Toyota,Camry,30000
Here is my code:
import pandas as pd
df=pd.read_csv('old.csv')
df1=df.sort_values('Name').groupby('Name','State','Brand','Model')
.agg({'Name':'first','Price':'sum'})
print(df1[['Name','State','Brand','Model','Price']])
and It didn't work,and I got these error:
File "------\venv\lib\site-packages\pandas\core\frame.py", line 4421, in sort_values stacklevel=stacklevel)
File "------- \venv\lib\site-packages\pandas\core\generic.py", line 1382, in _get_label_or_level_values raise KeyError(key)
KeyError: 'Name'
I am a totally new of python,and I found a solutions in stackoverflow:
Sum values from Duplicated rows
The site above has similar question as mine,But It's a sql code,
Not Python
Any help will be great appreciation....
import pandas as pd
df = pd.read_csv('old.csv')
Group by 4 fields('Name', 'State', 'Brand', 'Model') and select Price column and apply aggregate sum to it,
df1 = df.groupby(['Name', 'State', 'Brand', 'Model'])['Price'].agg(['sum'])
print(df1)
This will give you a required output,
sum
Name State Brand Model
Adam MO Toyota RV4 26500
Berry KS Toyota Camry 30000
Kavin CA Ford F150 23000
Yuke OR Nissan Murano 31000
Note: There is only column sum in this df1. All 4 other columns are indexes so to convert it into csv, we first need to convert these 4 index columns to dataframe columns.
list(df1['sum'].index.get_level_values('Name')) will give you an output like this,
['Adam', 'Berry', 'Kavin', 'Yuke']
Now, for all indexes, do this,
df2 = pd.DataFrame()
cols = ['Name', 'State', 'Brand', 'Model']
for col in cols:
df2[col] = list(df1['sum'].index.get_level_values(col))
df2['Price'] = df1['sum'].values
Now, just write df2 to excel file like this,
df2.to_csv('new.csv', index = False)