CREAT A CONDITION ON .LOC FUNCTION WORKING WITH A LIST OF LABELS - indexing

I have two dataframes with kinds of foods on indexs:
df_1.index.name = 'foods_column'
df_2.index.name = 'foods_column'
df_1
df_2
foods_column
foods_column
rice
rice
nuts
nuts
pizza
coffee
coffee
coffee
nutella
nutella
milk
milk
I want to select only this labels:
labels =["rice", "nuts", "pizza"]
df_1_new = df_1.loc[labels]
df_2_new = df_2.loc[labels]
But "pizza" don't appear in df_2, and py give me this error:
KeyError: "['pizza'] not in index"

Try this
df_2_new = df_2.loc[df_2["foods_column"].isin(labels)]
See more info here
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html

Related

How to group by a column of list

I have an imaginary movie dataframe. I would like to group Sales by the values in the list of the Genre column. How can I do it (preferably without exploding the Genre column)? For example, the total sales by genre.
Thanks
data = {
"Movie": ["Avatar", "Leap Year", "Life is Beautiful","Roman Holiday"],
"Sales": [5000, 2500, 2800, 4050],
"Genre": [["Sci-fi","Action"], ["Romantic", "Comedy"], ["Tragic", "Comdey"], ["Romantic"]]
}
df = pd.DataFrame(data)
sales_by_genre = df.groupby(df['Genre'].map(tuple))['Sales'].sum() # <<< This line not working
I can't think of a straight forward way of doing this without exploding the list, so here is an example with explode:
df = df.explode(column='Genre', ignore_index=True)[['Sales','Genre']].groupby('Genre').sum()
print(df)
result:
Sales
Genre
Action 5000
Comdey 2800
Comedy 2500
Romantic 6550
Sci-fi 5000
Tragic 2800

Is there anything in Pandas similar to dplyr's 'list columns'

I'm currently transitioning from R to Python in my data analysis, and there's one thing I haven't seen in any tutorials out there: is there anything in Pandas similar to dplyr's 'list columns' ?
Link to refence:
https://www.rstudio.com/resources/webinars/how-to-work-with-list-columns/
pandas will accept any object type, including lists, in an object type column.
df = pd.DataFrame()
df['genre']=['drama, comedy, action', 'romance, sci-fi, drama','horror']
df.genre = df.genre.str.split(', ')
print(df, '\n', df.genre.dtype, '\n', type(df.genre[0]))
# Output:
genre
0 [drama, comedy, action]
1 [romance, sci-fi, drama]
2 [horror]
object
<class 'list'>
We can see that:
genre is a column of lists.
The dtype of the genre column is object
The type of the first value of genre is list.
There are a number of str functions that work with lists.
For example:
print(df.genre.str.join(' | '))
# Output:
0 drama | comedy | action
1 romance | sci-fi | drama
2 horror
Name: genre, dtype: object
print(df.genre.str[::2])
# Output:
0 [drama, action]
1 [romance, drama]
2 [horror]
Name: genre, dtype: object
Others can typically be done with an apply function if there isn't a built-in method:
print(df.genre.apply(lambda x: max(x)))
# Output:
0 drama
1 sci-fi
2 horror
Name: genre, dtype: object
See the documentation for more... pandas str functions
As for nesting dataframes within one another, it is possible but, I believe it's considered an anti-pattern, and pandas will fight you the whole way there:
data = {'df1': df, 'df2': df}
df2 = pd.Series(data.values(), data.keys()).to_frame()
df2.columns = ['dfs']
print(df2)
# Output:
dfs
df1 genre
0 [drama, comedy...
df2 genre
0 [drama, comedy...
print(df2['dfs'][0])
# Output:
genre
0 [drama, comedy, action]
1 [romance, sci-fi, drama]
2 [horror]
See:
Link1
Link2
A possibly acceptable work around, would be storing them as numpy arrays:
df2 = df2.applymap(np.array)
print(df2)
print(df2['dfs'][0])
# Output:
dfs
df1 [[[drama, comedy, action]], [[romance, sci-fi,...
df2 [[[drama, comedy, action]], [[romance, sci-fi,...
array([[list(['drama', 'comedy', 'action'])],
[list(['romance', 'sci-fi', 'drama'])],
[list(['horror'])]], dtype=object)

Trying to print entire dataframe after str.replace on one column

I can't figure out why this is throwing the error:
KeyError(f"None of [{key}] are in the [{axis_name}]")
here is the code:
def get_company_name(df):
company_name = [col for col in df if col.lower().startswith('comp')]
return company_name
df = df[df[get_company_name(master_leads)[0]].str.replace(punc, '', regex=True)]
this is what df.head() looks like:
Company / Account Website
0 Big Moose RV, & Boat Sales, Service, Camper Re... https://bigmooservsales.com/
1 Holifield Pest Management of Hattiesburg NaN
2 Steve Nichols Insurance NaN
3 Sandel Law Firm sandellaw.com
4 Duplicate - Checkered Flag FIAT of Newport News NaN
I have tried putting the [] in every place possible but I must be missing something. I was under impression that this is how you ran transformations on one column of the dataframe without pulling the series out of the dataframe.
Thanks!
You can get the first column name for company with
company_name_col = [col for col in df if col.lower().startswith('comp')][0]
you can see the cleaned up company name with
df[company_name_col].str.replace(punc, "", regex=True)
to apply the replacement
df[company_name_col] = df[company_name_col].str.replace(punc, "", regex=True)

Sum column's values from duplicate rows python3

I have a old.csv like this:
Name,State,Brand,Model,Price
Adam,MO,Toyota,RV4,26500
Berry,KS,Toyota,Camry,18000
Berry,KS,Toyota,Camry,12000
Kavin,CA,Ford,F150,23000
Yuke,OR,Nissan,Murano,31000
and I need a new.csv like this:
Name,State,Brand,Model,Price
Adam,MO,Toyota,RV4,26500
Berry,KS,Toyota,Camry,30000
Kavin,CA,Ford,F150,23000
Yuke,OR,Nissan,Murano,31000
As you can see the difference from these two is:
Berry,KS,Toyota,Camry,18000
Berry,KS,Toyota,Camry,12000
merge to
Berry,KS,Toyota,Camry,30000
Here is my code:
import pandas as pd
df=pd.read_csv('old.csv')
df1=df.sort_values('Name').groupby('Name','State','Brand','Model')
.agg({'Name':'first','Price':'sum'})
print(df1[['Name','State','Brand','Model','Price']])
and It didn't work,and I got these error:
File "------\venv\lib\site-packages\pandas\core\frame.py", line 4421, in sort_values stacklevel=stacklevel)
File "------- \venv\lib\site-packages\pandas\core\generic.py", line 1382, in _get_label_or_level_values raise KeyError(key)
KeyError: 'Name'
I am a totally new of python,and I found a solutions in stackoverflow:
Sum values from Duplicated rows
The site above has similar question as mine,But It's a sql code,
Not Python
Any help will be great appreciation....
import pandas as pd
df = pd.read_csv('old.csv')
Group by 4 fields('Name', 'State', 'Brand', 'Model') and select Price column and apply aggregate sum to it,
df1 = df.groupby(['Name', 'State', 'Brand', 'Model'])['Price'].agg(['sum'])
print(df1)
This will give you a required output,
sum
Name State Brand Model
Adam MO Toyota RV4 26500
Berry KS Toyota Camry 30000
Kavin CA Ford F150 23000
Yuke OR Nissan Murano 31000
Note: There is only column sum in this df1. All 4 other columns are indexes so to convert it into csv, we first need to convert these 4 index columns to dataframe columns.
list(df1['sum'].index.get_level_values('Name')) will give you an output like this,
['Adam', 'Berry', 'Kavin', 'Yuke']
Now, for all indexes, do this,
df2 = pd.DataFrame()
cols = ['Name', 'State', 'Brand', 'Model']
for col in cols:
df2[col] = list(df1['sum'].index.get_level_values(col))
df2['Price'] = df1['sum'].values
Now, just write df2 to excel file like this,
df2.to_csv('new.csv', index = False)

For loop to create pandas dataframes - varying dataframe names?

I would like to create 3 dataframes as follows:
basket = [['Apple', 'Banana', 'Orange']]
for fruit in basket:
fruit = pd.DataFrame(np.random.rand(10,3))
However, after running this, running something like:
Apple
Gives the error
NameError: name 'Apple is not defined
But 'fruit' as a dataframe does work.
How is it possible to have each dataframe produced take a variable as its name?
This would work:
basket = ['Apple', 'Banana', 'Orange']
for fruit in basket:
vars()[fruit] = pd.DataFrame(np.random.rand(10,3))
However it would be better practice to perhaps assign to a dictionary e.g.:
var_dict={}
basket = ['Apple', 'Banana', 'Orange']
for fruit in basket:
var_dict[fruit] = pd.DataFrame(np.random.rand(10,3))
Instead of creating variables use dict to store the dfs, its not a good practice to create variables on loop i.e
basket = ['Apple', 'Banana', 'Orange']
d_o_dfs = {x: pd.DataFrame(np.random.rand(10,3)) for x in basket}
Not recommended , but in case you want to store it in a variable then use globals i.e
for i in basket:
globals()[i] = pd.DataFrame(np.random.rand(10,3))
Output :
Banana And d_o_dfs['Banana']
0 1 2
0 0.822190 0.115136 0.807569
1 0.698041 0.936516 0.438414
2 0.184000 0.772022 0.006315
3 0.684076 0.988414 0.991671
4 0.017289 0.560416 0.349688
5 0.379464 0.642631 0.373243
6 0.956938 0.485344 0.276470
7 0.910433 0.062117 0.670629
8 0.507549 0.393622 0.003585
9 0.878740 0.209498 0.498594