Pandas Dataframe groupby on 3 columns and make one column lowercase - pandas

I have a dataframe:
country rating owner
0 England a John Smith
1 England b John Smith
2 France a Frank Foo
3 France a Frank foo
4 France a Frank Foo
5 France b Frank Foo
I'd like to produce a count of owners after grouping by country and rating and
ignoring case
gnoring any spaces ( leading, trailing or inbetween)
I am expecting:
country rating owner count
0 England a John Smith 1
1 England b John Smith 1
2 France a Frank Foo 3
3 France b Frank Foo 1
I have tried:
df.group_by(['rating','owner'])['owner'].count()
and
df.group_by(['rating','owner'].str.lower())['owner'].count()

Use title and replace to rework the string and groupby.size to aggregate:
out = (df.groupby(['country', 'rating',
df['owner'].str.title().str.replace(r'\s+', ' ', regex=True)])
.size().reset_index(name='count')
)
Output:
country rating owner count
0 England a John Smith 1
1 England b John Smith 1
2 France a Frank Foo 3
3 France b Frank Foo 1

Use Series.str.strip, Series.str.title and remove multiple spaces by Series.str.replace with aggregate GroupBy.size:
DataFrameGroupBy.count is used for count exclude missing values, seems not necessary here.
df1 = (df.groupby(['county','rating',
df['owner'].str.strip().str.title().str.replace('\s+',' ', regex=True)])
.size()
.reset_index(name='count'))

Related

Putting zeros in column except for the single row in the category in sql

Let's say I have a table that looks like that:
Name Category Subject Score
Alice 1 Math 2
Alice 1 Biology 3
Bob 2 Math 4
Bob 2 Biology 2
I would like to leave just one occurence of Score in each batch of Category for each Name and set rest to zero, so the result would be:
Alice 1 Math 2
Alice 1 Biology 0
Bob 2 Math 4
Bob 2 Biology 0
Is it possible to do?

How to group and filter by 2 columns?

Imagine the following dataset that represents owners of Restaurants and Bars in USA and UK:
Owner Property Country
0 John Restaurant UK
1 John Bar USA
2 George Bar USA
3 George Restaurant USA
How can I find the owners that have both types of properties in the same country?
Use DataFrameGroupBy.nunique with GroupBy.transform, compare by 2 and filter in boolean indexing:
df1 = df[df.groupby(['Owner', 'Country'])['Property'].transform('nunique').eq(2)]
print (df1)
Owner Property Country
2 George Bar USA
3 George Restaurant USA

How to differentiate mini dataframes appened to a bigger dataframe

I am trying to create a bigger dataframe from others dataframes, But I need to indentificate the them separately. I want to create a new column with a index of every dataframe.
frames = [dataTotal,dataFrame]
dataTotal = dataTotal.append(dataFrame, ignore_index=False, sort=False)
I tried use the pd.contact with the atributte key, but it doesn't work since the dataframes are in different sizes.
What I have to do?
Example:
I've this dataframe, and a want to append other to it, and create a index to differentiate them
name LastName
0 Vitor Albres
1 Matheus Wilson
2 Andrew George
3 Filipe Dircksen
4 Eli Matthew
Add a other dataframe
name LastName
0 Ana Lee
1 Renato Cristian
2 Joe Jonh
To create something like
data_id name LastName
0 0 Vitor Albres
1 Matheus Wilson
2 Andrew George
3 Filipe Dircksen
4 Eli Matthew
1 0 Ana Lee
1 Renato Cristian
2 Joe Jonh
passing the atribute key, doesn't work, 'cause it say I can't concat dataframes with different levels. I don't know if I make ir wrong or something
You can use pd.concat with arg key:
In [1831]: df1
Out[1831]:
name LastName
0 Vitor Albres
1 Matheus Wilson
2 Andrew George
3 Filipe Dircksen
4 Eli Matthew
In [1832]: df2
Out[1832]:
name LastName
0 Ana Lee
1 Renato Cristian
2 Joe Jonh
In [1830]: df_list = [df1, df2]
In [1833]: df = pd.concat(df_list, keys=range(len(df_list)))
Then name the Multiindex using df.index.names:
In [1837]: df.index.names = ['data_id', '']
In [1838]: df
Out[1838]:
name LastName
data_id
0 0 Vitor Albres
1 Matheus Wilson
2 Andrew George
3 Filipe Dircksen
4 Eli Matthew
1 0 Ana Lee
1 Renato Cristian
2 Joe Jonh

Manipulating series in a dataframe

My dataframe has a list of comma separated values in one column. I want to find the list of distinct entries, create a new column for each distinct entry in the dataframe, then fill the new columns with 1 or 0 depending on whether the row has the city name.
The idea is to use the new columns in building a logistic regression model.
As an example
Before
Name City
Jack NewYork,Chicago,Seattle
Jill Seattle, SanFrancisco
Ted Chicago,SanFrancisco
Bill NewYork,Seattle
After
Name NewYork Chicago Seattle SanFrancisco
Jack 1 1 1 0
Jill 0 0 1 1
Ted 0 1 0 1
Bill 1 0 1 0
You can do this with the get_dummies str method:
import pandas as pd
df = pd.DataFrame(
{"Name": ["Jack", "Jill", "Ted", "Bill"],
"City": ["NewYork,Chicago,Seattle", "Seattle,SanFrancisco", "Chicago,SanFrancisco", "NewYork,Seattle"]}
)
print(pd.concat((df, df.City.str.get_dummies(",")), axis=1))
Result:
Name City Chicago NewYork SanFrancisco Seattle
0 Jack NewYork,Chicago,Seattle 1 1 0 1
1 Jill Seattle,SanFrancisco 0 0 1 1
2 Ted Chicago,SanFrancisco 1 0 1 0
3 Bill NewYork,Seattle 0 1 0 1

split index column based on existence of a substring

I have the following df:
stuff
james__America by Estonia : 2
luke__Spain by Italy 3
michael 4
Louis__Portugal by USA 2
I would like that in case in the index the substring "__" exists then I would like to split the index and create 2 new columns next to it to make a second split by ' by ' in order to get the following output:
name1 name2 stuff
james America Estonia 2
luke Spain Italy 3
michael 0 0 4
Louis Portugal USA 2
I thought using :
df.index.str.split('__', expand=True).split(' by ',expand=True).rename(columns={0:'name1',1:'name2'})
However it does not seem to work.
Convert Index to Series by Index.to_series, then use Series.str.split by first separator, then split by second column, join original columns and last overwrite index:
df1 = df.index.to_series().str.split('__', expand=True)
df2 = df1[1].str.split(' by ',expand=True).rename(columns={0:'name1',1:'name2'}).fillna('0')
df = df2.join(df)
df.index = df1[0].rename(None)
print (df)
name1 name2 stuff
james America Estonia 2
luke Spain Italy 3
michael 0 0 4
Louis Portugal USA 2