I am classifying movies by genres . ( Action Adventure SciFi, Thriller Horror Action ,...) so on. I get 200 classes and of that 50 classes have only one value when I groupby. I want to rename each of these rows by value (or occurence=1 each) and rename them as 'Other' so that the other count will be 50 now
Please advise on the code .
dataframe is df and column name is genre
thanks
You could compute the frequency and use np.where to replace like this:
# compute the frequency:
counts = df.groupby('genre').transform('size')
# maps:
df['new_genre'] = np.where(counts > 1, df['genre'], 'Other')
Related
So I am new to using Python Pandas dataframes.
I have a dataframe with one column representing customer ids and the other holding flavors and satisfaction scores that looks something like this.
Although each customer should have 6 rows dedicated to them, Customer 1 only has 5. How do I create a new dataframe that will only print out customers who have 6 rows?
I tried doing: df['Customer No'].value_counts() == 6 but it is not working.
Here is one way to do it
if you post data as a code (preferably) or text, i would be able to share the result
# create a temporary column 'c' by grouping on Customer No
# and assigning count to it using transform
# finally, using loc to select rows that has a count eq 6
(df.loc[df.assign(
c=df.groupby(['Customer No'])['Customer No']
.transform('count'))['c'].eq(6]
)
I am ranking a bunch of movies using pandas, from 1-100, and I was wondering how to create a separate column called score where the scores are inverse. for example:
rank score
100
99
98
all the way to
1
thank you!
if already you have rank column with the values 1 to 100,
df['rank'] = range(1, 101)
df['score'] = range(len(df),0,-1)
US_Sales=pd.read_excel("C:\\Users\\xxxxx\\Desktop\\US_Sales.xlsx")
US_Sales
US_Sales.State.nlargest(2,'Sales').groupby(['Sales'])
i want second max sales for each city wise
no sample data so simulated
sort, shift and take first value gives result you want
df = pd.DataFrame([{"state":"Florida","sales":[22,4,5,6,7,8]},
{"state":"California","sales":[99,9,10,11]}]).explode("sales").reset_index(drop=True)
df.sort_values(["state","sales"], ascending=[1,0]).groupby("state").agg({"sales":lambda x: x.shift(-1).values[0]})
state
sales
California
11
Florida
8
utility function
import functools
def nlargest(x, n=2):
return x.sort_values(ascending=False).shift((n-1)*-1).values[0]
df.groupby("state", as_index=False).agg({"sales":functools.partial(nlargest, n=2)})
You can sort the Sales column descending, then takes the 2nd row with pandas.core.groupby.GroupBy.nth() in each group. Note that n in nth() is zero indexed.
US_Sales.sort_values(['State', 'Sales'], ascending=[True, False]).groupby('State').nth(1).reset_index()
You can also choose the largest 2 values then keep the last by various methods:
largest2 = df.sort_values(['State', 'Sales'], ascending=[True, False]).groupby('State')['Sales'].nlargest(2)
# Method 1
# Drop duplicates by `State`, keep the last one
largest2.reset_index().drop('level_1', axis=1).drop_duplicates(['State'], keep='last')
# Method 2
# Group by `State`, keep the last one
largest2.groupby('State').tail(1).reset_index().drop('level_1', axis=1)
I'm want to sample n rows from each different value in column named club
columns = ['long_name','age','dob','height_cm','weight_kg','club']
teams = ['Real Madrid','FC Barcelona','Chelsea','CA Osasuna','Paris Saint-Germain','FC Bayern München','Atlético Madrid','Manchester City','Liverpool','Hull City']
playersDataDB = playersData.loc[playersData['club'].isin(teams)][columns]
playersDataDB.head()
In the code above i have selected my desired colums based on them belonging to the teams selected.
The output from this code is a 299 rows × 6 columns Dataframe meaning that i'm sampling all the player from the team but i want to get just 16 of them from each club.
Not sure how your dataframe looks like but you could groupby teams and then use head(16) to get only the first 16 of them.
df.groupby('club').head(16)
You can use isin like this:
playersDataDB = playersData[playersData['club'].isin(teams)]
playersDataDB.head()
I have a dataframe that looks like this:
In [268]: dft.head()
Out[268]:
ticker BYND UBER UBER UBER ... ZM ZM BYND ZM
0 analyst worlds uber revenue ... company owning pet things
1 moskow apac note uber ... try things humanization users
2 growth anheuserbusch growth target ... postipo unicorn products revenue
3 stock kong analysts raised ... software revenue things million
4 target uberbeating stock rising ... earnings million pets direct
[5 rows x 500 columns]
In [269]: dft.columns.unique()
Out[269]: Index(['BYND', 'UBER', 'LYFT', 'SPY', 'WORK', 'CRWD', 'ZM'], dtype='object', name='ticker')
How do I combine the the columns so there is only a single unique column name for each ticker?
Maybe you should try making a copy of the column you wish to join then extend the first column with the copy you have.
Code :
First convert the all columns name into one case either in lower or upper case so that there is no miss-match in header case.
def merge_(df):
'''Return the data-frame with columns with the same lowercase'''
# Get the list of unique columns in lowercase
columns = set(map(str.lower,df.columns))
df1 = pd.DataFrame(data=np.zeros((len(df),len(columns))),columns=columns)
# Merging the matching columns
for col in df.cloumns:
df1[col.lower()] += df[col] # words are in str format so '+' will concatenate
return df1