Ranking pandas df by ordered groups - pandas

Consider the following pre-ordered dataframe
import pandas as pd
df = pd.DataFrame({'Item':['Hat','Necklace','Bag','Bag','Necklace','Hat','Hat','Bag','Bag']})
I want to dense_rank the Item column without affecting the current order of the dataframe.
Namely
I have tried using
df['bad_rank'] = df['Item'].rank(ascending=False, method='dense').astype(int)
However, this is not what I want since 'Hat' is ranked second and should be ranked first.
I have also coded a dirty answer but I am surprised that there is not a simpler way using the rank method.
ordered_rank = dict(zip(df['Item'].unique(), range(1,len(df['Item'].unique())+1)))
df['good_rank'] = df['Item'].map(ordered_rank)
Would someone be willing to help me out?

Related

pandas: split pandas columns of unequal length list into multiple columns

I have a dataframe with one column of unequal list which I want to spilt into multiple columns (the item value will be the column names). An example is given below
I have done through iterrows, iterating thruough the rows and examine the list from each rows. It seem workable as my dataframe has few rows. However, I wonder if there is any clean methods
I have done through additional_df = pd.DataFrame(venue_df.location.values.tolist())
However the list break down into as below
thanks fro your help
Can you try this code: built assuming venue_df.location contains the list you have shown in the cells.
venue_df['school'] = venue_df.location.apply(lambda x: ('school' in x)+0)
venue_df['office'] = venue_df.location.apply(lambda x: ('office' in x)+0)
venue_df['home'] = venue_df.location.apply(lambda x: ('home' in x)+0)
venue_df['public_area'] = venue_df.location.apply(lambda x: ('public_area' in x)+0)
Hope this helps!
First lets explode your location column, so we can get your wanted end result.
s=df['Location'].explode()
Then lets use crosstab in that series so we can get your end result
import pandas as pd
pd.crosstab(s).unstack()
I didnt test it out cause i dont know you base_df

Groupby does return previous df without changing it

df=pd.read_csv('../input/tipping/tips.csv')
df_1 = df.groupby(['day','time'])
df_1.head()
Guys, what am I missing here ? As it returns to me previous dataframe without groupby
We can print it using the following :
df_1 = df.groupby(['day','time']).apply(print)
groupby doesn't work the way you are assuming by the sounds of it. Using head on the grouped dataframe takes the first 5 rows of the dataframe, even if it is across groups because that is how the groupby object is built. You can use #tlentali's approach to print out each group, but df_1 will not be assigned the grouped dataframe that way, instead, None (the number of groups times) as that is the output of print.
The way below gives a lot of control over how to show/display the groups and their keys
This might also help you understand more about how the grouped data frame structure in pandas works.
df_1 = df.groupby(['day','time'])
# for each (day,time) and grouped data
for key, group in df_1:
# show the (day,time)
print(key)
# display head of the grouped data
group.head()

how to plot graded letters like A* in matplotlib

i'm a complete beginner and i have a college stats project, im comparing exam scores for our year group and the one below. i collected my own data and since i do cs i decided to try visualize the data with pandas and matplotlib (my first time). i was able to read the csv file into a dataframe with columns = Level,Grade,Difficulty,Happy,MAG. Level is just ' year group ' e.g. AS or A2. and MAG is like a minimum expected grade, the rest are numeric values out of 5.
i want to do some type of plotting but i cant' seem to get it work.
i want to plot revision against difficulty? for AS group and try show a correlation. i also want to show a barchart ( if appropriate ) for Grade Vs MAG.
here is the csv https://docs.google.com/spreadsheets/d/169UKfcet1qh8ld-eI7B4U14HIl7pvgZfQLE45NrleX8/edit?usp=sharing
this is the code so far:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv('Report Task.csv')
df.columns = ['Level','Grade','Difficulty','Revision','Happy','MAG'] #numerical values are out of 5
df[df.Level.str.match('AS')] #to get only AS group
plt.plot(df.Revision, df.Difficulty)
this is my first time ever posting on stack so im really sorry if i did something wrong.
For difficulty vs revision, you were using a line plot. You're probably looking for a scatter plot:
df = df[df.Level.str.match('AS')] # note the extra `df =` as per comments
plt.scatter(x=df.Revision, y=df.Difficulty)
plt.xlabel('Revision')
plt.ylabel('Difficulty')
Alternatively you can plot via pandas directly:
df = df[df.Level.str.match('AS')] # note the extra `df =` as per comments
df.plot.scatter(x='Revision', y='Difficulty')

Find dates and difference between extreme observations

he function passed to apply must take a dataframe as its first argument and return a DataFrame, Series or scalar. apply will then take care of combining the results back together into a single dataframe or series. apply is therefore a highly flexible grouping method.
While apply is a very flexible method, its downside is that using it can be quite a bit slower than using more specific methods like agg or transform. Pandas offers a wide range of method that will be much faster than using apply for their specific purposes, so try to use them before reaching for apply.
easiest is an aggregation with groupby and then do a select
# make index a column
df = df.reset_index()
# get min of holdings for each ticker
lowest = df[['ticker','holdings']].groupby('ticker').min()
print(lowest)
# select lowest my performing a left join (solutions with original)
# this gives only the matching rows of df in return
lowest_dates = lowest.merge(df, on=['ticker','holdings'], how='left')
print(lowest_dates)
If you just want a series of Date you can use this function.
def getLowest(df):
df = df.reset_index()
lowest = df[['ticker','holdings']].groupby('ticker').min()
lowest_dates = lowest.merge(df, on=['ticker','holdings'], how='left')
return lowest_dates['Date']
From my point of view it would be better to return the entire dataframe, to know which ticker was lowest when. In this case you can :
return lowest_dates

How to translate a pandas group by without aggregation to pyspark?

I am trying to convert the following pandas line into pyspark:
df = df.groupby('ID', as_index=False).head(1)
Now, I am familiar with the pyspark df.groupby("col1", "col2") method in pyspark, as well as the following to get whatever the first element is within a group:
df = df.withColumn("row_num", row_number().over(Window.partitionBy("ID").orderBy("SOME_DATE_COLUMN"))).where(col("row_num") < 2)
However, without an orderBy argument, this grouping and fetching of the first element in each group doesn't work (and I am literally trying to convert from pandas to spark, whatever the pandas line does):
An error occurred while calling o2547.withColumn.
: org.apache.spark.sql.AnalysisException: Window function row_number() >requires window to be ordered, please add ORDER BY clause. For example >SELECT row_number()(value_expr) OVER (PARTITION BY window_partition >ORDER BY window_ordering) from table
Looking at the pandas groupby documentation, I cannot grasp what groupby does without a following sort/agg function applied to the groups; i.e. what is the default order within a group from which the $.head(1)$ fetches the first element?
It depends on the order of your pandas dataframe before the groupby. From the pandas groupby documentation:
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
Converting the pandas beheaviour exactly to pyspark is impossible as pyspark dataframes aren't ordered. But if your data source can provide a row number or something like that, it is possible.