I have a pivot table like this in pandas:
I want to add a column [DIFFERENCE] and sort the table by that new [DIFFERENCE] Column
I have played around with table.diff(axis=1) and but somehow don't get the sorting to work...
any idea is very much appriciated
Creating a column based on the difference is very straightforward, sorting is possible by using sort_values
df['Difference'] = df['2021'] - df['2020']
df.sort_values('Difference', inplace=True)
Related
I have following code:
df.orderBy(expr("COUNTRY_NAME").desc, expr("count").asc).show()
I expect count column to be arranged in ascending order for a given COUNTRY_NAME. But I see something like this:
Last value of 12 is not as per the expectation.
Why is it so?
If you output df.printSchema(), you'll see that your "count" column is of the string datatype, resulting in the undesired alphanumeric sort.
In pyspark, you can use the following to accomplish what you are looking for:
df = df.withColumn('count',df['count'].cast('int'))
df.orderBy(['COUNTRY_NAME'],ascending=False).orderBy(['count'],ascending=True).show()
You should create and apply your schema when the data is read in - if possible.
I have the following pivot table
I want to get the max value from each row, but also, I need to get the column it came from.
So far I know who to get the max row of every column using this:
dff['State'] = stateRace.max(axis=1)
dff
I get this:
which is returning the correct max value but not the column it came from.
You suffer a disadvantage getting help because you have supplied images and the question is not clear. Happy to help if the below answer doesn't help.
stateRace=stateRace.assign(max_value=stateRace.select_dtypes(exclude='object').max(axis=1),\
max_column=stateRace.select_dtypes(exclude='object').idxmax(axis=1))
I have a dataframe like this
screenshot
I want to find out :
unique viewers: 3
unique viewers who reviewed movies: 2
I am able to do that using the following code:
movie['Viewer_ID'].nunique()
movie.loc[movie['watched']==1,:]['Viewer_ID'].nunique()
However, i was wondering if there is a better way to combined both in one, something like
movie.agg({'Viewer_id':'nunique'
,'watched': 'sum'
})
is there a way i can write a conditional count within the agg function?
You can use .groupby():
view_count = movie.groupby('Viewer_id').watched.sum()
Now view_count is a Series with viewer id as index and sum of watched as values. You can filter with:
filtered = view_count.loc[view.count > 0]
I am trying to make a pivot table with a data set with many columns.
When making a pivot table with code below I get all the columns which I don't want.
I only want the counts and not any other columns there. Can i achieve this ?
table1 = pd.pivot_table(dfCALCNoExcecption,index=['AD Platform','Agent Program'],columns=None,aggfunc='count')
The output of above code in excel output is like below( I have not pasted the whole as there are around 50 columns):
The Desired Output I am trying to get:
You can group by your data based on the columns 'AD Plataform' and 'Agent Program'. After that, you can sum all the values of the column that has the quantity of the machines. Here is my code:
df.groupby(['AD Plataform', 'Agent Program'])['AD Hostname'].sum()
This is not complete but a part of this can be achieved by Groupby. I am not sure how to rename the third column to "Count"
dfAgentTable3 = dfCALCNoExcecption.groupby(['AD Platform', 'Agent Program'])['AD Hostname'].count().sort_index(ascending=True)
in pandas data frame it try to make some statistical analysis on column (heart rate) it aggregate with patient id and hour of measure, then make all statistical analysis
(mean,max,etc)
, my question is how to rename the return result ( to name sum_heart_rate instead of sum, min_heart_rate instead of min )
as follows
newdataframe= df2.groupby(['DayHour','subject_id']).agg({"Heart Rate":['sum' ,'min','max','std', 'count','var','skew']})
You can use the below template. You add more columns if needed.
newdataframe= (df2.groupby(['DayHour','subject_id']).agg(sum_heart_rate =('heart rate', 'sum'), min_heart_rate =('heart rate','min'))
For pandas version below 0.25 use code below
newdataframe = df2.groupby('date')['heart rate'].agg([('sum_heart_rate','sum'), ('min_heart_rate','min')])