Is there a way in pandas to compare two values within one column and sum how many times the second value is greater? - pandas

In my pandas dataframe, I have one column, score, thats rows are values such as [80,100], [90,100], etc. what I want to do is go through this column and if the second value in the list is greater than the first value, then to count that. so that I have a value that sums the number of times where in [a,b], b was greater. how would I do this?

print(len([x for x in df['score'] if x[1] > x[0]]))

Related

pandas df: replace values with np.NaN if character count do not match across multiple columns

currently stuck with something I hope to find an answer for in this forum:
I have a df with multiple columns containing URLs. My index column are URLs as well.
AIM: I'd like to replace df values across all columns with np.NaN if the number of "/" (count()) in the index is not equal to the number of "/" (count()) in the values of each individual of of the other columns
E.x.
First, you need one column to compare to.
counts = df['id_url'].str.count('/')
Then you evaluate all the rows at once.
mask = df.str.count('/') == counts
Then we want to to show rows where all the values are equal.
mask = mask.all(axis=1)
Now we have a mask for where every value is equal, we can use the not operator to filter for those where at least one column is not equal.
df.loc[~mask, :] = np.nan # replaces every value in the row with np.nan

Conditional row number Pandas

I need to add row number to my dataframe based on certain condition, below is the image input data frame.
I need a row number column in my dataframe as illustrated in below image(Rank column).
so when ever "RequestResubmitted" value is found within group I want reset rank to 1 again.
Let us try cumsum create the cub key and groupby + cumcount
s=df.groupby([df['Word Order Code'],df['Status Code'].eq('Request Submitted').cumsum()]).cumcount()+1
df['rank']=s

Pandas groupby year filtering the dataframe by n largest values

I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?
We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)
nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.
Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)

counting each value in dataframe

So I want to create a plot or graph. I have a time series data.
My dataframe looks like that:
df.head()
I need to count values in df['status'] (there are 4 different values) and df['group_name'] (2 different values) for each day.
So i want to have date index and count of how many times each value from df['status'] appear as well as df['group_name']. It should return Series.
I used spam.groupby('date')['column'].value_counts().unstack().fillna(0).astype(int) and it working as it should. Thank you all for help

How can I add continuous 'Ident' column to a dataframe in Pyspark, not as monotonically_increasing_id()?

I have a dataframe 'df', and I want to add an 'Ident' numeric column where the values are continuous. I tried with monotonically_increasing_id() but the values are not continuous. As it description says: "The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. "
So, my question is, how could I do that?
You could try something like this,
df = df.rdd.zipWithIndex().map(lambda x: [x[1]] + [y for y in x[0]]).toDF(['Ident']+df.columns)
This will give you first column as your identifier which will have consecutive values starting from 0 to N-1, where N is total number of records in df.