pandas row wise comparison and apply condition - pandas

This is my dataframe:
df = pd.DataFrame(
{
"name": ["bob_x", "mad", "jay_x", "bob_y", "jay_y", "joe"],
"score": [3, 5, 6, 2, 4, 1],
}
)
I want to compare the score of bob_x with 'bob_y, and retain the row with the lowest, and do the same for jay_xandjay_y. No change is required for madandjoe`.

You can first split the names by _ and keep the first part, then groupby and keep the lowest value:
import pandas as pd
df = pd.DataFrame({"name": ["bob_x", "mad", "jay_x", "bob_y", "jay_y", "joe"],"score": [3, 5, 6, 2, 4, 1]})
df['name'] = df['name'].str.split('_').str[0]
df.groupby('name')['score'].min().reset_index()
Result:
name
score
0
bob
2
1
jay
4
2
joe
1
3
mad
5

Related

Сoncatenate rows in pandas with conditions and calculations

If I have a dataframe:
myData = {'start': [1, 2, 3, 4, 5],
'end': [2, 3, 5,7,6],
'number': [1, 2, 7,9, 7]
}
df = pd.DataFrame(myData, columns=['start', 'end', 'number'])
df
And I need to do something like:
result = {'start': [1, 4, 5],
'end': [7,7,6],
'number': [10,9, 7]
}
df = pd.DataFrame(result, columns=['start', 'end', 'number'])
df
If number < 1, start = start(previous row), end = end(current row), then delete previous rows.
That is, to merge the rows, the difference between the end of the first and the beginning of the second is less than 1, rewrite the new beginning, merge the number and delete the first.
Can I do it without iteration?
enter image description here
You can use:
# identify when end - previous_start > 2
# and create a new group
group = df['end'].sub(df['start'].shift()).gt(2).cumsum()
# aggregate
out = df.groupby(group).agg({'start': 'first', 'end': 'last', 'number': 'sum'})
Output:
start end number
0 1 3 3
1 3 5 7
2 4 6 16

Calculate the difference between all rows and a specific row in the dataframe

This is a similar question to this thread.
Lets consider df as:
df = pd.DataFrame([["a", 2, 3], ["b", 5, 6], ["c", 8, 9],["a", 0, 0], ["a", 8, 7], ["c", 2, 1]], columns = ["A", "B", "C"])
How can you calculate the difference between all rows and the row at Nth index in a group (lowest index for EACH group) for column "B", and put it in column "D"? I want to calculate mean square displacement for my data and I want to calculate the difference of values in a column in each group with the first appeared row in that group.
I tried:
df['D'] = df.groupby(["A"])['B'].sub(df.groupby(['A'])["B"].iloc[0])
Group = df.groupby(["A"])
However using .sub and groupby raise the following error:
AttributeError: 'SeriesGroupBy' object has no attribute 'sub'
the desired result would be like this:
A B C D
0 a 2 3 0 *lowest index in group "a"
1 b 5 6 0 *lowest index in group "b"
2 c 8 9 0 *lowest index in group "c"
3 a 0 0 -2
4 a 8 7 6
5 c 2 1 -6
I guess this answer could be enough of a hint for you:
import pandas as pd
df = pd.DataFrame([["a", 2, 3], ["b", 5, 6], ["c", 8, 9],["a", 0, 0], ["a", 8, 7], ["c", 2, 1]], columns = ["A", "B", "C"])
print("df:")
print(df)
print()
groupA = df.groupby(['A'])
print("groupA:")
print(groupA.groups)
print()
print("lowest indices for each group from columnA:")
lowest_indices = dict()
for k, v in groupA.groups.items():
lowest_indices[k] = v[0]
print(lowest_indices)
print()
columnB = df['B']
print("columnB:")
print(columnB)
print()
df['D'] = df['B']
for i in range(len(df)):
group_at_i = df['A'].iloc[i]
lowest_index_of_that = lowest_indices[group_at_i]
b_element_at_that_index = df['B'].iloc[lowest_index_of_that]
the_difference = df['B'].iloc[i] - b_element_at_that_index
df.loc[i, 'D'] = the_difference
print("df:")
print(df)

how to change value based on criteria pandas

I have a following problem. I have this df:
d = {'id': [1, 1, 2, 2, 3], 'value': [0, 1, 0, 0, 1]}
df = pd.DataFrame(data=d)
I would like to have a new column where value will be 1 if in any other cases it is also 1. See desired output:
d = {'id': [1, 1, 2, 2, 3], 'value': [0, 1, 0, 0, 1], 'newvalue': [1, 1, 0, 0, 1]}
df = pd.DataFrame(data=d)
How can I do it please?
If need set 0,1 by condition - here at least one 1 use GroupBy.transform with GroupBy.any for mask and casting to integers for True, False to 1,0 map:
df['newvalue'] = df['value'].eq(1).groupby(df['id']).transform('any').astype(int)
Alternative:
df['newvalue'] = df['id'].isin(df.loc[df['value'].eq(1), 'id']).astype(int)
Or if only 0,1 values is possible simplify solution for new column by maximal values per groups:
df['newvalue'] = df.groupby('id')['value'].transform('max')
print (df)
id value newvalue
0 1 0 1
1 1 1 1
2 2 0 0
3 2 0 0
4 3 1 1

pandas sort values based other column's value for each row

Generally, I want to sort each ceil of some columns in pandas dataframe based on 1 column's value, That single column stores rank of other columns' value.
Suppose I have a dataframe like this, chrs has characters I want to sort, rank is the order of charaters for each row :
import pandas as pd
import numpy as np
import string
from operator import itemgetter
letters = list(string.ascii_lowercase)
np.random.seed(0)
# generate length for each row
data = pd.DataFrame({'col0': np.random.randint(2,10,10)})
# generate random string for each row
data['chrs'] = data.col0.apply(lambda x: ','.join(np.random.choice(letters) for i in range(x)))
# generate random rank for each row
data['rank_of_chr'] = data.col0.apply(lambda x: np.random.choice(x,x,replace = False))
data.iloc[:,1:]
chrs rank_of_chr
0 v,s,e,x,g,y [2, 3, 5, 1, 4, 0]
1 y,m,b,g,h,x,o,y,r [0, 4, 2, 3, 5, 6, 7, 1, 8]
2 f,z,n,i,j,u,t [4, 1, 5, 0, 6, 2, 3]
3 q,t [0, 1]
4 f,p,p,a,s [3, 0, 2, 1, 4]
5 d,y,r,t,t [1, 4, 2, 0, 3]
6 t,o,h,a,b [1, 2, 0, 3, 4]
7 j,z,a,k,u,x,d,l,s [7, 5, 1, 2, 3, 8, 6, 0, 4]
8 x,c,a [2, 0, 1]
9 a,e,v,f,g [0, 2, 3, 4, 1]
I want to sort chrs value base on rank_of_chr value for each row. For instance, for row 9, I want a,g,e,v,f(a,e,v,f,g with rank [0,2,3,4,1], rank is ascending just like rank() in sql).
Since the true data is 50,000,000 rows, I want to find the fastest methods for it.
What I have tried is:
use itertuple for each rows, use for loop to iter over each column I want to sort.
for each row, use np.argsort to get the index of sorted chr and then use itergetter to index original value of chrs
I revise dataframes' value inplace using dt.at[index,col_name] = new_value
cols_need_sort = ['chrs']
for i in data.itertuples():
this_order = np.argsort(list(map(int, data.loc[i.Index,'rank_of_chr'])))
for col_name in cols_need_sort:
data.at[i.Index, col_name] = itemgetter(*this_order)(data.loc[i.Index,col_name].split(','))
data.iloc[:,1:]
Any method to boost performance for this task?

Use pandas cut function in Dask

How can I use pd.cut() in Dask?
Because of the large dataset, I am not able to put the whole dataset into memory before finishing the pd.cut().
Current code that is working in Pandas but needs to be changed to Dask:
import pandas as pd
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
#Groupby name and add column sum (of amounts) and count (number of grouped rows)
df = (df.groupby('name')['amount'].agg(['sum', 'count']).reset_index().sort_values(by='name', ascending=True))
print(df.head(15))
#Groupby bins and chnage sum and count based on grouped rows
df = df.groupby(pd.cut(df['name'],
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))['sum', 'count'].sum().reset_index()
print(df.head(15))
Output:
name sum count
0 namebin1 5 3
1 namebin2 9 2
2 namebin3 8 1
I tried:
import pandas as pd
import dask.dataframe as dd
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
df = dd.from_pandas(df, npartitions=2)
df = df.groupby('name')['amount'].agg(['sum', 'count']).reset_index()
print(df.head(15))
df = df.groupby(df.map_partitions(pd.cut,
df['name'],
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))['sum', 'count'].sum().reset_index()
print(df.head(15))
Gives error:
TypeError("cut() got multiple values for argument 'bins'",)
The reason why you're seeing this error is that pd.cut() is being called with the partition as the first argument which it doesn't expect (see the docs).
You can wrap it in a custom function and call that instead, like so:
import pandas as pd
import dask.dataframe as dd
def custom_cut(partition, bins, labels):
result = pd.cut(x=partition["name"], bins=bins, labels=labels)
return result
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
df = dd.from_pandas(df, npartitions=2)
df = df.groupby('name')['amount'].agg(['sum', 'count']).reset_index()
df = df.groupby(df.map_partitions(custom_cut,
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))[['sum', 'count']].sum().reset_index()
df.compute()
name sum count
namebin1 5 3
namebin2 9 2
namebin3 8 1