replacing first row of selected column from each group with 0 - pandas

Existing df :
Id status value
A1 clear 23
A1 in-process 50
A1 done 20
B1 start 2
B1 end 30
Expected df :
Id status value
A1 clear 0
A1 in-process 50
A1 done 20
B1 start 0
B1 end 30
looking to replace first value of each group with 0

Use Series.duplicated for duplicated values, set first duplicate by inverse mask by ~ with DataFrame.loc:
df.loc[~df['Id'].duplicated(), 'value'] = 0
print (df)
Id status value
0 A1 clear 0
1 A1 in-process 50
2 A1 done 20
3 B1 start 0
4 B1 end 30

One approach could be as follows:
Compare the values for each row in df.Id with the next row, combining Series.shift with Series.ne. This will return a boolean Series with True for each first row of a new Id value.
Next, use df.loc to select only rows with True for column value and assign 0.
df.loc[df.Id.ne(df.Id.shift()), 'value'] = 0
print(df)
Id status value
0 A1 clear 0
1 A1 in-process 50
2 A1 done 20
3 B1 start 0
4 B1 end 30
N.B. this approach assumes that the "groups" in Id are sorted (as they seem to be, indeed). If this is not the case, you could use df.sort_values('Id', inplace=True) first, but if that is necessary, the answer by #jezrael will be faster, surely.

df1.mask(~df1.Id.duplicated(),0)

Related

Pivoting data without column

Starting from an imported df from excel like that:
Code
Material
Text
QTY
A1
X222
Model3
1
A2
4027721
Gruoup1
1
A2
4647273
Gruoup1.1
4
A1
573828
Gruoup1.2
1
I want to create a new pivot table like that:
Code
Qty
A1
2
A2
5
I tried with the following command but they do not work:
df.pivot(index='Code', columns='',values='Qty')
df_pivot = df ("Code").Qty([sum, max])
You don't need pivot but groupby:
out = df.groupby('Code', as_index=False)['QTY'].sum()
# Or
out = df.groupby('Code')['QTY'].agg(['sum', 'max']).reset_index()
Output:
>>> out
Code sum max
0 A1 2 1
1 A2 5 4
The equivalent code with pivot_table:
out = (df.pivot_table('QTY', 'Code', aggfunc=['sum', 'max'])
.droplevel(1, axis=1).reset_index())

Find the third field if two fields have non zero values

My dataset -
A B C
abc 0 12
ert 0 45
ghj 14 0
kli 56 78
qas 0 0
I want to find the values of A for which values of B and C together are non-zero.
Expected output-
A B C
kli 56 78
I tried-
aggr(
sum({<[B]={"<>0"},[C]={"<>0"}>}A)
,[B],[C])
Depends where you are doing this, in the Data Load or through Set Analysis on the front, but really this will work on both the load editor and a table.
if("B" <> 0 and "C" <> 0, 'Non-Zero Value', 'Zero Value')
Example of what I created

Assigning Score based on Order Sequence in pandas

Following are the dataframes I have
score_df
col1_id col2_id score
1 2 10
5 6 20
records_df
date col_id
D1 6
D2 4
D3 1
D4 2
D5 5
D6 7
I would like to compute a score based on the following criteria:
When 2 occurs after 1 the score should be assigned 10 or when 1 occurs after 2, score should be assigned 10.
i.e when (1,2) gives a score 10 .. (2,1) also get the same score 10.
considering (1,2) . When 1 occurs first time we dont assign a score. We flag the row and wait for 2 to occur. When 2 occurs in the column we give the score 10.
considering (2,1). When 2 comes first. We assign value 0 and wait for 1 to occur. When 1 occurs, we give the score 10.
So, for the first time - dont assign the score and wait for the corresponding event to occur and then assign the score
So, my result dataframe should look something like this
result
date col_id score
D1 6 0 -- Eventhough 6 is there in score list, it occured for first time. So 0
D2 4 0 -- 4 is not even there in list
D3 1 0 -- 1 occurred for first time . So 0
D4 2 10 -- 1 occurred previously. 2 occurred now.. we can assign 10.
D5 5 20 -- 6 occurred previously. we can assign 20
D6 7 0 -- 7 is not in the list
I have around 100k rows in both score_df and record_df. Looping and assigning score is taking the time. Can someone help with logic without looping the entire dataframe?
From what i understand , you can try melt for unpivotting and then merge. keeping the index from the melted df , we check where the index is duplicated , and then return score from the merge else 0.
m = score_df.reset_index().melt(['index','uid','score'],
var_name='col_name',value_name='col_id')
final = records_df.merge(m.drop('col_name',1),on=['uid','col_id'],how='left')
c = final.duplicated(['index']) & final['index'].notna()
final = final.drop('index',1).assign(score=lambda x: x['score'].where(c,0))
print(final)
uid date col_id score
0 123 D1 6 0.0
1 123 D2 4 0.0
2 123 D3 1 0.0
3 123 D4 2 10.0
4 123 D5 5 20.0
5 123 D6 7 0.0

is it possible to obtain 'groupby-transform-apply' style results with the function return series rather than scaler?

I want to achieve the following behavior:
res = df.groupby(['dimension'], as_index=False)['metric'].transform(lambda x: foo(x))
where foo(x) returns a series the same size as the input which is df['metric']
however, this will throw the following error:
ValueError: transform must return a scalar value for each group
i know i can use a for loop style, but how can i achieve this in a groupby manner?
e.g.
df:
col1 col2 col3
0 A1 B1 1
1 A1 B1 2
2 A2 B2 3
and i want to achieve:
col1 col2 col3
0 A1 B1 1 - (1+2)/2
1 A1 B1 2 - (1+2)/2
2 A2 B2 3 - 3
If you want to return a Series you should use apply instead of transform:
res = df.groupby(['dimension'], as_index=False)['metric'].apply(lambda x: foo(x))
Transform as the error states must return a scalar value that would be put in every rows for each group. But apply will work with a Series returned for each group.
If this doesn't work, provide input and expected output to understand better your problem.
You can do this using transform:
df['col3']=(df.col3-df.groupby(['col1','col2'])['col3'].transform('sum'))/2
Or using apply(slower):
df['col3']=df.groupby(['col1','col2'])['col3'].apply(lambda x: (x-x.sum())/2)
col1 col2 col3
0 A1 B1 -1.0
1 A1 B1 -0.5
2 A2 B2 0.0

Multiple group-by with one common variable with pandas?

I want to mark duplicate values within an ID group. For example
ID A B
i1 a1 b1
i1 a1 b2
i1 a2 b2
i2 a1 b2
should become
ID A An B Bn
i1 a1 2 b1 1
i1 a1 2 b2 2
i1 a2 1 b2 2
i2 a1 1 b2 1
Basically An and Bn count multiplicity within each ID group. How can I do this in pandas? I've found groupBy, but it was quite messy to put everything together. Also I tried individual groupby for ID, A and ID, B. Maybe there is a way to pre-group by ID first and then do all the other variables? (there are many variables and I have very man rows!)
Also I tried individual groupby for ID, A and ID, B
I think this is a straight-forward way to tackle it; As you suggest, you can groupby each separately and then compute the size of the groups. And use transform so you can easily add the results to the original dataframe:
df['An'] = df.groupby(['ID','A'])['A'].transform(np.size)
df['Bn'] = df.groupby(['ID','B'])['B'].transform(np.size)
print df
ID A B An Bn
0 i1 a1 b1 2 1
1 i1 a1 b2 2 2
2 i1 a2 b2 1 2
3 i2 a1 b2 1 1
Of course, with lots of columns you could do:
for col in ['A','B']:
df[col + 'n'] = df.groupby(['ID',col])[col].transform(np.size)
The duplicated method can also be used to give you something similar, but it will mark observations within a group after the first as duplicates:
for col in ['A','B']:
df[col + 'n'] = df.duplicated(['ID',col])
print df
ID A B An Bn
0 i1 a1 b1 False False
1 i1 a1 b2 True False
2 i1 a2 b2 False True
3 i2 a1 b2 False False
EDIT: increasing performance for large data. I did it on a large dataset (4 million rows) and it was significantly faster if I avoided transform with something like the following (it is much less elegant):
for col in ['A','B']:
x = df.groupby(['ID',col]).size()
df.set_index(['ID',col],inplace=True)
df[col + 'n'] = x
df.reset_index(inplace=True)