pandas - slice dataframe with condition of end point - pandas

I would like to use a search result as end point to slice a dataframe.
import pandas as pd
df = pd.DataFrame({'A':['apple','orange','bananna','watermelon'],'B':[1,2,3,2]})
print(df)
pos = df[df['A'].str.contains('ban')]
print(pos)
: A B
: 0 apple 1
: 1 orange 2
: 2 bananna 3
: 3 watermelon 2
: A B
: 2 bananna 3
for below example, I would like to get output from first row to row start with 'ban', as below:
: A B
: 0 apple 1
: 1 orange 2
: 2 bananna 3

You can use boolean masking and .index attribute:
condition=df[df['A'].str.contains('ban')].index[-1]
Now finally use loc[] accessor or iloc[] accessor:
result=df.loc[:condition,:]
OR
result=df.iloc[:condition+1,:]
Now if you print result you will get:
A B
0 apple 1
1 orange 2
2 bananna 3

Here is the traditional method to do this:
lst = []
for row in df.iterrows():
lst.append(list(row[1])) # Appendig every row as list in temporary list
if str(row[1][0]).startswith('ban'): # Condition
break
new_df = pd.DataFrame(lst)
print(new_df)
Output:
0 1
0 apple 1
1 orange 2
2 bananna 3
:)

Related

pandas finding duplicate rows with different label

I have the case where I want to sanity check labeled data. I have hundreds of features and want to find points which have the same features but different label. These found cluster of disagreeing labels should then be numbered and put into a new dataframe.
This isn't hard but I am wondering what the most elegant solution for this is.
Here an example:
import pandas as pd
df = pd.DataFrame({
"feature_1" : [0,0,0,4,4,2],
"feature_2" : [0,5,5,1,1,3],
"label" : ["A","A","B","B","D","A"]
})
result_df = pd.DataFrame({
"cluster_index" : [0,0,1,1],
"feature_1" : [0,0,4,4],
"feature_2" : [5,5,1,1],
"label" : ["A","B","B","D"]
})
In order to get the output you want (both de-duplication and cluster_index), you can use a groupby approach:
g = df.groupby(['feature_1', 'feature_2'])['label']
(df.assign(cluster_index=g.ngroup()) # get group name
.loc[g.transform('size').gt(1)] # filter the non-duplicates
# line below only to have a nice cluster_index range (0,1…)
.assign(cluster_index= lambda d: d['cluster_index'].factorize()[0])
)
output:
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1
First get all duplicated values per feature columns and then if necessary remove duplciated by all columns (here in sample data not necessary), last add GroupBy.ngroup for groups indices:
df = df[df.duplicated(['feature_1','feature_2'],keep=False)].drop_duplicates()
df['cluster_index'] = df.groupby(['feature_1', 'feature_2'])['label'].ngroup()
print (df)
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1

Restructuring a Pandas series

I have the following series:
r = [1,2,3,4,'None']
ser = pd.Series(r, copy=False)
The output of which is -
ser
Out[406]:
0 1
1 2
2 3
3 4
4 None
At ser[1], I want to set the value to be 'NULL' and copy the [2,3,4] to be shifted by one index.
Therefore the desired output would be:
ser
Out[406]:
0 1
1 NULL
2 2
3 3
4 4
I did the following which is not working:
slice_ser = ser[1:-1]
ser[2] = 'NULL'
ser[3:-1] = slice_ser
I am getting an error 'ValueError: cannot set using a slice indexer with a different length than the value'. How do I fix the issue?
I'd use shift for this:
>>> ser[1:] = ser[1:].shift(1).fillna('NULL')
>>> ser
0 1
1 NULL
2 2
3 3
4 4
dtype: object
You can shift values after position 1 and assign it back:
ser.iloc[1:] = ser.iloc[1:].shift()
ser
0 1
1 NaN
2 2
3 3
4 4
dtype: object

I want to remove specific rows and restart the values from 1

I have a dataframe that looks like this:
Time Value
1 5
2 3
3 3
4 2
5 1
I want to remove the first two rows and then restart time from 1. The dataframe should then look like:
Time Value
1 3
2 2
3 1
I attach the code:
file = pd.read_excel(r'C:......xlsx')
df = file0.loc[(file0['Time']>2) & (file0['Time']<11)]
df = df.reset_index()
Now what I get is:
index Time Value
0 3 3
1 4 2
2 5 1
Thank you!
You can use .loc[] accessor and reset_index() method:
df=df.loc[2:].reset_index(drop=True)
Finally use list comprehension:
df['Time']=[x for x in range(1,len(df)+1)]
Now If you print df you will get your desired output:
Time Value
0 1 3
1 2 2
2 3 1
You can use df.loc to extract the subset of dataframe, Reset the index and then change the value of Time column.
df = df.loc[2:].reset_index(drop=True)
df['Time'] = df.index + 1
print(df)
you have two ways to do that.
first :
df[2:].assign(time = df.time.values[:-2])
Which returns your desired output.
time
value
1
3
2
2
3
1
second :
df = df.set_index('time')
df['value'] = df['value'].shift(-2)
df.dropna()
this return your output too but turn the numbers to float64
time
value
1
3.0
2
2.0
3
1.0

Append new column to DF after sum?

I have a sample dataframe below:
sn C1-1 C1-2 C1-3 H2-1 H2-2 K3-1 K3-2
1 4 3 5 4 1 4 2
2 2 2 0 2 0 1 2
3 1 2 0 0 2 1 2
I will like to sum based on the prefix of C1, H2, K3 and output three new columns with the total sum. The final result is this:
sn total_c1 total_h2 total_k3
1 12 5 6
2 4 2 3
3 3 2 3
What I have tried on my original df:
lst = ["C1", "H2", "K3"]
lst2 = ["total_c1", "total_h2", "total_k3"]
for k in lst:
idx = df.columns.str.startswith(i)
for j in lst2:
df[j] = df.iloc[:,idx].sum(axis=1)
df1 = df.append(df, sort=False)
But I kept getting error
IndexError: Item wrong length 35 instead of 36.
I can't figure out how to append the new total column to produce my end result in the loop.
Any help will be appreciated (or better suggestion as oppose to loop). Thank you.
You can use groupby:
# columns of interest
cols = df.columns[1:]
col_groups = cols.str.split('-').str[0]
out_df = df[['sn']].join(df[cols].groupby(col_groups, axis=1)
.sum()
.add_prefix('total_')
)
Output:
sn total_C1 total_H2 total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3
Let us try ,split then groupby with it with axis=1
out = df.groupby(df.columns.str.split('-').str[0],axis=1).sum().set_index('sn').add_prefix('Total_').reset_index()
Out[84]:
sn Total_C1 Total_H2 Total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3
Another option, where we create a dictionary to groupby the columns:
mapping = {entry: f"total_{entry[:2]}" for entry in df.columns[1:]}
result = df.groupby(mapping, axis=1).sum()
result.insert(0, "sn", df.sn)
result
sn total_C1 total_H2 total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3

Sort data in Pandas dataframe alphabetically

I have a dataframe where I need to sort the contents of one column (comma separated) alphabetically:
ID Data
1 Mo,Ab,ZZz
2 Ab,Ma,Bt
3 Xe,Aa
4 Xe,Re,Fi,Ab
Output:
ID Data
1 Ab,Mo,ZZz
2 Ab,Bt,Ma
3 Aa,Xe
4 Ab,Fi,Re,Xe
I have tried:
df.sort_values(by='Data')
But this does not work
You can split, sorting and then join back:
df['Data'] = df['Data'].apply(lambda x: ','.join(sorted(x.split(','))))
Or use list comprehension alternative:
df['Data'] = [','.join(sorted(x.split(','))) for x in df['Data']]
print (df)
ID Data
0 1 Ab,Mo,ZZz
1 2 Ab,Bt,Ma
2 3 Aa,Xe
3 4 Ab,Fi,Re,Xe
IIUC get_dummies
s=df.Data.str.get_dummies(',')
df['n']=s.dot(s.columns+',').str[:-1]
df
Out[216]:
ID Data n
0 1 Mo,Ab,ZZz Ab,Mo,ZZz
1 2 Ab,Ma,Bt Ab,Bt,Ma
2 3 Xe,Aa Aa,Xe
3 4 Xe,Re,Fi,Ab Ab,Fi,Re,Xe
IIUC you can use a list comprehension:
[','.join(sorted(i.split(','))) for i in df['Data']]
#['Ab,Mo,ZZz', 'Ab,Bt,Ma', 'Aa,Xe', 'Ab,Fi,Re,Xe']
using explode and sort_values
df["Sorted_Data"] = (
df["Data"].str.split(",").explode().sort_values().groupby(level=0).agg(','.join)
)
print(df)
ID Data Sorted_Data
0 1 Mo,Ab,ZZz Ab,Mo,ZZz
1 2 Ab,Ma,Bt Ab,Bt,Ma
2 3 Xe,Aa Aa,Xe
3 4 Xe,Re,Fi,Ab Ab,Fi,Re,Xe
Using row iteration:
for index, row in df.iterrows():
row['Data'] = ','.join(sorted(row['Data'].split(',')))
In [29]: df
Out[29]:
Data
0 Ab,Mo,ZZz
1 Ab,Bt,Ma
2 Aa,Xe
3 Ab,Fi,Re,Xe