how to drop duplicated columns data based on column name in pandas

how to drop duplicated columns data based on column name in pandas - pandas

Assume I have a table like below
A B C B
0 0 1 2 3
1 4 5 6 7
I'd like to drop column B. I tried to use drop_duplicates, but it seems that it only works based on duplicated data not header.
Hope anyone know how to do this.

Use Index.duplicated with loc or iloc and boolean indexing:
print (~df.columns.duplicated())
[ True True True False]
df = df.loc[:, ~df.columns.duplicated()]
print (df)
A B C
0 0 1 2
1 4 5 6
df = df.iloc[:, ~df.columns.duplicated()]
print (df)
A B C
0 0 1 2
1 4 5 6
Timings:
np.random.seed(123)
cols = ['A','B','C','B']
#[1000 rows x 30 columns]
df = pd.DataFrame(np.random.randint(10, size=(1000,30)),columns = np.random.choice(cols, 30))
print (df)
In [115]: %timeit (df.groupby(level=0, axis=1).first())
1000 loops, best of 3: 1.48 ms per loop
In [116]: %timeit (df.groupby(level=0, axis=1).mean())
1000 loops, best of 3: 1.58 ms per loop
In [117]: %timeit (df.iloc[:, ~df.columns.duplicated()])
1000 loops, best of 3: 338 µs per loop
In [118]: %timeit (df.loc[:, ~df.columns.duplicated()])
1000 loops, best of 3: 346 µs per loop

You can groupby
We use the axis=1 and level=0 parameters to specify that we are grouping by columns. Then use the first method to grab the first column within each group defined by unique column names.
df.groupby(level=0, axis=1).first()
A B C
0 0 1 2
1 4 5 6
We could have also used last
df.groupby(level=0, axis=1).last()
A B C
0 0 3 2
1 4 7 6
Or mean
df.groupby(level=0, axis=1).mean()
A B C
0 0 2 2
1 4 6 6

Related

pandas dataframe how to replace extreme outliers for all columns

I have a pandas dataframe with some very extreme value - more than 5 std.
I want to replace, per column, each value that is more than 5 std with the max other value.
For example,
df = A B
1 2
1 6
2 8
1 115
191 1
Will become:
df = A B
1 2
1 6
2 8
1 8
2 1
What is the best way to do it without a for loop over the columns?

s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame
A B
0 1.0 2.0
1 1.0 6.0
2 2.0 8.0
3 1.0 8.0
4 2.0 1.0

Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do
q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df
this fixes column A:
A B
0 1 2
1 1 6
2 2 8
3 1 115
4 2 1
do the same for B

Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range
def calc_zscore(col):
return (col - col.mean()) / col.std()
zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5
After that it's up to you to fill the values marked with the boolean mask.
df[outlier_mask] = something

Remove rows in pandas df with index values within a range

I would like to remove all rows in a pandas df that have an index value within 4 counts of the index value of the previous row.
In the pandas df below,
A B
0 1 1
5 5 5
8 9 9
9 10 10
Only the row with index value 0 should remain.
Thanks!

get the differences between the current and previous row as a list and pass to loc. Chose to get it as a list so i could return a dataframe as a final output.
ind = [ a for a,b in zip(df.index,df.index[1:]) if b-a > 4]
df.loc[ind]
A B
0 1 1

You can use reset_index, diff and shift:
In [1309]: df
Out[1309]:
A B
0 1 1
5 5 5
8 9 9
9 10 10
In [1310]: d = df.reset_index()
In [1313]: df = d[d['index'].diff(1).shift(-1) >=4].drop('index', 1)
In [1314]: df
Out[1313]:
A B
0 1 1

pandas dataframe filter by sequence of values in a specific column

I have a dataframe
A B C
1 2 3
2 3 4
3 8 7
I want to take only rows where there is a sequence of 3,4 in columns C (in this scenario - first two rows)
What will be the best way to do so?

You can use rolling for general solution working with any pattern:
pat = np.asarray([3,4])
N = len(pat)
mask= (df['C'].rolling(window=N , min_periods=N)
.apply(lambda x: (x==pat).all(), raw=True)
.mask(lambda x: x == 0)
.bfill(limit=N-1)
.fillna(0)
.astype(bool))
df = df[mask]
print (df)
A B C
0 1 2 3
1 2 3 4
Explanation:
use rolling.apply and test pattern
replace 0s to NaNs by mask
use bfill with limit for filling first NANs values by last previous one
fillna NaNs to 0
last cast to bool by astype

Use shift
In [1085]: s = df.eq(3).any(1) & df.shift(-1).eq(4).any(1)
In [1086]: df[s | s.shift()]
Out[1086]:
A B C
0 1 2 3
1 2 3 4

How to keep index in pandas pivot table

Suppose I create a pandas pivot table:
adults_per_hh= pd.pivot_table(data,index=["hh_id"],values=["adult"],aggfunc=np.sum)
adults_per_hh.shape
(1000,1)
I want to keep hh_id as a column in addition to adult. What is the most efficient way to do this?

I think you need reset_index if use pivot_table, because first column is index:
print (data)
adult hh_id
0 4 1
1 5 1
2 6 3
3 1 2
4 2 2
print (pd.pivot_table(data,index=["hh_id"],values=["adult"],aggfunc=np.sum))
adult
hh_id
1 9
2 3
3 6
adults_per_hh= pd.pivot_table(data,index=["hh_id"],values=["adult"],aggfunc=np.sum)
.reset_index()
print (adults_per_hh)
hh_id adult
0 1 9
1 2 3
2 3 6
Another solution is use groupby and aggregate sum:
adults_per_hh = data.groupby("hh_id")["adult"].sum().reset_index()
print (adults_per_hh)
hh_id adult
0 1 9
1 2 3
2 3 6
Timings:
#random dataframe
np.random.seed(100)
N = 10000000
data = pd.DataFrame(np.random.randint(50, size=(N,2)), columns=['hh_id','adult'])
#[10000000 rows x 2 columns]
print (data)
In [60]: %timeit (pd.pivot_table(data,index=["hh_id"],values=["adult"],aggfunc=np.sum).reset_index())
1 loop, best of 3: 384 ms per loop
In [61]: %timeit (data.groupby("hh_id", as_index=False)["adult"].sum())
1 loop, best of 3: 381 ms per loop
In [62]: %timeit (data.groupby("hh_id")["adult"].sum().reset_index())
1 loop, best of 3: 355 ms per loop

How to set a pandas dataframe equal to a row?

I know how to set the pandas data frame equal to a column.
i.e.:
df = df['col1']
what is the equivalent for a row? let's say taking the index? and would I eliminate one or more of them?
Many thanks.

If you want to take a copy of a row then you can either use loc for label indexing or iloc for integer based indexing:
In [104]:
df = pd.DataFrame({'a':np.random.randn(10),'b':np.random.randn(10)})
df
Out[104]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
3 0.531293 -0.386598
4 -0.278565 1.224272
5 0.491417 -0.498816
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186
In [106]:
row = df.iloc[3]
row
Out[106]:
a 0.531293
b -0.386598
Name: 3, dtype: float64
If you want to remove that row then you can use drop:
In [107]:
df.drop(3)
Out[107]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
4 -0.278565 1.224272
5 0.491417 -0.498816
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186
You can also use a slice or pass a list of labels:
In [109]:
rows = df.loc[[3,5]]
row_slice = df.loc[3:5]
print(rows)
print(row_slice)
a b
3 0.531293 -0.386598
5 0.491417 -0.498816
a b
3 0.531293 -0.386598
4 -0.278565 1.224272
5 0.491417 -0.498816
Similarly you can pass a list to drop:
In [110]:
df.drop([3,5])
Out[110]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
4 -0.278565 1.224272
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186
If you wanted to drop a slice then you can slice your index and pass this to drop:
In [112]:
df.drop(df.index[3:5])
Out[112]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
5 0.491417 -0.498816
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

how to drop duplicated columns data based on column name in pandas - pandas

Assume I have a table like below A B C B 0 0 1 2 3 1 4 5 6 7 I'd like to drop column B. I tried to use drop_duplicates, but it seems that it only works based on duplicated data not header. Hope anyone know how to do this.

Related

pandas dataframe how to replace extreme outliers for all columns

Remove rows in pandas df with index values within a range

pandas dataframe filter by sequence of values in a specific column

How to keep index in pandas pivot table

How to set a pandas dataframe equal to a row?

Categories

Resources