Pandas: Delete duplicated items in a specific column - pandas

I have a panda dataframe (here represented using excel):
Now I would like to delete all dublicates (1) of a specific row (B).
How can I do it ?
For this example, the result would look like that:

You can use duplicated for boolean mask and then set NaNs by loc, mask or numpy.where:
df.loc[df['B'].duplicated(), 'B'] = np.nan
df['B'] = df['B'].mask(df['B'].duplicated())
df['B'] = np.where(df['B'].duplicated(), np.nan,df['B'])
Alternative if need remove duplicates rows by B column:
df = df.drop_duplicates(subset=['B'])
Sample:
df = pd.DataFrame({
'B': [1,2,1,3],
'A':[1,5,7,9]
})
print (df)
A B
0 1 1
1 5 2
2 7 1
3 9 3
df.loc[df['B'].duplicated(), 'B'] = np.nan
print (df)
A B
0 1 1.0
1 5 2.0
2 7 NaN
3 9 3.0
df = df.drop_duplicates(subset=['B'])
print (df)
A B
0 1 1
1 5 2
3 9 3

Related

creating a new dataframe from 3 other dataframes but columns must have specific order without specifying the name of the columns

Assume we have 3 dataframes named df1, df2, df3. Each of these dataframes have 100 rows and 15 columns. I want to create new dataframe that will have the first column of df1, then the first column of df2m then the first column of df3. then it will have the second column of df1 then the second column of df2 then the second column of df3 and so on until all 15 columns of each of the three dataframes are included. For example
df1
A B C ... O
1 1 1 1
1 1 1 1
... ... ... ...
df2
A B C ... O
2 2 2 2
2 2 2 2
... ... ... ...
df3
A B C ... O
3 3 3 3
3 3 3 3
... ... ... ...
The expected output should be something like the following
dfnew
A_df1 A_df2 A_df3 B_df1 B_df2 B_df3 ... O_df1 O_df2 O_df3
1 2 3 1 2 3 1 2 3
1 2 3 1 2 3 1 2 3
... ... ... ...
My issue is that I cannot use the names of the columns to specify them. For example I know how to do it like this
# create a list of the dataframes
dfs = [df1, df2, df3]
# concatenate the dataframes along the columns axis (axis=1)
dfnew = pd.concat(dfs, axis=1)
# specify the column names for the new dataframe
column_names = ["column1", "column2", ..., "column15"]
# concatenate the dataframes along the columns axis (axis=1)
# and specify the column names for the new dataframe
dfnew = pd.concat(dfs, axis=1, columns=column_names)
but I cannot use the column names because they will change everytime. Plus it seems like there could be a faster way that hard coding them by using the .loc function
Exmaple
data1 = {'A': {0: 1, 1: 1}, 'B': {0: 1, 1: 1}, 'C': {0: 1, 1: 1}}
df1 = pd.DataFrame(data1)
df2 = df1.replace(1, 2).copy()
df3 = df1.replace(1, 3).copy()
df1
A B C
0 1 1 1
1 1 1 1
df2
A B C
0 2 2 2
1 2 2 2
df3
A B C
0 3 3 3
1 3 3 3
Code
dfs = (pd.concat([df1, df2, df3], axis=1, keys=['df1', 'df2', 'df3'])
.sort_index(level=1, axis=1).swaplevel(0, 1, axis=1))
dfs
A B C
df1 df2 df3 df1 df2 df3 df1 df2 df3
0 1 2 3 1 2 3 1 2 3
1 1 2 3 1 2 3 1 2 3
dfs.set_axis(dfs.columns.map('_'.join), axis=1)
A_df1 A_df2 A_df3 B_df1 B_df2 B_df3 C_df1 C_df2 C_df3
0 1 2 3 1 2 3 1 2 3
1 1 2 3 1 2 3 1 2 3

how to subtract 2 data frame with the same size?

How I can subtract values of 2 different data frame with the same size and columns?
for example df1-df2 in the following:
df1:
A B
4 5
0 6
df2:
A B
6 0
7 1
output:
diff:
A B
-2 5
-7 5
Note: I have too many columns and rows, please don't suggest manually methods. no for loop please
I guess this is what you want.
df1 = pd.DataFrame({"A": [4,0], "B": [5,6]})
df2 = pd.DataFrame({"A": [6,7], "B": [0,1]})
df = df1 - df2
df
Out[4]:
A B
0 -2 5
1 -7 5

How to merge two dataframes according to their indexes?

I am trying to use Pandas for data analysis. I need to merge two dataframes according to their indexes. However, their indexes are totally different. The rule is that if the index of df2 is the substring of df1, then I should merge them. For example, df1.index == ['a/aa/aaa','b/bb/bbb', 'c/cc/ccc'], and df2.index == ['bb/bbb', 'ccc', 'hello']. Then df1 and df2 have two indexes in common, we should merge them based on these indexes. What should i do?
Having your DataFrame :
>>> df1 = pd.DataFrame({'col_a': [1, 2, 3]}, index=['a/aa/aaa','b/bb/bbb', 'c/cc/ccc'])
>>> df2 = pd.DataFrame({'col_b': [4, 5, 6]}, index=['bb/bbb', 'ccc', 'hello'])
And changing the index to column :
>>> df1=df1.reset_index(drop=False)
>>> df1 = df1.rename(columns={'index': 'value_df1'})
>>> df1
value_df1 col_a
0 a/aa/aaa 1
1 b/bb/bbb 2
2 c/cc/ccc 3
>>> df2=df2.reset_index(drop=False)
>>> df2 = df2.rename(columns={'index': 'value_df2'})
>>> df2
value_df2 col_b
0 bb/bbb 4
1 ccc 5
2 hello 6
We merge both DataFrame on the joincolumn :
>>> df1['join'] = 1
>>> df2['join'] = 1
>>> dfFull = df1.merge(df2, on='join').drop('join', axis=1)
>>> dfFull
value_df1 col_a value_df2 col_b
0 a/aa/aaa 1 bb/bbb 4
1 a/aa/aaa 1 ccc 5
2 a/aa/aaa 1 hello 6
3 b/bb/bbb 2 bb/bbb 4
4 b/bb/bbb 2 ccc 5
5 b/bb/bbb 2 hello 6
6 c/cc/ccc 3 bb/bbb 4
7 c/cc/ccc 3 ccc 5
8 c/cc/ccc 3 hello 6
Then we use an apply to match the initial index value :
>>> df2.drop('join', axis=1, inplace=True)
>>> dfFull['match'] = dfFull.apply(lambda x: x['value_df1'].find(x['value_df2']), axis=1).ge(0)
>>> dfFull
value_df1 col_a value_df2 col_b match
0 a/aa/aaa 1 bb/bbb 4 False
1 a/aa/aaa 1 ccc 5 False
2 a/aa/aaa 1 hello 6 False
3 b/bb/bbb 2 bb/bbb 4 True
4 b/bb/bbb 2 ccc 5 False
5 b/bb/bbb 2 hello 6 False
6 c/cc/ccc 3 bb/bbb 4 False
7 c/cc/ccc 3 ccc 5 True
8 c/cc/ccc 3 hello 6 False
Filtering on the row where the column match is True and dropping the match column, we get the expected result :
>>> dfFull[dfFull['match']].drop(['match'], axis=1)
value_df1 col_a value_df2 col_b
3 b/bb/bbb 2 bb/bbb 4
7 c/cc/ccc 3 ccc 5
This solution is inspired by this post.
Since you have a known delimiter, you can split on that delimiter, do some merging, and then add back in the original data.
# sample data
df1 = pd.DataFrame({'ColumnA': [1,2,3]}, index=['a/aa/aaa','b/bb/bbb', 'c/cc/ccc'])
df2 = pd.DataFrame({'ColumnB': [4,5,6]}, index=['bb/bbb', 'ccc', 'hello'])
# set original index as column
# make a copy of each dataframe to preserve original data
# reset index of copy to keep track of original row number
df1 = df1.reset_index()
copy_df1 = df1
copy_df1.index.name = 'row_df1'
copy_df1 = df1.reset_index()
df2 = df2.reset_index()
copy_df2 = df2
copy_df2.index.name = 'row_df2'
copy_df2 = copy_df2.reset_index()
# split on known delimiter and explode into rows for each substring
copy_df1['index'] = copy_df1['index'].str.split('/')
copy_df1 = copy_df1.explode('index')
copy_df2['index'] = copy_df2['index'].str.split('/')
copy_df2 = copy_df2.explode('index')
# merge based on substrings, drop duplicates in case of multiple substring matches
mrg = copy_df1[['row_df1','index']].merge(copy_df2[['row_df2','index']]).drop(columns='index')
mrg = mrg.drop_duplicates()
# merge back in original details
mrg = mrg.merge(df1, left_on='row_df1', right_index=True)
mrg = mrg.merge(df2, left_on='row_df2', right_index=True, suffixes=('_df1','_df2'))
The final output would be:
row_df1 row_df2 index_df1 ColumnA index_df2 ColumnB
0 1 0 b/bb/bbb 2 bb/bbb 4
2 2 1 c/cc/ccc 3 ccc 5

Sum columns in pandas having string and number

I need to sum column and column b, which contain string in 1st row
>>> df
a b
0 c d
1 1 2
2 3 4
>>> df['sum'] = df.sum(1)
>>> df
a b sum
0 c d cd
1 1 2 3
2 3 4 7
I only need to add numeric values and get an output like
>>> df
a b sum
0 c d "dummyString/NaN"
1 1 2 3
2 3 4 7
I need to add only some columns
df['sum']=df['a']+df['b']
solution if mixed data - numeric with strings:
I think simpliest is convert non numeric values after sum by to_numeric to NaNs:
df['sum'] = pd.to_numeric(df[['a','b']].sum(1), errors='coerce')
Or:
df['sum'] = pd.to_numeric(df['a']+df['b'], errors='coerce')
print (df)
a b sum
0 c d NaN
1 1 2 3.0
2 3 4 7.0
EDIT:
Solutions id numbers are strings represenation - first convert to numeric and then sum:
df['sum'] = pd.to_numeric(df['a'], errors='coerce') + pd.to_numeric(df['b'], errors='coerce')
print (df)
a b sum
0 c d NaN
1 1 2 3.0
2 3 4 7.0
Or:
df['sum'] = (df[['a', 'b']].apply(lambda x: pd.to_numeric(x, errors='coerce'))
.sum(axis=1, min_count=1))
print (df)
a b sum
0 c d NaN
1 1 2 3.0
2 3 4 7.0

Re-index to insert missing rows in a multi-indexed dataframe

I have a MultiIndexed DataFrame with three levels of indices. I would like to expand my third level to contain all values in a given range, but only for the existing values in the two upper levels.
For example, assume the first level is name, the second level is date and the third level is hour. I would like to have rows for all 24 possible hours (even if some are currently missing), but only for the already existing names and dates. The values in new rows can be filled with zeros.
So a simple example input would be:
>>> import pandas as pd
>>> df = pd.DataFrame([[1,1,1,3],[2,2,1,4], [3,3,2,5]], columns=['A', 'B', 'C','val'])
>>> df.set_index(['A', 'B', 'C'], inplace=True)
>>> df
val
A B C
1 1 1 3
2 2 1 4
3 3 2 5
if the required values for C are [1,2,3], the desired output would be:
val
A B C
1 1 1 3
2 0
3 0
2 2 1 4
2 0
3 0
3 3 1 0
2 5
3 0
I know how to achieve this using groupby and applying a defined function for each group, but I was wondering if there was a cleaner way of doing this with reindex (I couldn't make this one work for a MultiIndex case, but perhaps I'm missing something)
Use -
partial_indices = [ i[0:2] for i in df.index.values ]
C_reqd = [1, 2, 3]
final_indices = [j+(i,) for j in partial_indices for i in C_reqd]
index = pd.MultiIndex.from_tuples(final_indices, names=['A', 'B', 'C'])
df2 = pd.DataFrame(pd.Series(0, index), columns=['val'])
df2.update(df)
Output
df2
val
A B C
1 1 1 3.0
2 0.0
3 0.0
2 2 1 4.0
2 0.0
3 0.0
3 3 1 0.0
2 5.0
3 0.0