Remove the identical values, and leave only different - pandas

I would like to know if there is more optimal solution to leave the different value (to easily catch them) and to remove identical values under some columns.
merged = pd.merge(us_df, gb_df, how='outer', indicator=True)
res = pd.merge(merged[merged['_merge'] == 'left_only'].drop('_merge', axis=1),
merged[merged['_merge'] == 'right_only'].drop('_merge', axis=1),
on=us_df.columns.tolist()[0:col_range],
how='outer',
suffixes=('_US', '_GB')).fillna(' ')
cols = [col for col in res.columns.tolist() if '_US' in col or '_GB' in col]
sorted_cols = [col for col in res.columns.tolist() if '_US' not in col and '_GB' not in col] + sorted(cols)
I get this table (res):
Id
ages_GB
ages_US
salary_GB
salary_US
6
45
45
34
67
43
12
11
65
65
So far, I used this iteration:
cols = [ages_US, salary_US, ages_GB, salary_GB]
for i, row in res.iterrows():
for us, gb in zip(cols[:len(cols) // 2], cols[len(cols) // 2:]):
if row[us] == row[gb]:
res.at[i, us] = res.at[i, gb] = ' '
to get the result (where identical values under columns in cols are replaced with " " (space)):
Id
ages_GB
ages_US
salary_GB
salary_US
6
34
67
43
12
11
Is there another method to get the similar result?

Given your example I think loc offers a simpler solution assuming you want to compare two sets of columns.
I will first recreate a reproducible example of your dataset (I would recommend you create this in future questions as it makes it easier to understand and answer you question: How to create a Minimal, Reproducible Example)
d = {
'ages_GB': [45, 12],
'ages_US': [45, 11],
'salary_GB': [34, 65],
'salary_US': [67, 65]
}
df = pd.DataFrame(data=d)
print(df)
Initial DataFrame
ages_GB ages_US salary_GB salary_US
0 45 45 34 67
1 12 11 65 65
The simplest solution I can think of is to use loc to just reassign records to "" or NaN where ages_GB == ages_US & salary_GB == salary_US.
df.loc[df.ages_GB == df.ages_US, ['ages_GB', 'ages_US']] = ["", ""]
df.loc[df.salary_GB == df.salary_US, ['salary_GB', 'salary_US']] = ["", ""]
Output
ages_GB ages_US salary_GB salary_US
0 34 67
1 12 11

For a generic method, you can groupby on axis=1 using the columns prefixes, and get the duplicated values to use with mask:
prefix = df.columns.str.extract('^([^_]+)', expand=False)
# ['Id', 'ages', 'ages', 'salary', 'salary']
m = df.groupby(prefix, axis=1).transform(lambda s: s.duplicated(keep=False))
out = df.mask(m, '')
Output:
Id ages_GB ages_US salary_GB salary_US
0 6 34 67
1 43 12 11
Intermediate m:
Id ages_GB ages_US salary_GB salary_US
0 False True True False False
1 False False False True True

Related

Merge rows with same id, different vallues in 1 column to multiple columns

what i have length can be of different values/ so somethimes 1 id has 4 rows with different values in column val, the other columns have all the same values
df1 = pd.DataFrame({'id':[1,1,1,2,2,2,3,3,3], 'val': ['06123','nick','#gmail','06454','abey','#gmail','06888','sisi'], 'media': ['nrc','nrc','nrc','nrc','nrc','nrc','nrc','nrc']})
what i need
id kolom 1 kolom2 kolom 3 media
1 06123 nick #gmail nrc
2 06454 abey #gmail nrc
3 6888 sisi None nrc
I hope I gave a good example, in the corrected way, thanks for the help
df2 = df1.groupby('id').agg(list)
df2['col 1'] = df2['val'].apply(lambda x: x[0] if len(x) > 0 else 'None')
df2['col 2'] = df2['val'].apply(lambda x: x[1] if len(x) > 1 else 'None')
df2['col 3'] = df2['val'].apply(lambda x: x[2] if len(x) > 2 else 'None')
df2['media'] = df2['media'].apply(lambda x: x[0] if len(x) > 0 else 'None')
df2.drop(columns='val')
Here is another way. Since your original dataframe doesn't have lists with the same length (which will get you a ValueError, you can define it as:
data = {"id":[1,1,1,2,2,2,3,3,3],
"val": ["06123","nick","#gmail","06454","abey","#gmail","06888","sisi"],
"media": ["nrc","nrc","nrc","nrc","nrc","nrc","nrc","nrc"]}
df = pd.DataFrame.from_dict(data, orient="index")
df = df.transpose()
>>> df
id val media
0 1 06123 nrc
1 1 nick nrc
2 1 #gmail nrc
3 2 06454 nrc
4 2 abey nrc
5 2 #gmail nrc
6 3 06888 nrc
7 3 sisi nrc
8 3 NaN NaN
Afterwards, you can replace with np.nan values with an empty string, so that you can groupby your id column and join the values in val separated by a ,.
df = df.replace(np.nan, "", regex=True)
df_new = df.groupby(["id"])["val"].apply(lambda x: ",".join(x)).reset_index()
>>> df_new
id val
0 1.0 06123,nick,#gmail
1 2.0 06454,abey,#gmail
2 3.0 06888,sisi,
Then, you only need to transform the new val column into 3 columns by splitting the string inside, with any method you want. For example,
new_cols = df_new["val"].str.split(",", expand=True) # Good ol' split
df_new["kolom 1"] = new_cols[0] # Assign to new columns
df_new["kolom 2"] = new_cols[1]
df_new["kolom 3"] = new_cols[2]
df_new.drop("val", 1, inplace=True) # Delete previous val
df_new["media"] = "nrc" # Add the media column again
df_new = df_new.replace("", np.nan, regex=True) # If necessary, replace empty string with np.nan
>>> df_new
id kolom 1 kolom 2 kolom 3 media
0 1.0 06123 nick #gmail nrc
1 2.0 06454 abey #gmail nrc
2 3.0 06888 sisi NaN nrc

Subset two consecutive event occurrence in pandas

I'm trying to get a subset of my data whenever there is consecutive occurrence of an two events in that order. The event is time-stamped. So every time there are continuous 2's and then continuous 3's, I want to subset that to a dataframe and append it to a dictionary. The following code does that but I have to apply this to a very large dataframe of more than 20 mil obs. This is extremely slow using iterrows. How can I make this fast?
df = pd.DataFrame({'Date': [101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122],
'Event': [1,1,2,2,2,3,3,1,3,2,2,3,1,2,3,2,3,2,2,3,3,3]})
dfb = pd.DataFrame(columns = df.columns)
C = {}
f1 = 0
for index, row in df.iterrows():
if ((row['Event'] == 2) & (3 not in dfb['Event'].values)):
dfb = dfb.append(row)
f1 =1
elif ((row['Event'] == 3) & (f1 == 1)):
dfb = dfb.append(row)
elif 3 in dfb['Event'].values:
f1 = 0
C[str(dfb.iloc[0,0])] = dfb
del dfb
dfb = pd.DataFrame(columns = df.columns)
if row['Event'] == 2:
dfb = dfb.append(row)
f1 =1
else:
f1=0
del dfb
dfb = pd.DataFrame(columns = df.columns)
Edit: The desired output is basically a dictionary of the subsets shown in the imagehttps://i.stack.imgur.com/ClWZs.png
If you want to accerlate, you should vectorize your code. You could try it like this (df is the same with your code):
vec = df.copy()
vec['Event_y'] = vec['Event'].shift(1).fillna(0).astype(int)
vec['Same_Flag'] = float('nan')
vec.Same_Flag.loc[(vec['Event_y'] == vec['Event']) & (vec['Event'] != 1)] = 1
vec.dropna(inplace=True)
vec.loc[:, ('Date', 'Event')]
Output is:
Date Event
3 104 2
4 105 2
6 107 3
10 111 2
18 119 2
20 121 3
21 122 3
I think that's close to what you need. You could improve based on that.
I'm not understand why date 104, 105, 107 are not counted.

Function creation with pandas dataframe?

I have the following pandas DataFrame named table
grh pm_0 age_0
0 1 39054414 74
1 2 34054409 37
2 3 3715955000 65
3 4 19373605 53
4 5 99411 64
5 6 25664143 37
6 7 5161112 77
7 8 41517547 80
8 9 9517054000 72
9 10 538129400 52
I have a loop iterating over like this :
df2=df.copy()
for k in range (1,3):
for i in range (1,5):
df["pm_"+str(i)]=df["pm_"+str(i-1)]/k
df2=df2.append(df)
print(df2.head(15))
It works but i would like to encapsule it in a function.
I tried something like this but it doesn't work.
I think i made something wrong..
def sto(scn):
df4=df.copy()
for k in range (1,scn):
for i in range (1,5):
df["pm_"+str(i)]=df["pm_"+str(i-1)]/k
df4=df4.append(df)
sto(3)
print(df4)
Traceback (most recent call last):
File "", line 11, in
print(df4)
NameError: name 'df4' is not defined
Any idea ?
You just need to explicitly mark it as global object:
df4 = df.copy()
def sto(scn):
global df4
for k in range (1,scn):
for i in range (1,5):
df["pm_"+str(i)]=df["pm_"+str(i-1)]/k
df4=df4.append(df)
return df4
sto(3)
You can pass a dataframe to the function, do:
def sto(scn, df):
dfx = df.copy()
for k in range (1,scn):
for i in range (1,5):
df["pm_"+str(i)]=df["pm_"+str(i-1)]/k
dfx=dfx.append(df)
return dfx
df4 = sto(3, df)
You need to send the df to the function and return it at the end:
def sto(scn, df)
.....
.....
return df

Pandas subtract columns with groupby and mask

For groups under one "SN", I would like to subtract three performance indicators for each group. One group boundaries are the serial number SN and sequential Boolean True values in mask. (So multiple True sequances can exist under one SN).
The first indicator I want is, Csub that subtracts between the first and last values of each group in column 'C'. Second, Bmean, is the mean of each group in column 'B'.
For example:
In:
df = pd.DataFrame({"SN" : ["66", "66", "66", "77", "77", "77", "77", "77"], "B" : [-2, -1, -2, 3, 1, -1, 1, 1], "C" : [1, 2, 3, 15, 11, 2, 1, 2],
"mask" : [False, False, False, True, True, False, True, True] })
SN B C mask
0 66 -2 1 False
1 66 -1 2 False
2 66 -2 3 False
3 77 3 15 True
4 77 1 11 True
5 77 -1 2 False
6 77 1 1 True
7 77 1 2 True
Out:
SN B C mask Csub Bmean CdivB
0 66 -2 1 False Nan Nan Nan
1 66 -1 2 False Nan Nan Nan
2 66 -2 3 False Nan Nan Nan
3 77 3 15 True -4 13 -0.3
4 77 1 11 True -4 13 -0.3
5 77 -1 2 False Nan Nan Nan
6 77 1 1 True 1 1 1
7 77 1 2 True 1 1 1
I cooked up something like this, but it groups by the mask T/F values. It should group by SN and sequential True values, not ALL True values. Further, I cannot figure out how to get a subtraction sqeezed in to this.
# Extracting performance values
perf = (df.assign(
Bmean = df['B'], CdivB = df['C']/df['B']
).groupby(['SN','mask'])
.agg(dict(Bmean ='mean', CdivB = 'mean'))
.reset_index(drop=False)
)
It's not pretty, but you can try the following.
First, prepare a 'group_key' column in order to group by consecutive True values in 'mask':
# Select the rows where 'mask' is True preceded by False.
first_true = df.loc[
(df['mask'] == True)
& (df['mask'].shift(fill_value=False) == False)
]
# Add the column.
df['group_key'] = pd.Series()
# Each row in first_true gets assigned a different 'group_key' value.
df.loc[first_true.index, 'group_key'] = range(len(first_true))
# Forward fill 'group_key' on mask.
df.loc[df['mask'], 'group_key'] = df.loc[df['mask'], 'group_key'].ffill()
Then we can group by 'SN' and 'group_key' and compute and assign the indicator values.
# Group by 'SN' and 'group_key'.
gdf = df.groupby(by=['SN', 'group_key'], as_index=False)
# Compute indicator values
indicators = pd.DataFrame(gdf.nth(0)) # pd.DataFrame used here to avoid a SettingwithCopyWarning.
indicators['Csub'] = gdf.nth(0)['C'].array - gdf.nth(-1)['C'].array
indicators['Bmean'] = gdf.mean()['B'].array
# Write values to original dataframe
df = df.join(indicators.reindex(columns=['Csub', 'Bmean']))
# Forward fill the indicator values
df.loc[df['mask'], ['Csub', 'Bmean']] = df.loc[df['mask'], ['Csub', 'Bmean']].ffill()
# Drop 'group_key' column
df = df.drop(columns=['group_key'])
I excluded 'CdivB' since I couldn't understand what it's value should be.

Python Pandas: merge, join, concat

I have a dataframe that has a non-unique GEO_ID, and an attribute (FTYPE) in a separate column (1 of 6 values) for each GEO_ID and an associated length for each FTYPE.
df
FID GEO_ID FTYPE Length_km
0 1400000US06001400100 428 3.291467766
1 1400000US06001400100 460 7.566487367
2 1400000US06001401700 460 0.262190266
3 1400000US06001401700 566 10.49899202
4 1400000US06001403300 428 0.138171389
5 1400000US06001403300 558 0.532913513
How do I make 6 new columns for FTYPE (with 1 and 0 to indicate if that row has the FTYPE) and 6 new columns for FTYPE_Length to make each row have a unique GEO_ID?
I want my new dataframe to have a structure like this (with 6 FTYPE-s):
FID GEO_ID FTYPE_428 FTYPE_428_length FTYPE_460 FTYPE_460_length
0 1400000US06001400100 1 3.291467766 1 7.566487367
So far, what I have tried is doing something like this:
import pandas as pd
fname = "filename.csv"
df = pd.read_csv(fname)
nhd = [334, 336, 420, 428, 460, 558, 556]
df1 = df.loc[df['FTYPE']==nhd[0]]
df2 = df.loc[df['FTYPE']==nhd[1]]
df3 = df.loc[df['FTYPE']==nhd[2]]
df4 = df.loc[df['FTYPE']==nhd[3]]
df5 = df.loc[df['FTYPE']==nhd[4]]
df6 = df.loc[df['FTYPE']==nhd[5]]
df7 = df.loc[df['FTYPE']==nhd[6]]
df12 = df1.merge(df2, how='left', left_on='GEO_ID', right_on='GEO_ID')
df23 = df12.merge(df3,how='left', left_on='GEO_ID', right_on='GEO_ID')
df34 = df23.merge(df4,how='left', left_on='GEO_ID', right_on='GEO_ID')
df45 = df34.merge(df5,how='left', left_on='GEO_ID', right_on='GEO_ID')
df56 = df45.merge(df6,how='left', left_on='GEO_ID', right_on='GEO_ID')
df67 = df56.merge(df7,how='left', left_on='GEO_ID', right_on='GEO_ID')
cols = [0,4,7,10,13,16,19]
df67.drop(df67.columns[cols],axis=1,inplace=True)
df67.columns =['GEO_ID','334','len_334','336','len_336','420','len_420','428','len_428','460','len_460','558','len_558','566','len_566']
But this approach is problematic because it reduces the rows to the ones that have the first two FTYPE-s. Is there a way to merge with multiple columns at once?
Its probably easier to write a for loop and go over each row and use a condition to fill in the values like this:
nhd = [334, 336, 420, 428, 460, 558, 556]
for x in nhd:
df[str(x)] = None
df["length_"+str(x)] = None
df.head()
for geoid in df["GEO_ID"]:
#print geoid
for x in nhd:
df.ix[(df['FTYPE']==x) & (df['GEO_ID'] == geoid)][str(nhd)] = 1
But this takes too much time and there is probably a one liner in Pandas to do the same thing.
Any help on this is appreciated!
Thanks,
Solomon
I don't quite see the point of your _length columns: they seem to have the same information as just whether or not the matching value is null or not, which makes them redundant. They're easy enough to create, though.
While we could cram this into one line if we insisted, what's the point? This is SO, not codegolf. So I might do something like:
df = df.pivot(index="GEO_ID", columns="FTYPE", values="Length_km")
df.columns = "FTYPE_" + df.columns.astype(str)
has_value = df.notnull().astype(int)
has_value.columns += '_length'
final = pd.concat([df, has_value], axis=1).sort_index(axis='columns')
which gives me (using your input data, which only has 5 distinct FTYPEs):
In [49]: final
Out[49]:
FTYPE_334 FTYPE_334_length FTYPE_428 \
GEO_ID
1400000US06001400100 NaN 0 3.291468
1400000US06001401700 NaN 0 NaN
1400000US06001403300 NaN 0 0.138171
1400000US06001403400 0.04308 1 NaN
FTYPE_428_length FTYPE_460 FTYPE_460_length \
GEO_ID
1400000US06001400100 1 7.566487 1
1400000US06001401700 0 0.262190 1
1400000US06001403300 1 NaN 0
1400000US06001403400 0 NaN 0
FTYPE_558 FTYPE_558_length FTYPE_566 FTYPE_566_length
GEO_ID
1400000US06001400100 NaN 0 NaN 0
1400000US06001401700 NaN 0 10.498992 1
1400000US06001403300 0.532914 1 1.518864 1
1400000US06001403400 NaN 0 NaN 0