Python Pandas: merge, join, concat - pandas

I have a dataframe that has a non-unique GEO_ID, and an attribute (FTYPE) in a separate column (1 of 6 values) for each GEO_ID and an associated length for each FTYPE.
df
FID GEO_ID FTYPE Length_km
0 1400000US06001400100 428 3.291467766
1 1400000US06001400100 460 7.566487367
2 1400000US06001401700 460 0.262190266
3 1400000US06001401700 566 10.49899202
4 1400000US06001403300 428 0.138171389
5 1400000US06001403300 558 0.532913513
How do I make 6 new columns for FTYPE (with 1 and 0 to indicate if that row has the FTYPE) and 6 new columns for FTYPE_Length to make each row have a unique GEO_ID?
I want my new dataframe to have a structure like this (with 6 FTYPE-s):
FID GEO_ID FTYPE_428 FTYPE_428_length FTYPE_460 FTYPE_460_length
0 1400000US06001400100 1 3.291467766 1 7.566487367
So far, what I have tried is doing something like this:
import pandas as pd
fname = "filename.csv"
df = pd.read_csv(fname)
nhd = [334, 336, 420, 428, 460, 558, 556]
df1 = df.loc[df['FTYPE']==nhd[0]]
df2 = df.loc[df['FTYPE']==nhd[1]]
df3 = df.loc[df['FTYPE']==nhd[2]]
df4 = df.loc[df['FTYPE']==nhd[3]]
df5 = df.loc[df['FTYPE']==nhd[4]]
df6 = df.loc[df['FTYPE']==nhd[5]]
df7 = df.loc[df['FTYPE']==nhd[6]]
df12 = df1.merge(df2, how='left', left_on='GEO_ID', right_on='GEO_ID')
df23 = df12.merge(df3,how='left', left_on='GEO_ID', right_on='GEO_ID')
df34 = df23.merge(df4,how='left', left_on='GEO_ID', right_on='GEO_ID')
df45 = df34.merge(df5,how='left', left_on='GEO_ID', right_on='GEO_ID')
df56 = df45.merge(df6,how='left', left_on='GEO_ID', right_on='GEO_ID')
df67 = df56.merge(df7,how='left', left_on='GEO_ID', right_on='GEO_ID')
cols = [0,4,7,10,13,16,19]
df67.drop(df67.columns[cols],axis=1,inplace=True)
df67.columns =['GEO_ID','334','len_334','336','len_336','420','len_420','428','len_428','460','len_460','558','len_558','566','len_566']
But this approach is problematic because it reduces the rows to the ones that have the first two FTYPE-s. Is there a way to merge with multiple columns at once?
Its probably easier to write a for loop and go over each row and use a condition to fill in the values like this:
nhd = [334, 336, 420, 428, 460, 558, 556]
for x in nhd:
df[str(x)] = None
df["length_"+str(x)] = None
df.head()
for geoid in df["GEO_ID"]:
#print geoid
for x in nhd:
df.ix[(df['FTYPE']==x) & (df['GEO_ID'] == geoid)][str(nhd)] = 1
But this takes too much time and there is probably a one liner in Pandas to do the same thing.
Any help on this is appreciated!
Thanks,
Solomon

I don't quite see the point of your _length columns: they seem to have the same information as just whether or not the matching value is null or not, which makes them redundant. They're easy enough to create, though.
While we could cram this into one line if we insisted, what's the point? This is SO, not codegolf. So I might do something like:
df = df.pivot(index="GEO_ID", columns="FTYPE", values="Length_km")
df.columns = "FTYPE_" + df.columns.astype(str)
has_value = df.notnull().astype(int)
has_value.columns += '_length'
final = pd.concat([df, has_value], axis=1).sort_index(axis='columns')
which gives me (using your input data, which only has 5 distinct FTYPEs):
In [49]: final
Out[49]:
FTYPE_334 FTYPE_334_length FTYPE_428 \
GEO_ID
1400000US06001400100 NaN 0 3.291468
1400000US06001401700 NaN 0 NaN
1400000US06001403300 NaN 0 0.138171
1400000US06001403400 0.04308 1 NaN
FTYPE_428_length FTYPE_460 FTYPE_460_length \
GEO_ID
1400000US06001400100 1 7.566487 1
1400000US06001401700 0 0.262190 1
1400000US06001403300 1 NaN 0
1400000US06001403400 0 NaN 0
FTYPE_558 FTYPE_558_length FTYPE_566 FTYPE_566_length
GEO_ID
1400000US06001400100 NaN 0 NaN 0
1400000US06001401700 NaN 0 10.498992 1
1400000US06001403300 0.532914 1 1.518864 1
1400000US06001403400 NaN 0 NaN 0

Related

Merge rows with same id, different vallues in 1 column to multiple columns

what i have length can be of different values/ so somethimes 1 id has 4 rows with different values in column val, the other columns have all the same values
df1 = pd.DataFrame({'id':[1,1,1,2,2,2,3,3,3], 'val': ['06123','nick','#gmail','06454','abey','#gmail','06888','sisi'], 'media': ['nrc','nrc','nrc','nrc','nrc','nrc','nrc','nrc']})
what i need
id kolom 1 kolom2 kolom 3 media
1 06123 nick #gmail nrc
2 06454 abey #gmail nrc
3 6888 sisi None nrc
I hope I gave a good example, in the corrected way, thanks for the help
df2 = df1.groupby('id').agg(list)
df2['col 1'] = df2['val'].apply(lambda x: x[0] if len(x) > 0 else 'None')
df2['col 2'] = df2['val'].apply(lambda x: x[1] if len(x) > 1 else 'None')
df2['col 3'] = df2['val'].apply(lambda x: x[2] if len(x) > 2 else 'None')
df2['media'] = df2['media'].apply(lambda x: x[0] if len(x) > 0 else 'None')
df2.drop(columns='val')
Here is another way. Since your original dataframe doesn't have lists with the same length (which will get you a ValueError, you can define it as:
data = {"id":[1,1,1,2,2,2,3,3,3],
"val": ["06123","nick","#gmail","06454","abey","#gmail","06888","sisi"],
"media": ["nrc","nrc","nrc","nrc","nrc","nrc","nrc","nrc"]}
df = pd.DataFrame.from_dict(data, orient="index")
df = df.transpose()
>>> df
id val media
0 1 06123 nrc
1 1 nick nrc
2 1 #gmail nrc
3 2 06454 nrc
4 2 abey nrc
5 2 #gmail nrc
6 3 06888 nrc
7 3 sisi nrc
8 3 NaN NaN
Afterwards, you can replace with np.nan values with an empty string, so that you can groupby your id column and join the values in val separated by a ,.
df = df.replace(np.nan, "", regex=True)
df_new = df.groupby(["id"])["val"].apply(lambda x: ",".join(x)).reset_index()
>>> df_new
id val
0 1.0 06123,nick,#gmail
1 2.0 06454,abey,#gmail
2 3.0 06888,sisi,
Then, you only need to transform the new val column into 3 columns by splitting the string inside, with any method you want. For example,
new_cols = df_new["val"].str.split(",", expand=True) # Good ol' split
df_new["kolom 1"] = new_cols[0] # Assign to new columns
df_new["kolom 2"] = new_cols[1]
df_new["kolom 3"] = new_cols[2]
df_new.drop("val", 1, inplace=True) # Delete previous val
df_new["media"] = "nrc" # Add the media column again
df_new = df_new.replace("", np.nan, regex=True) # If necessary, replace empty string with np.nan
>>> df_new
id kolom 1 kolom 2 kolom 3 media
0 1.0 06123 nick #gmail nrc
1 2.0 06454 abey #gmail nrc
2 3.0 06888 sisi NaN nrc

Subset two consecutive event occurrence in pandas

I'm trying to get a subset of my data whenever there is consecutive occurrence of an two events in that order. The event is time-stamped. So every time there are continuous 2's and then continuous 3's, I want to subset that to a dataframe and append it to a dictionary. The following code does that but I have to apply this to a very large dataframe of more than 20 mil obs. This is extremely slow using iterrows. How can I make this fast?
df = pd.DataFrame({'Date': [101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122],
'Event': [1,1,2,2,2,3,3,1,3,2,2,3,1,2,3,2,3,2,2,3,3,3]})
dfb = pd.DataFrame(columns = df.columns)
C = {}
f1 = 0
for index, row in df.iterrows():
if ((row['Event'] == 2) & (3 not in dfb['Event'].values)):
dfb = dfb.append(row)
f1 =1
elif ((row['Event'] == 3) & (f1 == 1)):
dfb = dfb.append(row)
elif 3 in dfb['Event'].values:
f1 = 0
C[str(dfb.iloc[0,0])] = dfb
del dfb
dfb = pd.DataFrame(columns = df.columns)
if row['Event'] == 2:
dfb = dfb.append(row)
f1 =1
else:
f1=0
del dfb
dfb = pd.DataFrame(columns = df.columns)
Edit: The desired output is basically a dictionary of the subsets shown in the imagehttps://i.stack.imgur.com/ClWZs.png
If you want to accerlate, you should vectorize your code. You could try it like this (df is the same with your code):
vec = df.copy()
vec['Event_y'] = vec['Event'].shift(1).fillna(0).astype(int)
vec['Same_Flag'] = float('nan')
vec.Same_Flag.loc[(vec['Event_y'] == vec['Event']) & (vec['Event'] != 1)] = 1
vec.dropna(inplace=True)
vec.loc[:, ('Date', 'Event')]
Output is:
Date Event
3 104 2
4 105 2
6 107 3
10 111 2
18 119 2
20 121 3
21 122 3
I think that's close to what you need. You could improve based on that.
I'm not understand why date 104, 105, 107 are not counted.

Pandas subtract columns with groupby and mask

For groups under one "SN", I would like to subtract three performance indicators for each group. One group boundaries are the serial number SN and sequential Boolean True values in mask. (So multiple True sequances can exist under one SN).
The first indicator I want is, Csub that subtracts between the first and last values of each group in column 'C'. Second, Bmean, is the mean of each group in column 'B'.
For example:
In:
df = pd.DataFrame({"SN" : ["66", "66", "66", "77", "77", "77", "77", "77"], "B" : [-2, -1, -2, 3, 1, -1, 1, 1], "C" : [1, 2, 3, 15, 11, 2, 1, 2],
"mask" : [False, False, False, True, True, False, True, True] })
SN B C mask
0 66 -2 1 False
1 66 -1 2 False
2 66 -2 3 False
3 77 3 15 True
4 77 1 11 True
5 77 -1 2 False
6 77 1 1 True
7 77 1 2 True
Out:
SN B C mask Csub Bmean CdivB
0 66 -2 1 False Nan Nan Nan
1 66 -1 2 False Nan Nan Nan
2 66 -2 3 False Nan Nan Nan
3 77 3 15 True -4 13 -0.3
4 77 1 11 True -4 13 -0.3
5 77 -1 2 False Nan Nan Nan
6 77 1 1 True 1 1 1
7 77 1 2 True 1 1 1
I cooked up something like this, but it groups by the mask T/F values. It should group by SN and sequential True values, not ALL True values. Further, I cannot figure out how to get a subtraction sqeezed in to this.
# Extracting performance values
perf = (df.assign(
Bmean = df['B'], CdivB = df['C']/df['B']
).groupby(['SN','mask'])
.agg(dict(Bmean ='mean', CdivB = 'mean'))
.reset_index(drop=False)
)
It's not pretty, but you can try the following.
First, prepare a 'group_key' column in order to group by consecutive True values in 'mask':
# Select the rows where 'mask' is True preceded by False.
first_true = df.loc[
(df['mask'] == True)
& (df['mask'].shift(fill_value=False) == False)
]
# Add the column.
df['group_key'] = pd.Series()
# Each row in first_true gets assigned a different 'group_key' value.
df.loc[first_true.index, 'group_key'] = range(len(first_true))
# Forward fill 'group_key' on mask.
df.loc[df['mask'], 'group_key'] = df.loc[df['mask'], 'group_key'].ffill()
Then we can group by 'SN' and 'group_key' and compute and assign the indicator values.
# Group by 'SN' and 'group_key'.
gdf = df.groupby(by=['SN', 'group_key'], as_index=False)
# Compute indicator values
indicators = pd.DataFrame(gdf.nth(0)) # pd.DataFrame used here to avoid a SettingwithCopyWarning.
indicators['Csub'] = gdf.nth(0)['C'].array - gdf.nth(-1)['C'].array
indicators['Bmean'] = gdf.mean()['B'].array
# Write values to original dataframe
df = df.join(indicators.reindex(columns=['Csub', 'Bmean']))
# Forward fill the indicator values
df.loc[df['mask'], ['Csub', 'Bmean']] = df.loc[df['mask'], ['Csub', 'Bmean']].ffill()
# Drop 'group_key' column
df = df.drop(columns=['group_key'])
I excluded 'CdivB' since I couldn't understand what it's value should be.

Dataframe tranposition and mapping

I need to perform two sample t-test, for which I have to transpose my sample file and map values from another csv file to the sample file. I am new to python, so far I have tried this:
with open('project.csv') as f_project:
df = pd.read_csv('project.csv', delimiter = ',', dtype= 'unicode',
error_bad_lines=False)
df.set_index('TaxID', inplace=True)
df_kraken = df.T
df_meta = pd.read_csv('Meta3251.csv', delimiter = ',', dtype= 'unicode',
error_bad_lines=False, usecols = ['SRA ID', '(0/1)'])
df_kraken['Meta'] = df_kraken['TaxID'].map(df_meta.set_index('SRA ID')
['(0/1)'])
My sample file dataframe after transposition looks like this:
333046 1049 337090
PRJEB3251_ERR169499 0.05 0.03 0.01
PRJEB3251_ERR169500 0 0 0
PRJEB3251_ERR169501 0 0 0
PRJEB3251_ERR169502 0.05 0 0
PRJEB3251_ERR169503 0.03 1.9 0
PRJEB3251_ERR169507 0.01 0 0
PRJEB3251_ERR169508 0 0.1 0
PRJEB3251_ERR169509 0 0.05 0
The index is not been set as TaxID.
I have another csv file which T have taken as another dataframe so that I can map the values. It looks like
SRA ID (0/1)
ERR169611 1
ERR169610 1
ERR169609 1
ERR169608 1
ERR169607 0
ERR169606 0
ERR169605 1
ERR169604 1
ERR169484 0
I need to map the zero one values to the first column of 1st dataframe. Im stuck with the error : KeyError: 'TaxID'
Any hepl regarding this will be highly appreciated.
After you suggestion I have this :
import pandas as pd
df = pd.read_csv('project.csv').set_index('ID').T
df = df.reset_index().rename(columns={'index': 'Project ID'})
df_meta = pd.read_csv('Meta3251.csv', delimiter = ',', dtype= 'unicode',
error_bad_lines=False, usecols = ['SRA ID', '(0/1)'])
df['KEY'] = df['Project ID'].str.split('_').str[1]
df['Meta ID'] = df['KEY'].replace(dict(zip(df_meta['SRA ID'], df['(Project
ID)'])))
df.to_csv('R.csv')
After this I have the following result:
Project ID 333046 1049 KEY Meta ID
0 PRJEB3251_ERR169499 0.05 0.03 ERR169499 PRJEB3251_ERR169636
1 PRJEB3251_ERR169500 0 0 ERR169500 PRJEB3251_ERR169635
2 PRJEB3251_ERR169501 0 0 ERR169501 PRJEB3251_ERR169626
3 PRJEB3251_ERR169502 0.05 0 ERR169502 PRJEB3251_ERR169625
I have the index but the good part is now im able to rename my column, the mapping is not working though.
Here a solution that could work:
df = pd.read_csv('project.csv', delimiter = ',', dtype= 'unicode', error_bad_lines=False)
df.set_index('TaxID', inplace=True)
df_kraken = df.T.reset_index() # Make sure 'TaxID' is a column
df_meta = pd.read_csv('Meta3251.csv', delimiter = ',', dtype= 'unicode', error_bad_lines=False, usecols = ['SRA ID', '(0/1)'])
# In your example the second dataframe only matches what's after the '_'
# so you can isolate that part
df_kraken['KEY'] = df_kraken['TaxID'].str.split('_').str[1]
df_kraken['Meta'] = df_kraken['KEY'].replace(dict(zip(meta['SRA'], meta['ID'])))
EDIT
The question has been edited.
After read_csv() (first line):
TaxID PRJEB3251_ERR169499 PRJEB3251_ERR169500 PRJEB3251_ERR169501
0 333046 0.05 0 0
1 1049 0.03 0 0
2 337090 0.01 0 0
3 288681 3.6 0 0
4 267889 0.02 0 0
...
Then
df = df.set_index('TaxID').T
print(df)
TaxID 333046 1049 337090
PRJEB3251_ERR169499 0.05 0.03 0.01
PRJEB3251_ERR169500 0.00 0.00 0.00
PRJEB3251_ERR169501 0.00 0.00 0.00
Note that at this point TaxID is the name of the columns index, not of the row index. If you want to have TaxID as a column:
df = df.reset_index().rename(columns={'index': 'TaxID'})
To avoid confusion you can remove TaxID from the column name:
df.columns.name = None

Imposing a threshold on values in dataframe in Pandas

I have the following code:
t = 12
s = numpy.array(df.Array.tolist())
s[s<t] = 0
thresh = numpy.where(s>0, s-t, 0)
df['NewArray'] = list(thresh)
while it works, surely there must be a more pandas-like way of doing it.
EDIT:
df.Array.head() looks like this:
0 [0.771511552006, 0.771515476223, 0.77143569165...
1 [3.66720695274, 3.66722560562, 3.66684636758, ...
2 [2.3047433839, 2.30475510675, 2.30451676559, 2...
3 [0.999991522708, 0.999996609066, 0.99989319662...
4 [1.11132718786, 1.11133284052, 0.999679589875,...
Name: Array, dtype: object
IIUC you can simply subtract and use clip_lower:
In [29]: df["NewArray"] = (df["Array"] - 12).clip_lower(0)
In [30]: df
Out[30]:
Array NewArray
0 10 0
1 11 0
2 12 0
3 13 1
4 14 2