Dataframe tranposition and mapping - pandas

I need to perform two sample t-test, for which I have to transpose my sample file and map values from another csv file to the sample file. I am new to python, so far I have tried this:
with open('project.csv') as f_project:
df = pd.read_csv('project.csv', delimiter = ',', dtype= 'unicode',
error_bad_lines=False)
df.set_index('TaxID', inplace=True)
df_kraken = df.T
df_meta = pd.read_csv('Meta3251.csv', delimiter = ',', dtype= 'unicode',
error_bad_lines=False, usecols = ['SRA ID', '(0/1)'])
df_kraken['Meta'] = df_kraken['TaxID'].map(df_meta.set_index('SRA ID')
['(0/1)'])
My sample file dataframe after transposition looks like this:
333046 1049 337090
PRJEB3251_ERR169499 0.05 0.03 0.01
PRJEB3251_ERR169500 0 0 0
PRJEB3251_ERR169501 0 0 0
PRJEB3251_ERR169502 0.05 0 0
PRJEB3251_ERR169503 0.03 1.9 0
PRJEB3251_ERR169507 0.01 0 0
PRJEB3251_ERR169508 0 0.1 0
PRJEB3251_ERR169509 0 0.05 0
The index is not been set as TaxID.
I have another csv file which T have taken as another dataframe so that I can map the values. It looks like
SRA ID (0/1)
ERR169611 1
ERR169610 1
ERR169609 1
ERR169608 1
ERR169607 0
ERR169606 0
ERR169605 1
ERR169604 1
ERR169484 0
I need to map the zero one values to the first column of 1st dataframe. Im stuck with the error : KeyError: 'TaxID'
Any hepl regarding this will be highly appreciated.
After you suggestion I have this :
import pandas as pd
df = pd.read_csv('project.csv').set_index('ID').T
df = df.reset_index().rename(columns={'index': 'Project ID'})
df_meta = pd.read_csv('Meta3251.csv', delimiter = ',', dtype= 'unicode',
error_bad_lines=False, usecols = ['SRA ID', '(0/1)'])
df['KEY'] = df['Project ID'].str.split('_').str[1]
df['Meta ID'] = df['KEY'].replace(dict(zip(df_meta['SRA ID'], df['(Project
ID)'])))
df.to_csv('R.csv')
After this I have the following result:
Project ID 333046 1049 KEY Meta ID
0 PRJEB3251_ERR169499 0.05 0.03 ERR169499 PRJEB3251_ERR169636
1 PRJEB3251_ERR169500 0 0 ERR169500 PRJEB3251_ERR169635
2 PRJEB3251_ERR169501 0 0 ERR169501 PRJEB3251_ERR169626
3 PRJEB3251_ERR169502 0.05 0 ERR169502 PRJEB3251_ERR169625
I have the index but the good part is now im able to rename my column, the mapping is not working though.

Here a solution that could work:
df = pd.read_csv('project.csv', delimiter = ',', dtype= 'unicode', error_bad_lines=False)
df.set_index('TaxID', inplace=True)
df_kraken = df.T.reset_index() # Make sure 'TaxID' is a column
df_meta = pd.read_csv('Meta3251.csv', delimiter = ',', dtype= 'unicode', error_bad_lines=False, usecols = ['SRA ID', '(0/1)'])
# In your example the second dataframe only matches what's after the '_'
# so you can isolate that part
df_kraken['KEY'] = df_kraken['TaxID'].str.split('_').str[1]
df_kraken['Meta'] = df_kraken['KEY'].replace(dict(zip(meta['SRA'], meta['ID'])))
EDIT
The question has been edited.
After read_csv() (first line):
TaxID PRJEB3251_ERR169499 PRJEB3251_ERR169500 PRJEB3251_ERR169501
0 333046 0.05 0 0
1 1049 0.03 0 0
2 337090 0.01 0 0
3 288681 3.6 0 0
4 267889 0.02 0 0
...
Then
df = df.set_index('TaxID').T
print(df)
TaxID 333046 1049 337090
PRJEB3251_ERR169499 0.05 0.03 0.01
PRJEB3251_ERR169500 0.00 0.00 0.00
PRJEB3251_ERR169501 0.00 0.00 0.00
Note that at this point TaxID is the name of the columns index, not of the row index. If you want to have TaxID as a column:
df = df.reset_index().rename(columns={'index': 'TaxID'})
To avoid confusion you can remove TaxID from the column name:
df.columns.name = None

Related

Populate empty pandas dataframe with specific conditions

I want to create a pandas dataframe where there are 5000 columns (n=5000) and one row (row G). For row G, 1 (in 10% of samples) or 0 (in 90% of samples).
import pandas as pd
df = pd.DataFrame({"G": np.random.choice([1,0], p=[0.1, 0.9], size=5000)}).T
I also want to add column names such that it is "Cell" followed by "1..5000":
Cell1
Cell2
Cell3
Cell5000
G
0
0
1
0
The columns will default to a RangeIndex from 0-4999. You can add 1 to the column values, and then use DataFrame.add_prefix to add the string "Cell" before all of the column names.
df.columns += 1
df = df.add_prefix("Cell")
print(df)
Cell1 Cell2 Cell3 ... Cell5000
G 0 0 0 ... 0
For a single-liner, you can also add 1 and prefix with "Cell" by converting the column index dtype manually.
df.columns = "Cell" + (df.columns + 1).astype(str)
To make a single row DataFrame, I would construct my data with numpy in the correct shape instead of transposing a DataFrame. You can also pass in the columns as you want them numbered and the index labelled.
import numpy as np
import pandas as pd
df = pd.DataFrame(
np.random.choice([1,0], p=[.1, .9], size=(1, size)),
columns=np.arange(1, size+1),
index=["G"]
).add_prefix("Cell")
print(df)
Cell1 Cell2 Cell3 ... Cell4999 Cell5000
G 0 0 0 ... 0 0
Another Method could be:
size = 5000
pd.DataFrame.from_dict(
{"G": np.random.choice([1,0], p=[0.1, 0.9], size=size)},
columns=(f'Cell{x}' for x in range(1, size+1)),
orient='index'
)
Output:
Cell1 Cell2 Cell3 Cell4 Cell5 Cell6 Cell7 Cell8 Cell9 ... Cell4992 Cell4993 Cell4994 Cell4995 Cell4996 Cell4997 Cell4998 Cell4999 Cell5000
G 0 0 0 0 0 1 0 1 0 ... 0 0 0 0 0 0 0 0 0
[1 rows x 5000 columns]

Merge rows with same id, different vallues in 1 column to multiple columns

what i have length can be of different values/ so somethimes 1 id has 4 rows with different values in column val, the other columns have all the same values
df1 = pd.DataFrame({'id':[1,1,1,2,2,2,3,3,3], 'val': ['06123','nick','#gmail','06454','abey','#gmail','06888','sisi'], 'media': ['nrc','nrc','nrc','nrc','nrc','nrc','nrc','nrc']})
what i need
id kolom 1 kolom2 kolom 3 media
1 06123 nick #gmail nrc
2 06454 abey #gmail nrc
3 6888 sisi None nrc
I hope I gave a good example, in the corrected way, thanks for the help
df2 = df1.groupby('id').agg(list)
df2['col 1'] = df2['val'].apply(lambda x: x[0] if len(x) > 0 else 'None')
df2['col 2'] = df2['val'].apply(lambda x: x[1] if len(x) > 1 else 'None')
df2['col 3'] = df2['val'].apply(lambda x: x[2] if len(x) > 2 else 'None')
df2['media'] = df2['media'].apply(lambda x: x[0] if len(x) > 0 else 'None')
df2.drop(columns='val')
Here is another way. Since your original dataframe doesn't have lists with the same length (which will get you a ValueError, you can define it as:
data = {"id":[1,1,1,2,2,2,3,3,3],
"val": ["06123","nick","#gmail","06454","abey","#gmail","06888","sisi"],
"media": ["nrc","nrc","nrc","nrc","nrc","nrc","nrc","nrc"]}
df = pd.DataFrame.from_dict(data, orient="index")
df = df.transpose()
>>> df
id val media
0 1 06123 nrc
1 1 nick nrc
2 1 #gmail nrc
3 2 06454 nrc
4 2 abey nrc
5 2 #gmail nrc
6 3 06888 nrc
7 3 sisi nrc
8 3 NaN NaN
Afterwards, you can replace with np.nan values with an empty string, so that you can groupby your id column and join the values in val separated by a ,.
df = df.replace(np.nan, "", regex=True)
df_new = df.groupby(["id"])["val"].apply(lambda x: ",".join(x)).reset_index()
>>> df_new
id val
0 1.0 06123,nick,#gmail
1 2.0 06454,abey,#gmail
2 3.0 06888,sisi,
Then, you only need to transform the new val column into 3 columns by splitting the string inside, with any method you want. For example,
new_cols = df_new["val"].str.split(",", expand=True) # Good ol' split
df_new["kolom 1"] = new_cols[0] # Assign to new columns
df_new["kolom 2"] = new_cols[1]
df_new["kolom 3"] = new_cols[2]
df_new.drop("val", 1, inplace=True) # Delete previous val
df_new["media"] = "nrc" # Add the media column again
df_new = df_new.replace("", np.nan, regex=True) # If necessary, replace empty string with np.nan
>>> df_new
id kolom 1 kolom 2 kolom 3 media
0 1.0 06123 nick #gmail nrc
1 2.0 06454 abey #gmail nrc
2 3.0 06888 sisi NaN nrc

Pandas Dataframe Manipulation logic

Can use please help with below problem:
Given two dataframes df1 and df2, need to get something like result dataframe.
import pandas as pd
import numpy as np
feature_list = [ str(i) for i in range(6)]
df1 = pd.DataFrame( {'value' : [0,3,0,4,2,5]})
df2 = pd.DataFrame(0, index=np.arange(6), columns=feature_list)
Expected Dataframe :
Need to be driven by comparing values from df1 with column names (features) in df2. if they match, we put 1 in resultDf
Here's expected output (or resultsDf):
I think you need:
(pd.get_dummies(df1['value'])
.rename(columns = str)
.reindex(columns = df2.columns,
index = df2.index,
fill_value = 0))
0 1 2 3 4 5
0 1 0 0 0 0 0
1 0 0 0 1 0 0
2 1 0 0 0 0 0
3 0 0 0 0 1 0
4 0 0 1 0 0 0
5 0 0 0 0 0 1

Python Pandas: merge, join, concat

I have a dataframe that has a non-unique GEO_ID, and an attribute (FTYPE) in a separate column (1 of 6 values) for each GEO_ID and an associated length for each FTYPE.
df
FID GEO_ID FTYPE Length_km
0 1400000US06001400100 428 3.291467766
1 1400000US06001400100 460 7.566487367
2 1400000US06001401700 460 0.262190266
3 1400000US06001401700 566 10.49899202
4 1400000US06001403300 428 0.138171389
5 1400000US06001403300 558 0.532913513
How do I make 6 new columns for FTYPE (with 1 and 0 to indicate if that row has the FTYPE) and 6 new columns for FTYPE_Length to make each row have a unique GEO_ID?
I want my new dataframe to have a structure like this (with 6 FTYPE-s):
FID GEO_ID FTYPE_428 FTYPE_428_length FTYPE_460 FTYPE_460_length
0 1400000US06001400100 1 3.291467766 1 7.566487367
So far, what I have tried is doing something like this:
import pandas as pd
fname = "filename.csv"
df = pd.read_csv(fname)
nhd = [334, 336, 420, 428, 460, 558, 556]
df1 = df.loc[df['FTYPE']==nhd[0]]
df2 = df.loc[df['FTYPE']==nhd[1]]
df3 = df.loc[df['FTYPE']==nhd[2]]
df4 = df.loc[df['FTYPE']==nhd[3]]
df5 = df.loc[df['FTYPE']==nhd[4]]
df6 = df.loc[df['FTYPE']==nhd[5]]
df7 = df.loc[df['FTYPE']==nhd[6]]
df12 = df1.merge(df2, how='left', left_on='GEO_ID', right_on='GEO_ID')
df23 = df12.merge(df3,how='left', left_on='GEO_ID', right_on='GEO_ID')
df34 = df23.merge(df4,how='left', left_on='GEO_ID', right_on='GEO_ID')
df45 = df34.merge(df5,how='left', left_on='GEO_ID', right_on='GEO_ID')
df56 = df45.merge(df6,how='left', left_on='GEO_ID', right_on='GEO_ID')
df67 = df56.merge(df7,how='left', left_on='GEO_ID', right_on='GEO_ID')
cols = [0,4,7,10,13,16,19]
df67.drop(df67.columns[cols],axis=1,inplace=True)
df67.columns =['GEO_ID','334','len_334','336','len_336','420','len_420','428','len_428','460','len_460','558','len_558','566','len_566']
But this approach is problematic because it reduces the rows to the ones that have the first two FTYPE-s. Is there a way to merge with multiple columns at once?
Its probably easier to write a for loop and go over each row and use a condition to fill in the values like this:
nhd = [334, 336, 420, 428, 460, 558, 556]
for x in nhd:
df[str(x)] = None
df["length_"+str(x)] = None
df.head()
for geoid in df["GEO_ID"]:
#print geoid
for x in nhd:
df.ix[(df['FTYPE']==x) & (df['GEO_ID'] == geoid)][str(nhd)] = 1
But this takes too much time and there is probably a one liner in Pandas to do the same thing.
Any help on this is appreciated!
Thanks,
Solomon
I don't quite see the point of your _length columns: they seem to have the same information as just whether or not the matching value is null or not, which makes them redundant. They're easy enough to create, though.
While we could cram this into one line if we insisted, what's the point? This is SO, not codegolf. So I might do something like:
df = df.pivot(index="GEO_ID", columns="FTYPE", values="Length_km")
df.columns = "FTYPE_" + df.columns.astype(str)
has_value = df.notnull().astype(int)
has_value.columns += '_length'
final = pd.concat([df, has_value], axis=1).sort_index(axis='columns')
which gives me (using your input data, which only has 5 distinct FTYPEs):
In [49]: final
Out[49]:
FTYPE_334 FTYPE_334_length FTYPE_428 \
GEO_ID
1400000US06001400100 NaN 0 3.291468
1400000US06001401700 NaN 0 NaN
1400000US06001403300 NaN 0 0.138171
1400000US06001403400 0.04308 1 NaN
FTYPE_428_length FTYPE_460 FTYPE_460_length \
GEO_ID
1400000US06001400100 1 7.566487 1
1400000US06001401700 0 0.262190 1
1400000US06001403300 1 NaN 0
1400000US06001403400 0 NaN 0
FTYPE_558 FTYPE_558_length FTYPE_566 FTYPE_566_length
GEO_ID
1400000US06001400100 NaN 0 NaN 0
1400000US06001401700 NaN 0 10.498992 1
1400000US06001403300 0.532914 1 1.518864 1
1400000US06001403400 NaN 0 NaN 0

In dataframe iam getting unexpected results

when i have tried with below code iam getting unexpected results.
This is my code:
def copy_blanks(df, column):
df.iloc[:, -1] = df.iloc[:, column]
df.loc[df.iloc[:, column] == '', -1] = ''
My input file:
word,number
abc,0
adf,0
gfsgs,0
,0
sdfgsd,0
fgsdfg,0
sfdgs,0
Iam getting like below output:
word,number,word_clean
NA,0,abc
NA,0,adf
NA,0,gfsgs
,0,
NA,0,sdfgsd
NA,0,fgsdfg
NA,0,sfdgs
I Want to get like below output:
word,number,word_clean
abc,0,abc
adf,0,adf
gfsgs,0,gfsgs
,0,
sdfgsd,0,sdfgsd
fgsdfg,0,fgsdfg
sfdgs,0,sfdgs
PLease suggest me.I think the issue is getting by this code [df.loc[df.iloc[:, column] == '', -1] = ''].
IIUC you need fillna for replacing NaN to empty string:
print (df)
word number
0 abc 0
1 adf 0
2 gfsgs 0
3 NaN 0
4 sdfgsd 0
5 fgsdfg 0
6 sfdgs 0
def copy_blanks(df, column):
df['new'] = df.iloc[:, column]
df['new'] = df['new'].fillna('')
return df
column = 0
print (copy_blanks(df, column))
word number new
0 abc 0 abc
1 adf 0 adf
2 gfsgs 0 gfsgs
3 NaN 0
4 sdfgsd 0 sdfgsd
5 fgsdfg 0 fgsdfg
6 sfdgs 0 sfdgs