How do you split All columns in a large pandas data frame? - pandas

I have a very large data frame that I want to split ALL of the columns except first two based on a comma delimiter. So I need to logically reference column names in a loop or some other way to split all the columns in one swoop.
In my testing of the split method:
I have been able to explicitly refer to ( i.e. HARD CODE) a single column name (rs145629793) as one of the required parameters and the result was 2 new columns as I wanted.
See python code below
HARDCODED COLUMN NAME --
df[['rs1','rs2']] = df.rs145629793.str.split(",", expand = True)
The problem:
It is not feasible to refer to the actual column names and repeat code.
I then replaced the actual column name rs145629793 with columns[2] in the split method parameter list.
It results in an ERROR
'str has ni str attribute'

You can index columns by position rather than name using iloc. For example, to get the third column:
df.iloc[:, 2]
Thus you can easily loop over the columns you need.

I know what you are asking, but it's still helpful to provide some input data and expected output data. I have included random input data in my code below, so you can just copy and paste this to run, and try to apply it to your dataframe:
import pandas as pd
your_dataframe=pd.DataFrame({'a':['1,2,3', '9,8,7'],
'b':['4,5,6', '6,5,4'],
'c':['7,8,9', '3,2,1']})
import copy
def split_cols(df):
dict_of_df = {}
cols=df.columns.to_list()
for col in cols:
key_name = 'df'+str(col)
dict_of_df[key_name] = copy.deepcopy(df)
var=df[col].str.split(',', expand=True).add_prefix(col)
df=pd.merge(df, var, how='left', left_index=True, right_index=True).drop(col, axis=1)
return df
split_cols(your_dataframe)
Essentially, in this solution you create a list of the columns that you want to loop through. Then you loop through that list and create new dataframes for each column where you run the split() function. Then you merge everything back together on the index. I also:
included a prefix of the column name, so the column names did not have duplicate names and could be more easily identifiable
dropped the old column that we did the split on.
Just import copy and use the split_cols() function that I have created and pass the name of your dataframe.

Related

How can I create a pandas column based on another pandas column that has for values a list?

I am working with a dataframe and one of the columns has for values a list of strings in each row. The list contains a number of links (each list can have a different number of links). I want to create a new column that will be based on this column of lists but keep only the links that have the keyword "uploads".
To my example, the first entry of the column is like that:
['https://seekingalpha.com/instablog/5006891-hfir/4960045-natural-gas-daily',
'https://seekingalpha.com/article/4116929-weekly-natural-gas-storage-report',
'https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647719993095_origin.png',
'https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647854075453_origin.png',
'https://static.seekingalpha.com/uploads/2017/10/26/5006891-1509065004154725_origin.png',
'https://seekingalpha.com/account/research/subscribe?slug=hfir-energy&sasource=upsell']
And I want to keep only
['https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647719993095_origin.png',
'https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647854075453_origin.png',
'https://static.seekingalpha.com/uploads/2017/10/26/5006891-1509065004154725_origin.png']
And put the clean version in a new column of the same dataframe.
Can you please suggest a way to do it?
I just found a way where I create a function that looks within a list for a specific pattern (in my case the keyword "uploads")
def clean_alt_list(list_):
list_ = [s for s in list_ if "uploads" in s]
return list_
And then I apply this function into the column I am interested in
df['clean_links'] = df['links'].apply(clean_alt_list)
IIUC, this should work for you:
df = pd.DataFrame({'url': [['https://seekingalpha.com/instablog/5006891-hfir/4960045-natural-gas-daily', 'https://seekingalpha.com/article/4116929-weekly-natural-gas-storage-report', 'https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647719993095_origin.png', 'https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647854075453_origin.png', 'https://static.seekingalpha.com/uploads/2017/10/26/5006891-1509065004154725_origin.png', 'https://seekingalpha.com/account/research/subscribe?slug=hfir-energy&sasource=upsell']]})
df = df.explode('url').reset_index(drop=True)
df[df['url'].str.contains('uploads')]
Result:
url
2 https://static.seekingalpha.com/uploads/2017/1...
3 https://static.seekingalpha.com/uploads/2017/1...
4 https://static.seekingalpha.com/uploads/2017/1...

Pandas splitting a column with new line separator

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')

Selecting columns from a dataframe

I have a dataframe of monthly returns for 1,000 stocks with ids as column names.
monthly returns
I need to select only the columns that match the values in another dataframe which includes the ids I want.
permno list
I'm sure this is really quite simple, but I have been struggling for 2 days and if someone has an easy solution it would be so very much appreciated. Thank you.
You could convert the single-column permno list dataframe (osr_curr_permnos) into a list, and then use that list to select certain columns from your main dataframe (all_rets).
To convert the osr_curr_permnos column "0" into a list, you can use .to_list()
Then, you can use that list to slice all_rets and .copy() to make a fresh copy of it into a new dataframe.
The python code might look something like:
keep = osr_curr_permnos['0'].to_list()
selected_rets = all_rets[keep].copy()
"keep" would be a list, and "selected_rets" would be your new dataframe.
If there's a chance that osr_curr_permnos would have duplicates, you'll want to filter those out:
keep = osr_curr_permnos['0'].drop_duplicates().to_list()
selected_rets = all_rets[keep].copy()
As I expected, the answer was more simple than I was making it. Basically, I needed to take the integer values in my permnos list and recast those as strings.
osr_curr_permnos['0'] = osr_curr_permnos['0'].apply(str)
keep = osr_curr_permnos['0'].values
Then I can use that to select columns from my returns dataframe which had string values as column headers.
all_rets[keep]
It was all just a mismatch of int vs. string.

Working with dataframe / matrix to create an input for sklearn & Tensorflow

I am working with pandas / python /numpy / datalab/bigQuery to generate an input table for machine learning processing. The data is genomic - and right now, I am working with small subset of
174 rows
12430 columns
The column names are extracted from bigQuery (df_pik3ca_features = bq.Query(std_sql_features).to_dataframe(dialect='standard',use_cache=True))
at the same way, the row names are extracted: samples_rows = bq.Query('SELECT sample_id FROMspeedy-emissary-167213.pgp_orielresearch.pgp_PIK3CA_all_features_values_step_3GROUP BY sample_id')
what would be the easiest way to create a dataframe / matrix with named rows and columns that were extracted.
I explored the dataframes in pandas and could not find the way to pass the names as parameter.
for empty array, I was able to find the following (numpy) with no names:
a = np.full([num_of_rows, num_of_columns], np.nan)
a.columns
I know R very well (if there is no other way - I hope that I can use it with datalab)
any idea?
Many thanks!
If you have your column names and row names stored in lists then you can just use .loc to select the exact rows and columns you desire. Just make sure that the row names are in the index. You might need to do df.set_index('sample_id') to put the correct row name in the index.
Assuming the rows and columns are in variables row_names and col_names, do this.
df.loc[row_names, col_names]

pandas merge produce duplicate columns

n1 = DataFrame({'zhanghui':[1,2,3,4] , 'wudi':[17,'gx',356,23] ,'sas'[234,51,354,123] })
n2 = DataFrame({'zhanghui_x':[1,2,3,5] , 'wudi':[17,23,'sd',23] ,'wudi_x':[17,23,'x356',23] ,'wudi_y':[17,23,'y356',23] ,'ddd':[234,51,354,123] })
code above defined two DataFrame objects. I wanna use 'zhanghui' field from n1 and 'zhanghui_x' field from n2 as "on" field merge n1 and n2,so my code like this:
n1.merge(n2,how = 'inner',left_on = 'zhanghui',right_on='zhanghui_x')
and then result columns given like this :
sas wudi_x zhanghui ddd wudi_y wudi_x wudi_y zhanghui_x
Some duplicate columns appeared,such as 'wudi_x' ,'wudi_y'.
So it's a pandas inner problems or I had a wrong usage about pd.merge ?
From pandas documentation, the merge() function has following properties;
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None)
where suffixes denote default suffix string to be attached to 'over-lapping' columns with defaults '_x' and '_y'.
I'm not sure if I understood your follow-up question correctly, but;
#case1
if the first dataFrame has column 'column_name_x' and the second dataFrame has column 'column_name' then there are no over-lapping columns and therefore no suffixes are attached.
#case2
if the first dataFrame has columns 'column_name', 'column_name_x' and the second dataFrame also has column 'column_name', the default suffixes attach to over-lapping columns and therefore the first frame's 'columnn_name' becomes 'column_name_x' and result in a duplicate of already existing column.
You can however, pass a None value to one(not all) of the suffixes to ensure that column names of certain dataFrame remain as-is.
Your approach is right, pandas automatically gives postscripts after merging the columns that are "duplicated" with the original headers given a postscript _x, _y, etc.
you can first select what columns to merge and proceed:
cols_to_use = n2.columns - n1.columns
n1.merge(n2[cols_to_use],how = 'inner',left_on = 'zhanghui',right_on='zhanghui_x')
result columns:
sas wudi zhanghui ddd wudi_x wudi_y zhanghui_x
When I tried to run cols_to_use = n2.columns - n1.columns,it gave me a TypeError like this:
cannot perform __sub__ with this index type: <class pandas.core.indexes.base.Index'>
then I tried to use code below:
cols_to_use = [i for i in list(n2.columns) if i not in list(n1.columns) ]
It worked fine,result columns given like this:
sas wudi zhanghui ddd wudi_x wudi_y zhanghui_x
So,#S Ringne's method really resolved my problems.
=============================================
Pandas just simply add suffix such as '_x' to resolve the duplicate-column-name problem when it comes to merging two Frame objects.
But what will it happen if the name form of 'a-column-name'+'_x' appears in either Frame object? I used to think that it will check if the name form of 'a-column-name'+'_x' appears, But actually pandas doesn't have this check?