How to concatenate numerous column names in pandas? - pandas

I would like to concatenate all the columns with comma-delimitted in pandas.
But as you can seem it is very laborious tasks since I manually typed all the column indices.
de = data[3]+","+data[4]+","+data[5]+....+","+data[1511]
do you have any idea to avoid above procedure in pandas in python3?

First convert all columns to strings by DataFrame.astype and then possible add join per rows:
df = data.astype(str).apply(','.join, axis=1)
Or after convert to strings add ,, then sum and last remove last , by Series.str.rstrip:
df = data.astype(str).add(',').sum(axis=1).str.rstrip(',')

Related

Proper way to join data based on coditions

I want to add a new column to a datframe "table" (name: conc) which uses the values in columns (plate, ab) to get the numeric value from the dataframe "concs"
Below is what I mean, with the dataframe "exp" used to show what I expect the data to look like
what is the proper way to do this. Is it using some multiple condition, or do I need to reshape the concs dataframe somehow?
Use DataFrame.melt with left join for new column concs, if no match is created NaNs:
exp = concs.melt('plate', var_name='ab', value_name='concs').merge(table,on=['plate', 'ab'], how='left')
Solution should be simplify - if same columns names 'plate', 'ab' in both DataFrames and need merge by both is possible omit on parameter:
exp = concs.melt('plate', var_name='ab', value_name='concs').merge(table, how='left')
First melt the concs dataframe and then merge with table:
out = concs.melt(id_vars=['plate'],
value_vars=concs.columns.drop('plate').tolist(),
var_name='ab').merge(table, on=['plate', 'ab'
]).rename(columns={'value': 'concs'})
or just make good use of parameters of melt like in jezraels' answer:
out = concs.melt(id_vars=['plate'],
value_name='concs',
var_name='ab').merge(table, on=['plate', 'ab'])

Pandas splitting a column with new line separator

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')

How do you split All columns in a large pandas data frame?

I have a very large data frame that I want to split ALL of the columns except first two based on a comma delimiter. So I need to logically reference column names in a loop or some other way to split all the columns in one swoop.
In my testing of the split method:
I have been able to explicitly refer to ( i.e. HARD CODE) a single column name (rs145629793) as one of the required parameters and the result was 2 new columns as I wanted.
See python code below
HARDCODED COLUMN NAME --
df[['rs1','rs2']] = df.rs145629793.str.split(",", expand = True)
The problem:
It is not feasible to refer to the actual column names and repeat code.
I then replaced the actual column name rs145629793 with columns[2] in the split method parameter list.
It results in an ERROR
'str has ni str attribute'
You can index columns by position rather than name using iloc. For example, to get the third column:
df.iloc[:, 2]
Thus you can easily loop over the columns you need.
I know what you are asking, but it's still helpful to provide some input data and expected output data. I have included random input data in my code below, so you can just copy and paste this to run, and try to apply it to your dataframe:
import pandas as pd
your_dataframe=pd.DataFrame({'a':['1,2,3', '9,8,7'],
'b':['4,5,6', '6,5,4'],
'c':['7,8,9', '3,2,1']})
import copy
def split_cols(df):
dict_of_df = {}
cols=df.columns.to_list()
for col in cols:
key_name = 'df'+str(col)
dict_of_df[key_name] = copy.deepcopy(df)
var=df[col].str.split(',', expand=True).add_prefix(col)
df=pd.merge(df, var, how='left', left_index=True, right_index=True).drop(col, axis=1)
return df
split_cols(your_dataframe)
Essentially, in this solution you create a list of the columns that you want to loop through. Then you loop through that list and create new dataframes for each column where you run the split() function. Then you merge everything back together on the index. I also:
included a prefix of the column name, so the column names did not have duplicate names and could be more easily identifiable
dropped the old column that we did the split on.
Just import copy and use the split_cols() function that I have created and pass the name of your dataframe.

Pandas split list inside a column into separate columns

I have a dataset with 71 columns and 113 rows. Each column is a array of values. I want to split these arrays into separate columns. Then rename the columns with the prefix
!wget https://raw.githubusercontent.com/pranavn91/sample/master/audioonly.csv
audio = pd.read_csv("audioonly.csv")
zcr = pd.DataFrame(audio['zcr'].str.split().values.tolist())
zcr.columns = ['zcr_' + str(col) for col in zcr.columns]
I can do it for each column individually and combine as single dataframe.
Please propose a faster method.
you can use concat and a list comprehension:
audio_exploded = pd.concat([pd.DataFrame(audio[col].str.split().values.tolist())\
.add_prefix(f'{col}_')
for col in audio.columns],
axis=1)

How to concat 3 dataframes with each into sequential columns

I'm trying to understand how to concat three individual dataframes (i.e df1, df2, df3) into a new dataframe say df4 whereby each individual dataframe has its own column left to right order.
I've tried using concat with axis = 1 to do this, but it appears not possible to automate this with a single action.
Table1_updated = pd.DataFrame(columns=['3P','2PG-3Io','3Io'])
Table1_updated=pd.concat([get_table1_3P,get_table1_2P_max_3Io,get_table1_3Io])
Note that with the exception of get_table1_2P_max_3Io, which has two columns, all other dataframes have one column
For example,
get_table1_3P =
get_table1_2P_max_3Io =
get_table1_3Io =
Ultimately, i would like to see the following:
I believe you need first concat and tthen change order by list of columns names:
Table1_updated=pd.concat([get_table1_3P,get_table1_2P_max_3Io,get_table1_3Io], axis=1)
Table1_updated = Table1_updated[['3P','2PG-3Io','3Io']]