Instead of appending value as a new column on the same row, pandas adds a new column AND new row - pandas

What I have below is an example of the type of the type of concatenation that I am trying to do.
df = pd.DataFrame(np.array([1, 2, 3]).reshape((1, 3)), columns = ['col1', 'col2', 'col3'], index = ['a'])
df2 = pd.DataFrame() # already exists elsewhere in code
df2 = df2.append([df, pd.Series(1, name = 'label')])
The result I am hoping for is:
col1 col2 col3 label
a 1.0 2.0 3.0 1
but I get is
col1 col2 col3 0
a 1.0 2.0 3.0 NaN
0 NaN NaN NaN 1.0
I know that I'm joining these wrong, but I cannot seem to figure out how its done. Any advice?

This is because the series you are adding has an incompatible index. The original dataframe has ['a'] as the specified index and there is no index specified in the series. If you want to add a new column without specifying an index, the following will give you what you want:
df = pd.DataFrame(np.array([1, 2, 3]).reshape((1, 3)), columns = ['col1', 'col2', 'col3'], index = ['a'])
df2 = pd.DataFrame() # already exists elsewhere in code
df2 = df2.append([df]) # append the desired dataframe
df2['label'] = 1 # add a new column with the value 1 across all rows
print(df2.to_string())
col1 col2 col3 label
a 1 2 3 1

Related

Why is Pandas DataFrame.loc different for a DataFrame with two columns as index?

I am having trouble adding a row to a pandas DataFrame with two columns as index. This is the code I'm using:
df = pd.DataFrame(columns=['id', 'idx1', 'val'])
df = df.set_index(['id', 'idx1'])
df.loc[123, 'a'] = [1]
Then df becomes:
val a
id idx1
123 NaN 1
However, I expect to get this:
val
id idx1
123 a 1
When I change the length of the index to three (or one), I get what I expect. For example, if I run this code:
df = pd.DataFrame(columns=['id', 'idx1', 'idx2', 'val'])
df = df.set_index(['id', 'idx1', 'idx2'])
df.loc[123, 'a', 'b'] = [1]
df becomes:
val
id idx1 idx2
123 a b 1
Is there something different when referring to two columns as index?
Your dataframe is empty. Only index and column names are defined. So how should Pandas know what you mean by df.loc[123, 'a'] = 1?
create an entry with the first index 123 and a column 'a', or
use 123 and 'a' as two levels of a multiindex?
Solution:
df.loc[(123, 'a'), 'val'] = 1

Pandas: Replace pattern in column values with list values

Stuck on the following in Pandas...
I have the dataframe:
df = pd.DataFrame({'col1': ['aaCHANGEMEbb', 'nnANDMEdd', 'hhMETOOkk'], 'index': ['a', 'b', 'c']}).set_index('index')
col1
a aaCHANGEMEbb
b nnANDMEdd
c hhXXXXkk
And I want to change all uppercase cases in column 'col1' with values from m_list:
m_list = ['1.0', '2.0', '3.0']
One of my attempt that seems as something that is near to truth:
df['col1'] = df['col1'].str.replace('[A-Z]+', lambda x: [i for i in m_list], regex=True)
And another one:
df['col1'].str.replace('([A-Z]+)', lambda x: m_list[x.group(0)])
It doesn't work. I got this in both cases:
col1
a NaN
b NaN
c NaN
But the expected df below:
col1
a aa1.0bb
b nn2.0dd
c hh3.0kk
Please, share your thoughts about this. Thanks!
You can do with split
s = df['col1'].str.split(r'([A-Z]+)',expand=True)
s.loc[:,1] = m_list
df['col1'] = s.agg(''.join,1)
df
Out[255]:
col1
index
a aa1.0bb
b nn2.0dd
c hh3.0kk
for idx, x in enumerate(m_list):
df.loc[idx] = re.sub(r'([A-Z]+)', x, df.loc[idx][0])

Throw and exception and move on in pandas

I have created a pandas dataframe called df with the following code:
import numpy as np
import pandas as pd
ds = {'col1' : ["1","2","3","A"], "col2": [45,6,7,87], "col3" : ["23","4","5","6"]}
df = pd.DataFrame(ds)
The dataframe looks like this:
print(df)
col1 col2 col3
0 1 45 23
1 2 6 4
2 3 7 5
3 A 87 6
Now, col1 and col3 are objects:
print(df.info())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 4 non-null object
1 col2 4 non-null int64
2 col3 4 non-null object
I want to transform, where possible, the object columns into floats.
For example, I can convert col3 into a float like this:
df['col3'] = df['col3'].astype(float)
But I cannot convert col1 into a float:
df['col1'] = df['col1'].astype(float)
ValueError: could not convert string to float: 'A'
Is it possible to create a code that converts, where possible, the object columns into float and by-passes the cases in which it is not possible (so, without throwing an error which stops the process)? I guess it has to do with exceptions?
I think you can make a test whether the content in a string, object or not, in which cases the conversion won't be made. Did you try this ?
for y in df.columns:
if(df[y].dtype == object):
continue
else:
# your treatement here
or, apparently in pandas 0.20.2, there is a function which makes the test : is_string_dtype(df['col1'])
This is in the case where all the values of a column are of the same type, if the values are mixed, iterate over df.values
I have sorted it.
def convert_float(x):
try:
return x.astype(float)
except:
return x
cols = df.columns
for i in range(len(cols)):
df[cols[i]] = convert_float(df[cols[i]])
print(df)
print(df.info())

Converting Lists within Pandas Dataframe into New DataFrame

I have a dataframe:
df =
col1 col2
0 [0.1,0.2,0.3] [1,2,3]
1 [0.5,0.6,0.7] [11,12,13]
My goal: to re-create data frame from index 0:
new_df =
new_col1 new_col2
0 0.1 1
1 0.2 2
2 0.3 3
What I tried was trying to access row by row:
new_col1 = df.col1[0]
new_col2 = df.col2[0]
But new_col1 results in below instead of a list. So I am unsure how to approach this.
0 [0.1,0.2,0.3]
Name: col1, dtype: object
Thanks.
Here is a way by using apply.
df.apply(pd.Series.explode).loc[0]
You can create new DataFrame by select first row by DataFrame.loc or DataFrame.iloc and then transpose by DataFrame.T with DataFrame.add_prefix for new columns names:
df1 = pd.DataFrame(df.iloc[0].tolist(), index=df.columns).T.add_prefix('new_')
print (df1)
new_col1 new_col2
0 0.1 1.0
1 0.2 2.0
2 0.3 3.0
new_df = pd.DataFrame([new_col1, new_col2]).transpose()
If you want to add column names,
new_df.columns = ["new_col1","new_col2"]
you can use the list() function for this
>>> new_col1
[0.1, 0.2, 0.3]
>>> new_col1_=list(new_col1)
[0.1, 0.2, 0.3]
>>> type(new_col1_)
<class 'list'>

append one CSV to another as a dataframe based on certain column names without headers in pandas

I have a CSV in a data frame with these columns and data
ID. Col1. Col2. Col3 Col4
I have another CSV with just
ID. Column2. Column3
How can I append 1st CSV with 2nd data under their corresponding headers, without including CSV2 header
My Expected Dataframe
ID. Col1. Col2. Col3 Col4
Data.CSV1 Data.CSV1 Data.CSV1 Data.CSV1 Data.CSV1
ID.DataCSV2. Column2.DataCSV2. Column3.DataCSV2
Given the column names in CSV to is different
IIUC,
you'll need to clean your column names then you can do a simple concat.
import re
def col_cleaner(cols):
new_cols = [re.sub('\s+|\.','',x) for x in cols]
return new_cols
df1.columns = col_cleaner(df1.columns)
df2.columns = col_cleaner(df2.columns)
#output
#['ID', 'Val1', 'Val2', 'Val3', 'Val4']
#['ID', 'Val2', 'Val3']
new_df = pd.concat([df1,df2],axis=0)
new_df.to_csv('your_csv.csv')
I think you can use .append
df1.append(df2)
col1 col2 col3
0 1 2 2.0
1 2 3 3.0
2 3 4 4.0
0 3 2 NaN
1 4 3 NaN
2 5 4 NaN
Sample Data
df1 = pd.DataFrame({'col1': [1,2,3], 'col2':[2,3,4], 'col3':[2,3,4]})
df2 = pd.DataFrame({'col1': [3,4,5], 'col2':[2,3,4]})