Converting Lists within Pandas Dataframe into New DataFrame - pandas

I have a dataframe:
df =
col1 col2
0 [0.1,0.2,0.3] [1,2,3]
1 [0.5,0.6,0.7] [11,12,13]
My goal: to re-create data frame from index 0:
new_df =
new_col1 new_col2
0 0.1 1
1 0.2 2
2 0.3 3
What I tried was trying to access row by row:
new_col1 = df.col1[0]
new_col2 = df.col2[0]
But new_col1 results in below instead of a list. So I am unsure how to approach this.
0 [0.1,0.2,0.3]
Name: col1, dtype: object
Thanks.

Here is a way by using apply.
df.apply(pd.Series.explode).loc[0]

You can create new DataFrame by select first row by DataFrame.loc or DataFrame.iloc and then transpose by DataFrame.T with DataFrame.add_prefix for new columns names:
df1 = pd.DataFrame(df.iloc[0].tolist(), index=df.columns).T.add_prefix('new_')
print (df1)
new_col1 new_col2
0 0.1 1.0
1 0.2 2.0
2 0.3 3.0

new_df = pd.DataFrame([new_col1, new_col2]).transpose()
If you want to add column names,
new_df.columns = ["new_col1","new_col2"]

you can use the list() function for this
>>> new_col1
[0.1, 0.2, 0.3]
>>> new_col1_=list(new_col1)
[0.1, 0.2, 0.3]
>>> type(new_col1_)
<class 'list'>

Related

drop rows from a Pandas dataframe based on which rows have missing values in another dataframe

I'm trying to drop rows with missing values in any of several dataframes.
They all have the same number of rows, so I tried this:
model_data_with_NA = pd.concat([other_df,
standardized_numerical_data,
encode_categorical_data], axis=1)
ok_rows = ~(model_data_with_NA.isna().all(axis=1))
model_data = model_data_with_NA.dropna()
assert(sum(ok_rows) == len(model_data))
False!
As a newbie in Python, I wonder why this doesn't work? Also, is it better to use hierarchical indexing? Then I can extract the original columns from model_data.
In Short
I believe the all in ~(model_data_with_NA.isna().all(axis=1)) should be replaced with any.
The reason is that all checks here if every value in a row is missing, and any checks if one of the values is missing.
Full Example
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a':[1, 2, 3]})
df2 = pd.DataFrame({'b':[1, np.nan]})
df3 = pd.DataFrame({'c': [1, 2, np.nan]})
model_data_with_na = pd.concat([df1, df2, df3], axis=1)
ok_rows = ~(model_data_with_na.isna().any(axis=1))
model_data = model_data_with_na.dropna()
assert(sum(ok_rows) == len(model_data))
model_data_with_na
a
b
c
0
1
1
1
1
2
nan
2
2
3
nan
nan
model_data
a
b
c
0
1
1
1

Pandas: Replace pattern in column values with list values

Stuck on the following in Pandas...
I have the dataframe:
df = pd.DataFrame({'col1': ['aaCHANGEMEbb', 'nnANDMEdd', 'hhMETOOkk'], 'index': ['a', 'b', 'c']}).set_index('index')
col1
a aaCHANGEMEbb
b nnANDMEdd
c hhXXXXkk
And I want to change all uppercase cases in column 'col1' with values from m_list:
m_list = ['1.0', '2.0', '3.0']
One of my attempt that seems as something that is near to truth:
df['col1'] = df['col1'].str.replace('[A-Z]+', lambda x: [i for i in m_list], regex=True)
And another one:
df['col1'].str.replace('([A-Z]+)', lambda x: m_list[x.group(0)])
It doesn't work. I got this in both cases:
col1
a NaN
b NaN
c NaN
But the expected df below:
col1
a aa1.0bb
b nn2.0dd
c hh3.0kk
Please, share your thoughts about this. Thanks!
You can do with split
s = df['col1'].str.split(r'([A-Z]+)',expand=True)
s.loc[:,1] = m_list
df['col1'] = s.agg(''.join,1)
df
Out[255]:
col1
index
a aa1.0bb
b nn2.0dd
c hh3.0kk
for idx, x in enumerate(m_list):
df.loc[idx] = re.sub(r'([A-Z]+)', x, df.loc[idx][0])

How to delete rows with decimal points from a column of mixed type Pandas dataframe

I'm wondering how to delete rows with decimal points from a column of mixed type in a Pandas data frame.
Suppose I have a column of mixed type (type 'o').
d = {'col1': [1, 2.3, 'Level1']}
test1 = pd.DataFrame(data=d)
test1['col1'].dtypes
dtype('O')
test1
col1
0 1
1 2.3
2 Level1
I will like to delete the row that contains decimal points.
test1
col1
0 1
2 Level1
I tried str.isdecimal() or str.contain('.') didn't work.. Thanks in advance.
This may help:
d = {'col1': [1, 2.3, 'Level1']}
test1 = pd.DataFrame(data=d)
test2 = test1.copy()
for i in range(len(test1)):
if "." in str(test1.iloc[i,0]):
test2.drop(i, axis = 0, inplace = True)
What about using a regex?
m = test1['col1'].astype(str).str.fullmatch('\d+\.\d+')
test1[~m]
Or testing the real object type:
m = test1['col1'].apply(lambda x: isinstance(x, float))
test1[~m]
Output:
col1
0 1
2 Level1

Instead of appending value as a new column on the same row, pandas adds a new column AND new row

What I have below is an example of the type of the type of concatenation that I am trying to do.
df = pd.DataFrame(np.array([1, 2, 3]).reshape((1, 3)), columns = ['col1', 'col2', 'col3'], index = ['a'])
df2 = pd.DataFrame() # already exists elsewhere in code
df2 = df2.append([df, pd.Series(1, name = 'label')])
The result I am hoping for is:
col1 col2 col3 label
a 1.0 2.0 3.0 1
but I get is
col1 col2 col3 0
a 1.0 2.0 3.0 NaN
0 NaN NaN NaN 1.0
I know that I'm joining these wrong, but I cannot seem to figure out how its done. Any advice?
This is because the series you are adding has an incompatible index. The original dataframe has ['a'] as the specified index and there is no index specified in the series. If you want to add a new column without specifying an index, the following will give you what you want:
df = pd.DataFrame(np.array([1, 2, 3]).reshape((1, 3)), columns = ['col1', 'col2', 'col3'], index = ['a'])
df2 = pd.DataFrame() # already exists elsewhere in code
df2 = df2.append([df]) # append the desired dataframe
df2['label'] = 1 # add a new column with the value 1 across all rows
print(df2.to_string())
col1 col2 col3 label
a 1 2 3 1

Assigning index column to empty pandas dataframe

I am creating an empty dataframe that i then want to add data to one row at a time. I want to index on the first column, 'customer_ID'
I have this:
In[1]: df = pd.DataFrame(columns = ['customer_ID','a','b','c'],index=['customer_ID'])
In[2]: df
Out[3]:
customer_ID a b c
customer_ID NaN NaN NaN NaN
So there is already a row of NaN that I don't want.
Can I point the index to the first column without adding a row of data?
The answer, I think, as hinted at by #JD Long is to set the index in a seprate instruction:
In[1]: df = pd.DataFrame(columns = ['customer_ID','a','b','c'])
In[2]: df.set_index('customer_ID',inplace = True)
In[3]: df
Out[3]:
Empty DataFrame
Columns: [customer_ID, a, b, c]
Index: []
I can then add rows:
In[4]: id='x123'
In[5]: df.loc[id]=[id,4,5,6]
In[6]: df
Out[7]:
customer_ID a b c
x123 x123 4.0 5.0 6.0
yes... and you can dropna at any time if you are so inclined:
df = df.set_index('customer_ID').dropna()
df
Because you didn't have any row in your dataframe when you just create it.
df= pd.DataFrame({'customer_ID': ['2'],'a': ['1'],'b': ['A'],'c': ['1']})
df.set_index('customer_ID',drop=False)
df