I have a dataframe:
df =
col1 col2
0 [0.1,0.2,0.3] [1,2,3]
1 [0.5,0.6,0.7] [11,12,13]
My goal: to re-create data frame from index 0:
new_df =
new_col1 new_col2
0 0.1 1
1 0.2 2
2 0.3 3
What I tried was trying to access row by row:
new_col1 = df.col1[0]
new_col2 = df.col2[0]
But new_col1 results in below instead of a list. So I am unsure how to approach this.
0 [0.1,0.2,0.3]
Name: col1, dtype: object
Thanks.
Here is a way by using apply.
df.apply(pd.Series.explode).loc[0]
You can create new DataFrame by select first row by DataFrame.loc or DataFrame.iloc and then transpose by DataFrame.T with DataFrame.add_prefix for new columns names:
df1 = pd.DataFrame(df.iloc[0].tolist(), index=df.columns).T.add_prefix('new_')
print (df1)
new_col1 new_col2
0 0.1 1.0
1 0.2 2.0
2 0.3 3.0
new_df = pd.DataFrame([new_col1, new_col2]).transpose()
If you want to add column names,
new_df.columns = ["new_col1","new_col2"]
you can use the list() function for this
>>> new_col1
[0.1, 0.2, 0.3]
>>> new_col1_=list(new_col1)
[0.1, 0.2, 0.3]
>>> type(new_col1_)
<class 'list'>
Related
I'm trying to drop rows with missing values in any of several dataframes.
They all have the same number of rows, so I tried this:
model_data_with_NA = pd.concat([other_df,
standardized_numerical_data,
encode_categorical_data], axis=1)
ok_rows = ~(model_data_with_NA.isna().all(axis=1))
model_data = model_data_with_NA.dropna()
assert(sum(ok_rows) == len(model_data))
False!
As a newbie in Python, I wonder why this doesn't work? Also, is it better to use hierarchical indexing? Then I can extract the original columns from model_data.
In Short
I believe the all in ~(model_data_with_NA.isna().all(axis=1)) should be replaced with any.
The reason is that all checks here if every value in a row is missing, and any checks if one of the values is missing.
Full Example
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a':[1, 2, 3]})
df2 = pd.DataFrame({'b':[1, np.nan]})
df3 = pd.DataFrame({'c': [1, 2, np.nan]})
model_data_with_na = pd.concat([df1, df2, df3], axis=1)
ok_rows = ~(model_data_with_na.isna().any(axis=1))
model_data = model_data_with_na.dropna()
assert(sum(ok_rows) == len(model_data))
model_data_with_na
a
b
c
0
1
1
1
1
2
nan
2
2
3
nan
nan
model_data
a
b
c
0
1
1
1
Stuck on the following in Pandas...
I have the dataframe:
df = pd.DataFrame({'col1': ['aaCHANGEMEbb', 'nnANDMEdd', 'hhMETOOkk'], 'index': ['a', 'b', 'c']}).set_index('index')
col1
a aaCHANGEMEbb
b nnANDMEdd
c hhXXXXkk
And I want to change all uppercase cases in column 'col1' with values from m_list:
m_list = ['1.0', '2.0', '3.0']
One of my attempt that seems as something that is near to truth:
df['col1'] = df['col1'].str.replace('[A-Z]+', lambda x: [i for i in m_list], regex=True)
And another one:
df['col1'].str.replace('([A-Z]+)', lambda x: m_list[x.group(0)])
It doesn't work. I got this in both cases:
col1
a NaN
b NaN
c NaN
But the expected df below:
col1
a aa1.0bb
b nn2.0dd
c hh3.0kk
Please, share your thoughts about this. Thanks!
You can do with split
s = df['col1'].str.split(r'([A-Z]+)',expand=True)
s.loc[:,1] = m_list
df['col1'] = s.agg(''.join,1)
df
Out[255]:
col1
index
a aa1.0bb
b nn2.0dd
c hh3.0kk
for idx, x in enumerate(m_list):
df.loc[idx] = re.sub(r'([A-Z]+)', x, df.loc[idx][0])
I'm wondering how to delete rows with decimal points from a column of mixed type in a Pandas data frame.
Suppose I have a column of mixed type (type 'o').
d = {'col1': [1, 2.3, 'Level1']}
test1 = pd.DataFrame(data=d)
test1['col1'].dtypes
dtype('O')
test1
col1
0 1
1 2.3
2 Level1
I will like to delete the row that contains decimal points.
test1
col1
0 1
2 Level1
I tried str.isdecimal() or str.contain('.') didn't work.. Thanks in advance.
This may help:
d = {'col1': [1, 2.3, 'Level1']}
test1 = pd.DataFrame(data=d)
test2 = test1.copy()
for i in range(len(test1)):
if "." in str(test1.iloc[i,0]):
test2.drop(i, axis = 0, inplace = True)
What about using a regex?
m = test1['col1'].astype(str).str.fullmatch('\d+\.\d+')
test1[~m]
Or testing the real object type:
m = test1['col1'].apply(lambda x: isinstance(x, float))
test1[~m]
Output:
col1
0 1
2 Level1
What I have below is an example of the type of the type of concatenation that I am trying to do.
df = pd.DataFrame(np.array([1, 2, 3]).reshape((1, 3)), columns = ['col1', 'col2', 'col3'], index = ['a'])
df2 = pd.DataFrame() # already exists elsewhere in code
df2 = df2.append([df, pd.Series(1, name = 'label')])
The result I am hoping for is:
col1 col2 col3 label
a 1.0 2.0 3.0 1
but I get is
col1 col2 col3 0
a 1.0 2.0 3.0 NaN
0 NaN NaN NaN 1.0
I know that I'm joining these wrong, but I cannot seem to figure out how its done. Any advice?
This is because the series you are adding has an incompatible index. The original dataframe has ['a'] as the specified index and there is no index specified in the series. If you want to add a new column without specifying an index, the following will give you what you want:
df = pd.DataFrame(np.array([1, 2, 3]).reshape((1, 3)), columns = ['col1', 'col2', 'col3'], index = ['a'])
df2 = pd.DataFrame() # already exists elsewhere in code
df2 = df2.append([df]) # append the desired dataframe
df2['label'] = 1 # add a new column with the value 1 across all rows
print(df2.to_string())
col1 col2 col3 label
a 1 2 3 1
I am creating an empty dataframe that i then want to add data to one row at a time. I want to index on the first column, 'customer_ID'
I have this:
In[1]: df = pd.DataFrame(columns = ['customer_ID','a','b','c'],index=['customer_ID'])
In[2]: df
Out[3]:
customer_ID a b c
customer_ID NaN NaN NaN NaN
So there is already a row of NaN that I don't want.
Can I point the index to the first column without adding a row of data?
The answer, I think, as hinted at by #JD Long is to set the index in a seprate instruction:
In[1]: df = pd.DataFrame(columns = ['customer_ID','a','b','c'])
In[2]: df.set_index('customer_ID',inplace = True)
In[3]: df
Out[3]:
Empty DataFrame
Columns: [customer_ID, a, b, c]
Index: []
I can then add rows:
In[4]: id='x123'
In[5]: df.loc[id]=[id,4,5,6]
In[6]: df
Out[7]:
customer_ID a b c
x123 x123 4.0 5.0 6.0
yes... and you can dropna at any time if you are so inclined:
df = df.set_index('customer_ID').dropna()
df
Because you didn't have any row in your dataframe when you just create it.
df= pd.DataFrame({'customer_ID': ['2'],'a': ['1'],'b': ['A'],'c': ['1']})
df.set_index('customer_ID',drop=False)
df