How to split a series with array to multiple series? - pandas

One column of my dataset contains numpy arrays as elements. I want to split it to multiple columns each one with a value of the array.
Data now are like:
column1 column2 column3
0 1 np.array([1,2,3,4]) 4.5
1 2 np.array([5,6,7,8]) 3
I want to convert it into:
column1 col1 col2 col3 col4 column3
0 1 1 2 3 4 4.5
1 2 5 6 7 8 3

Another possible solution, based on pandas.DataFrame.from_records:
out = pd.DataFrame.from_records(
df['col'], columns=[f'col{i+1}' for i in range(len(df.loc[0, 'col']))])
Output:
col1 col2 col3 col4
0 1 2 3 4
1 5 6 7 8

As an alternative:
df = pd.DataFrame(data={'col':[np.array([1,2,3,4]),np.array([5,6,7,8])]})
new_df = pd.DataFrame(df.col.tolist(), index= df.index) #explode column to new dataframe and get index from old df.
new_df.columns = ["col_{}".format(i) for i in range(1,len(new_df.columns) + 1)]
'''
col_1 col_2 col_3 col_4
0 1 2 3 4
1 5 6 7 8
'''

I hope I've understood your question well. You can leverage the result_type="expand" of the .apply method:
df = df.apply(
lambda x: {f"col{k}": vv for v in x for k, vv in enumerate(v, 1)},
result_type="expand",
axis=1,
)
print(df)
Prints:
col1 col2 col3 col4
0 1 2 3 4
1 5 6 7 8

Related

Is there a way to count the number of values in a row that are greater than a "variable" value in Pandas?

I have two separated DataFrames:
df1:
Col1 Col2 Col3 Col4 Col5
ID1 2 3 5 0
ID2 7 6 11 5
ID3 9 16 20 12
df2:
Col1 ColB
ID1 2
ID2 7
ID3 9
Is there a way to count how many values in the first row of df1 are greater than the first value in the column ColB in df2? I need this counting for each row and to add it at the end of df1. So, df1 can be looked like this:
df1:
Col1 Col2 Col3 Col4 Col5 COUNT
ID1 2 3 5 0 2
ID2 7 6 11 5 1
ID3 9 16 20 12 3
Thank you for any suggestion!
The prior assumption is that 'Col1' is the index.
If not, add .set_index('Col1') after df1/df2 in the right part of the commands:
You can use the underlying numpy array:
df1['COUNT'] = (df1.values>df2.values).sum(axis=1)
# if "Col1" is not index
df1['COUNT'] = (df1.set_index('Col1').values>df2.set_index('Col1').values).sum(axis=1)
or:
df1['COUNT'] = df1.gt(df2['ColB'].values[:, None]).sum(axis=1)
# if "Col1" is not index
df1['COUNT'] = df1.set_index('Col1').gt(df2['ColB'].values[:, None]).sum(axis=1)
output:
Col2 Col3 Col4 Col5 COUNT
Col1
ID1 2 3 5 0 2
ID2 7 6 11 5 1
ID3 9 16 20 12 3
Try this:
df1 = df1.set_index('Col1')
df1.assign(COUNT = df1.gt(df2.set_index('Col1').squeeze(),axis=0).sum(axis=1))

convert from one column pandas dataframe to 3 columns based on index

I have:
col1
0 1
1 2
2 3
3 4
4 5
5 6
...
I want, every 3 rows of the original dataframe to become a single row in the new dataframe:
col1 col2 col3
0 1 2 3
1 4 5 6
...
Any suggestions?
The values of the dataframe are an array that can be reshaped using numpy's reshape method. Then, create a new dataframe using the reshaped values. Assuming your existing dataframe is df-
df_2 = pd.DataFrame(df.values.reshape(2, 3), columns=['col1', 'col2', 'col3'])
This will create the new dataframe of two rows and 3 columns.
col1 col2 col3
0 0 1 2
1 3 4 5
You can use set_index and unstack to get the right shape, and add_preffix to change the column name:
print (df.set_index([df.index//3, df.index%3+1])['col1'].unstack().add_prefix('col'))
col1 col2 col3
0 1 2 3
1 4 5 6
in case the original index is not consecutive values but you still want to reshape every 3 rows, replace df.index by np.arange(len(df)) for both in the set_index
you can covert the col in numpy array and then reshape.
In [27]: np.array(df['col1']).reshape( len(df) // 3 , 3 )
Out[27]:
array([[1, 2, 3],
[4, 5, 6]])
In [..] :reshaped_cols = np.array(df['col1']).reshape( len(df) // 3 , 3 )
pd.DataFrame( data = reshaped_cols , columns = ['col1' , 'col2' , 'col3' ] )
Out[30]:
col1 col2 col3
0 1 2 3
1 4 5 6

Updating pandas dataframe values assigns Nan

I have a dataframe with 3 columns: Col1, Col2 and Col3.
Toy example
d = {'Col1':['hello','k','hello','we','r'],
'Col2':[10,20,30,40,50],
'Col3':[1,2,3,4,5]}
df = pd.DataFrame(d)
Which gets:
Col1 Col2 Col3
0 hello 10 1
1 k 20 2
2 hello 30 3
3 we 40 4
4 r 50 5
I am selecting the values of Col2 such that the value in Col1 is 'hello'
my_values = df.loc[df['Col1']=='hello']['Col2']
this returns me a Series where I can see the values of Col2 as well as the index.
0 10
2 30
Name: Col2, dtype: int64
Now suppose I want to assign this values to a Col3.
I only want to replace those values(index 0 and 2), keeping the other values in Col3 unmodified
I tried:
df['Col3'] = my_values
But this assigns Nan to the other values (the ones where Col1 is not hello)
Col1 Col2 Col3
0 hello 10 10
1 k 20 NaN
2 hello 30 30
3 we 40 NaN
4 r 50 NaN
How can I update certain values in Col3 leaving the others untouched?
Col1 Col2 Col3
0 hello 10 10
1 k 20 2
2 hello 30 30
3 we 40 4
4 r 50 5
So, in short:
Having my_values I want to put them in Col3
Or just base on np.where
df['Col3']=np.where(df['Col1'] == 'hello',df.Col2,df.Col3)
If base on your myvalue
df.loc[my_values.index,'col3']=my_values
Or you can just do update
df['Col3'].update(my_values)

Append Pandas Series to DataFrame as a column [duplicate]

This question already has answers here:
vlookup in Pandas using join
(3 answers)
Closed 6 years ago.
I have panadas dataframe (df) like ['key','col1','col2','col3'] and I have pandas series (sr) for which the index is the same as 'key' in data frame. I want to append the series to the dataframe at the new column called col4 with the same 'key'. I have the following code:
for index, row in segmention.iterrows():
df[df['key']==row['key']]['col4']=sr.loc[row['key']]
The code is very slow. I assume there should be more efficient and better way to do that. could you please help?
You can simply do:
df['col4'] = sr
If don't misunderstand.
Use map as mentioned EdChum:
df['col4'] = df['key'].map(sr)
print (df)
col1 col2 col3 key col4
0 4 7 1 A 2
1 5 8 3 B 4
2 6 9 5 C 1
Or assign with set_index:
df = df.set_index('key')
df['col4'] = sr
print (df)
col1 col2 col3 col4
key
A 4 7 1 2
B 5 8 3 4
C 6 9 5 1
If dont need align data in Series by key use (see difference 2,1,4 vs 4,1,2):
df['col4'] = sr.values
print (df)
col1 col2 col3 key col4
0 4 7 1 A 4
1 5 8 3 B 1
2 6 9 5 C 2
Sample:
df = pd.DataFrame({'key':[1,2,3],
'col1':[4,5,6],
'col2':[7,8,9],
'col3':[1,3,5]}, index=list('ABC'))
print (df)
col1 col2 col3 key
A 4 7 1 1
B 5 8 3 2
C 6 9 5 3
sr = pd.Series([4,1,2], index=list('BCA'))
print (sr)
B 4
C 1
A 2
dtype: int64
df['col4'] = df['key'].map(sr)
print (df)
col1 col2 col3 key col4
0 4 7 1 A 2
1 5 8 3 B 4
2 6 9 5 C 1
df = df.set_index('key')
df['col4'] = sr
print (df)
col1 col2 col3 col4
key
A 4 7 1 2
B 5 8 3 4
C 6 9 5 1
This is really a good use case for join, where the left dataframe aligns a column with the index of the right dataframe/series. You have to make sure your Series has a name for it to work
sr.name = 'some name'
df.join(sr, on='key')

Select some columns based on WHERE in a dataframe

So, I am working with Blaze and wanted to perform this query on a dataframe:
SELECT col1,col2 FROM table WHERE col1 > 0
For SELECT *, this works: d[d.col1 > 0]. But I want col1 and col2 only rather than all columns. How should I go about it?
Thanks in advance!
Edit: Here I create d as: d = Data('postgresql://uri')
This also works: d[d.col1 > 0][['col1','col2']]
I think you can use first subset and then boolean indexing:
print (d)
col1 col2 col3
0 -1 4 7
1 2 5 8
2 3 6 9
d = d[['col1','col2']]
print (d)
col1 col2
0 -1 4
1 2 5
2 3 6
print (d[d.col1 > 0])
col1 col2
1 2 5
2 3 6
This is same as:
print (d[['col1','col2']][d.col1 > 0])
col1 col2
1 2 5
2 3 6