Convert multiple columns in pandas dataframe to array of arrays - pandas

I have the following dataframe:
col1 col2 col3
1 1 2 3
2 4 5 6
3 7 8 9
4 10 11 12
I want to create a new column that will be an array of arrays, that contains a single array consisting of specific columns, casted to float.
So given column names, say "col2" and "col3", the output dataframe would look like this.
col1 col2 col3 new
1 1 2 3 [[2,3]]
2 4 5 6 [[5,6]]
3 7 8 9 [[8,9]]
4 10 11 12 [[11,12]]
What I have so far works, but seems clumsy and I believe there's a better way. I'm fairly new to pandas and numpy.
selected_columns = ["col2", "col3"]
df[selected_columns] = df[selected_columns].astype(float)
df['new'] = df.apply(lambda r: tuple(r[selected_columns]), axis=1)
.apply(np.array)
.apply(lambda r: tuple(r[["new"]]), axis=1)
.apply(np.array)
Appreciate your help, Thanks!

Using agg:
cols = ['col2', 'col3']
df['new'] = df[cols].agg(list, axis=1)
Using numpy:
df['new'] = df[cols].to_numpy().tolist()
Output:
col1 col2 col3 new
1 1 2 3 [2, 3]
2 4 5 6 [5, 6]
3 7 8 9 [8, 9]
4 10 11 12 [11, 12]
2D lists
cols = ['col2', 'col3']
df['new'] = df[cols].agg(lambda x: [list(x)], axis=1)
# or
df['new'] = df[cols].to_numpy()[:,None].tolist()
Output:
col1 col2 col3 new
1 1 2 3 [[2, 3]]
2 4 5 6 [[5, 6]]
3 7 8 9 [[8, 9]]
4 10 11 12 [[11, 12]]

Related

convert from one column pandas dataframe to 3 columns based on index

I have:
col1
0 1
1 2
2 3
3 4
4 5
5 6
...
I want, every 3 rows of the original dataframe to become a single row in the new dataframe:
col1 col2 col3
0 1 2 3
1 4 5 6
...
Any suggestions?
The values of the dataframe are an array that can be reshaped using numpy's reshape method. Then, create a new dataframe using the reshaped values. Assuming your existing dataframe is df-
df_2 = pd.DataFrame(df.values.reshape(2, 3), columns=['col1', 'col2', 'col3'])
This will create the new dataframe of two rows and 3 columns.
col1 col2 col3
0 0 1 2
1 3 4 5
You can use set_index and unstack to get the right shape, and add_preffix to change the column name:
print (df.set_index([df.index//3, df.index%3+1])['col1'].unstack().add_prefix('col'))
col1 col2 col3
0 1 2 3
1 4 5 6
in case the original index is not consecutive values but you still want to reshape every 3 rows, replace df.index by np.arange(len(df)) for both in the set_index
you can covert the col in numpy array and then reshape.
In [27]: np.array(df['col1']).reshape( len(df) // 3 , 3 )
Out[27]:
array([[1, 2, 3],
[4, 5, 6]])
In [..] :reshaped_cols = np.array(df['col1']).reshape( len(df) // 3 , 3 )
pd.DataFrame( data = reshaped_cols , columns = ['col1' , 'col2' , 'col3' ] )
Out[30]:
col1 col2 col3
0 1 2 3
1 4 5 6

adding multiple lists into one column DataFrame pandas

l = {'col1': [[1,2,3], [4,5,6]]}
df = pd.DataFrame(data = l)
col1
0 [1, 2, 3]
1 [4, 5, 6]
Desired output:
col1
0 1
1 2
2 3
3 4
4 5
5 6
Here is explode
df.explode('col1')
col1
0 1
0 2
0 3
1 4
1 5
1 6
You can use np.ravel to flatten the list of lists:
import numpy as np, pandas as pd
l = {'col1': [[1,2,3], [4,5,6]]}
df = pd.DataFrame(np.ravel(*l.values()),columns=l.keys())
>>> df
col1
0 1
1 2
2 3
3 4
4 5
5 6

Dropping Rows with a does not equal condition

I'm attempting to create a new dataframe that drops a certain segment of records from an existing dataframe.
df2=df[df['AgeSeg']!='0-1']
when I look at df2, the records with '0-1' Age Segment are still there.
Output with 0-1 records still in it.
I would expect the new dataframe to not have them. What am I doing wrong?
You can use isin (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html)
Simple example:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 3, 2, 9], 'col2': [4, 5, 6, 3, 0]})
df = df[df['col1'].isin([2]) != True]
df before:
col1 col2
0 1 4
1 2 5
2 3 6
3 2 3
4 9 0
df after:
col1 col2
0 1 4
2 3 6
4 9 0

Combine multiple columns into two columns: "column name" and "value"

There is probably an easy way of doing this, so I hope someone has a nice solution (currently I am doing it with ugly for loops).
My data looks like:
In [1]: df = pd.DataFrame({'Ref': [5, 6, 7],
'Col1': [10,11,12],
'Col2': [20,21,22],
'Col3': [30,31,32]})
In [2]: df
Out[2]:
Col1 Col2 Col3 Ref
0 10 20 30 5
1 11 21 31 6
2 12 22 32 7
And I am trying to flatten the table (for 2D histograms) to use a single column for the column id and one column for the actual values while keeping the corresponding Ref, like this:
Ref Col Value
0 5 1 10
1 5 2 20
2 5 3 30
3 6 1 11
4 6 2 21
5 6 3 31
6 7 1 12
7 7 2 22
8 7 3 32
I remember there was some kind of a join/group operation to do the reverse operation, but I cannot recall it anymore...
Maybe not the most elegant solution, but it works on your data. Using a combination of pivot_table and stack.
import pandas as pd
df = pd.DataFrame({'Ref': [5, 6, 7],
'Col1': [10,11,12],
'Col2': [20,21,22],
'Col3': [30,31,32]})
# In [23]: df
# Out[23]:
# Col1 Col2 Col3 Ref
# 0 10 20 30 5
# 1 11 21 31 6
# 2 12 22 32 7
piv = df.pivot_table(index=['Ref']).stack()
df2 = pd.DataFrame(piv)
df2.reset_index(inplace=True)
df2.columns = ['Ref','Col','Value']
# In [19]: df2
# Out[19]:
# Ref Col Value
# 0 5 Col1 10
# 1 5 Col2 20
# 2 5 Col3 30
# 3 6 Col1 11
# 4 6 Col2 21
# 5 6 Col3 31
# 6 7 Col1 12
# 7 7 Col2 22
# 8 7 Col3 32
If you want 'Col' to just be the last digit of the column name, could do something like this:
df2.Col = df2.Col.apply(lambda x: x[-1:])
# In [21]: df2
# Out[21]:
# Ref Col Value
# 0 5 1 10
# 1 5 2 20
# 2 5 3 30
# 3 6 1 11
# 4 6 2 21
# 5 6 3 31
# 6 7 1 12
# 7 7 2 22
# 8 7 3 32

In pandas, how to set_index with using column index instead of referring to column names?

For example:
We have a Pandas dataFrame foo with 2 columns ['A', 'B'].
I want to do function like
foo.set_index([0,1])
instead of
foo.set_index(['A', 'B'])
Have tried foo.set_index([[0,.1]]) as well but came with this error:
Length mismatch: Expected axis has 9 elements, new values have 2 elements
If the column index is unique you could use:
df.set_index(list(df.columns[cols]))
where cols is a list of ordinal indices.
For example,
In [77]: np.random.seed(2016)
In [79]: df = pd.DataFrame(np.random.randint(10, size=(5,4)), columns=list('ABCD'))
In [80]: df
Out[80]:
A B C D
0 3 7 2 3
1 8 4 8 7
2 9 2 6 3
3 4 1 9 1
4 2 2 8 9
In [81]: df.set_index(list(df.columns[[0,2]]))
Out[81]:
B D
A C
3 2 7 3
8 8 4 7
9 6 2 3
4 9 1 1
2 8 2 9
If the DataFrame's column index is not unique, then setting the index by label
is impossible and by ordinals more complicated:
import numpy as np
import pandas as pd
np.random.seed(2016)
def set_ordinal_index(df, cols):
columns, df.columns = df.columns, np.arange(len(df.columns))
mask = df.columns.isin(cols)
df = df.set_index(cols)
df.columns = columns[~mask]
df.index.names = columns[mask]
return df
df = pd.DataFrame(np.random.randint(10, size=(5,4)), columns=list('AAAA'))
print(set_ordinal_index(df, [0,2]))
yields
A A
A A
3 2 7 3
8 8 4 7
9 6 2 3
4 9 1 1
2 8 2 9
This worked for me, the other answer didn't.
# single column
df.set_index(df.columns[1])
# multi column
df.set_index(df.columns[[1, 0]].tolist())