adding multiple lists into one column DataFrame pandas - pandas

l = {'col1': [[1,2,3], [4,5,6]]}
df = pd.DataFrame(data = l)
col1
0 [1, 2, 3]
1 [4, 5, 6]
Desired output:
col1
0 1
1 2
2 3
3 4
4 5
5 6

Here is explode
df.explode('col1')
col1
0 1
0 2
0 3
1 4
1 5
1 6

You can use np.ravel to flatten the list of lists:
import numpy as np, pandas as pd
l = {'col1': [[1,2,3], [4,5,6]]}
df = pd.DataFrame(np.ravel(*l.values()),columns=l.keys())
>>> df
col1
0 1
1 2
2 3
3 4
4 5
5 6

Related

Convert multiple columns in pandas dataframe to array of arrays

I have the following dataframe:
col1 col2 col3
1 1 2 3
2 4 5 6
3 7 8 9
4 10 11 12
I want to create a new column that will be an array of arrays, that contains a single array consisting of specific columns, casted to float.
So given column names, say "col2" and "col3", the output dataframe would look like this.
col1 col2 col3 new
1 1 2 3 [[2,3]]
2 4 5 6 [[5,6]]
3 7 8 9 [[8,9]]
4 10 11 12 [[11,12]]
What I have so far works, but seems clumsy and I believe there's a better way. I'm fairly new to pandas and numpy.
selected_columns = ["col2", "col3"]
df[selected_columns] = df[selected_columns].astype(float)
df['new'] = df.apply(lambda r: tuple(r[selected_columns]), axis=1)
.apply(np.array)
.apply(lambda r: tuple(r[["new"]]), axis=1)
.apply(np.array)
Appreciate your help, Thanks!
Using agg:
cols = ['col2', 'col3']
df['new'] = df[cols].agg(list, axis=1)
Using numpy:
df['new'] = df[cols].to_numpy().tolist()
Output:
col1 col2 col3 new
1 1 2 3 [2, 3]
2 4 5 6 [5, 6]
3 7 8 9 [8, 9]
4 10 11 12 [11, 12]
2D lists
cols = ['col2', 'col3']
df['new'] = df[cols].agg(lambda x: [list(x)], axis=1)
# or
df['new'] = df[cols].to_numpy()[:,None].tolist()
Output:
col1 col2 col3 new
1 1 2 3 [[2, 3]]
2 4 5 6 [[5, 6]]
3 7 8 9 [[8, 9]]
4 10 11 12 [[11, 12]]

different substring for each row based on condition

How do one add a different substring to each row based on a condition in pandas?
Here is a dummy dataframe that I created:
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame(np.random.randint(0,5,size=(5, 2)))
df.columns = ['A','B']
If I replace the rows in B, with a string YYYY for those rows which have the value in A less then 5, then I would do it this way:
df.loc[df['A'] < 2, 'B'] = 'YYYY'
This is the current output of original df:
A B
0 3 4
1 0 1
2 3 0
3 0 1
4 4 4
Of replaced df:
A B
0 3 4
1 0 YYYY
2 3 0
3 0 YYYY
4 4 4
What I instead want is:
A B
0 3 4
1 0 1_1
2 3 0
3 0 1_2
4 4 4
Here is necessary generate list with same size like number of Trues values with range and sum, then convert to strings and join together:
m = df['A'] < 2
df.loc[m, 'B'] = df.loc[m, 'B'].astype(str) + '_' + list(map(str, range(1, m.sum() + 1)))
print (df)
A B
0 3 4
1 0 1_1
2 3 0
3 0 1_2
4 4 4
Or you can use f-strings for generate new list:
m = df['A'] < 2
df.loc[m, 'B'] = [f'{b}_{a}' for a, b in zip(range(1, m.sum() + 1), df.loc[m, 'B'])]
EDIT1:
m = df['A'] < 4
df.loc[m, 'B'] = df.loc[m, 'B'].astype(str) + '_' + df[m].groupby('B').cumcount().add(1).astype(str)
print (df)
A B
0 3 4_1
1 0 1_1
2 3 0_1
3 0 1_2
4 4 4

How to select all the rows that are between a range of values in specific column in pandas dataframe

I have the below sample df, and I'd like to select all the rows that are between a range of values in a specific column:
0 1 2 3 4 5 index
0 -252.44 -393.07 886.72 -2.04 1.58 -2.41 0
1 -260.25 -415.53 881.35 -3.07 0.08 -1.66 1
2 -267.58 -412.60 893.07 -2.98 -1.15 -2.66 2
3 -279.30 -417.97 880.86 -1.15 -0.50 -1.37 3
4 -252.93 -395.51 883.30 -1.30 1.43 4.17 4
I'd like to get the below df (all the rows between index value of 1-3):
0 1 2 3 4 5 index
1 -260.25 -415.53 881.35 -3.07 0.08 -1.66 1
2 -267.58 -412.60 893.07 -2.98 -1.15 -2.66 2
3 -279.30 -417.97 880.86 -1.15 -0.50 -1.37 3
How can I do it?
I tried the below which didn't work:
new_df = df[df['index'] >= 1 & df['index'] <= 3]
Between min and max: use between():
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1,2,3], 'b':[11,12,13]})
>>> df
a b
0 1 11
1 2 12
2 3 13
>>> df[df.a.between(1,2)]
a b
0 1 11
1 2 12
Your attempt new_df = df[df['index'] >= 1 & df['index'] <= 3] is wrong in two places:
it's df.index, not df["index"]
when using multiple filters, use parentheses: df[(df.index >= 1) & (df.index <= 3)]

Dropping Rows with a does not equal condition

I'm attempting to create a new dataframe that drops a certain segment of records from an existing dataframe.
df2=df[df['AgeSeg']!='0-1']
when I look at df2, the records with '0-1' Age Segment are still there.
Output with 0-1 records still in it.
I would expect the new dataframe to not have them. What am I doing wrong?
You can use isin (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html)
Simple example:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 3, 2, 9], 'col2': [4, 5, 6, 3, 0]})
df = df[df['col1'].isin([2]) != True]
df before:
col1 col2
0 1 4
1 2 5
2 3 6
3 2 3
4 9 0
df after:
col1 col2
0 1 4
2 3 6
4 9 0

pandas convert lists in multiple columns within DataFrame to separate columns

I am trying to convert a list within multiple columns of a pandas DataFrame into separate columns.
Say, I have a dataframe like this:
0 1
0 [1, 2, 3] [4, 5, 6]
1 [1, 2, 3] [4, 5, 6]
2 [1, 2, 3] [4, 5, 6]
And would like to convert it to something like this:
0 1 2 0 1 2
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6
I have managed to do this in a loop. However, I would like to do this in fewer lines.
My code snippet so far is as follows:
import pandas as pd
df = pd.DataFrame([[[1,2,3],[4,5,6]],[[1,2,3],[4,5,6]],[[1,2,3],[4,5,6]]])
output1 = df[0].apply(pd.Series)
output2 = df[1].apply(pd.Series)
output = pd.concat([output1, output2], axis=1)
If you don't care about the column names you could do:
>>> df.apply(np.hstack, axis=1).apply(pd.Series)
0 1 2 3 4 5
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6
Using sum
pd.DataFrame(df.sum(1).tolist())
0 1 2 3 4 5
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6