Consider the code below:
import pandas as pd
d = {'col1': [1, 2, 3 ,4 ,5, 5, 6, 5], 'col2': [3, 4, 3 ,4 , 5, 6 , 6, 5], 'col3': [5, 6, 3 ,4 , 5, 6 ,6, 5], 'col4': [7, 8, 3 , 4 , 5, 4 , 6, 4], }
df = pd.DataFrame(data=d)
df=df.T
This code gives me the following output:
# 0 1 2 3 4 5 6 7
# col1 1 2 3 4 5 5 6 5
# col2 3 4 3 4 5 6 6 5
# col3 5 6 3 4 5 6 6 5
# col4 7 8 3 4 5 4 6 4
I would like to reshape the dataframe in such a way that the columns are rearranged as shown below:
# 0 1
# col1 1 2
# col2 3 4
# col3 5 6
# col4 7 8
# col1 3 4
# col2 3 4
# col3 3 4
# col4 3 4
# col1 5 5
# col2 5 6
# col3 5 6
# col4 5 4
# col1 6 5
# col2 6 5
# col3 6 5
# col4 6 4
The code should allow some room for modification so that one can choose two columns as in the above example or three columns or four columns and so on. Any ideas how to implement this?
Try this:
import pandas as pd
d = {'col1': [1, 2, 3 ,4 ,5, 5, 6, 5], 'col2': [3, 4, 3 ,4 , 5, 6 , 6, 5], 'col3': [5, 6, 3 ,4 , 5, 6 ,6, 5], 'col4': [7, 8, 3 , 4 , 5, 4 , 6, 4], }
df = pd.DataFrame(data=d)
df = df.T
number = 2 #Here you can choose the number of columns
df1 = df.iloc[:, :number]
for x in range(0, len(df.columns), number):
df1 = pd.concat([df1, df.iloc[:, x:x + number].T.reset_index(drop=True).T])
print(df1)
A much faster way, is to use numpy, especially as the number of columns is even.
You are reshaping into a 2 column dataframe; this is achieved with np.reshape:
data = np.reshape(df.to_numpy(), (-1, 2))
data
array([[1, 2],
[3, 4],
[5, 5],
[6, 5],
[3, 4],
[3, 4],
[5, 6],
[6, 5],
[5, 6],
[3, 4],
[5, 6],
[6, 5],
[7, 8],
[3, 4],
[5, 4],
[6, 4]])
The length of the current index is 4; when reshaped, it should be length of current index * length of columns/2:
index = np.tile(df.index, df.columns.size//2)
index
array(['col1', 'col2', 'col3', 'col4', 'col1', 'col2', 'col3', 'col4',
'col1', 'col2', 'col3', 'col4', 'col1', 'col2', 'col3', 'col4'],
dtype=object)
All that is left is to create a new dataframe:
pd.DataFrame(data, index = index)
0 1
col1 1 2
col2 3 4
col3 5 5
col4 6 5
col1 3 4
col2 3 4
col3 5 6
col4 6 5
col1 5 6
col2 3 4
col3 5 6
col4 6 5
col1 7 8
col2 3 4
col3 5 4
col4 6 4
Another option, is to use the idea of even and odd rows to reshape the data, with pyjanitor's pivot_longer function; collate even(0) and odd(1) into separate columns:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
(df.set_axis((df.columns % 2).astype(str), axis=1)
.pivot_longer(ignore_index=False,
names_to = ['0', '1'],
names_pattern=['0', '1'])
)
0 1
col1 1 2
col2 3 4
col3 5 6
col4 7 8
col1 3 4
col2 3 4
col3 3 4
col4 3 4
col1 5 5
col2 5 6
col3 5 6
col4 5 4
col1 6 5
col2 6 5
col3 6 5
col4 6 4
Again, the numpy approach is much faster
Let say I have got a dataframe called df
A 10
A 20
15
20
B 10
B 10
The result I want is
A 30
35
B 20
I imagine your blanks are actually NaNs, then use dropna=False:
df.groupby('col1', dropna=False).sum()
If they really are empty strings, then it should work with the default.
Example:
df = pd.DataFrame({'col1': ['A', 'A', float('nan'), float('nan'), 'B', 'B'],
'col2': [10, 20, 15, 20, 10, 10]})
df.groupby('col1', dropna=False).sum()
output:
col2
col1
A 30
B 20
NaN 35
Group by custom group and aggregate columns.
Suppose your dataframe with 2 columns: 'col1' and 'col2':
>>> df
col1 col2
0 A 10 # <- group 1
1 A 20 # <- group 1
2 15 # <- group 2
3 20 # <- group 2
4 B 10 # <- group 3
5 B 10 # <- group 3
grp = df.iloc[:, 0].ne(df.iloc[:, 0].shift()).cumsum()
out = df.groupby(grp, as_index=False).agg({'col1': 'first', 'col2': 'sum'})
Output result:
>>> out
col1 col2
0 A 30
1 35
2 B 20
I have:
col1
0 1
1 2
2 3
3 4
4 5
5 6
...
I want, every 3 rows of the original dataframe to become a single row in the new dataframe:
col1 col2 col3
0 1 2 3
1 4 5 6
...
Any suggestions?
The values of the dataframe are an array that can be reshaped using numpy's reshape method. Then, create a new dataframe using the reshaped values. Assuming your existing dataframe is df-
df_2 = pd.DataFrame(df.values.reshape(2, 3), columns=['col1', 'col2', 'col3'])
This will create the new dataframe of two rows and 3 columns.
col1 col2 col3
0 0 1 2
1 3 4 5
You can use set_index and unstack to get the right shape, and add_preffix to change the column name:
print (df.set_index([df.index//3, df.index%3+1])['col1'].unstack().add_prefix('col'))
col1 col2 col3
0 1 2 3
1 4 5 6
in case the original index is not consecutive values but you still want to reshape every 3 rows, replace df.index by np.arange(len(df)) for both in the set_index
you can covert the col in numpy array and then reshape.
In [27]: np.array(df['col1']).reshape( len(df) // 3 , 3 )
Out[27]:
array([[1, 2, 3],
[4, 5, 6]])
In [..] :reshaped_cols = np.array(df['col1']).reshape( len(df) // 3 , 3 )
pd.DataFrame( data = reshaped_cols , columns = ['col1' , 'col2' , 'col3' ] )
Out[30]:
col1 col2 col3
0 1 2 3
1 4 5 6
l = {'col1': [[1,2,3], [4,5,6]]}
df = pd.DataFrame(data = l)
col1
0 [1, 2, 3]
1 [4, 5, 6]
Desired output:
col1
0 1
1 2
2 3
3 4
4 5
5 6
Here is explode
df.explode('col1')
col1
0 1
0 2
0 3
1 4
1 5
1 6
You can use np.ravel to flatten the list of lists:
import numpy as np, pandas as pd
l = {'col1': [[1,2,3], [4,5,6]]}
df = pd.DataFrame(np.ravel(*l.values()),columns=l.keys())
>>> df
col1
0 1
1 2
2 3
3 4
4 5
5 6
I'm attempting to create a new dataframe that drops a certain segment of records from an existing dataframe.
df2=df[df['AgeSeg']!='0-1']
when I look at df2, the records with '0-1' Age Segment are still there.
Output with 0-1 records still in it.
I would expect the new dataframe to not have them. What am I doing wrong?
You can use isin (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html)
Simple example:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 3, 2, 9], 'col2': [4, 5, 6, 3, 0]})
df = df[df['col1'].isin([2]) != True]
df before:
col1 col2
0 1 4
1 2 5
2 3 6
3 2 3
4 9 0
df after:
col1 col2
0 1 4
2 3 6
4 9 0