Pandas grouping based on a column in a very specific format - pandas

I have a data-frame df -
a b c
0 1 5 0
1 1 6 1
2 1 7 0
3 3 8 0
need to group it based on column-c like -
a b c
0 [1, 1] [5, 6] [0, 1]
1 1 7 0
2 3 8 0
It can be done through iterating over the df. Are there any other ways more like pandas grouping or something?

Not sure but do you need this?
k = 0
temp = []
for i in df.c:
if i == 0:
k+=1
temp.append(k)
df = df.groupby(temp).agg(list)
Output:
a b c
1 [1, 1] [5, 6] [0, 1]
2 [1] [7] [0]
3 [3] [8] [0]

You don't need any loop. Here is the two line of solution.
Change index 0 to 1 so that you can make groups on the basis of index.
Make groups on the basis of index using groupby and get list of values of each column for each group
df.rename(index={0:1}, inplace=True)
df = df.groupby(df.index).agg(list)
print(df)
a b c
1 [1, 1] [5, 6] [0, 1]
2 [1] [7] [0]
3 [3] [8] [0]

Related

How to explode row list into multiple cumulative lists?

Firstly I did a groupby operation: df.groupby('a')['b'].agg(list).reset_index(name='b')
a b
A 1
A 2
B 5
B 5
B 4
C 6
Resulting in this df:
a b
A [1,2]
B [5,5,4]
C [6]
Now I want to explode these lists into multiple cumulative lists by row.
a b
A [1]
A [1,2]
B [5]
B [5,5]
B [5,5,4]
C [6]
You need 1st convert the cell value to list then we can do cumsum
df['out'] = df['b'].apply(lambda x : [x]).groupby(df['a']).apply(lambda x : x.cumsum() )
Out[382]:
0 [1]
1 [1, 2]
2 [5]
3 [5, 5]
4 [5, 5, 4]
5 [6]
Name: b, dtype: object
As DataFrame.expanding() seems only to work on numeric data, I resort to this nested list comprehension:
df['b'] = [subdf['b'].tolist()[:i+1]
for group, subdf in df.groupby('a')
for i in range(subdf.shape[0])]
print(df)
a b
0 A [1]
1 A [1, 2]
2 B [5]
3 B [5, 5]
4 B [5, 5, 4]
5 C [6]

Numpy: Add something before array

When I print a Numpy array,I want to add something before array like this:
G1: first row
G2: second row
G3: Third row
What i have done is like this,but the result is not satisfy what I want.
c = np.arange(9).reshape(3,3)
for i in range(1,3):
for row in c:
print('G'+str(i))
print(row)
Result:
G1
[0 1 2]
G1
[3 4 5]
G1
[6 7 8]
G2
[0 1 2]
G2
[3 4 5]
G2
[6 7 8]
c = np.arange(9).reshape(3,3)
for i, row in enumerate(c):
print('G' + str(i+1) + ': ' + str(row))
Result:
G1: [0 1 2]
G2: [3 4 5]
G3: [6 7 8]
This works as I think you want.
import numpy as np
c = np.arange(9).reshape(3,3)
for i in range(c.shape[0]):
print(f'G{i+1}: {c[i]}')
Result:
G1: [0 1 2]
G2: [3 4 5]
G3: [6 7 8]
just a little tweak on your code :
c = np.arange(9).reshape(3,3)
for i in range(1,3):
for row in c:
print('G'+str(i), end=' ')
print(row)

Reshaping a pandas dataframe from (12,1) to a specific shaped (3,4)

I have a specific reshaping I'm trying to accomplish. I don't see how to use np.reshape or pd.pivot to get this to work. Any help would be appreciated.
df = [1,2,3,4,1,2,3,4,1,2,3,4]
#I would like the output to look like:
0 1 2 3
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
Using pandas.DataFrame.values or pandas.DataFrame.to_numpy with numpy.reshape
Per pandas documentation: Is recommended using DataFrame.to_numpy()
import numpy as np
import pandas as pd
import pandas as pd
list = [1,2,3,4,1,2,3,4,1,2,3,4]
df = pd.Series(list)
# Option1 using 'values' with reshape()
print('Option1 : \n', df.values.reshape(3,4).T)
# Option2 using 'to_numpy()' with reshape()
print('Option2 : \n',df.to_numpy().reshape(3,4).T)
# Get reshape dataframe to vector
df1 = pd.DataFrame(df.to_numpy().reshape(3,4).T)
# dataframe to vector Option1
print('Option1: Convert dataframe to vector: \n', np.reshape(df1.values.T, (1, df1.size)))
# dataframe to Option2
print('Option2: Convert dataframe to vector: \n', df1.to_numpy().T.reshape(1, df1.size))
# numpy array to vector :
df2 = df.to_numpy().reshape(3,4).T
print('Array to vector: \n', np.reshape(df2.T, (1, df2.size)))
Out:
Option1 :
[[1 1 1]
[2 2 2]
[3 3 3]
[4 4 4]]
Option2 :
[[1 1 1]
[2 2 2]
[3 3 3]
[4 4 4]]
Option1: Convert dataframe to vector:
[[1 2 3 4 1 2 3 4 1 2 3 4]]
Option2: Convert dataframe to vector:
[[1 2 3 4 1 2 3 4 1 2 3 4]]
Array to vector:
[[1 2 3 4 1 2 3 4 1 2 3 4]]
See here interactive

Using numpy.argsort() gives wrong indices array

I'm new to numpy, so I might be missing something obviuous here.
The following small argsort() test script gives strange results. Any directions ?
import numpy as np
a = np.array([[3, 5, 6, 4, 1] , [2, 7 ,4 ,1 , 2] , [8, 6, 7, 2, 1]])
print a
print a.argsort(axis=0)
print a.argsort(axis=1)
output:
[[3 5 6 4 1]
[2 7 4 1 2]
[8 6 7 2 1]]
[[1 0 1 1 0] # bad 4th & 5th columns ?
[0 2 0 2 2]
[2 1 2 0 1]]
[[4 0 3 1 2] # what's going on here ?
[3 0 4 2 1]
[4 3 1 2 0]]
As others have indicated this method is working correctly, so in order to provide an answer here is an explanation of how .argsort() works. a.argsort returns the indices (not values) in order that would sort the array along the specified axis.
In your example
a = np.array([[3, 5, 6, 4, 1] , [2, 7 ,4 ,1 , 2] , [8, 6, 7, 2, 1]])
print a
print a.argsort(axis=0)
returns
[[3 5 6 4 1]
[2 7 4 1 2]
[8 6 7 2 1]]
[[1 0 1 1 0]
[0 2 0 2 2]
[2 1 2 0 1]]
because along
[[3 ...
[2 ...
[8 ...
2 is the smallest value. Therefore the current index of 2 (which is 0) takes the first position along this axis in the matrix returned by argsort(). The second smallest value is 3 at index 0, therefore the second position along this axis in the returned matrix will be 0. Finally, the largest element is 2 which occurs at index 2 along the 0 axis, so the final element of the returned matrix will be 2. Thus:
[[1 ...
[0 ...
[2 ...
the same process is repeated along other 4 sequences along axis 0:
[[...5 ...] [[...0 ...]
[...7 ...] becomes ----> [... 2 ...]
[...6 ...]] [... 1 ...]]
[[...6 ...] [[...1 ...]
[...4 ...] becomes ----> [... 0 ...]
[...7 ...]] [... 2 ...]]
[[...4 ...] [[...1 ...]
[...1 ...] becomes ----> [... 2 ...]
[...2 ...]] [... 0 ...]]
[[...1] [[...0]
[...2] becomes ----> [... 2]
[...1]] [... 1]]
changing the axis to from 0 to 1, results in this same process being applied along sequences in the 1st axis:
[[3 5 6 4 1 becomes ----> [[4 0 3 1 2
again because the smallest element is 1 which is at index 4, then 3 at index 0, then 4 at index 3, 5 at index 1 and finally 6 is the largest at index 2.
As before this process is repeated across each of
[2 7 4 1 2] ----> [3 0 4 2 1]
[8 6 7 2 1] ----> [4 3 1 2 0]
giving
[[4 0 3 1 2]
[3 0 4 2 1]
[4 3 1 2 0]]
This actually returns a sorted array, whose elements, rather than the element of the array we want to sort, are the index of that element.
enter image description here
enter image description here
This says the first element in our sorted array would be the element whose index is '1', which in turn is '0'.

create a dataframe from a list of length-unequal lists

I try to convert such a list:
l = [[1, 2, 3, 17], [4, 19], [5]]
to a dataframe having each of the number as indice, and position of list as value.
For example, 19 is in the second list, I thus expect to get somwhere one row with "19" as index and "1" as value, and so on.
I managed to get it (cf.boiler plate below), but I guess there is something more simple
>>> df=pd.DataFrame(l)
>>> df=df.unstack().reset_index(level=0,drop=True)
>>> df=df[df.notnull()==True] # remove NaN rows
>>> df=pd.DataFrame(df)
>>> df = df.reset_index().set_index(0)
>>> print df
index
0
1 0
4 1
5 2
2 0
19 1
3 0
17 0
Thanks in advance.
In [52]: pd.DataFrame([(item, i) for i, seq in enumerate(l)
for item in seq]).set_index(0)
Out[52]:
1
0
1 0
2 0
3 0
17 0
4 1
19 1
5 2