Replace pandas values as index to another array - pandas

Consider an array
a = np.array([5, 12, 56, 36])
and a pandas dataframe
b = pandas.DataFrame(np.array([1, 3, 0, 3, 1, 0, 2])
how does one replace the values on b by using its values as indexes for a, i.e., the intended value is:
c = pandas.DataFrame([12, 36, 5, 36, 12, 5, 56])
Can't quite figure this out.

One way is using apply,
c = b.apply(lambda x: a[x])
Or by indexing the numpy array and passing the values to DataFrame,
c = pd.DataFrame(a[b[0].values])
0
0 12
1 36
2 5
3 36
4 12
5 5
6 56

Let us try something different Series.get
pd.Series(a).get(b[0])
Out[57]:
1 12
3 36
0 5
3 36
1 12
0 5
2 56
dtype: int32

map can be used.
b.a.map({i:j for i,j in enumerate(a)})
0 12
1 36
2 5
3 36
4 12
5 5
6 56
Name: a, dtype: int64

Related

Reshape wide to long for many columns with a common prefix

My frame has many pairs of identically named columns, with the only difference being the prefix. For example, player1.player.id and player2.player.id.
Here's an example (with fewer and shorter columns):
pd.DataFrame({'p1.a': {0: 4, 1: 0}, 'p1.b': {0: 1, 1: 4},
'p1.c': {0: 2, 1: 8}, 'p1.d': {0: 3, 1: 12},
'p1.e': {0: 4, 1: 16}, 'p1.f': {0: 5, 1: 20},
'p1.g': {0: 6, 1: 24},
'p2.a': {0: 0, 1: 0}, 'p2.b': {0: 3, 1: 12},
'p2.c': {0: 6, 1: 24}, 'p2.d': {0: 9, 1: 36},
'p2.e': {0: 12, 1: 48}, 'p2.f': {0: 15, 1: 60},
'p2.g': {0: 18, 1: 72}})
p1.a p1.b p1.c p1.d p1.e p1.f p1.g p2.a p2.b p2.c p2.d p2.e p2.f p2.g
0 4 1 2 3 4 5 6 0 3 6 9 12 15 18
1 0 4 8 12 16 20 24 0 12 24 36 48 60 72
I'd like to turn it into a long format, with a new side column denoting either p1 or p2. I have several crappy ways of doing it, for example:
df1 = df.filter(regex='^p1.*').assign(side='p1')
df2 = df.filter(regex='^p2.*').assign(side='p2')
df1.columns = [c.replace('p1.', '') for c in df1.columns]
df2.columns = [c.replace('p2.', '') for c in df2.columns]
pd.concat([df1, df2]).head()
a b c d e f g side
0 4 1 2 3 4 5 6 p1
1 0 4 8 12 16 20 24 p1
0 0 3 6 9 12 15 18 p2
1 0 12 24 36 48 60 72 p2
This feels non-idiomatic, and I couldn't get pd.wide_to_long() to work here.
I'd appreciate an answer which also handles arbitrary substrings, not just prefix, i.e., I'm also interested in something like this:
foo.p1.a foo.p1.b foo.p1.c foo.p1.d foo.p1.e foo.p1.f foo.p1.g foo.p2.a foo.p2.b foo.p2.c foo.p2.d foo.p2.e foo.p2.f foo.p2.g
0 4 1 2 3 4 5 6 0 3 6 9 12 15 18
1 0 4 8 12 16 20 24 0 12 24 36 48 60 72
Turning into:
foo.a foo.b foo.c foo.d foo.e foo.f foo.g side
0 4 1 2 3 4 5 6 p1
1 0 4 8 12 16 20 24 p1
0 0 3 6 9 12 15 18 p2
1 0 12 24 36 48 60 72 p2
But if there's an idiomatic way to handle prefixes whereas substrings require complexity, I'd appreciate learning about both.
What's the idiomatic (pythonic? pandonic?) way of doing this?
A couple of options to do this:
with pd.wide_to_long, you need to reorder the positions based on the delimiter; in this case we move the a, b, ... to the fore and the p1, p2 to the back, before reshaping:
temp = df.copy()
temp = temp.rename(columns = lambda df: ".".join(df.split(".")[::-1]))
(pd.wide_to_long(temp.reset_index(),
stubnames = ["a", "b", "c", "d", "e", "f", "g"],
sep=".",
suffix=".+",
i = "index",
j = "side")
.droplevel('index')
.reset_index()
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
One limitation with pd.wide_to_long is the reshaping of positions. The other limitation is that the stubnames have to be explicitly specified.
Another option is via stack, where the columns are split, based on the delimiter and reshaped:
temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)
temp.stack(0).droplevel(0).rename_axis('side').reset_index()
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72
stack is quite flexible, and did not require us to list the column names. The limitation of stack is that it fails if the index is not unique.
Another option is pivot_longer from pyjanitor, which abstracts the process:
# pip install janitor
import janitor
df.pivot_longer(index = None,
names_to = ("side", ".value"),
names_sep=".")
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
The worker here is .value. This tells the code that anything after . should remain as column names, while anything before . should be collated into a new column (side). Note that, unlike wide_to_long, the stubnames do not need to be stated - it abstracts that for us. Also, it can handle duplicate indices, since it uses pd.melt under the hood.
One limitation of pivot_longer is that you have to install the pyjanitor library.
For the other example, I'll use stack and pivot_longer; you can still use pd.wide_to_long to solve it.
With stack:
first split the columns and convert into a MultiIndex:
temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)
Reshape the data:
temp = temp.stack(1).droplevel(0).rename_axis('side')
Merge the column names:
temp.columns = temp.columns.map(".".join)
Reset the index:
temp.reset_index()
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72
With pivot_longer, one option is to reorder the columns, before reshaping:
temp = df.copy()
temp.columns = ["".join([first, last, middle])
for first, middle, last in
temp.columns.str.split(r'(\.p\d)')]
(
temp
.pivot_longer(
index = None,
names_to = ('.value', 'side'),
names_pattern = r"(.+)\.(p\d)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
In the dev version however, the column reorder is not necessary; we can simply use multiple .value to reshape the dataframe - note that you'll have to install from the repo to get the latest dev version:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_pattern = r"(.+)\.(.\d)(.+)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
Another option with names_sep:
(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_sep = r'\.(p\d)')
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72

Passing Tuple to a function via apply

I am trying to run below function which takes two points..
point A=(2,3)
point B=(4,5
def Somefunc(pointA, point B):
x= pointA[0] + pointB[1]
return x
Now, when in try to create a separate column based on this fucntion, it is throwing me errors like cannot convert the series to <class 'float'>, so I tried this
df['T']=df.apply(Somefunc((df['A'].apply(lambda x: float(x)),df['B'].apply(lambda x: float(x))),\
(df['C'].apply(lambda x: float(x)),df['D'].apply(lambda x: float(x)))),axis=0))
Sample dataframe below;
A B C D
1 2 3 5
2 4 7 8
4 7 9 0
Any help will be appreciated.
This is the best guess I can make as to what you're trying to do:
df['T']=df.apply(lambda row: [(row['A'],row['B']),(row['C'],row['D'])],axis=1)
Edit: to apply your function;
df['T'] = df.apply(lambda row: SomeFunc((row['A'],row['B']),(row['C'],row['D'])),axis=1)
that being said, the same result can be achieved much quicker and idiomatically like so:
>>> df
A B C D
0 2 7 3 3
1 3 1 5 7
2 2 0 6 2
3 3 9 5 9
4 0 2 3 7
>>> df['T']=df.apply(tuple,axis=1)
>>> df
A B C D T
0 2 7 3 3 (2, 7, 3, 3)
1 3 1 5 7 (3, 1, 5, 7)
2 2 0 6 2 (2, 0, 6, 2)
3 3 9 5 9 (3, 9, 5, 9)
4 0 2 3 7 (0, 2, 3, 7)

Length of passed values is 1, index implies 10

Why and what is this error about??? It shows Length of passed
values is 1, index implies 10. I tried many times to run the
code and I come across the same
ser = pd.Series(np.random.randint(1, 50, 10))
result = np.argwhere(ser % 3==0)
print(result)
argwhere() operates on a numpy array not a panda series. See below
a = np.random.randint(1, 50, 12)
a = pd.Series(a)
print(a)
np.argwhere(a.values%3==0)
output
0 28
1 46
2 4
3 40
4 19
5 26
6 6
7 24
8 26
9 30
10 33
11 27
dtype: int64
[250]:
array([[ 6],
[ 7],
[ 9],
[10],
[11]])
Please read documentation for numpy.random.randint You will see that the parameters are (low, high, size).
In your case, you are sending (1, 50, 10). So 10 random numbers will be generated between 1 and 50.
If you want multiples of 3, then you need to do this ser[ser % 3==0] not use np.anywhere.
See similar issue raised earlier and answered on Stack Overflow
import pandas as pd
import numpy as np
ser = pd.Series(np.random.randint(1, 50, 10))
print (ser)
result = ser[ser % 3==0]
print(result)
Output of this will be:
Original Series.
0 17
1 34
2 29
3 15
4 24
5 20
6 21
7 48
8 6
9 42
dtype: int64
Multiples of 3 will be:
3 15
4 24
6 21
7 48
8 6
9 42
dtype: int64
Use Index.tolist:
In [1374]: ser
Out[1374]:
0 44
1 5
2 35
3 10
4 16
5 20
6 25
7 9
8 44
9 16
dtype: int64
In [1372]: l = ser[ser % 3 == 0].index.tolist()
In [1373]: l
Out[1373]: [7]
where l will be a list of indexes of elements which are a multiple of 3.

groupby list of lists of indexes

I have a list of np. arrays, representing indexes of pandas dataframe.
I need to groupby index to get each group for each array
let's say, that is the df:
index values
0 2
1 3
2 2
3 2
4 4
5 4
6 1
7 4
8 4
9 4
and that is the list of np.arrays:
[array([0, 1, 2, 3]), array([6, 7, 8])]
from this data I expect to get 2 groups without loop opertaions as a single groupby object:
group1:
index values
0 2
1 3
2 2
3 2
group2:
index values
6 1
7 4
8 4
I would stress again that finally I need to get a single groupby object.
Thank you!
I still using for-loop to create the groupby key dict
l=[np.array([0, 1, 2, 3]), np.array([6, 7, 8])]
df=pd.DataFrame([2, 3, 2, 2, 4, 4, 1, 4, 4, 4],columns=['values'])
from collections import ChainMap
L=dict(ChainMap(*[dict.fromkeys(y,x) for x, y in enumerate(l)]))
list(df.groupby(L))
Out[33]:
[(0.0, values
index
0 2
1 3
2 2
3 2), (1.0, values
index
6 1
7 4
8 4)]
df=pd.DataFrame([2,3,2,2,4,4,1,4,4,4],columns=['values'])
df.index.name ='index'
l=[np.array([0, 1, 2, 3]), np.array([6, 7, 8])]
group1= df.loc[pd.Series(l[0])]
group2= df.loc[pd.Series(l[1])]
This seems like an X-Y problem:
l = [np.array([0,1,2,3]), np.array([6,7,8])]
df_indx = pd.DataFrame(l).stack().reset_index()
df_new = df.assign(foo=df['index'].map(df_indx.set_index(0)['level_0']))
for n,g in df_new.groupby('foo'):
print(g)
Output:
index values foo
0 0 2 0.0
1 1 3 0.0
2 2 2 0.0
3 3 2 0.0
index values foo
6 6 1 1.0
7 7 4 1.0
8 8 4 1.0

create a dataframe from a list of length-unequal lists

I try to convert such a list:
l = [[1, 2, 3, 17], [4, 19], [5]]
to a dataframe having each of the number as indice, and position of list as value.
For example, 19 is in the second list, I thus expect to get somwhere one row with "19" as index and "1" as value, and so on.
I managed to get it (cf.boiler plate below), but I guess there is something more simple
>>> df=pd.DataFrame(l)
>>> df=df.unstack().reset_index(level=0,drop=True)
>>> df=df[df.notnull()==True] # remove NaN rows
>>> df=pd.DataFrame(df)
>>> df = df.reset_index().set_index(0)
>>> print df
index
0
1 0
4 1
5 2
2 0
19 1
3 0
17 0
Thanks in advance.
In [52]: pd.DataFrame([(item, i) for i, seq in enumerate(l)
for item in seq]).set_index(0)
Out[52]:
1
0
1 0
2 0
3 0
17 0
4 1
19 1
5 2