Length of passed values is 1, index implies 10 - pandas

Why and what is this error about??? It shows Length of passed
values is 1, index implies 10. I tried many times to run the
code and I come across the same
ser = pd.Series(np.random.randint(1, 50, 10))
result = np.argwhere(ser % 3==0)
print(result)

argwhere() operates on a numpy array not a panda series. See below
a = np.random.randint(1, 50, 12)
a = pd.Series(a)
print(a)
np.argwhere(a.values%3==0)
output
0 28
1 46
2 4
3 40
4 19
5 26
6 6
7 24
8 26
9 30
10 33
11 27
dtype: int64
[250]:
array([[ 6],
[ 7],
[ 9],
[10],
[11]])

Please read documentation for numpy.random.randint You will see that the parameters are (low, high, size).
In your case, you are sending (1, 50, 10). So 10 random numbers will be generated between 1 and 50.
If you want multiples of 3, then you need to do this ser[ser % 3==0] not use np.anywhere.
See similar issue raised earlier and answered on Stack Overflow
import pandas as pd
import numpy as np
ser = pd.Series(np.random.randint(1, 50, 10))
print (ser)
result = ser[ser % 3==0]
print(result)
Output of this will be:
Original Series.
0 17
1 34
2 29
3 15
4 24
5 20
6 21
7 48
8 6
9 42
dtype: int64
Multiples of 3 will be:
3 15
4 24
6 21
7 48
8 6
9 42
dtype: int64

Use Index.tolist:
In [1374]: ser
Out[1374]:
0 44
1 5
2 35
3 10
4 16
5 20
6 25
7 9
8 44
9 16
dtype: int64
In [1372]: l = ser[ser % 3 == 0].index.tolist()
In [1373]: l
Out[1373]: [7]
where l will be a list of indexes of elements which are a multiple of 3.

Related

How can I plot two lines in one graph where values of the lines do not exist for the same x axis?

I would like to plot SupDem (variable) where e_boix_regime==1 and SupDem where e_boix_regime==0.
My data:
year
SupDem
e_boix_regime
1997
0.98
1
1998
0.75
0
My code:
dem = dem_aut[dem_aut["e_boix_regime"]==1].SupDem
aut = dem_aut[dem_aut["e_boix_regime"]==0].SupDem
year = dem_aut["year"]
plt.plot(year, dem, label="Suuport for Democracy in Demcoracies")
plt.plot(year, aut, label="Support for Democracy in Autocracies")
plt.show()```
The error is follwoing: x and y must have same first dimension, but have shapes (53,) and (28,)
I just wanted to plot two lines together.
This can help you solve the problem. I hope you can reproduce the codee with it:
two (or more) graphs in one plot with different x-axis AND y-axis scales in python
Issue
Your issue is regarding shape of x and y. For plotting graph you need same data point/shape of x-values and y-values.
Solution
Take each year with dem_aut["e_boix_regime"]==1 and dem_aut["e_boix_regime"]==2 condition as you are doing with SupDem
Source Code
df = pd.DataFrame(
{
"SupDem": np.random.randint(1, 11, 30),
"year": np.random.randint(10, 21, 30),
"e_boix_regime": np.random.randint(1, 3, 30),
}
) # see DataFrame below
df["e_boix_regime"].value_counts() # 1 = 18, 2 = 12
df[df["e_boix_regime"] == 2][["SupDem", "year"]] # see below
# you need same no. of data points for both x/y axis i.e. `year` and `SupDem`
plt.plot(
df[df["e_boix_regime"] == 1]["year"], df[df["e_boix_regime"] == 1]["SupDem"], marker="o", label="e_boix_regime==1"
)
# hence applying same condition for grabbing year which is applied for SupDem
plt.plot(
df[df["e_boix_regime"] == 2]["year"], df[df["e_boix_regime"] == 2]["SupDem"], marker="o", label="e_boix_regime==2"
)
plt.xlabel("Year")
plt.ylabel("SupDem")
plt.legend()
plt.show()
Output
PS: Ignore the data point plots, it's generated from random values
DataFrame Outputs
SupDem year e_boix_regime
0 1 12 2
1 10 10 1
2 5 19 2
3 4 14 2
4 8 14 2
5 4 17 2
6 2 15 2
7 10 11 1
8 8 11 2
9 6 19 2
10 5 15 1
11 8 17 1
12 9 10 2
13 1 14 2
14 8 18 1
15 3 13 2
16 6 16 2
17 1 16 1
18 7 13 1
19 8 15 2
20 2 17 2
21 5 10 2
22 1 19 2
23 5 20 2
24 7 16 1
25 10 14 1
26 2 11 2
27 1 18 1
28 5 16 1
29 10 18 2
df[df["e_boix_regime"] == 2][["SupDem", "year"]]
SupDem year
0 1 12
2 5 19
3 4 14
4 8 14
5 4 17
6 2 15
8 8 11
9 6 19
12 9 10
13 1 14
15 3 13
16 6 16
19 8 15
20 2 17
21 5 10
22 1 19
23 5 20
26 2 11
29 10 18

Pandas append row without specifying columns

I wanted to add or append a row (in the form of a list) to a dataframe. All the methods requires that I turn the list into another dataframe first, eg.
df = df.append(another dataframe)
df = df.merge(another dataframe)
df = pd.concat(df, another dataframe)
I've found a trick if the index is in running number at https://www.statology.org/pandas-add-row-to-dataframe/
import pandas as pd
#create DataFrame
df = pd.DataFrame({'points': [10, 12, 12, 14, 13, 18],
'rebounds': [7, 7, 8, 13, 7, 4],
'assists': [11, 8, 10, 6, 6, 5]})
#view DataFrame
df
points rebounds assists
0 10 7 11
1 12 7 8
2 12 8 10
3 14 13 6
4 13 7 6
5 18 4 5
#add new row to end of DataFrame
df.loc[len(df.index)] = [20, 7, 5]
#view updated DataFrame
df
points rebounds assists
0 10 7 11
1 12 7 8
2 12 8 10
3 14 13 6
4 13 7 6
5 18 4 5
6 20 7 5
However, the dataframe must have index in running number or else, the add/append will override the existing data.
So my question is: Is there are simple, foolproof way to just append/add a list to a dataframe ?
Thanks very much !!!
>>> df
points rebounds assists
3 10 7 11
1 12 7 8
2 12 8 10
If the indexes are "numbers" - you could add 1 to the max index.
>>> df.loc[max(df.index) + 1] = 'my', 'new', 'row'
>>> df
points rebounds assists
3 10 7 11
1 12 7 8
2 12 8 10
4 my new row

Reshape wide to long for many columns with a common prefix

My frame has many pairs of identically named columns, with the only difference being the prefix. For example, player1.player.id and player2.player.id.
Here's an example (with fewer and shorter columns):
pd.DataFrame({'p1.a': {0: 4, 1: 0}, 'p1.b': {0: 1, 1: 4},
'p1.c': {0: 2, 1: 8}, 'p1.d': {0: 3, 1: 12},
'p1.e': {0: 4, 1: 16}, 'p1.f': {0: 5, 1: 20},
'p1.g': {0: 6, 1: 24},
'p2.a': {0: 0, 1: 0}, 'p2.b': {0: 3, 1: 12},
'p2.c': {0: 6, 1: 24}, 'p2.d': {0: 9, 1: 36},
'p2.e': {0: 12, 1: 48}, 'p2.f': {0: 15, 1: 60},
'p2.g': {0: 18, 1: 72}})
p1.a p1.b p1.c p1.d p1.e p1.f p1.g p2.a p2.b p2.c p2.d p2.e p2.f p2.g
0 4 1 2 3 4 5 6 0 3 6 9 12 15 18
1 0 4 8 12 16 20 24 0 12 24 36 48 60 72
I'd like to turn it into a long format, with a new side column denoting either p1 or p2. I have several crappy ways of doing it, for example:
df1 = df.filter(regex='^p1.*').assign(side='p1')
df2 = df.filter(regex='^p2.*').assign(side='p2')
df1.columns = [c.replace('p1.', '') for c in df1.columns]
df2.columns = [c.replace('p2.', '') for c in df2.columns]
pd.concat([df1, df2]).head()
a b c d e f g side
0 4 1 2 3 4 5 6 p1
1 0 4 8 12 16 20 24 p1
0 0 3 6 9 12 15 18 p2
1 0 12 24 36 48 60 72 p2
This feels non-idiomatic, and I couldn't get pd.wide_to_long() to work here.
I'd appreciate an answer which also handles arbitrary substrings, not just prefix, i.e., I'm also interested in something like this:
foo.p1.a foo.p1.b foo.p1.c foo.p1.d foo.p1.e foo.p1.f foo.p1.g foo.p2.a foo.p2.b foo.p2.c foo.p2.d foo.p2.e foo.p2.f foo.p2.g
0 4 1 2 3 4 5 6 0 3 6 9 12 15 18
1 0 4 8 12 16 20 24 0 12 24 36 48 60 72
Turning into:
foo.a foo.b foo.c foo.d foo.e foo.f foo.g side
0 4 1 2 3 4 5 6 p1
1 0 4 8 12 16 20 24 p1
0 0 3 6 9 12 15 18 p2
1 0 12 24 36 48 60 72 p2
But if there's an idiomatic way to handle prefixes whereas substrings require complexity, I'd appreciate learning about both.
What's the idiomatic (pythonic? pandonic?) way of doing this?
A couple of options to do this:
with pd.wide_to_long, you need to reorder the positions based on the delimiter; in this case we move the a, b, ... to the fore and the p1, p2 to the back, before reshaping:
temp = df.copy()
temp = temp.rename(columns = lambda df: ".".join(df.split(".")[::-1]))
(pd.wide_to_long(temp.reset_index(),
stubnames = ["a", "b", "c", "d", "e", "f", "g"],
sep=".",
suffix=".+",
i = "index",
j = "side")
.droplevel('index')
.reset_index()
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
One limitation with pd.wide_to_long is the reshaping of positions. The other limitation is that the stubnames have to be explicitly specified.
Another option is via stack, where the columns are split, based on the delimiter and reshaped:
temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)
temp.stack(0).droplevel(0).rename_axis('side').reset_index()
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72
stack is quite flexible, and did not require us to list the column names. The limitation of stack is that it fails if the index is not unique.
Another option is pivot_longer from pyjanitor, which abstracts the process:
# pip install janitor
import janitor
df.pivot_longer(index = None,
names_to = ("side", ".value"),
names_sep=".")
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
The worker here is .value. This tells the code that anything after . should remain as column names, while anything before . should be collated into a new column (side). Note that, unlike wide_to_long, the stubnames do not need to be stated - it abstracts that for us. Also, it can handle duplicate indices, since it uses pd.melt under the hood.
One limitation of pivot_longer is that you have to install the pyjanitor library.
For the other example, I'll use stack and pivot_longer; you can still use pd.wide_to_long to solve it.
With stack:
first split the columns and convert into a MultiIndex:
temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)
Reshape the data:
temp = temp.stack(1).droplevel(0).rename_axis('side')
Merge the column names:
temp.columns = temp.columns.map(".".join)
Reset the index:
temp.reset_index()
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72
With pivot_longer, one option is to reorder the columns, before reshaping:
temp = df.copy()
temp.columns = ["".join([first, last, middle])
for first, middle, last in
temp.columns.str.split(r'(\.p\d)')]
(
temp
.pivot_longer(
index = None,
names_to = ('.value', 'side'),
names_pattern = r"(.+)\.(p\d)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
In the dev version however, the column reorder is not necessary; we can simply use multiple .value to reshape the dataframe - note that you'll have to install from the repo to get the latest dev version:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_pattern = r"(.+)\.(.\d)(.+)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
Another option with names_sep:
(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_sep = r'\.(p\d)')
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72

Replace pandas values as index to another array

Consider an array
a = np.array([5, 12, 56, 36])
and a pandas dataframe
b = pandas.DataFrame(np.array([1, 3, 0, 3, 1, 0, 2])
how does one replace the values on b by using its values as indexes for a, i.e., the intended value is:
c = pandas.DataFrame([12, 36, 5, 36, 12, 5, 56])
Can't quite figure this out.
One way is using apply,
c = b.apply(lambda x: a[x])
Or by indexing the numpy array and passing the values to DataFrame,
c = pd.DataFrame(a[b[0].values])
0
0 12
1 36
2 5
3 36
4 12
5 5
6 56
Let us try something different Series.get
pd.Series(a).get(b[0])
Out[57]:
1 12
3 36
0 5
3 36
1 12
0 5
2 56
dtype: int32
map can be used.
b.a.map({i:j for i,j in enumerate(a)})
0 12
1 36
2 5
3 36
4 12
5 5
6 56
Name: a, dtype: int64

Filter multiples in a pandas dataframe

My data can be easily converted into a pandas dataframe that looks something like:
import pandas as pd
data={'a':["t", "g"]*9,'b' [1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6],'distance':[10, 15, 290, 300, 315, 320, 350, 360, 10, 25, 225, 240, 325, 335, 365, 205, 15, 35]}
df=pd.DataFrame(data,columns=['a','b','distance'])
print df
a b distance
0 t 1 10
1 g 2 15
2 t 3 290
3 g 4 300
4 t 5 315
5 g 6 320
6 t 1 350
7 g 2 360
8 t 3 10
9 g 4 25
10 t 5 225
11 g 6 240
12 t 1 325
13 g 2 335
14 t 3 365
15 g 4 205
16 t 5 15
17 g 6 35
I want to erase all the lines that have the same value in the "b" column but keep the one line with the smallest value in the "distance" column. In this case I would like to erase all the lines that have a "distance" greater than 200 so that, in this example, only the lines with the index 0,1,8,9,16,17 remain. In the end all the lines should have a different "b" value and the smallest "distance". It would look like:
a b distance
0 t 1 10
1 g 2 15
2 t 3 10
3 g 4 25
4 t 5 15
5 g 6 35
How could I do that?
groupby on b col and call idxmin on distance column to index the orig df:
In [114]:
df.loc[df.groupby('b')['distance'].idxmin()]
Out[114]:
a b distance
0 t 1 10
1 g 2 15
8 t 3 10
9 g 4 25
16 t 5 15
17 g 6 35
Here you can see that idxmin returns the indices of the lowest values:
In [115]:
df.groupby('b')['distance'].idxmin()
Out[115]:
b
1 0
2 1
3 8
4 9
5 16
6 17
Name: distance, dtype: int64
Try this:
df.groupby('b')['a','b','distance'].min()
# a b distance
# b
# 1 t 1 10
# 2 g 2 15
# 3 t 3 10
# 4 g 4 25
# 5 t 5 15
# 6 g 6 35
​