I have created a 2d numpy array with 2 rows and 5 columns.
import numpy as np
import pandas as pd
arr = np.zeros((2, 5))
arr[0] = [12, 94, 4, 4, 2]
arr[1] = [1, 3, 4, 12, 46]
I have also created a dataframe with two columns col1 and col2
list1 = [1,2,3,4,5]
list2 = [2,3,4,5,6]
df = pd.DataFrame({'col1': list1, 'col2': list2})
I used pandas isin function with col1 and col2 to create a boolean value list, like this:
df['col1'].isin(df['col2'])
output
0 False
1 True
2 True
3 True
4 True
Now I want to use these bool values to slice the 2d array that I have created before, I can do that for a single row but now for the whole 2d array at once:
print(arr[0][df['col1'].isin(df['col2'])])
print(arr[1][df['col1'].isin(df['col2'])])
output:
[94. 4. 4. 2.]
[ 3. 4. 12. 46.]
but when I do something like this:
print(arr[df['col1'].isin(df['col2'])])
But this gives the error:
IndexError: boolean index did not match indexed array along dimension 0; dimension is 2 but corresponding boolean dimension is 5
Is there a way to achieve this?
You should slice on the second dimension of the array:
arr[:, df['col1'].isin(df['col2'])]
output:
array([[94., 4., 4., 2.],
[ 3., 4., 12., 46.]])
Related
I have a larger, more complex function that essentially boils down to the below function: f(X). Why can't this function take in linspace integers as a parameter?
This error is given:
"TypeError: only integer scalar arrays can be converted to a scalar index"
import numpy
from matplotlib import*
from numpy import*
import numpy as np
from math import*
import random
from random import *
import matplotlib.pyplot as plt
def f(x):
for i in range(x):
x+=1
return x
xlist = np.linspace(0, 100, 10)
ylist = f(xlist)
plt.figure(num=0)
linspace produces a numpy array, not a list. Don't confuse the two.
In [3]: x = np.linspace(0,100,10)
In [4]: x
Out[4]:
array([ 0. , 11.11111111, 22.22222222, 33.33333333,
44.44444444, 55.55555556, 66.66666667, 77.77777778,
88.88888889, 100. ])
The numbers will look nicer if we take into account that linspace includes the end points:
In [5]: x = np.linspace(0,100,11)
In [6]: x
Out[6]: array([ 0., 10., 20., 30., 40., 50., 60., 70., 80., 90., 100.])
As an array, we can simply add a scalar; no need to write a loop:
In [7]: x + 1
Out[7]: array([ 1., 11., 21., 31., 41., 51., 61., 71., 81., 91., 101.])
range takes one number (or 3), but never an array (of more than 1 element):
In [8]: for i in range(5): print(i)
0
1
2
3
4
We can't modify i in such a loop, or the "x" used to create the range. But we can write a list comprehension:
In [9]: [i+1 for i in range(5)]
Out[9]: [1, 2, 3, 4, 5]
references:
https://www.w3schools.com/python/ref_func_range.asp
https://numpy.org/doc/stable/reference/generated/numpy.linspace.html
xlist = np.linspace(0, 100, 10) + 1
Based on your comment it seems you are just trying to plot a few points. I think your use of range and linspace are redundant and you don't need a loop at all. You do not need a for loop to plot many points, unless you are trying to plot multiple curves or do something fancy with each point. For example, this plots 10 points:
x = np.linspace(0, 100, 10)
y=x*x
plt.plot(x,y,'.')
I think this is a good place to start: https://matplotlib.org/3.5.1/tutorials/introductory/pyplot.html
What is the easiest way (I am looking for the minimum number of code lines) to convert a pandas dataframe of 4 columns into a 3d tensor padding the missing values along the way.
import pandas as pd
# initialize data of lists.
data = {'Animal':['Cat', 'Dog', 'Dog', 'Dog'],
'Country':["USA", "Canada", "USA", "Canada"],
'Likes': ['Petting', 'Hunting', 'Petting', 'Petting'],
'Age':[1, 2, 3, 4]}
# there are no duplicate lines in terms of Animal, Country and Likes, so I do not need any aggregation function
# Create DataFrame
dfAnimals = pd.DataFrame(data)
dfAnimals
I want to create a 3d tensor with shape (2, 2, 2) --> (Animal, Country, Likes) and Age is the value. I also want to fill the missing values with 0
There might be a solution with fewer lines and more optimized library calls, but this seems to do the trick:
import pandas as pd
import numpy as pd
import torch
data = ...
df = pd.DataFrame(data)
CAT = df.columns.tolist()
CAT.remove("Age")
# encode categories as integers and extract the shape
shape = []
for c in CAT:
shape.append(len(df[c].unique()))
df[c] = df[c].astype("category").cat.codes
shape = tuple(shape)
# get indices as tuples and corresponding values
idx = [tuple(t) for t in df.values[:,:-1]]
values = df.values[:,-1]
# init final matrix with zeros and fill it from indices
A = np.zeros(shape)
for i, v in zip(idx,values):
A[i] = v
# convert to pytorch tensor
A = torch.tensor(A)
print(A)
tensor([[[0., 0.],
[0., 1.]],
[[2., 4.],
[0., 3.]]], dtype=torch.float64)
I have a Dataframe with MultiIndex. The two Levels are 'Nr' and 'Price'. Is it possible to use np.where on Index Level 1 ('Price') to create a new column ('ZZ')?
'ZZ' should be calculated by column 'first' multiplicated by 2, if Level 1 ('Price') is equal to 'x'.
import pandas as pd
index = pd.MultiIndex.from_product([['s1', 's2','s3'],['x','y']])
df = pd.DataFrame([1,2,3,4,5,6],index, columns=['first'] )
df.index.names = ['Nr', 'Price']
df
I tried:
df['ZZ'] = np.where(df['Price']=='x',df['0']*2,np.nan)
I obtain:
Thank you!
You should use get_level_values
np.where(df.index.get_level_values(1)=='x', df['first']*2, np.nan)
array([ 2., nan, 6., nan, 10., nan])
#df['ZZ'] = np.where(df.index.get_level_values(1)=='x', df['first']*2, np.nan)
Problem description
The column 'a' has type integer, not float. The apply function should not change the type just because the dataframe has another, unrelated float column.
I understand, why it happens: it detects the most suitable type for a Series. I still consider it unintuitive that I select a group of columns to apply some function to them that only works on ints, not on floats, and suddenly I remove one unrelated column and get an exception, because now I only have numeric columns, and all ints became floats.
>>> import pandas as pd
# This works.
>>> pd.DataFrame({'a': [1, 2, 3], 'b': ['', '', '']}).apply(lambda row: row['a'], axis=1)
0 1
1 2
2 3
dtype: int64
# Here we also expect 1, 2, 3, as above.
>>> pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]}).apply(lambda row: row['a'], axis=1)
0 1.0
1 2.0
2 3.0
# Why floats?!?!?!?!?!
# It's an integer column:
>>> pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]})['a'].dtype
dtype('int64')
Expected Output
0 1
1 2
2 3
dtype: int64
Specifically in my problem I am trying to use the value in the apply function to get the value from a list. I am trying to do this in a performant way such that recasting as int inside the apply is too slow.
>>> pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]}).apply(lambda row: myList[row['a']], axis=1)
https://github.com/pandas-dev/pandas/issues/23230
This is from the only source I could find having the same problem.
It seems like your underlying problem is to index a list by the values in one of your DataFrame columns. This can be done by converting your list to an array and then you can normally slice:
Sample Data
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 0, 3], 'b': ['', '', '']})
myList = ['foo', 'bar', 'baz', 'boo']
Code:
np.array(myList)[df.a.to_numpy()]
#array(['bar', 'baz', 'boo'], dtype='<U3')
Or if you want the Series:
pd.Series(np.array(myList)[df.a.to_numpy()], index=df.index)
#0 bar
#1 foo
#2 boo
#dtype: object
Alternatively with a list comprehension this is:
[myList[i] for i in df.a]
#['bar', 'foo', 'boo']
You are getting caught by Pandas upcasting. Certain operations will result in an upcast column dtype. The (0.24 Doc)[https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#gotchas] describes this here.
Examples of this are encountered when certain operations are done.
import pandas as pd
import numpy as np
print(pd.__version__)
# float64 is the default dtype of an empty dataframe.
df = pd.DataFrame({'a': [], 'b': []})['a'].dtype
print(df)
try:
df['a'] = [1,2,3,4]
except TypeError as te:
# good, the default dtype is float64
print(te)
print(df)
# even if 'defaul' is changed, this is a surprise
# because referring to all columns does convert to float
df = pd.DataFrame(columns=["col1", "col2"], dtype=np.int64)
# creates an index, "a" is float type
df.loc["a", "col1":"col2"] = np.int64(0)
print(df.dtypes)
df = pd.DataFrame(columns=["col1", "col2"], dtype=np.int64)
# not upcast
df.loc[:"col1"] = np.int64(0)
print(df.dtypes)
Taking a shot at a performant answer that works around such upcasting behavior:
import pandas as pd
import numpy as np
print(pd.__version__)
df = pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]})
df['a'] = df['a'].apply(lambda row: row+1)
df['b'] = df['b'].apply(lambda row: row+1)
print(df)
print(df['a'].dtype)
print(df['b'].dtype)
dtypes are preserved.
0.24.2
a b
0 2 1.0
1 3 1.0
2 4 1.0
int64
float64
I tried to find entries in an Array containing a substring with np.where and an in condition:
import numpy as np
foo = "aa"
bar = np.array(["aaa", "aab", "aca"])
np.where(foo in bar)
this only returns an empty Array.
Why is that so?
And is there a good alternative solution?
We can use np.core.defchararray.find to find the position of foo string in each element of bar, which would return -1 if not found. Thus, it could be used to detect whether foo is present in each element or not by checking for -1 on the output from find. Finally, we would use np.flatnonzero to get the indices of matches. So, we would have an implementation, like so -
np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)
Sample run -
In [91]: bar
Out[91]:
array(['aaa', 'aab', 'aca'],
dtype='|S3')
In [92]: foo
Out[92]: 'aa'
In [93]: np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)
Out[93]: array([0, 1])
In [94]: bar[2] = 'jaa'
In [95]: np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)
Out[95]: array([0, 1, 2])
Look at some examples of using in:
In [19]: bar = np.array(["aaa", "aab", "aca"])
In [20]: 'aa' in bar
Out[20]: False
In [21]: 'aaa' in bar
Out[21]: True
In [22]: 'aab' in bar
Out[22]: True
In [23]: 'aab' in list(bar)
It looks like in when used with an array works as though the array was a list. ndarray does have a __contains__ method, so in works, but it is probably simple.
But in any case, note that in alist does not check for substrings. The strings __contains__ does the substring test, but I don't know any builtin class that propagates the test down to the component strings.
As Divakar shows there is a collection of numpy functions that applies string methods to individual elements of an array.
In [42]: np.char.find(bar, 'aa')
Out[42]: array([ 0, 0, -1])
Docstring:
This module contains a set of functions for vectorized string
operations and methods.
The preferred alias for defchararray is numpy.char.
For operations like this I think the np.char speeds are about same as with:
In [49]: np.frompyfunc(lambda x: x.find('aa'), 1, 1)(bar)
Out[49]: array([0, 0, -1], dtype=object)
In [50]: np.frompyfunc(lambda x: 'aa' in x, 1, 1)(bar)
Out[50]: array([True, True, False], dtype=object)
Further tests suggest that the ndarray __contains__ operates on the flat version of the array - that is, shape doesn't affect its behavior.
If using pandas is acceptable, then utilizing the str.contains method can be used.
import numpy as np
entries = np.array(["aaa", "aab", "aca"])
import pandas as pd
pd.Series(entries).str.contains('aa') # <----
Results in:
0 True
1 True
2 False
dtype: bool
The method also accepts regular expressions for more complex patterns:
pd.Series(entries).str.contains(r'a.a')
Results in:
0 True
1 False
2 True
dtype: bool
The way you are trying to use np.where is incorrect. The first argument of np.where should be a boolean array, and you are simply passing it a boolean.
foo in bar
>>> False
np.where(False)
>>> (array([], dtype=int32),)
np.where(np.array([True, True, False]))
>>> (array([0, 1], dtype=int32),)
The problem is that numpy does not define the in operator as an element-wise boolean operation.
One way you could accomplish what you want is with a list comprehension.
foo = 'aa'
bar = np.array(['aaa', 'aab', 'aca'])
out = [i for i, v in enumerate(bar) if foo in v]
# out = [0, 1]
bar = ['aca', 'bba', 'baa', 'aaf', 'ccc']
out = [i for i, v in enumerate(bar) if foo in v]
# out = [2, 3]
You can also do something like this:
mask = [foo in x for x in bar]
filter = bar[ np.where( mask * bar != '') ]