I have a pandas Dataframe of the form
"a" "b" "c" #first level index
0, 1, 2 0, 1, 2 0, 1, 2 #second level index
index
0 1,2,3 6,7,8 5,3,4
1 2,3,4 7,5,4 9,2,5
2 3,4,5 4,5,6 0,4,5
...
representing a spot (a,b or c) where a measurement took place and the results of the measurments (0,1,2) that took place on this spot.
I want to do the following:
pick a slice in the sample (say the first measurement on each spot at measurement 0)
mean each i-th measurement (mean("a"[0], "b"[0], "c"[0]), mean("a"[1], "b"[1], "c"[1]), ...)
I tried to get the hang of the pandas Multiindex documentation but do not manage to slice for the second level.
This is the column index:
MultiIndex(levels=[['a', 'b', 'c', ... , 'y'], [0, 1, 2, ... , 49]],
labels=[[0, 0, 0, ... , 0, 1, 1, 1, ... 1, ..., 49, 49, 49, ... 49]])
And the index
Float64Index([204.477752686, 204.484664917, 204.491577148, ..., 868.723022461], dtype='float64', name='wavelength', length=43274)
Using
df[:][0]
yields a key-error (0 not in index)
df.iloc[0]
returns the horizontal slice
0 "a":(1,2,3), "b":(6,7,8), "c":(5,3,4)
but I would like to have
"a":(1,2,3), "b":(6,7,4), "c":(5,9,0)
THX for any help
PS: version:pandas-0.19, python-3.4
The trick was to specify the axis...
df.loc(axis=1)[:,0]
provides the 0-th measurment of each spot.
Since I use integers on the second level index, I am not sure if this actually yields the label "0" or just the 0-th measurment in the DataFrame, label-agnostic.
But for my use-case, this is actually sufficient.
Related
I have a series that I want to apply an external function to in subsets/chunks of three. Although the actual external function is more complex, for the sake of an example, lets just assume my external function takes an ndarray of integers and returns the sum of all values. So for example:
series = pd.Series([1,1,1,1,1,1,1,1,1])
# Some pandas magic similar to:
result = series.resample(3).apply(myFunction)
# where 3 just represents every 3 values and
# result == pd.Series([3,3,3])
I looked at combining Series.resample and Series.apply as hinted to by the psuedo code above but it appears resample depends on a datetime index. Any ideas on how I can effectively downsample by applying an external function like this without a datetime index? Or do you just recommend creating a temporary datetime index to do this then reverting to the original index?
pandas.DataFrame.groupby would do the trick here. What you need is a repeated index to specify subsets/chunks
Create chunks
n = 3
repeat_idx = np.repeat(np.arange(0,len(series), n), n)[:len(series)]
print(repeat_idx)
array([0, 0, 0, 3, 3, 3, 6, 6, 6])
Groupby
def myFunction(l):
output = 0
for item in l:
output+=item
return output
series = pd.Series([1,1,1,1,1,1,1,1,1])
result = series.groupby(repeat_idx).apply(myFunction)
(result)
0 3
3 3
6 3
The solution will also work for chunks not adding to the length of series,
n = 4
repeat_idx = np.repeat(np.arange(0,len(series), n), n)[:len(series)]
print(repeat_idx)
array([0, 0, 0, 0, 4, 4, 4, 4, 8])
result = series.groupby(repeat_idx).apply(myFunction)
print(result)
0 4
4 4
8 1
Using numpy, given a sorted 1D array, how to efficiently obtain a 1D array with equal size where the value at each position is the number of preceding equal elements? I have very large arrays and processing each element in Python code one way or another is not acceptable.
Example:
input = [0, 0, 4, 4, 4, 5, 5, 5, 5, 6]
output = [0, 1, 0, 1, 2, 0, 1, 2, 3, 0]
import numpy as np
A=np.array([0, 0, 4, 4, 4, 5, 5, 5, 5, 6])
uni,counts=np.unique(A, return_counts=True)
out=np.concatenate([np.arange(n) for n in counts])
print(out)
Not certain about the efficiency (probably better way to form the out array rather than concatenating), but a very straightforward way to get the result you are looking for. Counts the unique elements, then does np.arange on each count to get the ascending sequence, then concatenates these arrays together.
I have a Pandas series of lists of categorical variables. For example:
df = pd.Series([["A", "A", "B"], ["A", "C"]])
Note that in my case the series is pretty long (50K elements) and the number of possible distinct elements in the list is also big (20K elements).
I would like to obtain a matrix having a column for each distinct feature and its count as value. For the previous example, this means:
[[2, 0, 0], [1, 0, 1]]
This is the same output as the one obtained with OneHot encoding, except that it contains the count instead of just 1.
What is the best way to achieve this?
Let's try explode:
df.explode().groupby(level=0).value_counts().unstack(fill_value=0)
Output:
A B C
0 2 1 0
1 1 0 1
To get the list of list, chain the above with .values:
array([[2, 1, 0],
[1, 0, 1]])
Note that you will end up with a 50K x 20K array.
I have three numpy arrays
a = [0, 1, 2, 3, 4]
b = [5, 1, 7, 3, 9]
c = [10, 1, 3, 3, 1]
and i wanna to compute how many elements in a, b, c are equal to 3 in the same position, so for that example would be 3.
An elegant solution is to use Numpy functions, like:
np.count_nonzero(np.vstack([a, b, c])==3, axis=0).max()
Details:
np.vstack([a, b, c]) - generate an array with 3 rows, composed of your
3 source arrays.
np.count_nonzero(...==3, axis=0) - count how many values of 3 occurs
in each column. For your data the result is array([0, 0, 1, 3, 0], dtype=int64).
max() - take the greatest value, in your case 3.
I have an array 'A' of shape(50,3) and another array 'B' of shape (1,3).
Actually this B is a row in A. So I need to find its row location.
I used np.where(A==B), but it gives the locations searched element wise. For example, below is the result i got :
>>> np.where(A == B)
(array([ 3, 3, 3, 30, 37, 44]), array([0, 1, 2, 1, 2, 0]))
Actually B is the 4th row in A (in my case). But above result gives (3,0)(3,1)(3,2) and others, which are matched element-wise.
Instead of this, i need an answer '3' which is the answer obtained when B searched in A as a whole and it also removes others like (30,1)(37,2)... which are partial matches.
How can i do this in Numpy?
Thank you.
You can specify the axis:
numpy.where((A == B).all(axis=1))