I have a numpy array i want to subtract the previous number from the next number after fixing the first number and want to replace last number with zero
a=np.array([10,20,22,44])
expected output
np.array([10,10,2,0])
I tried with np.diff function but it misses the first number.Hope experts will suggest better solution.
You want a custom output, so use a custom concatenation:
out = np.r_[a[0], np.diff(a[:-1]), 0]
Output: array([10, 10, 2, 0])
Related
I spent a few hours on this, so any help would be amazing!
I have a pandas dataframe df. Then I group by one of the columns (A), focus on another column (B) and get the mean of each group:
group_mean = df.groupby('A').B.agg('mean')
group = df.groupby('A').B
In the same order above, these are the types python reports:
<class 'pandas.core.series.Series'>
<class 'pandas.core.groupby.generic.SeriesGroupBy'>
Now the question, how can I, for each group in "group" identify the index of the first element that is equal or greater than the mean. So in other words, if a group has elements 5, 3, 7, 9, 1, 10 and the mean is 8, I want to return the value 3 (to point to "9").
The result can be another groupby object with one number per group (the index).
Thanks in advance!
You can use apply to check per group the values greater than the mean, and idxmax to get the first True value:
df.groupby('A')['B'].apply(lambda x: x.ge(x.mean()).idxmax())
I have a dataFrame with 28 columns (features) and 600 rows (instances). I want to select all rows, but only columns from 0-12 and 16-27. Meaning that I don't want to select columns 12-15.
I wrote the following code, but it doesn't work and throws a syntax error at : in 0:12 and 16:. Can someone help me understand why?
X = df.iloc[:,[0:12,16:]]
I know there are other ways for selecting these rows, but I am curious to learn why this one does not work, and how I should write it to work (if there is a way).
For now, I have written it is as:
X = df.iloc[:,0:12]
X = X + df.iloc[:,16:]
Which seems to return an incorrect result, because I have already treated the NaN values of df, but when I use this code, X includes lots of NaNs!
Thanks for your feedback in advance.
You can use np.r_ to concatenate the slices:
x = df.iloc[:, np.r_[0:12,16:]]
iloc has these allowed inputs (from the docs):
An integer, e.g. 5.
A list or array of integers, e.g. [4, 3, 0].
A slice object with ints, e.g. 1:7.
A boolean array.
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have a reference to the calling object, but would like to base your selection on some value.
What you're passing to iloc in X = df.iloc[:,[0:12,16:]] is not a list of integers or a slice of ints, but a list of slice objects. You need to convert those slices to a list of integers, and the best way to do that is using the numpy.r_ function.
X = df.iloc[:, np.r_[0:13, 16:28]]
Is it possible to initialize a pandas SparseArray by providing only the dense entries? I could not figure this out from the documentation: http://pandas.pydata.org/pandas-docs/stable/sparse.html .
For example, say I want a length 1000 SparseArray with a one at index 9 and zeros everywhere else, how would I go about creating it? This is one way:
a = [0] * 1000
a[9] = 1
sparse_a = pd.SparseArray(data=a, fill_value=0)
But, in the above, we have to create the dense array before the sparse one. Is there a way to specify only the indices and the dense entries to create the SparseArray directly?
A length 10 SparseArray with a one at index 9 and zeros everywhere else:
pd.SparseArray(1, index= range(1), kind='block',
sparse_index= BlockIndex(10, [8], [1]),
fill_value=0)
Notes:
index could be any list as long as its length is equal to all non-sparsed part of the array (the smaller part of the data), in this case, number of 1 in the sparse array
BlockIndex(10, [8], [1]) is the object pointing to the positions of the non-parsed part of the data where the first argument is the TOTAL length of the array (sparse + non-sparse), the second argument is a list of starting positions of the non-sparse data and the third argument is a list of how long each block of non-sparse lasts. Notice: that the length of the array mentioned in point 1 is the sum of all elements of the list in the third argument of this BlockIndex
So a more general example is: to make a length 20 SparseArray where the 2nd, 3rd, 6th,7th,8th elements are 1 and the rest is 0 is:
pd.SparseArray(1, index= range(5), kind='block',
sparse_index= BlockIndex(20, [1,5], [2,3]),
fill_value=0)
or
pd.SparseArray(1, index= [None, 3, 2, 7, np.inf], kind='block',
sparse_index= BlockIndex(20, [1,5], [2,3]),
fill_value=0)
Sadly, I don't know any good way to specify an array of non-sparsed data as the first argument for SparseArray-- it does not mean that it can't be done, this is only a disclaimer. I think as long as you specify index=... pandas will require a scalar for the first argument (the data).
Tested on Windows 7, pandas version 0.20.2 installed by Aconda.
I Have a 2 dimensional numpy array say as follows:
[["cat","dog","dog","mouse","man"],
["rhino","rhino","bat","rhino","dino","dino"],
["zebra","alien","alien","alien","alien"]]
I want to perform numpy.unique along each row in order to count the number of occurrences of each label, unfortunately I don't think this is possible as numpy.unique would return vectors of different lengths:
[["cat","dog","mouse","man"]
["rhino","bat","dino"]
["zebra","alien"]]
(similar then for the counts)
so this won't work obviously.
Does anybody know of a way I can get around this problem?
Try this:
a = pd.DataFrame([["cat","dog","dog","mouse","man"],
["rhino","rhino","bat","rhino","dino","dino"],
["zebra","alien","alien","alien","alien"]])
a.apply(lambda x: pd.Series(x.unique()), axis=1)
I have a numpy.ndarray in which the maximum value will mostly occur more than once.
EDIT: This is subtly different from numpy.argmax: how to get the index corresponding to the *last* occurrence, in case of multiple occurrences of the maximum values because the author says
Or, even better, is it possible to get a list of indices of all the occurrences of the maximum value in the array?
whereas in my case getting such a list may prove very expensive
Is it possible to find the index of the last occurrence of the maximum value by using something like numpy.argmax? I want to find only the index of the last occurrence, not an array of all occurrences (since several hundreds may be there)
For example this will return the index of the first occurrence ie 2
import numpy as np
a=np.array([0,0,4,4,4,4,2,2,2,2])
print np.argmax(a)
However I want it to output 5.
numpy.argmax only returns the index of the first occurrence. You could apply argmax to a reversed view of the array:
import numpy as np
a = np.array([0,0,4,4,4,4,2,2,2,2])
b = a[::-1]
i = len(b) - np.argmax(b) - 1
i # 5
a[i:] # array([4, 2, 2, 2, 2])
Note numpy doesn't copy the array but instead creates a view of the original with a stride that accesses it in reverse order.
id(a) == id(b.base) # True
If your array is made up of integers and has less than 1e15 rows. You can also sort this out by adding a noise function that linearly increases the value of later occurrences.
>>>import numpy as np
>>>a=np.array([0,0,4,4,4,4,2,2,2,2])
>>>noise= np.array(range(len(a))) * 1e-15
>>>print(np.argmax(a + noise))
5