Related
I'd like to generate an int random number between 0 and 1 n number of times. I know the code to generate this:
num_samp = np.random.randint(0, 2, size= 20)
However, I need to specify a condition that say 1 can only appear a number times and the rest should be zeros. For instance if I want 1 to appear only 5 times from the above code, then I would have something like this [0,1,0,0,0,1,1,0,0,0,1,0,1,0,0,0,0,0,0,0]
Can someone help with the code to generate something like this? Thanks
Then it looks more like shuffling an array with 0 and 1.
N,k=20,5 # Total number of wanted numbers, and of 1
arr=np.zeros((N,), dtype=int)
arr[:k]=1
np.random.shuffle(arr)
# arr now contains random 0 and 1, with only 5 1. Like this one:
# array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0])
Check numpy.random.choice()
It picks a random number from a range and can accept a probability distribution.
Background
Let's think, there is a list of values which presents activity of a person for several hours. That person did not have any movement in those hours. Therefore, all the values are 0.
What did raise the question?
Searching on Google, I found the following formula of skewness. The same formula is available in some other sites also. In the denominator part, Standard Deviation (SD) is included. For a list of similar non-zero values (e.g., [1, 1, 1]) and also for 0 values (i.e., [0, 0, 0]), the SD will be 0. Therefore, I am supposed to get NaN (something divided by 0) for skewness. Surprisingly, I get 0 while calling pandas.DataFrame.skew().
My Question
Why does pandas.DataFrame.skew() return 0 when the SD of a list of values is 0?
Minimum Reproducible Example
import pandas as pd
ot_df = pd.DataFrame(data={'Day 1': [0, 0, 0, 0, 0, 0],
'Day 2': [0, 0, 0, 0, 0, 0],
'Day 3': [0, 0, 0, 0, 0, 0]})
print(ot_df.skew(axis=1))
Note: I have checked several Q&A of this site (e.g., this one (How does pandas calculate skew?)) and others (e.g., this one of GitHub). But I did not find the answer of my question.
You can find the implementation here:
https://github.com/pandas-dev/pandas/blob/main/pandas/core/nanops.py
As you can see there is a:
with np.errstate(invalid="ignore", divide="ignore"):
result = (count * (count - 1) ** 0.5 / (count - 2)) * (m3 / m2 ** 1.5)
dtype = values.dtype
if is_float_dtype(dtype):
result = result.astype(dtype)
if isinstance(result, np.ndarray):
result = np.where(m2 == 0, 0, result)
result[count < 3] = np.nan
else:
result = 0 if m2 == 0 else result
if count < 3:
return np.nan
As you can see if m2 (which will be equal 0 for all constant values) is 0, then the result will be 0.
If you are asking why it is implemented this way, I can only speculate. I suppose, that it is done for practical reasons - if you are calculating the skewness you want to check if the distribution of variables is symetrical (and you can argue, that it indeed is: https://stats.stackexchange.com/questions/114823/skewness-of-a-random-variable-that-have-zero-variance-and-zero-third-central-mom).
EDIT: It was done due to:
https://github.com/pandas-dev/pandas/issues/11974
https://github.com/pandas-dev/pandas/pull/12121
Probably you could add an issue for adding a flag on behaviour of this method in case of constant value of variable. It should be easy to fix.
Using numpy, given a sorted 1D array, how to efficiently obtain a 1D array with equal size where the value at each position is the number of preceding equal elements? I have very large arrays and processing each element in Python code one way or another is not acceptable.
Example:
input = [0, 0, 4, 4, 4, 5, 5, 5, 5, 6]
output = [0, 1, 0, 1, 2, 0, 1, 2, 3, 0]
import numpy as np
A=np.array([0, 0, 4, 4, 4, 5, 5, 5, 5, 6])
uni,counts=np.unique(A, return_counts=True)
out=np.concatenate([np.arange(n) for n in counts])
print(out)
Not certain about the efficiency (probably better way to form the out array rather than concatenating), but a very straightforward way to get the result you are looking for. Counts the unique elements, then does np.arange on each count to get the ascending sequence, then concatenates these arrays together.
In my table, each row can hold an integer let's call this integer column height.
for example, the values in this column could be: [5, 4, 0, 1, 9]
I need for this sequence in the example to be rearranged to become: [0, 1, 4, 5, 9] and then I want these values to become like this: [1, 2, 3, 4, 5].
The idea I am trying to write in SQL is to get the minimum at the first time and count how many values are there less than it.
How can I write/translate this idea into an SQL Query?
You seem to want a ranking:
select t.*, row_number() over (order by height) as height_seqnum
from t
order by height_seqnum;
I have a pandas Dataframe of the form
"a" "b" "c" #first level index
0, 1, 2 0, 1, 2 0, 1, 2 #second level index
index
0 1,2,3 6,7,8 5,3,4
1 2,3,4 7,5,4 9,2,5
2 3,4,5 4,5,6 0,4,5
...
representing a spot (a,b or c) where a measurement took place and the results of the measurments (0,1,2) that took place on this spot.
I want to do the following:
pick a slice in the sample (say the first measurement on each spot at measurement 0)
mean each i-th measurement (mean("a"[0], "b"[0], "c"[0]), mean("a"[1], "b"[1], "c"[1]), ...)
I tried to get the hang of the pandas Multiindex documentation but do not manage to slice for the second level.
This is the column index:
MultiIndex(levels=[['a', 'b', 'c', ... , 'y'], [0, 1, 2, ... , 49]],
labels=[[0, 0, 0, ... , 0, 1, 1, 1, ... 1, ..., 49, 49, 49, ... 49]])
And the index
Float64Index([204.477752686, 204.484664917, 204.491577148, ..., 868.723022461], dtype='float64', name='wavelength', length=43274)
Using
df[:][0]
yields a key-error (0 not in index)
df.iloc[0]
returns the horizontal slice
0 "a":(1,2,3), "b":(6,7,8), "c":(5,3,4)
but I would like to have
"a":(1,2,3), "b":(6,7,4), "c":(5,9,0)
THX for any help
PS: version:pandas-0.19, python-3.4
The trick was to specify the axis...
df.loc(axis=1)[:,0]
provides the 0-th measurment of each spot.
Since I use integers on the second level index, I am not sure if this actually yields the label "0" or just the 0-th measurment in the DataFrame, label-agnostic.
But for my use-case, this is actually sufficient.