calculating probability from long series data in python pandas - pandas

I have a data ranging from 19 to 49. How can I calculate the probability of the data occurred in between 25 to 40?
46.58762816
30.50477684
27.4195249
47.98157313
44.55425608
30.21066503
34.27381019
48.19934524
46.82233375
46.05077036
42.63647302
40.11270346
48.04909583
24.18660332
24.47549276
44.45442651
19.24542913
37.44141763
28.41079638
21.69325455
31.32887617
26.26988582
18.19898804
19.01329026
28.33846808

Simplest you can do is to use the % of values that fall between 25 and 40.
If s is your pandas.Series you gave us:
In [1]: s.head()
Out[1]:
0 46.587628
1 30.504777
2 27.419525
3 47.981573
4 44.554256
Name: 0, dtype: float64
In [2]: # calculate number of values between 25 and 40 and divide by total count
s.between(25,40).sum()/float(s.count())
Out[2]: 0.3599
Otherwise it would require trying to find what distribution your data might be following (from the data you gave, which might be just a small sample of your data, it doesn't appear to be following any distribution I know...), testing if it actually follows the distribution you think it follows (using Kolmogorov-Smirnov test or another like it), then you can use that distribution to calculate the probability etc.

Related

Summing time series with slight variance in timestamps

I imagine that I have several time series like following, from different "sources":
time events
0 1000 1080000
1 2003 2122386
2 3007 3043985
3 4007 3872544
4 5007 4853763
Here, an monotonic increasing count events is sampled every 1000 ms. The sampling is not exact so most of the timestamps vary from their ideal values by a few ms - e.g., the second point is at 2003 instead of 2000.
I want to sum several of these time series: they will all be sampled at ~1000 ms but may not agree to the exact millsecond. E.g another time series could be:
time events
0 1000 1070000
1 2002 2122486
2 3006 3063985
3 4007 3872544
4 5009 4853763
I'd like something reasonable in terms of the final result. For example the same number of rows as each of the input dataframes, with a timestamp column the same as the first, or average of the inputs times. As long as the inputs are smooth, the outputs should be too.
I'd suggest DataFrame.reindex() with nearest method. Example:
def combine_datasources(reference_df, extra_dfs, tolerance_ms=100):
reindexed_df_list = [df.reindex(reference_df.index, method='nearest', tolerance=tolerance_ms) for df in extra_dfs]
combined = pd.concat([reference_df, *reindexed_df_list])
return combined.groupby(combined.index).sum()
combine_datasources(df_a, [df_b])
This code changes the index on the dataframes in the extra_dfs list to match the index for the reference dataframe. Then, it concatenates all of the dataframes together. It uses groupby to do the sum, which requires that the indexes match exactly to work. The timestamps will be the same as the one on the reference dataframe.
Note that if you have data from a time period not covered by the reference dataframe, that data will be dropped.
Here's the output for the dataset in your question:
events
time
1000 2150000
2003 4244872
3007 6107970
4007 7745088
5007 9707526

Pandas run function only on subset of whole Dataframe

Lets say i have Dataframe, which has 200 values, prices for products. I want to run some operation on this dataframe, like calculate average price for last 10 prices.
The way i understand it, right now pandas will go through every single row and calculate average for each row. Ie first 9 rows will be Nan, then from 10-200, it would calculate average for each row.
My issue is that i need to do a lot of these calculations and performance is an issue. For that reason, i would want to run the average only on say on last 10 values (dont need more) from all values, while i want to keep those values in the dataframe. Ie i dont want to get rid of those values or create new Dataframe.
I just essentially want to do calculation on less data, so it is faster.
Is something like that possible? Hopefully the question is clear.
Building off Chicodelarose's answer, you can achieve this in a more "pandas-like" syntax.
Defining your df as follows, we get 200 prices up to within [0, 1000).
df = pd.DataFrame((np.random.rand(200) * 1000.).round(decimals=2), columns=["price"])
The bit you're looking for, though, would the following:
def add10(n: float) -> float:
"""An exceptionally simple function to demonstrate you can set
values, too.
"""
return n + 10
df["price"].iloc[-12:] = df["price"].iloc[-12:].apply(add10)
Of course, you can also use these selections to return something else without setting values, too.
>>> df["price"].iloc[-12:].mean().round(decimals=2)
309.63 # this will, of course, be different as we're using random numbers
The primary justification for this approach lies in the use of pandas tooling. Say you want to operate over a subset of your data with multiple columns, you simply need to adjust your .apply(...) to contain an axis parameter, as follows: .apply(fn, axis=1).
This becomes much more readable the longer you spend in pandas. 🙂
Given a dataframe like the following:
Price
0 197.45
1 59.30
2 131.63
3 127.22
4 35.22
.. ...
195 73.05
196 47.73
197 107.58
198 162.31
199 195.02
[200 rows x 1 columns]
Call the following to obtain the mean over the last n rows of the dataframe:
def mean_over_n_last_rows(df, n, colname):
return df.iloc[-n:][colname].mean().round(decimals=2)
print(mean_over_n_last_rows(df, 2, "Price"))
Output:
178.67

Is there a way to use cumsum with a threshold to create bins?

Is there a way to use numpy to add numbers in a series up to a threshold, then restart the counter. The intention is to form groupby based on the categories created.
amount price
0 27 22.372505
1 17 126.562276
2 33 101.061767
3 78 152.076373
4 15 103.482099
5 96 41.662766
6 108 98.460743
7 143 126.125865
8 82 87.749286
9 70 56.065133
The only solutions I found iterate with .loc which is slow. I tried building a solution based on this answer https://stackoverflow.com/a/56904899:
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
df['cumvals'] = sumvals.accumulate(df['amount'], dtype=np.object)
The use-case is to find the average price of every 75 sold amounts of the thing.
Solution #1 Interpreting the following one way will get my solution below: "The use-case is to find the average price of every 75 sold amounts of the thing." If you are trying to do this calculation the "hard way" instead of pd.cut, then here is a solution that will work well but the speed / memory will depend on the cumsum() of the amount column, which you can find out if you do df['amount'].cumsum(). The output will take about 1 second per every 10 million of the cumsum, as that is how many rows is created with np.repeat. Again, this solution is not horrible if you have less than ~10 million in cumsum (1 second) or even 100 million in cumsum (~10 seconds):
i = 75
df = np.repeat(df['price'], df['amount']).to_frame().reset_index(drop=True)
g = df.index // i
df = df.groupby(g)['price'].mean()
df.index = (df.index * i).astype(str) + '-' + (df.index * i +75).astype(str)
df
Out[1]:
0-75 78.513748
75-150 150.715984
150-225 61.387540
225-300 67.411182
300-375 98.829611
375-450 126.125865
450-525 122.032363
525-600 87.326831
600-675 56.065133
Name: price, dtype: float64
Solution #2 (I believe this is wrong but keeping just in case)
I do not believe you are tying to do it this way, which was my initial solution, but I will keep it here in case, as you haven't included expected output. You can create a new series with cumsum and then use pd.cut and pass bins=np.arange(0, df['Group'].max(), 75) to create groups of cumulative 75. Then, groupby the groups of cumulative 75 and take the mean. Finally, use pd.IntervalIndex to clean up the format and change to a sting:
df['Group'] = df['amount'].cumsum()
s = pd.cut(df['Group'], bins=np.arange(0, df['Group'].max(), 75))
df = df.groupby(s)['price'].mean().reset_index()
df['Group'] = pd.IntervalIndex(df['Group']).left.astype(str) + '-' + pd.IntervalIndex(df['Group']).right.astype(str)
df
Out[1]:
Group price
0 0-75 74.467390
1 75-150 101.061767
2 150-225 127.779236
3 225-300 41.662766
4 300-375 98.460743
5 375-450 NaN
6 450-525 126.125865
7 525-600 87.749286

Why can't I read all of the values in the matrix in scilab?

i am trying to read a csv file and my code is as follows
param=csvRead("C:\Users\USER\Dropbox\VOA-BK code\assets\Iris.csv",",","%i",'double',[],[],[1 2 3 4]); //reads number of clusters and features
data=csvRead("C:\Users\USER\Dropbox\VOA-BK code\assets\Iris.csv",",","%f",'double',[],[],[3 1 19 4]); //reads the values
numft=param(1,1);//save number of features
numcl=param(2,1);//save number of clusters
data_pts=0;
data_pts = max(size(data, "r"));//checks how many number of rows
disp(data(numft-3:data_pts,:));//print all data points (I added -3 otherwise it displays only 15 rows)
disp(numft);//print features
disp(data_pts);//print features
disp(param);
endfunction
below is the values that i am trying to read
features,4,,
clusters,3,,
5.1,3.5,1.4,0.2
4.9,3,1.4,0.2
4.7,3.2,1.3,0.2
4.6,3.1,1.5,0.2
5,3.6,1.4,0.2
7,3.2,4.7,1.4
6.4,3.2,4.5,1.5
6.9,3.1,4.9,1.5
5.5,2.3,4,1.3
6.5,2.8,4.6,1.5
5.7,2.8,4.5,1.3
6.3,3.3,6,2.5
5.8,2.7,5.1,1.9
7.1,3,5.9,2.1
6.3,2.9,5.6,1.8
6.5,3,5.8,2.2
7.6,3,6.6,2.1
I do not know why the code only displays 15 rows instead of 17. The only time it displays the correct matrix is when i put -3 in numft but with that, the number of columns would be 1. I am so confused. Is there a better way to read the values?
In the csvRead call in the first line of your script the boundaries of the region to read is incorrect, it should be corrected like this:
param=csvRead("C:\Users\USER\Dropbox\VOA-BK code\assets\Iris.csv",",","%i",'double',[],[],[1 2 2 2]);

Grouping nearby data in pandas

Lets say I have the following dataframe:
df = pd.DataFrame({'a':[1,1.1,1.03,3,3.1], 'b':[10,11,12,13,14]})
df
a b
0 1.00 10
1 1.10 11
2 1.03 12
3 3.00 13
4 3.10 14
And I want to group nearby points, eg.
df.groupby(#SOMETHING).mean():
a b
a
0 1.043333 11.0
1 3.050000 13.5
Now, I could use
#SOMETHING = pd.cut(df.a, np.arange(0, 5, 2), labels=False)
But only if I know the boundaries beforehand. How can I accomplish similar behavior if I don't know where to place the cuts? ie. I want to group nearby points (with nearby being defined as within some epsilon).
I know this isn't trivial because point x might be near point y, and point y might be near point z, but point x might be too far z; so then its ambiguous what to do--this is kind of a k-means problem, but I'm wondering if pandas has any tools built in to make this easy.
Use case: I have several processes that generate data on regular intervals, but they're not quite synced up, so the timestamps are close, but not identical, and I want to aggregate their data.
Based on this answer
df.groupby( (df.a.diff() > 1).cumsum() ).mean()