How to redistribute outliers over the previous time period? - pandas

Imagine a dataframe that looks like this:
1
2
3
4
5
6
7
50
16
17
Normally we would apply an algorithm from Detect and exclude outliers in a pandas DataFrame to entirely remove the 50, however my particular dataset instead requires me to distribute the values of the 50 over the previous 7 days:
8
9
10
11
12
13
14
15
16
17
How can I make this work in Pandas? I can detect the outliers pretty easily but not sure how to spread the values out into previous days. Note that a simple moving average doesn't work well for this type of data, as there would still be a jump in the average value when 50 shows up. What I need to do is smooth out 50 into the previous days so that no jump is visible.

Related

One step left to filter consecutive numbers in GPS data according to conditions

Many posts have been published about filtering desired data from a dataframe. I reviewed most of them and figured out there is a gap. Mainly, responders just tried to correct the code or recommend some line to solve the presented issue. So, I would be grateful if you recommend training resources in your resource to improve our knowledge about filtering the data.
ID CI timestamp speed Lat Long
1 1 2013-01-08 10:22:36 20 23.01 33.54
2 1 2013-01-08 10:22:42 21 23.04 33.54
3 1 2013-01-08 10:22:47 25 23.05 33.54
4 1 2013-01-08 10:22:51 10 23.06 33.54
5 2 2013-01-08 10:22:27 24 23.07 33.54
6 2 2013-01-08 10:22:29 18 23.08 33.54
7 1 2013-01-08 10:33:15 07 23.09 33.55
8 1 2013-01-08 10:33:36 20 24.01 33.55
9 1 2013-01-08 10:33:42 21 24.11 33.55
10 1 2013-01-08 10:33:47 25 24.14 33.55
11 1 2013-01-08 10:33:51 10 24.21 33.55
12 1 2013-01-08 10:33:57 24 24.31 33.55
13 1 2013-01-08 10:33:59 10 24.51 33.55
14 1 2013-01-08 10:34:04 24 24.61 33.55
I have a dataframe that includes 200 thousand records. It has three columns, Cycler's Id (CI), timestamp, and speed. According to CI, the time difference (TD) is calculated from subtraction between two succession rows. I used the groupby to calculate TD according to CI. Then, I dropped the CIs because the number of their records was less than five. In the presented data sample, CI "2" is eliminated because the number of its records was 2. Next, I counted the number of consecutive values whose time difference was less than seven and put them in a list for each CI. Obviously, each CI could have various lengths of successive values. For instance, in the shown sample, CI 1 has lengths 4 and 8.
What step is still left as the final step to accomplish the mission:
I need to check the generated list for each CI and save the records having maximum consecutive values in a new CSV file.
Following is my code and I need to complete this code or welcome a faster solution.
grouped = df.sort_values(by='timestamp').groupby('CI')
for i in grouped.groups.keys():
p = grouped.get_group(i)
if len(p.index)>13:
p['Time_diff'] = pd.to_datetime(p['timestamp'].astype(str)).diff(1).dt.total_seconds()
y= [len(list(g)) for k, g in groupby(p['Time_diff']<7) if k==True]
if len(p['y']) !=0:
if max(y)>5:
################ till here, my code works perfectly ##########
??? generating code to save only consecutive values having a maximum length
for each CI
p.to_csv('D:/out/'f"{i}.csv", sep=';')
Expected output:
In this case, only the last 7 records save in a CSV file.
Thanks in advance

Why does Pandas df.mode() return a zero before the actual modal value?

When I run df.mode() on the below dataframe I get a leading zero before the expected output. Why is that?
df
sample 1 2 3 4 5 6 7 8 9 10
zone run
2 5 14 12 22 23 24 22 23 22 23 23
print(df.iloc[:,3:10].mode(axis=1)))
gives
0
zone run
2 5 23
expecting
zone run
2 5 23
pd.Series.mode
Return the mode(s) of the dataset. Always returns Series even if only one value is returned.
So that's how it is by design. A Series must have an index and it will start counting from 0. This ensures that the return type is stable regardless of whether there is only a single mode or multiple values tied for the mode.
So if you take a slice where values are tied for the mode, your return is a Series where the numbers 0, ...N are indicators for the N values tied for the mode (modal values in sorted order).
df.iloc[:, 4:7]
#sample 5 6 7
#zone run
#2 5 24 22 23
df.iloc[:,4:7].mode(axis=1)
# 0 1 2 # <- 3 values tied for mode so 3 labels
#zone run
#2 5 22 23 24
My thinking is, df.mode returns a dataframe. By default, dataframes if no column values are given allocates indices as column names. In this case,0 is allocated because that is how pandas/python begins count.
Because it is a dataframe, the only way to change the column name which in this case is an index is to apply the .rename(columnn) method. Hence, to get what you need you will have to;
df1.iloc[:,3:10].agg('mode', axis=1).reset_index().rename(columns={0:''})
zone run
0 2 5 23

How to Create a CDF out of a PDF in SQL

So I have a datatable that looks something like that following. ID represents an object, bin represents how I am segmenting the data, and percent is how much of a data falls into that bin.
id bin percent
2 8 0.20030698388
2 16 0.14504988488
2 24 0.12356101304
2 32 0.09976976208
2 40 0.09056024558
2 48 0.07137375287
2 56 0.04067536454
2 64 0.03914044512
2 72 0.02916346891
2 80 0.16039907904
3 8 0.36316695352
3 16 0.03958691910
3 24 0.11876075731
3 32 0.13253012048
3 40 0.03098106712
3 48 0.07228915662
3 56 0.07745266781
3 64 0.02581755593
3 72 0.02065404475
3 80 0.11876075731
I am looking for a function to turn this dataset into a cdf partitioning id. I have tried cume_dist and percent_rank, but they do not appear to work.
I am facing a similar problem and found this great tutorial for doing exactly that:
https://dwaincsql.com/2015/05/14/excel-in-t-sql-part-2-the-normal-distribution-norm-dist-density-functions/
It tries to rebuild the Excel function NORM.DIST function which gives you either the PDF if you set the cummulative flag as FALSE and the CDF if you set it as TRUE. I assumed that CUME_DIST would do the exact same thing in SQL. However, it turns out that the latter distributes by counting the elements whereas Excel uses the relative differences in the values.

pandas applying function to columns array is very slow

os hour day
0 13 14 0
1 19 14 0
2 13 14 0
3 13 14 0
4 13 14 0
Here is my dataframe and I just want to get a new column which is str(os)+'_'+str(hour)+'_'str(day). I use apply function to process the dataframe but it is very slow.
Any high-performance method to realize this ?
I also tried convert the df to array and process every row. It seems that it is slow too.
There are nearly two hundred millions rows of the dataframe.
Not sure what code are you using but you can try
df.astype(str).apply('_'.join, axis = 1)
0 13_14_0
1 19_14_0
2 13_14_0
3 13_14_0
4 13_14_0

Removing index duplicates in pandas

There is a some seemingly inconsistent behaviour observed when removing duplicates in pandas.
Problem set up: I have a dataframe with three columns and 3330 timeseries observations as shown below:
data.describe()
Mean Buy Sell
count 3330 3330 3330
Checking if the data contains any duplicates, shows there are duplicate indices.
data.index.duplicated().any()
True
How many duplicates are in the data
data.loc[data.index.duplicated()].count()
Mean 38
Buy 38
Sell 38
The duplicates can be visually inspected too
`data[data.index.duplicated()]`
Dilemma: Clearly, there are duplicates in the data and it seems they are 38 of them per column. However, when I use the DataFrame's drop_duplicates(), it seems more data is dropped than expected.
`data.drop_duplicates().count()`
Mean 3241
Buy 3241
Sell 3241
dtype: int64
`data.count() - data.drop_duplicates().count()`
Mean 89
Buy 89
Sell 89
Any ideas on what is the cause of this disparity or the detail I'm missing would be appreciated. Note: It is possible to have similar entries of data but dates should not be duplicated hence the reasonable way to clean the data is to remove the duplicate days.
If I understand you correctly, you want to keep only the first occurrence (row / record) where there are duplicates in your index?
This will accomplish that.
import pandas as pd
df = pd.DataFrame({'IDX':[1,2,2,2,3,4,5,5,6],
'Mean':[1,2,3,4,5,6,7,8,9]}).set_index('IDX')
df
Mean
IDX
1 1
2 2
2 3
2 4
3 5
4 6
5 7
5 8
6 9
duplicates = df.index.duplicated()
duplicates
array([False, False, True, True, False, False, False, True, False])
keep = duplicates == False
df.loc[keep,:]
Mean
IDX
1 1
2 2
3 5
4 6
5 7
6 9