pandas outliers with and without calculations - pandas

I'm contemplating making decisions on outliers on a dataset with over 300 features. I'd like to analyse the frame without removing the data hastingly. I have a frame:
| | A | B | C | D | E |
|---:|----:|----:|-----:|----:|----:|
| 0 | 100 | 99 | 1000 | 300 | 250 |
| 1 | 665 | 6 | 9 | 1 | 9 |
| 2 | 7 | 665 | 4 | 9 | 1 |
| 3 | 1 | 3 | 4 | 3 | 6 |
| 4 | 1 | 9 | 1 | 665 | 5 |
| 5 | 3 | 4 | 6 | 1 | 9 |
| 6 | 5 | 9 | 1 | 3 | 2 |
| 7 | 1 | 665 | 3 | 2 | 3 |
| 8 | 2 | 665 | 9 | 1 | 0 |
| 9 | 5 | 0 | 7 | 6 | 5 |
| 10 | 0 | 3 | 3 | 7 | 3 |
| 11 | 6 | 3 | 0 | 3 | 6 |
| 12 | 6 | 6 | 5 | 1 | 5 |
I have coded some introspection to be saved in another frame called _outliers:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = (Q3 - Q1)
min_ = (Q1 - (1.5 * IQR))
max_ = (Q3 + (1.5 * IQR))
# Counts outliers in columns
_outliers = ((df.le (min_)) | (df.ge (max_))).sum().to_frame(name="outliers")
# Gives percentage of data that outliers represent in the column
_outliers["percent"] = (_outliers['outliers'] / _outliers['outliers'].sum()) * 100
# Shows max value in the column
_outliers["max_val"] = df[_outliers.index].max()
# Shows min value in the column
_outliers["min_val"] = df[_outliers.index].min()
# Shows median value in the column
_outliers["median"] = df[_outliers.index].median()
# Shows mean value in the column
_outliers["mean"] = df[_outliers.index].mean()
That yields:
| | outliers | percent | max_val | min_val | median | mean |
|:---|-----------:|----------:|----------:|----------:|---------:|---------:|
| A | 2 | 22.2222 | 665 | 0 | 5 | 61.6923 |
| B | 3 | 33.3333 | 665 | 0 | 6 | 164.385 |
| C | 1 | 11.1111 | 1000 | 0 | 4 | 80.9231 |
| D | 2 | 22.2222 | 665 | 1 | 3 | 77.0769 |
| E | 1 | 11.1111 | 250 | 0 | 5 | 23.3846 |
I would like to calculate the impact of the outliers on the column by calculating the mean and the median without them. I don't want to remove them to do this calculation. I suppose the best way is to add "~" to the outlier filter but I get lost in the code... This will benefit a lot of people as a search on removing outliers yields a lot of results. Other than the why they sneaked in the data in the first place, I just don't think the removal decision should be made without consideration on the potential impact. Feel free to add other considerations (skewness, sigma, n, etc.)
As always, I'm grateful to this community!
EDIT: I added variance and its square root standard deviation with and without outliers. In some fields you might want to keep outliers and go into ML directly. At least, by inspecting your data beforehand, you'll know how much they are contributing to your results. Used with nlargest() in the outliers column you get a quick view of which features contain the most. You could use this as a basis for filtering features by setting up thresholds on variance or mean. Thanks to the contributors, I have a powerful analytics tool now. Hope it can be useful to others.

Take advantage of apply method of DataFrame.
Series genereator
Just define the way you want the robust mean to apply by creating a method consuming Series and returning scalar and apply it to your DataFrame.
For the IRQ mean, here is a simple snippet:
def irq_agg(x, factor=1.5, aggregate=pd.Series.mean):
q1, q3 = x.quantile(0.25), x.quantile(0.75)
return aggregate(x[(q1 - factor*(q3 - q1) < x) & (x < q3 + factor*(q3 - q1))])
data.apply(irq_agg)
# A 3.363636
# B 14.200000
# C 4.333333
# D 3.363636
# E 4.500000
# dtype: float64
The same can be done to filter out based on percentiles (both side version):
def quantile_agg(x, alpha=0.05, aggregate=pd.Series.mean):
return aggregate(x[(x.quantile(alpha/2) < x) & (x < x.quantile(1 - alpha/2))])
data.apply(quantile_agg, alpha=0.01)
# A 12.454545
# B 15.777778
# C 4.727273
# D 41.625000
# E 4.909091
# dtype: float64
Frame generator
Even better, create a function that returns a Series, apply will create a DataFrame. Then we can compute at once a bunch of different means and medians in order to compare them. We can also reuse Series generator method defined above:
def analyze(x, alpha=0.05, factor=1.5):
return pd.Series({
"p_mean": quantile_agg(x, alpha=alpha),
"p_median": quantile_agg(x, alpha=alpha, aggregate=pd.Series.median),
"irq_mean": irq_agg(x, factor=factor),
"irq_median": irq_agg(x, factor=factor, aggregate=pd.Series.median),
"standard": x[((x - x.mean())/x.std()).abs() < 1].mean(),
"mean": x.mean(),
"median": x.median(),
})
data.apply(analyze).T
# p_mean p_median irq_mean irq_median standard mean median
# A 12.454545 5.0 3.363636 3.0 11.416667 61.692308 5.0
# B 15.777778 6.0 14.200000 5.0 14.200000 164.384615 6.0
# C 4.727273 4.0 4.333333 4.0 4.333333 80.923077 4.0
# D 41.625000 4.5 3.363636 3.0 3.363636 77.076923 3.0
# E 4.909091 5.0 4.500000 5.0 4.500000 23.384615 5.0
Now you can filter out outlier in several ways computes relevant aggregate on it such as mean or median.

No comment on whether this is an appropriate method to filter out your outliers. The code below should do what you asked:
q1, q3 = df.quantile([0.25, 0.75]).to_numpy()
delta = (q3 - q1) * 1.5
min_val, max_val = q1 - delta, q3 + delta
outliers = (df < min_val) | (max_val < df)
result = pd.concat(
[
pd.DataFrame(
{
"outliers": outliers.sum(),
"percent": outliers.sum() / outliers.sum().sum() * 100,
"max_val": max_val,
"min_val": min_val,
}
),
df.agg(["median", "mean"]).T,
df.mask(outliers, np.nan).agg(["median", "mean"]).T.add_suffix("_no_outliers"),
],
axis=1,
)
Result:
outliers percent max_val min_val median mean median_no_outliers mean_no_outliers
A 2 15.384615 13.5 -6.5 5.0 61.692308 3.0 3.363636
B 3 23.076923 243.0 -141.0 6.0 164.384615 5.0 14.200000
C 1 7.692308 13.0 -3.0 4.0 80.923077 4.0 4.333333
D 2 15.384615 16.0 -8.0 3.0 77.076923 3.0 3.363636
E 1 7.692308 10.5 -1.5 5.0 23.384615 5.0 4.500000

Related

How do I clean this dataframe?

In row 2, I have a value "AVE" in the 'address' column that I would like to join with the 'address' value in row 1. The result should be row 1 'address' reads as "NEWPORT AVE / HIGHLAND AVE". How do I do this?
I also need to perform the same function with row 3 where 'action_taken' reads as "SERVICE RENDERED" with "RENDERED" taken from row 4.
|incident_no | date_reported | time_reported | address | incident_type | action_taken
------------------------------------------------------------------------------------------------------
1 | 2100030948 | 2021-05-16 | 23:21:00 | NEWPORT AVE / HIGHLAND | ERRATIC M/V | UNFOUNDED
2 | <NA> | NaT | NaT | AVE | NaN | NaN
3 | 2100030947 | 2021-05-16 | 23:16:00 | FALMOUTH ST | SECURITY CHECK| SERVICE
4 | <NA> | NaT | NaT | NaN | NaN | RENDERED
5 | 2100030946 | 2021-05-16 | 22:55:00 | PINE RD | SECURITY CHECK| SERVICE
``
First columns from list forward filling missing values, then group by them and aggregate join with remove missing values:
cols = ['incident_no','date_reported','time_reported']
df[cols] = df[cols].ffill()
df = df.groupby(cols).agg(lambda x: ' '.join(x.dropna())).reset_index()

How do you control float formatting when using DataFrame.to_markdown in pandas?

I'm trying to use DataFrame.to_markdown with a dataframe that contains float values that I'd like to have rounded off. Without to_markdown() I can just set pd.options.display.float_format and everything works fine, but to_markdown doesn't seem to be respecting that option.
Repro:
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [42.42, 99.11234123412341234, -23]])
pd.options.display.float_format = '{:,.0f}'.format
print(df)
print()
print(df.to_markdown())
outputs:
0 1 2
0 1 2 3
1 42 99 -23
| | 0 | 1 | 2 |
|---:|------:|--------:|----:|
| 0 | 1 | 2 | 3 |
| 1 | 42.42 | 99.1123 | -23 |
(compare the 42.42 and 99.1123 in the to_markdown table to the 42 and 99 in the plain old df)
Is this a bug or am I missing something about how to use to_markdown?
It looks like pandas uses tabulate for this formatting. If it's installed, you can use something like:
df.to_markdown(floatfmt=".0f")
output:
| | 0 | 1 | 2 |
|---:|----:|----:|----:|
| 0 | 1 | 2 | 3 |
| 1 | 42 | 99 | -23 |

Iterate through pandas data frame and replace some strings with numbers

I have a dataframe sample_df that looks like:
bar foo
0 rejected unidentified
1 clear caution
2 caution NaN
Note this is just a random made up df, there are lot of other columns lets say with different data types than just text. bar and foo might also have lots of empty cells/values which are NaNs.
The actual df looks like this, the above is just a sample btw:
| | Unnamed: 0 | user_id | result | face_comparison_result | created_at | facial_image_integrity_result | visual_authenticity_result | properties | attempt_id |
|-----:|-------------:|:---------------------------------|:---------|:-------------------------|:--------------------|:--------------------------------|:-----------------------------|:----------------|:---------------------------------|
| 0 | 58 | ecee468d4a124a8eafeec61271cd0da1 | clear | clear | 2017-06-20 17:50:43 | clear | clear | {} | 9e4277fc1ddf4a059da3dd2db35f6c76 |
| 1 | 76 | 1895d2b1782740bb8503b9bf3edf1ead | clear | clear | 2017-06-20 13:28:00 | clear | clear | {} | ab259d3cb33b4711b0a5174e4de1d72c |
| 2 | 217 | e71b27ea145249878b10f5b3f1fb4317 | clear | clear | 2017-06-18 21:18:31 | clear | clear | {} | 2b7f1c6f3fc5416286d9f1c97b15e8f9 |
| 3 | 221 | f512dc74bd1b4c109d9bd2981518a9f8 | clear | clear | 2017-06-18 22:17:29 | clear | clear | {} | ab5989375b514968b2ff2b21095ed1ef |
| 4 | 251 | 0685c7945d1349b7a954e1a0869bae4b | clear | clear | 2017-06-18 19:54:21 | caution | clear | {} | dd1b0b2dbe234f4cb747cc054de2fdd3 |
| 5 | 253 | 1a1a994f540147ab913fcd61b7a859d9 | clear | clear | 2017-06-18 20:05:05 | clear | clear | {} | 1475037353a848318a32324539a6947e |
| 6 | 334 | 26e89e4a60f1451285e70ca8dc5bc90e | clear | clear | 2017-06-17 20:21:54 | suspected | clear | {} | 244fa3e7cfdb48afb44844f064134fec |
| 7 | 340 | 41afdea02a9c42098a15d94a05e8452b | NaN | clear | 2017-06-17 20:42:53 | clear | clear | {} | b066a4043122437bafae3ddcf6c2ab07 |
| 8 | 424 | 6cf6eb05a3cc4aabb69c19956a055eb9 | rejected | NaN | 2017-06-16 20:00:26 |
I want to replace any strings I find with numbers, per the below mapping.
def no_strings(df):
columns=list(df)
for column in columns:
df[column] = df[column].map(result_map)
#We will need a mapping of strings to numbers to be able to analyse later.
result_map = {'unidentified':0,"clear": 1, 'suspected': 2,"caution" : 3, 'rejected':4}
So the output might look like:
bar foo
0 4 0
1 1 3
2 3 NaN
For some reason, when I run no_strings(sample_df) I get errors.
What am I doing wrong?
df['bar'] = df['bar'].map(result_map)
df['foo'] = df['foo'].map(result_map)
df
bar foo
0 4 0
1 1 3
2 3 2
However, if you wish to be on the safe side (assuming a key/value is not in your result_map and you dont want to see a NaN) do this:
df['foo'] = df['foo'].map(lambda x: result_map.get(x, 'not found'))
df['bar'] = df['bar'].map(lambda x: result_map.get(x, 'not found'))
so an out put for this df
bar foo
0 rejected unidentified
1 clear caution
2 caution suspected
3 sdgdg 0000
will result in:
bar foo
0 4 0
1 1 3
2 3 2
3 not found not found
To be extra efficient:
cols = ['foo','bar','other_columns']
for c in cols:
df[c] = df[c].map(lambda x: result_map.get(x, 'not found'))
Lets try stack, map the dict and then unstack
df.stack().to_frame()[0].map(result_map).unstack()
bar foo
0 4 0
1 1 3
2 3 2

Pandas: need to create dataframe for weekly search per event occurrence

If I have this events dataframe df_e below:
|------|------------|-------|
| group| event date | count |
| x123 | 2016-01-06 | 1 |
| | 2016-01-08 | 10 |
| | 2016-02-15 | 9 |
| | 2016-05-22 | 6 |
| | 2016-05-29 | 2 |
| | 2016-05-31 | 6 |
| | 2016-12-29 | 1 |
| x124 | 2016-01-01 | 1 |
...
and also know the t0 which is the beginning of time (let's say for x123 it's 2016-01-01) and tN which is the end of experiment from another dataframe df_s (2017-05-25), then how can I create the dataframe df_new which should like this
|------|------------|---------------|--------|
| group| obs. weekly| lifetime, week| status |
| x123 | 2016-01-01 | 1 | 1 |
| | 2016-01-08 | 0 | 0 |
| | 2016-01-15 | 0 | 0 |
| | 2016-01-22 | 1 | 1 |
| | 2016-01-29 | 2 | 1 |
...
| | 2017-05-18 | 1 | 1 |
| | 2017-05-25 | 1 | 1 |
...
| x124 | 2017-05-18 | 1 | 1 |
| x124 | 2017-05-25 | 1 | 1 |
Explanation: take t0 and generate rows until tN per week period. For each row R, search with that group if the event date falls within R, if True, then count how long in weeks it lives there, also set status = 1 as alive, otherwise set lifetime, status columns for this R as 0, e.g. dead.
Questions:
1) How to generate dataframes per group given t0 and tN values, e.g. generate [group, obs. weekly, lifetime, status] columns for (tN - t0) / week rows?
2) How to accomplish the construction of such df_new dataframe explained above?
I can begin with this so far =)
import pandas as pd
# 1. generate dataframes per group to get the boundary within `t0` and `tN` from df_s dataframe, where each dataframe has "group, obs, lifetime, status" columns X (tN - t0 / week) rows filled with 0 values.
df_all = pd.concat([df_group1, df_group2])
def do_that(R):
found_event_row = df_e.iloc[[R.group]]
# check if found_event_row['date'] falls into R['obs'] week
# if True, then found how long it's there
df_new = df_all.apply(do_that)
I'm not really sure if I get you but group one is not related to group two, right? if that's the case I think what you want is something like this:
import pandas as pd
df_group1 = df_group1.set_index('event date')
df_group1.index = pd.to_datetime(df_group1.index) #convert the index to datetime so you can 'resample'
df_group1['lifetime, week'] = df_group1.resample('1W').apply(lamda x: yourfuncion(x))
df_group1 = df_group1.reset_index()
df_group1['status']= df_group1.apply(lambda x: 1 if x['lifetime, week']>0 else 0)
#do the same with group2 and concat to create df_all
I'm not sure how you get 'lifetime, week' but all that's left is creating the function that generates it.

influxdb/SQL get field count

I have an influxdb table lets call it my_table
my_table is structured like this (simplified):
+-----+-----+-----
| Time| m1 | m2 |
+=====+=====+=====
| 1 | 8 | 4 |
+-----+-----+-----
| 2 | 1 | 12 |
+-----+-----+-----
| 3 | 6 | 18 |
+-----+-----+-----
| 4 | 4 | 1 |
+-----+-----+-----
However I was wondering if it is possible to find out how many of the metrics are larger than a certain (dynamic) threshold for each time.
So lets say I want to know how many of the metrics (columns) are higher than 5,
I would want to do something like this:
select fieldcount(/m*/) from my_table where /m*/ > 5
Returning:
1
1
2
0
I am relatively restricted in structuring the database as I'm using diamond collector (python) which takes care of all datacollection for me and flushes it to my influxdb without me telling what the tables should look like.
EDIT
I am aware of a possible solution if I hardcode the threshold and add a third metric named mGreaterThan5:
+-----+-----+------------------+
| Time| m1 | m2 |mGreaterThan5|
+=====+=====+====+=============+
| 1 | 8 | 4 | 1 |
+-----+-----+----+-------------+
| 2 | 1 | 12 | 1 |
+-----+-----+----+-------------+
| 3 | 6 | 18 | 2 |
+-----+-----+----+-------------+
| 4 | 4 | 1 | 0 |
+-----+-----+----+-------------+
However this means that I cant easily change this threshold to 6 or any other number so thats why I would prefer a better solution if there is one.
EDIT2
Another similar problem occurs with trying to retrieve the highest x amount of metrics. Eg:
On Jan 1st what were the highest 3 values of m? Given table:
+-----+-----+----+-----+----+-----+----+
| Time| m1 | m2 | m3 | m4 | m5 | m6 |
+=====+=====+====+=====+====+=====+====+
| 1/1 | 8 | 4 | 1 | 7 | 2 | 0 |
+-----+-----+----+-----+----+-----+----+
Am I screwed if I keep the table structured this way?