IsolationForest, transforming data - data-science

A colleague and myself are trying to detect anomalies in a large dataset. We want to try out different algorithms (LOF, OC-SVM, DBSCAN, etc) but we are currently working with IsolationForest.
Our dataset is currently shaped a follows. It's a count of the number of event-types logged per user per day, and contains > 300.000 records:
date
user
event
count
6/1/2021
user_a
Open
2
6/2/2021
user_a
Open
4
6/1/2021
user_b
Modify
3
6/2/2021
user_b
Open
5
6/2/2021
user_b
Delete
2
6/3/2021
user_b
Open
7
6/5/2021
user_b
Move
4
6/4/2021
user_c
Modify
3
6/4/2021
user_c
Move
6
Our goal is to automatically detect anomalous counts of events per user. For example a user who normally logs between 5 and 10 "open" events per day a count of 400 would be an outlier. My colleague and I are having a discussion on how we should prepare the dataset for the IsolationForest algorithm.
One of us is saying we should drop the date field and labelencode the rest of the data => encode all strings by integers and let IF calculate an outlier score for each of the records.
The other is of the opinion labelencoding should NOT be done, since replacing categorical data by integers cannot be done. The data should however be scaled, the user column should be dropped (or set as index), and the data within the event column should be pivotted to generate more dimensions (the example below shows what he wants to do):
date
user
event_Open
event_Modify
event_Delete
event_Move
6/1/2021
user_a
2
NaN
NaN
NaN
6/2/2021
user_a
4
NaN
NaN
NaN
6/1/2021
user_b
NaN
3
NaN
NaN
6/2/2021
user_b
5
NaN
2
NaN
6/3/2021
user_b
7
NaN
NaN
NaN
6/5/2021
user_b
NaN
NaN
NaN
4
6/4/2021
user_c
NaN
3
NaN
6
So we're in disagreement on a couple of points. I'll list them below and include my thoughts on the them:
Issue
Comment
Labelencoding
Is a must and does not effect the categorical nature of the dataset
Scaling
IsolationForest is by nature insensitive to scaling making scaling superfluous
Drop data column
The date is actually not a feature in the dataset, as the date does not have any correlation to the anomalousness of the count per event-type per user
Drop user column
User is actually a (critical) feature and should not be dropped
Pivot event column
This generates a spare matrix, which can be bad practice. It also introduces relations within the data that are not there in reality (for example user_b on 2. june logged 5 open events and 2 delete events, but these are considered not related and should therefore not form a single record)
I am very curious to your thought on these points. What's best practice regarding the issues listed above while using the IsolationForest algorithm for anomaly detection?

Related

fillna in one column only where two other columns are equal from different data frames python

I am trying to update my master data table with the information from my custom table.
where mt.type is null update mt.type when mt.item = ct.item
On the internet, I can't find a solution to update one column in a data frame based on a different matched column from the main data frame and another one.
I think maybe I need something like this but with the condition where mt.['item'] matched cc.['item']:
mt['type'] = mt['type'].fillna(cc['type'])
I have also tried using lambda and x and mapper but I can't figure it out.
Tables below:
custom table as ct
Type
Item
Cupboard
Pasta
Fresh
Apple
Frozen
Peas
master table as mt
Type
Item
Weather
Shopping Week
Cupboard
Beans
Sunny
1
NULL
Pasta
Rainy
NULL
NULL
Apples
Null
2
NULL
Peas
Cloudy
5
...
...
...
...
desired output
Type
Item
Weather
Shopping Week
Cupboard
Beans
Sunny
1
Cupboard
Pasta
Rainy
NULL
Fresh
Apples
Null
2
Frozen
Peas
Cloudy
5
...
...
...
...
Thanks!
Here is one way to do it using fillna with a little help from set_index :
out = mt.assign(Type= mt.set_index("Item")["Type"]
.fillna(ct.set_index("Item")["Type"])
.reset_index(drop=True))
This will create a new DataFrame. If you need to overwrite the column "Type" in mt, use this :
mt["Type"] = mt.set_index("Item")["Type"]
.fillna(ct.set_index("Item")["Type"])
.reset_index(drop=True))
​
Output :
print(out) # or print(mt)
Type Item Shopping_Week
0 Fresh Orange 1.0
1 Fresh Apple 2.0
2 Fresh Banana 3.0
3 Cupboard NaN NaN
4 Cupboard Beans 4.0
5 Frozen Peas 7.0

Join values in different dataframes

I am trying trying to join two dataframes in such a way that the resulting union contains info about both of them. My dataframes are similar to:
>> df_1
user_id hashtag1 hashtag2 hashtag3
0000 '#breakfast' '#lunch' '#dinner'
0001 '#day' '#night' NaN
0002 '#breakfast' NaN NaN
The second dataframe contains a unique identifier of the hashtags and their respective score:
>> df_2
hashtag1 score
'#breakfast' 10
'#lunch' 8
'#dinner' 9
'#day' -5
'#night' 6
I want to add a set of columns on my first dataframe that contain the scores of each hashtag used, such as:
user_id hashtag1 hashtag2 hashtag3 score1 score2 score3
0000 '#breakfast' '#lunch' '#dinner' 10 8 9
0001 '#day' '#night' NaN -5 6 NaN
0002 '#breakfast' NaN NaN 10 NaN NaN
I tried to use df.join() but I get an error: "ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat"
My code is as follows:
new_df = df_1.join(df_2, how='left', on='hashtag1')
I appreciate any help, thank you
You should try pandas.merge:
pandas.merge(df_1, df_2, on='hashtag1', how='left')
If you want to use .join, you need to set the index of df_2.
df_1.join(df_2.set_index('hashtag1'), on='hashtag1', how='left')
Some resources:
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging
Trouble with df.join(): ValueError: You are trying to merge on object and int64 columns

Pandas subtract dates to get a surgery patient length of stay

I have a dataframe of surgical activity with admission dates (ADMIDATE) and discharge dates (DISDATE). It is 600k rows by 78 columns but I have filtered it for a particular surgery. I want to calculate the length of stay and add it as a further column.
Usually I use
df["los"] = (df["DISDATE"] - df["ADMIDATE"]).dt.days
I recently had to clean the data and must have done it in a different way to previously because I am now getting a negative los, eg.
DISDATE.
. ADMIDATE.
. los.
2019-12-24
2019-12-08
-43805.
2019-05-15
. 2019-03-26
50.
2019-10-11
. 2019-10-07
4.
2019-06-20
2019-06-16
4
2019-04-11
2019-04-08
3
df.info()
df.info()
<class '`pandas`.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 78 columns):
5 ADMIDATE 5 non-null datetime64[ns]
28 DISDATE 5 non-null datetime64[ns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 78 columns):
5 ADMIDATE 5 non-null datetime64[ns]
28 DISDATE 5 non-null datetime64[ns]
I am not sure how to ask the right questions to the problem, - and why its only affecting some rows. In cleansing the data some of the DISDATES had to be populated from another column (also a date column) becuase they were incomplete, and I wonder if it is these which are negative due to some retention of the orig data soemhow even though printing the new DISDATE looks ok.
Your sample works well with the right output (16 days for the first row)
Can you try that and check if the problem persists:
import io
data = df[['DISDATE', 'ADMIDATE']].to_csv()
test = pd.read_csv(io.StringIO(data), index_col=0,
parse_dates=['DISDATE', 'ADMIDATE'])
print(test['DISDATE'].sub(test['ADMIDATE']).dt.days)
Output:
0 16
1 50
2 4
3 4
4 3
dtype: int64
Update
To debug your bad dates, try:
df.loc[pd.to_datetime(df['ADMIDATE'], errors='coerce').isna(), 'ADMIDATE']
You should see rows where values are not a right date.

How to count the distance in cells (e.g. in indices) between two repeating values in one column in Pandas dataframe?

I have the following dataset. It lists the words that were presented to a participant in the psycholinguistic experiment (I set the order of the presentation of each word as an index):
data = {'Stimulus': ['sword','apple','tap','stick', 'elephant', 'boots', 'berry', 'apple', 'pear', 'apple', 'stick'],'Order': [1,2,3,4,5,6,7,8,9,10,11]}
df = pd.DataFrame(data, columns = ['Stimulus', 'Order'])
df.set_index('Order', inplace=True)
Stimulus
Order
1 sword
2 apple
3 tap
4 stick
5 elephant
6 boots
7 berry
8 apple
9 pear
10 apple
11 stick
Some values in this dataset are repeated (e.g. apple), some are not. The problem is that I need to calculate the distance in cells based on the order column between each occurrence of repeated values and store it in a separate column, like this:
Stimulus Distance
Order
1 sword NA
2 apple NA
3 tap NA
4 stick NA
5 elephant NA
6 boots NA
7 berry NA
8 apple 6
9 pear NA
10 apple 2
11 stick 7
It shouldn't be hard to implement, but I've got stuck.. Initially, I made a dictionary of duplicates where I store items as keys and their indices as values:
{'apple': [2,8,10],'stick': [4, 11]}
And then I failed to find a solution to put those values into a column. If there is a simplier way to do it in a loop without using dictionaries, please let me know. I will appreciate any advice.
Use, df.groupby on Stimulus then transform the Order column using pd.Series.diff:
df = df.reset_index()
df['Distance'] = df.groupby('Stimulus').transform(pd.Series.diff)
df = df.set_index('Order')
# print(df)
Stimulus Distance
Order
1 sword NaN
2 apple NaN
3 tap NaN
4 stick NaN
5 elephant NaN
6 boots NaN
7 berry NaN
8 apple 6.0
9 pear NaN
10 apple 2.0
11 stick 7.0

dataframe panda row information

I have been searching the pandas information and can't seem locate any information.
Here is what I am trying to do - I am in trying to explore data I got using pandas. I have dataframe below:
michael larry Moe Carol
height NaN 5'8'' 6' 5'3"
weight 150 230 NaN 60
eyes NaN NaN NaN blue
hair black NaN NaN NaN
This is simplification but I will have 50 rows and hundreds of columns
I want be able to count the number of NaN or missing data per row. I want to basically say we are i.e. we are missing 75% of eyes information, 25% of height information, etc..
I have tried basic commands like
df.isnull().sum(axis=1)
which I believe should give me count via rows but it is all zero- which doesn't make sense.
Thanks.