I have this issue where I record a daily entry for all users in my system (several thousands, even 100.000+). These entries have 3 main features, "date", "file_count", "user_id".
date
file_count
user_id
2021-09-28
200
5
2021-09-28
10
7
2021-09-29
210
5
2021-09-29
50
7
Where I am in doubt is how to run an anomaly detection algorithm efficiently on all these users.
My goal is to be able to report whether a user has some abnormal behavior each day.
In this example, user 7 should be flagged as an anomaly because the file_count suddenly is x5 higher than "normal".
My idea was firstly to create a model for each user but since there are so many users this might not be feasible.
Could you help explain me how to do this in an efficient manner if you know an algorithm that could solve this problem?
Any help is greatly appreciated!
Article for anomaly detection in audit data can be found many on the Internet.
One simple article with many of examples/approaches can be found in original (Czech) language here: https://blog.root.cz/trpaslikuv-blog/detekce-anomalii-v-auditnich-zaznamech-casove-rady/ or translated using google technology: https://blog-root-cz.translate.goog/trpaslikuv-blog/detekce-anomalii-v-auditnich-zaznamech-casove-rady/?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=sk&_x_tr_pto=wapp
PS: Clustering (Clustering Based Unsupervised Approach) can be way to go, when searching for simple algorithm.
Related
I have my dataset with job metrics, and one of my features is industry. It is a categorical feature and has 1200 unique values. Before I go on and work on building a model, I need to figure out how to best encode it esp because it has 1200 unique values. Does anyone have any tips or guidance as to where I should start?
The picture below shows the top 9 industries. I am thinking of selective encoding - maybe only using one-hot encoding for these 15-20 most frequent values, but I will be thankful for any suggestions. Thanks
Tried to look for several resources, but couldn't find anything promising so far
[A picture of the 9 most occurring industries]
https://i.stack.imgur.com/tDAEk.jpg
You could one hot encode everything, and maybe check correlations against target to see which job categories may be informative features.
if the data is too large to do this, then yes perhaps selective encoding as you said -- just conditionally fill everything else as "other" and then proceed with one hot encoding.
I have been exploring the Crux dataset in big query for last 10 days to extract data for data studio report. Though I consider myself good at SQL, as I have mostly worked with oracle and SQL server, I am finding it very hard to write queries against this dataset. I started from this article by Rick Viscomi, explored the queries on his github repo but still unable to figure it out.
I am trying to use the materialized table chrome-ux-report.materialized.metrics_summary to get some of the metrics but I am not sure if the Min/Avg/Max lcp (in milliseconds) for a time period (month for example) could be extracted from this table. What other queries could I possibly try which requires less data processing. (Some of the queries that I tried expired my free TB of data processing on big query).
Any suggestion, advise solution, queries are more than welcome since the documentation about the structure of the dataset and queries against it is not very clear.
For details about the fields used on the report you can check on the main documentation for the chrome ux report specially on the last part with data format which shows the dimensions and how its interpreted as show below:
Dimension
origin "https://example.com"
effective_connection_type.name 4G
form_factor.name "phone"
first_paint.histogram.start 1000
first_paint.histogram.end 1200
first_paint.histogram.density 0.123
For example, the above shows a sample record from the Chrome User Experience Report, which indicates that 12.3% of page loads had a “first paint time” measurement in the range of 1000-1200 milliseconds when loading “http://example.com” on a “phone” device over a ”4G”-like connection. To obtain a cumulative value of users experiencing a first paint time below 1200 milliseconds, you can add up all records whose histogram’s “end” value is less than or equal to 1200.
For the metrics, in the initial link there is a section called methodology where you can get information about the metrics and dimensions of the report. I recommend going to the actual origin source table per country and per site and not the summary as the data you are looking for can be obtained there. In the Bigquery part of the documentation you will find samples of how to query those tables. I find this relatable:
SELECT
SUM(bin.density) AS density
FROM
`chrome-ux-report.chrome_ux_report.201710`,
UNNEST(first_contentful_paint.histogram.bin) AS bin
WHERE
bin.start < 1000 AND
origin = 'http://example.com'
In the example above we’re adding all of the density values in the FCP histogram for “http://example.com” where the FCP bin’s start value is less than 1000 ms. The result is 0.7537, which indicates that ~75.4% of page loads experience the FCP in under a second.
About query estimation cost, you can see estimating query cost guide on google official bigquery documentation. But using this tables due to its nature consumes a lot of processing so filter it as much as possible.
We are trying to solve a VRP with Optaplanner where it is important that two (or more) customers are served at the same time.
This means, for example, that if customer #1 is supplied at 10 o'clock, then customer #2 must also be supplied at 10 o'clock.
It is not allowed to deliver to one customer and leave the other unscheduled.
Such constellations occur with approx 50% of all customers out of a total number of 1000 customers.
It is not sufficient to apply the "delay till last pattern".
All other conditions remain the same as in the VRP example.
How can we proceed in order to solve this problem with Optaplanner?
Are there any examples of such constellations?
In the docs, take a look at chapter Design Patterns, and specifically the auto delay until last pattern.
I have a simple statistical question and hope someone here has a quick answer.
I have a set of 200 documents, each document should contain exactly 3 pages. My assumption is that all 100% of those documents have 3 pages. I want to take a sample that would statistically confirm that that set is homogeneous, which means that all documents have exactly 3 pages. If I find even one document in a sample having != 3 pages I would know my set is in-homogeneous.
How many documents do I have to look at to be 80% sure my set is homogeneous? Should I have more then 200 documents in my base set, for instance 1000?
I am not sure but i dont think that can be calculated from the given details, U should know the standard deviation of the base set.
You are trying to test whether all documents are 3 pages. A statistical test will not help here. In most cases what you will have is a 5%, and 1% significance tests that the number of mean pages is 3. This means that there will be a 1 in 20, and 1 in a 100, respectively, chance that the pages might be different from 3.
I have an app that is displaying metrics about defects in a project.
I have the option of making one query that returns all the defects, and from that I can break out about four different metrics (How many defects escaped QA in 90 days, 180 days, and then the same metrics again but only counting sev1/sev2 defects).
I could make four queries and limit the results to one so that I just get a count for each. Or I could make one query that encompass them all (all defects that escaped QA in 180 days) and then count up the difference.
I'm figuring worst case, the number of defects that escaped QA in the last six months will generally be less than 100, certainly less 500 worst case.
Which would you do-- four queryies with one result each, or one single query that on average might return 50, perhaps worst case 500?
And I guess the key question is-- where are the inflections points? Perhaps I have more metrics tomorrow (who knows, 8?) and a different average defect counts. Is there a rule of thumb I could use to help choose which approach?
Well I would probably make the series of four queries and use the result count. If you are expecting 500 defects that will end up being three queries each with 200 defects anyways.
The solution where you do each individual query and use the total result count would be safe with even a very large amount of defects. Plus I usually find it to be a bad plan to think that I know the data sets that an App will be dealing with. Most of my Apps end up living much longer and being used on larger datasets than I intended.
The max page size is 200, so it sounds like you'd be requesting between 1 and 3 pages to get all the data vs. 4 queries with a page size of 1 and using the TotalResultCount...
You'd definitely have less aggregation code to write if you use the multi query approach (letting the server do the counting for you based on your supplied filters).
I'd guess the 4 independent queries might be faster but it would be interesting to hear back your experimental results...