Data preprocessing of click stream data in real time - pandas

I am working on a project to detect anomalies in web users activity in real-time. Any ill intention or malicious activity of the user has to be detected in real-time. Input data is clickstream data of users. Click data contains user-id ( Unique user ID), click URL ( URL of web page), Click text (Text/function in the website on which user has clicked) and Information (Any information typed by user). This project is similar to an Intrusion detection system (IDS). I am using python 3.6 and I have the following queries,
Which is the best approach to carry out the data preprocessing, Considering all the attributes in the dataset are categorical values.
Encoding methods like hot encoding or label encoding could be applied but data has to be processed in real-time which makes it difficult to apply
As per the requirement of the project 3 columns(click URL, Click Text and Typed information) considered as feature columns.
I am really confused about how to approach data preprocessing. Any insight or suggestions would be appreciated

In some recent personal and professional projects when faced with the challenge of applying ML on streaming data I have had success with the python library River https://github.com/online-ml/river.
Some online algorithms can handle labelled values (like hoeffding trees) so depending on what you want to achieve you may not need to conduct preprocessing.
If you do need to conduct preprocessing, label encoding and one hot encoding could be applied in an incremental fashion. Below is some code to get you started. River also has a number of classes to help out with feature extraction and feature selection e.g: TF-IDF, bag of words or frequency aggregations.
online_label_enc = {}
for click in click_stream:
try:
label_enc = click[click__feature_label_of_interest]
except KeyError:
click[click__feature_label_of_interest] = len(online_label_enc)
label_enc = click[click__feature_label_of_interest]
I am not sure what you are asking - but if you are approaching the problem online/incrementally then extract the features you want and pass them to your online algorithm of choice - which should then be updating and learning at every data increment.

Related

Kapow Robot - Extract business Operating hours from Google Search Results

Is it possible to create a Kapow Robot that can search Google for the Operating hours of the Businesses from our list/database and update the timings if changes are made?
Please share if there are any other more efficient ways than the KAPOW robot that can be implemented with minimal effort and cost-effectiveness.
That's what the Google Places API is there for. While you could in theory just open Google Maps in a Load Page action, enter the query string and then parse the results, I would advise against it. Here's why:
The API will be faster, returning results in a structured manner (JSON)
Kapow has actions for calling RESTful services and parsing/modifying JSON
Google does not like robots parsing their pages, and most likely will lock you out (i.e. present you with Captchas sooner or later)
If you decide to go for the API, here's what you should do:
Get your API key first, see this page for details: https://developers.google.com/places/web-service/get-api-key. Note that the free plan allows for 1,000 requests within a 24-hours limit (https://developers.google.com/places/web-service/usage)
Maintain the place ids for all the businesses you'd like to query regularly, and update your list.
For each place, retrieve the details as described in the API documentation. The opening hours will be within the JSON response: https://developers.google.com/places/web-service/details
Update your list. I'd recommend using a definite type in Kapow for that, and using the actions Store in Database and Query Database. In case you need the data elsewhere, you may create additional robots (e.g. for Excel files, sending data per email, et cetera).

Best way to deal with IDs only containing numbers

We're trying to display some booking information to the users and we're asking to them the ID which has a 10 length numbers format like this one: 1553296942
In the stories, we try to identify the user input with an intent called bookingStatus and a entity called uid.
Thing is, this IDs are recognized as a wit/location type (it looks like coordinates to him, I guess) and it doesn't recognize them properly most of the times.
What would be the best approach to handle this situation?
For now, in the Understanding tab we're feeding the bot with lots of these IDs, adding the intent bookingStatus and marking it as uid entity aswell. Is this the right thing and shall we continue training it this way?
You can feed with the 10-length numbers. Actually there are only 10^10 possibilities for the uid entity. You can basically feed the whole 10^10 possibilities with a simple CURL command which contains a loop to 10^10.
How do you feed your NLP without the Understanding tab? Well..
Check the HTTP API Docs here
https://wit.ai/docs/http/20160526#post--entities-:entity-id-values-link
Have a nice day!

API architecture - output consistent data in SI units or intelligently adapt to return user friendly data and units

Two scenarios - an API that consistently outputs data in SI units. So if a device is transmitting 0.0001V you will get that same output. If it posts 1000W then again it will return 1000W. Any sanitization to make the data more user friendly will need to be done by the application making the get requests. Potentially many applications will require user friendly data.
The alternative approach would be for intelligence to be coded within the API to effectively make the output data user friendly. So if a device posts 10000W then the user will get 10kW output. Basically if a figure can be best represented using less digits but with a more appropriate unit then the API will figure that out and return that data. So the data output is not consistent but depends on the values themselves.
In terms of designing a RESTful API and best practices, which method is more appropriate and why? The argument is is that since many applications will require user friendly data, the idea would be to save time and energy to do it once in the back-end.
Do both. Include the actual numerical value, the unit, and the user-friendly value as three separate properties in the response.

Structuring user-entered data - design/architecture

I am working on an outdoor site and letting users enter the types of things they saw when they are outdoors. Obviously any item can be written in a large number of ways.
For example, the animal "coyote" can be written like this:
coyote,
wolf,
coyotes,
wild coyotes,
cayotees
So if I let users enter data, how can I have the system understand that all the above examples are about something classified as a "coyote" ?
Why don't you try to rely on Google Sets for each new entry then create links with already existing matched entries in your system ?
You could even crowdsource the validity checking of the links by adding a "Report non related" function.
There are various (non offical) versions of Google Set API.

RSS Feeds or API to access REIT information?

I have a Web application that needs to display up to date information on REITs and tickers like AX.UN, BEI.UN, CAR.UN etc..
For example, I need to automate consumption of information on pages such as
http://ca.finance.yahoo.com/q?s=AX-UN.TO
Are there rss feeds or apis I can use to import this kind of data? I don't want to copy and paste this information into my website on a daily basis.
On the very site you link to, there's a small link that says "download data". If you have a database with the symbols you want to track, it would be pretty easy to construct that download URL on the fly, query it, parse the CSV, and load that data into your database for display on your website.
ETA: Here's the link:
http://ca.finance.yahoo.com/d/quotes.csv?s=AX-UN.TO&f=sl1d1t1c1ohgv&e=.csv
Just have your program replace the "AX-UN.TO" with whatever symbol you want, and it should grab the data for that symbol.
Take a look at http://www.mergent.com/servius - particularly the Historical Securities Data API. Not sure if it has REIT data - if it doesn't, it may be available by special arrangement from them.