Adapting Machine Learning Algorithms to my Problem

Adapting Machine Learning Algorithms to my Problem - authentication

i'm working on a project and need your ideas, advices.
First of all, let me tell my problem.
There is power button and some other keys of a machine and
there is only one user has authentication to use this machine.There are
no other authentication methods, the machine is in public area in a company.
the machine is working with the combination of pressing power button and some other keys.
The order of pressing keys is secret but we don't trust it, anybody can learn the password and can access the machine.
i have the capability of managing the key hold time and also some other metrics to
measure the time differences between the key such as horizantal or vertical key press times (differences). and also i can measure the hold time etc.
These all means i have some inputs,
Now i'm trying to get a user profile by analysing these inputs.
My idea is to get the authenticated user to press the password n times and create a threshold or something similar to that.
This method also can be said BIOMETRICS, anyone else who knows the machine button combination, can try the password but if he is out of this range can not get access it.
How can i adapt these into my algorithms? where should i start ?
i don't want to delve deep into machine learning, and also i can see that in my first try i can get false positive and false negative values really high, but i can manage it by changing my inputs.
thanks.

To me this seems like a good candidate for a classification problem. You have two classes (correct password input / incorrect), your data could be the time (from time 0) that buttons were pressed. You could teach a learning algorithm but having several examples of correct password data and incorrect password data. Once your classifier is trained and working satisfactorily, you could try it out to predict new password input attempts for correctness.
You could try out several classifiers from Weka, a GUI based machine learning tool http://www.cs.waikato.ac.nz/ml/weka/
What you need is your data to be in a simple table format for experimenting in weka, something like the following:
Attempt No | 1st button time | 2nd button time | 3rd button time | is_correct
-----------|-----------------|-----------------|-----------------|------------
1 | 1.2 | 1.5 | 2.4 | YES
2 | 1.3 | 1.8 | 2.2 | YES
3 | 1.1 | 1.9 | 2.0 | YES
4 | 0.8 | 2.1 | 2.9 | NO
5 | 1.2 | 1.9 | 2.2 | YES
6 | 1.1 | 1.8 | 2.1 | NO
This would be a training set. The outcome (which is known) is the class is_correct. You would run this data through weka selecting a classifier (Naive Bayes' for example). This would produce a claffier ( for example a set of rules) which could be used to predict future entries.

The key to this sort of problem is devising good metrics. Once you have a vector of input values, you can use one of a number of machine learning algorithms to classify it as authorised or declined. So the first step should be to determine which metrics (of those you mention) will be the most useful and pick a small number of them (5-10). You can probably benefit by collapsing some by means of averaging (for example, the average length of any key press, rather than a separate value for every key). Then you will need to pick an algorithm. A good one for classifying vectors of real number is Support vector machines - at this point you should read up on it, particularly on what the "kernel" function is so you can choose one to use. Then you will need to gather a set of learning examples (vectors with a known result), train the algorithm with them, and test the trained svm on a fresh set of examples to see how it performs. If the performance is poor with a simple kernel (e.g. linear), you may choose to use a higher dimensional one. Good luck!

Related

Detecting anomalies among several thousand users

I have this issue where I record a daily entry for all users in my system (several thousands, even 100.000+). These entries have 3 main features, "date", "file_count", "user_id".
date
file_count
user_id
2021-09-28
200
5
2021-09-28
10
7
2021-09-29
210
5
2021-09-29
50
7
Where I am in doubt is how to run an anomaly detection algorithm efficiently on all these users.
My goal is to be able to report whether a user has some abnormal behavior each day.
In this example, user 7 should be flagged as an anomaly because the file_count suddenly is x5 higher than "normal".
My idea was firstly to create a model for each user but since there are so many users this might not be feasible.
Could you help explain me how to do this in an efficient manner if you know an algorithm that could solve this problem?
Any help is greatly appreciated!

Article for anomaly detection in audit data can be found many on the Internet.
One simple article with many of examples/approaches can be found in original (Czech) language here: https://blog.root.cz/trpaslikuv-blog/detekce-anomalii-v-auditnich-zaznamech-casove-rady/ or translated using google technology: https://blog-root-cz.translate.goog/trpaslikuv-blog/detekce-anomalii-v-auditnich-zaznamech-casove-rady/?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=sk&_x_tr_pto=wapp
PS: Clustering (Clustering Based Unsupervised Approach) can be way to go, when searching for simple algorithm.

What's the most efficient way to simulate a car's ECU logic?

I've been thinking a lot lately when driving my car - inside the ECU there is a memory module with pre-calculated values for almost anything. For example, the ECU can calculate how much fuel to inject based on several readings such as throttle position, current RPM's, etc. When people remap their cars they change the predefined values which in turn changes the output calculated in realtime by the ECU. Let's keep it simple and imagine we have 2 parameters we constantly juggle around on a predefined 2D graph. We have 4 reference points: A1(2000 RPM - 200 foo units), A2(3000 RPM - 270 foo units), A3(4000 RPM - 350 foo units), A4(5000 RPM - 400 foo units). So the question I'm struggling with is how can you calculate the exact amount of foo units on let's say 3650 RPM in realtime on "slow" hardware without any errors or delays. I'd love to see some C style pseudo code on how it could be implemented logic-wise to run efficiently. The first thing that comes to my mind are 2 arrays (a matrix), but things get messy when you account for multiple variables making a difference on the final outcome. I'd like to experiment with this and try to write a small program to do this kind of math, but I'm stuck on choosing the clean, sane way of representing and manipulating values...
Sorry for no formatting, wrote this post on my phone!

Database model to describe IT environment [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm looking at writing a Django app to help document fairly small IT environments. I'm getting stuck at how best to model the data as the number of attributes per device can vary, even between devices of the same type. For example, a SAN will have 1 or more arrays, and 1 or more volumes. The arrays will then have an attribute of Name, RAID Level, Size, Number of disks, and the volumes will have attributes of Size and Name. Different SANs will have a different number of arrays and volumes.
Same goes for servers, each server could have a different number of disks/partitions, all of which will have attributes of Size, Used space, etc, and this will vary between servers.
Another device type may be a switch, which won't have arrays or volumes, but will have a number of network ports, some of which may be gigabit, others 10/100, others 10Gigabit, etc.
Further, I would like the ability to add device types in the future without changing the model. A new device type may be a phone system, which will have its own unique attributes which may vary between different phone systems.
I've looked into EAV database designs but it seems to get very complicated very quickly, and I'm unclear on whether it's the best way to go about this. I was thinking something along the lines of the model as shown in the picture.
http://i.stack.imgur.com/ZMnNl.jpg
A bonus would be the ability to create 'snapshots' of environments at a particular time, making it possible to view changes to the environment over time. Adding a date column to the attributes table may be a way to solve this.
For the record, this app won't need to scale very much (at most 1000 devices), so massive scalability isn't a big concern.

Since your attributes are per model instance and are different for each instance,
I would suggest going with completely free schema
class ITEntity(Model):
name = CharField()
class ITAttribute(Modle)
name = CharField()
value = CharField()
entity = ForeignKey(ITEntity, related_name="attrs")
This is very simple model and you can do the rest, like templates (i.e. switch template, router template, etc) in you app code - its much more straight-forward then using complicated model like EAV (I do like EAV, but this does not seem the usage case for this).
Adding history is also simple - just add timestamp to ITAttribute. When changing attribute - create new one instead. Then fetching attribute pick the one with the latest timestamp. That way you can always have point-in-time view of your environment.

If you are more comfortable with something along the lines of the image you posted, below is a slightly modified version (sorry I can't upload an image, don't have enough rep).
+-------------+
| Device Type |
|-------------|
| type |--------+
+-------------+ |
^
+---------------+ +--------------------+ +-----------+
| Device |----<| DeviceAttributeMap |>----| Attribute |
|---------------| |--------------------| |-----------|
| name | | Device | | name |
| DeviceType | | Attribute | +-----------+
| parent_device | | value |
| Site | +--------------------+
+---------------+
v
+-------------+ |
| Site | |
|-------------| |
| location |--------+
+-------------+
I added a linker table DeviceAttributeMap so you can have more control over an Attribute catalog, allowing queries for devices with the same Attribute but differing values. I also added a field in the Device model named parent_device intended as a self-referential foreign key to capture a relationship between a device's parent device. You'll likely want to make this field optional. To make the foreign key parent_device optional in Django set the field's null and blank attributes to True.

You could try a document based NoSQL database, like MongoDB. Each document can represent a device with as many different fields as you like.

Developing Rainbow Tables

I am currently working on a parallel computing project where i am trying to crack passwords using rainbow tables.
The first step that i have thought of is to implement a very small version of it that cracks password of lengths 5 or 6 (only numeric passwords to begin with). To begin with, i have some questions with the configuration settings.
1 - What should be the size that i should start with. My first guess is, i will start with a table with 1000 Initial, Final pair. Is this is a good size to start with?
2- Number of chains - I really got no information online with what should be the size of a chain be
3 - Reduction function - If someone can give me any information about how should i go about building one.
Also, if anyone has any information or any example, it will be really helpful.

There is already a wealth of rainbow tables available online. Calculating rainbow tables simply moves the computation burden from when the attack is being run, to the pre-computation.
http://www.freerainbowtables.com/en/tables/
http://www.renderlab.net/projects/WPA-tables/
http://ophcrack.sourceforge.net/tables.php
http://www.codinghorror.com/blog/2007/09/rainbow-hash-cracking.html

It's a time-space tradeoff. The longer the chains are, the less of them you need, so the less space it'll take up, but the longer cracking each password will take.
So, the answer is always to build the biggest table you can in the space that you have available. This will determine your chain length and number of chains.
As for choosing the reduction function, it should be fast and behave pseudo-randomly. For your proposed plaintext set, you could just pick 20 bits from the hash and interpret them as a decimal number (choosing a different set of 20 bits at each step in the chain).

Should I initialize my AUTO_INCREMENT id column to 2^32+1 instead of 0?

I'm designing a new system to store short text messages [sic].
I'm going to identify each message by a unique identifier in the database, and use an AUTO_INCREMENT column to generate these identifiers.
Conventional wisdom says that it's okay to start with 0 and number my messages from there, but I'm concerned about the longevity of my service. If I make an external API, and make it to 2^31 messages, some people who use the API may have improperly stored my identifier in a signed 32-bit integer. At this point, they would overflow or crash or something horrible would happen. I'd like to avoid this kind of foo-pocalypse if possible.
Should I "UPDATE message SET id=2^32+1;" before I launch my service, forcing everyone to store my identifiers as signed 64-bit numbers from the start?

If you wanted to achieve your goal and avoid the problems that cletus mentioned, the solution is to set your starting value to 2^32+1. There's still plenty of IDs to go and it won't fit in a 32 bit value, signed or otherwise.
Of course, documenting the value's range and providing guidance to your API or data customers is the only right solution. Someone's always going to try and stick a long into a char and wonder why it doesn't work (always)

What if you provided a set of test suites or a test service that used messages in the "high but still valid" range and persuade your service users to use it to validate their code is proper? Starting at an arbitrary value for defensive reasons is a little weird to me; providing sanity tests rubs me right.

Actually 0 can be problematic with many persistence libraries. That's because they use it as some sort of sentinel value (a substitute for NULL). Rightly or wrongly, I would avoid using 0 as a primary key value. Convention is to start at 1 and go up. With negative numbers you're likely just to confuse people for no good reason.

If everyone alive on the planet sent one message per second every second non-stop, your counter wouldn't wrap until the year 2050 using 64 bit integers.
Probably just starting at 1 would be sufficient.
(But if you did start at the lower bound, it would extend into the start of 2092.)

Why use incrementing IDs? These require locking and will kill any plans for distributing your service over multiple machines. I would use UUIDs. API users will likely store these as opaque character strings, which means you can probably change the scheme later if you like.
If you want to ensure that messages have an order, implement the ordering like a linked list:
---
id: 61746144-3A3A-5555-4944-3D5343414C41
msg: "Hello, world"
next: 006F6F66-0000-0000-655F-444E53000000
prev: null
posted_by: jrockway
---
id: 006F6F66-0000-0000-655F-444E5300000
msg: "This is my second message EVER!"
next: 00726162-0000-0000-655F-444E53000000
prev: 61746144-3A3A-5555-4944-3D5343414C41
posted_by: jrockway
---
id: 00726162-0000-0000-655F-444E53000000
msg: "OH HAI"
next: null
prev: 006F6F66-0000-0000-655F-444E5300000
posted_by: jrockway
(As an aside, if you are actually returning the results as YAML, you can use & and * references instead of just using the IDs as data. Then the client will get the linked-list structure "for free".)

One thing I don't understand is why developers don't grasp that they don't need to expose their AUTO_INCREMENT field. For example, richardtallent mentioned using Guids as the primary key. I say do one better. Use a 64bit Int for your table ID/Primary Key, but also use a GUID, or something similar, as your publicly exposed ID.
An example Message table:
Name | Data Type
-------------------------------------
Id | BigInt - Primary Key
Code | Guid
Message | Text
DateCreated | DateTime
Then your data looks like:
Id | Code Message DateCreated
-------------------------------------------------------------------------------
1 | 81e3ab7e-dde8-4c43-b9eb-4915966cf2c4 | ....... | 2008-09-25T19:07:32-07:00
2 | c69a5ca7-f984-43dd-8884-c24c7e01720d | ....... | 2007-07-22T18:00:02-07:00
3 | dc17db92-a62a-4571-b5bf-d1619210245a | ....... | 2001-01-09T06:04:22-08:00
4 | 700910f9-a191-4f63-9e80-bdc691b0c67f | ....... | 2004-08-06T15:44:04-07:00
5 | 3b094cf9-f6ab-458e-965d-8bda6afeb54d | ....... | 2005-07-16T18:10:51-07:00
Where Code is what you would expose to the public whether it be a URL, Service, CSV, Xml, etc.

Don't want to be the next Twitter, eh? lol
If you're worried about scalability, consider using a GUID (uniqueidentifier) instead.
They are only 16 bytes (twice that of a bigint), but they can be assigned independently on multiple database or BL servers without worrying about collisions.
Since they are random, use NEWSEQUENTIALID() (in SQL Server) or a COMB technique (in your business logic or pre-MSSQL 2005 database) to ensure that each GUID is "higher" than the last one (speeds inserts into your table).
If you start with a number that high, some "genius" programmer will either subtract 2^32 to squeeze it in an int, or will just ignore the first digit (which is "always the same" until you pass your first billion or so messages).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas