How to consider features while forecasting? - data-science

I have to forecast future utilisation of my emplyees based on their past data based on zone,slot.
Here zone and slot is the 2 features i wanted to include while forecasting.Any suggestions how to proceed.
data looks as like
dt zone slot utilization
-- ---- ----- -----------
2019-06-23 236 1 87.018695
2019-07-07 218 3 37.497308
2019-07-08 218 2 49.132561
python is the programming language we are using here.

Maybe you can give some more details? You're saying you want to forecast the utilization parameter, by using zone and slot features. Then you proceed to say zone and 'hour'? What is 'hour'?
To answer your question, this type of problem could be considered a regression-problem, since you want to estimate a numerical value 'utilization'. What language are you working with? For Python, the scikit library has some easy to implement regression models.
Also, if you consider your explanatory features zone and slot to be categorical features you'd probably need to dummy encode them.

Related

Pyomo Mixed Integer Linear Optimization Code to multiple set of variables with binary variables and prices

Helly everyone,
I am a new user to pyomo and have some problem at the moment.
I am trying to develop a multi-time period mixed-integer optimization problem in python with pyomo. I have 4 technologies for which I want to optimize the capacity over 12 periods (1-12). If technology 1 gets chosen in a period technology 2 is not chosen in that period. The same goes for technologies 3 and 4. Each of these technologies has its own price per period. I set up a list for all the variables for each technologies in each period(x11-x124), for the binary variable of each technology in each period and for the price of each technology in each period. However, I am unable to write a working objective function for all these variables.
I would appreciate any help!
Below is the image of the code I have tried. I have also tried. I however get the error: list indices must be integers or slices, not str.
I have also tried first transforming the lists into numpy.arrays. I however then get an error because I cannot use numpy in a pyomo optimization
enter image description here

Detecting anomalies among several thousand users

I have this issue where I record a daily entry for all users in my system (several thousands, even 100.000+). These entries have 3 main features, "date", "file_count", "user_id".
date
file_count
user_id
2021-09-28
200
5
2021-09-28
10
7
2021-09-29
210
5
2021-09-29
50
7
Where I am in doubt is how to run an anomaly detection algorithm efficiently on all these users.
My goal is to be able to report whether a user has some abnormal behavior each day.
In this example, user 7 should be flagged as an anomaly because the file_count suddenly is x5 higher than "normal".
My idea was firstly to create a model for each user but since there are so many users this might not be feasible.
Could you help explain me how to do this in an efficient manner if you know an algorithm that could solve this problem?
Any help is greatly appreciated!
Article for anomaly detection in audit data can be found many on the Internet.
One simple article with many of examples/approaches can be found in original (Czech) language here: https://blog.root.cz/trpaslikuv-blog/detekce-anomalii-v-auditnich-zaznamech-casove-rady/ or translated using google technology: https://blog-root-cz.translate.goog/trpaslikuv-blog/detekce-anomalii-v-auditnich-zaznamech-casove-rady/?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=sk&_x_tr_pto=wapp
PS: Clustering (Clustering Based Unsupervised Approach) can be way to go, when searching for simple algorithm.

Using Optaplanner for long trip planning of a fleet of vehicles in a Vehicle Routing Problem (VRP)

I am applying the VRP example of optaplanner with time windows and I get feasible solutions whenever I define time windows in a range of 24 hours (00:00 to 23:59). But I am needing:
Manage long trips, where I know that the duration between leaving the depot to the first visit, or durations between visits, will be more than 24 hours. So currently it does not give me workable solutions, because the TW format is in 24 hour format. It happens that when applying the scoring rule "arrivalAfterDueTime", always the "arrivalTime" is higher than the "dueTime", because the "dueTime" is in a range of (00:00 to 23:59) and the "arrivalTime" is the next day.
I have thought that I should take each TW of each Customer and add more TW to it, one for each day that is planned.
Example, if I am planning a trip for 3 days, then I would have 3 time windows in each Customer. Something like this: if Customer 1 is available from [08:00-10:00], then say it will also be available from [32:00-34:00] and [56:00-58:00] which are the equivalent of the same TW for the following days.
Likewise I handle the times with long, converted to milliseconds.
I don't know if this is the right way, my consultation would be more about some ideas to approach this constraint, maybe you have a similar problematic and any idea for me would be very appreciated.
Sorry for the wording, I am a Spanish speaker. Thank you.
Without having checked the example, handing multiple days shouldn't be complicated. It all depends on how you model your time variable.
For example, you could:
model the time stamps as a long value denoted as seconds since epoch. This is how most of the examples are model if I remember correctly. Note that this is not very human-readable, but is the fastest to compute with
you could use a time data type, e.g. LocalTime, this is a human-readable time format but will work in the 24-hour range and will be slower than using a primitive data type
you could use a date time data tpe, e.g LocalDateTime, this is also human-readable and will work in any time range and will also be slower than using a primitive data type.
I would strongly encourage to not simply map the current day or current hour to a zero value and start counting from there. So, in your example you denote the times as [32:00-34:00]. This makes it appear as you are using the current day midnight as the 0th hour and start counting from there. While you can do this it will affect debugging and maintainability of your code. That is just my general advice, you don't have to follow it.
What I would advise is to have your own domain models and map them to Optaplanner models where you use a long value for any time stamp that is denoted as seconds since epoch.

SQL database standard way of designing a table

There is an argument between me and my colleague in term of the design of a table inside a SQL database. The objective of the table is to store the value of different type of parameters base on the date and time.
For my suggestion is to create the table as below:
id date time temperature pressure duration flowrate steps
1 1/27/2018 11:13:00 24.5 0.343 57 8 pumping start
2 1/28/2018 12:13:00 25.4 0.452 788 10 pumping end
3 1/29/2018 13:13:00 24.5 3.342 332 6 pumping start
4 1/30/2018 14:13:00 30.5 4.323 33 3 vacuum start
5 1/31/2018 15:13:00 24.5 0.358 232 8 pumping start
As you can see, the 'tags' are represent different parameters, which each of them has different data type: double, int, text, etc.
My arguments are:
we should not store numbers in text
should not store multiple types in one column
a query can be complicated, you might use a lot of 'When', 'And' in the clause
need to convert the value from text to type of numbers when do calculations
As the idea of my colleague, the table should design as below:
id date time tags value(use text data type)
1 1/27/2018 11:13:00 temperature 24.5
2 1/27/2018 12:13:00 pressure 0.343
3 1/27/2018 13:13:00 duration 57
4 1/27/2018 14:13:00 flowrate 8
5 1/27/2018 15:13:00 pressure 9
6 1/27/2018 16:13:00 temperature 30.1
7 1/27/2018 17:13:00 temperature 23.4
8 1/27/2018 18:13:00 steps pumping start
9 1/27/2018 19:13:00 steps pumping end
His arguments are:
each tag is independent in term of timing
no structure modification when we add a tag
reduce the size of the base
Apparently, my words are not enough to convince him, well, perhaps I am wrong in this case. So I need you advise on which is the best practice? and Why? It will be better to give some official reference link about standards/normalization about this, so that I make my words stronger.
There are really two separable questions here.
The first is whether or not two parameters like temperature and pressure should be bound to the same date and time by placing them in the same row. It sounds like, in the real world, these two parameters come out of one observation, that has one date and time. So binding them together is both more efficient and better data management.
The second question is whether making the database structure independent of the specific tags is a good idea or a bad idea. Your friends design is indeed very much like the EAV pattern or anti-pattern depending on your point of view. This is a very deep philosophical debate, one that has passionate partisans on both sides. It is unlikely to be resolved between you and your friend.
I'm firmly in the anti EAV camp. I'm forced to admit that there are some exceptional cases where EAV turns out to be the right way to go. These are cases where analyzing the subject matter to discover the data is impossible or impractical, and you have to capture data before you understand the scope of the project.
Most of the time, data analysis of the subject matter is eminently practical and worthwhile, even though time consuming. The result is a database whose logical structure mirrors the conceptual structure of the real world. When the information requirements change (such as a new tag), the structure of the database changes.
Changing the structure of the database is labor intensive, and transforming the existing data is hard. But the result is much better data management, where the data definitions inside the DBMS are helping you with data management. It's both better use of machine resources and better use of human resources.
So I think you are right in the argument, but unlikely to prevail over your friend. Your friend would rather do his own data management, without the DBMS helping or hindering. Good luck to him. He's going to need it when his projects get beyond the beginner stage.
I think this is the best way:
id date time temperature pressure duration flowrate steps
1 1/27/2018 11:13:00 24.5 0.343 57 8 pumping start
2 1/28/2018 12:13:00 25.4 0.452 788 10 pumping end
3 1/29/2018 13:13:00 24.5 3.342 332 6 pumping start
4 1/30/2018 14:13:00 30.5 4.323 33 3 vacuum start
5 1/31/2018 15:13:00 24.5 0.358 232 8 pumping start
Because With the second design could has less columns but It has so many repetead data, the records 1, 2, 3, 4, 5 has the same information because is the same record, this could make the database more heavy, and with repeated data.
I hope be helpful.

Adapting Machine Learning Algorithms to my Problem

i'm working on a project and need your ideas, advices.
First of all, let me tell my problem.
There is power button and some other keys of a machine and
there is only one user has authentication to use this machine.There are
no other authentication methods, the machine is in public area in a company.
the machine is working with the combination of pressing power button and some other keys.
The order of pressing keys is secret but we don't trust it, anybody can learn the password and can access the machine.
i have the capability of managing the key hold time and also some other metrics to
measure the time differences between the key such as horizantal or vertical key press times (differences). and also i can measure the hold time etc.
These all means i have some inputs,
Now i'm trying to get a user profile by analysing these inputs.
My idea is to get the authenticated user to press the password n times and create a threshold or something similar to that.
This method also can be said BIOMETRICS, anyone else who knows the machine button combination, can try the password but if he is out of this range can not get access it.
How can i adapt these into my algorithms? where should i start ?
i don't want to delve deep into machine learning, and also i can see that in my first try i can get false positive and false negative values really high, but i can manage it by changing my inputs.
thanks.
To me this seems like a good candidate for a classification problem. You have two classes (correct password input / incorrect), your data could be the time (from time 0) that buttons were pressed. You could teach a learning algorithm but having several examples of correct password data and incorrect password data. Once your classifier is trained and working satisfactorily, you could try it out to predict new password input attempts for correctness.
You could try out several classifiers from Weka, a GUI based machine learning tool http://www.cs.waikato.ac.nz/ml/weka/
What you need is your data to be in a simple table format for experimenting in weka, something like the following:
Attempt No | 1st button time | 2nd button time | 3rd button time | is_correct
-----------|-----------------|-----------------|-----------------|------------
1 | 1.2 | 1.5 | 2.4 | YES
2 | 1.3 | 1.8 | 2.2 | YES
3 | 1.1 | 1.9 | 2.0 | YES
4 | 0.8 | 2.1 | 2.9 | NO
5 | 1.2 | 1.9 | 2.2 | YES
6 | 1.1 | 1.8 | 2.1 | NO
This would be a training set. The outcome (which is known) is the class is_correct. You would run this data through weka selecting a classifier (Naive Bayes' for example). This would produce a claffier ( for example a set of rules) which could be used to predict future entries.
The key to this sort of problem is devising good metrics. Once you have a vector of input values, you can use one of a number of machine learning algorithms to classify it as authorised or declined. So the first step should be to determine which metrics (of those you mention) will be the most useful and pick a small number of them (5-10). You can probably benefit by collapsing some by means of averaging (for example, the average length of any key press, rather than a separate value for every key). Then you will need to pick an algorithm. A good one for classifying vectors of real number is Support vector machines - at this point you should read up on it, particularly on what the "kernel" function is so you can choose one to use. Then you will need to gather a set of learning examples (vectors with a known result), train the algorithm with them, and test the trained svm on a fresh set of examples to see how it performs. If the performance is poor with a simple kernel (e.g. linear), you may choose to use a higher dimensional one. Good luck!