Suitable Clustering Approach - data-science

I've got a total of 9 sensors in the ground, which measure the water content of the soil. 1-3 are in a depth of 1m, 4-6 are in a depth of 2m and sensors 7-9 are in a depth of 3m.
My dataset also contains the precipiation of the location. It is hourly data:
Time
Sensor-ID
Precipitation
Soil Water Content
2022-01-01 11:00
1
74
120
2022-01-01 11:00
2
74
100
2022-01-01 11:00
3
74
110
...
...
...
...
2022-01-01 11:00
9
74
30
The goal now is to find out if the different ground / soil depths behave differently regarding the water content after raining (over time).
I thought about a clustering method to find out if the sensors can be clustered based on the data and confirm this. Since I'm not very experienced in data science, would that be the right approach and is it even possible to analyse it with clustering?

For clustering, you can add a new column with three new classes to your data - for 1-3 sensors : Class 1, for 4-6 sensors : Class 2, for 7-9 sensors : Class 3 and perform your analysis using the new classes. Either can be done using Python, Power BI or Excel.
You should start by analyzing different variables w.r.t to the sensors at different ground depths: Use univariate, Bi-Variate and Multi-Variate plots to derive your goal.

Related

Expected Value formula

I have a table as follows
Day Savings
1 : 251
2 : 722
3 : 1132
4 : 1929
5 : 3006
6 : 4522
7 : 8467
...
14 : x
These savings are growing day by day, I want to find a formula to expect the final value of day 14 which is x!
I didn't look at the data in any detail, but it seems like an exponential growth situation. If that's the case, then you can estimate the growth rate by fitting an exponential curve to the data using least squares approximation to get an estimated interest rate, r. If you find the data not conducive to that, you could try fitting it to some other curve. You can then use the estimated interest rate to compute the expected funds using the standard a = p*exp(r*i) where p is the initial principal and i is the elapsed time.
This all assumes compounding interest which is an exponential growth situation. If that assumption is incorrect, this approach is probably not going to work for you.

Plotting data from two sets with different shapes in the same plot

I am using data collected from two different instruments which have different resolution because of the sampling rate of each instrument. For a specific time, one of the sets have >10k entries while the other has ~2.5k. They however capture data over the same time interval, and I want to plot them on top of each other even though they have different resolution in data. The minimum and maximum x of both sets are the same however one of them have more entries.
Simplified it could look like this:
1st set from instrument with higher sampling rate:
time(s) value
0.0 10
0.2 11
0.4 12
0.6 13
0.8 14
... ..
100 50
2nd set from instrument with lower sampling rate:
time(s) value
0 100
1 120
2 125
3 128
4 130
. ...
100 430
They are measuring different things, but I would like to display them in the same plot. How can I accomplish this?
I found the mistake.. I was trying to plot both datasets using the time data from the first instrument. Of course they need to be plotted with their respective time data and I put the first time data in the second plot by mistake..

Composite indexing using Redis in a hierarchical data model

I have a data model like this:
Fields:
counter number (e.g. 00888, 00777, 00123 etc)
counter code (e.g. XA, XD, ZA, SI etc)
start date (e.g. 2017-12-31 ...)
end date (e.g. 2017-12-31 ...)
Other counter date (e.g. xxxxx)
Current Datastructure organization is like this (root and multiple child format):
counter_num + counter_code
---> start_date + end_date --> xxxxxxxx
---> start_date + end_date --> xxxxxxxx
---> start_date + end_date --> xxxxxxxx
Example:
00888 + XA
---> Jan 10 + Jan 20 --> xxxxxxxx
---> Jan 21 + Jan 31 --> xxxxxxxx
---> Feb 01 + Dec 31 --> xxxxxxxx
00888 + ZI
---> Jan 09 + Feb 24 --> xxxxxxxx
---> Feb 25 + Dec 31 --> xxxxxxxx
00777 + XA
---> Jan 09 + Feb 24 --> xxxxxxxx
---> Feb 25 + Dec 31 --> xxxxxxxx
Today the retrieval happens in 2 ways:
//Fetch unique counter data using all the composite keys
counter_number + counter_code + date (start_date <= date <= end_date)
//Fetch all the counter codes and corresponding data matching the below conditions
counter_number + date (start_date <= date <= end_date)
What's the best way to model this in redis as I need to cache some of the frequently hit data. I feel sorted sets should do this somehow, but unable to model it.
UPDATE:
Just to remove the confusion, the ask here is not for an SQL "BETWEEN" like query. 'Coz I don't know what the start_date and end_date values are. Think they are just column names.
What I don't want is
SELECT * FROM redis_db
WHERE counter_num AND
date_value BETWEEN start_date AND end_date
What I want is
SELECT * FROM redis_db
WHERE counter_num AND
start_date <= specifc_date AND end_date >= specific_date
NOTE: The requirement is pretty much close to 2D indexing of what is proposed in Redis multi-dimensional indexing document
https://redis.io/topics/indexes#multi-dimensional-indexes
I understood the concept but unable to digest the implementation detail that is given.
I'm unlikely to get this done in time for the bounty, but what the hell...
This sounds like a job for geohashing. Geohashing is what you do when you want to index a 2-dimensional (or higher) dataset. For example, if you have a database of cities and you want to be able to quickly respond to queries like "find all the cities within 50km of X", you use geohashing.
For the purposes of this question, you can think of start_date and end_date as x and y coordinates. Normally in geohashing you're searching for points in your dataset near a particular point in space, or in a certain bounded region of space. In this case you just have a lower bound on one of the coordinates and an upper bound on the other one. But I suppose in practice the whole dataset is bounded anyway, so that's not a problem.
It would be nice if there was a library for doing this in Redis. There probably is, if you look hard enough. The newer versions of Redis have built-in geohashing functionality. See the commands starting with GEO. But it doesn't claim to be very accurate, and it's designed for the surface of a sphere rather than a flat surface.
So as far as I can see you have 3 options:
Map your search space to a small part of the sphere, preferably near the equator. Use the Redis GEO commands. To search, use GEOSPHERE on a circle covering the triangle you're trying to search, taking into account the inbuilt inaccuracy and the distortion you get by mapping onto the sphere, then filter the results to get the ones that are actually inside the triangle.
Find some 3rd-party geohashing client for Redis which works on flat space and is more accurate than GEO.
Read the rest of this answer, or some other primer on geohashing, then implement it yourself on top of Redis. This is the hardest (but most educational) option.
If you have a database that indexes data using a numerical ordering, such that you can do queries like "find all the rows/records for which z is between a and b", you can build a geohash index on top of it. Suppose the coordinates are (non-negative) integers x and y. Then you add an integer-valued column z, and index by z. To calculate z, write x and y in binary, then take alternate digits from each. Example:
x = 969 = 0 1 1 1 1 0 0 1 0 0 1
y = 1130 = 1 0 0 0 1 1 0 1 0 1 0
z = 1750214 = 0110101011010011000110
Note that the index allows you to find, for example, all records positioned with z between 0101100000000000000000 and 0101101111111111111111 inclusive. In other words, all records for which z starts with 010110. Or to put it another way, you can find all records for which x starts with 001 and y starts with 110. This set of records corresponds to a square in the 2-dimensional space we are trying to search.
Not all squares can be searched in this way. We'll call these ones searchable squares. Suppose the client sends a request for all records for which (x,y) is inside a particular rectangle. (Or a circle, or some other reasonable geometric shape.) Then you need to find a set of searchable squares which cover the rectangle. Then, for each of these squares you've chosen, query the database for records inside that square and send the results to the client. (But you'll have to filter the results, because not all the records in the square are actually in the original rectangle.)
There's a balance to be struck. If you choose a small number of large special squares, you'll probably end up covering a much larger area of the map than you need; the query to the database will return lots of extra results that you'll have to filter out. Alternatively, if you use lots of little special squares, you'll be doing lots of queries to the database, many of which will return no results.
I said above that x and y could be start_time and end_time. But actually the distribution of your dataset won't be as symmetrical as in most uses of geohashing. So the performance might be better (or worse) if you use x = end_time + start_time and y = end_time - start_time.
Because your question remains a bit vague on how you desire to query your data, it remains unclear on how to solve your question. With that in mind, however, here are my thoughts on how I might model your data:
Updated answer, detailing how to use SORTED SET
I have edited this answer to be able to store your values in a way that you can query by dynamic date ranges. This edit assumes that your database values are timestamps, as in the value is for a single time, not 2, as in your current setup.
Yes, you are correct that using Sorted Sets will be able to accomplish this. I suggest that you always use a Unix timestamp value for the score component in these sorted sets.
In case you were not already familiar with redis, let's explain indexing limitations. Redis is a simple key-value designed to quickly retrieve values by a key. Because of this design, it does not contain many features of your traditional DBMS, like indexing a column for instance.
In redis, you accomplish indexing by using a key, and the most nested key-like structures are available in HASH and SORTED SET, but you only get 2 key-like structures. In a HASH, you have the key (same as any data type), and a inner hash key, which can take the form of any string.
In a SORTED SET, you have the key (same as any data type), and a numeric value.
A HASH is nice to use to keep a grouped data organized.
A SORTED SET is nice if you want to query by a range of values. This could be a good fit for your data.
Your SORTED SET would look like the following:
key
00888:XA =>
score (date value) value
1452427200 (2016-01-10) xxxxxxxx
1452859200 (2016-01-10) yyyyxxxx
1453291200 (2016-01-10) zzzzxxxx
Let's use a more intuitive example, the 2017 Juventus roster:
To produce the SORTED SET in the table below, issue this command in your redis client:
ZADD JUVENTUS 32 "Emil Audero" 1 "Gianluigi Buffon" 42 "Mattia Del Favero" 36 "Leonardo Loria" 25 "Neto" 15 "Andrea Barzagli" 4 "Medhi Benatia" 19 "Leonardo Bonucci" 3 "Giorgio Chiellini" 40 "Luca Coccolo" 29 "Paolo De Ceglie" 26 "Stephan Lichtsteiner" 12 "Alex Sandro" 24 "Daniele Rugani" 43 "Alessandro Semprini" 23 "Dani Alves" 22 "Kwadwo Asamoah" 7 "Juan Cuadrado" 6 "Sami Khedira" 18 "Mario Lemina" 46 "Mehdi Leris" 38 "Rolando Mandragora" 8 "Claudio Marchisio" 14 "Federico Mattiello" 45 "Simone Muratore" 20 "Marko Pjaca" 5 "Miralem Pjanic" 28 "Tomás Rincón" 27 "Stefano Sturaro" 21 "Paulo Dybala" 9 "Gonzalo Higuaín" 34 "Moise Kean" 17 "Mario Mandzukic"
Jersey Name Jersey Name
32 Emil Audero 23 Dani Alves
1 Gianluigi Buffon 42 Mattia Del Favero
36 Leonardo Loria 25 Neto
15 Andrea Barzagli 4 Medhi Benatia
19 Leonardo Bonucci 3 Giorgio Chiellini
40 Luca Coccolo 29 Paolo De Ceglie
26 Stephan Lichtsteiner 12 Alex Sandro
24 Daniele Rugani 43 Alessandro Semprini
22 Kwadwo Asamoah 7 Juan Cuadrado
6 Sami Khedira 18 Mario Lemina
46 Mehdi Leris 38 Rolando Mandragora
8 Claudio Marchisio 14 Federico Mattiello
45 Simone Muratore 20 Marko Pjaca
5 Miralem Pjanic 28 Tomás Rincón
27 Stefano Sturaro 21 Paulo Dybala
9 Gonzalo Higuaín 34 Moise Kean
17 Mario Mandzukic
To query the roster by a range of jersey numbers:
ZRANGEBYSCORE JUVENTUS 1 5
Output:
1) "Gianluigi Buffon"
2) "Giorgio Chiellini"
3) "Medhi Benatia"
4) "Miralem Pjanic"
Note that the scores are not returned, however ZRANGEBYSCORE command orders the results in ASC order by score.
To add the scores, append "WITHSCORES" to the command, like so: ZRANGEBYSCORE JUVENTUS 1 5 WITHSCORES
By using ZRANGEBYSCORE, you should be able to query any key (counter number + counter code) with a date range,
producing the values in that range.
Original: Below is my original answer, recommending HASH
Based on your examples, I recommend you use a HASH.
With a hash, you would have a main key to find the hash (Ex. 00888:XA). Then within the hash, you have key -> value pairs (Ex. 2017-01-10:2017-01-20 -> xxxxxxxx). I prefer to delimit or tokenize my keys' components with the colon char :, but you can use any delimiter.
HASH follows your example data structure very well:
key
00888:XA =>
hashkey value
2017-01-10:2017-01-20 xxxxxxxx
2017-01-21:2017-01-31 yyyyxxxx
2016-02-01:2016-12-31 zzzzxxxx
key
00888:ZI =>
hashkey value
2017-01-10:2017-01-20 xxxxxxxx
2017-01-21:2017-01-31 xxxxyyyy
2016-02-01:2016-12-31 xxxxzzzz
When querying for data, instead of GET key, you would query with HGET key hashkey. Same for setting values, instead of SET key value, use HSET key hashkey value.
Example commands
HSET 00777:XA 2017-01-10:2017-01-20 xxxxxxxx
HSET 00777:XA 2017-01-21:2017-01-31 yyyyyyyy
HSET 00777:XA 2016-02-01:2016-12-31 zzzzzzzz
(Note: there is also a HMSET to simplify this into a single command)
Then:
HGET 00777:XA 2017-01-21:2017-01-31
Would return yyyyyyyy
Unless there is some specific performance consideration, or other goal for your data, I think Hashes will work great for your system.
It's also very convenient if you want to get all hashkeys or all values for a given hash, using commands like HKEYS, HVALS, or HGETALL.

How can I do SQL like operations on a R data frame?

For example, I have a data frame with data across categories and subcategories and I want to be able to get row with maximum value in a particular column etc.
SQL is what comes to mind first. But since I am not interested in joins or indices etc, python's list comprehensions would do the same thing better with a more modern syntax.
What's best practice in R for such operations?
EDIT:
For now I think I am fine with which.max. Why I asked the question the way I did is simply that I have come to learn that in R there are many libraries etc doing pretty much the same thing. Just by reading the documentation it's very hard to evaluate how popular (ie how well the library fulfills its purpose). My personal experience with Python is that the day you figure out how to use list comprehensions (with itertools as a bonus), you are pretty much covered. Over time this has evolved as best practice, you don't see lambda and filter for example that often in the general python debate these days as list comprehensions does the same thing easier and more uniform.
If you really mean SQL, a pretty straightforward answer is the 'sqldf' package:
http://cran.at.r-project.org/web/packages/sqldf/index.html
From the help for ?sqldf
library(sqldf)
a1s <- sqldf("select * from warpbreaks limit 6")
Some additional context would help, but from the sounds of it - you may be looking for which.max() or the related functions. For group by operations, I default to the plyr family of functions, but there are certainly faster alternatives in base R if speed is of utmost importance.
library(plyr)
#Make a local copy of mycars data and add the rownames as a column since ddply
#seems to drop them. I've never encountered that before actually...
myCars <- mtcars
myCars$carname <- rownames(myCars)
#Find the max mpg
myCars[which.max(myCars$mpg) ,]
mpg cyl disp hp drat wt qsec vs am gear carb carname
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1 Toyota Corolla
#Find the max mpg by cylinder category
ddply(myCars, "cyl", function(x) x[which.max(x$mpg) ,])
mpg cyl disp hp drat wt qsec vs am gear carb carname
1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corolla
2 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive
3 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 Pontiac Firebird

Predicting game outcomes at a point in time

Hi I am trying to gauge from past information I have on an MSSQL database the predicted outcomes of soccer games (win, tie or loss for the home team) at any point in time based on the minutes played and the scoreline
What I had envisaged as output was something like fangraphs does for baseball
http://www.fangraphs.com/scoreboard.aspx?date=2010-11-01
although with two lines as there are three rather than two possible outcomes
From the data and the existing tables I can create game records like this
Time TeamID Venue MatchID Result
6 TOT H 5 W
27 ASV A 5 W
58 ASV A 5 W
66 TOT H 5 W
77 TOT H 5 W
So for the graph for this game the home team TOT would start with the win line at around 45% (based on the historical probability of a home win) it would spike when they score their goal, dip significantly after ASV score twice but be probably above 90% when they score to go 3-2 up and then rise gently to 100% at the cloing 90 minute mark
So I want to go through the 7500 games I have data on and based on them establish for every minute of a 90 minute game what are the chances of a win, tie or loss for the home team based on the these results
For instance, in the simplest situation after 1 minute of play in actuality 44 of the home teams scored, 33 of them went on to win, 6 tied and 5 lost. The corresponding case where the away team scored has been 9 wins, 8 ties 23 losses for the home team. However, I am having trouble getting my head around how to get all 90 minutes scorelines and compare them with the final result (Only one goal can be scored in any specific minute)
TIA for any help
There will always be things you can add to the model, but the first thing I would do is, for each game, pull out the score at each minute, and assume that the probability of winning doesn't depend on when the goal was scored, but depends on what the score is now.
So now you would have 90 data points per game.
game1:
Minute: 0 10 20 30 40 50 60 70 80 90
Score :[0 0] [0 0] [0 1] [0 1] [1 1] [1 1] [1 1] [2 1] [2 1] [3 1]
The next thing I would do is, for each minute slice, add up the number of wins, losses, and draws over all games, for each configuration of scores.
So each entry in that table might correspond to something like this:
#minute 27, for score {home:5, away:2} : {homeWins: 9, draws:1, homeLosses:0}
you might want to try using the difference in score instead of the actual score values..
Either way, Once you have the data formatted that way, getting a reasonable solution is easy.
If a game is on, and it's minute 77, and the score is {home:5, away:2}, the (MLE) estimate is 90% wins, 10% draw, 0% looses (according to the example table entry above).
So you see already how it will help to include "laplacian smoothing": adding +1 to the final values of each of those win/lose/draw counters. This way if you've never seen a loss in this exact situation you don't say it's impossible, impossible is a very strong word (look for beta or drichlet distributions for background).
The obvious problem with this approach is that if you've never seen a particular score combination before it will predict (33%,33%,33%), which is obviously wrong in some cases.
The simplest fix would be to enforce a rule like "leading by 6 points is at least as good as leading by 5 points". It's ugly, but it's a start.
To avoid that sort of special-case logic you could try averaging that approach with a monte-carlo approximation.
The simplest of those approaches is to say : over all my data I expect about a 1 in 30 chance of a goal by each team in each minute of play -> simulate the game 10000 times form the current point, count the number of wins/losses/draws and you're done.
If that's too random, or processor intense, switch to Markov Chains.