SQL: Pushing a Dataset "Backwards"? - sql

I have the following dataset that contains information on different students who took a fitness test, the date they took the fitness test, their weight at the time of the fitness test, and whether or not they passed the fitness test :
name date_test_taken result_of_test weight_after_time_of_test
1 john 2013-01-01 pass 165
2 john 2016-01-01 fail 183
3 john 2017-01-01 fail 175
4 john 2020-01-01 pass 182
5 alex 2019-01-01 fail 220
6 alex 2020-01-01 fail 225
7 tim 2018-01-01 pass 176
In this example, the student participates in the fitness test, then is told if they passed or failed, and then the student records their weight. The students don't necessarily take the test every year.
I am interested in building a statistical/machine learning model that will predict whether the student will pass or fail the NEXT fitness test they take based on the CURRENT weight of the student AND the result of their last fitness test.
This means if you take the second row of this dataset - John weighed 183 lbs after his second test, but his last known weight was actually 165 lbs. Therefore, I would be interested in "shifting" the dataset backward for each student. I am interested in predicting if John would have passed his second fitness test when his last known weight was 165 lbs and not 183 lbs.
Thus, using SQL code, I would like to "shift" the data for each student backward to modify the dataset. This way, the teacher can predict who will fail the next fitness test based on the results of the current fitness test - and then help those students more throughout the year.
Can someone please show me how to do this?
Thanks!

Obviously, with the information provided, we won't be able to provide you with the model to predict results. However, we can help with providing the data you need for your modelling.
The key task here is to get an individual's previous results into the same row (for analysis) as their current results.
SQL (at least SQL Server, and multiple other products) have functions LEAD and LAG that allow you to 'look forward' (lead) or backward (lag) through the dataset according to a given sorting and partitioning mechanism that tells it how to identify the previous/next rows.
In this case, we want to partition by the individual (name), take their previous result (LAG function, 1 row back) ordered by the date they took the test.
The following SQL gets the previous results for an individual onto the same row as the current data (note - I'm assuming the data table is called #FT and the first column is called 'Auto_ID'):
SELECT [name],
[date_test_taken],
LAG([date_test_taken], 1) OVER (PARTITION BY [name] ORDER BY [date_test_taken], [Auto_Id]) AS [date_test_taken_Previous],
LAG([result_of_test], 1) OVER (PARTITION BY [name] ORDER BY [date_test_taken], [Auto_Id]) AS [result_of_test_Previous],
LAG([weight_after_time_of_test], 1) OVER (PARTITION BY [name] ORDER BY [date_test_taken], [Auto_Id]) AS [weight_after_time_of_test_Previous],
[result_of_test],
[weight_after_time_of_test]
FROM #FT
Note that if they don't have a previous record, the previous results are NULL.
Here are the results
|name|date_test_taken|date_test_taken_Previous|result_of_test_Previous|weight_after_time_of_test_Previous|result_of_test|weight_after_time_of_test|
|----|---------------|------------------------|-----------------------|----------------------------------|----------------------------------------|
|alex|2019-01-01 |NULL |NULL |NULL |fail |220.00 |
|alex|2020-01-01 |2019-01-01 |fail |220.00 |fail |225.00 |
|john|2013-01-01 |NULL |NULL |NULL |pass |165.00 |
|john|2016-01-01 |2013-01-01 |pass |165.00 |fail |183.00 |
|john|2017-01-01 |2016-01-01 |fail |183.00 |fail |175.00 |
|john|2020-01-01 |2017-01-01 |fail |175.00 |pass |182.00 |
|tim |2018-01-01 |NULL |NULL |NULL |pass |176.00 |
To see it in action, here is a dbfiddle with the data, query, and results.

Related

Identifying newest records in parallel

We're using U-SQL to extract sensor data from a set of .csv files. Each record contains a sensor ID, time of measurement and value, as well as a time for when the record was received:
+----------+---------------------+------------------+---------------------+
| SensorID | MeasurementTime | MeasurementValue | ReceivedTime |
+----------+---------------------+------------------+---------------------+
| xxx | 2017-09-10 11:00:00 | 12.342 | 2017-09-19 14:25:17 |
| xxx | 2017-09-10 12:00:00 | 14.654 | 2017-09-19 14:25:17 |
| yyy | 2017-09-10 11:00:00 | 1.054 | 2017-09-19 14:25:17 |
| yyy | 2017-09-10 12:00:00 | 1.354 | 2017-09-19 14:25:17 |
...
| xxx | 2017-09-10 11:00:00 | 10.261 | 2017-09-19 15:25:17 |
+----------+---------------------+------------------+---------------------+
The files are stored in ADLS in a path based on the date-portion of the measurement time, so the data seen above would be found in /Data/2017/09/10/measurements.csv, where the first four rows were written at 14:25:17 on the 19th of September, and the last row was appended one hour later, at 15:25:17.
As the above example illustrates, new values for the same SensorID and MeasurementTime can be received at a later time. Each partition holds a few million rows, with a few thousand rows being appended to a small number of partitions every day. We want to run a batch job say every 24 hours, that will output only the newest values, for any given SensorID and MeasurementTime. For this, we use a U-SQL script that looks similar to this:
#newestMeasurements_addRN =
SELECT *,
ROW_NUMBER() OVER (PARTITION BY PDate,
SensorId,
MeasurementTime
ORDER BY ReceivedTime DESC) AS MeasurementRN;
#newestMeasurements =
SELECT SensorId,
MeasurementTime,
MeasurementValue
FROM #newestMeasurements_addRN
WHERE MeasurementRN == 1;
Here, PDate is a virtual column inferred from the yyyy/MM/dd in the path of the CSV file (equals the date-portion of MeasurementTime).
Now, since we use PDate in the PARTITION BY part of the window function, I expected that this operation could be parallelised, since we don't have to consider different days (partitions) when trying to find the newest record for any given SensorID and MeasurementTime. Unfortunately, that does not seem to be the case, looking at a job graph:
Here, we are extracting data from 4 different days. Each of the Extract vertices outputs the full number of records, leaving the task of identifying only the newest records to the Combine vertex at the bottom, indicating that the ROW_NUMBER and subsequent filtering does not happen in parallel.
Is this a bug in the implementation of ROW_NUMBER?
Is there a different U-SQL technique we can use to ensure parallelism?
I managed to find a usable solution, in which I encapsulated the U-SQL that detects the latest measurements inside a U-SQL stored proc, which takes a value corresponding to pdate as input parameter.
Then, I simply execute this stored proc several times, with a list of dates that I want to process in parallel:
DetectLatestMeasurements(20170910);
DetectLatestMeasurements(20170911);
DetectLatestMeasurements(20170912);
DetectLatestMeasurements(20170913);
The stored proc handles EXTRACT, transformation and OUTPUT of one days worth of data, so this does the job, and it is parallelised the way I expect.

Access Query: get difference of dates with a twist

I'm going to do my best to explain this so I apologize in advance if my explanation is a little awkward. If I am foggy somewhere, please tell me what would help you out.
I have a table filled with circuits and dates. Each circuit gets trimmed on a time cycle of about 36 months or 48 months. I have a column that gives me this info. I have one record for every time the a circuit's trim cycle has been completed. I am attempting to link a known circuit outage list, to a table with their outage data, to a table with the circuit's trim history. The twist is the following:
I only want to get back circuits that have exceeded their trim cycles by 6 months. So I would need to take all records for a circuit, look at each individual record, find the most recent previous record relative to the record currently being examined (I will need every record examined invididually), calculate the difference between the two records in months, then return only the records that exceeded 6 months of difference between any two entries for a given feeder.
Here is an example of the data:
+----+--------+----------+-------+
| ID | feeder | comp | cycle |
| 1 | 123456 | 1/1/2001 | 36 |
| 2 | 123456 | 1/1/2004 | 36 |
| 3 | 123456 | 7/1/2007 | 36 |
| 4 | 123456 | 3/1/2011 | 36 |
| 5 | 123456 | 1/1/2014 | 36 |
+----+--------+----------+-------+
Here is an example of the result set I would want (please note: cycle can vary by circuit, so the value in the cycle column needs to be in the calculation to determine if I exceeded the cycle by 6 months between trimmings):
+----+--------+----------+-------+
| ID | feeder | comp | cycle |
| 3 | 123456 | 7/1/2007 | 36 |
| 4 | 123456 | 3/1/2011 | 36 |
+----+--------+----------+-------+
This is the query I started but I'm failing really hard at determining how to make the date calculations correctly:
SELECT temp_feederList.Feeder, Temp_outagesInfo.causeType, Temp_outagesInfo.StormNameThunder, Temp_outagesInfo.deviceGroup, Temp_outagesInfo.beginTime, tbl_Trim_History.COMP, tbl_Trim_History.CYCLE
FROM (temp_feederList
LEFT JOIN Temp_outagesInfo ON temp_feederList.Feeder = Temp_outagesInfo.Feeder)
LEFT JOIN tbl_Trim_History ON Temp_outagesInfo.Feeder = tbl_Trim_History.CIRCUIT_ID;
I wasn't really able to figure out where I need to go from here to get that most recent entry and perform the mathematical comparison. I've never been asked to do SQL this complex before, so I want to thank all of you for your patience and any assistance you're willing to lend.
I'm making some assumptions, but this uses a subquery to give you rows in the feeder list where the previous completed date was greater than the number of months ago indicated by the cycle:
SELECT tbl_Trim_History.ID, tbl_Trim_History.feeder,
tbl_Trim_History.comp, tbl_Trim_History.cycle
FROM tbl_Trim_History
WHERE tbl_Trim_History.comp>
(SELECT Max(DateAdd("m", tbl_Trim_History.cycle, comp))
FROM tbl_Trim_History T2
WHERE T2.feeder = tbl_Trim_History.feeder AND
T2.comp < tbl_Trim_History.comp)
If you needed to check for longer than 36 months you could add an arbitrary value to the months calculated by the DateAdd function.
Also I don't know if the value of cycle specified the number of month from the prior cycle or the number of months to the next one. If the latter I would change tbl_Trim_History.cycle in the DateAdd function to just cycle.
SELECT tbl_trim_history.ID, tbl_trim_history.Feeder,
tbl_trim_history.Comp, tbl_trim_history.Cycle,
(select max(comp) from tbl_trim_history T
where T.feeder=tbl_trim_history.feeder and
t.comp<tbl_trim_history.comp) AS PriorComp,
IIf(DateDiff("m",[priorcomp],[comp])>36,"x") AS [Select]
FROM tbl_trim_history;
This query identifies (with an X in the last column) the records from tbl_trim_history that exceed the cycle time - but as noted in the comments I'm not entirely sure if this is what you need or not, or how to incorporate the other 2 tables. Once you see what it is doing you can modify it to only keep the records you need.

DAX SUMMARIZE() with filter - Powerpivot

Rephrasing a previous question after further research. I have a denormalised hierarchy of cases, each with an ID, a reference to their parent (or themselves) and a closure date.
Cases
ID | Client | ParentMatterName | MatterName | ClaimAmount | OpenDate | CloseDate
1 | Mr. Smith | ABC Ltd | ABC Ltd | $40,000 | 1 Jan 15 | 4 Aug 15
2 | Mr. Smith | ABC Ltd | John | $0 |20 Jan 15 | 7 Oct 15
3 | Mr. Smith | ABC Ltd | Jenny | $0 | 1 Jan 15 | 20 Jan 15
4 | Mrs Bow | JQ Public | JQ Public | $7,000 | 1 Jan 15 | 4 Aug 15
After the help of greggyb I also have another column, Cases[LastClosed], which will be true if the current row is closed, and is the last closed of the parent group.
There is also a second table of payments, related to Cases[ID]. These payments could be received in parent or child matters. I sum payments received as follows:
Recovery All Time:=CALCULATE([Recovery This Period], ALL(Date_Table[dateDate]))
I am looking for a new measure which will calculate the total recovered for a unique ParentMatterName, if the last closed matter in this group was closed in the Financial Year we are looking at - 30 June end date.
I am now looking at the SUMMARIZE() function to do the first part of this, but I don't know how to filter it. The layers of calculate are confusing. I've looked at This MSDN blog but it appears that this will filter to only show the total payments for that matter that was last closed (not adding the related children).
My current formula is:
Recovery on Closed This FY :=
CALCULATE (
SUMX (
SUMMARIZE (
MatterListView,
MatterListView[UniqueParentName],
"RecoveryAllTime", [Recovery All Time]
),
[RecoveryAllTime]
)
)
All help appreciated.
Again, your solution is much more easily solved with a model addition. Remember, storage is cheap, your end users are impatient.
Just store in your Cases table a column with the LastClosedDate of every parent matter, which indicates the date associated with the last closed child matter. Then it's a simple filter to return only those payments/matters that have LastClosedDate in the current fiscal year. Alternately, if you know for certain that you are only concerned with the year, you could store just LastClosedFiscalYear, to make your filter predicate a bit simpler.
If you need help with specific measures or how you might implement the additional field, let us know (I'd recommend adding these fields at the source, or deriving them in the source query rather than using calculated columns).

Quartile for subgroups in SQL Server 2008

I have a table with the times athletes of a sport club take to run a lap around the field . Each athlete has several entries in that table for each time they run and and for statistics purposed I need to gather some statistics regarding the time they take.
I already have the basic statistics like average time, median time, etc.... However I have no idea how to exactly do the bottom and top quartiles.
I see some examples for quartiles of a table if you just want the statistics of the whole table (in this case the whole club) but I have no idea how to make them for sub groups like distinct athletes of a table, could anyone give me point me on the right direction/give me an example?
The relevant data is in a very simple structure like this (there are more columns but in this case they don't matter)
LAP_ID | ATHLETE| TIME |
1 | Ath_X | 120 |
2 | Ath_Y | 160 |
3 | Ath_X | 90 |
4 | Ath_X | 80 |
5 | Ath_Z | 113 |
6 | Ath_X | 115 |
EDIT:There seems to be some misunderstanding, by Quartile I mean the 1st and 3rd Quartile, that is the place where it splits off the lowest 25% of data from the highest 75% and the place where it splits off the highest 25% of data from the lowest 75%.

What to do when missing some data in a date series?

I am trying to graph a count over time from multiple sources, but having issues when the collection job fails on one (or more, but not all) of the sources.
Suppose I have a set of data like:
date | count
---------------------
10-11-2013 | 50
11-11-2013 | 52
13-11-2013 | 63
and another like
date | count
---------------------
10-11-2013 | 15
11-11-2013 | 19
12-11-2013 | 17
13-11-2013 | 20
for whatever reason I am missing the data entry on the 12th for the first one. If I am just working with this single object then I can graph it fine by just skipping that element and the line will just be inaccurate on that day.
The problem I get is when I have multiple sources, and at least one of them succeeded in reporting its results for that day. I have a queryset that gets a sum of the all the daily counts:
DailyCount.objects.values('date').annotate(count=Sum('count')).order_by('date')
The results from this show a much lower number on the entry for the 12th. Making the graph look very wrong whenever this happens.
date | count
---------------------
10-11-2013 | 65
11-11-2013 | 71
12-11-2013 | 17
13-11-2013 | 83
Is there a way to have my queryset use the previous date's count if it doesn't exist? I thought about adding the previous day's count to the database, but it doesn't seem right to be adding some (probably wrong) data to the database when I can't verify it.
ideally I think it would look like:
date | count
---------------------
10-11-2013 | 65
11-11-2013 | 71
12-11-2013 | 69
13-11-2013 | 83
It depends on how you display the graph. In pandas you can store time series of data as well and they provide exactly the functionality you describe: backfill or forward fill any missing values by using a previous or future value (i.e., pandas.DataFrame.fillna). On one hand using that library for just that functionality is overkill but you may find it useful if you're planning on doing more data manipulation.
I don't think a Django QuerySet can fill in missing values as it was not built to do that. However you could compute it manually by taking the values from the query result and computing the right daily values before displaying the graph.