Identifying newest records in parallel - azure-data-lake

We're using U-SQL to extract sensor data from a set of .csv files. Each record contains a sensor ID, time of measurement and value, as well as a time for when the record was received:
+----------+---------------------+------------------+---------------------+
| SensorID | MeasurementTime | MeasurementValue | ReceivedTime |
+----------+---------------------+------------------+---------------------+
| xxx | 2017-09-10 11:00:00 | 12.342 | 2017-09-19 14:25:17 |
| xxx | 2017-09-10 12:00:00 | 14.654 | 2017-09-19 14:25:17 |
| yyy | 2017-09-10 11:00:00 | 1.054 | 2017-09-19 14:25:17 |
| yyy | 2017-09-10 12:00:00 | 1.354 | 2017-09-19 14:25:17 |
...
| xxx | 2017-09-10 11:00:00 | 10.261 | 2017-09-19 15:25:17 |
+----------+---------------------+------------------+---------------------+
The files are stored in ADLS in a path based on the date-portion of the measurement time, so the data seen above would be found in /Data/2017/09/10/measurements.csv, where the first four rows were written at 14:25:17 on the 19th of September, and the last row was appended one hour later, at 15:25:17.
As the above example illustrates, new values for the same SensorID and MeasurementTime can be received at a later time. Each partition holds a few million rows, with a few thousand rows being appended to a small number of partitions every day. We want to run a batch job say every 24 hours, that will output only the newest values, for any given SensorID and MeasurementTime. For this, we use a U-SQL script that looks similar to this:
#newestMeasurements_addRN =
SELECT *,
ROW_NUMBER() OVER (PARTITION BY PDate,
SensorId,
MeasurementTime
ORDER BY ReceivedTime DESC) AS MeasurementRN;
#newestMeasurements =
SELECT SensorId,
MeasurementTime,
MeasurementValue
FROM #newestMeasurements_addRN
WHERE MeasurementRN == 1;
Here, PDate is a virtual column inferred from the yyyy/MM/dd in the path of the CSV file (equals the date-portion of MeasurementTime).
Now, since we use PDate in the PARTITION BY part of the window function, I expected that this operation could be parallelised, since we don't have to consider different days (partitions) when trying to find the newest record for any given SensorID and MeasurementTime. Unfortunately, that does not seem to be the case, looking at a job graph:
Here, we are extracting data from 4 different days. Each of the Extract vertices outputs the full number of records, leaving the task of identifying only the newest records to the Combine vertex at the bottom, indicating that the ROW_NUMBER and subsequent filtering does not happen in parallel.
Is this a bug in the implementation of ROW_NUMBER?
Is there a different U-SQL technique we can use to ensure parallelism?

I managed to find a usable solution, in which I encapsulated the U-SQL that detects the latest measurements inside a U-SQL stored proc, which takes a value corresponding to pdate as input parameter.
Then, I simply execute this stored proc several times, with a list of dates that I want to process in parallel:
DetectLatestMeasurements(20170910);
DetectLatestMeasurements(20170911);
DetectLatestMeasurements(20170912);
DetectLatestMeasurements(20170913);
The stored proc handles EXTRACT, transformation and OUTPUT of one days worth of data, so this does the job, and it is parallelised the way I expect.

Related

How to track day wise change in Power BI?

I am creating a Power BI report using data from https://www.mohfw.gov.in/ website which provides latest corona virus data for all Indian states/union territories.
Data is in below format -
+-----+-----------------------------+-----------+-------+-------+
| SNo | State | Confirmed | Cured | Death |
+-----+-----------------------------+-----------+-------+-------+
| 1 | Andaman and Nicobar Islands | 14 | 11 | 0 |
| 2 | Andhra Pradesh | 603 | 42 | 15 |
| 3 | Arunachal Pradesh | 1 | 0 | 0 |
| 4 | Assam | 35 | 12 | 1 |
| 5 | Bihar | 86 | 37 | 2 |
They website is refreshed with new data everyday, so there is no date wise tracker. I wanted to track the day wise change(increment/decrements) in cases for every state, is there any way I can model it in power BI to achieve this?
For now what I am doing is I am downloading the table from the web page everyday and adding a date column which will be today's date(getdate()) and loading the data into a SQL table. So everyday I am inserting a row for each of the state with that day's date-stamp in the table and then I can subtract it from previous day's data to see the changes, but I feel it is a inefficient way and the table size keep on increasing everyday.
So any suggestion to improve it, either by some changes in Power BI data model, or in SQL will be much appreciated.
Context
considering the data source is updated according to SCD 1 (Overwriting) the only way to track day wise change is to historize data every day. In practice, schedule a daily load of the data source and store the new data of that day.
Answer
You are implementing SCD 2 (Create a new record on change) in the correct way. It is important to make sure adding a technical field to each record with the timestamp when it was generated so you can study the trend later.
Extra
You could eventually optimize this approach by normalizing the model in order to reduce the size of the table you are applying SCD 2 (Create a new record on change).
Please let me give a simple example. Consider a table with:
only 1 record
1000 fields of which only 1 field (LAST_UPDATE) can change using SCD 2 (Create a new record on change)
If LAST_UPDATE changes 100,000 times a day, every days it triggers the creation of 100,000 new version of the same record (because we track its changes). Therefore, after one year the table would have still 1,000 fields and 36,500,000 records. Instead, if we normalize the model such that LAST_UPDATE field (historized with SCD 2) is stored in a separate table, after one year we would have one table with 1 record and 999 columns, and a different table with 1 column and 36,500,000 records.
In the case your database is a row database, you would much benefit from normalizing the model. Instead, if your database is columnar database, everything is already taken care of because each column is individually compressed instead of compressing row-wise.

Subtract two aggregated values in Bar Chart

My data is like -
+-----------+------------------+-----------------+-------------+
| Issue Num | Created On | Closed at | Issue Owner |
+-----------+------------------+-----------------+-------------+
| 1 | 12/21/2016 15:26 | 1/13/2017 9:48 | Name 1 |
| 2 | 1/10/2017 7:38 | 1/13/2017 9:08 | Name 2 |
| 3 | 1/13/2017 8:57 | 1/13/2017 8:58 | Name 2 |
| 4 | 12/20/2016 20:30 | 1/13/2017 5:46 | Name 2 |
| 5 | 12/21/2016 19:30 | 1/13/2017 1:14 | Name 1 |
| 6 | 12/20/2016 20:30 | 1/12/2017 9:11 | Name 1 |
| 7 | 1/9/2017 17:44 | 1/12/2017 1:52 | Name 1 |
| 8 | 12/21/2016 19:36 | 1/11/2017 16:59 | Name 1 |
| 9 | 12/20/2016 19:54 | 1/11/2017 15:45 | Name 1 |
+-----------+------------------+-----------------+-------------+
What I am trying to achieve is
Number of issues created per week
Number of issues closed per week
Net number of issues remaining per week
I am able to resolve the top two points but unable to approach the last.
My attempt -
This gives me number of issues created every week.
Similarly I have done for Closed per week.
For Net number of issues (Created-Closed) -
I tried adding Closed At column along with Created On but I can't see second bar in the chart along with Created On either.
Something like this
I tried doing the same in excel -
I want something of this sort but with another column as the difference of
number of issues created that week - number of issues closed that week.
In this case, 8-6=2.
You could use a calculated field(Analysis->Create Calculated Field). Something like this:
{FIXED [Create Date]:Count(if DATEPART('year',[Create Date]) = 2016 then [Number of Records] end)} - {FIXED [Closed Date]:Count(if DATEPART('year',[Closed Date]) = 2016 then [Number of Records] end)}
This function is using LOD expressions to pull back both sets of values. It will filter on all 2016 results for both date sets and then minus them from each other.
For more on LOD's see here:
https://www.tableau.com/about/blog/LOD-expressions
Use this as your measure and pull in one of your date fields as the dimension.
The normal way to solve this problem is to reshape the data so you have one row per status change instead of one row per issue, with a column named [Date] and a column named [Action]. The action can be submit and close (or in a more complex world include approve, reject, whatever - tracking the history.
You can do the reshaping without modifying your source data by using a UNION to get two copies of each row with appropriate calculated fields to make the visible columns make sense (e.g., create calculated a field called Date that returns the submission date or closing date depending on whether the row is from the first or second union, with a similar one called Action whose value depends on that as well. Filter out Close actions that have a null date)
Or you can preprocess the data to reshape it.
Or you can use data blending to make two sources that point to the same data source but customizing the linking fields to line up the submit and close dates (e.g., duplicate the data connection and rename both date fields to have the same name). But in this case, you probably want to create scaffolding source that has every date, but no other data, to use as the primary data source to avoid filtering out data from the secondary for dates that don't appear in the primary. The blending approach can be brittle.
Assuming you used the UNION approach instead of Data Blending, then you can count the number of submissions and closures within a certain date range, or compute a running total of the difference to see the backlog size over time.

Access Query: get difference of dates with a twist

I'm going to do my best to explain this so I apologize in advance if my explanation is a little awkward. If I am foggy somewhere, please tell me what would help you out.
I have a table filled with circuits and dates. Each circuit gets trimmed on a time cycle of about 36 months or 48 months. I have a column that gives me this info. I have one record for every time the a circuit's trim cycle has been completed. I am attempting to link a known circuit outage list, to a table with their outage data, to a table with the circuit's trim history. The twist is the following:
I only want to get back circuits that have exceeded their trim cycles by 6 months. So I would need to take all records for a circuit, look at each individual record, find the most recent previous record relative to the record currently being examined (I will need every record examined invididually), calculate the difference between the two records in months, then return only the records that exceeded 6 months of difference between any two entries for a given feeder.
Here is an example of the data:
+----+--------+----------+-------+
| ID | feeder | comp | cycle |
| 1 | 123456 | 1/1/2001 | 36 |
| 2 | 123456 | 1/1/2004 | 36 |
| 3 | 123456 | 7/1/2007 | 36 |
| 4 | 123456 | 3/1/2011 | 36 |
| 5 | 123456 | 1/1/2014 | 36 |
+----+--------+----------+-------+
Here is an example of the result set I would want (please note: cycle can vary by circuit, so the value in the cycle column needs to be in the calculation to determine if I exceeded the cycle by 6 months between trimmings):
+----+--------+----------+-------+
| ID | feeder | comp | cycle |
| 3 | 123456 | 7/1/2007 | 36 |
| 4 | 123456 | 3/1/2011 | 36 |
+----+--------+----------+-------+
This is the query I started but I'm failing really hard at determining how to make the date calculations correctly:
SELECT temp_feederList.Feeder, Temp_outagesInfo.causeType, Temp_outagesInfo.StormNameThunder, Temp_outagesInfo.deviceGroup, Temp_outagesInfo.beginTime, tbl_Trim_History.COMP, tbl_Trim_History.CYCLE
FROM (temp_feederList
LEFT JOIN Temp_outagesInfo ON temp_feederList.Feeder = Temp_outagesInfo.Feeder)
LEFT JOIN tbl_Trim_History ON Temp_outagesInfo.Feeder = tbl_Trim_History.CIRCUIT_ID;
I wasn't really able to figure out where I need to go from here to get that most recent entry and perform the mathematical comparison. I've never been asked to do SQL this complex before, so I want to thank all of you for your patience and any assistance you're willing to lend.
I'm making some assumptions, but this uses a subquery to give you rows in the feeder list where the previous completed date was greater than the number of months ago indicated by the cycle:
SELECT tbl_Trim_History.ID, tbl_Trim_History.feeder,
tbl_Trim_History.comp, tbl_Trim_History.cycle
FROM tbl_Trim_History
WHERE tbl_Trim_History.comp>
(SELECT Max(DateAdd("m", tbl_Trim_History.cycle, comp))
FROM tbl_Trim_History T2
WHERE T2.feeder = tbl_Trim_History.feeder AND
T2.comp < tbl_Trim_History.comp)
If you needed to check for longer than 36 months you could add an arbitrary value to the months calculated by the DateAdd function.
Also I don't know if the value of cycle specified the number of month from the prior cycle or the number of months to the next one. If the latter I would change tbl_Trim_History.cycle in the DateAdd function to just cycle.
SELECT tbl_trim_history.ID, tbl_trim_history.Feeder,
tbl_trim_history.Comp, tbl_trim_history.Cycle,
(select max(comp) from tbl_trim_history T
where T.feeder=tbl_trim_history.feeder and
t.comp<tbl_trim_history.comp) AS PriorComp,
IIf(DateDiff("m",[priorcomp],[comp])>36,"x") AS [Select]
FROM tbl_trim_history;
This query identifies (with an X in the last column) the records from tbl_trim_history that exceed the cycle time - but as noted in the comments I'm not entirely sure if this is what you need or not, or how to incorporate the other 2 tables. Once you see what it is doing you can modify it to only keep the records you need.

TSQL reduce the amount of data returned by a query to a parametric defined sample

I have a table containing a large amount of data which is stored on change.
tbl_bigOne
----------
timestamp | var01 | var02 | ...
2016-01-14 15:20:21 | 10.1 | 100.6 | ...
2016-01-14 15:20:26 | 11.2 | 110.3 | ...`
2016-01-14 15:21:27 | 52.1 | 620.1 | ...
2016-01-14 15:35:00 | 13.5 | 230.6 | ...
...
2016-01-15 09:18:01 | 94.4 | 140.0 | ...
2016-01-15 10:01:15 | 105.3 | 188.7 | ...
...
and so on for years of data
What I would like to obtain is a query/stored procedure that given two datetime references (date_from and date_to) gives the required selected data.
Now, the query just mentioned is pretty straight forward what I would also like to achieve is to set the maximum number of rows returned per day (if data is available) while doing the average of the values.
Let's give a few examples:
date_from: 2016-01-14 00:00:00
date_to: 2016-01-20 23:59:59
max_points:12
in this case the time windows is of 7 days and in this one i would like to have a maximum of 12 rows for each days of the 7 day window, giving a max total of 84 rows whilst doing the average from all the grouping done since, the data for each day is now partitioned by 12.
It is possible to see this partitioning as if every hour worth of data for that specific day is averaged, generating one row of the 12 required for a day.
date_from: 2016-01-14 00:00:00
date_to: 2016-01-14 23:59:59
max_points:1440
in this case the time window is one day worth and, if available, i would like to have a maximum of 1440 rows (for each day) for the selected period.
In this way the parameter defines the maximum number of rows for each day. The minimum time window is one day nothing below that.
Can something like this be achieved just using TSQL?
Thank you.
edit for taking care of the observations raised by #Thorsten Kettner
Use the analytic function ROW_NUMBER() to number the matching rows per day. Then only keep rows up to the given limit. If you want the rows arbitrarily chosen when there exist more than needed, then number the rows in random order using NEWID().
select timestmp, var01, var02, var03
from
(
select
mytable.*,
row_number() over (partition by convert(date, timestmp) order by newid()) as rn
from mytable
where convert(date, timestmp) between #start_date and #end_date
) numbered
where rn <= #limit
order by timestmp;

What to do when missing some data in a date series?

I am trying to graph a count over time from multiple sources, but having issues when the collection job fails on one (or more, but not all) of the sources.
Suppose I have a set of data like:
date | count
---------------------
10-11-2013 | 50
11-11-2013 | 52
13-11-2013 | 63
and another like
date | count
---------------------
10-11-2013 | 15
11-11-2013 | 19
12-11-2013 | 17
13-11-2013 | 20
for whatever reason I am missing the data entry on the 12th for the first one. If I am just working with this single object then I can graph it fine by just skipping that element and the line will just be inaccurate on that day.
The problem I get is when I have multiple sources, and at least one of them succeeded in reporting its results for that day. I have a queryset that gets a sum of the all the daily counts:
DailyCount.objects.values('date').annotate(count=Sum('count')).order_by('date')
The results from this show a much lower number on the entry for the 12th. Making the graph look very wrong whenever this happens.
date | count
---------------------
10-11-2013 | 65
11-11-2013 | 71
12-11-2013 | 17
13-11-2013 | 83
Is there a way to have my queryset use the previous date's count if it doesn't exist? I thought about adding the previous day's count to the database, but it doesn't seem right to be adding some (probably wrong) data to the database when I can't verify it.
ideally I think it would look like:
date | count
---------------------
10-11-2013 | 65
11-11-2013 | 71
12-11-2013 | 69
13-11-2013 | 83
It depends on how you display the graph. In pandas you can store time series of data as well and they provide exactly the functionality you describe: backfill or forward fill any missing values by using a previous or future value (i.e., pandas.DataFrame.fillna). On one hand using that library for just that functionality is overkill but you may find it useful if you're planning on doing more data manipulation.
I don't think a Django QuerySet can fill in missing values as it was not built to do that. However you could compute it manually by taking the values from the query result and computing the right daily values before displaying the graph.