Tracking Growth of a Metric over Time In TimescaleDB

Tracking Growth of a Metric over Time In TimescaleDB - sql

I'm currently running timescaleDB. I have a table that looks similar to the following
one_day | name | metric_value
------------------------+---------------------
2022-05-30 00:00:00+00 | foo | 400
2022-05-30 00:00:00+00 | bar | 200
2022-06-01 00:00:00+00 | foo | 800
2022-06-01 00:00:00+00 | bar | 1000
I'd like a query that returns the % growth and raw growth of metric, so something like the following.
name | % growth | growth
-------------------------
foo | 200% | 400
bar | 500% | 800
I'm fairly new to timescaleDB and not sure what the most efficient way to do this is. I've tried using LAG, but the main problem I'm facing with that is OVER (GROUP BY time, url) doesn't respect that I ONLY want to consider the same name in the group by and can't seem to get around it. The query works fine for a single name.
Thanks!

Use LAG to get the previous value for the same name using the PARTITION option:
lag(metric_value,1,0) over (partition by name order by one_day)
This says, when ordered by 'one_day', within each 'name', give me the previous (the second parameter to LAG says 1 row) value of 'metric_value'; if there is no previous row, give me '0'.

Related

Using Window Functions to search back through rows based on time

I'm new to SQL and have been battling for days to understand how to search backwards through previous rows based on time.
I found the Windows Lag Function may help me here but I have not found a way to define a time period for it to search back though.
If I enter: -
SELECT food_word_1,
date,
lead(food_word_1,2) OVER (ORDER BY date DESC) as prev_food_word_1
FROM bookmark
WHERE mood = 'allergies'"
The result looks like the following: -
food_word_1 | date | prev_food_word_1
-------------+----------------------------+------------------
burritos | 2019-02-01 09:56:40.943341 |
burritos | 2019-02-01 09:56:31.56869 |
burritos | 2019-02-01 09:56:31.34883 | burritos
cereal bar | 2019-01-10 07:24:29.602226 | burritos
almonds | 2019-01-09 08:37:34.223448 | burritos
fennel | 2019-01-09 08:35:44.186134 | cereal bar
I get a result searching back 2 rows but what I would like to do is have this searching backwards (lag) for rows 36 hours previous instead of me having to define the number of rows with no time associated with them.
Does anyone know the best approach for this please?
Thanks

This answer is for Oracle, because the question was originally tagged Oracle.
Oracle supports range between with number ranges, but these can also be used for dates. Try this:
SELECT food_word_1,
date,
lead(food_word_1) OVER (ORDER BY date DESC RANGE BETWEEN 1.5 PRECEDING AND CURRENT ROW) as prev_food_word_1
FROM bookmark
WHERE mood = 'allergies';

How to get subtotals with time datatype in SQL?

I get stuck generating a SQL query. I have a Table in a Firebird DB like the following one:
ID | PROCESS | STEP | TIME
654 | 1 | 1 | 09:08:40
655 | 1 | 2 | 09:09:32
656 | 1 | 3 | 09:10:04
...
670 | 2 | 15 | 09:30:05
671 | 2 | 16 | 09:31:00
and so on.
I need the subtotals for each process group (It's about 7 of these). The table has the "time"-type for the TIME column.I have been trying it with DATEDIFF, but it doesn't work.

You need to use SUM
This question has been answered here.
How to sum up time field in SQL Server
and here.
SUM total time in SQL Server
For more specific Firebird documentation. Read up on the sum function here.
Sum() - Firebird Official Documentation

I think you should use "GROUP BY" to get max time and min time, and to use them in the datediff function. Something like that:
select process, datediff(second, max(time), min(time)) as nb_seconds
from your_table
group by process;

Identifying newest records in parallel

We're using U-SQL to extract sensor data from a set of .csv files. Each record contains a sensor ID, time of measurement and value, as well as a time for when the record was received:
+----------+---------------------+------------------+---------------------+
| SensorID | MeasurementTime | MeasurementValue | ReceivedTime |
+----------+---------------------+------------------+---------------------+
| xxx | 2017-09-10 11:00:00 | 12.342 | 2017-09-19 14:25:17 |
| xxx | 2017-09-10 12:00:00 | 14.654 | 2017-09-19 14:25:17 |
| yyy | 2017-09-10 11:00:00 | 1.054 | 2017-09-19 14:25:17 |
| yyy | 2017-09-10 12:00:00 | 1.354 | 2017-09-19 14:25:17 |
...
| xxx | 2017-09-10 11:00:00 | 10.261 | 2017-09-19 15:25:17 |
+----------+---------------------+------------------+---------------------+
The files are stored in ADLS in a path based on the date-portion of the measurement time, so the data seen above would be found in /Data/2017/09/10/measurements.csv, where the first four rows were written at 14:25:17 on the 19th of September, and the last row was appended one hour later, at 15:25:17.
As the above example illustrates, new values for the same SensorID and MeasurementTime can be received at a later time. Each partition holds a few million rows, with a few thousand rows being appended to a small number of partitions every day. We want to run a batch job say every 24 hours, that will output only the newest values, for any given SensorID and MeasurementTime. For this, we use a U-SQL script that looks similar to this:
#newestMeasurements_addRN =
SELECT *,
ROW_NUMBER() OVER (PARTITION BY PDate,
SensorId,
MeasurementTime
ORDER BY ReceivedTime DESC) AS MeasurementRN;
#newestMeasurements =
SELECT SensorId,
MeasurementTime,
MeasurementValue
FROM #newestMeasurements_addRN
WHERE MeasurementRN == 1;
Here, PDate is a virtual column inferred from the yyyy/MM/dd in the path of the CSV file (equals the date-portion of MeasurementTime).
Now, since we use PDate in the PARTITION BY part of the window function, I expected that this operation could be parallelised, since we don't have to consider different days (partitions) when trying to find the newest record for any given SensorID and MeasurementTime. Unfortunately, that does not seem to be the case, looking at a job graph:
Here, we are extracting data from 4 different days. Each of the Extract vertices outputs the full number of records, leaving the task of identifying only the newest records to the Combine vertex at the bottom, indicating that the ROW_NUMBER and subsequent filtering does not happen in parallel.
Is this a bug in the implementation of ROW_NUMBER?
Is there a different U-SQL technique we can use to ensure parallelism?

I managed to find a usable solution, in which I encapsulated the U-SQL that detects the latest measurements inside a U-SQL stored proc, which takes a value corresponding to pdate as input parameter.
Then, I simply execute this stored proc several times, with a list of dates that I want to process in parallel:
DetectLatestMeasurements(20170910);
DetectLatestMeasurements(20170911);
DetectLatestMeasurements(20170912);
DetectLatestMeasurements(20170913);
The stored proc handles EXTRACT, transformation and OUTPUT of one days worth of data, so this does the job, and it is parallelised the way I expect.

Subtract two aggregated values in Bar Chart

My data is like -
+-----------+------------------+-----------------+-------------+
| Issue Num | Created On | Closed at | Issue Owner |
+-----------+------------------+-----------------+-------------+
| 1 | 12/21/2016 15:26 | 1/13/2017 9:48 | Name 1 |
| 2 | 1/10/2017 7:38 | 1/13/2017 9:08 | Name 2 |
| 3 | 1/13/2017 8:57 | 1/13/2017 8:58 | Name 2 |
| 4 | 12/20/2016 20:30 | 1/13/2017 5:46 | Name 2 |
| 5 | 12/21/2016 19:30 | 1/13/2017 1:14 | Name 1 |
| 6 | 12/20/2016 20:30 | 1/12/2017 9:11 | Name 1 |
| 7 | 1/9/2017 17:44 | 1/12/2017 1:52 | Name 1 |
| 8 | 12/21/2016 19:36 | 1/11/2017 16:59 | Name 1 |
| 9 | 12/20/2016 19:54 | 1/11/2017 15:45 | Name 1 |
+-----------+------------------+-----------------+-------------+
What I am trying to achieve is
Number of issues created per week
Number of issues closed per week
Net number of issues remaining per week
I am able to resolve the top two points but unable to approach the last.
My attempt -
This gives me number of issues created every week.
Similarly I have done for Closed per week.
For Net number of issues (Created-Closed) -
I tried adding Closed At column along with Created On but I can't see second bar in the chart along with Created On either.
Something like this
I tried doing the same in excel -
I want something of this sort but with another column as the difference of
number of issues created that week - number of issues closed that week.
In this case, 8-6=2.

You could use a calculated field(Analysis->Create Calculated Field). Something like this:
{FIXED [Create Date]:Count(if DATEPART('year',[Create Date]) = 2016 then [Number of Records] end)} - {FIXED [Closed Date]:Count(if DATEPART('year',[Closed Date]) = 2016 then [Number of Records] end)}
This function is using LOD expressions to pull back both sets of values. It will filter on all 2016 results for both date sets and then minus them from each other.
For more on LOD's see here:
https://www.tableau.com/about/blog/LOD-expressions
Use this as your measure and pull in one of your date fields as the dimension.

The normal way to solve this problem is to reshape the data so you have one row per status change instead of one row per issue, with a column named [Date] and a column named [Action]. The action can be submit and close (or in a more complex world include approve, reject, whatever - tracking the history.
You can do the reshaping without modifying your source data by using a UNION to get two copies of each row with appropriate calculated fields to make the visible columns make sense (e.g., create calculated a field called Date that returns the submission date or closing date depending on whether the row is from the first or second union, with a similar one called Action whose value depends on that as well. Filter out Close actions that have a null date)
Or you can preprocess the data to reshape it.
Or you can use data blending to make two sources that point to the same data source but customizing the linking fields to line up the submit and close dates (e.g., duplicate the data connection and rename both date fields to have the same name). But in this case, you probably want to create scaffolding source that has every date, but no other data, to use as the primary data source to avoid filtering out data from the secondary for dates that don't appear in the primary. The blending approach can be brittle.
Assuming you used the UNION approach instead of Data Blending, then you can count the number of submissions and closures within a certain date range, or compute a running total of the difference to see the backlog size over time.

Oracle, Mysql, how to get average

How to get Average fuel consumption only using MySQL or Oracle:
SELECT te.fuelName,
zkd.fuelCapacity,
zkd.odometer
FROM ZakupKartyDrogowej zkd
JOIN TypElementu te
ON te.typElementu_Id = zkd.typElementu_Id
AND te.idFirmy = zkd.idFirmy
AND te.typElementu_Id IN (3,4,5)
WHERE zkd.idFirmy = 1054
AND zkd.kartaDrogowa_Id = 42
AND zkd.data BETWEEN to_date('2015-09-01','YYYY-MM-DD')
AND to_date('2015-09-30','YYYY-MM-DD');
Result of this query is:
fuelName | fuelCapacity | odometer | tanking
-------------------------------------------------
'ON' | 534 | 1284172 | 2015-09-29
'ON' | 571 | 1276284 | 2015-09-02
'ON' | 470 | 1277715 | 2015-09-07
'ON' | 580.01 | 1279700 | 2015-09-11
'ON' | 490 | 1281103 | 2015-09-17
'ON' | 520 | 1282690 | 2015-09-23
We can do it later in java or php, but want to get result right away from query. How should we modify above query to do that?
fuelCapacity is the number of liters of fuel that has been poured into cartank at gas station.

For one total average, what you need is the sum of the refills divided by the difference between the odometer readings at the start and the end, i.e. fuel used / distance travelled.
I don't have your table structure at hand, but this alteration to the select statement should do the trick:
select cast(sum(zkd.fuelCapacity) as float) / (max(zkd.odometer) - min(zkd.odometer)) as consumption ...
The cast(field AS float) does what the name implies, and typecasts the field as float, so the result will also be a float. (I do suspect that your fuelCapacity field is a float because there is one float value in your example, but this will make sure.)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Tracking Growth of a Metric over Time In TimescaleDB - sql

Related

Using Window Functions to search back through rows based on time

How to get subtotals with time datatype in SQL?

Identifying newest records in parallel

Subtract two aggregated values in Bar Chart

Oracle, Mysql, how to get average

Categories

Resources