TSQL reduce the amount of data returned by a query to a parametric defined sample - sql

I have a table containing a large amount of data which is stored on change.
tbl_bigOne
----------
timestamp | var01 | var02 | ...
2016-01-14 15:20:21 | 10.1 | 100.6 | ...
2016-01-14 15:20:26 | 11.2 | 110.3 | ...`
2016-01-14 15:21:27 | 52.1 | 620.1 | ...
2016-01-14 15:35:00 | 13.5 | 230.6 | ...
...
2016-01-15 09:18:01 | 94.4 | 140.0 | ...
2016-01-15 10:01:15 | 105.3 | 188.7 | ...
...
and so on for years of data
What I would like to obtain is a query/stored procedure that given two datetime references (date_from and date_to) gives the required selected data.
Now, the query just mentioned is pretty straight forward what I would also like to achieve is to set the maximum number of rows returned per day (if data is available) while doing the average of the values.
Let's give a few examples:
date_from: 2016-01-14 00:00:00
date_to: 2016-01-20 23:59:59
max_points:12
in this case the time windows is of 7 days and in this one i would like to have a maximum of 12 rows for each days of the 7 day window, giving a max total of 84 rows whilst doing the average from all the grouping done since, the data for each day is now partitioned by 12.
It is possible to see this partitioning as if every hour worth of data for that specific day is averaged, generating one row of the 12 required for a day.
date_from: 2016-01-14 00:00:00
date_to: 2016-01-14 23:59:59
max_points:1440
in this case the time window is one day worth and, if available, i would like to have a maximum of 1440 rows (for each day) for the selected period.
In this way the parameter defines the maximum number of rows for each day. The minimum time window is one day nothing below that.
Can something like this be achieved just using TSQL?
Thank you.
edit for taking care of the observations raised by #Thorsten Kettner

Use the analytic function ROW_NUMBER() to number the matching rows per day. Then only keep rows up to the given limit. If you want the rows arbitrarily chosen when there exist more than needed, then number the rows in random order using NEWID().
select timestmp, var01, var02, var03
from
(
select
mytable.*,
row_number() over (partition by convert(date, timestmp) order by newid()) as rn
from mytable
where convert(date, timestmp) between #start_date and #end_date
) numbered
where rn <= #limit
order by timestmp;

Related

Using Window Functions to search back through rows based on time

I'm new to SQL and have been battling for days to understand how to search backwards through previous rows based on time.
I found the Windows Lag Function may help me here but I have not found a way to define a time period for it to search back though.
If I enter: -
SELECT food_word_1,
date,
lead(food_word_1,2) OVER (ORDER BY date DESC) as prev_food_word_1
FROM bookmark
WHERE mood = 'allergies'"
The result looks like the following: -
food_word_1 | date | prev_food_word_1
-------------+----------------------------+------------------
burritos | 2019-02-01 09:56:40.943341 |
burritos | 2019-02-01 09:56:31.56869 |
burritos | 2019-02-01 09:56:31.34883 | burritos
cereal bar | 2019-01-10 07:24:29.602226 | burritos
almonds | 2019-01-09 08:37:34.223448 | burritos
fennel | 2019-01-09 08:35:44.186134 | cereal bar
I get a result searching back 2 rows but what I would like to do is have this searching backwards (lag) for rows 36 hours previous instead of me having to define the number of rows with no time associated with them.
Does anyone know the best approach for this please?
Thanks
This answer is for Oracle, because the question was originally tagged Oracle.
Oracle supports range between with number ranges, but these can also be used for dates. Try this:
SELECT food_word_1,
date,
lead(food_word_1) OVER (ORDER BY date DESC RANGE BETWEEN 1.5 PRECEDING AND CURRENT ROW) as prev_food_word_1
FROM bookmark
WHERE mood = 'allergies';

Identifying newest records in parallel

We're using U-SQL to extract sensor data from a set of .csv files. Each record contains a sensor ID, time of measurement and value, as well as a time for when the record was received:
+----------+---------------------+------------------+---------------------+
| SensorID | MeasurementTime | MeasurementValue | ReceivedTime |
+----------+---------------------+------------------+---------------------+
| xxx | 2017-09-10 11:00:00 | 12.342 | 2017-09-19 14:25:17 |
| xxx | 2017-09-10 12:00:00 | 14.654 | 2017-09-19 14:25:17 |
| yyy | 2017-09-10 11:00:00 | 1.054 | 2017-09-19 14:25:17 |
| yyy | 2017-09-10 12:00:00 | 1.354 | 2017-09-19 14:25:17 |
...
| xxx | 2017-09-10 11:00:00 | 10.261 | 2017-09-19 15:25:17 |
+----------+---------------------+------------------+---------------------+
The files are stored in ADLS in a path based on the date-portion of the measurement time, so the data seen above would be found in /Data/2017/09/10/measurements.csv, where the first four rows were written at 14:25:17 on the 19th of September, and the last row was appended one hour later, at 15:25:17.
As the above example illustrates, new values for the same SensorID and MeasurementTime can be received at a later time. Each partition holds a few million rows, with a few thousand rows being appended to a small number of partitions every day. We want to run a batch job say every 24 hours, that will output only the newest values, for any given SensorID and MeasurementTime. For this, we use a U-SQL script that looks similar to this:
#newestMeasurements_addRN =
SELECT *,
ROW_NUMBER() OVER (PARTITION BY PDate,
SensorId,
MeasurementTime
ORDER BY ReceivedTime DESC) AS MeasurementRN;
#newestMeasurements =
SELECT SensorId,
MeasurementTime,
MeasurementValue
FROM #newestMeasurements_addRN
WHERE MeasurementRN == 1;
Here, PDate is a virtual column inferred from the yyyy/MM/dd in the path of the CSV file (equals the date-portion of MeasurementTime).
Now, since we use PDate in the PARTITION BY part of the window function, I expected that this operation could be parallelised, since we don't have to consider different days (partitions) when trying to find the newest record for any given SensorID and MeasurementTime. Unfortunately, that does not seem to be the case, looking at a job graph:
Here, we are extracting data from 4 different days. Each of the Extract vertices outputs the full number of records, leaving the task of identifying only the newest records to the Combine vertex at the bottom, indicating that the ROW_NUMBER and subsequent filtering does not happen in parallel.
Is this a bug in the implementation of ROW_NUMBER?
Is there a different U-SQL technique we can use to ensure parallelism?
I managed to find a usable solution, in which I encapsulated the U-SQL that detects the latest measurements inside a U-SQL stored proc, which takes a value corresponding to pdate as input parameter.
Then, I simply execute this stored proc several times, with a list of dates that I want to process in parallel:
DetectLatestMeasurements(20170910);
DetectLatestMeasurements(20170911);
DetectLatestMeasurements(20170912);
DetectLatestMeasurements(20170913);
The stored proc handles EXTRACT, transformation and OUTPUT of one days worth of data, so this does the job, and it is parallelised the way I expect.

Get records after a certain time in PostgreSQL

I have a table that looks like this:
id | flight_number | departure_time | arrival_time
---+---------------+----------------+-------------
1 | UAL123 | 07:00:00 | 08:30:00
---+---------------+----------------+-------------
2 | AAL456 | 07:30:00 | 08:40:00
---+---------------+----------------+-------------
3 | SWA789 | 07:45:00 | 09:10:00
I'm trying to figure out an SQL query that can get upcoming flights based on departure time given the current time. For instance, at 07:20, I would like to return AAL456, SWA789 since those flights have not departed yet. At 07:40, I would like to just return SWA789. What is a good way to do this?
Well, you can use LOCALTIME to get the current time. So, if the departure_time is stored as a time, then:
select t.*
from t
where t.departure_time > localtime;
This assumes no time zone information is part of the time value. Also, it will return no flights after the last flight has departed for a day (which is consistent with the phrasing of your question).

How to average data on periods from a table in SQL

I'm trying to average data on specific period of time and then, averaging a date between from these result.
Having data like:
value | datetime
-------+------------------------
15 | 2015-08-16 01:00:40+02
22 | 2015-08-16 01:01:40+02
16 | 2015-08-16 01:02:40+02
19 | 2015-08-16 01:03:40+02
21 | 2015-08-16 01:04:40+02
18 | 2015-08-16 01:05:40+02
29 | 2015-08-16 01:06:40+02
16 | 2015-08-16 01:07:40+02
16 | 2015-08-16 01:08:40+02
15 | 2015-08-16 01:09:40+02
I would like to obtain something like in one query:
value | datetime
-------+------------------------
18.6 | 2015-08-16 01:03:00+02
18.8 | 2015-08-16 01:08:00+02
where value corresponding with the first 5 initial values averaged and the datetime with the middle (or average) of the 5 intial datetimes. 5 representing the interval n.
I saw some posts that put me on the track with avg, group by and averaging date format in SQL but I'm still not able to find out what to do exactly.
I'm working under PostgreSQL 9.4
You would need to share more information but here is a way to do it. Here is more information on it : HERE
mysql> SELECT AVG(value), AVG(datetime)
FROM database.table
WHERE datetime > date1
AND datetime < date2;
Something like
SELECT
to_timestamp(round(AVG(EXTRACT(epoch from datetime)))) as middleDate,
avg(value) AS avgValue
FROM
myTable
GROUP BY
(id) / ((SELECT Count(*) FROM myTable) / 100);
filled roughtly my requirements, with 100 acting on averaged intervals length (globally equals to the outputed lines).

Group records by time

I have a table containing a datetime column and some misc other columns. The datetime column represents an event happening. It can either contains a time (event happened at that time) or NULL (event didn't happen)
I now want to count the number of records happening in specific intervals (15 minutes), but do not know how to do that.
example:
id | time | foreign_key
1 | 2012-01-01 00:00:01 | 2
2 | 2012-01-01 00:02:01 | 4
3 | 2012-01-01 00:16:00 | 1
4 | 2012-01-01 00:17:00 | 9
5 | 2012-01-01 00:31:00 | 6
I now want to create a query that creates a result set similar to:
interval | COUNT(id)
2012-01-01 00:00:00 | 2
2012-01-01 00:15:00 | 2
2012-01-01 00:30:00 | 1
Is this possible in SQL or can anyone advise what other tools I could use? (e.g. exporting the data to a spreadsheet program would not be a problem)
Give this a try:
select datetime((strftime('%s', time) / 900) * 900, 'unixepoch') interval,
count(*) cnt
from t
group by interval
order by interval
Check the fiddle here.
I have limited SQLite background (and no practice instance), but I'd try grabbing the minutes using
strftime( FORMAT, TIMESTRING, MOD, MOD, ...)
with the %M modifier (http://souptonuts.sourceforge.net/readme_sqlite_tutorial.html)
Then divide that by 15 and get the FLOOR of your quotient to figure out which quarter-hour you're in (e.g., 0, 1, 2, or 3)
cast(x as int)
Getting the floor value of a number in SQLite?
Strung together it might look something like:
Select cast( (strftime( 'YYYY-MM-DD HH:MI:SS', your_time_field, '%M') / 15) as int) from your_table
(you might need to cast before you divide by 15 as well, since strftime probably returns a string)
Then group by the quarter-hour.
Sorry I don't have exact syntax for you, but that approach should enable you to get the functional groupings, after which you can massage the output to make it look how you want.