Select in specific days range even if there is no data [duplicate] - sql

I'm building a quick csv from a mysql table with a query like:
select DATE(date),count(date) from table group by DATE(date) order by date asc;
and just dumping them to a file in perl over a:
while(my($date,$sum) = $sth->fetchrow) {
print CSV "$date,$sum\n"
}
There are date gaps in the data, though:
| 2008-08-05 | 4 |
| 2008-08-07 | 23 |
I would like to pad the data to fill in the missing days with zero-count entries to end up with:
| 2008-08-05 | 4 |
| 2008-08-06 | 0 |
| 2008-08-07 | 23 |
I slapped together a really awkward (and almost certainly buggy) workaround with an array of days-per-month and some math, but there has to be something more straightforward either on the mysql or perl side.
Any genius ideas/slaps in the face for why me am being so dumb?
I ended up going with a stored procedure which generated a temp table for the date range in question for a couple of reasons:
I know the date range I'll be looking for every time
The server in question unfortunately was not one that I can install perl modules on atm, and the state of it was decrepit enough that it didn't have anything remotely Date::-y installed
The perl Date/DateTime-iterating answers were also very good, I wish I could select multiple answers!

When you need something like that on server side, you usually create a table which contains all possible dates between two points in time, and then left join this table with query results. Something like this:
create procedure sp1(d1 date, d2 date)
declare d datetime;
create temporary table foo (d date not null);
set d = d1
while d <= d2 do
insert into foo (d) values (d)
set d = date_add(d, interval 1 day)
end while
select foo.d, count(date)
from foo left join table on foo.d = table.date
group by foo.d order by foo.d asc;
drop temporary table foo;
end procedure
In this particular case it would be better to put a little check on the client side, if current date is not previos+1, put some addition strings.

When I had to deal with this problem, to fill in missing dates I actually created a reference table that just contained all dates I'm interested in and joined the data table on the date field. It's crude, but it works.
SELECT DATE(r.date),count(d.date)
FROM dates AS r
LEFT JOIN table AS d ON d.date = r.date
GROUP BY DATE(r.date)
ORDER BY r.date ASC;
As for output, I'd just use SELECT INTO OUTFILE instead of generating the CSV by hand. Leaves us free from worrying about escaping special characters as well.

not dumb, this isn't something that MySQL does, inserting the empty date values. I do this in perl with a two-step process. First, load all of the data from the query into a hash organised by date. Then, I create a Date::EzDate object and increment it by day, so...
my $current_date = Date::EzDate->new();
$current_date->{'default'} = '{YEAR}-{MONTH NUMBER BASE 1}-{DAY OF MONTH}';
while ($current_date <= $final_date)
{
print "$current_date\t|\t%hash_o_data{$current_date}"; # EzDate provides for automatic stringification in the format specfied in 'default'
$current_date++;
}
where final date is another EzDate object or a string containing the end of your date range.
EzDate isn't on CPAN right now, but you can probably find another perl mod that will do date compares and provide a date incrementor.

You could use a DateTime object:
use DateTime;
my $dt;
while ( my ($date, $sum) = $sth->fetchrow ) {
if (defined $dt) {
print CSV $dt->ymd . ",0\n" while $dt->add(days => 1)->ymd lt $date;
}
else {
my ($y, $m, $d) = split /-/, $date;
$dt = DateTime->new(year => $y, month => $m, day => $d);
}
print CSV, "$date,$sum\n";
}
What the above code does is it keeps the last printed date stored in a
DateTime object $dt, and when the current date is more than one day
in the future, it increments $dt by one day (and prints it a line to
CSV) until it is the same as the current date.
This way you don't need extra tables, and don't need to fetch all your
rows in advance.

I hope you will figure out the rest.
select * from (
select date_add('2003-01-01 00:00:00.000', INTERVAL n5.num*10000+n4.num*1000+n3.num*100+n2.num*10+n1.num DAY ) as date from
(select 0 as num
union all select 1
union all select 2
union all select 3
union all select 4
union all select 5
union all select 6
union all select 7
union all select 8
union all select 9) n1,
(select 0 as num
union all select 1
union all select 2
union all select 3
union all select 4
union all select 5
union all select 6
union all select 7
union all select 8
union all select 9) n2,
(select 0 as num
union all select 1
union all select 2
union all select 3
union all select 4
union all select 5
union all select 6
union all select 7
union all select 8
union all select 9) n3,
(select 0 as num
union all select 1
union all select 2
union all select 3
union all select 4
union all select 5
union all select 6
union all select 7
union all select 8
union all select 9) n4,
(select 0 as num
union all select 1
union all select 2
union all select 3
union all select 4
union all select 5
union all select 6
union all select 7
union all select 8
union all select 9) n5
) a
where date >'2011-01-02 00:00:00.000' and date < NOW()
order by date
With
select n3.num*100+n2.num*10+n1.num as date
you will get a column with numbers from 0 to max(n3)*100+max(n2)*10+max(n1)
Since here we have max n3 as 3, SELECT will return 399, plus 0 -> 400 records (dates in calendar).
You can tune your dynamic calendar by limiting it, for example, from min(date) you have to now().

Since you don't know where the gaps are, and yet you want all the values (presumably) from the first date in your list to the last one, do something like:
use DateTime;
use DateTime::Format::Strptime;
my #row = $sth->fetchrow;
my $countdate = strptime("%Y-%m-%d", $firstrow[0]);
my $thisdate = strptime("%Y-%m-%d", $firstrow[0]);
while ($countdate) {
# keep looping countdate until it hits the next db row date
if(DateTime->compare($countdate, $thisdate) == -1) {
# counter not reached next date yet
print CSV $countdate->ymd . ",0\n";
$countdate = $countdate->add( days => 1 );
$next;
}
# countdate is equal to next row's date, so print that instead
print CSV $thisdate->ymd . ",$row[1]\n";
# increase both
#row = $sth->fetchrow;
$thisdate = strptime("%Y-%m-%d", $firstrow[0]);
$countdate = $countdate->add( days => 1 );
}
Hmm, that turned out to be more complicated than I thought it would be.. I hope it makes sense!

I think the simplest general solution to the problem would be to create an Ordinal table with the highest number of rows that you need (in your case 31*3 = 93).
CREATE TABLE IF NOT EXISTS `Ordinal` (
`n` int(10) unsigned NOT NULL AUTO_INCREMENT, PRIMARY KEY (`n`)
);
INSERT INTO `Ordinal` (`n`)
VALUES (NULL), (NULL), (NULL); #etc
Next, do a LEFT JOIN from Ordinal onto your data. Here's a simple case, getting every day in the last week:
SELECT CURDATE() - INTERVAL `n` DAY AS `day`
FROM `Ordinal` WHERE `n` <= 7
ORDER BY `n` ASC
The two things you would need to change about this are the starting point and the interval. I have used SET #var = 'value' syntax for clarity.
SET #end = CURDATE() - INTERVAL DAY(CURDATE()) DAY;
SET #begin = #end - INTERVAL 3 MONTH;
SET #period = DATEDIFF(#end, #begin);
SELECT #begin + INTERVAL (`n` + 1) DAY AS `date`
FROM `Ordinal` WHERE `n` < #period
ORDER BY `n` ASC;
So the final code would look something like this, if you were joining to get the number of messages per day over the last three months:
SELECT COUNT(`msg`.`id`) AS `message_count`, `ord`.`date` FROM (
SELECT ((CURDATE() - INTERVAL DAY(CURDATE()) DAY) - INTERVAL 3 MONTH) + INTERVAL (`n` + 1) DAY AS `date`
FROM `Ordinal`
WHERE `n` < (DATEDIFF((CURDATE() - INTERVAL DAY(CURDATE()) DAY), ((CURDATE() - INTERVAL DAY(CURDATE()) DAY) - INTERVAL 3 MONTH)))
ORDER BY `n` ASC
) AS `ord`
LEFT JOIN `Message` AS `msg`
ON `ord`.`date` = `msg`.`date`
GROUP BY `ord`.`date`
Tips and Comments:
Probably the hardest part of your query was determining the number of days to use when limiting Ordinal. By comparison, transforming that integer sequence into dates was easy.
You can use Ordinal for all of your uninterrupted-sequence needs. Just make sure it contains more rows than your longest sequence.
You can use multiple queries on Ordinal for multiple sequences, for example listing every weekday (1-5) for the past seven (1-7) weeks.
You could make it faster by storing dates in your Ordinal table, but it would be less flexible. This way you only need one Ordinal table, no matter how many times you use it. Still, if the speed is worth it, try the INSERT INTO ... SELECT syntax.

Use some Perl module to do date calculations, like recommended DateTime or Time::Piece (core from 5.10). Just increment date and print date and 0 until date will match current.

I don't know if this would work, but how about if you created a new table which contained all the possible dates (that might be the problem with this idea, if the range of dates is going to change unpredictably...) and then do a left join on the two tables? I guess it's a crazy solution if there are a vast number of possible dates, or no way to predict the first and last date, but if the range of dates is either fixed or easy to work out, then this might work.

Related

How to recursively calculate yearly rollover in SQL?

I need to calculate yearly rollover for a system that keeps track of when people have used days off.
The rollover calculation itself is simple: [TOTALDAYSALLOWED] - [USED]
Provided that number is not higher than [MAXROLLOVER] (and > 0)
Where this gets complicated is the [TOTALDAYSALLOWED] column, which is [NUMDAYSALLOWED] combined with the previous year's rollover to get the total number of days that can be used in a current year.
I've tried several different ways of getting this calculation, but all of them have failed to account for the previous year's rollover being a part of the current year's allowed days.
Creating columns for the LAG of days used, joining the data to itself but shifted back a year, etc. I'm not including examples of code I've tried because the approach was wrong in all of the attempts. That would just make this long post even longer.
Here's the data I'm working with:
Here's how it should look after the calculation:
This is a per-person calculation, so there's no need to consider any personal ID here. DAYTYPE only has one value currently, but I want to include it in the calculation in case another is added. The [HOW] column is only for clarity in this post.
Here's some code to generate the sample data (SQL Server or Azure SQL):
IF OBJECT_ID('tempdb..#COUNTS') IS NOT NULL DROP TABLE #COUNTS
CREATE TABLE #COUNTS (USED INT, DAYTYPE VARCHAR(20), THEYEAR INT)
INSERT INTO #COUNTS (USED, DAYTYPE, THEYEAR)
SELECT 1, 'X', 2019
UNION
SELECT 3, 'X', 2020
UNION
SELECT 0, 'X', 2021
IF OBJECT_ID('tempdb..#ALLOWANCES') IS NOT NULL DROP TABLE #ALLOWANCES
CREATE TABLE #ALLOWANCES (THEYEAR INT, DAYTYPE VARCHAR(20), NUMDAYSALLOWED INT, MAXROLLOVER INT)
INSERT INTO #ALLOWANCES (THEYEAR, DAYTYPE, NUMDAYSALLOWED, MAXROLLOVER)
SELECT 2019, 'X', 3, 3
UNION
SELECT 2020, 'X', 3, 3
UNION
SELECT 2021, 'X', 3, 3
SELECT C.*, A.NUMDAYSALLOWED, A.MAXROLLOVER
FROM #COUNTS C
JOIN #ALLOWANCES A ON C.DAYTYPE = A.DAYTYPE AND C.THEYEAR = A.THEYEAR
The tricky part is to limit the rollover amount. This is maybe possible with window functions, but I think this is easier to do with a recursive query:
with
data as (
select c.*, a.numdaysallowed, a.maxrollover,
row_number() over(partition by c.daytype order by c.theyear) rn
from #counts c
inner join #allowances a on a.theyear = c.theyear and a.daytype = c.daytype
),
cte as (
select d.*,
numdaysallowed as totaldaysallowed,
numdaysallowed - used as actualrollover
from data d
where rn = 1
union all
select d.*,
d.numdaysallowed + c.actualrollover,
case when d.numdaysallowed + c.actualrollover - d.used > d.maxrollover
then 3
else d.numdaysallowed + c.actualrollover - d.used
end
from cte c
inner join data d on d.rn = c.rn + 1 and d.daytype = c.daytype
)
select * from cte order by theyear
Demo on DB Fiddle

Adding missing date rows to a BigQuery Table

I have a table where 1 of the rows is an integer that represents the rows time. Problem is the table isn't full, there are missing timestamps.
I would like to fill missing values such that every 10 seconds there is a row. I want the rest of the columns to be nuns (later I'll forward fill these nuns).
10 secs is basically 10,000.
If this was python the range would be:
range(
min(table[column]),
max(table[column]),
10000
)
If your values are strictly parted by 10 seconds, and there are just some multiples of 10 seconds intervals which are missing, you can go by this approach to fill your data holes:
WITH minsmax AS (
SELECT
MIN(time) AS minval,
MAX(time) AS maxval
FROM `dataset.table`
)
SELECT
IF (d.time <= i.time, d.time, i.time) as time,
MAX(IF(d.time <= i.time, d.value, NULL)) as value
FROM (
SELECT time FROM minsmax m, UNNEST(GENERATE_ARRAY(m.minval, m.maxval+100, 100)) AS time
) AS i
LEFT JOIN `dataset.table` d ON 1=1
WHERE ABS(d.time - i.time) >= 100
GROUP BY 1
ORDER BY 1
Hope this helps.
You can use arrays. For numbers, you can do:
select n
from unnest(generate_array(1, 1000, 1)) n;
There are similar functions for generate_timestamp_array() and generate_date_array() if you really need those types.
I ended up using the following query through python API:
"""
SELECT
i.time,
Sensor_Reading,
Sensor_Name
FROM (
SELECT time FROM UNNEST(GENERATE_ARRAY({min_time}, {max_time}+{sampling_period}+1, {sampling_period})) AS time
) AS i
LEFT JOIN
`{input_table}` AS input
ON
i.time =input.Time
ORDER BY i.time
""".format(sampling_period=sampling_period, min_time=min_time,
max_time=max_time,
input_table=input_table)
Thanks to both answers

SQL Server iterating through time series data

I am using SQL Server and wondering if it is possible to iterate through time series data until specific condition is met and based on that label my data in other table?
For example, let's say I have a table like this:
Id Date Some_kind_of_event
+--+----------+------------------
1 |2018-01-01|dsdf...
1 |2018-01-06|sdfs...
1 |2018-01-29|fsdfs...
2 |2018-05-10|sdfs...
2 |2018-05-11|fgdf...
2 |2018-05-12|asda...
3 |2018-02-15|sgsd...
3 |2018-02-16|rgw...
3 |2018-02-17|sgs...
3 |2018-02-28|sgs...
What I want to get, is to calculate for each key the difference between two adjacent events and find out if there exists difference > 10 days between these two adjacent events. In case yes, I want to stop iterating for that specific key and put label 'inactive', otherwise 'active' in my other table. After we finish with one key, we start with another.
So for example id = 1 would get label 'inactive' because there exists two dates which have difference bigger that 10 days. The final result would be like that:
Id Label
+--+----------+
1 |inactive
2 |active
3 |inactive
Any ideas how to do that? Is it possible to do it with SQL?
When working with a DBMS you need to get away from the idea of thinking iteratively. Instead you need to try and think in sets. "Instead of thinking about what you want to do to a row, think about what you want to do to a column."
If I understand correctly, is this what you're after?
CREATE TABLE SomeEvent (ID int, EventDate date, EventName varchar(10));
INSERT INTO SomeEvent
VALUES (1,'20180101','dsdf...'),
(1,'20180106','sdfs...'),
(1,'20180129','fsdfs..'),
(2,'20180510','sdfs...'),
(2,'20180511','fgdf...'),
(2,'20180512','asda...'),
(3,'20180215','sgsd...'),
(3,'20180216','rgw....'),
(3,'20180217','sgs....'),
(3,'20180228','sgs....');
GO
WITH Gaps AS(
SELECT *,
DATEDIFF(DAY,LAG(EventDate) OVER (PARTITION BY ID ORDER BY EventDate),EventDate) AS EventGap
FROM SomeEvent)
SELECT ID,
CASE WHEN MAX(EventGap) > 10 THEN 'inactive' ELSE 'active' END AS Label
FROM Gaps
GROUP BY ID
ORDER BY ID;
GO
DROP TABLE SomeEvent;
GO
This assumes you are using SQL Server 2012+, as it uses the LAG function, and SQL Server 2008 has less than 12 months of any kind of support.
Try this. Note, replace #MyTable with your actual table.
WITH Diffs AS (
SELECT
Id
,DATEDIFF(DAY,[Date],LEAD([Date],1,0) OVER (ORDER BY [Id], [Date])) Diff
FROM #MyTable)
SELECT
Id
,CASE WHEN MAX(Diff) > 10 THEN 'Inactive' ELSE 'Active' END
FROM Diffs
GROUP BY Id
Just to share another approach (without a CTE).
SELECT
ID
, CASE WHEN SUM(TotalDays) = (MAX(CNT) - 1) THEN 'Active' ELSE 'Inactive' END Label
FROM (
SELECT
ID
, EventDate
, CASE WHEN DATEDIFF(DAY, EventDate, LEAD(EventDate) OVER(PARTITION BY ID ORDER BY EventDate)) < 10 THEN 1 ELSE 0 END TotalDays
, COUNT(ID) OVER(PARTITION BY ID) CNT
FROM EventsTable
) D
GROUP BY ID
The method is counting how many records each ID has, and getting the TotalDays by date differences (in days) between the current the next date, if the difference is less than 10 days, then give me 1, else give me 0.
Then compare, if the total days equal the number of records that each ID has (minus one) would print Active, else Inactive.
This is just another approach that doesn't use CTE.

Why does dbms_random.value return the same value in graph queries (connect by)?

On Oracle 11.2.0.4.0, when I run the following query then each row gets a different result:
select r.n from (
select trunc(dbms_random.value(1, 100)) n from dual
) r
connect by level < 100; -- returns random values
But as soon as I use the obtained random value in a join or subquery then each row gets the same value from dbms_random.value:
select r.n, (select r.n from dual) from (
select trunc(dbms_random.value(1, 100)) n from dual
) r
connect by level < 100; -- returns the same value each time
Is it possible to make the second query return random values for each row?
UPDATE
My example was maybe over-simplified, here's what I am trying to do:
with reservations(val) as (
select 1 from dual union all
select 3 from dual union all
select 4 from dual union all
select 5 from dual union all
select 8 from dual
)
select * from (
select rnd.val, CONNECT_BY_ISLEAF leaf from (
select trunc(dbms_random.value(1, 10)) val from dual
) rnd
left outer join reservations res on res.val = rnd.val
connect by res.val is not null
)
where leaf = 1;
But with reservations which can go from 1 to 1.000.000.000 (and more).
Sometimes that query returns correctly (if it immediately picked a random value for which there was no reservation) or give an out of memory error because it always tries with the same value of dbms_random.value.
Your comment "...and I want to avoid concurrency problems" made me think.
Why don't you just try to insert a random number, watch out for duplicate violations, and retry until successful? Even a very clever solution that looks up available numbers might come up with identical new numbers in two separate sessions. So, only an inserted and committed reservation number is safe.
You can move the connect-by clause inside the subquery:
select r.n, (select r.n from dual) from (
select trunc(dbms_random.value(1, 100)) n from dual
connect by level < 100
) r;
N (SELECTR.NFROMDUAL)
---------- -------------------
90 90
69 69
15 15
53 53
8 8
3 3
...
what I try to do is generate a sequence of random numbers and find the first one for which I don't have a record in some table
You could potentially do something like:
select r.n
from (
select trunc(dbms_random.value(1, 100)) n from dual
connect by level < 100
) r
where not exists (
select id from your_table where id = r.n
)
and rownum = 1;
but it will generate all 100 random values before checking any of them, which is a bit wasteful; and as you might not find a gap in those 100 (and there may be duplicates within those hundred) you either need a much larger range which is also expensive, though doesn't need to be so many random calls:
select min(r.n) over (order by dbms_random.value) as n
from (
select level n from dual
connect by level < 100 -- or entire range of possible values
) r
where not exists (
select id from your_table where id = r.n
)
and rownum = 1;
Or repeat a single check until a match is found.
Another approach is to have a look-up table of all possible IDs with a column indicating if they are used or free, maybe with a bitmap index; and then use that to find the first (or any random) free value. But then you have to maintain that table too, and update atomically as you use and release the IDs in your main table, which means making things more complicated and serialising access - though you probably can't avoid that anyway really if you don't want to use a sequence. You could probably use a materialised view to simplify things.
And if you have a relatively small number of gaps (and you really want to reuse those) then you could possibly only search for a gap within the assigned range and then fall back to a sequencer if there are no gaps. Say you only have values in the range 1 to 1000 currently used, with a few missing; you could look for a free value in that 1-100 range, and if there are none then use a sequence to get 1001 instead, rather than always including your entire possible range of values in your gap search. That would also fill in gaps in preference to extending the used range, which may or may not be useful. (I'm not sure if "I don't need those numbers to be consecutive" means they should not be consecutive, or that it doesn't matter).
Unless you particularly have a business need to fill in the gaps and for the assigned values to not be consecutive, though, I'd just use a sequence and ignore the gaps.
I managed to obtain a correct result with the following query but I am not sure if this approach is really advisable:
with
reservations(val) as (
select 1 from dual union all
select 3 from dual union all
select 4 from dual union all
select 5 from dual union all
select 8 from dual
),
rand(v) as (
select trunc(dbms_random.value(1, 10)) from dual
),
next_res(v, ok) as (
select v, case when exists (select 1 from reservations r where r.val = rand.v) then 0 else 1 end from rand
),
recursive(i, v, ok) AS (
select 0, 0, 0 from dual
union all
select i + 1, next_res.v, next_res.ok from recursive, next_res where i < 100 /*maxtries*/ and recursive.ok = 0
)
select v from recursive where ok = 1;

Create a Timeline Query with date periods in SQL ACCESS

I would like to write a Query in SQL Access (no choice with the software...) in order to create a timeline from a Table of records.
Hereunder a simple example in order to explain what I want to do.
I have a table with 2 records. Each record is defined by its ID, a Value, a date of begin, a date of end and a Level of confidence.
Id | vValue | dtBegin | dtEnd | lLevel
-------------------------------------------
1 | a |20/06/2016|28/06/2016| Low
2 | b |23/06/2016|25/06/2016| High
The query should return a timeline with the highest level of information (and its value) available for each period of time. In the example the result of the query should be:
vValue| dtBegin | dtEnd |lLevel
------------------------------------
a |20/06/2016|23/06/2016| Low
b |23/06/2016|25/06/2016| High
a |25/06/2016|28/06/2016| Low
From the 20/06/2016 to the 23/06/2016 the highest level of confidence available in the table is 'Low' and the associated value is 'a'/
From the 23/06 to the 25/06, the period is covered by two records but the highest level of confidence available is 'High', therefore the value is 'b'
Thanks a lot for your help
Thibaud
There may be an easier way of doing it, you could start with a base query to get all dates necessary:
SELECT dtBegin
FROM Table1
GROUP BY dtBegin
UNION
SELECT dtEnd AS dtBegin
FROM Table1
With this query you can get the non-overlapping date ranges:
SELECT Timeline.dtBegin, Min(Timeline_1.dtBegin) AS dtEnd
FROM Query1 AS Timeline, Query1 AS Timeline_1
WHERE (((Timeline_1.dtBegin) > Timeline.dtBegin))
GROUP BY Timeline.dtBegin
From this you could cross join back to your original table, and get the max level and vValue for any table entry that overlaps the date range using a subquery:
SELECT Query2.dtBegin, Query2.dtEnd, Table1.lLevel, Table1.vValue
FROM Query2, Table1
WHERE (((Table1.lLevel) =
(SELECT MAX(lLevel)
FROM Table1 T
WHERE T.dtBegin <=Query2.dtBegin AND
T.dtEnd >=Query2.dtEnd))) AND
((Table1.dtBegin)<=Query2.dtBegin) AND
((Table1.dtEnd)>=Query2.dtEnd)
You'll have to do some extra logic to get a max level, of course, depending on how your choices are configured.
It's hard to help without seeing the data structure. If the begin / end dates don't overlap in your data this could possibly work:
/* Finds distinct dates which only have a low record */
SELECT
vValue,dtBegin,dtEnd,'Low' AS lLevel
FROM
(
SELECT DISTINCT vValue,dtBegin,dtEnd
FROM tbl
WHERE lLevel = "Low"
EXCEPT
SELECT DISTINCT vValue,dtBegin,dtEnd
FROM tbl
WHERE lLevel = "High"
) AS X
/* Appends distinct date sets */
UNION
SELECT DISTINCT vValue,dtBegin,dtEnd,lLevel
FROM tbl
WHERE lLevel = "High"