SQL Like function Broken? or Limited? - sql

I am trying to use the LIKE function to get data with similar names. Everything looks fine but the data I get in return is missing some values when I get back more than ~20 rows of data.
I have a very basic query. I just want data that starts with Lab, ideally for the whole day, or at least 12 hours. The code below misses some data and I cannot discern a pattern for what it picks to skip.
SELECT History.TagName, DateTime, Value FROM History
WHERE History.TagName like ('Lab%')
AND Quality = 0
AND wwRetrievalMode = 'Full'
AND DateTime >= '20150811 6:00'
AND DateTime <= '20150811 18:00'
To give you an idea of the data I am pulling, I have Lab.Raw.NTU, Lab.Raw.Alk, Lab.Sett.NTU, etc. Most of the data should have values at 6am/pm, 10am/pm, and 2am/pm. Some have more, few have less, not important. When I change the query to be more specific (i.e. only 1 hour window or LIKE "Lab.Raw.NTU") I get all of my data. Currently, this will spit out data for all tags and I get both 6am data and 6pm data, but certain values will be missing such as Lab.Raw.NTU at 6pm. There seem to be other data that is missing if I change the window for the previous day or the night shift, so I don't think it has to be with the data itself. Something weird is going on with the LIKE function but I have no idea what.
Is there another way to get the tagnames that I want besides like? Such as Tagname > Lab and Tagname <= Labz? (that gives me an error, so I am thinking not)
Please help.

It appears that you are using the Like operator correctly; that could be a red herring. Check the data type of the DateTime field. If it is character based such as varchar you are doing string comparisons instead of date comparisons, which could cause unexpected results. Try doing an explicit cast to ensure they are compared as dates:
DateTime >= convert(datetime, '20150811 6:00')

Related

How to set a max range condition with timescale time_bucket_gapfill() in order to not fill real missing values?

I'd like some advices to know if what I need to do is achievable with timescale functions.
I've just found out I can use time_bucket_gapfill() to complete missing data, which is amazing! I need data each 5 minutes but I can receive 10 minutes, 30 minutes or 1 hour data. So the function helps me to complete the missing points in order to have only 5 minutes points. Also, I use locf() to set the gapfilled value with last value found.
My question is: can I set a max range when I set the last value found with locf() in order to never overpass 1 hour ?
Example: If the last value found is older than 1 hour ago I don't want to fill gaps, I need to leave it empty to say we have real missing values here.
I think I'm close to something with this but apparently I'm not allowed to use locf() in the same case.
ERROR: multiple interpolate/locf function calls per resultset column not supported
Somebody have an idea how I can resolve that?
How to reproduce:
Create table powers
CREATE table powers (
delivery_point_id BIGINT NOT NULL,
at timestamp NOT NULL,
value BIGINT NOT NULL
);
Create hypertable
SELECT create_hypertable('powers', 'at');
Create indexes
CREATE UNIQUE INDEX idx_dpid_at ON powers(delivery_point_id, at);
CREATE INDEX index_at ON powers(at);
Insert data for one day, one delivery point, point 10 minutes
INSERT INTO powers SELECT 1, at, round(random()*10000) FROM generate_series(TIMESTAMP '2021-01-01 00:00:00', TIMESTAMP '2022-01-02 00:00:00', INTERVAL '10 minutes') AS at;
Remove three hours of data from 4am to 7am
DELETE FROM powers WHERE delivery_point_id = 1 AND at < '2021-01-1 07:00:00' AND at > '2021-01-01 04:00:00';
The query that need to be fixed
SELECT
time_bucket_gapfill('5 minutes', at) AS point_five,
avg(value) AS avg,
CASE
WHEN (locf(at) - at) > interval '1 hour' THEN null
ELSE locf(avg(value))
END AS gapfilled
FROM powers
GROUP BY point_five, at
ORDER BY point_five;
Actual: ERROR: multiple interpolate/locf function calls per resultset column not supported
Expected: Gapfilled values each 5 minutes except between 4am and 7 am (real missing values).
This is a great question! I'm going to provide a workaround for how to do this with the current stuff, but I think it'd be great if you'd open a Github issue as well, because there might be a way to add an option for this that doesn't require a workaround like this.
I also think your attempt was a good approach and just requires a few tweaks to get it right!
The error that you're seeing is that we can't have multiple locf calls in a single column, this is a limitation that's pretty easy to work around as we can just shift both of them into a subquery, but that's not enough. The other thing that we need to change is that locf only works on aggregates, right now, you’re trying to use it on a column (at) that isn’t aggregated, which isn’t going to work, because it wouldn’t know which of the values of at in a time_bucket to “pull forward” for the gapfill.
Now you said you want to fill data as long as the previous point wasn’t more than one hour ago, so, we can take the last value of at in the bucket by using last(at, at) this is also the max(at) so either of those aggregates would work. So we put that into a CTE (common table expression or WITH query) and then we do the case statement outside like so:
WITH filled as (SELECT
time_bucket_gapfill('5 minutes', at) AS point_five,
avg(value) AS avg,
locf(last(at, at)) as filled_from,
locf(avg(value)) as filled_avg
FROM powers
WHERE at BETWEEN '2021-01-01 01:30:00' AND '2021-01-01 08:30:00'
AND delivery_point_id = 1
GROUP BY point_five
ORDER BY point_five)
SELECT point_five,
avg,
filled_from,
CASE WHEN point_five - filled_from > '1 hour'::interval THEN NULL
ELSE filled_avg
END as gapfilled
FROM filled;
Note that I’ve tried to name my CTE expressively so that it’s a little easier to read!
Also, I wanted to point out a couple other hyperfunctions that you might think about using:
heartbeat_agg is a new/experimental one that will help you determine periods when your system is up or down, so if you're expecting points at least every hour, you can use it to find the periods where the delivery point was down or the like.
When you have more irregular sampling or want to deal with different data frequencies from different delivery points, I’d take a look a the time_weight family of functions. They can be more efficient than using something like gapfill to upsample, by instead letting you treat all the different sample rates similarly, without having to create more points and more work to do so. Even if you want to, for instance, compare sums of values, you’d use something like integral to get the time weighted sum over a period based on the locf interpolation.
Anyway, hope all that is helpful!

Why does using CONVERT(DATETIME, [date], [format]) in WHERE clause take so long?

I'm running the following code on a dataset of 100M to test some things out before I eventually join the entire range (not just the top 10) on another table to make it even smaller.
SELECT TOP 10 *
FROM Table
WHERE CONVERT(datetime, DATE, 112) BETWEEN '2020-07-04 00:00:00' AND '2020-07-04 23:59:59'
The table isn't mine but a client's, so unfortunately I'm not responsible for the data types of the columns. The DATE column, along with the rest of the data, is in varchar. As for the dates in the BETWEEN clause, I just put in a relatively small range for testing.
I have heard that CONVERT shouldn't be in the WHERE clause, but I need to convert it to dates in order to filter. What is the proper way of going about this?
Going to summarise my comments here, as they are "second class citizens" and thus could be removed.
Firstly, the reason your query is slow is because of theCONVERT on the column DATE in your WHERE. Applying functions to a column in your WHERE will almost always make your query non-SARGable (there are some exceptions, but that doesn't make them a good idea). As a result, the entire table must be scanned to find rows that are applicable for your WHERE; it can't use an index to help it.
The real problem, therefore, is that you are storing a date (and time) value in your table as a non-date (and time) datatype; presumably a (n)varchar. This is, in truth, a major design flaw and needs to be fixed. String type values aren't validated to be valid dates, so someone could easily insert the "date" '20210229' or even 20211332'. Fixing the design not only stops this, but also makes your data smaller (a date is 3 bytes in size, a varchar(8) would be 10 bytes), and you could pass strongly typed date and time values to your query and it would be SARGable.
"Fortunately" it appears your data is in the style code 112, which is yyyyMMdd; this at least means that the ordering of the dates is the same as if it were a strongly typed date (and time) data type. This means that the below query will work and return the results you want:
SELECT TOP 10 * --Ideally don't use * and list your columns properly
FROM dbo.[Table]
WHERE [DATE] >= '20210704' AND [DATE] < '20210705'
ORDER BY {Some Column};
you can use like this to get better performance:
SELECT TOP 10 *
FROM Table
WHERE cast(DATE as date) BETWEEN '2020-07-04' AND '2020-07-04' and cast(DATE as time) BETWEEN '00:00:00' AND '23:59:59'
No need to include time portion if you want to search full day.

How To Compare two dates, SQL/SSIS Task

So I've got a task that takes a random 20% of a table's results from the previous day to use as a control group. These results are put into a table, and then shoved into a .CSV file for use by the employer.
That works perfectly well. The only problem is, it's in a group of tasks that are often tested, which means that when the task gets repeated, more random data gets dumped into the file - meaning manual deletion of rows. I'm looking for a fix.
Because the process is run once a day, a unique key is the TransactionDateID, formatted INT (20150603). I need to check against that column to make sure that nothing has been run on that same day. The problem is exacerbated because it involves yesterday's records.
For example. In order to check todays date to see if it has been run, getDate() would be used to get today's date, then converted to INT (20150604). But I can't simply check to see if there is a numerical difference of 1, because once the month switches, a simple +1 will throw the entire thing out of whack:
(20150631) + 1 =/= (20150701)
I'm just wondering if this is going to be casting/converting back and forth because of the difference in variable types, or if there's something I can do with a BIT to add a column if the task has been completed for the day, something along those lines.
A colleague suggested using MAX(TransactionDateID) and then checking getDate() against that column.
Unfortunately, I run into a problem the following day:
Initial task run at 2015-06-04-09:30:ss:mm
2015-06-04-11:45:ss:mm etc.. > 2015-06-04-09:30:ss:mm, DO NOT RUN
2015-06-05-09:30:ss:mm etc.. > 2015-06-04-09:30:ss:mm, I want it to run ...
To convert your day to a formatted int, try this:
DECLARE #today date = getdate()
select year(#today) * 10000 + month(#today) * 100 + day(#today)

Reduce 'X' hours from a column in Oracle

I have a table 'ABC' with a column TIME_THU which has values like 04:00, 04:02,04:03. The column is VARCHAR2 and I need to reduce the column value to 00:00, 00:02,00:03 i.e. reduce them by 04 hours back.
Need to use UPDATE query to make changes to this.
Something like,
Update ABC set TIME_THU = TIME_THU -4;
Thanks
Given your data, the best option seems to be to convert to a date, subtract the hours, then convert back.
Update ABC set TIME_THU = to_char(to_date(TIME_THU,'hh24:mi')-(4/24),'hh24:mi')
The caveat here is that if any of your time are less than 4:00, you may get incorrect results. I don't know enough of you requirements to know what the correct result would be in that scenario, so I can't solve it.
You may also want to investigate the interval data type, which is better suited to storing times disconnected from a date.

Using BETWEEN clause

Whenever you write a query where you need to filter out rows on a range of values - then should I use the BETWEEN clause or <= and >= ?
Which one is better in performance?
Neither. They create exactly the same execution plan.
The times where I use them depends not on performance, but on the data.
If the data are Discrete Values, then I use BETWEEN...
x BETWEEN 0 AND 9
But if the data are Continuous Values, then that doesn't work so well...
x BETWEEN 0.000 AND 9.999999999999999999
Instead, I use >= AND <...
x >= 0 AND x < 10
Interestingly, however, the >= AND < technique actually works for both Continuous and Discrete data types. So, in general, I rarely use BETWEEN at all.
Also, don't use BETWEEN for date/time range queries.
What does the following really mean?
BETWEEN '20120201' AND '20120229'
Some people think that means get me the all the data from February, including all of the data anytime on February 29th. The above gets translated to:
BETWEEN '20120201 00:00:00.000' AND '20120229 00:00:00.000'
So if there is data on the 29th any time after midnight, your report is going to be incomplete.
People also try to be clever and pick the "end" of the day:
BETWEEN '00:00:00.000' AND '23:59:59.997'
That works if the data type is datetime. If it is smalldatetime the end of the range gets rounded up, and you may include data from the next day that you didn't mean to. If it's datetime2 you might actually miss a small portion of data that happened in the last 2+ milliseconds of the day. In most cases statistically irrelevant, but if the query is wrong, the query is wrong.
So for date range queries I always strongly recommend using an open-ended range, e.g. to report on the month of February the WHERE clause would say "on or after February 1st, and before March 1st" as follows:
WHERE date_col >= '20120201' AND date_col < '20120301'
BETWEEN can work as expected using the date type only, but I still prefer an open-ended range in queries because later someone may change that underlying data type to allow it to include time.
I blogged a lot more details here:
What do BETWEEN and the devil have in common?