SQL find non-null columns - sql

I have a table of time-series data of which I need to find all columns that contain at least one non-null value within a given time period. So far I am using the following query:
select max(field1),max(field2),max(field3),...
from series where t_stamp between x and y
Afterwards I check each field of the result if it contains a non-null value.
The table has around 70 columns and a time period can contain >100k entries.
I wonder if there if there is a faster way to do this (using only standard sql).
EDIT:
Unfortunately, refactoring the table design is not an option for me.

The EXISTS operation may be faster since it can stop searching as soon as it finds any row that matches the criteria (vs. the MAX which you are using). It depends on your data and how smart your SQL server is. If most of your columns have a high rate of non-null data then this method will find rows quickly and it should run quickly. If your columns are mostly NULL values then your method may be faster. I would give them both a shot and see how they are each optimized and how they run. Also keep in mind that performance may change over time if the distribution of your data changes significantly.
Also, I've only tested this on MS SQL Server. I haven't had to code strict ANSI compatible SQL in over a year, so I'm not sure that this is completely generic.
SELECT
CASE WHEN EXISTS (SELECT * FROM Series WHERE t_stamp BETWEEN #x AND #y AND field1 IS NOT NULL) THEN 1 ELSE 0 END AS field1,
CASE WHEN EXISTS (SELECT * FROM Series WHERE t_stamp BETWEEN #x AND #y AND field2 IS NOT NULL) THEN 1 ELSE 0 END AS field2,
...
EDIT: Just to clarify, the MAX method might be faster since it could determine those values with a single pass through the data. Theoretically, the method here could as well, and potentially with less than a full pass, but your optimizer may not recognize that all of the subqueries are related, so it might do separate passes for each. That still might be faster, but as I said it depends on your data.

It would be faster with a different table design:
create table series (fieldno integer, t_stamp date);
select distinct fieldno from series where t_stamp between x and y;
Having a table with 70 "similar" fields is not generally a good idea.

When you say "a faster way to do this", if you mean a faster way for the query to run, then yes, here's how to do it: break it out into one query per column:
select top 1 field1 from series where t_stamp between x and y and field1 is not null
select top 1 field2 from series where t_stamp between x and y and field2 is not null
select top 1 field3 from series where t_stamp between x and y and field3 is not null
This way, you won't be doing a table scan across the entire table to find the maximum value. Instead, the database engine will stop as soon as it finds a non-null value. Assuming your data isn't 99% nulls, this should give you faster execution - but at the expense of more programming time to set this up.

How about this... You query for a list of field names that you can iterate through.
select 'field1' as fieldname from series
where field1 is not null and t_stamp between x and y
UNION
select 'field2' from series where field2 is not null
... etc
Then you have a recordset that will only contain the string name of the fields that are not null. Then you can loop over this recordset to build your real query as dynamic SQL and ignore fields that don't have any data. The "select 'field2'" will not return a string when there is no crieteria matching the where clause.

Edit: I think I misread the question... this will give you all the rows with a non-null value. I'll leave it here in case it helps someone but it's not the answer to your question. Thanks #Pax
I think you want to use COALESCE:
SELECT ... WHERE COALESCE(fild1, field2, field3) IS NOT NULL

For a start, this is a very bad idea with standard SQL since not all DBMSs sort with NULLs last.
There are all sorts of tricky ways you could do this and most would be interminably slow.
I'd suggest you (sort-of) normalize the database some more so that each of the columns is in a separate table which would make a select easier but that's probably not what you want.
After edit of question: if refactoring table design is not an option, your given solution is probably the best, especially if you have indexes on all the 70 columns.
Although that's likely to slow down inserts quite a bit, you may want to use a non-indexed table for maximum insert speed and transfer the data periodically (overnight?) to an indexed table which would run your selects at best speed (by avoiding a full table scan).

select count(field1),count(field2),count(field3),...
from series where t_stamp between x and y
will tell you how many non-null values are in each column. Unfortunately, it's not much better than the way you're doing it now.

Try this:
SELECT CASE WHEN field1 IS NOT NULL THEN '' ELSE 'contains null' END AS field1_stat,
CASE WHEN field2 IS NOT NULL THEN '' ELSE 'contains null' END AS field2_stat,
... for every field to be checked
FROM series
WHERE foo IN bar
GROUP BY CASE WHEN field1 IS NOT NULL THEN '' ELSE 'contains null' END,
CASE WHEN field2 IS NOT NULL THEN '' ELSE 'contains null' END
... etc
This will give you a summary on the combination of 'nulled' fields in the table

Related

Select ' ' from TableA

What exactly does
select '' from TableA
do?
When I run it on a given table I get back a record for all rows in the table that are obviously blank with the header of '(No column name)' because no alias was used.
I have seen this query used as a subquery in 'not exists' statements.
At what times would this query be useful and is it a good practice to query this way?
For instance when I first saw it I thought it would return one blank row but in fact it returns all rows in the table and they are blank.
I've looked around and haven't found an answer for this.
Thank you
When checking whether something exists in a table, it is common to select an arbitrary value rather than an actual column, because it has an affect on the execution plan (if you select a real column, the execution plan takes that column into account and it can take a little longer, even though you don't use the column).
Most commonly, I have seen 1:
IF EXISTS (SELECT 1 FROM MyTable WHERE SomeColumn > 10)
If you just care whether there is any row, you can short circuit the query rather than getting all rows... although I suspect the EXISTS statement would stop as soon as any row was found anyway.
IF EXISTS (SELECT TOP 1 '' FROM TableA)
You would use this syntax if you want to add a static value as part of your query for any reason
E.g.
SELECT
'SELECT TOP 10 * FROM '+name
from sys.objects
where type = 'U'
This will create automatically create queries for all tables you have in your database

Conditional offset on union select

Consider we have complex union select from dozens of tables with different structure but similar fields meaning:
SELECT a1.abc as field1,
a1.bcd as field2,
a1.date as order_date,
FROM a1_table a1
UNION ALL
SELECT a2.def as field1,
a2.fff as field2,
a2.ts as order_date,
FROM a2_table a2
UNION ALL ...
ORDER BY order_date
Notice also that results in general are sorted by "synthetic" field order_date.
This query gives huge number of rows, and we want to work with pages from this set of rows. Each page is defined by two parameters:
page size
field2 value of last item from previous page
Most important thing that we can not change the way can page is defined. I.e. it is not possible to use row number of date of last item from previous page: only field2 value is acceptable.
Current algorithm of paging is implemented in quite ugly way:
1) query above is wrapped in additional select with row_number() additional column and then wrapped in stored procedure union_wrapper which returns appropriate
table ( field1 ..., field2 character varying),
2) then complex select performed:
RETURN QUERY
with tmp as (
select
rownum, field1, field2 from union_wrapper()
)
SELECT field1, field2
FROM tmp
WHERE rownum > (SELECT rownum
FROM tmp
WHERE field2 = last_field_id
LIMIT 1)
LIMIT page_size
The problem is that we have to build in memory full union-select results in order to later detect row number from which we want to cut new page. This is quite slow and takes unacceptable much time to perform.
Is any way to reconfigure this operations in order to significantly reduce query complexity and increase its speed?
And again: we can not change condition of paging, we can not change structure of the tables. Only way of rows retrieving.
UPD: I also can not use temp tables, because I'm working in read-replica of the database.
You have successfully maneuvered yourself into a tight spot. The query and its ORDER BY expression contradict your paging requirements.
ORDER BY order_date is not a deterministic sort order (there could be multiple rows with the same order_date) - which you need before you do anything else here. And field2 does not seem to be unique either. You need both: Define a deterministic sort order and a unique indicator for page end / start. Ideally, the indicator matches the sort order. Could be (order_date, field2), which both columns defined NOT NULL, and the combination UNIQUE. Your restriction "only field2 value is acceptable" contradicts your query.
That's all before thinking about how to get best performance ...
There are proven solutions with row values and multi-column indexes for paging:
Optimize query with OFFSET on large table
But drawing from a combination of multiple source tables complicates matters. Optimization depends on the details of your setup.
If you can't get the performance you need, your only remaining alternative is to materialize the query results somehow. Temp table, cursor, materialized view - the best tool depends on details of your setup.
Of course, general performance tuning might help, too.

How does IN query work in SQL

Select *
from TableName
where columnA = value1
AND columnB = value2
and column3 IN (list of ids);
How will the above query work in either of the DBs.
How is it different from the following loop query:-
for x in list_of_ids:
Select * from TableName where columnA = value1 AND columnB = value2 and column3 = x;
The second example is simply a loop where select will be done X number of times.
The execution first example:
1. If number of values are small IN will be transformed to ORed conditions.
2. If number of values are huge some db engines will optimize it and create a spool with all the values and do a join with your table
In any case first example will be faster.
Note: SQL queries are not valid in NOSQL databases. The above scenario is for traditional dbs only.
The later example is not a query. You have a loop (probably Python) that is executing a specific query (with a specific value for column3) each time.
Is this a fictitious example or taken from a specific DB ?
The first example should be more efficient as it will perform the select * from table once.
The second example will perform the select X times and therefore is likely to be less efficient than the first.
The link below gives the detail on the In clause I think you are after
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/4ca39765-6521-4dec-aec1-373b1adf0be7/how-does-the-in-clause-affect-performance?forum=transactsql

Changing NULL's position in sorting

I am sorting a table. The fiddle can be found here.
CREATE TABLE test
(
field date NULL
);
INSERT INTO test VALUES
('2000-01-05'),
('2004-01-05'),
(NULL),
('2008-01-05');
SELECT * FROM test ORDER BY field DESC;
The results I get:
2008-01-05
2004-01-05
2000-01-05
(null)
However I need the results to be like this:
(null)
2008-01-05
2004-01-05
2000-01-05
So the NULL value is treated as if it is higher than any other value. Is it possible to do so?
Easiest is to add an extra sort condition first:
ORDER BY CASE WHEN field is null then 0 else 1 END,field DESC
Or, you can try setting it to the maximum of its datatype:
ORDER BY COALESCE(field,'99991231') DESC
COALESCE/ISNULL work fine, provided you don't have "real" data using that same maximum value. If you do, and you need to distinguish them, use the first form.
Use a 'end of time' marker to replace nulls:
SELECT * FROM test
ORDER BY ISNULL(field, '9999-01-01') DESC;
Be wary of queries that invoke per-row functions, they rarely scale well.
That may not be a problem for smaller data sets but will be if they become large. That should be monitored by regularly performing tests on the queries. Database optimisation is only a set-and-forget operation if your data never changes (very rare).
Sometimes it's better to introduce an artificial primary sort column, such as with:
select 1 as art_id, mydate, col1, col2 from mytable where mydate is null
union all
select 2 as art_id, mydate, col1, col2 from mytable where mydate is not null
order by art_id, mydate desc
Then only use result_set["everything except art_id"] in your programs.
By doing that, you don't introduce (possibly) slow per-row functions, instead you rely on fast index lookup on the mydate column. And advanced execution engines can actually run these two queries concurrently, combining them once they're both finished.

Does ISNULL or OR have better performance?

I have the SQL query:
SELECT ISNULL(t.column1, t.column2) as [result]
FROM t
I need to filter out data by [result] column. What is the best approach regarding performance from the two listed below:
WHERE ISNULL(t.column1, t.column2) = #filterValue
or:
WHERE t.column1 = #filterValue OR t.column2 = #filterValue
UPDATE: Sorry, I have forgotten to mention that the column2 is always null if the column1 is filled.
Measure, don't guess! This is something you should be doing yourself, with production-like data. We don't know the make-up of your data and that makes a big difference.
Having said that, I wouldn't do it either way. I'd create another column, column3 to store column1 if non-NULL and column2 if column1 is NULL.
Then I'd have an insert/update trigger to populate that column correctly, index it and use the screaming-banshee-speed:
select t.column3 as [result] from t
The vast majority of databases are read more often than written and it's better if this calculation is done as few times as possible (i.e., when the data changes, not every time you select it). If you want your databases to be scalable, don't use per-row functions.
It's perfectly valid to sacrifice disk space for speed and the triggers ensure that the data doesn't become inconsistent.
If adding another column and triggers is out of the question, I'd go for the or solution since it can often be split into two parallel queries by the smarter DBMS engines.
An alternative, which MarkB gave but since deleted his answer so I'll have to go hunting for another good answer of his to upvote :-), is to use UNION ALL. If your DBMS isn't quite smart enough to recognise OR as a chance for parallelism, it may be smart enough to recognise UNION ALL in that context, something like:
select column1 as c from t where column1 is not NULL
union all
select column2 as c from t where column1 is NULL
But again, it depends on both your database and your data. A smart DBA would put the whole thing in a stored procedure so they could swap in a new method seamlessly should the data change its properties.
On an MSSQL-Table (MSSQL 2000) with 13.000.000 entries and indexes on Col1 and Col2 i get the following results:
select top 1000000 * from Table1 with(nolock) where isnull(Col1,Col2) > '0'
-- Compile-Time: 4ms
-- CPU-Time: 18265ms
-- Elapsed-Time: 24882ms = ~25s
select top 1000000 * from Table1 with(nolock) where Col1 > '0' or (Col1 is null and Col2 > '0')
-- Compile-Time: 9ms
-- CPU-Time: 7781ms
-- Elapsed-Time: 25734 = ~26s
The measured values are subject to strong fluctuations base on the workload of the server.
The first statment need lesser time to compile but takes more cpu-time for excecution (culstered index scan).
Its important to know that many storage-engines have an optimizer who reorganize the statment for better results und executiontimes. Ultimately, both statements will rebuild to mostly the same statement by the optimizer.
I think, your replacement expression does not mean the same. Assume filterValue is 2, then ISNULL(1,2)=2 is false, but 1=2 or 2=2 is true. The expression you need looks more like:
(c1=filter) or ((c1 is null) and (c2 = filter));
There is a chance that a server can answer this from the index on c1. First part of the soultion is an index scan over c1=filter. The second part is a scan over c1=null and then a linear search for c2=filter. I'd even say that a clustered index (c1,c2) could work here.
OTOH, you should rather measure before make assumptions like this, speculations doesn't work usually in SQL unless you have intimate knowledge on the implementation. For example, I'm pretty sure that the query planners already knows that ISNULL(X,Y) can be decomposed into a boolean statement with its implications for searching, but I would not rely on that but rather measure and then decide what to do.
I have measured the performance of both queries on SQL Sever 2008.
And have got the following results:
Both approaches had almost the same Estimated Subtree Cost metric.
But the OR approach had more accurate value of the Estimated Number of Rows metric.
So the query optimizer will build more appropriate execution plan for the OR than for ISNULL approach.