How to count defined values (e.g. -1) in each column of a big table? - sql

My task is to analyze a big table (250 columns, millions of rows). I need to find out, how many defined values (e.g. -1) there are in each column. I have a solution that loops through the columns of my table and uses the methods described in the following links:
Fastest way to count exact number of rows in a very large table?
https://learn.microsoft.com/de-de/archive/blogs/martijnh/sql-serverhow-to-quickly-retrieve-accurate-row-count-for-table
However, I have to do:
select column into #tab from MyBigTable where column = -1
And then apply the methods to #tab.
Do you see any way how this can be efficiently dealt with?

You could conditionally aggregate
select sum(case when col1 = -1 then 1 else 0 end) col1sum,
sum(case when col2 = -1 then 1 else 0 end) col2sum,
...
...
sum(case when coln = -1 then 1 else 0 end) colnsum
from yourtable

Related

Create a Query to check if any Column in a table is Null

I have zero experience with SQL but am trying to learn how to validate tables. I am trying to see within a table if any of the columns are null.
Currently I have been going with a script that is just counting the number of nulls. I am doing this for each column. Is there a better script that I can use to check all the columns in a table?
select count(id) from schema.table where id is not null
If there are 100 records I would expect all columns to come back with 100 but if one column is null it will show a 0.
You can count each column in a single query by using sum and case:
select
sum(case when Column1 is null then 1 else 0 end) Column1NullCount
, sum(case when Column2 is null then 1 else 0 end) Column2NullCount
-- ...
, sum(case when ColumnN is null then 1 else 0 end) ColumnNNullCount
from MyScheme.MyTable

Grouping multiple column operations in 1 table scan

I need to translate SAS code (PROC SQL) to (postgres) SQL, especially the calculated keyword in SAS that allow a variable defined in the query to be re-used directly in the same query for another variable computation:
SELECT
id,
sum( case
when (sales > 0) then 1
when (sales = 0) then 0
else -1
end) as pre_freq,
(case
when calculated pre_freq > 0 then calculated pre_freq
else 1
end) as freq
FROM my_table
GROUP BY id
This is not possible (AFAIK) in SQL, so I need to break down each step of the computation.
I was wondering what was the best option, knowing that, from my understanding, it is better to have more computations and fewer table scans, i.e. make as much as computation during a scan, instead of multiple table scans with a small computation steps.
In this particular exemple I could use:
SELECT
id
, greatest(1, sum( case
when (sales > 0) then 1
when (sales = 0) then 0
else -1
end) as freq
FROM
my_table
GROUP BY id
or:
SELECT
id
, (case when sum(case
when (sales > 0) then 1
when (sales < 0) then -1
else 0
end) > 0 then sum(case
when (sales > 0) then 1
when (sales < 0) then -1
else 0
end) else 1 end) as freq
FROM
my_table
GROUP BY id
... which is starting to be hard to read...
Is there anyway to define a variable for a snippet of SQL code that will be repeated?
More generally speaking that this illustration, was is the best (most efficient) approach?
calculated is a nice feature of proc sql. However, you cannot re-use aliases in databases in general (this is not a Postgres-specific limitation). A simple method is to use a subquery or CTE:
select id, pre_freq,
(case when pre_freq > 0 then pre_freq
else 1
end) as freq
from (select id,
sum(case when (sales > 0) then 1
when (sales = 0) then 0
else -1
end) as pre_freq,
from my_table t
group by id
) t;
However, the simplest solution is to use sign():
select id, sum(sign(sales)) as pre_freq,
greatest(sum(sign(sales)), 1) as freq
from my_table t
group by id;
Note: This is slightly different. It basically ignores NULL values. If you really need to treat NULL as -1, then use coalesce().

nested SQL queries on one table

I am having trouble formulating a query to get the desired output.
This query involves one table and two columns.
First column bld_stat has 4 different values Private, public, Public-Abandoned, Private-Abandoned the other column bld_type, single_flr, multi_flr, trailer, Whs.
I need to get results that look like this:
So far I can get the first two columns but after that I have not been able to logically get a query to work
SELECT bld_stat, COUNT(grade) AS single_flr
FROM (SELECT bld_stat,bld_type
FROM bld_inventory WHERE bld_type = 'single_flr') AS grade
GROUP BY bld_stat,bld_type,grade
The term you are going for is pivoting. I think this should work...no need for the subquery, and I've changed your group by to only bld_stat
SELECT bld_stat,
sum(case when bld_type = 'singl_flr' then 1 else 0 end) AS single_flr,
sum(case when bld_type = 'multi_flr' then 1 else 0 end) AS multi_flr,
sum(case when bld_type = 'trailer' then 1 else 0 end) AS trailer,
sum(case when bld_type = 'whs' then 1 else 0 end) AS WHS
FROM bld_inventory
GROUP BY bld_stat

SQL: sort by number of empty columns

I have a SQL query which displays a list of results. Every row in my database has about
20 columns and not every column is mandatory. I would like the result of the SQL query to be
sorted by the number of filled in columns. The rows with the least empty columns at the top, the ones with the most empty columns at the bottom. Do any of you guys have an idea how to do this?
I thought about adding an extra column to the table which if updated every time the user edits their row, this number would indicate the number of empty columns and I could sort my list with that. This however, sounds like unnecessary troubles, but maybe there is no other way? I'm sure somebody on here will know!
Thanks,
Sander
You can do it in just about any database with a giant case statement:
order by ((case when col1 is not null then 1 else 0 end) +
(case when col2 is not null then 1 else 0 end) +
. . .
(case when col20 is not null then 1 else 0 end)
) desc
You could order by the amount of empty columns:
order by
case when col1 is null then 1 else 0 end +
case when col2 is null then 1 else 0 end +
case when col3 is null then 1 else 0 end +
...
case when col20 is null then 1 else 0 end
(Note the + at the end of the lines: it's only one column with the integer count of empty fields, sorted in ascending order.)

SQL 2 counts with different filter

I have a table and I need calculate two aggregate functions with different conditions in one statement. How can I do this?
Pseudocode below:
SELECT count(CoumntA) *< 0*, count(CoumntA) * > 0*
FROM dbo.TableA
This is the same idea as tombom's answer, but with SQL Server syntax:
SELECT
SUM(CASE WHEN CoumntA < 0 THEN 1 ELSE 0 END) AS LessThanZero,
SUM(CASE WHEN CoumntA > 0 THEN 1 ELSE 0 END) AS GreaterThanZero
FROM TableA
As #tombom demonstrated, this can be done as a single query. But it doesn't mean that it should be.
SELECT
SUM(CASE WHEN CoumntA < 0 THEN 1 ELSE 0 END) AS less_than_zero,
SUM(CASE WHEN CoumntA > 0 THEN 1 ELSE 0 END) AS greater_than_zero
FROM
TableA
The time when this is not so good is...
- There is an index on CoumntA
- Most values (50% or more feels about right) are exactly zero
In that case, two queries will be faster. This is because each query can use the index to quickly home in on the section to be counted. In the end only counting the relevant records.
The example I gave, however, scans the whole table every time. Only once, but always the whole table. This is worth it when you're counting most of the records. In your case it looks liek you're counting most or all of them, and so this is probably a good way of doing it.
It is possible to do this in one select statement.
The way I've done it before is like this:
SELECT SUM(CASE WHEN ColumnA < 0 THEN 1 END) AS LessThanZero,
SUM(CASE WHEN ColumnA > 0 THEN 1 END) AS GreaterThanZero
FROM dbo.TableA
This is the correct MS SQL syntax and I believe this is a very efficient way of doing it.
Don't forget you are not covering the case when ColumnA = 0!
select '< 0' as filter, COUNT(0) as cnt from TableA where [condition 1]
union
select '> 0' as filter, COUNT(0) as cnt from TableA where [condition 2]
Be sure that condition 1 and condition 2 create a partition on the original set of records, otherwise same records could be counted in both groups.
For SQL Server, one way would be;
SELECT COUNT(CASE WHEN CoumntA<0 THEN 1 ELSE NULL END),
COUNT(CASE WHEN CoumntA>0 THEN 1 ELSE NULL END)
FROM dbo.TableA
Demo here.
SELECT
SUM(IF(CoumntA < 0, 1, 0)) AS lowerThanZero,
SUM(IF(CoumntA > 0, 1, 0)) AS greaterThanZero
FROM
TableA
Is it clear what's happening? Ask, if you have any more questions.
A shorter form would be
SELECT
SUM(CoumntA < 0) AS lowerThanZero,
SUM(CoumntA > 0) AS greaterThanZero
FROM
TableA
This is possible, since in MySQL a true condition is equal 1, a false condition is equal 0
EDIT: okay, okay, sorry, don't know why I thought it's about MySQL here.
See the other answers about correct syntax.