I have a SQL table with 6 columns. 1 ID int column and 5 Datetime columns Round1, Round2, ..., Round5
The data looks something like this. Either there is a date or the cell is empty.
I would like the query to show the number of filled datetime columns. That is
Can you please give some hint on how to build this query? Would this involve aggregate function?
Thank you
Consider:
SELECT ID, IIf(Round1 Is Null, 0, 1) + IIf(Round2 Is Null, 0, 1) +
IIf(Round 3 Is Null, 0, 1) + IIf(Round4 Is Null, 0, 1) + IIf(Round5 Is Null, 0, 1) AS Cnt
FROM Table;
Aggregate function is not helpful unless you first normalize the data with UNION query.
SELECT ID, Round1 AS Dte, "R1" AS Src FROM table
UNION SELECT ID, Round2, "R2" FROM table
UNION SELECT ID, Round3, "R3" FROM table
UNION SELECT ID, Round4, "R4" FROM table
UNION SELECT ID, Round5, "R5" FROM table;
Then use that query in aggregate SQL.
SELECT ID, Count(Dte) AS CntD FROM Q1 GROUP BY ID;
You can use CASE expressions to return 1 when a value is not NULL or 0 otherwise. Then just add all that.
SELECT id,
CASE
WHEN round1 IS NOT NULL THEN
1
ELSE
0
END
+
...
CASE
WHEN round5 IS NOT NULL THEN
1
ELSE
0
END total
FROM elbat;
And next time do not post images of tables. Post their CREATE statements with sample data as INSERT statements. And tag the specific DBMS you're using.
Related
I am performing data QA testing.
I have this query to establish any errors between the source table and the destination table.
select
count(case when coalesce(x.col1,1) = coalesce(y.col1,1) then null else 1 end) as cnt_col1,
count(case when coalesce(x.col2,"1") = coalesce(y.col2,"1") then null else 1 end) as cnt_col2
from
`DatasetA.Table` x
OUTER JOIN
`DatasetB.Table` y
on x.col1 = y.col1
The output of this query is like this:
col1, col2
null, null
null, null
1, null
null, 1
I have 200 tables that I need to perform this test on, and the number of cols are dynamic. the table above only has two columns, some have 50.
I have the queries for the tables already, but I need to conform the output of all of the tests into a single output. My plan is to conform each query into a unified output and join them together using a UNION ALL.
The output set should say:
COLUMN, COUNT_OF_ERRORS
cnt_col1, 1
cnt_col2, 1
...
cnt_col15, 0
My question is this.
How do I reverse pivot this so I can achieve the output I'm looking for.
Thanks
How do I reverse pivot this so I can achieve the output I'm looking for.
Assuming you have table `data`
col1 col2 col3
---- ---- ----
null null null
null null 1
null 1 1
1 null 1
1 null 1
1 null 1
And you need reverse pivot it to
column count_of_errors
-------- ---------------
cnt_col1 3
cnt_col2 1
cnt_col3 5
Below is for BigQuery Standard SQL and does exactly this
#standardSQL
WITH `data` AS (
SELECT NULL AS col1, NULL AS col2, NULL AS col3 UNION ALL
SELECT NULL, NULL, 1 UNION ALL
SELECT 1, NULL, 1 UNION ALL
SELECT NULL, 1, 1 UNION ALL
SELECT 1, NULL, 1 UNION ALL
SELECT 1, NULL, 1
)
SELECT r.* FROM (
SELECT
[
STRUCT<column STRING, count_of_errors INT64>
('cnt_col1', SUM(col1)),
('cnt_col2', SUM(col2)),
('cnt_col3', SUM(col3))
] AS row
FROM `data`
), UNNEST(row) AS r
It is simple enough and friendly for adjusting to any number of columns you potentially have in your initial `data` table - you just need to add respective number of ('cnt_colN', SUM(colN)), - which can be done manually or you can just write simple script to generate those lines (or whole query)
About "comparing 2 tables" in Big Data, I don't think that doing some Joins is the best approach, because Joins are quite slow in general and then you have to handle the case of "outer" joins rows.
I worked on this topic years ago (https://community.hortonworks.com/articles/1283/hive-script-to-validate-tables-compare-one-with-an.html) and I am now trying to backport this knowledge to compare Hive tables with BigQuery tables.
One of my main idea is to use some checksums to be sure that a table is fully identical to the other one.
Here is a "basic example":
with one_string as(
select concat( sessionid ,'|',referrercode ,'|',purchaseid ,'|',customerid ,'|', cast(bouncerateind as string),'|', cast( productpagevisit as string),'|', cast( itemordervalue as string),'|', cast( purchaseinsession as string),'|', cast( hit_time_gmt as string),'|',datedir ,'|',productcategory ,'|',post_cookies) as bigstring from bidwh2.omniture_2017_03_24_v2
),
shas as(
select TO_BASE64( sha1( bigstring)) as sha from one_string
),
shas_prefix as(
select substr( sha, 0 , 1) as prefix, sha from shas
),
shas_ordered as(
select prefix, sha from shas_prefix order by sha
),
results_prefix as(
select concat( prefix, ' ', TO_BASE64( sha1( STRING_AGG( sha, '|')))) as res from shas_ordered group by prefix
),
results_ordered as(
select 1 as myall, res from results_prefix order by res
)
select SHA1( STRING_AGG( res, '|')) as sha from results_ordered group by myall;
So you do that on each of the 2 tables, and compare the 2 checksums numbers.
Final idea is to have an Python script (not finished yet, I hope my company allows me to opensource when finished) that would do the following:
count the rows for some "buckets" (groups of rows that whose column with a good distribution has the same checksum modulo a big number) and compare the results (because there is no need to checksum the whole table if the number of rows does not match).
visually shows the differences if count does not match
use the bucket/rows technique + some other "buckets/columns" to do some checksums in a similar way as shown in above example. And compare all those checksums together.
visually shows the differences if checksums do not match
Edit on 03/11/2017: script is finished and can be found at: https://github.com/bolcom/hive_compared_bq
I found this question on here: How to select columns of data in BigQuery that has all NULL values
but I would like to do the opposite and find all the columns with non-null values. How would I flip this previous solution to accomplish the opposite? I am not that familiar with regexp syntax and I couldn't figure out a solution trying to research this online.
Thank you for your help in advance.
The script of How to select columns of data in BigQuery that has all NULL values
can be modified as following:
WITH `project.dataset.table` AS (
SELECT 77 A, 1 B, NULL C UNION ALL
SELECT NULL, 6, NULL UNION ALL
SELECT NULL, 2, NULL UNION ALL
SELECT NULL, 3, NULL
)
SELECT all_column, count(null_column) as count_null, count(1) as total_rows
FROM `project.dataset.table` AS t,
UNNEST(REGEXP_EXTRACT_ALL(
TO_JSON_STRING(t),
r'\"([a-zA-Z0-9\_]+)\":')
) AS all_column
left join UNNEST(REGEXP_EXTRACT_ALL(
TO_JSON_STRING(t),
r'\"([a-zA-Z0-9\_]+)\":null')
) AS null_column
on null_column=all_column
GROUP BY 1
HAVING count(null_column)=count(1)
The TO_JSON_STRING converts each entry to following string column_name:value.
The REGEXP_EXTRACT_ALL( ... , r'\"([a-zA-Z0-9\_]+)\":') extract from that string the column name.
if the value is null.
Let's say I have a table Category with columns
id, childCategory, hasParts
Let's say I want to group by id and check if any value in hasParts has value true.
How to do this efficiently?
this has got to be the most vague post that i've seen on here but i'll take a stab at it. based on my own imagination and the 3 sentences provided, here we go:
create table category (id int, childcategory nvarchar(25), hasparts bit)
insert category
select 1, 'stroller', 1
union all
select 1, 'rocker', 1
union all
select 2, 'car', 0
union all
select 2, 'doll', 0
union all
select 3, 'nasal sprayer', 0
union all
select 3, 'thermometer', 1
select *,
case when exists (select 1 from category b where a.id = b.id and b.hasparts = 1) then 'has true value' end as truecheck
from
(
select id, count(*) as inventory
from category
group by id
) a
drop table category
this should theoretically get you want you want. adjust as needed.
I have a table like below:
CREATE TABLE public.test_table
(
"ID" serial PRIMARY KEY NOT NULL,
"CID" integer NOT NULL,
"SEG" integer NOT NULL,
"DDN" character varying(3) NOT NULL
)
and data looks like this:
ID CID SEG DDN
1 1 1 "711"
2 1 2 "800"
3 1 3 "124"
4 2 1 "711"
5 3 1 "711"
6 3 2 "802"
7 4 1 "799"
8 5 1 "799"
9 5 2 "804"
10 6 1 "799"
I need to group these data by CID column and get column counts depends on DDN columns first values but counts must give me two different information, if it's more than 1 or not.
I'm really sorry if couldn't explains clearly. Let me show you what I need..
DDN END TRA
711 1 2
799 2 1
As you can see, DDN:711 has 1 record of single count (ID:4). This is END column.
But 2 times has multiple SEG count (ID:1to3 and ID:5to6). This is TRA column.
I can not be sure what column should be in group clause!
My solution:
Just found a solution like below
WITH x AS (
SELECT
(SELECT t1."DDN" FROM public.test_table AS t1
WHERE t1."CID"=t."CID" AND t1."SEG"=1) AS ddn,
COUNT("CID") AS seg_count
FROM public.test_table AS t
GROUP BY "CID"
)
SELECT ddn, COUNT(seg_count) AS "TOTAL",
SUM(CASE WHEN x.seg_count=1 THEN 1 ELSE 0 END) as "END",
SUM(CASE WHEN x.seg_count>1 THEN 1 ELSE 0 END) as "TRA"
FROM x
GROUP BY ddn;
Equivalent, faster query:
SELECT "DDN"
, COUNT(*) AS "TOTAL"
, COUNT(*) FILTER (WHERE seg_count = 1) AS "END"
, COUNT(*) FILTER (WHERE seg_count > 1) AS "TRA"
FROM (
SELECT DISTINCT ON ("CID")
"DDN" -- assuming min "SEG" is always 1
, COUNT(*) OVER (PARTITION BY "CID") AS seg_count
FROM test_table
ORDER BY "CID", "SEG"
) sub
GROUP BY "DDN";
db<>fiddle here
Notes
CTEs are typically slower and should only be used where needed in Postgres.
This is equivalent to the query in the question assuming that the minimum "SEG" per "CID" is always 1 - since this query returns the row with the minimum "SEG" while your query returns the one with "SEG" = 1. Typically, you would want the "first" segment and my query implements this requirement more reliably, but that's not clear from the question.
COUNT(*) is slightly faster than COUNT(column) and equivalent while not involving NULL values (applicable here). Related:
PostgreSQL: running count of rows for a query 'by minute'
About DISTINCT ON:
Select first row in each GROUP BY group?
The aggregate FILTER syntax requires Postgres 9.4+:
Conditional SQL count
Here is the solution i propose, the query can be simplified i guess.
CREATE TABLE test_table
(
ID serial PRIMARY KEY NOT NULL,
CID integer NOT NULL,
SEG integer NOT NULL,
DDN character varying(3) NOT NULL
);
insert into test_table(CID,SEG,DDN)
values
( 1, 1, '711'),
( 1, 2, '800'),
( 1, 3, '124'),
( 2, 1, '711'),
( 3, 1, '711'),
( 3, 2, '802'),
( 4, 1, '799'),
( 5, 1, '799'),
( 5, 2, '804'),
( 6, 1, '799');
with summary as (with ddn_t as (select cid,ddn,row_number() OVER( PARTITION BY cid)from test_table)
select a.cid,count(distinct a.ddn),b.ddn
from ddn_t a
join ddn_t b on b.cid=a.cid and b.row_number=1
group by a.cid, b.ddn)
select ddn,
sum (case when count >1 then 1 else 0 end) as TRA,
sum (case when count = 1 then 1 else 0 end) as END
from summary
group by ddn;
Consider the following table (snapshot):
I would like to write a query to select rows from the table for which
At least 4 out of 7 column values (VAL, EQ, EFF, ..., SY) are not NULL..
Any idea how to do that?
Nothing fancy here, just count the number of non-null per row:
SELECT *
FROM Table1
WHERE
IIF(VAL IS NULL, 0, 1) +
IIF(EQ IS NULL, 0, 1) +
IIF(EFF IS NULL, 0, 1) +
IIF(SIZE IS NULL, 0, 1) +
IIF(FSCR IS NULL, 0, 1) +
IIF(MSCR IS NULL, 0, 1) +
IIF(SY IS NULL, 0, 1) >= 4
Just noticed you tagged sql-server-2005. IIF is sql server 2012, but you can substitue CASE WHEN VAL IS NULL THEN 1 ELSE 0 END.
How about this? Turning your columns into "rows" and use SQL to count not nulls:
select *
from Table1 as t
where
(
select count(*) from (values
(t.VAL), (t.EQ), (t.EFF), (t.SIZE), (t.FSCR), (t.MSCR), (t.SY)
) as a(val) where a.val is not null
) >= 4
I like this solution because it's splits data from data processing - after you get this derived "table with values", you can do anithing to it, and it's easy to change logic in the future. You can sum, count, do any aggregates you want. If it was something like case when t.VAL then ... end + ..., than you have to change logic many times.
For example, suppose you want to sum all not null elements greater than 2. In this solution you just changing count to sum, add where clause and you done. If it was iif(Val is null, 0, 1) +, first you have to think what should be done to this and then change every item to, for example, case when Val > 2 then Val else 0 end.
sql fiddle demo
Since the values are either numeric or NULL you can use ISNUMERIC() for this:
SELECT *
FROM YourTable
WHERE ISNUMERIC(VAL)+ISNUMERIC(EQ)+ISNUMERIC(EFF)+ISNUMERIC(SIZE)
+ISNUMERIC(FSCR)+ISNUMERIC(MSCR)+ISNUMERIC(SY) >= 4