Given the data set:
field_a
description_1
metric_1
ball
large
20
ball
small
4
cat
null
null
I want to pack fields description_1 and metric_1 into a struct and then zip them into an array by field_a:
WITH
DATA AS (
SELECT
'ball' AS field_a,
'large' AS description_1,
20 AS metric_1
UNION ALL
SELECT
'ball',
'small',
4
UNION ALL
SELECT
'cat',
NULL,
NULL )
SELECT
field_a,
ARRAY_AGG(STRUCT(description_1,
metric_1))
FROM
DATA
GROUP BY
1;
That gives me:
However, if all fields in the struct are null I would like to see an empty array instead of an array with size of 1 and an struct inside of it.
Desired output:
I figured the following query to test the fields before packing them into the struct:
WITH
DATA AS (
SELECT
'ball' AS field_a,
'large' AS description_1,
20 AS metric_1
UNION ALL
SELECT
'ball' AS field_a,
'small',
4
UNION ALL
SELECT
'cat',
NULL,
NULL )
SELECT
field_a,
ARRAY_AGG(
IF
(description_1 IS NOT NULL
OR metric_1 IS NOT NULL,
STRUCT(description_1,
metric_1),
NULL)IGNORE NULLS)
FROM
DATA
GROUP BY
1;
However it feels like a tedious solution, is there a better way to test if struct contains at least one non-null value? or any other solution to achieve the desired output more elegantly?
Below approach might be slightly cleaner one
select * from (select distinct field_a from your_table)
left join (
select field_a, array_agg(struct(description_1, metric_1))
from your_table
where not description_1 is null and not metric_1 is null
group by field_a
)
using(field_a)
if applied to sample data in your question - output is
Note: there are more ways to write that where clause and depends on your tastes
for example
where format('%t', (description_1, metric_1)) != '(NULL, NULL)'
or
where to_json_string((description_1, metric_1)) != '{"":null,"":null}'
But, I feel, using explicit not ... is ... is more natural and explicit so easier to swallow if you have just two or three columns to check.
When you have more - like 5 or 10, etc. last two version can produce more compact code ...
Related
I am performing data QA testing.
I have this query to establish any errors between the source table and the destination table.
select
count(case when coalesce(x.col1,1) = coalesce(y.col1,1) then null else 1 end) as cnt_col1,
count(case when coalesce(x.col2,"1") = coalesce(y.col2,"1") then null else 1 end) as cnt_col2
from
`DatasetA.Table` x
OUTER JOIN
`DatasetB.Table` y
on x.col1 = y.col1
The output of this query is like this:
col1, col2
null, null
null, null
1, null
null, 1
I have 200 tables that I need to perform this test on, and the number of cols are dynamic. the table above only has two columns, some have 50.
I have the queries for the tables already, but I need to conform the output of all of the tests into a single output. My plan is to conform each query into a unified output and join them together using a UNION ALL.
The output set should say:
COLUMN, COUNT_OF_ERRORS
cnt_col1, 1
cnt_col2, 1
...
cnt_col15, 0
My question is this.
How do I reverse pivot this so I can achieve the output I'm looking for.
Thanks
How do I reverse pivot this so I can achieve the output I'm looking for.
Assuming you have table `data`
col1 col2 col3
---- ---- ----
null null null
null null 1
null 1 1
1 null 1
1 null 1
1 null 1
And you need reverse pivot it to
column count_of_errors
-------- ---------------
cnt_col1 3
cnt_col2 1
cnt_col3 5
Below is for BigQuery Standard SQL and does exactly this
#standardSQL
WITH `data` AS (
SELECT NULL AS col1, NULL AS col2, NULL AS col3 UNION ALL
SELECT NULL, NULL, 1 UNION ALL
SELECT 1, NULL, 1 UNION ALL
SELECT NULL, 1, 1 UNION ALL
SELECT 1, NULL, 1 UNION ALL
SELECT 1, NULL, 1
)
SELECT r.* FROM (
SELECT
[
STRUCT<column STRING, count_of_errors INT64>
('cnt_col1', SUM(col1)),
('cnt_col2', SUM(col2)),
('cnt_col3', SUM(col3))
] AS row
FROM `data`
), UNNEST(row) AS r
It is simple enough and friendly for adjusting to any number of columns you potentially have in your initial `data` table - you just need to add respective number of ('cnt_colN', SUM(colN)), - which can be done manually or you can just write simple script to generate those lines (or whole query)
About "comparing 2 tables" in Big Data, I don't think that doing some Joins is the best approach, because Joins are quite slow in general and then you have to handle the case of "outer" joins rows.
I worked on this topic years ago (https://community.hortonworks.com/articles/1283/hive-script-to-validate-tables-compare-one-with-an.html) and I am now trying to backport this knowledge to compare Hive tables with BigQuery tables.
One of my main idea is to use some checksums to be sure that a table is fully identical to the other one.
Here is a "basic example":
with one_string as(
select concat( sessionid ,'|',referrercode ,'|',purchaseid ,'|',customerid ,'|', cast(bouncerateind as string),'|', cast( productpagevisit as string),'|', cast( itemordervalue as string),'|', cast( purchaseinsession as string),'|', cast( hit_time_gmt as string),'|',datedir ,'|',productcategory ,'|',post_cookies) as bigstring from bidwh2.omniture_2017_03_24_v2
),
shas as(
select TO_BASE64( sha1( bigstring)) as sha from one_string
),
shas_prefix as(
select substr( sha, 0 , 1) as prefix, sha from shas
),
shas_ordered as(
select prefix, sha from shas_prefix order by sha
),
results_prefix as(
select concat( prefix, ' ', TO_BASE64( sha1( STRING_AGG( sha, '|')))) as res from shas_ordered group by prefix
),
results_ordered as(
select 1 as myall, res from results_prefix order by res
)
select SHA1( STRING_AGG( res, '|')) as sha from results_ordered group by myall;
So you do that on each of the 2 tables, and compare the 2 checksums numbers.
Final idea is to have an Python script (not finished yet, I hope my company allows me to opensource when finished) that would do the following:
count the rows for some "buckets" (groups of rows that whose column with a good distribution has the same checksum modulo a big number) and compare the results (because there is no need to checksum the whole table if the number of rows does not match).
visually shows the differences if count does not match
use the bucket/rows technique + some other "buckets/columns" to do some checksums in a similar way as shown in above example. And compare all those checksums together.
visually shows the differences if checksums do not match
Edit on 03/11/2017: script is finished and can be found at: https://github.com/bolcom/hive_compared_bq
So I Have a table called Value that's associated with different 'Fields'. Note that some of these fields have similar 'names' but they are named differently. Ultimately I want these 'similar names' to be pivoted/grouped as the same field name in the result set
VALUE_ID VALUE_TX FIELD_NAME Version_ID
1 Yes Adult 1
2 18 Age 1
3 Black Eye Color 1
4 Yes Is_Adult 2
5 25 Years_old 2
6 Brown Color_of_Eyes 2
I have a table called Submitted that looks like the following:
Version_ID Version_Name
1 TEST_RUN
2 REAL_RUN
I need a result set that Looks like this:
Submitted_Name Adult? Age Eye_Color
TEST_RUN Yes 18 Black
REAL_RUN Yes 25 Brown
I've tried the following:
SELECT * FROM (
select value_Tx, field_name, version_id
from VALUE
)
PIVOT (max (value_tx) for field_name in (('Adult', 'Is_Adult') as 'Adult?', ('Age', 'Years_old') as 'Age', ('Eye Color', 'Color_of_Eyes') as 'Eye_Color')
);
What am I doing wrong? Please let me know if I need to add any additional details / data.
Thanks in advance!
The error message that I am getting is the following:
ORA-00907: missing right parenthesis
I would change the field names in the subquery:
SELECT *
FROM (select value_Tx,
(case when field_name in ('Adult', 'Is_Adult') then 'Adult?'
field_name in ('Age', 'Years_old') then 'Age'
field_name in ('Eye Color', 'Color_of_Eyes') then 'Eye_Color'
else field_name
end) as field_name, version_id
from VALUE
)
PIVOT (max(value_tx) for field_name in ('Adult?', 'Age', 'Eye_Color'));
You can use double quotes for column aliasing within the pivot clause's part, and I think decode function suits well for this question. You can consider using the following query :
with value( value_id, value_tx, field_name, version_id ) as
(
select 1 ,'Yes' ,'Adult' ,1 from dual union all
select 2 ,'18' ,'Age' ,1 from dual union all
select 3 ,'Black','Eye_Color' ,1 from dual union all
select 4 ,'Yes' ,'Is_Adult' ,2 from dual union all
select 5 ,'25' ,'Years_old' ,2 from dual union all
select 6 ,'Brown','Color_of_Eyes',2 from dual
), Submitted( version_id, version_name ) as
(
select 1 ,'TEST_RUN' from dual union all
select 2 ,'REAL_RUN' from dual
)
select * from
(
select s.version_name as "Submitted_Name", v.value_Tx,
decode(v.field_name,'Adult','Is_Adult','Age','Years_old','Eye_Color',
'Color_of_Eyes',v.field_name) field_name
from value v
join Submitted s
on s.version_id = v.version_id
group by decode(v.field_name,'Adult','Is_Adult','Age','Years_old','Eye_Color',
'Color_of_Eyes',v.field_name),
v.value_Tx, s.Version_Name
)
pivot(
max(value_tx) for field_name in ( 'Is_Adult' as "Adult?", 'Years_old' as "Age",
'Color_of_Eyes' as "Eye_Color" )
);
Submitted_Name Adult? Age Eye_Color
REAL_RUN Yes 25 Brown
TEST_RUN Yes 18 Black
I think, better to solve as much as shorter way, as an example, using modular arithmetic would even be better as below :
select *
from
(
select s.version_name as "Submitted_Name", v.value_Tx, mod(v.value_id,3) as value_id
from value v
join Submitted s
on s.version_id = v.version_id
group by v.value_Tx, s.version_name, mod(v.value_id,3)
)
pivot(
max(value_tx) for value_id in ( 1 as "Adult?", 2 as "Age", 0 as "Eye_Color" )
)
Demo
Is there a concept (with an implementation - in Oracle SQL for starters) which behaves like a 'universal' matcher ?
What I mean is; I know NULL is not equal to anything - including NULL.
Which is why you have to be careful to 'IS NULL' rather than '=NULL' in SQL expressions.
I also know it is useful to use the NVL (in Oracle) function to detect a NULL and replace it with something in the output.
However: what you replace the NULL with using NVL has to match the datatype of the underlying column; otherwise you'll (rightly) get an error.
An example:
I have a table with a NULLABLE column 'name' of type VARCHAR2; and this contains a NULL row.
I can fetch out the NULL and replace it with an NVL like this:
SELECT NVL(name, 'NullyMcNullFace’) from my_table;
Great.
But if the column happens to a NUMBER (say 'age'), then I have to change my NVL:
SELECT NVL(age, 32) from my_table;
Also great.
Now if the column happens to be a DATE (say 'somedate'), then I have to change my NVL again:
SELECT NVL(somedate, sysdate) from my_table;
What I'm getting at here : is that in order to deal with NULLs you have to replace with a specific something ; and that specific something has to 'fit' the data-type.
So is there a construct/concept of (for want of a better word) like 'ANY' here.
Where 'ANY' would fit into a column of any datatype (like NULL), but (unlike NULL and unlike all other specific values) would match ANYTHING (including NULL - ? probably urghhh dunno).
So that I could do:
SELECT NVL(whatever_column, ANY) from my_table;
I think the answer is probably no; and probably 'go away, NULLs are bad enough - never mind this monster you have half-thought of'.
No, there's no "universal acceptor" value in SQL that is equal to everything.
What you can do is raise the NVL into your comparison. Like if you're trying to do a JOIN:
SELECT ...
FROM my_table AS m
JOIN other_table AS o ON o.name = NVL(m.name, o.name)
So if m.name is NULL, then the join will compare o.name to o.name, which is of course always true.
For other uses of NULL, you might have to use another technique that suits the situation.
Adressing the question in the comment on Bill Karwin's answer:
I want to output a 1 if the NEW and OLD value differ and a 0 if they are the same. But (for my purposes) I want to also return 0 for two NULLS.
select
Case When (:New = :Old) or
(:New is NULL and :Old is NULL) then 0
Else
1
End
from dual
In a WHERE CLAUSE you can put a condition like this,
WHERE column1 LIKE NVL(any_column_or_param, '%')
Perhaps DECODE() would suit your purpose here?
WITH t1 AS (SELECT 1 ID, NULL val FROM dual UNION ALL
SELECT 2 ID, NULL val FROM dual UNION ALL
SELECT 3 ID, 1 val FROM dual UNION ALL
SELECT 4 ID, 2 val FROM dual UNION ALL
SELECT 5 ID, 5 val FROM dual),
t2 AS (SELECT 1 ID, NULL val FROM dual UNION ALL
SELECT 2 ID, 3 val FROM dual UNION ALL
SELECT 3 ID, 1 val FROM dual UNION ALL
SELECT 4 ID, 4 val FROM dual UNION ALL
SELECT 6 ID, 5 val FROM dual)
SELECT t1.id t1_id,
t1.val t1_val,
t2.id t2_id,
t2.val t2_val,
DECODE(t1.val, t2.val, 0, 1) different_vals
FROM t1
FULL OUTER JOIN t2 ON t1.id = t2.id
ORDER BY COALESCE(t1.id, t2.id);
T1_ID T1_VAL T2_ID T2_VAL DIFFERENT_VALS
---------- ---------- ---------- ---------- --------------
1 1 0
2 2 3 1
3 1 3 1 0
4 2 4 4 1
5 5 1
6 5 1
I have two tables with the following data:
[Animals].[Males]
DataID HerdNumber HerdID NaabCode
e46fff54-a784-46ed-9a7f-4c81e649e6a0 4 'GOLDA' '7JE1067'
fee3e66b-7248-44dd-8670-791a6daa5d49 1 '35' NULL
[Animals].[Females]
DataID HerdNumber HerdID BangsNumber
987110c6-c938-43a7-a5db-194ce2162a20 1 '9' 'NB3829483909488'
1fc83693-9b8a-4054-9d79-fbd66ee99091 2 'NATTIE' 'ID2314843985499'
I want to merge these tables into a view that looks like this:
DataID HerdNumber HerdID NaabCode BangsNumber
e46fff54-a784-46ed-9a7f-4c81e649e6a0 4 'GOLDA' '7JE1067' NULL
fee3e66b-7248-44dd-8670-791a6daa5d49 1 '35' NULL NULL
987110c6-c938-43a7-a5db-194ce2162a20 1 '9' NULL 'NB3829483909488'
1fc83693-9b8a-4054-9d79-fbd66ee99091 2 'NATTIE' NULL 'ID2314843985499'`
When I used the UNION keyword, SQL Server produced a view that merged the NaabCode and BangsNumber into one column. A book that I have on regular SQL suggested UNION CORRESPONDING syntax like so:
SELECT *
FROM [Animals].[Males]
UNION CORRESPONDING (DataID, HerdNumber, HerdID)
SELECT *
FROM [Animals].[Females]`
But when I type this SQL Server says "Incorrect syntax near 'CORRESPONDING'."
Can anyone tell me how to achieve my desired result and/or how to use UNION CORRESPONDING in T-SQL?
You can just do:
SELECT DataID, HerdNumber, HerdID, NaabCode, NULL as BangsNumber
FROM [Animals].[Males]
UNION ALL
SELECT DataID, HerdNumber, HerdID, NULL as NaabCode, BangsNumber
FROM [Animals].[Females]
SQL Fiddle
I don't remember that SQL Server supports the corresponding syntax, but I might be wrong.
Anyway, this query will select null for the BangsNumber column for the males, and for the NaabCode column for the females, while selecting everything else correctly.
Just do the union explicitly listing the columns:
select DataID, HerdNumber, HerdID, NaabCode, NULL as BangsNumber
from Animals.Males
union all
select DataID, HerdNumber, HerdID, NULL, BangsNumber
from Animals.Females;
Note: you should use union all instead of union (assuming that no single animal is both male and female). union incurs a performance overhead to remove duplicates.
SELECT DataID, HerdNumber, HerdID, NaabCode, '' asBangsNumber
FROM [Animals].[Males]
UNION ALL
SELECT DataID, HerdNumber, HerdID, '' as NaabCode, BangsNumber
FROM [Animals].[Females]
You need to state the columns in each select
SELECT DataID, HerdNumber, HerdID
FROM [Animals].[Males]
UNION
SELECT DataID, HerdNumber, HerdID
FROM [Animals].[Females]
I have a table with test fields, Example
id | test1 | test2 | test3 | test4 | test5
+----------+----------+----------+----------+----------+----------+
12345 | P | P | F | I | P
So for each record I want to know how many Pass, Failed or Incomplete (P,F or I)
Is there a way to GROUP BY value?
Pseudo:
SELECT ('P' IN (fields)) AS pass
WHERE id = 12345
I have about 40 test fields that I need to somehow group together and I really don't want to write this super ugly, long query. Yes I know I should rewrite the table into two or three separate tables but this is another problem.
Expected Results:
passed | failed | incomplete
+----------+----------+----------+
3 | 1 | 1
Suggestions?
Note: I'm running PostgreSQL 7.4 and yes we are upgrading
I may have come up with a solution:
SELECT id
,l - length(replace(t, 'P', '')) AS nr_p
,l - length(replace(t, 'F', '')) AS nr_f
,l - length(replace(t, 'I', '')) AS nr_i
FROM (SELECT id, test::text AS t, length(test::text) AS l FROM test) t
The trick works like this:
Transform the rowtype into its text representation.
Measure character-length.
Replace the character you want to count and measure the change in length.
Compute the length of the original row in the subselect for repeated use.
This requires that P, F, I are present nowhere else in the row. Use a sub-select to exclude any other columns that might interfere.
Tested in 8.4 - 9.1. Nobody uses PostgreSQL 7.4 anymore nowadays, you'll have to test yourself. I only use basic functions, but I am not sure if casting the rowtype to text is feasible in 7.4. If that doesn't work, you'll have to concatenate all test-columns once by hand:
SELECT id
,length(t) - length(replace(t, 'P', '')) AS nr_p
,length(t) - length(replace(t, 'F', '')) AS nr_f
,length(t) - length(replace(t, 'I', '')) AS nr_i
FROM (SELECT id, test1||test2||test3||test4 AS t FROM test) t
This requires all columns to be NOT NULL.
Essentially, you need to unpivot your data by test:
id | test | result
+----------+----------+----------+
12345 | test1 | P
12345 | test2 | P
12345 | test3 | F
12345 | test4 | I
12345 | test5 | P
...
- so that you can then group it by test result.
Unfortunately, PostgreSQL doesn't have pivot/unpivot functionality built in, so the simplest way to do this would be something like:
select id, 'test1' test, test1 result from mytable union all
select id, 'test2' test, test2 result from mytable union all
select id, 'test3' test, test3 result from mytable union all
select id, 'test4' test, test4 result from mytable union all
select id, 'test5' test, test5 result from mytable union all
...
There are other ways of approaching this, but with 40 columns of data this is going to get really ugly.
EDIT: an alternative approach -
select r.result, sum(char_length(replace(replace(test1||test2||test3||test4||test5,excl1,''),excl2,'')))
from mytable m,
(select 'P' result, 'F' excl1, 'I' excl2 union all
select 'F' result, 'P' excl1, 'I' excl2 union all
select 'I' result, 'F' excl1, 'P' excl2) r
group by r.result
You could use an auxiliary on-the-fly table to turn columns into rows, then you would be able to apply aggregate functions, something like this:
SELECT
SUM(fields = 'P') AS passed,
SUM(fields = 'F') AS failed,
SUM(fields = 'I') AS incomplete
FROM (
SELECT
t.id,
CASE x.idx
WHEN 1 THEN t.test1
WHEN 2 THEN t.test2
WHEN 3 THEN t.test3
WHEN 4 THEN t.test4
WHEN 5 THEN t.test5
END AS fields
FROM atable t
CROSS JOIN (
SELECT 1 AS idx
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
UNION ALL SELECT 5
) x
WHERE t.id = 12345
) s
Edit: just saw the comment about 7.4, I don't think this will work with that ancient version (unnest() came a lot later). If anyone thinks this is not worth keeping, I'll delete it.
Taking Erwin's idea to use the "row representation" as a base for the solution a bit further and automatically "normalize" the table on-the-fly:
select id,
sum(case when flag = 'F' then 1 else null end) as failed,
sum(case when flag = 'P' then 1 else null end) as passed,
sum(case when flag = 'I' then 1 else null end) as incomplete
from (
select id,
unnest(string_to_array(trim(trailing ')' from substr(all_column_values,strpos(all_column_values, ',') + 1)), ',')) flag
from (
SELECT id,
not_normalized::text AS all_column_values
FROM not_normalized
) t1
) t2
group by id
The heart of the solution is Erwin's trick to make a single value out of the complete row using the cast not_normalized::text. The string functions are applied to strip of the leading id value and the brackets around it.
The result of that is transformed into an array and that array is transformed into a result set using the unnest() function.
To understand that part, simply run the inner selects step by step.
Then the result is grouped and the corresponding values are counted.