I have two large tables with some overlapping columns, some of which contain the same values in the overlapping columns. Here's a toy example (in the actual example, there are dozens of columns, both those that overlap and those that don't):
Table 1: a, b, c
Table 2: a, d, e
Some values of a are in only one table, some are in both.
Is there a query that will let me generate a table with all values where available:
Table 3: a, b, c, d, e
My current query requires listing every column, which is very verbose with dozens of columns, and inflexible when the schema changes:
SELECT
coalesce(t1.a,
t2.a) AS a,
t1.b,
t1.c,
t2.d,
t2.e
FROM
t1
FULL JOIN
t2
USING
(a)
Things I've tried: UNION seems to require the same schema, SELECT t1.*, t2.* raises an error on overlapping columns, SELECT t1.* ... USING (a) will give nulls for values in a where there are values only in t1.a.
Before BigQuery Standard SQL got available to all of us in June 2, 2016 - I was extremely happy with what now called BigQuery Legacy SQL. I still enjoy it time by time for some specific use cases
I think the case you described in your question is exactly one where you can leverage feature of Legacy SQL to resolve your issue
So, below is for BigQuery Legacy SQL
#legacySQL
SELECT *
FROM [project:dataset.table1],
[project:dataset.table2]
Note: in BigQuery Legacy SQL comma - , - means UNION ALL
Super-simplified example of above is
#legacySQL
SELECT *
FROM (SELECT1 a, 2 b, 3 c, 11 x),
(SELECT 1 a, 4 d, 5 e, 12 x)
with result
Row a b c x d e
1 1 2 3 11 null null
2 1 null null 12 4 5
Note: you cannot mix Legacy and Standard SQL in same query, so if you need use Standard SQL against resulted UNION - you will need first to materialize(save) result as a table and then query that table using Standard SQL
Is there any way with Standard SQL
You can use INFORMATION_SCHEMA to script out columns from both tables and built list of all involved columns - but you sutill will need then to copy-paste result into final query to run it
Related
I'm currently writing a relatively complex SQL statement which selects data from multiple tables and has quite a few sub-statements and joins.
In my "final" data set, I want to return raw data as well as comparisons between the raw data. While I can do this when the raw data is found using a Join, is it possible to do this while the raw data is found in a sub-query?
For example:
If I have a query which is
SELECT
A
,(SELECT B FROM BETA WHERE Row = ALPHA.Betalink) B
FROM APLHA
WHERE A > 1
Can I add a column which compares A and B without adding another Select?
The only way I know to solve this would be to do the above select, then select on that:
SELECT
A
,B
,greater(A,B)
FROM
(SELECT A
,(SELECT B FROM BETA WHERE Row = ALPHA.Betalink) B
FROM APLHA
WHERE A > 1
)
TIA
I think you are looking for a with clause.
What is the most important- the query is ran once, while subquery is runned for every returned row.
You can read about it here:
subquery_factoring_clause
In your example it could look like this:
WITH SUBQ_DATA as (SELECT B,Row FROM BETA)
SELECT alpha.A
,sub.B
,greater(alpha.A,sub.B)
FROM ALPHA alpha
JOIN SUBQ_DATA sub on sub.Row = alpha.Betalink
WHERE A > 1
Hive's function explode is documented here
It is essentially a very practical function that generates many rows from a single one. Its basic version takes a column whose value is an array of values and produces a copy of the same row for each of those values.
I wonder whether such a thing exists in Impala. I haven't been able to find it in the documentation.
Impala does not have any function like EXPLODE in hive to read complex data types and generate multiple rows.
Currently through Impala, we can just read the complex data types in Hive generated tables using dot notation like select employee.empid from table1 .
Impala can query complex type columns only from Parquet tables or Parquet partitions within partitioned tables
The very very tricky approach:
with A as (select 'row 1' as key, 'a;b;c' as value
union all
select 'row 2' as key, 'd;e' as value
union all
select 'row 3' as key, 'f' as value),
B as (select *, length(value) - length(regexp_replace(value,';','')) + 1 as n from A),
-- assuming you have at lest as many rows as different values in a single row
C as (select row_number() over(order by key) as seq, n from B),
D as (select seq from C where seq <= (select max(n) from C))
select key, value, split_part(value,';',seq) as part
from B
cross join D
where seq <= n
order by key,seq
Important: notice the comment "assuming you have at least as many rows as different values in a single row".
Just run it and see the result.
Lateral views. In CDH 5.5 / Impala 2.3 and higher, Impala supports queries on complex types (STRUCT, ARRAY, or MAP), using join notation rather than the EXPLODE() keyword. See Complex Types (CDH 5.5 or higher only) for details about Impala support for complex types.
use another table B that unfold from array of values. and then use table a inner join table b.
See Impala Doc
https://docs.cloudera.com/documentation/enterprise/5-9-x/topics/impala_langref_unsupported.html
Ok...so what I'm trying to do is to have a query (I can't use PL/SQL as the query is utilized by an application that can't handle PL/SQL) that simply queries a table and if a particular condition isn't met, it actually creates a record with that condition in the returned results (not actually create a record in a table).
To set this up, imagine there is only one table with the following columns: ID, TEST, and SPEC and may have data like the following:
1234 LIMIT_TEST Total of limits
4321 LIMIT_TEST Total of limits
5678 LIMIT_TEST Etha
8765 LIMIT_TEST Metha
The SPEC column is produced by a case, when, then statement that pulls expressions out of a SPECIFICATION column.
So you'll see there are actually 3 LIMIT_TESTs:
Total of Limits
Etha
Metha
However, for ID 1234, there is only "Total of limits". What I need to have the query return is something like:
1234 LIMIT_TEST Total of limits
1234 LIMIT_TEST null Etha
1234 LIMIT_TEST null Metha
(Imagine in the case statement a column is added to put what the nulls are for).
Any ideas are appreciated.
You could form a UNION between your main query and another which includes a static NULL in its SELECT clause, and uses a NOT EXISTS in its WHERE clause to determine the absence of Etha and Metha.
select id, test, decode(spec, ms, spec) spec, nullif(ms, spec) missing
from (select id, test, spec, ms,
row_number() over (partition by id, ms order by decode(spec, ms, 1)) rn
from t cross join (select distinct spec ms from t) dt )
where rn = 1
SQLFiddle (I added one row here for id=1234, spec ='Etha' to check scenario
where two specs for one id exists). Table name is T, not creative.
Explanation:
select distinct spec - obvious step
cross join distinct specs with our table - probably must be done somehow in any solution (union, exists, etc.)
enumerate rows depending on if spec's are equal then this rows have priority - this is done by row_number()
take only rows with rn = 1, rest is the matter of presentation (functions decode and nullif).
This will do it...
select
c.id, c.test, d.spec, case when d.spec is null then c.spec else null end as missing_spec
from
(select a.id, a.test, b.spec from TABLE_NAME a, (select distinct spec from TABLE_NAME) b) c,
TABLE_NAME d
where c.id = d.id (+) and c.test = d.test (+) and c.spec = d.spec (+)
order by c.id, c.spec;
Assumption: There will only ever be one record in the table for each unique combination of id, test, and spec.
1) Cartesian join the source table with a distinct list of the spec values. This will provide a base result list having a record for each unique combination of all possible ids, tests, and spec values.
2) Left outer join the source table. This will allow you to identify which of all the possible unique combination are actually present in the source table.
3) Add a case to the select clause for the final results column that displays null when the combination is found and the spec value if missing.
If it is possible for the source table to have multiple records for a single combination of id, test, and spec, then you would want to add distinct before the a.id in line 4 (as mentioned by Ponder Stibbons).
I have two tables A and B with same column names. I have to combine them into table C
when I am running following query, the count is not matching -
select * into C
from
(
select * from A
union
select * from B
)X
The record count of C is not matching with A and B. There is difference of 89 rows. So I figured out that there are duplicates.
I used following query to find duplicates -
select * from A
INTERSECT
select * from B
-- 80 rows returned
Can anybody tell me why intersect returns 80 dups whereas count difference on using union is 89 ?
There are probably duplicates inside of A and/or B as well. All set operators perform an implicit DISTINCT on the result (logically, not necessarily physically).
Duplicate rows are usually a data-quality issue or an outright bug. I usually mitigate this risk by adding unique indexes on all columns and column sets that are supposed to be unique. I especially make sure that every table has a primary key if that is at all possible.
Suppose I have three tables Table A,Table B and Table C.
Table A contains the col t1 with entries 1,2,2,3,4,4.
Table B has col t2 with entries 1,3,4,4.
Table C has col t3 with entries 1,2,4,4.
The query given was
SELECT * FROM A EXCEPT (SELECT * FROM B INTERSECT SELECT * FROM C ).
I saw this question in a test paper. It was mentioned that the expected answer was 2 but the answer obtained from this query was 1,2,4. I am not able to understand the principle behind this.
Well, as I see it, both the expected answer and the answer you obtained are wrong. It may be the RDBMS that you are using, but analyzing your query the results should be 2,3. First you should do the INTERSECT between tables B and C, the values that intersect are 1 and 4. Taking that result, you should take all the values from table A except 1 and 4, that leaves us with 2 and 3 (since EXCEPT and INTERSECT return only distinct values). Here is a sqlfiddle with this for you to try.
Because of the bracket, the INTERSECT between B and C is done first, resulting in (1,4). You can even verify this just be taking the latter part and running in isolation:
SELECT * FROM B INTERSECT SELECT * FROM C
The next step is to select everything in A EXCEPT those that exist in the previous result of (1,4), which leaves (2,3).
The answer should be 2 and 3, not 1,2 and 4.
BTW, it should be mentioned that even if you had no parenthesis in the query at all, the result should still be the same because the INTERSECT operator has a higher precedence than the EXCEPT/UNION operators. This is the SQL Server documentation but it's consistent with the standard that applies to any DBMS that implements these operators.