BigQuery - FULL OUTER JOIN with USING causes casting error - google-bigquery

I work in BigQuery and I have a task to do. I have to compare values of two tables and search for differences. I have decided to convert values to string, unpivot these tables and join them with unique ID and name of the column. I have decided to use FULL OUTER JOIN, because I can't be certain that every single ID is present in both tables. I've tried to join tables with USING clause, but it gives me casting error, when I try to use standard CAST in WHERE clause.
Error running query
Invalid date: 'test'
This code reproduces the issue:
WITH RawData AS (
SELECT 'text' AS type, 'test' AS value
UNION ALL
SELECT 'number' AS type, '123' AS value
UNION ALL
SELECT 'date' AS type, '2020-12-12' AS value
),
Joined AS (
SELECT
IFNULL(a.type, b.type) AS type,
-- type,
a.value AS a_value,
b.value AS b_value
FROM RawData AS a
FULL OUTER JOIN RawData AS b ON a.type = b.type -- NO ISSUE
-- FULL OUTER JOIN RawData AS b USING (type) -- CASTING ERROR
-- LEFT JOIN RawData AS b USING (type) -- NO ISSUE
-- RIGHT JOIN RawData AS b USING (type) -- NO ISSUE
-- JOIN RawData AS b USING (type) -- NO ISSUE
),
Filtered AS (
SELECt * FROM Joined WHERE type = 'date'
)
SELECT *, CAST(a_value AS DATE), CAST(b_value AS DATE)
FROM Filtered
WHERE 1=1
AND CAST(a_value AS DATE) BETWEEN DATE('2020-12-01') AND DATE('2020-12-31')
AND CAST(b_value AS DATE) BETWEEN DATE('2020-12-01') AND DATE('2020-12-31')

I believe the way BQ is compiling your CTEs, the CAST(a_value as DATE) in your final WHERE clause is trying to be evaluated as if it were right after your self join and is causing a conversion/casting error before the other filters are applied.
I find it helpful to do as much casting as early as possible:
WITH
RawData AS (...),
Joined AS (
SELECT
IFNULL(a.type, b.type) AS type,
SAFE_CAST(a.value AS DATE) AS a_date, -- Handle casting ASAP
SAFE_CAST(b.value AS DATE) AS b_date
FROM RawData AS a
FULL OUTER JOIN RawData AS b USING (type)
),
Filtered AS (
SELECT * FROM Joined WHERE type = 'date'
)
SELECT * FROM Filtered
WHERE a_date BETWEEN '2020-12-01' AND '2020-12-31'
AND b_date BETWEEN '2020-12-01' AND '2020-12-31'
Even better would be to filter your table before you do a join so it is joining fewer rows!
WITH
RawData AS (...),
Filtered AS (
SELECT
*, SAFE_CAST(value AS DATE) AS date_value -- Handle casting ASAP
FROM RawData
WHERE type = 'date' -- Reduce Size Early On!
AND SAFE_CAST(value as DATE) BETWEEN '2020-12-01' AND '2020-12-31' -- Only do it 1 time here since it is the same range!
),
Joined AS (
SELECT
COALESCE(a.type, b.type) AS type,
a.date_value AS a_date,
b.date_value AS b_date
FROM Filtered AS a
FULL OUTER JOIN Filtered AS b USING (type)
)
SELECT * FROM Joined

Related

How do I complete a cross join with redshift?

I have two tables. One has a user ID and a date and another has a list of dates.
with first_day as (
select '2020-03-01' AS DAY_CREATED, '123' AS USER_ID
),
date_series as (
SELECT ('2020-02-28'::date + x)::date as day_n,
'one' as join_key
FROM generate_series(1, 30, 1) x
)
SELECT * from first_day cross join date_series
I'm getting this error with redshift
Error running query: Specified types or functions (one per INFO message) not supported on Redshift tables.
Can I do a cross join with redshift?
Alas, Redshift supports generete_series() but only in a very limited way -- on the master processing node. That basically renders it useless.
Assuming you have a table with enough rows, you can use that:
with first_day as (
select '2020-03-01' AS DAY_CREATED, '123' AS USER_ID
),
date_series as (
select ('2020-02-28'::date + x)::date as day_n,
'one' as join_key
from (select t.*, row_number() over () as x
from t -- big enough table
limit 30
) x
select *
from first_day cross join
date_series;

How to return two values from PostgreSQL subquery?

I have a problem where I need to get the last item across various tables in PostgreSQL.
The following code works and returns me the type of the latest update and when it was last updated.
The problem is, this query needs to be used as a subquery, so I want to select both the type and the last updated value from this query and PostgreSQL does not seem to like this... (Subquery must return only one column)
Any suggestions?
SELECT last.type, last.max FROM (
SELECT MAX(a.updated_at), 'a' AS type FROM table_a a WHERE a.ref = 5 UNION
SELECT MAX(b.updated_at), 'b' AS type FROM table_b b WHERE b.ref = 5
) AS last ORDER BY max LIMIT 1
Query is used like this inside of a CTE;
WITH sql_query as (
SELECT id, name, address, (...other columns),
last.type, last.max FROM (
SELECT MAX(a.updated_at), 'a' AS type FROM table_a a WHERE a.ref = 5 UNION
SELECT MAX(b.updated_at), 'b' AS type FROM table_b b WHERE b.ref = 5
) AS last ORDER BY max LIMIT 1
FROM table_c
WHERE table_c.fk_id = 1
)
The inherent problem is that SQL (all SQL not just Postgres) requires that a subquery used within a select clause can only return a single value. If you think about that restriction for a while it does makes sense. The select clause is returning rows and a certain number of columns, each row.column location is a single position within a grid. You can bend that rule a bit by putting concatenations into a single position (or a single "complex type" like a JSON value) but it remains a single position in that grid regardless.
Here however you do want 2 separate columns AND you need to return both columns from the same row, so instead of LIMIT 1 I suggest using ROW_NUMBER() instead to facilitate this:
WITH LastVals as (
SELECT type
, max_date
, row_number() over(order by max_date DESC) as rn
FROM (
SELECT MAX(a.updated_at) AS max_date, 'a' AS type FROM table_a a WHERE a.ref = 5
UNION ALL
SELECT MAX(b.updated_at) AS max_date, 'b' AS type FROM table_b b WHERE b.ref = 5
)
)
, sql_query as (
SELECT id
, name, address, (...other columns)
, (select type from lastVals where rn = 1) as last_type
, (select max_date from lastVals where rn = 1) as last_date
FROM table_c
WHERE table_c.fk_id = 1
)
----
By the way in your subquery you should use UNION ALL with type being a constant like 'a' or 'b' then even if MAX(a.updated_at) was identical for 2 or more tables, the rows would still be unique because of the difference in type. UNION will attempt to remove duplicate rows but here it just isn't going to help, so avoid that wasted effort by using UNION ALL.
----
For another way to skin this cat, consider using a LEFT JOIN instead
SELECT id
, name, address, (...other columns)
, lastVals.type
, LastVals.last_date
FROM table_c
WHERE table_c.fk_id = 1
LEFT JOIN (
SELECT type
, last_date
, row_number() over(order by last_date DESC) as rn
FROM (
SELECT MAX(a.updated_at) AS last_date, 'a' AS type FROM table_a a WHERE a.ref = 5
UNION ALL
SELECT MAX(b.updated_at) AS last_date, 'b' AS type FROM table_b b WHERE b.ref = 5
)
) LastVals ON LastVals.rn = 1

how to SUM two columns in different table between two date

this my query but result false where number row different , that's to say whenever tableA select 2 row and tableB select 3 result is false
select sum(tableA.value)+sum(tableB.value1) )
from tableA,tableB
where tableA.data between '2016-01-21' and '2016-03-09'
and tableB.date2 between '2016-01-21' and '2016-03-09'
You need to do the sums in subqueries before joining. A simple rule: never use commas in the from clause.
select coalesce(avalue, 0) + coalesce(bvalue, 0)
from (select sum(a.value) as avalue
from tableA a
where a.data between '2016-01-21' and '2016-03-09'
) a cross join
(select sum(b.value) as bvalue
from tableB b
where b.data between '2016-01-21' and '2016-03-09'
) b;
OK . So here's what my understanding is.
You are trying to sum up two columns from two different tables and get the sum of the summed up columns. isn't ?? Correct me if I am wrong.If this is the case then
A Simple Subquery Can Come To Your Rescue.
Select
(Select SUM(value) From tableA
where data between '2016-01-21' and '2016-03-09') +
(Select SUM(value1) From tableB
where date2 between '2016-01-21' and '2016-03-09') FinalValue

SQL SUM function inquiry

I'm having a hard time summing up a column on two tables. The scenario is something like this (refer to the image below)
Table 1 may have a lot of rows per Date. But Table 2 may only consists of two rows of data per Date. What I wanted to do is to sum up all Item/Price (Table1) according to their Date and ADD them with another SUM of Item/Price of Table2. The category of SUM is by Date.
I tried any joins statement (left, right or inner) but it does not produce the result that I am expecting to. My expected result is the Result table. But on my query, it produces a very high value.
Thanks.
Use a UNION clause like this:
WITH t(d, p) AS (
SELECT [Date], Price FROM Table1
UNION ALL
SELECT [Date], Price FROM Table2
)
SELECT d, SUM(p) FROM t GROUP BY d
You can do this with UNION ALL in either a subquery or a cte, cte shown here:
;WITH cte AS (SELECT [Date], Price
FROM Table1
UNION ALL
SELECT [Date], Price
FROM Table2
)
SELECT [Date], SUM(Price) AS Total_Price
FROM cte
GROUP BY [Date]
Demo: SQL Fiddle
Try This,
with cte (C_Date,C_Price)
as
(
SELECT date,SUM(price) FROM table_1
group by date
union
SELECT date,SUM(price) FROM table_2
group by date
)
select c_date,SUM(c_price) from cte
group by C_Date
Try this
Select t.date,P1+P2
from(
Select Date,sum(Price) P1
from table1 t
group by Date
) t
left join
(
Select Date,sum(Price) P2
from table t2
group by date
) t1 on t.date = t1.date
group by date

How to use group by with union in T-SQL

How can I using group by with union in T-SQL? I want to group by the first column of a result of union, I wrote the following SQL but it doesn't work. I just don't know how to reference the specified column (in this case is 1) of the union result.
SELECT *
FROM ( SELECT a.id ,
a.time
FROM dbo.a
UNION
SELECT b.id ,
b.time
FROM dbo.b
)
GROUP BY 1
You need to alias the subquery. Thus, your statement should be:
Select Z.id
From (
Select id, time
From dbo.tablea
Union All
Select id, time
From dbo.tableb
) As Z
Group By Z.id
GROUP BY 1
I've never known GROUP BY to support using ordinals, only ORDER BY. Either way, only MySQL supports GROUP BY's not including all columns without aggregate functions performed on them. Ordinals aren't recommended practice either because if they're based on the order of the SELECT - if that changes, so does your ORDER BY (or GROUP BY if supported).
There's no need to run GROUP BY on the contents when you're using UNION - UNION ensures that duplicates are removed; UNION ALL is faster because it doesn't - and in that case you would need the GROUP BY...
Your query only needs to be:
SELECT a.id,
a.time
FROM dbo.TABLE_A a
UNION
SELECT b.id,
b.time
FROM dbo.TABLE_B b
Identifying the column is easy:
SELECT *
FROM ( SELECT id,
time
FROM dbo.a
UNION
SELECT id,
time
FROM dbo.b
)
GROUP BY id
But it doesn't solve the main problem of this query: what's to be done with the second column values upon grouping by the first? Since (peculiarly!) you're using UNION rather than UNION ALL, you won't have entirely duplicated rows between the two subtables in the union, but you may still very well have several values of time for one value of the id, and you give no hint of what you want to do - min, max, avg, sum, or what?! The SQL engine should give an error because of that (though some such as mysql just pick a random-ish value out of the several, I believe sql-server is better than that).
So, for example, change the first line to SELECT id, MAX(time) or the like!
with UnionTable as
(
SELECT a.id, a.time FROM dbo.a
UNION
SELECT b.id, b.time FROM dbo.b
) SELECT id FROM UnionTable GROUP BY id