Spark SQL SELECT from UNION returns erroneous count

Spark SQL SELECT from UNION returns erroneous count - apache-spark-sql

I encounter a weird bug with Spark SQL. Considering the following query
select cast(file_modification_time as date) mod_on, count(*)
from ((select id, _metadata.file_modification_time from table_a
where _metadata.file_modification_time >= '2022-10-01')
union
(select id, _metadata.file_modification_time from table_b
where _metadata.file_modification_time >= '2022-10-01')
)
group by mod_on
order by mod_on;
table_a and table_b have identical schema. The statement above returns the expect result. However, if I omit id from the sub-query, Spark (3.3.0) returns erroneous counts -- way less than expected. What is happening here?

Related

Understanding the semantics of subquery in FROM clause in PostgreSQL

I have an rttmm table with conversation_id and duration fields. There's a query using two sub-queries in a FROM-clause, one of them is not used. I would expect it to be semantically equivalent to the one where you would remove the unused subquery, but it behaves very differently. Here's the query in question:
select
sum(subq2.dur) as res
from (
select sum(rttmm.duration) as dur, rttmm.conversation_id as conv_id
from rttmm
group by rttmm.conversation_id
) as subq1,
(
select sum(rttmm.duration) as dur, rttmm.conversation_id as conv_id
from rttmm
group by rttmm.conversation_id
) as subq2
and here's what I would expect it to be equivalent to (just removing the subq1):
select
sum(subq2.dur) as res
from
(
select sum(rttmm.duration) as dur, rttmm.conversation_id as conv_id
from rttmm
group by rttmm.conversation_id
) as subq2
Turns out it's not the same at all. What is the proper understanding of the first query here?

The first query uses the ancient SQL-89 join syntax and cross-joins two subqueries, whereas the second query does a simple select from the first subquery.
In simple words, the difference is:
select * from table1, table2 vs select * from table1
which is equivalent for
select * from table1 cross join table2 vs select * from table1

Compare two tables of data in HIVE

I have to find out if data in both the tables is same for a given view_date. If same my SQL should return zero, else non zero.
Table1/Table2 columns:
Source
view_date
count
start_date
end_date
I tried in the below way:
SELECT *
FROM (
SELECT count(*)
FROM table1
) a
JOIN (
SELECT count(*)
FROM TABLE 2
) b
WHERE view_date = '05/08/2016'
AND a.x != b.y;
But I am not getting the expected result. Could someone please help me?

Here is one method that counts the number of rows that are unique in each table:
select count(*)
from (select source, count, start_date, end_date,
min(which) as minwhich, max(which) as maxwhich
from ((select source, count, start_date, end_date, 1 as which
from table1
where viewdate = '2016-06-08'
) union all
(select source, count, start_date, end_date, 2 as which
from table2
where viewdate = '2016-06-08'
)
) t12
group by source, count, start_date, end_date
having minwhich = maxwhich
) t;
Note: If rows are duplicated across all values in a table, this does not check that the same number of duplicates are in each table.

To do a full comparison of 2 tables, you not only need to make sure that the number of rows match, but you must check that all the data in all the columns for all the rows match!
This can be a complicated problem (when I worked at Hortonworks, for 1 project we developed 3 different programs to try to solve this). Lately I had the opportunity to develop a program that solves this in an elegant and efficient way: https://github.com/bolcom/hive_compared_bq
The program shows you the differences in a webpage (which is something you could skip if you don't need it) and also gives you a return value 0/1 which is what you currently want.

SELECT statement in WHERE clause on BigQuery not working

I'm trying to run the following query on Google BigQuery:
SELECT SUM(var1) AS Revenue
FROM [table1]
WHERE timeStamp = (SELECT MAX(timeStamp) FROM [table1])
I'm getting the following error:
Error: Encountered "" at line 3, column 19. Was expecting one of:
Is this not supported in BigQuery? If so, would there be an elegant alternative?

Subselect in a comparison predicate is not supported, but you can use IN.
SELECT SUM(var1) AS Revenue
FROM [table1]
WHERE timeStamp IN (SELECT MAX(timeStamp) FROM [table1])

I would use Rank() to get the max timestamp, and filter the #1s in the where clause.
select SUM(var1) AS Revenue
From
(SELECT var1
,RANK() OVER (ORDER BY timestamp DESC) as RNK
FROM [table1]
)
where RNK=1
I don't know how it works with BQ, but in other DB technologies it would be more efficient as it involves only single table scan rather than 2.

Order by not working in Oracle subquery

I'm trying to return 7 events from a table, from todays date, and have them in date order:
SELECT ID
FROM table
where ID in (select ID from table
where DATEFIELD >= trunc(sysdate)
order by DATEFIELD ASC)
and rownum <= 7
If I remove the 'order by' it returns the IDs just fine and the query works, but it's not in the right order. Would appreciate any help with this since I can't seem to figure out what I'm doing wrong!
(edit) for clarification, I was using this before, and the order returned was really out:
select ID
from TABLE
where DATEFIELD >= trunc(sysdate)
and rownum <= 7
order by DATEFIELD
Thanks

The values for the ROWNUM "function" are applied before the ORDER BY is processed. That why it doesn't work the way you used it (See the manual for a similar explanation)
When limiting a query using ROWNUM and an ORDER BY is involved, the ordering must be done in an inner select and the limit must be applied in the outer select:
select *
from (
select *
from table
where datefield >= trunc(sysdate)
order by datefield ASC
)
where rownum <= 7

You cannot use order by in where id in (select id from ...) kind of subquery. It wouldn't make sense anyway. This condition only checks if id is in subquery. If it affects the order of output, it's only incidental. With different data query execution plan might be different and output order would be different as well. Use explicit order by at the end of the main query.
It is well known 'feature' of Oracle that rownum doesn't play nice with order by. See http://www.adp-gmbh.ch/ora/sql/examples/first_rows.html for more information. In your case you should use something like:
SELECT ID
FROM (select ID, row_number() over (order by DATEFIELD ) r
from table
where DATEFIELD >= trunc(sysdate))
WHERE r <= 7
See also:
http://www.orafaq.com/faq/how_does_one_select_the_top_n_rows_from_a_table
http://www.oracle.com/technetwork/issue-archive/2006/06-sep/o56asktom-086197.html
http://asktom.oracle.com/pls/asktom/f?p=100:11:507524690399301::::P11_QUESTION_ID:127412348064
See also other similar questions on SO, eg.:
Oracle SELECT TOP 10 records
Oracle/SQL - Select specified range of sequential records

Your outer query cant "see" the ORDER in the inner query and in this case the order in the inner doesn't make sense because it (the inner) is only being used to create a subset of data that will be used on the WHERE of the outer one, so the order of this subset doesn't matter.
maybe if you explain better what you want to do, we can help you

ORDER BY CLAUSE IN Subqueries:
the order by clause is not allowed inside a subquery, with the exception of the inline views. If attempt to include an ORDER BY clause, you receive an error message
An inline View is a query at the from clause.
SELECT t.*
FROM (SELECT id, name FROM student) t

SELECT , COUNT() in SQLite

If i perform a standard query in SQLite:
SELECT * FROM my_table
I get all records in my table as expected. If i perform following query:
SELECT *, 1 FROM my_table
I get all records as expected with rightmost column holding '1' in all records. But if i perform the query:
SELECT *, COUNT(*) FROM my_table
I get only ONE row (with rightmost column is a correct count).
Why is such results? I'm not very good in SQL, maybe such behavior is expected? It seems very strange and unlogical to me :(.

SELECT *, COUNT(*) FROM my_table is not what you want, and it's not really valid SQL, you have to group by all the columns that's not an aggregate.
You'd want something like
SELECT somecolumn,someothercolumn, COUNT(*)
FROM my_table
GROUP BY somecolumn,someothercolumn

If you want to count the number of records in your table, simply run:
SELECT COUNT(*) FROM your_table;

count(*) is an aggregate function. Aggregate functions need to be grouped for a meaningful results. You can read: count columns group by

If what you want is the total number of records in the table appended to each row you can do something like
SELECT *
FROM my_table
CROSS JOIN (SELECT COUNT(*) AS COUNT_OF_RECS_IN_MY_TABLE
FROM MY_TABLE)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Spark SQL SELECT from UNION returns erroneous count - apache-spark-sql

Related

Understanding the semantics of subquery in FROM clause in PostgreSQL

Compare two tables of data in HIVE

SELECT statement in WHERE clause on BigQuery not working

Order by not working in Oracle subquery

SELECT , COUNT() in SQLite

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Spark SQL SELECT from UNION returns erroneous count - apache-spark-sql

Related

Understanding the semantics of subquery in FROM clause in PostgreSQL

Compare two tables of data in HIVE

SELECT statement in WHERE clause on BigQuery not working

Order by not working in Oracle subquery

SELECT *, COUNT(*) in SQLite

Categories

Resources

SELECT , COUNT() in SQLite