Hello I have a hive query which contains a NOT IN clause. When I am trying to run this query in SPARK-SQL , it is giving me an unsupoported functionality exception.
Select A.primary_key,A.name,A.salary from spark.old A where A.primary_key NOT IN (Select B.primary_key from new B) UNION ALL Select C.primary_key,C.name,C.salary from spark.new C
Any other way to write this query?
UNION ALL is working perfectly.
Related
I am using DBeaver to query a PostgreSQL database.
I have this query, it simply selects the highest id per Enterprise_Nbr. The query works but is really slow. Is there any way I can rewrite the query to improve performance.
I am using the querytool DBeaver because I don't have direct access to PostgreSQL. The ultimate goal is to link the PostgreSQL with PowerBi.
select *
from public.address
where "ID" in (select max("ID")
from public.address a
group by "Enterprise_Nbr")
Queries for greatest-n-per-group problems are typically faster if done using Postgres' proprietary distinct on () operator
select distinct on ("Enterprise_Nbr") *
from public.address
order by "Enterprise_Nbr", "ID" desc;
Your query could rewrite as: per each value of Enterprise_Nbr, retrieve row which there is not exists other rows that have same Enterprise_Nbr and greater ID.
SELECT *
FROM public.address a
WHERE NOT EXISTS (
SELECT 1
FROM public.address b
WHERE b.Enterprise_Nbr = a.Enterprise_Nbr AND b.ID > a.ID
)
I want the latest records from HIVE table using the following query-
WITH lot as (select *
from to_burn_in as a where a.rel_lot='${Rel_Lot}')
select a.* from lot AS a
where not exists (select 1 from lot as b
where a.Rel_Lot=b.Rel_Lot and a.SerialNum=b.SerialNum and a.Test_Stage=b.Test_Stage
and cast(a.test_datetime as TIMESTAMP) < cast(b.Test_Datetime as TIMESTAMP))
order by a.SerialNum
this query is throwing a error as
Error while compiling statement: FAILED: SemanticException line 0:undefined:-1 Unsupported SubQuery Expression 'Test_Datetime': SubQuery expression refers to both Parent and SubQuery expressions and is not a valid join condition.
I have tried running with equal operator in place of the less than operator in subquery and it is running fine. I read the HIVE documentation as given in
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
and couldn't figure out why it is throwing a error as 'where' subquery is supported.
What might be the problem here?
EXISTS works the same as a join actually. Not equality join conditions are not supported in Hive prior Hive 2.2.0 (see HIVE-15211, HIVE-15251)
It seems you are trying to get records having latest timestamp per Rel_Lot,SerialNum,Test_Stage. Your query can be rewritten using dense_rank() or rank() function:
WITH lot as (select *
from to_burn_in as a where a.rel_lot='${Rel_Lot}'
)
select * from
(
select a.*,
dense_rank() over(partition by Rel_Lot,SerialNum,Test_Stage order by cast(a.test_datetime as TIMESTAMP) desc) as rnk
from lot AS a
)s
where rnk=1
order by s.SerialNum
I have a requirement to get the hierarchical structure for the employees.
Below is the current recursive function I am using:
WITH RECURSIVE resource_tbl AS (
select pers_id,pers_full_nm,mgr_id,mgr_full_nm, 1 as level from resource
UNION ALL
select t.pers_id,t.pers_full_nm,t.mgr_id,t.mgr_full_nm, c.level+1 from resource_tbl c
INNER JOIN resource t ON t.mgr_id = c.pers_id
)
SELECT *
FROM resource_tbl
ORDER BY level;
When I run this query I get the below error:
ERROR: RECURSIVE option in WITH clause is not supported SQL state:
0AM00
Does Anyone had this Problem before? The Postgres SQL version is 8.2.
If Version is a Problem then how can I implement in the Current PostgreSQL environment we had?
I am trying to query multiple tables in BigQuery using a wildcard (I have tables from _[0-9] suffix)
This query for a specific table works:
SELECT
count(*)
FROM `maw_qa.rt_content_secondly_0`
where _PARTITIONTIME = timestamp('2017-01-24');
But this doesn't :
SELECT
count(*)
FROM `maw_qa.rt_content_secondly_*`
where _PARTITIONTIME = timestamp('2017-01-24');
Error:
Query Failed
Error: Unrecognized name: _PARTITIONTIME at [5:7]
I am using standard SQL. Legacy SQL does not even take wildcard * in the query.
What is the way to do this correctly?
Looks like wildcard and partition do not work together in query
Try below. it is in BigQuery Legacy SQL as in this version it is less bushy
Assuming you have 4 tables, if more - you need to enlist all of them here
SELECT COUNT(*)
FROM
[maw_qa.rt_content_secondly_0],
[maw_qa.rt_content_secondly_1],
[maw_qa.rt_content_secondly_2],
[maw_qa.rt_content_secondly_3]
WHERE _PARTITIONTIME = TIMESTAMP('2017-01-24')
Of course similar can be written in BigQuery Standard SQL but it will require more typing with UNION ALL, etc.
For Standard SQL it can look like below
SELECT COUNT(*) FROM (
SELECT * FROM `maw_qa.rt_content_secondly_0` WHERE _PARTITIONTIME = TIMESTAMP('2017-01-24') UNION ALL
SELECT * FROM `maw_qa.rt_content_secondly_1` WHERE _PARTITIONTIME = TIMESTAMP('2017-01-24') UNION ALL
SELECT * FROM `maw_qa.rt_content_secondly_2` WHERE _PARTITIONTIME = TIMESTAMP('2017-01-24') UNION ALL
SELECT * FROM `maw_qa.rt_content_secondly_3` WHERE _PARTITIONTIME = TIMESTAMP('2017-01-24')
)
When you query a partitioned table, you don't need to use the _* syntax, which is reserved for table wildcards (where you filter on _TABLE_SUFFIX). In your case, you should just do:
SELECT
count(*)
FROM `maw_qa.rt_content_secondly`
where _PARTITIONTIME = '2017-01-24';
BigQuery does not seem to have support for UNION yet:
https://developers.google.com/bigquery/docs/query-reference
(I don't mean unioning tables together for the source. It has that.)
Is it coming soon?
If you want UNION so that you can combine query results, you can use subselects
in BigQuery:
SELECT foo, bar
FROM
(SELECT integer(id) AS foo, string(title) AS bar
FROM publicdata:samples.wikipedia limit 10),
(SELECT integer(year) AS foo, string(state) AS bar
FROM publicdata:samples.natality limit 10);
This is almost exactly equivalent to the SQL
SELECT id AS foo, title AS bar
FROM publicdata:samples.wikipedia limit 10
UNION ALL
SELECT year AS foo, state AS bar
FROM publicdata:samples.natality limit 10;
(note that if want SQL UNION and not UNION ALL this won't work)
Alternately, you could run two queries and append the result.
BigQuery recently added support for Standard SQL, including the UNION operation.
When submitting a query through the web UI, just make sure to uncheck "Use Legacy SQL" under the SQL Version rubric:
You can always do:
SELECT * FROM (query 1), (query 2);
It does the same thing as :
SELECT * from query1 UNION select * from query 2;
Note that, if you're using standard SQL, the comma operator now means JOIN - you have to use the UNION syntax if you want a union:
In legacy SQL, the comma operator , has the non-standard meaning of UNION ALL when applied to tables. In standard SQL, the comma operator has the standard meaning of JOIN.
For example:
#standardSQL
SELECT
column_name,
count(*)
from
(SELECT * FROM me.table1 UNION ALL SELECT * FROM me.table2)
group by 1
This helped me out very much for doing a UNION INTERSECT with big query's StandardSQL.
#standardSQL
WITH
a AS (
SELECT
*
FROM
table_a),
b AS (
SELECT
*
FROM
table_b)
SELECT
*
FROM
a INTERSECT DISTINCT
SELECT
*
FROM
b
I STOLE/MODIFIED THIS EXAMPLE FROM: https://gist.github.com/yancya/bf38d1b60edf972140492e3efd0955d0
Unions are indeed supported. An excerpt from the link that you posted:
Note: Unlike many other SQL-based systems, BigQuery uses the comma syntax to indicate table unions, not joins. This means you can run a query over several tables with compatible schemas as follows:
// Find suspicious activity over several days
SELECT FORMAT_UTC_USEC(event.timestamp_in_usec) AS time, request_url
FROM [applogs.events_20120501], [applogs.events_20120502], [applogs.events_20120503]
WHERE event.username = 'root' AND NOT event.source_ip.is_internal;