What's wrong with the hive sql statement? - hive

user_id:821044249473 return in the result's set after excuting below hive sql
select distinct(t1.user_id) from b2c_d.A t1 inner join b2c_d.B t2 on t1.user_id = t2.user_id inner join b2cdc.C t3 on t1.user_id = t3.base_uid
where t1.active_rate > 0.9 and t1.micloud_usage > 0.9 and t1.user_level > 5
and t3.order_accessory_amount >100 and t3.order_accessory_amount < 3000 and t3.order_total_amount > 10000 and t3.order_total_amount < 50000
and t1.play_date > 10
and t1.user_id not in(
...
)
So, what's wrong? why user_id 821044249473 is contained in the result's set?
Thanks

Could you be more specific on what's the input and what you are looking in the O/P.
From just browsing over the SQL i think you will have use SEMI join since in HIVEQL you wont be able to use IN like MySQL.

Related

Impala Error : AnalysisException: Multiple subqueries are not supported in expression: (CASE *** WHEN THEN)

I have a where Clause that I need to check if values exists in a table, and I'm doing that in a (subquery). The problem is, that should be made based on
values - 'FIX' and 'VAR'. Depending on each, we need to check on a different table (subquery). To achieve that goal I'm using a Case When statement in the where clause, as shown below:
select *
FROM T1
where
(upper(trim(ITAXAVAR)) = 'S'
and
(
upper(trim(CTIPAMOR)) not in ('A','U','F')
)
)
and
--problem starts here.....
(case ucase(trim(CTIPTXFX)) --Values 'FIX';'VAR';'PUR'
WHEN 'FIX'
THEN
(concat(trim(CPRZTXFX),trim(CTAXAREF)) not in
(select trim(A.tayd91c0_celemtab)
from cd_estruturais.tat91_tabelas A
where A.tayd91c0_ctabela = 'W03' and
--data_date_part = '${Data_ref}' and --por vezes não temos actualização TAT91 para mesma data_ref das tabelas
A.data_date_part = (select max(B.data_date_part)
from cd_estruturais.tat91_tabelas B
where A.tayd91c0_ctabela = B.tayd91c0_ctabela and
B.data_date_part > date_add(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())),-5)
)
and length(nvl(trim(A.tayd91c0_celemtab),'')) <> 0
)
)
WHEN 'VAR'
THEN
(concat(trim(CTAXAREF),trim(CPERRVTX)) not in
(select concat(trim(A.CTXREF),trim(A.CPERRVTX))
from land_estruturais.cat01_taxref A
where A.data_date_part > date_add(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())),-5)
and length(nvl(concat(trim(A.CTXREF),trim(A.CPERRVTX)),'')) <> 0
)
)
END
)
;
Below is a simplified view of the same query:
select *
FROM T1
where
(--first criteria
)
and
--problem starts here.....
(case ucase(trim(CTIPTXFX)) --Values 'FIX';'VAR';'PUR'
WHEN 'FIX'
THEN
(field1 not in
(subquery 1)
)
WHEN 'VAR'
THEN
(field1 not in
(subquery 2)
END
)
;
Can anyone tell me what I'm doing wrong, please?
I seems to me that Impala does not support the subqueries inside a Case When Statement.
Thank you.
Impala doesnt support Subqueries in the select list.
So, you need to rewrite the SQL like below -
Use LEFT ANTI JOIN in place of NOT IN() to link subqueries to T1.
To handle case when, use UNION ALL for different conditions.
SELECT * FROM T1
LEFT ANTI JOIN subqry1 y ON T1.id = y.id
WHERE col='FIX'
UNION ALL
SELECT * FROM T1
LEFT ANTI JOIN subqry2 y ON T1.id = y.id
WHERE col='VAR'
I tried to change the simple SQL you posted above. The main SQL is too complex and need table setup and data to prove the logic.
Here is my version of your simple SQL -
select * FROM T1
LEFT ANTI JOIN subquery1 ON subquery1.column = T1.field1
where (--first criteria )
and ucase(trim(CTIPTXFX))='FIX'
UNION ALL
select * FROM T1
LEFT ANTI JOIN subquery2 ON subquery2.column = T1.field1
where (--first criteria )
and ucase(trim(CTIPTXFX))='VAR'
Pls note, Anti join and union all can be expensive so if your table size if huge, please tune them accordingly.

Most efficient way to filter a table with 1 to many parent and child mappings

I have a client table which with a foreign key to itself where each client has a specific id in each department but one master id. I am trying to find the most efficient way to restrict my query to just the master entry.
Here are the two (simplified) queries I have that work but I feel like there is a more efficient way to accomplish this especially when joining to other large tables:
-- version 1
select
client.id
from
client
join client client2 on client.id = client2.masterid
and client2.id = client2.masterid
--version 2
select
client.id,
from
client
where
client.id = client.masterid
-- Expanded view
select
t1.id masterid,
t1.dob dob,
trunc((months_between(trunc(sysdate),t1.dob)/12),0) age,
case
when substr(t1.zip,1,5) in ('48502','48503','48504','48505','48506','48507','48529','48532') then null
else
(select
max(audit1.operationid)
from
t2 audit1
where
t1.id = audit1.sourceid
and audit1.fieldname = 'ZIP'
and substr(audit1.oldvalue,1,5) in ('48502','48503','48504','48505','48506','48507','48529','48532')
and audit1.created >= to_date('04/25/2014', 'MM/DD/YYYY')
and 1 < (
select
count(audr.id)
from
t2 audr
WHERE
audr.operationid = audit1.operationid
and audr.fieldname in ('ADDRESS1','CITY')
)
) end auditref,
t1.address1 addr1,
t1.address2 addr2,
t1.city city,
substr(t1.zip,1,5) zip
from
t1
where
t1.id = t1.masterid
and 1 = case
when substr(t1.zip,1,5) in ('48502','48503','48504','48505','48506','48507','48529','48532') then 1
when substr(t1.zip,1,5) not in ('48502','48503','48504','48505','48506','48507','48529','48532') and exists
(select
1
from
t2 audit2
where
audit2.sourceid = t1.id
and audit2.fieldname = 'ZIP'
and substr(audit2.oldvalue,1,5) in ('48502','48503','48504','48505','48506','48507','48529','48532')
and audit2.created >= to_date('04/25/2014', 'MM/DD/YYYY')
) then 1
else 0
end
Any thoughts would be appreciated as any other ways I have tried these joins have caused duplicate rows as there can be many ids for each masterid.
Edit:
Here is a more expanded version of the query but there are more joins and filters being used where using the client.id = client.masterid is causing the query to run much slower
The question is the most effective way to limit the t1 and t2 table scans as these tables are huge...
Using the following join accomplished the goal of limiting the table scans:
from
client
left join client client1 on client1.masterid = client.id and client1.id is null

USING limit/offset in a JOIN query

I have 4 tables
A user account
user_id | username | password
---------+----------+----------
A projects table
project_id | project_name | category_id
------------+------------------------------+-------------
A user_projects table (many to many relationship)
accounts_projects_id | account_id | project_id
----------------------+------------+------------
A project_messages table (a project will have many messages)
message_id | project_id |message| username
------------+------------+--------+---------
At login, I'm running a query where I fetch the number of projects a user belongs to and the messages for each project using the below query
SELECT account.user_id,account.username,
array_agg(json_build_object('message',project_messages.message,'username',project_messages.username)) AS messages,
project.project_name
FROM account
JOIN accounts_projects ON account.user_id = accounts_projects.account_id
JOIN project_messages ON accounts_projects.project_id = project_messages.project_id
JOIN project ON project.project_id = accounts_projects.project_id
WHERE account.username=$1
GROUP BY project.project_name,account.user_id
this gives me the below output
userid,username, messages (json array object),project_name`
87;"kannaj";"{"{\"message\" : \"saklep\", \"username\" : \"kannaj\"}"}";"Football with Javascript"
87;"kannaj";"{"{\"message\" : \"work\", \"username\" : \"kannaj\"}","{\"message\" : \"you've been down to long in the midnight sea\", \"username\" : \"kannaj\"}","{\"message\" : \"Yeaaaa\", \"username\" : \"house\"}"}";"Machine Learning with Python"
87;"kannaj";"{"{\"message\" : \"holyy DIVVEERRR\", \"username\" : \"kannaj\"}"}";"Beethoven with react"
Is there a way I can use the LIMIT/OFFSET function when retrieving the messages from the project_messages table?
To make our examples simpler lets say we have two linked tables:
t1(id);
t2(id, t1_id);
And query is
select t1.id, array_agg(t2.id)
from t1 join t2 on (t1.id = t2.t1_id)
group by t1.id;
It is very simplified variant of the your large query as you can see.
1) Arrays
select t1.id, (array_agg(t2.id order by t2.id desc))[3:5]
from t1 join t2 on (t1.id = t2.t1_id)
group by t1.id;
This query works just as original, but returns only from 3,4 and 5 elements of the array which is equal to offset 2 limit 3.
2) Subquery and lateral
select
t1.id,
array_agg(t.x)
from
t1 join lateral
(select t2.id as x from t2 where t1.id = t2.t1_id order by t2.id desc offset 2 limit 3) t on (true)
group by t1.id;
Here lateral keyword allows to use fields from other tables mentioned in the main from clause in our subquery (t1.id).

How to convert SUBSELECT with TOP and ORDER BY to JOIN

I have a working sql select, which looks like this
[Edited: Im sorry i did one mistake in the question, i edited alias of Table1 but im trying the answers]
SELECT
m.Column1
,t2.Column2
,COALESCE
(
(
SELECT TOP 1 Vat
FROM LinkedDBServer.DatabaseName.dbo.TableName t3
WHERE
m.MaterialNumber = t3.MaterialNumber COLLATE Czech_CI_AS
and t3.Currency = …
and ...
ORDER BY [Date] DESC
), m.Vat
) as Vat
FROM Table1 m
JOIN Table2 t2 on (m.Column1 = t2.Column1)
It works but the problem is that it takes too long and LinkedServer cut my connection because it takes more than 10 minutes. The purpose of the query is to get newer data from a different database if it exists (i get newest data by top and ordering it by date and precondition is that every data in that database is newer than in mine, thats why im using COALESCE).
But my though is if I was able to rewrite it to JOIN it could be faster. But another problem could be I dont have an primary key (and cant change that).
How can I speed that query up ? (Im using SQL Server 2008 R2)
Thank you
Here i attached Estimated Query Plan: (Its readable in browser ZOOM :) Estimation is for 2 Coalesce columns.
Try rewriting query using outer apply
SELECT
t1.Column1
,t2.Column2
,COALESCE(ou.vat, m.Vat) as Vat
FROM Table1 t1
JOIN Table2 m on (m.Column1 = t1.Column1)
outer apply
(
SELECT TOP 1 Vat
FROM LinkedDBServer.DatabaseName.dbo.TableName t3
WHERE
m.MaterialNumber = t3.MaterialNumber COLLATE Czech_CI_AS
and t3.Currency = …
and ...
ORDER BY [Date] DESC
) ou
Another option:
; WITH vat AS (
SELECT MaterialNumber COLLATE Czech_CI_AS As MaterialNumber
, Vat
, Row_Number() OVER (PARTITION BY MaterialNumber ORDER BY "Date" DESC) As sequence
FROM LinkedDBServer.DatabaseName.dbo.TableName
WHERE Currency = ...
AND ...
)
SELECT t1.Column1
, m.Column2
, Coalesce(vat.Vat, m.Vat) As Vat
FROM Table1 As t1
INNER
JOIN Table2 As m
ON m.Column1 = t1.Column1
LEFT
JOIN vat
ON vat.MaterialNumber = m.MaterialNumber
AND vat.sequence = 1
;

SQL Query to get time periods contained in another one in the same table

I have the following table (person_program):
program_id person_id start_date end_date
1 15588499 01-01-2014 02-16-2014
2 15588499 02-17-2014 03-01-2014
3 15588499 02-15-2014 02-21-2014
I need to get the program_id that are contained by another time period in the same table (in this case, program_id = 3).
Any idea to solve this?
Thanks!!
Yes, you can reference the same table and get overlapping periods:
select t1.program_id ThisOne, t2.program_id OverlapsWith
from person_program t1
inner join person_program t2
on t1.program_id < t2.program_id
and t1.person_id = t2.person_id
and t2.start_date > t1.end_date
SQL Fiddle demo
Try:
select *
from person_program x
where exists (select 1 from person_program y where y.start_date <= x.start_date and y.person_id = x.person_id and y.program_id <> x.program_id)
or exists (select 1 from person_program y where y.end_date >= x.end_date and y.person_id = x.person_id and y.program_id <> x.program_id)
Note: This finds situations where, for the same person_id, there is another record on the table, with a different program_id, having a start and end date FULLY contained within the range of another.
EDIT: I just change the AND to OR, as it looks like, from your post, you are looking for partial, not necessarily full, overlaps. This should do that.
Assuming you are passing in the program_id you want to find overlaps for, and partial overlaps are OK:
DECLARE #program_id int = 3
SELECT PP.program_id,
PP.person_id
FROM person_program PP
INNER JOIN person_program Source
ON PP.start_date <= Source.end_date
AND PP.end_date >= Source.start_date
AND Source.program_id = #program_id