Error with Hive SubQuery - Unsupported SubQuery Expression - sql

create table db.temp
location '/user/temp' as
SELECT t1.mobile_no
FROM db.temp t1
WHERE NOT EXISTS ( SELECT NULL
FROM db.temp t2
WHERE t1.mobile_no = t2.mobile_no
AND t1.cell != t2.cell
AND t2.access_time BETWEEN t1.access_time
AND t1.access_time_5);
I need to get all the users who used the same cell for 5 hours of the time interval(access_time_5) from access time. This code perfectly fine with impala. But not works in Hive.
Gives an error
"Error while compiling statement: FAILED:
SemanticException [Error 10249]: line 23:25 Unsupported SubQuery
Expression"
I looked at a similar question related to this error. Can't figure out the solution. Any help would be highly appreciated!

Correlated BETWEEN is not supported in Hive as well as non-equi joins. Try to rewrite using LEFT JOIN, count rows with your condition and filter:
select mobile_no from
(
SELECT t1.mobile_no,
sum(case when t1.cell != t2.cell
and t2.access_time between t1.access_time and t1.access_time_5
then 1 else 0
end) as cnt_exclude
FROM db.temp t1
LEFT JOIN db.temp t2 on t1.mobile_no = t2.mobile_no
GROUP BY t1.mobile_no
)s
where cnt_exclude=0
The problem with such solution is that LEFT JOIN may produce huge duplication and it will affect performance, though it may work if the data is not too big.

It seems to me that window functions would be better for both databases. Let me assume that access_time is a Unix time (i.e. measured in seconds). You can easily convert the value to such a time:
SELECT t1.mobile_no
FROM (SELECT t1.*,
MIN(t1.cell) OVER (PARTITION BY mobile_no
ORDER BY access_time
RANGE BETWEEN 17999 preceding AND CURRENT ROW
) as min_cell,
MAX(t1.cell) OVER (PARTITION BY mobile_no
ORDER BY access_time
RANGE BETWEEN 17999 preceding AND CURRENT ROW
) as max_cell
FROM db.temp t1
) t1
WHERE min_cell = max_cell;

Related

LAG() function in sql 2008

I have looked at a few other questions regarding this problem, we are trying to get a stored procedure working that contains the LAG() function, but the machine we are now trying to install an instance on is SQL 2008 and we can't use it
SELECT se.SetID,SetName,ParentSetId,
qu.QuestionID,qu.QuestionText,qu.QuestionTypeID,qu.IsPublished,qu.IsFilter,
qu.IsRequired,qu.QueCode,qu.IsDisplayInTable,
Case when (LAG(se.ParentSetId) OVER(ORDER BY se.ParentSetId) <> ParentSetId) then 2 else 1 end level ,
QuestionType
FROM tblSet se
LEFT join tblQuestion qu on qu.SetID=se.SetID
Inner join tblQuestionType qt on qt.QuestionTypeID=qu.QuestionTypeID and qt.IsAnswer=1
where CollectionId=#colID and se.IsDeleted=0
order by se.SetID
What I've tried so far (edited to reflect Zohar Peled's) suggestoin
SELECT se.SetID,se.SetName,se.ParentSetId,
qu.QuestionID,qu.QuestionText,qu.QuestionTypeID,qu.IsPublished,qu.IsFilter,
qu.IsRequired,qu.QueCode,qu.IsDisplayInTable,
(case when row_number() over (partition by se.parentsetid
order by se.parentsetid
) = 1
then 1 else 2
end) as level,
QuestionType
FROM tblSet se
left join tblSet se2 on se.ParentSetId = se2.ParentSetId -1
LEFT join tblQuestion qu on qu.SetID=se.SetID
Inner join tblQuestionType qt on qt.QuestionTypeID=qu.QuestionTypeID and qt.IsAnswer=1
where se.CollectionId=#colID and se.IsDeleted=0
order by se.SetID
it does not seem to be bringing out all of the same records when I run them side by side and the level value seems to be different also
I have put in some of the outputs into a HTML formatted table from the version containing LAG() (the first results) then the second is the new version, where the levels are not coming out the same
https://jsfiddle.net/gyn8Lv3u/
LAG() can be implemented using a self-join as Jeroen wrote in his comment, or by using a correlated subquery. In this case, it's a simple lag() so the correlated subquery is also simple:
SELECT se.SetID,SetName,ParentSetId,
qu.QuestionID,qu.QuestionText,qu.QuestionTypeID,qu.IsPublished,qu.IsFilter,
qu.IsRequired,qu.QueCode,qu.IsDisplayInTable,
Case when (
(
SELECT TOP 1 ParentSetId
FROM tblSet seInner
WHERE seInner.ParentSetId < se.ParentSetId
ORDER BY seInner.ParentSetId DESC
)
<> ParentSetId) then 2 else 1 end level ,
QuestionType
FROM tblSet se
LEFT join tblQuestion qu on qu.SetID=se.SetID
Inner join tblQuestionType qt on qt.QuestionTypeID=qu.QuestionTypeID and qt.IsAnswer=1
where CollectionId=#colID and se.IsDeleted=0
order by se.SetID
If you had specified an offset it would be harder do implement using a correlated subquery, and a self join would make a much easier solution.
Sample data and desired results would help. This construct:
(case when (LAG(se.ParentSetId) OVER(ORDER BY se.ParentSetId) <> ParentSetId) then 2 else 1
end) as level
is quite strange. You are lagging by the only column used in the order by. That makes sense. But then you are comparing the value to the same column, implying that there are duplicates.
If you have duplicates, then order by se.ParentSetId is unstable. That is, the "previous" row is indeterminate because of the duplicate values being ordered. You can run the query twice and get different results.
I am guessing you want one row with the value 1 for each parent set id. If so, then in either database, you would use:
(case when row_number() over (partition by se.parentsetid
order by se.parentsetid
) = 1
then 1 else 2
end) as level
This also has the problem with an unstable ordering. You can fix this by changing the order by to what you really want.

Select prev date in column TERADATA

I have a table consisting of a date column
I need to select this column additionally I need select the prev date that does not reside in db
if it exists or current data
I tried the following query
select hst1.QUERYID,hst1.starttime,
ZEROIFNULL(hst2.starttime) as delta
from dbqlogtbl_dba_hst hst1
left outer join dbqlogtbl_dba_hst hst2 on
hst1.QUERYID = hst2.QUERYID;
I am getting errors fetching results
You seem to just want lag():
select hst1.QUERYID, hst1.starttime,
lag(hst1.starttime) over (order by hst1.starttime)
from dbqlogtbl_dba_hst hst1 left join
dbqlogtbl_dba_hst hst2
on hst1.QUERYID = hst2.QUERYID ;
I am guessing that you really want this per queryid, so you would then need partition by:
lag(hst1.starttime) over (partition by hst1.QUERYID order by hst1.starttime)

ORACLE SQL - Compare dates without join

I have a very large table of data 1+ billion rows. If I try to join that table to itself to do a comparison, the cost on the estimated plan is unrunnable (cost: 226831405289150). Is there a way I can achieve the same results as the query below without a join, perhaps an over partition?
What I need to do is make sure another event did not happen within 24 hours before or after the one with the wildcare was received.
Thanks so much for your help!
select e2.SYSTEM_NO,
min(e2.DT) as dt
from SYSTEM_EVENT e2
inner join table1.event el2
on el2.event_id = e2.event_id
left join ( Select se.DT
from SYSTEM_EVENT se
where
--fails
( se.event_id in ('101','102','103','104')
--restores
or se.event_id in ('106','107','108','109')
)
) e3
on e3.dt-e2.dt between .0001 and 1
or e3.dt-e2.dt between -1 and .0001
where el2.descr like '%WILDCARE%'
and e3.dt is null
and e2.REC_STS_CD = 'A'
group by e2.SYSTEM_NO
Not having any test data it is difficult to determine what you are trying to achieve but it appears you could try using an analytic function with a range window:
SELECT system_no,
MIN( dt ) AS dt
FROM (
SELECT system_no,
dt,
COUNT(
CASE
WHEN ( se.event_id in ('101','102','103','104') --fails
OR se.event_id in ('106','107','108','109') ) --restores
THEN 1
END
) OVER (
ORDER BY dt
RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING
) AS num
FROM system_event
) se
WHERE num = 0
AND REC_STS_CD = 'A'
AND EXISTS(
SELECT 1
FROM table1.event te
WHERE te.descr like '%WILDCARE%'
AND te.event_id = se.event_id
)
GROUP BY system_no
This is not direct answer for your question but it is a bit too long for comment.
How old data may be inserted? 48h window means you need to check only subset of data not whole 1bilion row table if data is inserted incrementally. So if it is please reduce data in comparison by some with clause or temporary table.
If you still need to compare along whole table I would go for partitioning by event_id or other attribute if there is better partition. And compare each group separately.
where el2.descr like '%WILDCARE%' is performance killer for such huge table.

Count query giving wrong column name error

select COUNT(analysed) from Results where analysed="True"
I want to display count of rows in which analysed value is true.
However, my query gives the error: "The multi-part identifier "Results.runId" could not be bound.".
This is the actual query:
select ((SELECT COUNT(*) AS 'Count'
FROM Results
WHERE Analysed = 'True')/failCount) as PercentAnalysed
from Runs
where Runs.runId=Analysed.runId
My table schema is:
The value I want for a particular runId is: (the number of entries where analysed=true)/failCount
EDIT : How to merge these two queries?
i) select runId,Runs.prodId,prodDate,prodName,buildNumber,totalCount as TotalTestCases,(passCount*100)/(passCount+failCount) as PassPercent,
passCount,failCount,runOwner from Runs,Product where Runs.prodId=Product.prodId
ii) select (cast(counts.Count as decimal(10,4)) / cast(failCount as decimal(10,4))) as PercentAnalysed
from Runs
inner join
(
SELECT COUNT(*) AS 'Count', runId
FROM Results
WHERE Analysed = 'True'
GROUP BY runId
) counts
on counts.runId = Runs.runId
I tried this :
select runId,Runs.prodId,prodDate,prodName,buildNumber,totalCount as TotalTestCases,(passCount*100)/(passCount+failCount) as PassPercent,
passCount,failCount,runOwner,counts.runId,(cast(counts.Count as decimal(10,4)) / cast(failCount as decimal(10,4))) as PercentAnalysed
from Runs,Product
inner join
(
SELECT COUNT(*) AS 'Count', runId
FROM Results
WHERE Analysed = 'True'
GROUP BY runId
) counts
on counts.runId = Runs.runId
where Runs.prodId=Product.prodId
but it gives error.
Your problems are arising from improper joining of tables. You need information from both Runs and Results, but they aren't combined properly in your query. You have the right idea with a nested subquery, but it's in the wrong spot. You're also referencing the Analysed table in the outer where clause, but it hasn't been included in the from clause.
Try this instead:
select (cast(counts.Count as decimal(10,4)) / cast(failCount as decimal(10,4))) as PercentAnalysed
from Runs
inner join
(
SELECT COUNT(*) AS 'Count', runId
FROM Results
WHERE Analysed = 'True'
GROUP BY runId
) counts
on counts.runId = Runs.runId
I've set this up as an inner join to eliminate any runs which don't have analysed results; you can change it to a left join if you want those rows, but will need to add code to handle the null case. I've also added casts to the two numbers, because otherwise the query will perform integer division and truncate any fractional amounts.
I'd try the following query:
SELECT COUNT(*) AS 'Count'
FROM Results
WHERE Analysed = 'True'
This will count all of your rows where Analysed is 'True'. This should work if the datatype of your Analysed column is either BIT (Boolean) or STRING(VARCHAR, NVARCHAR).
Use CASE in Count
SELECT COUNT(CASE WHEN analysed='True' THEN analysed END) [COUNT]
FROM Results
Click here to view result
select COUNT(*) from Results where analysed="True"

SQL Using PARTITION when comparing values in consecutive DataRows

I'm using a SQL statement to compare consecutive values of a field [Allocation] as follows:
;WITH cteMain AS
(SELECT AllocID, CaseNo, FeeEarner, Allocation, ROW_NUMBER() OVER (ORDER BY AllocID) AS sn
FROM tblAllocations)
SELECT m.AllocID, m.CaseNo, m.FeeEarner, m.Allocation,
ISNULL(sLag.Allocation, 0) AS prevAllocation,
(m.Allocation - ISNULL(sLag.Allocation, 0)) AS movement
FROM cteMain AS m
LEFT OUTER JOIN cteMain AS sLag
ON sLag.sn = m.sn-1;
The query returns a calculated field [movement] which is the increase or decrease in consecutive values of [Allocation].
I have included a screen shot of the data returned by this query.
However the query is not yet complete. I need to revise the statement so that the consecutive values of [Allocation] compared are grouped / partitioned by [FeeEarner] and [CaseNo].
For example, at line 18 of the data, the [Allocation] is 800 and is compared to a previous value of 600. But the previous value belongs to a different [CaseNo] i.e. 6 rather than 31. In fact [FeeEarner] 'PJW' has no previous [Allocation] on [CaseNo] '31' and so the [prevAllocation] should be '0' from the ISNULL keyword.
I have tried changing
OVER (ORDER BY AllocID)
to
OVER (PARTITION BY CaseNo, FeeEarner ORDER BY AllocID)
But that results in a lot of lines of data being repeated.
Can someone advise how to compare consecutive values of [Allocation] but only between rows of data with matching [FeeEarner] AND [CaseNo] please?
NOTE - I cannot use LAG because my customer is using SQL Server 2008 R2 which does not support Parallel Data Warehousing.
I believe you were close. Try this (notice the added pieces in the join clause to match the partition - without this you will match every row number 3 with every row number 2 across partitions, which is what you were seeing):
;WITH cteMain AS
(
SELECT AllocID, CaseNo, FeeEarner, Allocation,
ROW_NUMBER() OVER (PARTITION BY CaseNo, FeeEarner ORDER BY AllocID) AS sn
FROM tblAllocations
)
SELECT m.AllocID, m.CaseNo, m.FeeEarner, m.Allocation,
ISNULL(sLag.Allocation, 0) AS prevAllocation,
(m.Allocation - ISNULL(sLag.Allocation, 0)) AS movement
FROM cteMain AS m
LEFT OUTER JOIN cteMain AS sLag
ON sLag.CaseNo = m.CaseNo
AND sLag.FeeEarner = m.FeeEarner
AND sLag.sn = m.sn-1
You need to change your join condition as well:
FROM cteMain m LEFT OUTER JOIN
cteMain sLag
ON sLag.sn = m.sn-1 and sLag.FeeEarner = m.FeeEarner and slag.CaseNo = m.CaseNo
Also, you should have only one order by in the row_number() call.
Also, if you are using Oracle, SQL Server 2012, newer versions of DB2, or Postgres, then the lead()/lag() functions would be a better choice.
One more option with OUTER APPLY and EXISTS
SELECT t1.AllocID, t1.CaseNo, t1.FreeEarner, t1.Allocation,
ISNULL(o.Allocation, 0) AS PrevAllocation,
(t1.Allocation - ISNULL(o.Allocation, 0)) AS movement
FROM tblAllocations t1
OUTER APPLY (
SELECT t2.AllocID, t2.CaseNo, t2.FreeEarner, t2.Allocation
FROM tblAllocations t2
WHERE EXISTS (
SELECT 1
FROM tblAllocations t3
WHERE t1.AllocID > t3.AllocID
HAVING MAX(t3.AllocID) = t2.AllocID
) AND t1.CaseNo = t2.CaseNo
) o