Issue with Hive Rank queries

Issue with Hive Rank queries - sql

select season,violation_code, cnt,
RANK() over (Partition BY season order by cnt desc) AS rank
from
( select season,violation_code,
count(*) as cnt
from ParkingViolations_seondary
group by season,violation_code
) tmp
where rank <= 3
I'm new to Hive. Can somebody help me what is wrong with the above query?
It throws the following error:
Error while compiling statement:
FAILED: SemanticException [Error 10004]: line 4:6 Invalid table alias
or column reference 'rank': (possible column names are: season,
violation_code, cnt)
Any quick help would be appreciated.

Use subquery to be able to address rank in the where clause:
select season, violation_code, cnt, rnk
from
( select season,violation_code, cnt,
RANK() over (Partition BY season order by cnt desc) AS rnk
from
( select season,violation_code,
count(*) as cnt
from ParkingViolations_seondary
group by season,violation_code
) tmp
)s
where rnk <= 3

Yes i was also able to get it working with the following:
SELECT * FROM
(
SELECT season,violation_code, cnt, RANK() over (Partition BY season ORDER BY cnt DESC) AS frequency
FROM
(SELECT season,violation_code, COUNT(*) as cnt FROM ParkingViolations_seondary
WHERE (violation_code <> 0) and (street_code1 <> 0 or street_code2 <> 0 or street_code3 <> 0)
GROUP BY season,violation_code)TMP
)TMP1
WHERE frequency <= 3;

Related

How to use conditions with a RANK statement

The following piece of code does its job : it gives me the top 10 results for each category.
SELECT *
FROM (
SELECT *, RANK() OVER (PARTITION BY "pera_id" ORDER BY "surface" DESC) AS rnk
FROM "testBadaBing"
) AS x
WHERE rnk <= 10
Now I'd like to add conditions so that the number of results may vary based on a criteria. Example : if "note" = 1, then I want to retain 1 result, else make it 3.
I tried something along the lines which you can see below using the CASE WHEN statement but as you might expect it doesn't work. Error returned :
1 - near "CASE": syntax error
SELECT *
CASE WHEN "note" = 1 THEN
SELECT *
FROM (
SELECT *, RANK() OVER (PARTITION BY "pera_id" ORDER BY "surface" DESC) AS rnk
FROM "testBadaBing"
) AS x
WHERE rnk <= 1
ELSE
SELECT *
FROM (
SELECT *, RANK() OVER (PARTITION BY "pera_id" ORDER BY "surface" DESC) AS rnk
FROM "testBadaBing"
) AS x
WHERE rnk <= 3
END
Do you have any ideas how to make this work? My knowledge of SQL is pretty limited. The code has to be SQLite/SpatiaLite compatible as I'm working in the QGIS environment. Thanks.

You can use boolean logic in the WHERE clause of the outer query:
SELECT *
FROM (
SELECT t.*,
RANK() OVER (PARTITION BY "pera_id" ORDER BY "surface" DESC) AS rnk
FROM "testBadaBing" t
) AS x
WHERE ("note" = 1 and rnk = 1) OR rnk <= 3

sqlite query python

i want to perform the below query
SELECT t1.patient_report, COUNT(*) AS cnt, t1.doctor_report, (SELECT t2.doctor_report FROM infoTable t2 WHERE t2.patient_report = t1.patient_report AND cnt > 1 LIMIT 3) AS Doctors FROM infoTable t1 WHERE t1.patient_report != 'N/A' GROUP BY t1.patient_report ORDER BY cnt DESC
but i got this error!
Result: no such column: cnt
please how can i solve the problem ?

The subquery can't access the aliased column cnt of the outer query.
But even if it did have accesss, another error would be thrown because the subquery may return more than 1 rows.
I think that you can do what you want with window functions and GROUP_CONCAT():
SELECT t.patient_report, t.cnt,
GROUP_CONCAT(doctor_report) AS Doctors
FROM (
SELECT patient_report,
doctor_report,
ROW_NUMBER() OVER (PARTITION BY patient_report) rn,
COUNT(*) OVER (PARTITION BY patient_report) cnt
FROM infoTable
WHERE patient_report <> 'N/A'
) t
WHERE t.rn <= 3 AND t.cnt > 1
GROUP BY t.patient_report, t.cnt
ORDER BY t.cnt DESC

Select Random Values for Grouped Dataset

I'm no whizz at SQL. However I'm using the following query:
select count(*) as countis, avclassfamily
from malwarehashesandstrings
where behaviouralbinary IS true and
avclassfamily != 'SINGLETON'
group by avclassfamily
ORDER BY countis desc
LIMIT 50;
I would like to select 3 random hashes from the malwarehashsha256 column grouped by the avclassfamily column.
The following query works, question over:
select count(*) as countis,avclassfamily from malwarehashesandstrings where behaviouralbinary IS true and avclassfamily != 'SINGLETON' group by avclassfamily ORDER BY countis desc LIMIT 50;
virustotal=# select m.avclassfamily, m.cnt,
array_agg(malwarehashsha256)
from (select malwarehashesandstrings.*,
count(*) over (partition by avclassfamily) as cnt,
row_number() over (partition by avclassfamily order by random()) as seqnum
from malwarehashesandstrings
where behaviouralbinary and
avclassfamily <> 'SINGLETON'
) as m
where seqnum <= 3
group by m.avclassfamily, m.cnt ORDER BY m.cnt DESC LIMIT 50;

If I understand correctly, you can use row_number():
select m.*
from (select m.*,
row_number() over (partition by m.avclassfamily order by random()) as seqnum
from malwarehashesandstrings m
where m.behaviouralbinary and
m.avclassfamily <> 'SINGLETON'
) m
where seqnum <= 3;
If you want this in a column in your existing query, one method is:
select m.avgclassfamily, m.cnt,
array_agg(m.malwarehashsha256)
from (select m.*,
count(*) over (partition by m.avgclassfamily) as cnt,
row_number() over (partition by m.avclassfamily order by random()) as seqnum
from malwarehashesandstrings m
where m.behaviouralbinary and
m.avclassfamily <> 'SINGLETON'
) m
where seqnum <= 3
group by m.avgclassfamily, m.cnt;

Select most recent status for each ID and department code

I have the following table:
I want to get the most recent status for each dept_code that a CL_ID has. So the desired output would be this:
I have tried the following but this give me just the most recent status for each client and not each of their dept_codes.
SELECT *
FROM [CIMSHR6_MERGED].[dbo].[C3CLSTAT] C
INNER JOIN
(SELECT CLIENT_NUMBER, MAX(STATUS_DATE) AS SDATE
FROM [CIMSHR6_MERGED].[dbo].[C3CLSTAT]
GROUP BY CLIENT_NUMBER) X
ON X.CLIENT_NUMBER = C.CLIENT_NUMBER
AND X.SDATE = C.STATUS_DATE
ORDER BY C.CLIENT_NUMBER
Any help would be much appreciated. Thanks.

A convenient method that works in SQL Server is:
select top (1) cl.*
from [CIMSHR6_MERGED].[dbo].[C3CLSTAT] cl
order by row_number() over (partition by cl_id, dept_code order by status_date desc);
A method that is efficient with the right indexes in almost any database is:
select cl.*
from [CIMSHR6_MERGED].[dbo].[C3CLSTAT] cl
where cl.status_date = (select max(cl2.status_date)
from [CIMSHR6_MERGED].[dbo].[C3CLSTAT] cl2
where cl2.cl_id = cl.cl_id and cl2.dept_code = cl.dept_code
);
The right index is on (cl_id, dept_code, status_date).

I would also use ROW_NUMBER, but with a subquery:
SELECT CL_ID, Status_date, Status, Dept_code
FROM
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY CL_ID, Dept_code ORDER BY Status_date DESC) rn
FROM CIMSHR6_MERGED].[dbo].[C3CLSTAT]
) t
WHERE rn = 1;

1) Firstly group everything on Dept_Code,CL_ID and assign rank for each row with in the group in descending order.
2) Select all the rows with rnk=1 which would display your desired result.
SELECT Z.CL_ID,
Z.Status_Date,
Z.Status,
Z.Dept_Code
FROM
(
SELECT *,
RANK() OVER( PARTITION BY Dept_Code,CL_ID, ORDER BY Status_Date DESC ) AS rnk
FROM [CIMSHR6_MERGED].[dbo].[C3CLSTAT]
) Z
WHERE Z.rnk = 1;

This would work for almost all databases
select * from c3clstat c
where exists
(select 1 from c3clstat c1
where c1.cl_id=c.cl_id
and c1.dept_code=c.dept_code
group by cl_id,dept_code
having c.status_date=max(c1.status_date)
)

Select only 20 rows of every distinct name

I have a table in which I have over 1000+ rows, in which there is a column "AnaId", values of this column are repeated many times like name 003912 is repeated 85 times, name 003156 in repeated 70 time, I want to select maximum 20 rows of every distinct AnaID. I have no idea how to do it.
SELECT dbo.Analysis.AnaId, Analysis.CasNo, MoleculeId,
SUM(dbo.AnalysisSummary.Area) as TotalArea
FROM dbo.Analysis LEFT JOIN dbo.AnalysisSummary
ON dbo.AnalysisSummary.AnaId = dbo.Analysis.AnaId
WHERE dbo.Analysis.Sample like '%Oil%'
GROUP BY dbo.Analysis.AnaId,Analysis.CasNo, MoleculeId ORDER BY
TotalArea DESC

You can use row_number():
select t.*
from (select t.*, row_number() over (partition by name order by name) as seqnum
from t
) t
where seqnum <= 20;
With the edits to your question, you can do:
with t as (
<your query here without order by>
)
select t.*
from (select t.*, row_number() over (partition by name order by name) as seqnum
from t
) t
where seqnum <= 20;
If you have another table of names, you can also use cross apply:
select t.*
from names n cross apply
(select top 20 t.*
from t
where t.name = n.name
) t;

Using Rank()
select t.*
from (select t.*, rank() over (partition by name order by name) as seqnum
from t
) t
where seqnum <= 20;
Using Dense_Rank()
select t.*
from (select t.*, Dense_Rank() over (partition by name order by name) as seqnum
from t
) t
where seqnum <= 20;
Using Row_Number
select t.*
from (select t.*, row_number() over (partition by name order by name) as seqnum
from t
) t
where seqnum <= 20;
This will help uunderstand usage of each Special Functions
Base Code Credits:-#gordon

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Issue with Hive Rank queries - sql

Related

How to use conditions with a RANK statement

sqlite query python

Select Random Values for Grouped Dataset

Select most recent status for each ID and department code

Select only 20 rows of every distinct name

Categories

Resources