I need to perform inner-join on tables with two common columns org_id and time_stamp on data in avro format in S3 queried through Athena
I have tried
SELECT year(from_iso8601_timestamp(em.time_stamp)) time_unit,
sum(em.column1) column1,
sum(spa.column2) column2,
sum(vir.column3) column3
FROM "schemaName".table1 em
JOIN "schemaName".table2 spa
ON year(from_iso8601_timestamp(em.time_stamp)) = year(from_iso8601_timestamp(spa.time_stamp))
AND em.org_id = spa.org_id
JOIN "schemaName".table3 vir
ON year(from_iso8601_timestamp(vir.time_stamp)) = year(from_iso8601_timestamp(spa.time_stamp))
AND vir.org_id = spa.org_id
WHERE em.org_id = 'org_id_test'
AND (from_iso8601_timestamp(em.time_stamp)) <= (cast(from_iso8601_timestamp('2019-11-22T23:59:31') AS timestamp))
AND (from_iso8601_timestamp(em.time_stamp)) >= (cast(from_iso8601_timestamp('2019-11-22T23:59:31') AS timestamp) - interval '10' year)
GROUP BY em.org_id, year(from_iso8601_timestamp(em.time_stamp))
ORDER BY time_unit DESC limit 11
But what I am getting is kind of looking as cross-join
results
time_unit |column1 |column2 |column3
1 2019 |48384 |299040 |712
while if I aggregate on each table separately with same where conditions, then values appear as
table1
column1
504
table2
column2
280
table3
column3
5
can somebody help me figure out what I am doing wrong and right way to achieve it?
If I followed you correctly, what is happening is that, since there are multiple records matching the conditions in each join, you end up the same record being counted multiple time when you aggregate.
A typical way around this is to aggregate in subqueries, and then join.
Something like this might be what you are looking for:
select
em.time_unit,
em.column1,
spa.column2,
vir.column3
from (
select
org_id,
year(from_iso8601_timestamp(time_stamp)) time_unit,
sum(column1) column1
from "schemaname".table1
group by org_id, year(from_iso8601_timestamp(time_stamp))
) em
join (
select
org_id,
year(from_iso8601_timestamp(time_stamp)) time_unit,
sum(column2) column2
from "schemaname".table2
group by org_id, year(from_iso8601_timestamp(time_stamp))
) spa on spa.time_unit = em.time_unit and spa.org_id = em.org_id
join (
select
org_id,
year(from_iso8601_timestamp(time_stamp)) time_unit,
sum(column3) column3
from "schemaname".table3
group by org_id, year(from_iso8601_timestamp(time_stamp))
) vir on vir.time_unit = em.time_unit and vir.org_id = em.org_id
where
em.org_id = 'org_id_test'
and em.time_unit between 2009 and 2019
order by em.time_unit desc
limit 11
Related
I have written a query to find 10 most busy airports in the USA from March to April. It produces the desired output however I want to try to further optimize it.
Are there any HiveQL specific optimizations that can be applied to the query? Is GROUPING SETS applicable here? I'm new to Hive and for now this is the shortest query that I've come up with.
SELECT airports.airport, COUNT(Flights.FlightsNum) AS Total_Flights
FROM (
SELECT Origin AS Airport, FlightsNum
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))
UNION ALL
SELECT Dest AS Airport, FlightsNum
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))
) Flights
INNER JOIN airports ON (Flights.Airport = airports.iata AND airports.country = 'USA')
GROUP BY airports.airport
ORDER BY Total_Flights DESC
LIMIT 10;
The table columns are as following:
Airports
|iata|airport|city|state|country|
Flights_stats
|originAirport|destAirport|FlightsNum|Cancelled|Month|
Filter by airport(inner join) and do aggregation before UNION ALL to reduce dataset passed to the final aggregation reducer. UNION ALL subqueries with joins should run in parallel and faster than join with bigger dataset after UNION ALL.
SELECT f.airport, SUM(cnt) AS Total_Flights
FROM (
SELECT a.airport, COUNT(*) as cnt
FROM flights_stats f
INNER JOIN airports a ON f.Origin=a.iata AND a.country='USA'
WHERE Cancelled = 0 AND Month IN (3,4)
GROUP BY a.airport
UNION ALL
SELECT a.airport, COUNT(*) as cnt
FROM flights_stats f
INNER JOIN airports a ON f.Dest=a.iata AND a.country='USA'
WHERE Cancelled = 0 AND Month IN (3,4)
GROUP BY a.airport
) f
GROUP BY f.airport
ORDER BY Total_Flights DESC
LIMIT 10
;
Tune mapjoins and enable parallel execution:
set hive.exec.parallel=true;
set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000; --size of table to fit in memory
Use Tez and vectorizing, tune mappers and reducers parallelism: https://stackoverflow.com/a/48487306/2700344
It might help if you do the aggregation before the union all:
SELECT a.airport, SUM(cnt) AS Total_Flights
FROM ((SELECT Origin AS Airport, COUNT(*) as cnt
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))
GROUP BY Origin
) UNION ALL
(SELECT Dest AS Airport, COUNT(*) as cnt
FROM flights_stats
WHERE Cancelled = 0 AND Month IN (3,4)
GROUP BY Dest
)
) f INNER JOIN
airports a
ON f.Airport = a.iata AND a.country = 'USA'
GROUP BY a.airport
ORDER BY Total_Flights DESC
LIMIT 10;
I don't think GROUPING SETS are applicable here because you are only grouping by one field.
From Apache Wiki:
"The GROUPING SETS clause in GROUP BY allows us to specify more than one GROUP BY option in the same record set."
You can test this but you are in the case where an Union maybe better, so You really need to test it and come back :
SELECT airports.airport,
SUM(
CASE
WHEN T1.FlightsNum IS NOT NULL THEN 1
WHEN T2.FlightsNum IS NOT NULL THEN 1
ELSE 0
END
) AS Total_Flights
FROM airports
LEFT JOIN (SELECT Origin AS Airport, FlightsNum
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))) t1
on t1.Airport = airports.iata
LEFT JOIN (SELECT Dest AS Airport, FlightsNum
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))) t2
on t1.Airport = airports.iata
GROUP BY airports.airport
ORDER BY Total_Flights DESC
I'm trying to pull 6 records using the code below but there are some cases where the information is updated and therefore it is pulling duplicate records.
My code:
SELECT column2, count(*) as 'Count'
FROM ServiceTable p
join HIERARCHY h
on p.LOCATION_CODE = h.LOCATION
where Report_date between '2017-04-01' and '2017-04-30'
and Column1 = 'Issue '
and LOCATION = '8789'
and
( record_code = 'INCIDENT' or
(
SUBMIT_METHOD = 'Web' and
not exists
(
select *
from ServiceTable p2
where p2.record_code = 'INCIDENT'
and p2.incident_id = p.incident_id
)
)
)
The problem is that instead of the six records it is pulling eight. I would just use distinct * but the file_date is different on the duplicate entries:
FILE_DATE Incident_ID Column1 Column2
4/4/17 123 Issue Service - Red
4/4/17 123 Issue Service - Blue
4/5/17 123 Issue Service - Red
4/5/17 123 Issue Service - Blue
The desired output is:
COLUMN2 COUNT
Service - Red 1
Service - Blue 1
Any help would be greatly appreciated! If you need any other info just let me know.
If you turn your original select statement without the aggregation function into a subquery, you can distinct that on your values that are not the changing date, then select a COUNT from there. Don't forget your GROUP BY clause at the end.
SELECT Column2, COUNT(Incident_ID) AS Service_Count
FROM (SELECT DISTINCT Incident_ID, Column1, Column2
FROM ServiceTable p
JOIN HIERARCHY h ON p.LOCATION_CODE = h.LOCATION
WHERE Report_date BETWEEN '2017-04-01' AND '2017-04-30'
AND Column1 = 'Issue '
AND LOCATION = '8789'
AND
( record_code = 'INCIDENT' or
(
SUBMIT_METHOD = 'Web' and
NOT EXISTS
(
SELECT *
FROM ServiceTable p2
WHERE p2.record_code = 'INCIDENT'
AND p2.incident_id = p.incident_id)
)
)
)
GROUP BY Column2
Also, if you are joining tables it is a good practice to fully qualify the field you are selecting. Example: p.Column2, p.Incident_ID, h.LOCATION. That way, even your distinct fields are easier to follow where they came from and how they relate.
Finally, don't forget that COUNT is a reserved word. I modified your alias accordingly.
If you are using an aggregation function (count), you should use group by for the column not in the aggregation function:
SELECT column2, count(*) as 'Count'
FROM ServiceTable p
join HIERARCHY h
on p.LOCATION_CODE = h.LOCATION
where Report_date between '2017-04-01' and '2017-04-30'
and Column1 = 'Issue '
and LOCATION = '8789'
and
( record_code = 'INCIDENT' or
(
SUBMIT_METHOD = 'Web' and
not exists
(
select *
from ServiceTable p2
where p2.record_code = 'INCIDENT'
and p2.incident_id = p.incident_id
)
)
)
group by column2
I can't seem to figure out how to set the logic up for my particular problem. I'm trying to count the number of times the word "Service" appears but only when the RECORD_CODE is INCIDENT. When the RECORD is INCIDENT-UPDATE, it is normally already somewhere else as an INCIDENT so I exclude them to keep from duplicating my data.
However, there are a small number of cases where the SUBMIT_METHOD is "WEB" and the only record is an INCIDENT_UPDATE and I cannot figure out how to look only where the RECORD = 'INCIDENT' unless the particular record has a SUBMIT_METHOD of "WEB" and there is no record for that report # with a RECORD of INCIDENT. It could be a simple problem and I'm just overthinking it but I cannot think of how to do it. Any help would be GREATLY appreciated!
My query:
SELECT column2, count(*) as 'COUNT'
from Service.Table
where date between '1/1/17' and '1/31/17'
and column1 = 'Issue'
and RECORD = 'INCIDENT'
group by column2
Sample of the data:
REPORT # RECORD SUBMIT_METHOD SUBMIT_DATE COLUMN2
1234 Incident Web 1/1/2017 Service
1234 Incident-Update Web 1/1/2017 Service
1235 Incident Phone 1/15/2017 Other
1235 Incident-Update Phone 1/15/2017 Other
1236 Incident-Update Web 1/18/2017 Service
The expected output in this case would be:
COLUMN2 COUNT
Service 3
If I can provide any other info just let me know!
You are looking for a group by like
select column2, count(*)
from tbl1
where SUBMIT_METHOD = 'Web'
group by column2;
;With cte(REPORT#,RECORD ,SUBMIT_METHOD,SUBMIT_DATE,COLUMN2)
AS
(
SELECT 1234,'Incident' ,'Web' , '1/1/2017' ,'Service' Union all
SELECT 1234,'Incident-Update' ,'Web' , '1/1/2017' ,'Service' Union all
SELECT 1235,'Incident' ,'Phone', '1/15/2017', 'Other' Union all
SELECT 1235,'Incident-Update' ,'Phone', '1/15/2017', 'Other' Union all
SELECT 1236,'Incident-Update' ,'Web' , '1/18/2017', 'Service'
)
SELECT COLUMN2
,CountCOLUMN2
FROM (
SELECT *
,COUNT(COLUMN2) OVER (
PARTITION BY COLUMN2 ORDER BY COLUMN2
) CountCOLUMN2
,ROW_NUMBER() OVER (
PARTITION BY COLUMN2 ORDER BY COLUMN2
) Seq
FROM cte
) Dt
WHERE SUBMIT_DATE BETWEEN '1/1/17'
AND '1/31/17'
AND RECORD = 'INCIDENT'
ORDER BY 1 DESC
OutPut
COLUMN2 CountCOLUMN2
--------------------
Service 3
Other 2
You could use not exists subquery to ensure there is no other row with the same Report# and a record type of INCIDENT:
select *
from Service.Table t1
where date between '1/1/17' and '1/31/17' and
column1 = 'Issue' and
(
record = 'Incident' or
(
record = 'Incident-Update' and
submit_method = 'Web' and
not exist
(
select *
from Service.Table t2
where t2.record = 'INCIDENT'
and t2.[Report #] = t1.[Report #]
)
)
)
I have four different select queries.
Select A,Round(B) as P,Round(C) as Q,Round(D) as R,Round(E) as S from tb_name1 a Inner Join tb_name2 b on (a.X1 =b.X2 and a.T_KEY=b.T_KEY) where a.X3="something" and a.X4="xyz" and b.X5="1243" GROUP BY A ORDER BY A DESC
Select A,Round(F) as T from tb_name4 a Join tb_name5 b on (a.K1 = b.K2 and a.K3 and b.K4 ) where a.X6="something" and a.X7="xyz1" and b.X8="1233" GROUP BY A ORDER BY A DESC
Select A,Round(G) as Q from tb_name6 a Join tb_name7 b on (a.K5 = b.K6 and a.K7 and b.K8 ) where a.X9="something" and a.X10="xyz2" and b.X11="123" GROUP BY A ORDER BY A DESC
Select A,Round(H) as R from tb_name8 a Join tb_name9 b on (a.K9 = b.K10 and a.K11 and b.K12 ) where a.X12="something" and a.X13="xyz3" and b.X14="1123" GROUP BY A ORDER BY A DESC
I have tried on Union but It's not working.I want one output using four queries and the values should be displayed as one after one values like below...
Output:--
Column's Name Column1 Column2 Column3 Column4 Column5 Column6 Column7
Row 1 valu1 valu2 valu3 valu4 valu5 valu6 valu7
Row 2 valu8 valu9 valu10 valu11 valu12 valu13 valu14
Row 3 valu15 valu16 valu17 valu18 valu19 valu20 valu21
Afternoon/Evening all,
I'm looking for the final touches to the below query. I need to remove the duplicate occurrences of a column in a particular row. Currently using the below SQL:
SELECT CBNEW.*
FROM CallbackNewID CBNEW
INNER JOIN (SELECT IDNEW, MAX(CallbackDate) AS MaxDate
FROM CallbackNewID
GROUP BY IDNEW) AS groupedCBNEW
ON (CBNEW.CallbackDate = groupedCBNEW.MaxDate) AND (CBNEW.IDNEW = groupedCBNEW.IDNEW);
My result set looks like the below
ID RecID Comp Rem Date_ IDNEW IDOLD CB? CallbackDate
138618 83209 1 0 2012-03-16 12:40:00 83209 83209 2 16-Mar-12
138619 83209 1 0 2012-03-16 12:40:00 83209 83209 2 16-Mar-12
110470 83799 1 0 2011-07-27 11:46:00 83799 83799 10 27-Jul-11
110471 83799 1 0 2011-07-27 11:46:00 83799 83799 10 27-Jul-11
This however gives me duplicate values in the CallBackDate and IDNEW Column because in the table there are some different Primary Keys with the same IDNEW and CallbackDate values.
If I dump this result into Excel, I can just use remove duplicates on the first ID column, and the problem's solved.
But what I want to do is make sure my result only includes the FIRST instance of the ID column, where IDNEW and CallbackDate are duplicated.
I'm sure I just need to append a tiny piece of SQL, but I'm stuck if I can find the answer so far.
Your help is very much appreciated.
Try adding MIN(ID) to the inner query and then adding it also on the ON clause:
SELECT CBNEW.*
FROM CallbackNewID CBNEW
INNER JOIN (SELECT IDNEW, MIN(ID) AS MinId, MAX(CallbackDate) AS MaxDate
FROM CallbackNewID
GROUP BY IDNEW) AS groupedCBNEW
ON (CBNEW.CallbackDate = groupedCBNEW.MaxDate)
AND (CBNEW.IDNEW = groupedCBNEW.IDNEW)
AND (CBNEW.ID = groupedCBNEW.MinId) ;
sqlfiddle demo
Here is a rather "brute force" approach. It just takes the results of your original query and does Min() on [ID], Max() on [Comp] and [Rem], and GROUP BY on everything else:
SELECT
Min(t.ID) AS MinOfID,
t.RecID,
Max(t.Comp) AS MaxOfComp,
Max(t.Rem) AS MaxOfRem,
t.Date_,
t.IDNEW,
t.IDOLD,
t.[CB?],
t.CallbackDate
FROM
(
SELECT CBNEW.*
FROM
CallbackNewID CBNEW
INNER JOIN
(
SELECT IDNEW, MAX(CallbackDate) AS MaxDate
FROM CallbackNewID
GROUP BY IDNEW
) AS groupedCBNEW
ON (CBNEW.CallbackDate = groupedCBNEW.MaxDate)
AND (CBNEW.IDNEW = groupedCBNEW.IDNEW)
) t
GROUP BY
t.RecID,
t.Date_,
t.IDNEW,
t.IDOLD,
t.[CB?],
t.CallbackDate;
It might not be terribly elegant, but if it works....
In MS SQL Server, I think you are looking for the ROW_NUMBER() function.
Something like this should help you get what you are looking for:
SELECT
X.*
FROM
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY DBNEW.IDNEW, DBNEW.MaxDate) [row_num]
FROM
CallbackNewID CBNEW
INNER JOIN
(
SELECT
IDNEW,
MAX(CallbackDate) AS MaxDate
FROM
CallbackNewID
GROUP BY
IDNEW
) AS groupedCBNEW ON (CBNEW.CallbackDate = groupedCBNEW.MaxDate) AND (CBNEW.IDNEW = groupedCBNEW.IDNEW)
) X
WHERE
X.row_num = 1
SELECT
A.*
FROM
(SELECT
*,
ROW_NUMBER() OVER (PARTITION BY IDNEW ORDER BY CallbackDate DESC)
AS [row_num]
FROM CallbackNewID
) A
WHERE
A.row_num = 1