Errors with Inequality join in Hive - sql

I am performing a self join in Hive and doing aggregations (percentile) on the window included in the join condition and I am getting the following errors -
Error1:
FAILED: SemanticException Line x: Both left and right aliases
encountered in JOIN ..."
Error2:
Invalid function 'DATE_PARSE'
The code looks like below-
SELECT
a.id
, a.date
, a.groups
, a.items
, PERCENTILE_APPROX(b.quantity, 0.75) AS rolling_percent_75
FROM (
SELECT
DISTINCT
id
, date
, groups
, items
FROM
table1) AS a
LEFT OUTER JOIN
table1 AS b
ON
a.id = b.id
AND a.groups = b.groups
AND a.items = b.items
AND b.date >= DATE_FORMAT(DATE_ADD(DATE_PARSE(a.date, 'yyyyMMdd'), -10, 'yyyyMMdd'))
AND b.date <= a.date
GROUP BY
1, 2, 3, 4
ORDER BY
1, 2, 3, 4
How to resolve these errors?

You can use from_unixtime(unix_timestamp(str,format)) to convert string to date(date_purse function) and DATE_ADD to minus date values.
You can use below section in your query and replace existing inequality.
AND b.date >= DATE_ADD(from_unixtime(unix_timestamp(a.date, 'yyyyMMdd')), -10)
AND b.date <= a.date
``

Related

Converting NOT in into LEFT join giving incorrect results

Please help me in converting following NOT in query to Left join as I want a date column in select clause. I have to use this query as source to Amazon quicksight. Quicksight cannot pass date paramters created in report to my source query. So I have to get date filtering condition in WHERE clause.
Not in:
SELECT DISTINCT Date(h.Created_Date) DATE, ( h.Vehicle_ID)Decline
FROM awsdatacatalog.waves.Recurring_Transaction_History h
Left JOIN awsdatacatalog.waves.Wash_Invoice wi on h.invoice_id=wi.invoice_id
WHERE Date(h.Created_Date) BETWEEN date('2021-05-01') AND date('2021-05-01')
AND h.Status IN ('Declined','DECLINED')
AND h.Vehicle_ID NOT IN (
SELECT Distinct ut.vehicle_Id
FROM awsdatacatalog.waves.Unlimited_Wash_Transaction ut
WHERE (Is_Refunded IS NULL OR CAST(Is_Refunded AS INTEGER) =0)
AND (Status ='RECURRING' or Status ='RESIGNUP')
AND DATE(DATE) BETWEEN date('2021-05-01') AND date('2021-05-01')
)
Left join query:
SELECT DISTINCT Date(h.Created_Date) DATE, ( h.Vehicle_ID)Decline
FROM awsdatacatalog.waves.Recurring_Transaction_History h
Left JOIN awsdatacatalog.waves.Wash_Invoice wi on h.invoice_id=wi.invoice_id
LEFT JOIN (
SELECT Distinct ut.vehicle_Id, DATE(DATE) DATE
FROM awsdatacatalog.waves.Unlimited_Wash_Transaction ut
WHERE (Is_Refunded IS NULL OR CAST(Is_Refunded AS INTEGER) =0)
AND (Status ='RECURRING' or Status ='RESIGNUP')
) A
ON H.Vehicle_ID = A.Vehicle_ID
AND DATE(h.Created_Date) <= A.DATE
WHERE Date(h.Created_Date) BETWEEN date('2021-05-01') AND date('2021-05-01')
AND A.Vehicle_ID IS NULL
AND A.DATE IS NULL
AND h.Status IN ('Declined','DECLINED')
ORDER BY Date(h.Created_Date) , ( h.Vehicle_ID)

Modify Select Statement To Sum Single Field

I have the below select statement that has extracted all the data I need, but I am trying to modify it so that the REJECTS by SHIFT by PROD_DATE are summed.
SELECT B.PROD_DATE,B.SHIFT,B.REJECTS
FROM REJECTS B
LEFT OUTER JOIN HIST_ILLUM_PART C ON B.HIST_ILLUM_PART_ID = C.ID
LEFT OUTER JOIN HIST_ILLUM_RT A ON A.ID = C.HIST_ILLUM_RT_ID
WHERE
B. REJECT_CODE NOT in ('START','SETUP','QC')
AND B.PROD_DATE >= SYSDATE - 8
ORDER BY SHIFT, PROD_DATE
I have tried
SELECT B.PROD_DATE,B.SHIFT,SUM(B.REJECTS)
I recieve the following error: ORA-00937: not a single-group group function
Do I need a subquery?
Add a GROUP BY clause to your query when doing aggregations. In pseudocode it means for each distinct group of B.PROD_DATE and B.SHIFT get the sum of all B.REJECTS for that set,
SELECT B.PROD_DATE,B.SHIFT,SUM(B.REJECTS) AS REJECTS
FROM REJECTS B
LEFT OUTER JOIN HIST_ILLUM_PART C ON B.HIST_ILLUM_PART_ID = C.ID
LEFT OUTER JOIN HIST_ILLUM_RT A ON A.ID = C.HIST_ILLUM_RT_ID
WHERE
B. REJECT_CODE NOT in ('START','SETUP','QC')
AND B.PROD_DATE >= SYSDATE - 8
GROUP BY B.PROD_DATE,B.SHIFT
ORDER BY SHIFT, PROD_DATE

How to optimize this query for my school project

It's my assignment kindly help me to optimize below two queries.
Optimize assignment 1:
SELECT
n.node_id,
MIN(LEAST(n.date,ec.date)) date
FROM
n, ec
WHERE
(n.node_id = ec.node_id_from OR n.node_id = ec.node_id_to)
AND n.date - ec.date > 0
GROUP BY
n.node_id;
Optimize assignment 2:
SELECT
TO_CHAR(CONVERT_TIMEZONE ('UTC','America/Los_Angeles', tableA."date"), 'YYYY-MM') AS "date_month",
COUNT(DISTINCT CASE WHEN (tableB."date" IS NOT NULL) THEN tableB._id ELSE NULL END) AS "tableB.countB",
COUNT(DISTINCT CASE WHEN (tableC."date" IS NOT NULL) THEN tableC._id ELSE NULL END) AS "tableC.countC"
FROM
tableA AS tableA
LEFT JOIN
tableB AS tableB ON (DATE (CONVERT_TIMEZONE ('UTC', 'America/Los_Angeles',tableB."date"))) = (DATE (CONVERT_TIMEZONE ('UTC', 'America/Los_Angeles',tableA."date")))
LEFT JOIN
tableC AS tableC ON (DATE (CONVERT_TIMEZONE ('UTC', 'America/Los_Angeles',tableC."date"))) = (DATE (CONVERT_TIMEZONE ('UTC', 'America/Los_Angeles',tableA."date")))
WHERE
tableA."date" >= CONVERT_TIMEZONE ('America/Los_Angeles', 'UTC', DATEADD (month, -17, DATE_TRUNC('month', DATE_TRUNC('day', CONVERT_TIMEZONE ('UTC', 'America/Los_Angeles',GETDATE ()))))
GROUP BY
1
ORDER BY
1 DESC
LIMIT 500;
use short alias that makes sql query shorter and cleaner.
Here is the optimized version of second query
SELECT DatePart(month, a.Date-8/24) date_month,
sum(case when b.date is Not null then 1 else 0 end) countb,
sum(case when c.date is Not null then 1 else 0 end) countc,
FROM tableA a
LEFT JOIN tableB b
ON b.Date = a.Date -- Timezone offsets are not necessary,
LEFT JOIN tableC c
ON c.date = a.date -- both in same timezone
WHERE a.date >= DateAdd(hour, 8,
DATEADD (month,-17,DATE_TRUNC('month',
GETDATE () ))
GROUP BY 1
ORDER BY 1 DESC LIMIT 500;
Very simple solution for assignment #1
SELECT n.node_id, MIN(ec.date) as date
FROM n
JOIN ec
ON n.node_id IN (ec.node_id_from, ec.node_id_to) AND ec.date < n.date
GROUP BY n.node_id;
just using min(ec.date) instead of MIN(LEAST(n.date,ec.date)).
Because the JOIN already forces the ec.date to be lower than n.date anyway.
Also note that a where clause like
where (x >= y and x <= z)
can be changed to
where (x between y and z)

is two inner joins is best for optimization of query

i just got a challenge from school optimise this query this is theoretical question
Challenge :
SELECT TO_CHAR(CONVERT_TIMEZONE ('UTC','America/Los_Angeles',tableA."date"),'YYYY-MM') AS "date_month",
COUNT(DISTINCT CASE WHEN (tableB."date" IS NOT NULL) THEN tableB._id ELSE NULL END) AS "tableB.countB",
COUNT(DISTINCT CASE WHEN (tableC."date" IS NOT NULL) THEN tableC._id ELSE NULL END) AS "tableC.countC"
FROM tableA AS tableA
LEFT JOIN tableB AS tableB ON (DATE (CONVERT_TIMEZONE ('UTC','America/Los_Angeles',tableB."date"))) = (DATE (CONVERT_TIMEZONE ('UTC','America/Los_Angeles',tableA."date")))
LEFT JOIN tableC AS tableC ON (DATE (CONVERT_TIMEZONE ('UTC','America/Los_Angeles',tableC."date"))) = (DATE (CONVERT_TIMEZONE ('UTC','America/Los_Angeles',tableA."date")))
WHERE tableA."date" >= CONVERT_TIMEZONE ('America/Los_Angeles','UTC',DATEADD (month,-17,DATE_TRUNC('month',DATE_TRUNC('day',CONVERT_TIMEZONE ('UTC','America/Los_Angeles',GETDATE ()))))
GROUP BY 1
ORDER BY 1 DESC LIMIT 500;
for optimize, i just remove case statements in above mentioned query i think this will also improve the efficiency of query
SELECT To_char(Convert_timezone ('UTC','America/Los_Angeles',tablea."date"),'YYYY-MM') AS "date_month",
Count(DISTINCT
decode(tableb."date", not null,tableb._id,null)
AS "tableB.countB",
Count(DISTINCT
decode(tablec."date", not null,tablec._id ,null)
AS "tableC.countC"
FROM tablea AS tablea
LEFT JOIN tableb AS tableb
ON (
Date (Convert_timezone ('UTC','America/Los_Angeles',tableb."date"))) = (Date (Convert_timezone ('UTC','America/Los_Angeles',tablea."date")))
LEFT JOIN tablec AS tablec
ON (
Date (Convert_timezone ('UTC','America/Los_Angeles',tablec."date"))) = (Date (Convert_timezone ('UTC','America/Los_Angeles',tablea."date")))
WHERE tablea."date" >= convert_timezone ('America/Los_Angeles','UTC',Dateadd (month,-17,Date_trunc('month',Date_trunc('day',Convert_timezone ('UTC','America/Los_Angeles',Getdate ())))) group BY 1 ORDER BY 1 DESC limit 500;
what you suggest if we remove one left join and merge the statement
is that fine for optimization
... or, use a shorter alias that actually makes the SQL shorter and cleaner. This also helps read-ability. Also, format it to separate clauses (Select, From, Join, Where, Order By, Group by, Having, etc. so they are easy to separate and distinguish with the eye. and use indentation consistent with the logical structure that supports, and does not hinder, you ability to separate those sections one from another.
Just as an example, here's your first SQL query re formatted, but identical in logical structure to what you posted:
SELECT TO_CHAR(CONVERT_TIMEZONE ('UTC','America/Los_Angeles', a.date),'YYYY-MM') date_month,
COUNT(DISTINCT CASE WHEN (b."date" IS NOT NULL) THEN b._id ELSE NULL END) countB,
COUNT(DISTINCT CASE WHEN (c."date" IS NOT NULL) THEN c._id ELSE NULL END) countC
FROM tableA a
LEFT JOIN tableB b
ON (DATE (CONVERT_TIMEZONE ('UTC','America/Los_Angeles',b.date))) =
(DATE (CONVERT_TIMEZONE ('UTC','America/Los_Angeles',a.date)))
LEFT JOIN tableC c
ON (DATE (CONVERT_TIMEZONE ('UTC','America/Los_Angeles',c.date))) =
(DATE (CONVERT_TIMEZONE ('UTC','America/Los_Angeles',a.date)))
WHERE a.date >= CONVERT_TIMEZONE ('America/Los_Angeles', 'UTC',
DATEADD (month,-17,DATE_TRUNC('month',
DATE_TRUNC('day',CONVERT_TIMEZONE ('UTC','America/Los_Angeles',
GETDATE ()))))
GROUP BY 1
ORDER BY 1 DESC LIMIT 500;
Here is an optimized version
SELECT DatePart(month, a.Date-8/24) date_month,
sum(case when b.date is Not null then 1 else 0 end) countb,
sum(case when c.date is Not null then 1 else 0 end) countc,
FROM tableA a
LEFT JOIN tableB b
ON b.Date = a.Date -- Timezone offsets are not necessary,
LEFT JOIN tableC c
ON c.date = a.date -- both in same timezone
WHERE a.date >= DateAdd(hour, 8,
DATEADD (month,-17,DATE_TRUNC('month',
GETDATE () ))
GROUP BY 1
ORDER BY 1 DESC LIMIT 500;
Presumably, the _id columns are unique. So:
SELECT TO_CHAR(CONVERT_TIMEZONE('UTC','America/Los_Angeles', a."date"), 'YYYY-MM') AS date_month,
SUM(CASE WHEN b."date" IS NOT NULL THEN 1 ELSE 0 END) AS tableB_countB,
SUM(CASE WHEN c."date" IS NOT NULL THEN 1 ELSE 0 END) AS tableC_countC
FROM tableA a LEFT JOIN
tableB b
ON DATE(CONVERT_TIMEZONE ('UTC', 'America/Los_Angeles', b."date")) = DATE(CONVERT_TIMEZONE ('UTC', 'America/Los_Angeles', b."date")) LEFT JOIN
tableC c
ON DATE(CONVERT_TIMEZONE('UTC', 'America/Los_Angeles', c."date")) = DATE(CONVERT_TIMEZONE('UTC', 'America/Los_Angeles', a."date")
WHERE a."date" >= CONVERT_TIMEZONE('America/Los_Angeles', 'UTC',
DATEADD(month, -17, DATE_TRUNC('month', DATE_TRUNC('day', CONVERT_TIMEZONE('UTC', 'America/Los_Angeles', GETDATE ()))
GROUP BY 1
ORDER BY 1 DESC
LIMIT 500;
Then, the date conversions in the ON clause don't seem necessary, because the two sides are being converted from the same time zone. If the values have no time component (as suggested by a name like date), then the DATE() is not needed either:
SELECT TO_CHAR(CONVERT_TIMEZONE('UTC', 'America/Los_Angeles', a."date"), 'YYYY-MM') AS date_month,
SUM(CASE WHEN b."date" IS NOT NULL THEN 1 ELSE 0 END) AS tableB_countB,
SUM(CASE WHEN c."date" IS NOT NULL THEN 1 ELSE 0 END) AS tableC_countC
FROM tableA a LEFT JOIN
tableB b
ON b."date" = b."date" LEFT JOIN
tableC c
ON c."date" = a."date"
WHERE a."date" >= CONVERT_TIMEZONE('America/Los_Angeles', 'UTC',
DATEADD(month, -17, DATE_TRUNC('month', DATE_TRUNC('day', CONVERT_TIMEZONE('UTC', 'America/Los_Angeles', GETDATE ()))
GROUP BY 1
ORDER BY 1 DESC
LIMIT 500;
The WHERE clause is fine. It can take advantage of an index on a(date).

not able to select a column outside left join

I am working with the below query
SELECT * FROM
(SELECT DISTINCT
a.Number
,a.Description
,ISNULL(temp.Quantity,0) Quantity
,LastReceived
,LastIssued
FROM Article a
LEFT JOIN (
select ss.ArticleId
, ss.Quantity
, max(lastreceiveddate) as LastReceived
, max(lastissueddate) as LastIssued
from StockSummary ss
where ss.UnitId = 8
group by ss.ArticleId, ss.StockQuantity
having (MAX(ss.LastReceivedDate) < '2014-09-01' or MAX(ss.LastReceivedDate) is NULL)
AND (MAX(ss.LastIssuedDate) < '2014-09-01' or MAX(ss.LastIssuedDate) is NULL)
) temp on a.Id = temp.ArticleId
WHERE a.UnitId = 8
) main
ORDER BY main.Number
What i want to achieve is to select the articles only with the MAX(ss.LastReceivedDate) and MAX(ss.LastIssuedDate) condition in the Left join query and then do the Quantity Select in the main query.
Note: the quantity column can be 0 or NULL.
Kindly help