Hive SQL Cross Join Question on condition

Hive SQL Cross Join Question on condition - sql

Wonderful coders of the universe, I have a question:
I'm writing a HIVE SQL script, and I'm wondering if it's possible to cross join on a condition (condition below is where the dayofweek is a friday), or if there's a performance-light alternative to what I'm doing below. I ONLY need to add in 2 rows to dates that are Fridays, which is just a persist of the Friday date data for Saturdays and Sundays. I get an error on the join condition, but I'm wondering if it's possible to bypass that somehow.
To be crystal clear, the way the query is written below gives me an error (specifically and DAYOFWEEK(performance_end_date) = 6). Just wondering if there's a way to write this where the syntax will be accepted.
Please advise.
select
portfolio_name
,Cast(Date_add(a.performance_end_date, crs.crs) AS TIMESTAMP) AS performance_end_date
,return
,nav
,nav_id
,row_no
from
(
SELECT portfolio_name, performance_end_date, return, cast(cast(nav as decimal(20,2))as string) as nav, nav_id
,row_number() over (partition by a.portfolio_code,a.performance_end_date order by a.nav_id desc) as row_no
FROM carsales a
WHERE
portfolio_code IN ('1994','2078','2155','2365','2367')
and
year=2020 and month=09
) a
CROSS JOIN (SELECT stack(2, 1,2) as crs) crs and DAYOFWEEK(performance_end_date) = 6
where a.row_no = 1

CROSS JOIN don't have "join condition", so move your criteria to where clause
CROSS JOIN (SELECT stack(2, 1,2) as crs) crs
where a.row_no = 1 and DAYOFWEEK(performance_end_date) = 6

Related

PostgreSQL GROUP BY that includes zeros

I have a SQL query (postgresql) that looks something like this:
SELECT
my_timestamp::timestamp::date as the_date,
count(*) as count
FROM my_table
WHERE ...
GROUP BY the_date
ORDER BY the_date
The result is a table of YYYY-MM-DD, count pairs.
Now I've been asked to fill in the empty dates with zero. So if I was previously providing
2022-03-15 3
2022-03-17 1
I'd now want to return
2022-03-15 3
2022-03-16 0
2022-03-17 1
Now I can easily do this client-side (relative to the database) and let my program compute and return the zero-augmented list to its clients based on the original list from postgres. But perhaps it would better if I could just tell postgresql to include zeros.
I suspect this isn't easy at all, because postgres has no obvious way of knowing what I'm up to. But in the interests of learning more about postgres and SQL, I thought I'd have try. The try isn't too promising thus far...
Any pointers before I conclude that I was right to leave this to my (postgres client) program?
Update
This is an interesting case where my simplification of the problem led to a correct answer that didn't work for me. For those who come after, I thought it worth documenting what followed, because it take some fun twists through constructing SQL queries.
#a_horse_with_no_name responded with a query that I've verified works if I simplify my own query to match. Unfortunately, my query had some extra baggage that I didn't think pertinent, and so had trimmed out when posting the original question.
Here's my real (original) query, with all names preserved (if shortened):
-- current query
SELECT
LEAST(time1, time2, time3, time4)::timestamp::date as the_date,
count(*) as count
FROM reading_group_reader rgr
INNER JOIN ( SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
WHERE LEAST(time1, time2, time3, time4) > current_date - 30
GROUP BY the_date
ORDER BY the_date;
If I translate that directly into the proposed solution, however, the inner join between reading_group_reader and the temporary table TT causes the left join to become inner (I think) and the date sequence drops its zeros again. Fwiw, the table TT is a table because sometimes it actually is a subselect.
So I transformed my query into this:
SELECT
g.dt::date as the_date,
count(*) as count
FROM generate_series(date '2022-03-06', date '2022-04-06', interval '1 day') as g(dt)
LEFT JOIN (
SELECT
LEAST(rgr.time1, rgr.time2, rgr.time3, rgr.time4)::timestamp::date as the_date
FROM reading_group_reader rgr
INNER JOIN (
SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
) rgrt
ON rgrt.the_date = g.dt::date
GROUP BY g.dt
ORDER BY the_date;
but this outputs 1's instead of 0's at the places that should be 0.
The reason for that, however, is because I've now selected every date, so, of course, there's one of each. I need to include an additional field (which will be NULL) and count that.
So this query finally does what I want:
SELECT
g.dt::date as the_date,
count(rgrt.device_id) as count
FROM generate_series(date '2022-03-06', date '2022-04-06', interval '1 day') as g(dt)
LEFT JOIN (
SELECT
LEAST(rgr.time1, rgr.time2, rgr.time3, rgr.time4)::timestamp::date as the_date,
rgr.device_id
FROM reading_group_reader rgr
INNER JOIN (
SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)
) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
) rgrt(the_date)
ON rgrt.the_date = g.dt::date
GROUP BY g.dt
ORDER BY g.dt;
And, of course, on re-reading the accepted answer, I eventually saw that he did count an unrelated field, which I'd simply missed on my first several readings.

You will need to join to a list of dates. This can e.g. be done using generate_series()
SELECT g.dt::date as the_date,
count(t.my_timestamp) as count
FROM generate_series(date '2022-03-01',
date '2022-03-31',
interval '1 day') as g(dt)
LEFT JOIN my_table as t
ON t.my_timestamp::date = g.dt::date
AND ... -- the original WHERE clause goes here!
GROUP BY the_date
ORDER BY the_date;
Note that the original WHERE conditions need to go into the join condition of the LEFT JOIN. You can't put them into a WHERE clause because that would turn the outer join back into an inner join (which means the missing dates wouldn't be returned).

SQL - Get the sum of several groups of records

DESIRED RESULT
Get the hours SUM of all [Hours] including only a single result from each [DevelopmentID] where [Revision] is highest value
e.g SUM 1, 2, 3, 5, 6 (Result should be 22.00)
I'm stuck trying to get the appropriate grouping.
DECLARE #CompanyID INT = 1
SELECT
SUM([s].[Hours]) AS [Hours]
FROM
[dbo].[tblDev] [d] WITH (NOLOCK)
JOIN
[dbo].[tblSpec] [s] WITH (NOLOCK) ON [d].[DevID] = [s].[DevID]
WHERE
[s].[Revision] = (
SELECT MAX([s2].[Revision]) FROM [tblSpec] [s2]
)
GROUP BY
[s].[Hours]

use row_number() to identify the latest revision
SELECT SUM([Hours])
FROM (
SELECT *, R = ROW_NUMBER() OVER (PARTITION BY d.DevID
ORDER BY s.Revision)
FROM [dbo].[tblDev] d
JOIN [dbo].[tblSpec] s
ON d.[DevID] = s.[DevID]
) d
WHERE R = 1

If you want one row per DevId, then that should be in the GROUP BY (and presumably in the SELECT as well):
SELECT s.DevId, SUM(s.Hours) as hours
FROM [dbo].[tblDev] d JOIN
[dbo].[tblSpec] s
ON [d].[DevID] = [s].[DevID]
WHERE s.Revision = (SELECT MAX(s2.Revision) FROM tblSpec s2)
GROUP BY s.DevId;
Also, don't use WITH NOLOCK unless you really know what you are doing -- and I'm guessing you do not. It is basically a license that says: "You can get me data even if it is not 100% accurate."
I would also dispense with all the square braces. They just make the query harder to write and to read.

List values with MaxDate

Im trying to create ie query to show itens with MAX DATE, but I don´t know how !
Follow the script and result:
Select
results.severity As "Count_severity",
tasks.name As task,
results.host,
to_timestamp(results.date)::date
From
tasks Inner Join
results On results.task = tasks.id
Where
tasks.name Like '%CORP 0%' And
results.severity >= 7 And
results.qod > 70
I need to show only tasks with the last date of each one.
Can you help me ?

You seem to be using Postgres (as suggested by the use of casting operator ::). If so - and I follow you correctly - you can use distinct on:
select distinct on(t.name)
r.severity, t.name as task, r.host, to_timestamp(r.date::bigint)::date
from tasks t
inner join results r on r.task = t.id
where t.name like '%corp 0%' and r.severity >= 7 and r.qod > 70
order by t.name, to_timestamp(r.date::bigint)::date desc
This guarantees one row per task only; which row is picked is controlled by the order by clause, so the above gets the row with the greatest date (time portion left apart). If there are ties, it is undefined which row is returned. You might want to adapt the order by clause to your exact requirement, if it is different than what I understood.
On the other hand, if you want top ties, then use window functions:
select *
from (
select r.severity, t.name as task, r.host, to_timestamp(r.date::bigint)::date,
rank() over(partition by t.name order by to_timestamp(r.date::bigint)::date desc) rn
from tasks t
inner join results r on r.task = t.id
where t.name like '%corp 0%' and r.severity >= 7 and r.qod > 70
) t
where rn = 1

SQLServer: LAG & LEAD instead of recursive calculation

I am pretty new to the new version of SQL Server 2016 and haven't used the new LAG & LEAD functions yet.
If i understood right, it will make work easier in cases where we currently use the ROW_NUMBER() function and furthermore join the results to connect the records in a certain order.
A case where i currently use this way to connect the records is:
;WITH IncrementingRowNums AS
(
SELECT d.MyKey
,d.Outstanding
,d.Rate
,AMO.PaymentAmount
,AMO.AmoDate
,ROW_NUMBER() OVER (PARTITION BY d.MyKey ORDER BY AMO.AmoDate ASC) AS RowNum
FROM Deals d
INNER JOIN Amortization AMO
ON d.MyKey = AMO.MyKey
),
lagged AS
(
SELECT MyKey
,Outstanding AS new_outstanding
,Rate
,PaymentAmount
,AmoDate
,RowNum
FROM IncrementingRowNums
WHERE RowNum = 1
UNION ALL
SELECT i.MyKey
,(l.new_outstanding - l.PaymentAmount)
* (1 + i.Rate * (DATEDIFF(DAY,l.AmoDate, i.AmoDate)/365.25))
AS new_outstanding
,i.Rate
,i.PaymentAmount
,i.AmoDate
,i.RowNum
FROM IncrementingRowNums i
INNER JOIN lagged l
ON i.RowNum = l.RowNum + 1
AND i.MyKey = l.MyKey
Whats the best way to solve this solution with the LAG&LEAD functions?
I tried several ways, but it never worked out.
The only thing i want to calculate is the column new_outstanding.
Which calculates like:
(previous_record.new_outstanding - previous_record.PaymentAmount)
* (1 + current_record.Rate * (DATEDIFF(DAY,previous_record.AmoDate, current_record.AmoDate)/365.25))
As there is no SQL Server 2016 Version on rextester, i can just provide a little test-data and the my old solution of the recursive calculation: http://rextester.com/WVTM46505
Thanks

SQL Using PARTITION when comparing values in consecutive DataRows

I'm using a SQL statement to compare consecutive values of a field [Allocation] as follows:
;WITH cteMain AS
(SELECT AllocID, CaseNo, FeeEarner, Allocation, ROW_NUMBER() OVER (ORDER BY AllocID) AS sn
FROM tblAllocations)
SELECT m.AllocID, m.CaseNo, m.FeeEarner, m.Allocation,
ISNULL(sLag.Allocation, 0) AS prevAllocation,
(m.Allocation - ISNULL(sLag.Allocation, 0)) AS movement
FROM cteMain AS m
LEFT OUTER JOIN cteMain AS sLag
ON sLag.sn = m.sn-1;
The query returns a calculated field [movement] which is the increase or decrease in consecutive values of [Allocation].
I have included a screen shot of the data returned by this query.
However the query is not yet complete. I need to revise the statement so that the consecutive values of [Allocation] compared are grouped / partitioned by [FeeEarner] and [CaseNo].
For example, at line 18 of the data, the [Allocation] is 800 and is compared to a previous value of 600. But the previous value belongs to a different [CaseNo] i.e. 6 rather than 31. In fact [FeeEarner] 'PJW' has no previous [Allocation] on [CaseNo] '31' and so the [prevAllocation] should be '0' from the ISNULL keyword.
I have tried changing
OVER (ORDER BY AllocID)
to
OVER (PARTITION BY CaseNo, FeeEarner ORDER BY AllocID)
But that results in a lot of lines of data being repeated.
Can someone advise how to compare consecutive values of [Allocation] but only between rows of data with matching [FeeEarner] AND [CaseNo] please?
NOTE - I cannot use LAG because my customer is using SQL Server 2008 R2 which does not support Parallel Data Warehousing.

I believe you were close. Try this (notice the added pieces in the join clause to match the partition - without this you will match every row number 3 with every row number 2 across partitions, which is what you were seeing):
;WITH cteMain AS
(
SELECT AllocID, CaseNo, FeeEarner, Allocation,
ROW_NUMBER() OVER (PARTITION BY CaseNo, FeeEarner ORDER BY AllocID) AS sn
FROM tblAllocations
)
SELECT m.AllocID, m.CaseNo, m.FeeEarner, m.Allocation,
ISNULL(sLag.Allocation, 0) AS prevAllocation,
(m.Allocation - ISNULL(sLag.Allocation, 0)) AS movement
FROM cteMain AS m
LEFT OUTER JOIN cteMain AS sLag
ON sLag.CaseNo = m.CaseNo
AND sLag.FeeEarner = m.FeeEarner
AND sLag.sn = m.sn-1

You need to change your join condition as well:
FROM cteMain m LEFT OUTER JOIN
cteMain sLag
ON sLag.sn = m.sn-1 and sLag.FeeEarner = m.FeeEarner and slag.CaseNo = m.CaseNo
Also, you should have only one order by in the row_number() call.
Also, if you are using Oracle, SQL Server 2012, newer versions of DB2, or Postgres, then the lead()/lag() functions would be a better choice.

One more option with OUTER APPLY and EXISTS
SELECT t1.AllocID, t1.CaseNo, t1.FreeEarner, t1.Allocation,
ISNULL(o.Allocation, 0) AS PrevAllocation,
(t1.Allocation - ISNULL(o.Allocation, 0)) AS movement
FROM tblAllocations t1
OUTER APPLY (
SELECT t2.AllocID, t2.CaseNo, t2.FreeEarner, t2.Allocation
FROM tblAllocations t2
WHERE EXISTS (
SELECT 1
FROM tblAllocations t3
WHERE t1.AllocID > t3.AllocID
HAVING MAX(t3.AllocID) = t2.AllocID
) AND t1.CaseNo = t2.CaseNo
) o

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive SQL Cross Join Question on condition - sql

CROSS JOIN don't have "join condition", so move your criteria to where clause CROSS JOIN (SELECT stack(2, 1,2) as crs) crs where a.row_no = 1 and DAYOFWEEK(performance_end_date) = 6

Related

PostgreSQL GROUP BY that includes zeros

SQL - Get the sum of several groups of records

List values with MaxDate

SQLServer: LAG & LEAD instead of recursive calculation

SQL Using PARTITION when comparing values in consecutive DataRows

Categories

Resources