How can I write this postgres query in Amazon redshift such that it is as optimized as it was in postgres?

How can I write this postgres query in Amazon redshift such that it is as optimized as it was in postgres? - sql

Here is my original query that I was using in postgres -
SELECT a.id,
(SELECT val
FROM database.detail x
WHERE name = 'blablah'
AND x.id = b.id) AS myGroup,
c.username,
a.someCode,
a.timeTaken,
a.date ::timestamp WITH time ZONE AT time ZONE 'PST' AS date,
SUM (CASE WHEN (b.name = 'name1') THEN b.val ::INTEGER ELSE 0 END ) AS name11,
SUM (CASE WHEN (b.name = 'name2') THEN b.val ::INTEGER ELSE 0 END ) AS name12
FROM
database.myTable a,
database.detail b,
database.client c
WHERE
a.id = b.id
AND a.c_id = c.c_id
AND a.date > current_date - interval '2 weeks'
GROUP BY 1, 2, 3, 4, 5, 6
Following is how I converted this query into Amazon redshift query.
SELECT a.id,
b.val AS myGroup,
c.username,
a.someCode,
a.timeTaken,
convert_timezone('PST', a.date) AS date,
SUM (CASE WHEN (b.name = 'name1') THEN b.val ::INTEGER ELSE 0 END ) AS name11,
SUM (CASE WHEN (b.name = 'name2') THEN b.val ::INTEGER ELSE 0 END ) AS name12
FROM
database.myTable a,
database.detail b,
database.client c
WHERE
a.id = b.id
AND b.name = 'blablah'
AND a.c_id = c.c_id
AND a.date > current_date - interval '2 weeks'
GROUP BY 1, 2, 3, 4, 5, 6 LIMIT 10
The CASE statement does not seem to be executing the way it is expected, basically the values for name11 and name12 are all zero. My postgres query returns valid values for these but the redshift query does not.
Also, this query is very very slow. Postgres query takes some 150 ms and this query takes 2 mins.
How can we do this better?

Redshift Query optimization comes from Cluster, Table Design, DataLoading, Data Vacuuming &Analyzing over the table.
Let me answer some core touch points in the above list.
1. Make Sure your table mytable, detail, client has proper SORT_KEY, DIST_KEY
2. Make Sure all your tables in join are analzed and vaccumed properly.
Here is another version of your same SQL written in Redshift format.
Few Tweaks I made are
Used "With Clause" to Optimized Cluster level computation
Used Joins the proper way and make sure left/right join matters
based on data.
Used date_range with clause table for kind of object orientation.
Used Group By in the main SQL below.
My Version of Redshift SQL
/** Date Range Computation **/
with date_range as (
select ( current_Date - interval '2 weeks' ) as two_weeks
),
/** Filter main ResultSet**/
myGroupSet as (
SELECT b.val AS myGroup,
c.username,
a.someCode,
a.timeTaken,
(case when (b.name == 'name1') THEN b.val::INTEGER ELSE 0 END ) as name11,
(case when (b.name == 'name2') THEN b.val::INTEGER ELSE 0 END ) as name12
FROM database.myTable a,
join date_range dr on a.date > dr.two_weeks
join database.detail b on b.id = a.id
join database.client c on c.c_id = a.c_id
where a.date > current_Date - interval '2 weeks'
)
/** Apply Aggregation **/
select myGroup, username, someCode, timeTaken, date,
sum(name1), sum(name2)
from myGroupSet
group by myGroup, username, someCode, timeTaken, date

Related

SELECT list expression references column integration_start_date which is neither grouped nor aggregated at

I'm facing an issue with the following query. It gave me this error [SELECT list expression references column integration_start_date which is neither grouped nor aggregated at [34:63]]. In particular, it points to the first 'when' in the result table, which I don't know how to fix. This is on BigQuery if that helps. I see everything is written correctly or I could be wrong. Seeking for help.
with plan_data as (
select format_date("%Y-%m-%d",last_day(date(a.basis_date))) as invoice_date,
a.sponsor_id as sponsor_id,
b.company_name as sponsor_name,
REPLACE(SUBSTR(d.meta,STRPOS(d.meta,'merchant_id')+12,13),'"','') as merchant_id,
a.state as plan_state,
date(c.start_date) as plan_start_date,
a.employee_id as square_employee_id,
date(
(select min(date)
from glproductionview.stats_sponsors
where sponsor_id = a.sponsor_id and sponsor_payroll_provider_identifier = 'square' and date >= c.start_date) )
as integration_start_date,
count(distinct a.employee_id) as eligible_pts_count, --pts that are in active plan and have payroll activities (payroll deductions) in the reporting month
from glproductionview.payroll_activities as a
left join glproductionview.sponsors as b
on a.sponsor_id = b.id
left join glproductionview.dc_plans as c
on a.plan_id = c.id
left join glproductionview.payroll_connections as d
on a.sponsor_id = d.sponsor_id and d.provider_identifier = 'rocket' and a.company_id = d.payroll_id
where a.payroll_provider_identifier = 'rocket'
and format_date("%Y-%m",date(a.basis_date)) = '2021-07'
and a.amount_cents > 0
group by 1,2,3,4,5,6,7,8
order by 2 asc
)
select invoice_date,
sponsor_id,
sponsor_name,
eligible_pts_count,
case
when eligible_pts_count <= 5 and date_diff(current_date(),integration_start_date, month) <= 12 then 20
when eligible_pts_count <= 5 and date_diff(current_date(),integration_start_date, month) > 12 then 15
when eligible_pts_count > 5 and date_diff(current_date(),integration_start_date, month) <= 12 then count(distinct square_employee_id)*4
when eligible_pts_count > 5 and date_diff(current_date(),integration_start_date, month) > 12 then count(distinct square_employee_id)*3
else 0
end as fees
from plan_data
group by 1,2,3,4;

Building Subquery to be a column/Field Name

I am unable to bundled groups of subqueries correctly in order to create column titled "Discharge_To"
I am using Teradata Studio Express. I was asked to create a column for field that is not inside a table we normally used. We want to know where a patient was discharged to from a previous place of service. In order to answer this, there has to be steps to determine that. So far, I can get it read correctly until line 94.
Select S.Member_ID, S.PAC_Sty_ID, S.Stay_Admit_Date, S.Stay_Discharge_Date, S.POS, S.LOS,
(
Select
S.Member_ID, S.PAC_Sty_ID,
Case
When S.Discharge_To is null
and H.POS is not null And S.POS <> '12' then 'Home With Care'
When S.Discharge_To is null then 'Home Without Care'
Else S.Discharge_To
End Discharge_To
From (
Select
S.Member_ID, S.PAC_Sty_ID, S.Stay_Admit_Date, S.Stay_Discharge_Date, S.POS,
Case trim(D.POS)
When '21' then 'Hospital' When '23' then 'ER' When '31' then 'SNF'
When '61' then 'IRF' When 'LTAC' then 'LTAC'
End Discharge_To
From ECONIMICS.PAC_02_MODEL_SUMMARY_Combined S
Left Join (
Select S.Member_ID, S.PAC_Sty_ID, S.POS, S.Stay_Admit_Date, S.Stay_Discharge_Date
From ECONIMICS.PAC_02_MODEL_SUMMARY_Combined S
Where PAC_Sty_ID is not null
And POS <> '12'
) D On D.Member_ID = S.Member_ID And D.PAC_Sty_ID <> S.PAC_Sty_ID
And D.Stay_Admit_Date Between S.Stay_Admit_Date and S.Stay_Discharge_Date + 1
Where S.PAC_Sty_ID is not null
Qualify Row_Number() Over (
Partition By S.PAC_Sty_ID Order By Case trim(D.POS)
When '21' then 1 When 'LTAC' then 2 when '61' then 3 When '31' then 4 end
) = 1
) S
Left Join (
Select *
From ECONIMICS.PAC_02_MODEL_SUMMARY_Combined
Where POS = '12'
) H On H.Member_ID = S.Member_ID
And H.From_Date Between S.Stay_Discharge_Date and S.Stay_Discharge_Date + 7
Qualify Row_Number() Over (Partition By S.PAC_Sty_ID Order By H.From_Date) = 1
) E On E.Member_ID = S.Member_ID And E.PAC_Sty_ID = S.PAC_Sty_ID
Where S.PAC_Sty_ID is not Null
AND S.STAY_DISCHARGE_DATE between '2017-01-01' and '2020-12-31'
AND S.LOB in ('CARE', 'DUAL')
AND S.ORPHAN_CLM_ID IS NULL
AND S.ORPHAN_CLM_LN IS NULL
Group By 1, 2, 3, 4, 5, 6
There should be 7 columns with the 7th column titled "Discharge_to" and values in the seventh column would be text (e.g., "Home without Care")

Posting here, since it's easier. Your query doesn't seem to be formatted correctly. It's of this form:
select S.Member_ID, ... ,
(
Select ... -- Sub-query trying to derive Discharge_To field
) E on E.Member_ID = S.Member_ID ...
where ...
A couple notes:
There is no FROM clause in the outer query yet you are trying to return S. fields
There is no S result set to join your E result to
The E result set is trying to be used as a sub-SELECT, but it also has an alias
Not knowing what your error message is, I'd suggest breaking apart the query into its sub-queries and running those individually to try to determine where the problem lies.

how to join 2 query into one in EXCEL ORACLE CONNECTION

I have 2 query.
I am trying to join them so I just write export from one instead of manually joining them in excel.
(SELECT
b.OUT_NO,
a.ACCNO,
a.BILL_ACCNO,
a.NAME,
a.HOUSE_NO,
a.STREET,
a.HOUSE_NO2,
a.ZIP,
a.ID,
b.TIME_STAMP,
b.REST_DATE,
c.RESTORED_TIME,
b.OUT_CMNT
FROM brook.account a,
brook.problem b,
brook.history c
WHERE c.OUT_NO = b.OUT_NO
AND a.ID = c.ID
AND ( (a.NAME Is Not Null)
AND (a.DISC Is Null)
AND (b.TIME_STAMP>?)
AND (c.RESTORED_TIME<?))
)
and
(SELECT
b.OUT_NO,
a.ACCNO,
a.BILL_ACCNO,
a.NAME,
a.HOUSE_NO,
a.STREET,
a.HOUSE_NO2,
a.ZIP,
a.ID,
b.TIME_STAMP,
b.REST_DATE,
c.RESTORED_TIME,
b.OUT_CMNT
FROM brook.account a,
brook.problem b,
brook.history c
WHERE c.OUTAGE_NO = b.OUTAGE_NO
AND a.ID = c.ID
AND ( (a.NAME Is Not Null)
AND (a.DISC Is Null)
AND (b.TIME_STAMP > ? And b.TIME_STAMP < ?)
AND (c.RESTORED_TIME > ? And c.RESTORED_TIME < ?)
)
)
How can I join these 2? into 1, I tried UNION ALL but I get ora-01847 day of month must be between 1 and last day of month ERROR.
? are the parameter, it is linked to cells on spreadsheet.
format of excel data parameter. 11/04/2013 00:00:00
Thanks

Error is about a date format, not about union.
If you pass cell values as string parameters Oracle tries to convert it to dates to comapre with columns of date or timestamp values in table columns. To do this conversion Oracle uses it's internal default date representation format wich is not mm/dd/yyyy hh24:mi:ss in your case.
There are 2 possibilities to fix a situation:
Pass parameters with date type to query and convert values to dates before passing it to Oracle. Check examples on MSDN and description of CreateParameter and Parameters.Append methods.
Convert values to dates in query with to_date Oracle function.
Change conditions in query from
AND (b.TIME_STAMP>?)
AND (c.RESTORED_TIME<?))
and
AND (b.TIME_STAMP > ? And b.TIME_STAMP < ?)
AND (c.RESTORED_TIME > ? And c.RESTORED_TIME < ?)
to
AND (b.TIME_STAMP > to_date(?,'mm/dd/yyyy hh24:mi:ss') )
AND (c.RESTORED_TIME < to_date(?,'mm/dd/yyyy hh24:mi:ss') ))
and
AND (
b.TIME_STAMP > to_date(?,'mm/dd/yyyy hh24:mi:ss')
And
b.TIME_STAMP < to_date(?,'mm/dd/yyyy hh24:mi:ss')
)
AND (
c.RESTORED_TIME > to_date(?,'mm/dd/yyyy hh24:mi:ss')
And
c.RESTORED_TIME < to_date(?,'mm/dd/yyyy hh24:mi:ss')
)

Running sum with aggregate function

I am retrieving the results of the mlog table and calculate the subtotal of the qtyn with the help of following code 1. I am stuck with how to join my second code criteria with the first.
Thanks for any help
1.
SELECT autn, date, itcode, qtyn, out,
date, phstock,
qtyn + COALESCE(
(SELECT SUM(qtyn) FROM dbo.mlog b
WHERE b.autn < a.autn
AND itcode = '40'), 0) AS balance
FROM dbo.mlog a
WHERE (itcode = '40')
ORDER BY autn
2.
date >=(SELECT MAX([date]) FROM mlog)

To append a condition to the code, use AND or OR. EG:
SELECT a.autn, a.date, a.itcode, a.qtyn, a.out,
a.date, a.phstock,
a.qtyn + COALESCE(
(SELECT SUM(b.qtyn) FROM dbo.mlog b
WHERE b.autn < a.autn
AND b.itcode = '40'), 0) AS balance
FROM dbo.mlog a
WHERE (a.itcode = '40' AND a.date >= (SELECT MAX([c.date]) FROM mlog c) )
ORDER BY a.autn
Not tested, but should do what you want

I have heard that SQL Server is rather inefficient with coalesce(), because it runs the first part twice. Here is an alternative way of writing this:
with ml as (
SELECT ml.autn, ml.date, ml.itcode, ml.qtyn, ml.out, ml.date, ml.phstock
FROM dbo.mlog ml
WHERE ml.itcode = '40' AND ml.date >= (SELECT MAX(ml1.date]) FROM mlog ml1)
)
select ml.*,
(select sum(m1l.qtyn) from ml ml1 where ml1.autn <= ml.autn) as balance
from ml
ORDER BY ml.autn
I also wonder if the where clause would be more efficient as:
WHERE ml.itcode = '40' AND ml.date = (SELECT top 1 ml1.date FROM mlog ml1 order by ml1.date desc)

Discrete Derivative in SQL

I've got sensor data in a table in the form:
Time Value
10 100
20 200
36 330
46 440
I'd like to pull the change in values for each time period. Ideally, I'd like to get:
Starttime Endtime Change
10 20 100
20 36 130
36 46 110
My SQL skills are pretty rudimentary, so my inclination is to pull all the data out to a script that processes it and then push it back to the new table, but I thought I'd ask if there was a slick way to do this all in the database.

Select a.Time as StartTime
, b.time as EndTime
, b.time-a.time as TimeChange
, b.value-a.value as ValueChange
FROM YourTable a
Left outer Join YourTable b ON b.time>a.time
Left outer Join YourTable c ON c.time<b.time AND c.time > a.time
Where c.time is null
Order By a.time

Select a.Time as StartTime, b.time as EndTime, b.time-a.time as TimeChange, b.value-a.value as ValueChange
FROM YourTable a, YourTable b
WHERE b.time = (Select MIN(c.time) FROM YourTable c WHERE c.time>a.time)

you could use a SQL window function, below is an example based on BIGQUERY syntax.
SELECT
LAG(time, 1) OVER (BY time) AS start_time,
time AS end_time,
(value - LAG(value, 1) OVER (BY time))/value AS Change
from data

First off, I would add an id column to the table so that you have something that predictably increases from row to row.
Then, I would try the following query:
SELECT t1.Time AS 'Starttime', t2.Time AS 'Endtime',
(t2.Value - t1.Value) AS 'Change'
FROM SensorData t1
INNER JOIN SensorData t2 ON (t2.id - 1) = t1.id
ORDER BY t1.Time ASC
I'm going to create a test table to try this for myself so I don't know if it works yet but it's worth a shot!
Update
Fixed with one minor issue (CHANGE is a protected word and had to be quoted) but tested it and it works! It produces exactly the results defined above.

Does this work?
WITH T AS
(
SELECT [Time]
, Value
, RN1 = ROW_NUMBER() OVER (ORDER BY [Time])
, RN2 = ROW_NUMBER() OVER (ORDER BY [Time]) - 1
FROM SensorData
)
SELECT
StartTime = ISNULL(t1.[time], t2.[time])
, EndTime = ISNULL(t2.[time], 0)
, Change = t2.value - t1.value
FROM T t1
LEFT OUTER JOIN
T t2
ON t1.RN1 = t2.RN2

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How can I write this postgres query in Amazon redshift such that it is as optimized as it was in postgres? - sql

Related

SELECT list expression references column integration_start_date which is neither grouped nor aggregated at

Building Subquery to be a column/Field Name

how to join 2 query into one in EXCEL ORACLE CONNECTION

Running sum with aggregate function

Discrete Derivative in SQL

Categories

Resources