Having difficulty writing sub-query - hive

I am a beginner level with HiveQL, I am trying to write a faster, more efficient query but am having trouble with it. Can someone help me rewrite this query? Any tips you can provide for improving my queries would be appreciated as well.
select "AUDIOONLYtopctrbyweek37Q32015", weekofyear(day),op.order_id,oppty_amount, mv.order_start_date, mv.order_end_date, count(distinct rdz.listener_id) as listeners, sum(impressions) , sum(clicks), (sum(clicks)/sum(impressions)) as ctr, sum(oline_net_amount)
from ROLLUP_PST rdz
join dfp2ss mv on (rdz.order_id = mv.dfp_order_id)
join oppty_order_oline op on (mv.order_id = op.order_id)
where day >= '2015-09-07'
and day <= '2015-09-13'
and creative_size in ('2000x132','134x1285','2000x114')
group by "AUDIOONLYtopctrbyweek37Q32015", weekofyear(day),op.order_id,oppty_amount, mv.order_start_date, mv.order_end_date
order by ctr desc
limit 150;

Please try the below modified query. It will work for you.
select "AUDIOONLYtopctrbyweek37Q32015",week_of_year,order_id,oppty_amount,order_start_date,order_end_date, count(distinct listener_id) over (partition by "AUDIOONLYtopctrbyweek37Q32015",week_of_year,order_id,oppty_amount,order_start_date,order_end_date) from (select "AUDIOONLYtopctrbyweek37Q32015", weekofyear(day) as week_of_year,op.order_id as order_id,
oppty_amount, mv.order_start_date as order_start_date, mv.order_end_date as order_end_date,rdz.listener_id as listener_id
from
ROLLUP_PST rdz,
dfp2ss mv,
oppty_order_oline op where rdz.order_id = mv.dfp_order_id and mv.order_id = op.order_id and day >= '2015-09-07' and day <= '2015-09-13'
and creative_size in ('2000x132','134x1285','2000x114')) z

Related

Retrieve data if next line of data equals a particular value

I am very new to SQL and I need some assistance with a query.
I am writing a script which is reviewing a log file. Basically the query is retrieving the instance of when a particular status occurred. This is working as expected however I would like to now add a new condition which states that only if the immediate value after this value equals 'Accepted' or 'Attended'. How would I do this. I have pasted the current script below and commented in italics where I think this condition should be. Any help would be greatly appreciated!
WITH Test AS
(
Select j.jobcode, min(log.timestamp) as 'Time First Assigned'
from Job J
inner join JobLog Log
on J.JobID = Log.JobID
and log.JobStatusID = 'Assigned' *-- and record after this equals accepted or attended*
where j.CompletionDate >= #Start_date
and j.CompletionDate < #End_date
Group by j.jobcode
)
I recommend lead(), but using it in a subquery on one table:
with test as (
select j.jobcode, min(log.timestamp) as time_first_assigned
from Job j join
(select jl.*,
lead(jl.JobStatusID) over (partition by jl.jobid order by jl.timestamp) as next_status
from JobLog jl
) jl
on J.JobID = Log.JobID
where jl.JobStatusID = 'Assigned' and
jl.next_JobStatusID in ('accepted', 'attended') and
j.CompletionDate >= #Start_date and
j.CompletionDate < #End_date
group by j.jobcode
)
In particular, this enables the optimizer to use an index on JobLog(jobid, timestamp, JobStatusId) for the lead(). That said, this will not always improve performance, particularly if the filter on the CompletionDate filters out most rows.
You can use the LEAD windows function as follows:
Select jobcode, min(ts) as 'Time First Assigned' from
(select j.jobcode, log.timestamp as ts, JobStatusID ,
lead(log.JobStatusID)
over (partition by Log.JobID order by Log.timestamp) as lead_statusid
from Job J
inner join JobLog Log on J.JobID = Log.JobID
where j.CompletionDate >= #Start_date and j.CompletionDate < #End_date
) t
where JobStatusID = 'Assigned' and lead_statusid in ('accepted', 'attended')
Group by jobcode
Thank you very much.
I used Gordon's suggested code and once I changed the values to the names I used in my code I can confirm that it works.
I did look at the Lead function however I didn't know how to apply it.
Again thanks to everyone for helping with my query.

Is this simple SQL query correct?

The query below is pretty self-explanatory, and although I'm not good at SQL, I can't find anything wrong with it. However, the number it yields in not in accordance with my gut feeling and I would like it double-checked, if this is appropriate for StackOverflow.
I'm simply trying to get the number of users that joined my website in 2020, and also made a payment in 2020. I'm trying to figure out "new revenue".
This is the query:
SELECT Count(DISTINCT( auth_user.id )) AS "2020"
FROM auth_user
JOIN subscription_transaction
ON ( subscription_transaction.event = 'one-time payment'
AND subscription_transaction.user_id = auth_user.id
AND subscription_transaction.timestamp >= '2020-01-01'
AND subscription_transaction.timestamp <= '2020-12-31' )
WHERE auth_user.date_joined >= '2020-01-01'
AND auth_user.date_joined <= '2020-12-31';
I use PostgreSQL 10.
Thanks in advance!
I would write the query using EXISTS to get rid of the COUNT(DISTINCT):
SELECT count(*) AS "2020"
FROM auth_user au
WHERE au.date_joined >= '2020-01-01' AND
au.date_joined < '2021-01-01' AND
EXISTS (SELECT 1
FROM subscription_transaction st
WHERE st.event = 'one-time payment' AND
st.user_id = au.id AND
st.timestamp >= '2020-01-01' AND
st.timestamp < '2021-01-01'
) ;
This should be faster than your version. However, the results should be the same.

Slow Aggregates using as-of date

I have a query that's intended as the base dataset for an AR Aging report in a BI tool. The report has to be able to show AR as of a given date across a several-month range. I have the logic working, but I'm seeing pretty slow performance. Code below:
WITH
DAT AS (
SELECT
MY_DATE AS_OF_DATE
FROM
NS_REPORTS."PUBLIC".NETSUITE_DATE_TABLE
WHERE
CAST(CAST(MY_DATE AS TIMESTAMP) AS DATE) BETWEEN '2020-01-01' AND CAST(CAST(CURRENT_DATE() AS TIMESTAMP) AS DATE)
), INV AS
(
WITH BASE AS
(
SELECT
BAS1.TRANSACTION_ID
, DAT.AS_OF_DATE
, SUM(BAS1.AMOUNT) ORIG_AMOUNT_BASE
FROM
"PUBLIC".BILL_TRANS_LINES_BASE BAS1
CROSS JOIN DAT
WHERE
BAS1.TRANSACTION_TYPE = 'Invoice'
AND BAS1.TRANSACTION_DATE <= DAT.AS_OF_DATE
--AND BAS1.TRANSACTION_ID = 6114380
GROUP BY
BAS1.TRANSACTION_ID
, DAT.AS_OF_DATE
)
, TAX AS
(
SELECT
TRL1.TRANSACTION_ID
, SUM(TRL1.AMOUNT_TAXED * - 1) ORIG_AMOUNT_TAX
FROM
CONNECTORS.NETSUITE.TRANSACTION_LINES TRL1
WHERE
TRL1.AMOUNT_TAXED IS NOT NULL
AND TRL1.TRANSACTION_ID IN (SELECT TRANSACTION_ID FROM BASE)
GROUP BY
TRL1.TRANSACTION_ID
)
SELECT
BASE.TRANSACTION_ID
, BASE.AS_OF_DATE
, BASE.ORIG_AMOUNT_BASE
, COALESCE(TAX.ORIG_AMOUNT_TAX, 0) ORIG_AMOUNT_TAX
FROM
BASE
LEFT JOIN TAX ON TAX.TRANSACTION_ID = BASE.TRANSACTION_ID
)
SELECT
AR.*
, CASE
WHEN AR.DAYS_OUTSTANDING < 0
THEN 'Current'
WHEN AR.DAYS_OUTSTANDING BETWEEN 0 AND 30
THEN '0 - 30'
WHEN AR.DAYS_OUTSTANDING BETWEEN 31 AND 60
THEN '31 - 60'
WHEN AR.DAYS_OUTSTANDING BETWEEN 61 AND 90
THEN '61 - 90'
WHEN AR.DAYS_OUTSTANDING > 90
THEN '91+'
ELSE NULL
END DO_BUCKET
FROM
(
SELECT
AR1.*
, TRA1.TRANSACTION_TYPE
, DATEDIFF('day', AR1.AS_OF_DATE, CAST(CAST(TRA1.DUE_DATE AS TIMESTAMP) AS DATE)) DAYS_OUTSTANDING
, AR1.ORIG_AMOUNT_BASE + AR1.ORIG_AMOUNT_TAX + AR1.PMT_AMOUNT AMOUNT_OUTSTANDING
FROM
(
SELECT
INV.TRANSACTION_ID
, INV.AS_OF_DATE
, INV.ORIG_AMOUNT_BASE
, INV.ORIG_AMOUNT_TAX
, COALESCE(PMT.PMT_AMOUNT, 0) PMT_AMOUNT
FROM
INV
LEFT JOIN (
SELECT
TLK.ORIGINAL_TRANSACTION_ID
, DAT.AS_OF_DATE
, SUM(TLK.AMOUNT_LINKED * - 1) PMT_AMOUNT
FROM
CONNECTORS.NETSUITE."TRANSACTION_LINKS" AS TLK
CROSS JOIN DAT
WHERE
TLK.LINK_TYPE = 'Payment'
AND CAST(CAST(TLK.ORIGINAL_DATE_POSTED AS TIMESTAMP) AS DATE) <= DAT.AS_OF_DATE
GROUP BY
TLK.ORIGINAL_TRANSACTION_ID
, DAT.AS_OF_DATE
) PMT ON PMT.ORIGINAL_TRANSACTION_ID = INV.TRANSACTION_ID
AND PMT.AS_OF_DATE = INV.AS_OF_DATE
) AR1
JOIN CONNECTORS.NETSUITE."TRANSACTIONS" TRA1 ON TRA1.TRANSACTION_ID = AR1.TRANSACTION_ID
)
AR
WHERE
1 = 1
--AND CAST(AMOUNT_OUTSTANDING AS NUMERIC(15, 2)) > 0
AND AS_OF_DATE >= '2020-04-22'
As you can see, I'm using a date table for the as-of date logic. I think this is the best way to do it, but I welcome any suggestions for better practice.
If I run the query with a single as-of date, it takes 1 min 6 sec and the two main aggregates, on TRANSACTION_LINKS and BILL_TRANS_LINES_BASE, each take about 25% of processing time. I'm not sure why. If I run with the filter shown, >= '2020-04-22', it takes 3 min 33 sec and the aggregates each take about 10% of processing time; they're lower because the ResultWorker takes 63% of processing time to write the results because it's so many rows.
I'm new to Snowflake but not to SQL. My understanding is that Snowflake does not allow manual creation of indexes, but again, I'm happy to be wrong. Please let me know if you have any ideas for improving the performance of this query.
Thanks in advance.
EDIT 1:
Screenshot of most expensive node in query profile
Without seeing the full explain plan and having some sample data to play with it is difficult to give any definitive answers, but here a few thoughts, for what they are worth...
The first are more about readability and may not help performance much:
Don't embed CTEs within each other, just define them in the order that they are needed. There is no need to define BASE and TAX within INV
Use CTEs as much as possible. Your main SELECT statement has 2 other SELECT statements embedded within it. It would be much more readable if these were defined using CTEs
Specific performance issues:
Keep data volumes as low as possible for as long as possible. Your CROSS JOINs obviously create cartesian products that massively increases the volume of data - therefore implement this as late in your SQL as possible rather than right at the start as you have done
While it may make your SQL less readable, use as few SQL statements as possible. For example, you should be able to create your INV CTE with a single SELECT statement rather than the 3 statements/CTEs that you are using

very slow oracle select statement

i have a select statement that contains hundred thousands if data, however the execution time is very slow which take longer than 15 minutes. Is the any way that i can improve the execution time for this select statement.
select a.levelP,
a.code,
a.descP,
(select nvl(SUM(amount),0) from ca_glopen where code = a.code and acc_mth = '2016' ) ocf,
(select nvl(SUM(amount),0) from ca_glmaintrx where code = a.code and to_char(doc_date,'yyyy') = '2016' and to_char(doc_date,'yyyymm') < '201601') bcf,
(select nvl(SUM(amount),0) from ca_glmaintrx where jum_amaun > 0 and code = a.code and to_char(doc_date,'yyyymm') = '201601' ) debit,
(select nvl(SUM(amount),0) from ca_glmaintrx where jum_amaun < 0 and code = a.code and to_char(doc_date,'yyyymm') = '201601' ) credit
from ca_chartAcc a
where a.code is not null
order by to_number(a.code), to_number(levelP)
please help me for the way to up speed my query and result.TQ
Your primary problem is that most of your subqueries use functions on your search criteria, including some awkward ones on your dates. It's much better to flip that around and explicitly qualify the expected range, by supplying actual dates (a one month range is usually a small percentage of total rows, so this is very likely to hit an index).
SELECT Chart.levelP, Chart.code, Chart.descP,
COALESCE(GL_SUM.ocf, 0),
COALESCE(Transactions.bcf, 0),
COALESCE(Transactions.debit, 0),
COALESCE(Transactions.credit, 0),
FROM ca_ChartAcc Chart
LEFT JOIN (SELECT code, SUM(amount) AS ocf
FROM ca_GLOpen
WHERE acc_mth = '2016') GL_Sum
ON GL_Sum.code = Chart.code
LEFT JOIN (SELECT code,
SUM(amount) AS bcf,
SUM(CASE WHEN amount > 0 THEN amount) AS debit,
SUM(CASE WHEN amount < 0 THEN amount) AS credit,
FROM ca_GLMainTrx
WHERE doc_date >= TO_DATE('2016-01-01')
AND doc_date < TO_DATE('2016-02-01')) Transactions
ON Transactions.code = Chart.code
WHERE Chart.code IS NOT NULL
ORDER BY TO_NUMBER(Chart.code), TO_NUMBER(Chart.levelP)
If you only need a few codes, it may yield better results to push those values into the subqueries as well (although note that the optimizer is likely to perform this for you).
It may be possible to remove the calls to TO_NUMBER(...) from the ORDER BY clause; however, this depends on the format of the values, since how they were encoded may change the ordering of results.

Trying to create a SQL query

I am trying to create a query that retrieves only the ten companies with the highest number of pickups over the six-month period, this means pickup occasions, and not the number of items picked up.
I have done this
SELECT *
FROM customer
JOIN (SELECT manifest.pickup_customer_ref reference,
DENSE_RANK() OVER (PARTITION BY manifest.pickup_customer_ref ORDER BY COUNT(manifest.trip_id) DESC) rnk
FROM manifest
INNER JOIN trip ON manifest.trip_id = trip.trip_id
WHERE trip.departure_date > TRUNC(SYSDATE) - interval '6' month
GROUP BY manifest.pickup_customer_ref) cm ON customer.reference = cm.reference
WHERE cm.rnk < 11;
this uses dense_rank to determine the order or customers with the highest number of trips first
Hmm well i don't have Oracle so I can't test it 100%, but I believe your looking for something like the following:
Keep in mind that when you use group by, you have to narrow down to the same fields you group by in the select. Hope this helps at least give you an idea of what to look at.
select TOP 10
c.company_name,
m.pickup_customer_ref,
count(*) as 'count'
from customer c
inner join mainfest m on m.pickup_customer_ref = c.reference
inner join trip t on t.trip_id = m.trip_id
where t.departure_date < DATEADD(month, -6, GETDATE())
group by c.company_name, m.pickup_customer_ref
order by 'count', c.company_name, m.pickup_customer_ref desc