Combine two queries with monthly average - sql

I need to put together the results of these two queries into a single return with the following structure:
"date", avg(selic."Taxa"), avg(titulos."puVenda")
Partial structure of tables:
selic
"dtFechamento" date,
"pTaxa" real
titulos
"dtTitulo" date,
"puVenda" real,
"nomeTitulo" character(30)
Query table selic:
select to_char("dtFechamento", 'YYYY-MM') as data, avg("pTaxa")
from "selic"
group by data
order by data
Query table titulos:
select to_char("dtTitulo", 'YYYY-MM') as data, avg("puVenda")
from "titulos"
where "nomeTitulo" = 'LFT010321'
group by data
order by data
I tried a subquery, but it returned the fields next to each other and can not muster.
select *
from (select to_char("dtFechamento", 'YYYY-MM') as data, avg("pTaxa")
from "selic"
group by data
order by data) as selic,
(select to_char("dtTitulo", 'YYYY-MM') as data, avg("puVenda")
from "titulos"
where "nomeTitulo" = 'LFT010321'
group by data
order by data) as LFT010321;

Assuming you want to return one row per month where either of your two queries returns a row. And pad missing values from the other query with NULL.
Use a FULL [OUTER] JOIN:
SELECT to_char(mon, 'YYYY-MM') AS data, s.avg_taxa, t.avg_venda
FROM (
SELECT date_trunc('month', "dtFechamento") AS mon, avg("pTaxa") AS avg_taxa
FROM selic
GROUP BY 1
) s
FULL JOIN (
SELECT date_trunc('month', "dtTitulo") AS mon, avg("puVenda") AS avg_venda
FROM titulos
WHERE "nomeTitulo" = 'LFT010321'
GROUP BY 1
) t USING (mon)
ORDER BY mon;
It is substantially faster to join after aggregating than before (fewer join operations).
It is also faster to GROUP BY, JOIN and ORDER on timestamp values than on a text rendition. Typically also cleaner and less error prone (although text is unambiguous in this particular case). That's why I use date_trunc() instead of to_char() on lower levels.
If the format for the month is not important, you can just return the timestamp value. Else you can format any way you like after you are done processing.
Similar case with more explanation:
PostgreSQL merge two queries with COUNT and GROUP BY in each

This should get what you need. The inner "PQ" (PreQuery) does a union all between each possible date, but also adds a flag column to identify which average it was associated with. Each part is grouped by date. So now, the outer query will AT MOST have 2 records for a given date... one for tax, the other be Venda. So now you dont need any full outer join, nor need to build some dynamic calendar data basis to get the details for all possible dates.
So, it is possible for only a Tax average OR a Venda average OR BOTH.
SELECT
PQ.Data,
SUM( CASE when PQ.SumType = 'T' then PQ.TypeAvg else 0 end ) as AvgTax,
SUM( CASE when PQ.SumType = 'V' then PQ.TypeAvg else 0 end ) as AvgVenda
from
( select
to_char( dtFechamento, 'YYYY-MM') as data,
'T' as sumtype,
avg( pTaxa ) as TypeAvg
from
selic
group by
to_char( dtFechamento, 'YYYY-MM') as data
UNION ALL
select
to_char( dtTitulo, 'YYYY-MM') as data,
'V' as sumType,
avg( puVenda ) as TypeAvg
from
titulos
where
nomeTitulo = 'LFT010321'
group by
to_char( dtTitulo, 'YYYY-MM') ) PQ
group by
PQ.Data
order by
PQ.Data

Related

How to reference fields from table created in sub-query's of large JOIN

I am writing a large query with many JOINs (shortened it in example here) and I am trying to reference values form other sub-queries but can't figure out how.
This is my example query:
DROP TABLE IF EXISTS breakdown;
CREATE TEMP TABLE breakdown AS
SELECT * FROM
(
SELECT COUNT(DISTINCT s_id) AS before, date_trunc('day', time) AS day FROM table_a
WHERE date_trunc('sec',earliest) < date_trunc('sec',time) GROUP BY day
)
JOIN
(
SELECT ROUND(before * 100.0 / total, 1) AS Percent_1, day
FROM breakdown
GROUP BY day
) USING (day)
JOIN
(
SELECT COUNT(DISTINCT s_id) AS equal, date_trunc('day', time) AS day FROM table_a
WHERE date_trunc('sec',earliest) = date_trunc('sec',time) GROUP BY day
) USING (day)
JOIN
(
SELECT COUNT(DISTINCT s_id) AS after, date_trunc('day', time) AS day FROM table_a
WHERE date_trunc('sec',earliest) > date_trunc('sec',time) GROUP BY day
) USING (day)
JOIN
(
SELECT COUNT(DISTINCT s_id) AS total, date_trunc('day', earliest) AS day
FROM first
GROUP BY 2
) USING (day)
ORDER BY day;
SELECT * FROM breakdown ORDER BY day;
The last query gives me the total and for each of the previous subqueries I want to get the percentages as well.
I found the code for getting the percentage (second JOIN) but I don't know how to reference the values from the other tables.
E.g. for getting the percentage from the first query I want to use the COUNT of the first query which I renamed before and then divide that by the COUNT of the last query which I renamed total (If there is an easier solution to do this i.e. get the percentage for each of the sub-queries please let me know), But I cant seem to find how to reference them. I tried adding AS x to the end of each subquery and calling by that (x.total) as well as trying to reference via the parent table (breakdown.total) but neither worked.
How can I do this without changing my table too much as it is a long table with a lot of sub-queries.
This is what my table looks like I would like to add percentage for each column
Using redshift BTW.
Thanks
I'm a little confused by all that is going on as you drop table breakdown and then in the second subquery of the create table you reference breakdown. I suspect that there are some issues in the provided sample of SQL. Please update if there are issues.
For a number of these subqueries it looks like you are using a subquery where a case statement will do. In Redshift you don't want to scan the same table over and over if you can prevent it. For example if we look at the the 3rd and 4th subqueries you can replace these with one query. Also in these cases I like to use the DECODE() statement rather than CASE since it is more readable in these simple cases.
(
SELECT COUNT(DISTINCT s_id) AS equal, date_trunc('day', time) AS day
FROM table_a
WHERE date_trunc('sec',earliest) = date_trunc('sec',time)
GROUP BY day
) USING (day)
JOIN
(
SELECT COUNT(DISTINCT s_id) AS after, date_trunc('day', time) AS day
FROM table_a
WHERE date_trunc('sec',earliest) > date_trunc('sec',time)
GROUP BY day
)
Becomes:
(
SELECT COUNT(DISTINCT DECODE(date_trunc('sec',earliest) = date_trunc('sec',time), true, s_id, NULL)) AS equal,
COUNT(DISTINCT DECODE(date_trunc('sec',earliest) > date_trunc('sec',time), true, s_id, NULL)) AS after,
date_trunc('day', time) AS day
FROM table_a
GROUP BY day
)
Read each table once (if at all possible) and calculate the desired results. then you will have all your values in one layer of query and can reference these new values. This will be faster (especially on Redshift).
=============================
Expanding based on comment made by poster.
It appears that using DECODE() and referencing derived columns in a single query can produce what you want. I don't have your data so I cannot test this but here is what I'd want to move to:
SELECT
COUNT(DISTINCT DECODE(date_trunc('sec',earliest) < date_trunc('sec',time), true, s_id)) AS before,
ROUND(before * 100.0 / total, 1) AS Percent_1,
COUNT(DISTINCT DECODE(date_trunc('sec',earliest) = date_trunc('sec',time), true, s_id)) AS equal,
COUNT(DISTINCT DECODE(date_trunc('sec',earliest) > date_trunc('sec',time), true, s_id)) AS after,
COUNT(DISTINCT s_id) AS total
FROM table_a
GROUP BY date_trunc('day', time);
This should be a complete replacement for the SELECT currently inside your CREATE TEMP TABLE. However, I don't have sample data so this is untested.

Using Subquery in Sequence function PrestoSQL

Use case -
I am trying to find weekly frequency of a customer from a dataset. Now, not all customers have "events" happening in all of the weeks, and I would need to fill them in with zero values for the "count" column.
I was trying to do this using the sequence function of PrestoSQL. However, this would need me to get the value of max week from the customer's orders itself ( I don't want to hardcode this since the result would be going into a BI tool and I dont want to update this manually every week )
with all_orders_2020 as (select customer, cast(date_parse(orderdate, '%Y-%m-%d') as date) as order_date
from orders
where orderdate > '2020-01-01' and customer in (select customer from some_customers)),
orders_with_week_number as (select *, week(order_date) as week_number from all_orders_2020),
weekly_count as (select customer, week_number, count(*) as ride_count from orders_with_week_number
where customer = {{some_customer}} group by customer, week_number)
SELECT
week_number
FROM
(VALUES
(SEQUENCE(1,(select max(week_number) from weekly_count)))
) AS t1(week_array)
CROSS JOIN
UNNEST(week_array) AS t2(week_number)
Presto complaints about this saying -
Unexpected subquery expression in logical plan: (SELECT "max"(week_number)
FROM
weekly_count
)
Any clues how this can be done ?
Had a similar use case and followed the example from here: https://docs.aws.amazon.com/athena/latest/ug/flattening-arrays.html
Bring the SEQUENCE out and define the subquery using a WITH clause:
WITH dataset AS (
SELECT SEQUENCE(1, (SELECT MAX(week_number) FROM weekly_count)) AS week_array
)
SELECT week_number FROM dataset
CROSS JOIN UNNEST(week_array) as t(week_number)

Same output in two different lateral joins

I'm working on a bit of PostgreSQL to grab the first 10 and last 10 invoices of every month between certain dates. I am having unexpected output in the lateral joins. Firstly the limit is not working, and each of the array_agg aggregates is returning hundreds of rows instead of limiting to 10. Secondly, the aggregates appear to be the same, even though one is ordered ASC and the other DESC.
How can I retrieve only the first 10 and last 10 invoices of each month group?
SELECT first.invoice_month,
array_agg(first.id) first_ten,
array_agg(last.id) last_ten
FROM public.invoice i
JOIN LATERAL (
SELECT id, to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE id = i.id
ORDER BY invoice_date, id ASC
LIMIT 10
) first ON i.id = first.id
JOIN LATERAL (
SELECT id, to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE id = i.id
ORDER BY invoice_date, id DESC
LIMIT 10
) last on i.id = last.id
WHERE i.invoice_date BETWEEN date '2017-10-01' AND date '2018-09-30'
GROUP BY first.invoice_month, last.invoice_month;
This can be done with a recursive query that will generate the interval of months for who we need to find the first and last 10 invoices.
WITH RECURSIVE all_months AS (
SELECT date_trunc('month','2018-01-01'::TIMESTAMP) as c_date, date_trunc('month', '2018-05-11'::TIMESTAMP) as end_date, to_char('2018-01-01'::timestamp, 'YYYY-MM') as current_month
UNION
SELECT c_date + interval '1 month' as c_date,
end_date,
to_char(c_date + INTERVAL '1 month', 'YYYY-MM') as current_month
FROM all_months
WHERE c_date + INTERVAL '1 month' <= end_date
),
invocies_with_month as (
SELECT *, to_char(invoice_date::TIMESTAMP, 'YYYY-MM') invoice_month FROM invoice
)
SELECT current_month, array_agg(first_10.id), 'FIRST 10' as type FROM all_months
JOIN LATERAL (
SELECT * FROM invocies_with_month
WHERE all_months.current_month = invoice_month AND invoice_date >= '2018-01-01' AND invoice_date <= '2018-05-11'
ORDER BY invoice_date ASC limit 10
) first_10 ON TRUE
GROUP BY current_month
UNION
SELECT current_month, array_agg(last_10.id), 'LAST 10' as type FROM all_months
JOIN LATERAL (
SELECT * FROM invocies_with_month
WHERE all_months.current_month = invoice_month AND invoice_date >= '2018-01-01' AND invoice_date <= '2018-05-11'
ORDER BY invoice_date DESC limit 10
) last_10 ON TRUE
GROUP BY current_month;
In the code above, '2018-01-01' and '2018-05-11' represent the dates between we want to find the invoices. Based on those dates, we generate the months (2018-01, 2018-02, 2018-03, 2018-04, 2018-05) that we need to find the invoices for.
We store this data in all_months.
After we get the months, we do a lateral join in order to join the invoices for every month. We need 2 lateral joins in order to get the first and last 10 invoices.
Finally, the result is represented as:
current_month - the month
array_agg - ids of all selected invoices for that month
type - type of the selected invoices ('first 10' or 'last 10').
So in the current implementation, you will have 2 rows for each month (if there is at least 1 invoice for that month). You can easily join that in one row if you need to.
LIMIT is working fine. It's your query that's broken. JOIN is just 100% the wrong tool here; it doesn't even do anything close to what you need. By joining up to 10 rows with up to another 10 rows, you get up to 100 rows back. There's also no reason to self join just to combine filters.
Consider instead window queries. In particular, we have the dense_rank function, which can number every row in the result set according to groups:
SELECT
invoice_month,
time_of_month,
ARRAY_AGG(id) invoice_ids
FROM (
SELECT
id,
invoice_month,
-- Categorize as end or beginning of month
CASE
WHEN month_rank <= 10 THEN 'beginning'
WHEN month_reverse_rank <= 10 THEN 'end'
ELSE 'bug' -- Should never happen. Just a fall back in case of a bug.
END AS time_of_month
FROM (
SELECT
id,
invoice_month,
dense_rank() OVER (PARTITION BY invoice_month ORDER BY invoice_date) month_rank,
dense_rank() OVER (PARTITION BY invoice_month ORDER BY invoice_date DESC) month_rank_reverse
FROM (
SELECT
id,
invoice_date,
to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE invoice_date BETWEEN date '2017-10-01' AND date '2018-09-30'
) AS fiscal_year_invoices
) ranked_invoices
-- Get first and last 10
WHERE month_rank <= 10 OR month_reverse_rank <= 10
) first_and_last_by_month
GROUP BY
invoice_month,
time_of_month
Don't be intimidated by the length. This query is actually very straightforward; it just needed a few subqueries.
This is what it does logically:
Fetch the rows for the fiscal year in question
Assign a "rank" to the row within its month, both counting from the beginning and from the end
Filter out everything that doesn't rank in the 10 top for its month (counting from either direction)
Adds an indicator as to whether it was at the beginning or end of the month. (Note that if there's less than 20 rows in a month, it will categorize more of them as "beginning".)
Aggregate the IDs together
This is the tool set designed for the job you're trying to do. If really needed, you can adjust this approach slightly to get them into the same row, but you have to aggregate before joining the results together and then join on the month; you can't join and then aggregate.

Calculating business days in Teradata

I need help in business days calculation.
I've two tables
1) One table ACTUAL_TABLE containing order date and contact date with timestamp datatypes.
2) The second table BUSINESS_DATES has each of the calendar dates listed and has a flag to indicate weekend days.
using these two tables, I need to ensure business days and not calendar days (which is the current logic) is calculated between these two fields.
My thought process was to first get a range of dates by comparing ORDER_DATE with TABLE_DATE field and then do a similar comparison of CONTACT_DATE to TABLE_DATE field. This would get me a range from the BUSINESS_DATES table which I can then use to calculate count of days, sum(Holiday_WKND_Flag) fields making the result look like:
Order# | Count(*) As DAYS | SUM(WEEKEND DATES)
100 | 25 | 8
However this only works when I use a specific order number and cant' bring all order numbers in a sub query.
My Query:
SELECT SUM(Holiday_WKND_Flag), COUNT(*) FROM
(
SELECT
* FROM
BUSINESS_DATES
WHERE BUSINESS.Business BETWEEN (SELECT ORDER_DATE FROM ACTUAL_TABLE
WHERE ORDER# = '100'
)
AND
(SELECT CONTACT_DATE FROM ACTUAL_TABLE
WHERE ORDER# = '100'
)
TEMP
Uploading the table structure for your reference.
SELECT ORDER#, SUM(Holiday_WKND_Flag), COUNT(*)
FROM business_dates bd
INNER JOIN actual_table at ON bd.table_date BETWEEN at.order_date AND at.contact_date
GROUP BY ORDER#
Instead of joining on a BETWEEN (which always results in a bad Product Join) followed by a COUNT you better assign a bussines day number to each date (in best case this is calculated only once and added as a column to your calendar table). Then it's two Equi-Joins and no aggregation needed:
WITH cte AS
(
SELECT
Cast(table_date AS DATE) AS table_date,
-- assign a consecutive number to each busines day, i.e. not increased during weekends, etc.
Sum(CASE WHEN Holiday_WKND_Flag = 1 THEN 0 ELSE 1 end)
Over (ORDER BY table_date
ROWS Unbounded Preceding) AS business_day_nbr
FROM business_dates
)
SELECT ORDER#,
Cast(t.contact_date AS DATE) - Cast(t.order_date AS DATE) AS #_of_days
b2.business_day_nbr - b1.business_day_nbr AS #_of_business_days
FROM actual_table AS t
JOIN cte AS b1
ON Cast(t.order_date AS DATE) = b1.table_date
JOIN cte AS b2
ON Cast(t.contact_date AS DATE) = b2.table_date
Btw, why are table_date and order_date timestamp instead of a date?
Porting from Oracle?
You can use this query. Hope it helps
select order#,
order_date,
contact_date,
(select count(1)
from business_dates_table
where table_date between a.order_date and a.contact_date
and holiday_wknd_flag = 0
) business_days
from actual_table a

Last day of the month with a twist in SQLPLUS

I would appreciate a little expert help please.
in an SQL SELECT statement I am trying to get the last day with data per month for the last year.
Example, I am easily able to get the last day of each month and join that to my data table, but the problem is, if the last day of the month does not have data, then there is no returned data. What I need is for the SELECT to return the last day with data for the month.
This is probably easy to do, but to be honest, my brain fart is starting to hurt.
I've attached the select below that works for returning the data for only the last day of the month for the last 12 months.
Thanks in advance for your help!
SELECT fd.cust_id,fd.server_name,fd.instance_name,
TRUNC(fd.coll_date) AS coll_date,fd.column_name
FROM super_table fd,
(SELECT TRUNC(daterange,'MM')-1 first_of_month
FROM (
select TRUNC(sysdate-365,'MM') + level as DateRange
from dual
connect by level<=365)
GROUP BY TRUNC(daterange,'MM')) fom
WHERE fd.cust_id = :CUST_ID
AND fd.coll_date > SYSDATE-400
AND TRUNC(fd.coll_date) = fom.first_of_month
GROUP BY fd.cust_id,fd.server_name,fd.instance_name,
TRUNC(fd.coll_date),fd.column_name
ORDER BY fd.server_name,fd.instance_name,TRUNC(fd.coll_date)
You probably need to group your data so that each month's data is in the group, and then within the group select the maximum date present. The sub-query might be:
SELECT MAX(coll_date) AS last_day_of_month
FROM Super_Table AS fd
GROUP BY YEAR(coll_date) * 100 + MONTH(coll_date);
This presumes that the functions YEAR() and MONTH() exist to extract the year and month from a date as an integer value. Clearly, this doesn't constrain the range of dates - you can do that, too. If you don't have the functions in Oracle, then you do some sort of manipulation to get the equivalent result.
Using information from Rhose (thanks):
SELECT MAX(coll_date) AS last_day_of_month
FROM Super_Table AS fd
GROUP BY TO_CHAR(coll_date, 'YYYYMM');
This achieves the same net result, putting all dates from the same calendar month into a group and then determining the maximum value present within that group.
Here's another approach, if ANSI row_number() is supported:
with RevDayRanked(itemDate,rn) as (
select
cast(coll_date as date),
row_number() over (
partition by datediff(month,coll_date,'2000-01-01') -- rewrite datediff as needed for your platform
order by coll_date desc
)
from super_table
)
select itemDate
from RevDayRanked
where rn = 1;
Rows numbered 1 will be nondeterministically chosen among rows on the last active date of the month, so you don't need distinct. If you want information out of the table for all rows on these dates, use rank() over days instead of row_number() over coll_date values, so a value of 1 appears for any row on the last active date of the month, and select the additional columns you need:
with RevDayRanked(cust_id, server_name, coll_date, rk) as (
select
cust_id, server_name, coll_date,
rank() over (
partition by datediff(month,coll_date,'2000-01-01')
order by cast(coll_date as date) desc
)
from super_table
)
select cust_id, server_name, coll_date
from RevDayRanked
where rk = 1;
If row_number() and rank() aren't supported, another approach is this (for the second query above). Select all rows from your table for which there's no row in the table from a later day in the same month.
select
cust_id, server_name, coll_date
from super_table as ST1
where not exists (
select *
from super_table as ST2
where datediff(month,ST1.coll_date,ST2.coll_date) = 0
and cast(ST2.coll_date as date) > cast(ST1.coll_date as date)
)
If you have to do this kind of thing a lot, see if you can create an index over computed columns that hold cast(coll_date as date) and a month indicator like datediff(month,'2001-01-01',coll_date). That'll make more of the predicates SARGs.
Putting the above pieces together, would something like this work for you?
SELECT fd.cust_id,
fd.server_name,
fd.instance_name,
TRUNC(fd.coll_date) AS coll_date,
fd.column_name
FROM super_table fd,
WHERE fd.cust_id = :CUST_ID
AND TRUNC(fd.coll_date) IN (
SELECT MAX(TRUNC(coll_date))
FROM super_table
WHERE coll_date > SYSDATE - 400
AND cust_id = :CUST_ID
GROUP BY TO_CHAR(coll_date,'YYYYMM')
)
GROUP BY fd.cust_id,fd.server_name,fd.instance_name,TRUNC(fd.coll_date),fd.column_name
ORDER BY fd.server_name,fd.instance_name,TRUNC(fd.coll_date)