SQL Aggregation Join and Subquery Optimisation

SQL Aggregation Join and Subquery Optimisation - sql

I am trying to get aggregate values by time periods of two relations (buys and uses) and join them so that I can get the results in one report and also draw a ratio on them. I am using PostgreSQL. The end report required is: dateTime, u.sum, b.sum, b.sum/u.sum
The following query works but scales very poorly with larger table sizes.
SELECT b2.datetime AS dateTime, b2.sum AS BUY_VOLUME, u1.sum AS USE_VOLUME,
CASE u1.sum
WHEN 0 THEN 0
ELSE (b2.sum / u1.sum)
END AS buyToUseRatio
FROM(
SELECT SUM(b.total / 100.0) AS sum, date_trunc('week', (b.datetime + INTERVAL '1 day')) - INTERVAL '1 day' as datetime
FROM buys AS b
WHERE
datetime > date_trunc('month', CURRENT_DATE) - INTERVAL '1 year'
GROUP BY datetime) AS b2
INNER JOIN (SELECT SUM(u.amount) / 100.00 AS sum, date_trunc('week', (u.datetime + INTERVAL '1 day')) - INTERVAL '1 day' AS datetime
FROM uses AS u
WHERE
datetime > date_trunc('month', CURRENT_DATE) - INTERVAL '1 year'
GROUP BY datetime) AS u1 ON b2.datetime = u1.datetime
ORDER BY b2.datetime ASC;
I was wondering if anyone could help me by providing an alternative query that would get the end result required and is faster to execute.
I appreciate any help on this :-) My junior level SQL is a little rusty and I can't think of another way of doing this without creating indexes. Thanks in advance.

At least, these indexes can help your query:
create index idx_buys_datetime on buys(datetime);
create index idx_uses_datetime on uses(datetime);
Your query seems fine. However, you could use full join (instead of inner) to have all rows, where at least one of your tables have data. You could even use generate_series() to always have 1 year of results, even when there is no data in either of your tables, but I'm not sure if that's what you need. Also, some other things can be written more easily; your query could look like this:
select dt, buy_volume, use_volume, buy_volume / nullif(use_volume, 0.0) buy_to_use_ratio
from (select sum(total / 100.0) buy_volume, date_trunc('week', (datetime + interval '1 day')) - interval '1 day' dt
from buys
where datetime > date_trunc('month', current_timestamp - interval '1 year')
group by 2) b
full join (select sum(amount) / 100.0 use_volume, date_trunc('week', (datetime + interval '1 day')) - interval '1 day' dt
from uses
where datetime > date_trunc('month', current_timestamp - interval '1 year')
group by 2) u using (dt)
order by 1
http://rextester.com/YVASV92568

So the answer depends on how large your tables are, but if it was me, I would create one or two new "summary" tables based on your query and make sure to keep them updated (run a batch job once a day to update them or once an hour with all the data that has changed recently).
Then, I would be able to query those tables and do so, much faster.
If however, your tables are very small, then just keep going the way you are and play around with indexes till you get some timing which is acceptable.

Related

Netezza SQL show only last year's data

it sounds simple and it should be simple but for some reason I can't seem to make it happen in Netezza... So far I tried:
select *
from table
where placed_dt >= DATEADD(YEAR, DATEDIFF(YEAR, 0, GETDATE()) - 1, 0);
and it looked like dateadd function doesn't work on Netezza. So I tried:
select *
from table
where placed_dt between (current_date - interval 1 year) and current_date
but still had no luck. Any help would be appreciated!

If you want the last year from the current date:
where placed_dt >= current_date - interval '1 year'
Note that the single quotes are needed.
and you can include the <= current_date if that is also needed.
If you want the last calendar year, there are various methods, but one is:
where date_trunc('year', placed_dt) = date_trunc('year', current_date) - interval '1 year'

You may try:
SELECT *
FROM yourTable
WHERE
placed_dt >= ADD_MONTHS(DATE_TRUNC('year', current_timestamp), -12) AND
placed_at < DATE_TRUNC('year', current_timestamp);
In the above inquality in the WHERE clause, for a current year of 2020, the lower bound represents 2019-01-01 and the upper bound represents 2020-01-01.

With PostgreSQL how to query for records contained within a date range of the created_at field?

In my PostgreSQL database I have an invitations table with the following fields:
Invitations: id, created_at, completed_at (timestamp)
I am working to write a PostgreSQL query that returns the number of records completed between 2-7 days of the creation date.
Here is what I have so far:
SELECT round(count(i.completed_at <= i.created_at + interval '7 day' and i.completed_at > i.created_at + interval '1 day')::decimal / count(DISTINCT i.id), 2) * 100 AS "CR in D2-D7"
FROM invitations i
The select statement is not returning the correct value. What am I doing wrong?

The expression you are feeding to the first COUNT yields a boolean and never a NULL (unless its inputs are NULL). But COUNT counts all of its non-NULL input, so whether the expression returns true or false, the count still increments. There are many ways to fix this, a simple (but probably not the best--just the least typing difference from what you already have) one would be to use nullif to convert false to NULL inside the first COUNT.
But even then, is this correct? It seems odd that one COUNT has a DISTINCT and the other does not.
So a more complete solution may be something like:
SELECT
round(
count(distinct i.id) filter (where i.completed_at <= i.created_at + interval '7 day' and i.completed_at > i.created_at + interval '1 day')::decimal
/
count(DISTINCT i.id)
,2) * 100 AS "CR in D2-D7"
FROM invitations i

Just do this below:
SELECT * from
invitations i
where
i.completed_at <= i.created_at + interval '7 day'
and i.completed_at > i.created_at + interval '1 day'

Here is another way:
SELECT COUNT(*)
FROM invitations i
WHERE completed_at BETWEEN (created_at + '2 days'::interval) AND (created_at + '7 days'::interval);

psql query CASE vs Multiple select for a large data-sets - performance

For large datasets which option is better multiple select vs case
CASE EXAMPLE:
SELECT SUM(CASE WHEN(created_at > (CURRENT_DATE - INTERVAL '1 days')) THEN 1 ELSE 0 END) as day_count,
SUM(CASE WHEN(created_at > (CURRENT_DATE - INTERVAL '1 months')) THEN 1 ELSE 0 END) as month_count,
SUM(CASE WHEN(created_at > (CURRENT_DATE - INTERVAL '3 months')) THEN 1 ELSE 0 END) as quater_count,
SUM(CASE WHEN(created_at > (CURRENT_DATE - INTERVAL '6 months')) THEN 1 ELSE 0 END) as half_year_count,
SUM(CASE WHEN(created_at > (CURRENT_DATE - INTERVAL '1 years')) THEN 1 ELSE 0 END) as year_count,
count(*) as total_count from wallets;
Multiple Select Query:
SELECT count(*) from wallets where created_at > CURRENT_DATE - INTERVAL '1 days';
SELECT count(*) from wallets where created_at > CURRENT_DATE - INTERVAL '1 months';
SELECT count(*) from wallets where created_at > CURRENT_DATE - INTERVAL '3 months';
SELECT count(*) from wallets where created_at > CURRENT_DATE - INTERVAL '6 months';
SELECT count(*) from wallets where created_at > CURRENT_DATE - INTERVAL '1 years';
SELECT count(*) from wallets;
the requirements are to find wallets count by day, month, 3 months, 6 months and year.
If I go with multiple select then 6 queries will be needed to fetch the data.
using switch case we can get the data in a single query but I am not sure its a best practice to use switch case for large datasets.
Please find the query analysis below, I have only 10 records in my DB:
Case query Analysis:
Multiple query Analysis:

The single query is going to be better. You will get an improvement in performance using filter:
SELECT COUNT(*) FILTER (WHERE created_at > (CURRENT_DATE - INTERVAL '1 days')) as day_count,
COUNT(*) FILTER (WHERE created_at > (CURRENT_DATE - INTERVAL '1 months')) as month_count,
COUNT(*) FILTER (WHERE created_at > (CURRENT_DATE - INTERVAL '3 months')) as quater_count,
COUNT(*) FILTER (WHERE created_at > (CURRENT_DATE - INTERVAL '6 months')) as half_year_count,
COUNT(*) FILTER (WHERE created_at > (CURRENT_DATE - INTERVAL '1 years')) as year_count,
COUNT(*) as total_count
FROM wallets;
If you have an index on created_at, then this should also help Postgres optimize to only use that index.

I can do an educated guess. Without an actual data tests are of little use.
The Multiple Select Query is easier to optimise by database planar. The PostgreSQL 9.6+ could use index only scans for this. It may end in a few very fast queries.
The Case Example is very hard to read. I’m afraid nobody can write an index for this and the query will be forced to scan the whole table. This could be a horribly slow operation.

From knex point of view the difference is that multiple queries can be sent to the DB through separate connections and executed parallel. Doing just single query is probably more performant overall causing less stress / data transfer overhead on DB server.
Biggest drawback from first way of doing the query is that you cannot build it nicely with knex and it looks horrible to anyone who reads the code.
Better way to achieve this kind of packing multiple queries to single is to use postgres with statement (common table expressions) https://www.postgresql.org/docs/9.6/static/queries-with.html which knexalso supports http://knexjs.org/#Builder-with
EDIT: or just do multiple queries in single select somewhat like Gordon Linoff suggested :
knex
.select(
knex('wallets')
.where('createad_at', '>', knex.raw("CURRENT_DATE - INTERVAL '1 days'"))
.count()
.as('lastDay'),
knex('wallets')
.where('createad_at', '>', knex.raw("CURRENT_DATE - INTERVAL '1 months'"))
.count()
.as('lastMonth'),
... rest of the queries ...
);
https://runkit.com/embed/wsy01ar1hb73
postgresql should be able to optimize multiple subqueries to be executed with similair plan that Gordon's answer promotes.

Difference of datetime column in SQL

I have a table of 20000 records. each Record has a datetime field. I want to select all records where gap between one record and subsequent record is more than one hour [condition to be applied on datetime field].
can any one give me the SQL command code for this purpose.
regards
KAM

ANSI SQL supports the lead() function. However, date/time functions vary by database. The following is the logic you want, although the exact syntax varies, depending on the database:
select t.*
from (select t.*,
lead(datetimefield) over (order by datetimefield) as next_datetimefield
from t
) t
where datetimefield + interval '1 hour' < next_datetimefield;
Note: In Teradata, the where would be:
where datetimefield + interval '1' hour < next_datetimefield;

This can also be done with a sub query, which should work on all DBMS. As gordon said, date/time functions are different in every one.
SELECT t.* FROM YourTable t
WHERE t.DateCol + interval '1 hour' < (SELECT min(s.DateCol) FROM YourTable s
WHERE t.ID = s.ID AND s.DateCol > t.DateCol)
You can replace this:
t.DateCol + interval '1 hour'
With one of this so it will work on almost every DBMS:
DATE_ADD( t.DateCol, INTERVAL 1 hour)
DATEADD(hour,1,t.DateCol)

Although Teradata doesn't support Standard SQL's LEAD it's easy to rewrite:
select tab.*,
min(ts) over (order by ts rows between 1 following and 1 following) as next_ts
from tab
qualify
ts < next_ts - interval '1' hour
If you don't need to show the next timestamp:
select *
from tab
qualify
ts < min(ts) over (order by ts rows between 1 following and 1 following) - interval '1' hour
QUALIFY is a Teradata extension, but really nice to have, similar to HAVING after GROUP BY

Date subtraction

I'd like to select all the users in our database that put their accounts on hold the day they created them. Users can only go on hold for 1 month, so the logic would be selecting users where....
day(hold_until - 1 month) = day(signup)
How can I achieve this in SQL?

Try:
SELECT *
FROM TheTable
WHERE date(singup) = date(hold_until - INTERVAL '6 months')

It depends on the RDBMS You're using, but assuming this is PostgreSQL, you can try this:
...where
day(hold_until - interval '1 month') = day(signup)

Assuming you have timestamp columns:
... WHERE (hold_until - interval '1 month')::date = singnup::date

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL Aggregation Join and Subquery Optimisation - sql

Related

Netezza SQL show only last year's data

With PostgreSQL how to query for records contained within a date range of the created_at field?

psql query CASE vs Multiple select for a large data-sets - performance

Difference of datetime column in SQL

Date subtraction

Categories

Resources