Best way to deal with huge postgres database - sql

I've created a scraper that collects huge amounts of data to the Postgres database. One of the tables has more than 120 million records and still grows.
It creates obvious problems with even simple selects, but when I run aggregate
functions like COUNT(), it takes ages to get a result. I want to display this data using a web service, but it is definitely too slow to do it directly. I thought about materialized views, but even there if I run some more advanced query (query with subqueries to show trend) it throws an error with not enough memory, and if the query is simple, then it takes about an hour to complete. I am asking about general rules (I haven't managed to find any) with dealing with such huge databases.
The example queries which I use:
The simple query takes about an hour to complete (Items table have 120 million records, ItemTypes have about 30k - they keep the names and all information for the Items)
SELECT
IT."name",
COUNT("Items".id) AS item_count,
(CAST(COUNT("Items".id) AS DECIMAL(10,1))/(SELECT COUNT(id) FROM "Items"))*100 as percentage_of_all
FROM "Items" JOIN "ItemTypes" IT on "Items"."itemTypeId" = IT.id
GROUP BY IT."name"
ORDER BY item_count DESC;
When I run the above query with subquery which returns COUNT("Items".id) AS item_count % trend which is the count of them from a week ago compared to count now, it throws an error that memory was exceeded.
As I wrote above I am looking for tips, how to optimize it. The first thing I plan to optimize the above query is to move names from ItemTypes to Items, to Items. It won't be required to join ItemTypes anymore, but I already tried to mock it and the results aren't a lot better.

You don't need a subquery, so an equivalent version is:
SELECT IT."name",
COUNT(*) AS item_count,
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () as percentage_of_all
FROM "Items" JOIN
"ItemTypes" IT
ON "Items"."itemTypeId" = IT.id
GROUP BY IT."name"
ORDER BY item_count DESC;
I'm not sure if this will fix your resource problem. In addition, this assumes that all items have a valid ItemType. If that is not the case, use a LEFT JOIN instead of JOIN.

Related

SUM(…) OVER (ORDER BY …) in view causes poor performance

I've got a view defined that lists transactions together with a running total, something like
CREATE VIEW historyView AS
SELECT
a.createdDate,
a.value,
m.memberId,
SUM(a.value) OVER (ORDER BY a.createdDate) as runningTotal,
...many more columns...
FROM allocations a
JOIN member m ON m.id = a.memberId
JOIN ...many joins...
The biggest tables this query looks at have ~10 million rows, but on average when the view is queried it will only return a few tens of rows.
My issue is that when this SELECT statement is run directly for a given member, it executes extremely quickly and returns results in a couple of milliseconds. However, when queried as a view...
SELECT h.createdDate, h.value, h.runningTotal
FROM historyView h
WHERE member.username = 'blah#blah.com'
...the performance is dreadful. The two query plans are very different - in the first case it is pretty much ideal but in the latter case, there are loads of scans and hundreds of thousands/millions of rows being read. This is clearly because the filter on member is being run last thing after everything else has been done, rather than right up front at the start.
If I remove the SUM(x) OVER (ORDER BY y) clause, this problem goes away.
Is there something I can do to ensure that the SUM(x) OVER (ORDER BY y) clause does not ruin the query plan?
One solution to my problem is to let the query optimiser know it is safe to filter before running the windowed function by PARTITION'ing by that property. The change to the view is:
CREATE VIEW historyView AS
SELECT
a.createdDate,
a.value,
m.memberId,
SUM(a.value) OVER (PARTITION BY m.username ORDER BY a.createdDate) as runningTotal,
...many more columns...
FROM allocations a
JOIN member m ON m.id = a.memberId
JOIN ...many joins...
Unfortunately this only creates the correct plan if filtering my member's username is part of the query.
That's because there's probably an index on m.username. When it comes to query tuning it takes some trial and error.
When using window functions there is the concept of 'POC' index to take into consideration - just search on google (Itzik Ben-Gan has good references about this as well).
From the book 'High-Performance T-SQL Using Window Functions':
Absent a POC index, the plan includes a Sort iterator, and with large input sets, it can be quite
expensive. Sorting has N * LOG(N) complexity, which is worse than linear. This means that with more
rows, you pay more per row. For example 1000 * LOG(1000) = 3000 and 10000 * LOG(10000) =
40000. This means that 10 times more rows results in 13 times more work, and it gets worse the further you go.
Here's a reference link to get started on window functions and indexes.

How to combine many single result queries into a single query with multiple results

I have currently have a table (post) with the following columns:
id, stock_code, posted_at
For a given stock code SC and time T1, I can retrieve the newest post after a certain time with something like
SELECT * FROM post WHERE stock_code = SC AND time > T1 ORDER BY time asc LIMIT 1; (not actually tested, but you get the gist)
However, I want to get that result for a set of multiple stocks (or even for every distinct stock code in the table). I could simply run this query multiple times, however that quickly becomes inefficient, and it would be best to combine into one SQL query, however I can't wrap my head around how to do that. I would like each row to be the newest post after a certain time for a given stock, and have one row for each stock. How do I go about doing this?
P.S. Using Postgres 9.4.8, and SqlAlchemy on the python side. Would be happy with just SQL, however if there is some SqlAlchemy magic to get to the same result that would be awesome.
Use distinct on:
SELECT DISTINCT ON (stock_code) p.*
FROM post p
WHERE p.stock_code = 'SC' AND p.time > T1
ORDER BY p.stock_code, time asc;
Obviously, with the WHERE clause, this will return one row. You can remove the p.stock_code = 'SC' and get one row per stock_code.
Use union or union all to group results from may queries in one.

Time based accumulation based on type: Speed considerations in SQL

Based on surfing the web, I came up with two methods of counting the records in a table "Table1". The counter field increments according to a date field "TheDate". It does this by summing records with an older TheDate value. Furthermore, records with different values for the compound field (Field1,Field2) are counted using separate counters. Field3 is just an informational field that is included for added awareness and does not affect the counting or how records are grouped for counting.
Method 1: Use corrrelated subquery
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
(
SELECT SUM(1) FROM Table1 InnerQuery
WHERE InnerQuery.Field1 = MainQuery.Field1 AND
InnerQuery.Field2 = MainQuery.Field2 AND
InnerQuery.TheDate <= MainQuery.TheDate
) AS RunningCounter
FROM Table1 MainQuery
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
Method 2: Use join and group-by
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
SUM(1) AS RunningCounter
FROM Table1 MainQuery INNER JOIN Table1 InnerQuery
ON InnerQuery.Field1 = MainQuery.Field1 AND
InnerQuery.Field2 = MainQuery.Field2 AND
InnerQuery.TheDate <= MainQuery.TheDate
GROUP BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
There is no inner query per se in Method 2, but I use the table alias InnerQuery so that a ready parellel with Method 1 can be drawn. The role is the same; the 2nd instance of Table 1 is for accumulating the counts of the records which have TheDate less than that of any record in MainQuery (1st instance of Table 1) with the same Field1 and Field2 values.
Note that in Method 2, Field 3 is include in the Group-By clause even though I said that it does not affect how the records are grouped for counting. This is still true, since the counting is done using the matching records in InnerQuery, whereas the GROUP By applies to Field 3 in MainQuery.
I found that Method 1 is noticably faster. I'm surprised by this because it uses a correlated subquery. The way I think of a correlated subquery is that it is executed for each record in MainQuery (whether or not that is done in practice after optimization). On the other hand, Method 2 doesn't run an inner query over and over again. However, the inner join still has multiple records in InnerQuery matching each record in MainQuery, so in a sense, it deals with a similar order of complexity.
Is there a decent intuitive explanation for this speed difference, as well as best practice or considerations in choosing an approach for time-base accumulation?
I've posted this to
Microsoft Answers
Stack Exchange
In fact, I think the easiest way is to do this:
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
COUNT(*)
FROM Table1 MainQuery
GROUP BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
(The order by isn't required to get the same data, just to order it. In other words, removing it will not change the number or contents of each row returned, just the order in which they are returned.)
You only need to specify the table once. Doing a self-join (joining a table to itself as both your queries do) is not required. The performance of your two queries will depend on a whole load of things which I don't know - what the primary keys are, the number of rows, how much memory is available, and so on.
First, your experience makes a lot of sense. I'm not sure why you need more intuition. I imagine you learned, somewhere along the way, that correlated subqueries are evil. Well, as with some of the things we teach kids as being really bad ("don't cross the street when the walk sign is not green") turn out to be not so bad, the same is true of correlated subqueries.
The easiest intuition is that the uncorrelated subquery has to aggregate all the data in the table. The correlated version only has to aggregate matching fields, although it has to do this over and over.
To put numbers to it, say you have 1,000 rows with 10 rows per group. The output is 100 rows. The first version does 100 aggregations of 10 rows each. The second does one aggregation of 1,000 rows. Well, aggregation generally scales in a super-linear fashion (O(n log n), technically). That means that 100 aggregations of 10 records takes less time than 1 aggregation of 1000 records.
You asked for intuition, so the above is to provide some intuition. There are a zillion caveats that go both ways. For instance, the correlated subquery might be able to make better use of indexes for the aggregation. And, the two queries are not equivalent, because the correct join would be LEFT JOIN.
Actually, I was wrong in my original post. The inner join is way, way faster than the correlated subquery. However, the correlated subquery is able to display its results records as they are generated, so it appears faster.
As a side curiosity, I'm finding that if the correlated sub-query approach is modified to use sum(-1) instead of sum(1), the number of returned records seems to vary from N-3 to N (where N is the correct number, i.e., the number of records in Table1). I'm not sure if this is due to some misbehaviour in Access's rush to display initial records or what-not.
While it seems that the INNER JOIN wins hands-down, there is a major insidious caveat. If the GROUP BY fields do not uniquely distinguish each record in Table1, then you will not get an individual SUM for each record of Table1. Imagine that a particular combination of GROUP BY field values matching (say) THREE records in Table1. You will then get a single SUM for all of them. The problem is, each of these 3 records in MainQuery also matches all 3 of the same records in InnerQuery, so those instances in InnerQuery get counted multiple times. Very insidious (I find).
So it seems that the sub-query may be the way to go, which is awfully disturbing in view of the above problem with repeatability (2nd paragraph above). That is a serious problem that should send shivers down any spine. Another possible solution that I'm looking at is to turn MainQuery into a subquery by SELECTing the fields of interest and DISTINCTifying them before INNER JOINing the result with InnerQuery.

Derived Tables for summary statistics

I often find myself running a query to get the number of people who meet a certain criteria, the total number of people in that population and the finding the percentage that meets that criteria. I've been doing it for the same way for a while and I was wondering what SO would do to solve the same type of problem. Below is how I wrote the query:
select m.state_cd
,m.injurylevel
,COUNT(distinct m.patid) as pplOnRx
,x.totalPatientsPerState
,round((COUNT(distinct m.patid) /cast(x.totalPatientsPerState as float))*100,2) as percentPrescribedNarcotics
from members as m
inner join rx on rx.patid=m.PATID
inner join DrugTable as dt on dt.drugClass=rx.drugClass
inner join
(
select m2.state_cd, m2.injurylevel, COUNT(distinct m2.patid) as totalPatientsPerState
from members as m2
inner join rx on rx.patid=m2.PATID
group by m2.STATE_CD,m2.injuryLevel
) x on x.state_cd=m.state_cd and m.injuryLevel=x.injurylevel
where drugText like '%narcotics%'
group by m.state_cd,m.injurylevel,x.totalPatientsPerState
order by m.STATE_CD,m.injuryLevel
In this example not everyone who appears in the members table is in the rx table. The derived table makes sure that everyone whose in rx is also in members without the condition of drugText like narcotics. From what little I've played with it it seems that the over(partition by clause might work here. I have no idea if it does, just seems like it to me. How would someone else go about tackling this problem?
results:
This is exactly what MDX and SSAS is designed to do. If you insist on doing it in SQL (nothing wrong with that), are you asking for a way to do it with better performance? In that case, it would depend on how the tables are indexed, tempdb speed, and if the tables are partitioned, then that too.
Also, the distinct count is going to be one of larger performance hits. The like '%narcotics%' in the predicate is going to force a full table scan and should be avoided at all costs (can this be an integer key in the data model?)
To answer your question, not really sure windowing (over partition by) is going to perform any better. I would test it and see, but there is nothing "wrong" with the query.
You could rewrite the count distinct's as virtual tables or temp tables with group by's or a combination of those two.
To illustrate, this is a stub for windowing that you could grow into the same query:
select a.state_cd,a.injurylevel,a.totalpatid, count(*) over (partition by a.state_cd, a.injurylevel)
from
(select state_cd,injurylevel,count(*) as totalpatid, count(distinct patid) as patid
from
#members
group by state_cd,injurylevel
) a
see what I mean about not really being that helpful? Then again, sometimes rewriting a query slightly can improve performance by selecting a better execution plan, but rather then taking stabs in the dark, I'd first find the bottlenecks in the query you have, since you already took the time to write it.

Sub-query Optimization Talk with an example case

I need advises and want to share my experience about Query Optimization. This week, I found myself stuck in an interesting dilemma.
I'm a novice person in mySql (2 years theory, less than one practical)
Environment :
I have a table that contains articles with a column 'type', and another table article_version that contain a date where an article is added in the DB, and a third table that contains all the article types along with types label and stuffs...
The 2 first tables are huge (800000+ fields and growing daily), the 3rd one is naturally small sized. The article tables have a lot of column, but we will only need 'ID' and 'type' in articles and 'dateAdded' in article_version to simplify things...
What I want to do :
A Query that, for a specified 'dateAdded', returns the number of articles for each types (there is ~ 50 types to scan).
What was already in place is 50 separate count, one for each document types oO ( not efficient, long(~ 5sec in general), ).
I wanted to do it all in one query and I came up with that :
SELECT type,
(SELECT COUNT(DISTINCT articles.ID)
FROM articles
INNER JOIN article_version
ON article_version.ARTI_ID = legi_arti.ID
WHERE type = td.NEW_ID
AND dateAdded = '2009-01-01 00:00:00') AS nbrArti
FROM type_document td
WHERE td.NEW_ID != ''
GROUP BY td.NEW_ID;
The external select (type_document) allow me to get the 55 types of documents I need.
The sub-Query is counting the articles for each type_document for the given date '2009-01-01'.
A common result is like :
* type * nbrArti *
*************************
* 123456 * 23 *
* 789456 * 5 *
* 16578 * 98 *
* .... * .... *
* .... * .... *
*************************
This query get the job done, but the join in the sub-query is making this extremely slow, The reason, if I'm right, is that a join is made by the server for each types, so 50+ times, this solution is even more slower than doing the 50 queries independently for each types, awesome :/
A Solution
I came up with a solution myself that drastically improve the performance with the same result, I just created a view corresponding to the subQuery, making the join on ids for each types... And Boom, it's f.a.s.t.
I think, correct me if I'm wrong, that the reason is the server only runs the JOIN statement once.
This solution is ~5 time faster than the solution that was already there, and ~20 times faster than my first attempt. Sweet
Questions / thoughts
With yet another view, I'll now need to check if I don't loose more than win when documents get inserted...
Is there a way to improve the original Query, by getting the JOIN statement out of the sub-query? (And getting rid of the view)
Any other tips/thoughts? (In Server Optimizing for example...)
Apologies for my approximating English, it'is not my primary language.
You cannot create a single index on (type, date_added), because these fields are in different tables.
Without the view, the subquery most probably selects article as a leading table and the index on type which is not very selective.
By creating the view, you force the subquery to calculate the sums for all types first (using a selective the index on date) and then use a JOIN BUFFER (which is fast enough for only 55 types).
You can achieve similar results by rewriting your query as this:
SELECT new_id, COALESCE(cnt, 0) AS cnt
FROM type_document td
LEFT JOIN
(
SELECT type, COUNT(DISTINCT article_id) AS cnt
FROM article_versions av
JOIN articles a
ON a.id = av.article_id
WHERE av.date = '2009-01-01 00:00:00'
GROUP BY
type
) q
ON q.type = td.new_id
Unfortunately, MySQL is not able to do table spools or hash joins, so to improve the performance you'll need to denormalize your tables: add type to article_version and create a composite index on (date, type).