PostgreSQL efficient count of number of rows WHERE parameter=TRUE - sql

I have a very massive PostgreSQL database table. One of its columns stores a boolean parameter. What I want is a very simple query:
SELECT COUNT(*) FROM myTable WHERE myTable.isSthTrue;
The problem I have is that it is very slow as it needs to check every row if it satisfies the criteria. Adding an index to this parameter speeds up the calculation roughly two times, but doesn't really improve the complexity in general.
Some people on the internet suggest adding triggers that update the count which is stored in a separate table. But that feels like too much effort for an easy problem like this.

If you need an exact result.
Then, yes, a trigger-based solution is likely the best path to go.
If an estimate is okay, consider Materialized Views. (Postgres 9.3+)
Something like CREATE MATERIALIZED VIEW myCount AS SELECT COUNT(*) FROM myTable WHERE myTable.isSthTrue; would maintain a copy of the expensive query you reference. The only caveat is that this aggregate view would not be automatically updated. To do that you need to call REFRESH MATERIALIZED VIEW, which could be done in a cron, or other timed task.

Related

Select Query VS View

Generally we are aggregating transaction data to one table using SQL Job periodically. Now the customer needs live data so 1 hour or hour is too high. Instead of running job in minutes, team suggest to create a view which includes the logic for selecting data from the transaction table itself.
But my confusion is any performance related difference between select query and View. Putting data in aggregate table on a periodica manner and select from this aggregation is easy because
1. It does not affect main tables
Direct select statement in direct transactions tables impact insert opertions as it frequently happening. If i create a view i think the concept is same as select statement. Now also we are selecting from the main tables. Or my understanding is wrong?
Please suggest the best approach for this and why
If i create a view i think the concept is same as select statement
Correct, a view in its simplest form is nothing more than saved query definition. When this view is used in a query the definition will be expanded out into the outer query and optimised accordingly. There is not performance benefit. This is not true of course for indexed views, where the view essentially becomes it's own table and using the NOEXPAND query hint will stop the definition being expanded, and will simply read from the view's index(es). Since this is an aggregation query though I suspect an indexed view won't even be possible, never mind a viable solution.
The next part about having a table to store the aggregation is more difficult to answer. Yes this could benefit performance, but at the cost of not having up to the minute data, and also having to maintain the table. Whether this is a suitable solution is entirely dependent on your needs, how up to date the data needs to be, how often it is required, how long it takes to (a) populate the reporting table (b) run the query on it's own.
For example if it takes 20 seconds to run the query, but it is only needed twice a day, then there is no point running this query every hour to maintain a reporting table to assist a query that is run twice a day.
Another option could be maintaining this reporting table through triggers, i.e. when a row is inserted/updated, cascade the change to the reporting table there and then, this would make the reporting table up to date, but again, if you are inserting millions of transactions a day and running the report a few times, you have to weigh up if the additional overhead cause by the trigger is worth it.
You could reduce the impact on write operations by using a transaction isolation level of READ UNCOMMITTED for the SELECT, but as with the summary table option, this comes at the cost of having up to the second accurate information, as you will be reading uncommitted transactions.
A final option could be an intermediate option, create a summary table daily, and partition/index your main table by date, then you can get the summary from historic data from the table created daily, and union this with only today's data which should be relatively fast with the right indexes.
A good approach is to create the view and profile your DB. Yes the best way to check if the view is a good solution is creating it and testing the results.
A similar question was asked here:
Is a view faster than a simple query?

Complexity comparison between temporary table + index creation vice multi-table group by without index

I have two potential roads to take on the following problem, the try it and see methodology won't pay off for this solution as the load on the server is constantly in flux. The two approaches I have are as follows:
select *
from
(
select foo.a,bar.b,baz.c
from foo,bar,baz
-- updated for clarity sake
where foo.a=b.bar
and b.bar=baz.c
)
group by a,b,c
vice
create table results as
select foo.a,bar.b,baz.c
from foo,bar,baz
where foo.a=b.bar
and b.bar=baz.c ;
create index results_spanning on results(a,b,c);
select * from results group by a,b,c;
So in case it isn't clear. The top query performs the group by outright against the multi-table select thus preventing me from using an index. The second query allows me to create a new table that stores the results of the query, proceeding to create a spanning index, then finishing the group by query to utilize the index.
What is the complexity difference of these two approaches, i.e. how do they scale and which is preferable in the case of large quantities of data. Also, the main issue is the performance of the overall select so that is what I am attempting to fix here.
Comments
Are you really doing a CROSS JOIN on three tables? Are those three
columns indexed in their own right? How often do you want to run the
query which delivers the end result?
1) No.
2) Yes, where clause omitted for the sake of discussion as this is clearly a super trivial example
3) Doesn't matter.
2nd Update
This is a temporary table as it is only valid for a brief moment in time, so yes this table will only be queried against one time.
If your query is executed frequently and unacceptably slow, you could look into creating materialized views to pre-compute the results. This gives you the benefit of an indexable "table", without the overhead of creating a table every time.
You'll need to refresh the materialized view (preferably fast if the tables are large) either on commit or on demand. There are some restrictions on how you can create on commit, fast refreshable views, and they will add to your commit time processing slightly, but they will always give the same result as running the base query. On demand MVs will become stale as the underlying data changes until these are refreshed. You'll need to determine whether this is acceptable or not.
So the question is, which is quicker?
Run a query once and sort the result set?
Run a query once to build a table, then build an index, then run the query again and sort the result set?
Hmmm. Tricky one.
The use cases for temporary tables are pretty rare in Oracle. They normally onlya apply when we need to freeze a result set which we are then going to query repeatedly. That is apparently not the case here.
So, take the first option and just tune the query if necessary.
The answer is, as is so often the case with tuning questions, it depends.
Why are you doing a GROUP BY in the first place. The query as you posted it doesn't do any aggregation so the only reason for doing GROUP BY woudl be to eliminate duplicate rows, i.e. a DISTINCT operation. If this is actually the case then you doing some form of cartesian join and one tuning the query would be to fix the WHERE clause so that it only returns discrete records.

I want to create a copy of a table, fewer columns than original... but still updated from it

I have a 30gb table, which has 30-40 columns. I create reports using this table and it causes performance problems. I just use 4-5 columns of this table for the reports. So that, I want to create a second table for the reports. But the second table must be updated when the original table is changed without using triggers.
No matter what my query is, When the query is executed, sql tries to cache all 30gb. When the cache is fully loaded, sql starts to use disk. Actually I want to aviod this
How can I do this?
Is there a way of doing this using ssis
thanks in advance
CREATE VIEW myView
AS
SELECT
column1,
column3,
column4 * column7
FROM
yourTable
A view is effectively just a stored query, like a macro. You can then select from that view as if it were a normal table.
Unless you go for matierialised views, it's not really a table, it's just a query. So it won't speed anything up, but it does encapsulate code and assist in controlling what data different users/logins can read.
If you are using SQL Server, what you want is an indexed view. Create a view using the column you want and then place an index on them.
An indexed view stores the data in the view. It should keep the view up-to-date with the underlying table, and it should reduce the I/O for reading the table. Note: this assumes that your 4-5 columns are much narrower than the overall table.
Dems answer with the view seems ideal, but if you are truly looking for a new table, create it and have it automatically updated with triggers.
Triggers placed on the primary table can be added for all Insert, Update and Delete actions upon it. When the action happens, the trigger fires and can be used to do additional function... such as update your new secondary table. You will pull from the Inserted and Deleted tables (MSDN)
There are many great existing articles here on triggers:
Article 1, Article 2, Google Search
You can create that second table just like you're thinking, and use triggers to update table 2 whenever table 1 is updated.
However, triggers present performance problems of their own; the speed of your inserts and updates will suffer. I would recommend looking for more conventional alternatives to improve query performance, which sounds like SQL Server since you mentioned SSIS.
Since it's only 4-5 out of 30 columns, have you tried adding an index which covers the query? I'm not sure if there are even more columns in your WHERE clause, but you should try that first. A covering index would actually do exactly what you're describing, since the table would never need to be touched by the query. Of course, this does cost a little in terms of space and insert/update performance. There's always a tradeoff.
On top of that, I can't believe that you would need to pull a large percentage of rows for any given report out of a 30 gb table. It's simply too much data for a report to have. A filtered index can improve query performance even more by only indexing the rows that are most likely to be asked for. If you have a report which lists the results for the past calendar month, you could add a condition to only index the rows WHERE report_date > '5/1/2012' for example.

SQL: using count() vs. keeping a separate field

I have a question regarding SQL performance and was hoping someone would have the answer.
I have the database table tbl_users and I want to get the total number of users I have. I could write it as SELECT COUNT(*) FROM tbl_users. I presume such query would have performance implications were I to have a handful of users vs. several millions of them. (So, assumption #1 is that the more rows I have, the more resources this query will consume).
In this particular case I need to run this query at a relatively high frequency and each time I need to get up-to-date data (so, caching is not an option).
Assuming my assmption #1 is correct, I then thought of structuring it the following way:
create tbl_stats with a field userCounter
each time there is an insert in tbl_users, userCounter is updated +1
each time I need to get my user count, I can pull that one field from tbl_stats
Now, I realize that by doing it this way, the data in userCounter is technically a duplicate, which is bad form.
So, will my first query (assuming millions of rows of data) consume that many resources to warrant me to implement my alternative design? If so (or if possibly yes), then is my alternative design consistent with best practices?
If your table is indexed, which it almost certainly will be, then the performance of select count(*) probably will not be as bad as you might anticipate - even if you have millions of rows.
But, if it does become a concern, then rather than roll your own solution, look into using an indexed view.
I have a database table with almost 5 million records, the following query returns in less than a second
select count(userID) from tblUsers
This query returns in 2 seconds
select count(*) from tblUsers
I'd personally just go with select count() rather than creating a duplicate field
I think this is one of those scenarios where you really need to measure the performance to make a good decision. I would wager that a simple COUNT() isn't going to create enough latency that you would need to implement your proposed work-around.
If you are worried I would encapsulate your COUNT() in a function or stored procedure so you can quickly swap it out later if performance does become a problem.
On some systems you can ask the system to maintain the counts for you. For example, in SQL Server you can have an indexed view on the count:
create view vwCountUsers
with schema binding
as
select count_big(*) as count
from dbo.tbl_users;
create clustered index cdxCountUsers on vwCountUsers (count);
The system will maintain the count for you and will always be available at nearly no cost.
If you have a desperate need and a real business case for up to the minute accurate counts, then the trigger would be the way to go. Just make sure it caters for all multi-user issues such as concurrency and transactions.
It could become a bottleneck because instead of 5 transactions being able to insert into a new table, they will queue up waiting to update the userCounter table, and you may even get deadlocks.
There are other options for less accurate counts, but if you want accurate then there are very few other choices, but I'll try to think of some:
You could partition the data and in userCounter store a count by day. If the data only gets added for the current day, do a select sum(dailycount) from counter + select count(*) from table where {date=today}
You could at least use the nolock or readpast options to lessen resource usage:
select * from tbl with (readpast)
select * from tbl with (nolock)
There are somethings it makes sense to precalculate for performace reasons (comlex calculations over years of data). That's why data warehouses exists much of the time to speed reporting. Select count(*) is generally not one of them if you have any indexing on the table at all. There are far worse performance problems to solve than that. I get 1 second to return the count on a table with 13 million rows.
I'm all about writing code that will is more likely to perform well than the alternative (avoiding correlated subqueries, using set-based operatiosn instead of cursors, having sargable where clauses), but this is a mirco optimization that should not be addressed until there is a real performance problem.

How do Views work in a DBM?

Say that I have two tables like those:
Employers (id, name, .... , deptId).
Depts(id, deptName, ...).
But Those data is not going to be modified so often and I want that a query like this
SELECT name, deptName FROM Employers, Depts
WHERE deptId = Depts.id AND Employers.id="ID"
be as faster as it can.
To my head comes two possible solutions:
Denormalize the table:
Despite that with this solution I will lose some of the great advantages of have "normalized databases, but here the performance is a MUST.
Create a View for that Denormalize data.
I will keep the Data Normalized and (here is my question), the performance of a query over that view will be faster that without that view.
Or another way to ask the same question, the View is "Interpreted" every time that you make a query over it, or how works the views Stuff in a DBA?.
Generally, unless you "materialize" a view, which is an option in some software like MS SQL Server, the view is just translated into queries against the base tables, and is therefore no faster or slower than the original (minus the minuscule amount of time it takes to translate the query, which is nothing compared to actually executing the query).
How do you know you've got performance problems? Are you profiling it under load? Have you verified that the performance bottleneck is these two tables? Generally, until you've got hard data, don't assume you know where performance problems come from, and don't spend any time optimizing until you know you're optimizing the right thing - 80% of the performance issues come from 20% of the code.
If Depts.ID is the primary key of that table, and you index the Employers.DeptID field, then this query should remain very fast even over millions of records.
Denormalizing doesn't make sense to me in that scenario.
Generally speaking, performance of a view will be almost exactly the same as performance when running the query itself. The advantage of a view is simply to abstract that query away, so you don't have to think about it.
You could use a Materialized View (or "snapshot" as some say), but then your data is only going to be as recent as your last refresh.
In a comment to one of the replies, the author of the question explains that he is looking for a way to create a materialized view in MySQL.
MySQL does not wrap the concept of the materialized view in a nice package for you like other DBMSes, but it does have all the tools you need to create one.
What you need to do is this:
Create the initial materialization of the result of your query.
Create a trigger on insert into the employers table that inserts into the materialized table all rows that match the newly inserted employer.
Create a trigger on delete in the employers table that deletes the corresponding rows from the materialized table.
Create a trigger on update in the employers table that updates the corresponding rows in the materialized table.
Same for the departments table.
This may work ok if your underlying tables are not frequently updated; but you need to be aware of the added cost of create/update/delete operations once you do this.
Also you'll want to make sure some DBA who doesn't know about your trickery doesn't go migrating the database without migrating the triggers, when time comes. So document it well.
Sounds like premature optimisation unless you know it is a clear and present problem.
MySQL does not materialise views, they are no faster than queries against the base tables. Moreover, in some cases they are slower as they get optimised less well.
But views also "hide" stuff from developers maintaining the code in the future to make them imagine that the query is less complex than it actually is.