select count(*) vs keep a counter - sql

Assuming indexes are put in place, and absolute-count-accuracy is not necessary (it's okay to be off by one or two), is it okay to use:
Option A
select count(*)
from Table
where Property = #Property
vs
Option B
update PropertyCounters
SET PropertyCount = PropertyCount + 1
where Property = #Property
then doing:
select PropertyCount
from PropertyCounters
where Property = #Property
How much performance degradation can I reasonably expect from doing select count(*) as the table grows into thousands/millions of records?

Keeping a separate count column in addition to the real data is a denormalisation. There are reasons why you might need to do it for performance, but you shouldn't go there until you really need to. It makes your code more complicated, with more chance of inconsistencies creeping in.
For the simple case where the query really is just SELECT COUNT(property) FROM table WHERE property=..., there's no reason to denormalise; you can make that fast by adding an index on the property column.

You didn't specify the platform, but since you use T-SQL syntax for #variables I'll venture a SQL Server platform specific answer:
count(*), or strictly speaking would be count_big(*), is an expression that can be used in indexed views, see Designing Indexed Views.
create view vwCounts
with schembinding
as select Property, count_big(*) as Count
from dbo.Table
group by Property;
create unique clustered index cdxCounts on vwCounts(Property);
select Count
from vwCount with (noexpand)
where Property = #property;
On Enterprise Edition the optimizer will even use the indexed view for your original query:
select count_big(*)
from Table
where Property = #property;
So in the end you get your cake and eat it too: the property is already aggregated and maintained for your for free by the engine. The price is that updates have to maintain the indexed view (they will not recompute the aggregate count though) and the aggregation will create hot spots for contention (locks on separate rows on Table will contend for same count(*) update on the indexed view).

If you say that you do not need absolute accuracy, then Option B is a strange approach. If Option A becomes too heavy (even after adding indexes), you can cache the output of Option A in memory or in another table (your PropertyCounters), and periodically refresh it.

This isn't something that can be answered in general SQL terms. Quite apart from the normal caveats about indices and so on affecting queries, it's also something where there is considerable different between platforms.
I'd bet on better performance on this from SQL Server than Postgres, to the point where I'd consider the latter approach sooner on Postgres and not on SQL Server. However, with a partial index set just right for matching the criteria, I'd bet on Postgres beating out SQL Server. That's just what I'd bet small winnings on though, either way I'd test if I needed to think about it for real.
If you do go for the latter approach, enforce it with a trigger or similar, so that you can't become inaccurate.

On SQL Server, if you don't need absolutely accurate counts, you could also inspect the catalog views. This would be much easier to do - you don't have to keep a count yourself - and it's a lot less taxing on the system. After all, if you need to count all the rows in a table, you need to scan that table, one way or another - no way around that.
With this SQL statement here, you'll get all the tables in your database, and their row counts, as kept by SQL Server:
SELECT
t.NAME AS TableName,
SUM(p.rows) AS RowCounts
FROM
sys.tables t
INNER JOIN
sys.indexes i ON t.OBJECT_ID = i.object_id
INNER JOIN
sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
WHERE
t.NAME NOT LIKE 'dt%' AND
i.OBJECT_ID > 255 AND
i.index_id <= 1
GROUP BY
t.NAME, i.object_id, i.index_id, i.name
ORDER BY
OBJECT_NAME(i.object_id)
I couldn't find any documentation on exactly how current those numbers are, typically - but from my own experience, they're usually on the spot (unless you're doing some bulk loading or something - but in that case, you wouldn't want to constantly scan the table to get the exact count, either)

Related

Optimize query when updating

I have the following query that took too much time to be executed.
How to optimize it?
Update Fact_IU_Lead
set
Fact_IU_Lead.Latitude_Point_Vente = Adr.Latitude,
Fact_IU_Lead.Longitude_Point_Vente = Adr.Longitude
FROM Dim_IU_PointVente
INNER JOIN
Data_I_Adresse AS Adr ON Dim_IU_PointVente.Code_Point_Vente = Adr.Code_Point_Vente
INNER JOIN
Fact_IU_Lead ON Dim_IU_PointVente.Code_Point_Vente = Fact_IU_Lead.Code_Point_Vente
WHERE
Latitude_Point_Vente is null
or Longitude_Point_Vente is null and Adr.[Error]=0
Couple of things I would look at on this to help.
How many records are on each table? If it's millions, then you may need to cycle through them.
Are the columns you're joining on or filtering on indexed on each table? If no, add them in! typically a huge speed difference with less cost.
Are the columns you're joining on stored as text instead of geo-spatial? I've had much better performance out of geo-spatial data types in this scenario. Just make sure your SRIDs are the same across tables.
Are the columns you're updating indexed, or is the table that's being updated heavy with indexes? Tons of indexes on a large table can be great for looking things up, but kills update/insert speeds.
Take a look at those first.
I've added a bit of slight cleaning to your code in regard to aliases.
Also, take a look at the where clauses. Choose one of them.
When you have mix and's and or's the best thing you can ever do is add parenthesis.
At a minimum, you'll have zero question regarding your thoughts when you wrote it.
At most, you'll know that SQL is executing your logic correctly.
Update Fact_IU_Lead
set
Latitude_Point_Vente = Adr.Latitude --Note the table prefix is removed
, Longitude_Point_Vente = Adr.Longitude --Note the table prefix is removed
FROM Dim_IU_PointVente as pv --Added alias
INNER JOIN
Data_I_Adresse AS adr ON pv.Code_Point_Vente = adr.Code_Point_Vente --carried alias
INNER JOIN
Fact_IU_Lead as fl ON pv.Code_Point_Vente = fl.Code_Point_Vente --added/carried alias
WHERE
(pv.Latitude_Point_Vente is null or pv.Longitude_Point_Vente is null) and adr.[Error] = 0 --carried alias, option one for WHERE change
pv.Latitude_Point_Vente is null or (pv.Longitude_Point_Vente is null and adr.[Error] = 0) --carried alias, option two for WHERE change
Making joins is usually expensive, the best approach in your case will be to place the update into a stored procedure, split your update into selects and use a transaction to keep everything consistent (if needed) instead.
Hope this answer point you in the right direction :)

SQL reduce data in join or where

I want to know what is faster, assuming I have the following queries and they retrieve the same data
select * from tableA a inner join tableB b on a.id = b.id where b.columnX = value
or
select * from tableA inner join (select * from tableB where b.columnX = value) b on a.id = b.id
I think makes sense to reduce the dataset from tableB in advanced, but I dont find anything to backup my perception.
In a database such as Teradata, the two should have exactly the same performance characteristics.
SQL is not a procedural language. A SQL query describes the result set. It does not specify the sequence of actions.
SQL engines process a query in three steps:
Parse the query.
Optimize the parsed query.
Execute the optimized query.
The second step gives the engine a lot of flexibility. And most query engines will be quite intelligent about ignoring subqueries, using indexes and partitions based on where clauses, and so on.
Most SQL dialects compile your query into an execution plan. Teradata and most SQL system show the expected execution plan with the "explain" command. Teradata has a visual explain too, which is simple to learn from
It depends on the data volumes and key type in each table, if any method would be advantageous
Most SQL compilers will work this out correctly using the current table statistics (data size and spread)
In some SQL systems your second command would be worse, as it may force a full temporary table build by ALL fields on tableB
It should be (not that I recommend this query style at all)
select * from tableA inner join (select id from tableB where columnX = value) b on a.id = b.id
In most cases, don't worry about this, unless you have a specific performance issue, and then use the explain commands to work out why
A better way in general is to use common table expressions (CTE) to break the problem down. This leads to better queries that can be tested and maintain over the long term
Whenever you come across such scenarios wherein you feel that which query would yeild the results faster in teradata, please use the EXPLAIN plan in teradata - which would properly dictate how the PE is going to retrieve records. If you are using Teradata sql assistant then you can select the query and press F6.
The DBMS decides the access path that will be used to resolve the query, you can't decide it, but you can do certain things like declaring indexes so that the DBMS takes those indexes into consideration when deciding which access path it will use to resolve the query, and then you will get a better performance.
For instance, in this example you are filtering tableB by b.columnX, normally if there are no indexes declared for tableB the DBMS will have to do a full table scan to determine which rows fulfill that condition, but suppose you declare an index on tableB by columnX, in that case the DBMS will probably consider that index and determine an access path that makes use of the index, getting a much better performance than a full table scan, specially if the table is big.

Most efficient way to store queries and counts of large SQL data

I have a SQL Server database with a large amount of data (65 million rows mostly of text, 8Gb total). The data gets changed only once per week. I have an ASP.NET web application that will run several SQL queries on this data that will count the number of rows satisfying various conditions. Since the data gets changed only once per week, what is the most efficient way to store both the SQL queries and their counts for the week? Should I store it in the database or in the application?
If the data is only modified once a week, as part of and at the end of that (ETL?) process, perform your "basic" counts and store the results in a table in the database. Thereafter, rather than lengthy queries on the big tables, you can just query those small summary tables.
If you do not need 100% up-to-the-minute accurate row counts, you could query SQL Server's internal info:
Select so.name as 'TableName', si.rowcnt as 'RowCount'
from sysobjects so
inner join sysindexes si on so.id = si.id
where so.type = 'u' and indid < 2
Very quick to execute and no extra tables required. Not accurate where many updates are occurring but might be accurate enough in your intended usage. [Thank you to commenters!]
Update: did a bit of digging and this does produce accurate counts (slower due to the sum, but still quick):
SELECT OBJECT_SCHEMA_NAME(ps.object_id) AS SchemaName,
OBJECT_NAME(ps.object_id) AS ObjectName,
SUM(ps.row_count) AS row_count
FROM sys.dm_db_partition_stats ps
JOIN sys.indexes i ON i.object_id = ps.object_id
AND i.index_id = ps.index_id
WHERE i.type_desc IN ('CLUSTERED','HEAP')
AND OBJECT_SCHEMA_NAME(ps.object_id) <> 'sys'
GROUP BY ps.object_id
ORDER BY OBJECT_NAME(ps.object_id), OBJECT_SCHEMA_NAME(ps.object_id)
Ref.
Remember that the stored count information was not always 100%
accurate in SQL Server 2000. For a new table created on 2005 the
counts will be accurate. But for a table that existed in 2000 and now
resides on 2005 through a restore or update, you need to run (only
once after the move to 2005) either sp_spaceused #updateusage =
N'true' or DBCC UPDATEUSAGE with the COUNT_ROWS option.
The queries should be stored as stored procedures or views, depending on complexity.
For your situation I would look into indexed views.
They let you both store a query AND the result set for things like aggregation that otherwise cannot be indexed.
As a bonus, the query optimizer "knows" it has this data as well, so if you check for a count or something else stored in the view index in another query (even one not referencing the view directly) it can still use that stored data.

SQL Distinct keyword bogs down performance?

I have received a SQL query that makes use of the distinct keyword. When I tried running the query it took at least a minute to join two tables with hundreds of thousands of records and actually return something.
I then took out the distinction and it came back in 0.2 seconds. Does the distinct keyword really make things that bad?
Here's the query:
SELECT DISTINCT
c.username, o.orderno, o.totalcredits, o.totalrefunds,
o.recstatus, o.reason
FROM management.contacts c
JOIN management.orders o ON (c.custID = o.custID)
WHERE o.recDate > to_date('2010-01-01', 'YYYY/MM/DD')
Yes, as using DISTINCT will (sometimes according to a comment) cause results to be ordered. Sorting hundreds of records takes time.
Try GROUP BY all your columns, it can sometimes lead the query optimiser to choose a more efficient algorithm (at least with Oracle I noticed significant performance gain).
Distinct always sets off alarm bells to me - it usually signifies a bad table design or a developer who's unsure of themselves. It is used to remove duplicate rows, but if the joins are correct, it should rarely be needed. And yes there is a large cost to using it.
What's the primary key of the orders table? Assuming it's orderno then that should be sufficient to guarantee no duplicates. If it's something else, then you may need to do a bit more with the query, but you should make it a goal to remove those distincts! ;-)
Also you mentioned the query was taking a while to run when you were checking the number of rows - it can often be quicker to wrap the entire query in "select count(*) from ( )" especially if you're getting large quantities of rows returned. Just while you're testing obviously. ;-)
Finally, make sure you have indexed the custID on the orders table (and maybe recDate too).
Purpose of DISTINCT is to prune duplicate records from the result set for all the selected columns.
If any of the selected columns is unique after join you can drop DISTINCT.
If you don't know that, but you know that the combination of the values of selected column is unique, you can drop DISTINCT.
Actually, normally, with properly designed databases you rarely need DISTINCT and in those cases that you do it is (?) obvious that you need it. RDBMS however can not leave it to chance and must actually build an indexing structure to establish it.
Normally you find DISTINCT all over the place when people are not sure about JOINs and relationships between tables.
Also, in classes when talking about pure relational databases where the result should be a proper set (with no repeating elements = records) you can find it quite common for people to stick DISTINCT in to guarantee this property for purposes of theoretical correctness. Sometimes this creeps in into production systems.
You can try to make a group by like this:
SELECT c.username,
o.orderno,
o.totalcredits,
o.totalrefunds,
o.recstatus,
o.reason
FROM management.contacts c,
management.orders o
WHERE c.custID = o.custID
AND o.recDate > to_date('2010-01-01', 'YYYY-MM-DD')
GROUP BY c.username,
o.orderno,
o.totalcredits,
o.totalrefunds,
o.recstatus,
o.reason
Also verify if you have index on o.recDate

simple sql query

which one is faster
select * from parents p
inner join children c on p.id = c.pid
where p.x = 2
OR
select * from
(select * from parents where p.x = 2)
p
inner join children c on p.id = c.pid
where p.x = 2
In MySQL, the first one is faster:
SELECT *
FROM parents p
INNER JOIN
children c
ON c.pid = p.id
WHERE p.x = 2
, since using an inline view implies generating and passing the records twice.
In other engines, they are usually optimized to use one execution plan.
MySQL is not very good in parallelizing and pipelining the result streams.
Like this query:
SELECT *
FROM mytable
LIMIT 1
is instant, while this one (which is semantically identical):
SELECT *
FROM (
SELECT *
FROM mytable
)
LIMIT 1
will first select all values from mytable, buffer them somewhere and then fetch the first record.
For Oracle, SQL Server and PostgreSQL, the queries above (and both of your queries) will most probably yield the same execution plans.
I know this is a simple case, but your first option is much more readable than the second one. As long as the two query plans are comparable I'd always opt for the more maintainable SQL code which your first example is for me.
It depends on how good the database is at optimising the query.
If the database manages to optimise the second one into the first one, they are equally fast, otherwise the first one is faster.
The first one gives more freedom for the database to optimise the query. The second one suggests a specific order of doing things. Either the database is able to see past this and optimise it into a single query, or it will run the query as two separate queries with the subquery as an intermediate result.
A database like SQL Server keeps statistics on what the database tables contain, which it uses to determine how to execute the query in the most efficient way. For example, depending on what will elliminate most records it can either start with joining the tables or filtering the parents table on the condition. If you write a query that forces a specific order, that might not be the most efficient order.
I'd think the first. I'm not sure if the optimizer would use any indexes on the the derived table in the second query, or if it would copy out all the rows that match into memory before joining back to the children.
This is why you have DBAs. It depends entirely on the DBMS, and how your tables and indexes are configured, as to which one runs the fastest.
Database tuning is not a set-and-forget operation, it should be done regularly, as the data changes, to ensure your database runs at peak performance. The question is not really meaningful without specifying:
which DBMS you are asking about.
what indexes you have on the tables.
a host of other possible configuration items (which may also depend on the DBMS, such as clustering).
You should run both those queries through the query optimizer to see which one is fastest, then start using that one. That's assuming the difference in noticeable in the first place. If the difference is minimal, go for the easiest to read/maintain.
For me, in the second query you are saying, I don't trust the optimizer to optimize this query so I'll provide some 'hints'.
I'd say, trust the optimizer until it let's you down and only then consider trying to do the optimizer's job for it.