number of rows in big table

number of rows in big table - sql

SELECT COUNT(*) FROM BigTable_1
Which way I should to use to get number of rows in table if I have more than 1 billion of rows?
UPDATE: For example, if we have 'a timeout problem' with the query above is there any way to optimize it? How to do it quicker?

If you need an exact count, you have to use COUNT (*)
If you are OK with a rough count, you can use a sum of rows in the partitions
SELECT SUM (Rows)
FROM sys.partitions
WHERE 1=1
And index_id IN (0, 1)
And OBJECT_ID = OBJECT_ID('Database.schema.Table');
If you want to be funny with your COUNT, you can do the following
select COUNT (1/0) from BigTable_1

A very fast ESTIMATE:
select count(*) from table
But don't execute! Highlight the code, press ctl-l to bring up the query plan. Then hover over the leftmost arrow. A yellow box appears with the estimated number of rows.
You can query system tables to get the same data, but that is harder to remember. This way is much more impressive to onlookers.
:)

You can use sys.dm_db_partition_stats.
select sum(row_count)
from sys.dm_db_partition_stats
where object_id = object_id('TableName') and index_id < 2

Depending on your concurrency, speed, and accuracy requirements, you can get an approximate answer with triggers. Create a table
CREATE TABLE TABLE_COUNTS(TABLE_NAME VARCHAR, R_COUNT BIGINT DEFAULT 0);
INSERT INTO TABLE_COUNTS('BigTable_1', 0);
(I'm going to leave out adding a key, etc., for brevity.)
Now set up triggers.
CREATE TRIGGER bt1count_1 AFTER INSERT ON BigTable_1 FOR EACH ROW
BEGIN
UPDATE TABLE_COUNTS SET R_COUNT=R_COUNT+1 WHERE TABLE_NAME='BigTable_1';
END;
A corresponding decrement trigger goes on DELETEs. Now instead of a COUNT, you query the TABLE_COUNT table. Your result will be a little off in the case of pending transactions, but you may be able to live with that. And the cost is amortized over all of the INSERT and DELETE operations; getting the row count when you need it is fast.

Try this:
select sum(P.rows) from sys.partitions P with (nolock)
join sys.tables T with (nolock) on P.object_id = T.object_id
where T.Name = 'Table_1' and index_id = 1
it should be a lot faster. Got it from here: SELECT COUNT(*) FOR BIG TABLE

Your query will get the number of rows regardless of the quantity. Try using the query you listed in your question.

There's only 1 [accurate] way to count the rows in a table: count(*). sp_spaceused or looking at the statistics won't necessarily give you the [a?] correct answer.

if you've got a primary key you should be able to do this:
select count(PrimaryKey) from table_1

Related

Is it possible to get the number of rows from a table in O(1) time?

In SQL server, obviously one way of getting the number of rows in a table is
SELECT COUNT(*) FROM MyTable
but I assume that's O(n) time where n is the number of rows. Is there any metadata I can access that has the number of rows stored?

Yes, you can use sys.partitions, it might not be the exact number, but it's extremely fast:
SELECT SUM(rows)
FROM sys.partitions
WHERE [object_id] = OBJECT_ID('dbo.MyTable')
AND index_id IN (0,1);

Tuning And Performance

INSERT INTO <TABLED>
SELECT A.* FROM
<TABLEA> A WHERE A.MED_DTL_STATUS='0'
AND A.TRANS_ID
NOT IN
(
SELECT DISTINCT TRANS_ID_X_REF FROM <TABLEB>
UNION
SELECT DISTINCT TRANS_ID FROM <TABLEA> WHERE ADJUSTMENT_TYPE='3'
);
The table has more than 250 columns.
The Select statement will return more than 300000 records .The above query is running for a long time.I have never worked on performance tuning.Could someone please help me on tuning this or give me some good links on how to tune oracle queries?.
Thanks in advance.

I find that NOT IN clauses are really slow. I would rewrite the query with NOT EXISTS instead.
INSERT INTO <TABLED>
SELECT A.* FROM <TABLEA> A
WHERE A.MED_DTL_STATUS='0'
AND NOT EXISTS (
SELECT B.TRANS_ID_X_REF
FROM <TABLEB> B
WHERE B.TRANS_ID_X_REF = A.TRANS_ID
)
AND NOT EXISTS (
SELECT A2.TRANS_ID
FROM <TABLEA> A2
WHERE A2.TRANS_ID = A.TRANS_ID
AND A2.ADJUSTMENT_TYPE='3'
);
The query above assumes there are indexes on TRANS_ID on TableA and TableB. This may not really solve your problem, but without knowing the data model and indexes it may be worth a shot.

Apart from the good suggestions already given, whenever you are inserting a large number of records into a table it is best practice to drop the indexes on that table. When the INSERT process has finished, then recreate the indexes.

How selective is this predicate?
A.MED_DTL_STATUS='0'
If it filters out a large proportion of the rows in the table then creating an index on MED_DTL_STATUS might help.
Note that Oracle has (or at least used to have) a limit of ~1000 items for IN: in case your subquery starts returning more rows than that you will get an error (this IN can be rewritten using a left outer join if/when that happens).

Horrible Oracle update performance

I am performing an update with a query like this:
UPDATE (SELECT h.m_id,
m.id
FROM h
INNER JOIN m
ON h.foo = m.foo)
SET m_id = id
WHERE m_id IS NULL
Some info:
Table h is roughly ~5 million rows
All rows in table h have NULL values for m_id
Table m is roughly ~500 thousand rows
m_id on table h is an indexed foreign key pointing to id on table m
id on table m is the primary key
There are indexes on m.foo and h.foo
The EXPLAIN PLAN for this query indicated a hash join and full table scans, but I'm no DBA, so I can't really interpret it very well.
The query itself ran for several hours and did not complete. I would have expected it to complete in no more than a few minutes. I've also attempted the following query rewrite:
UPDATE h
SET m_id = (SELECT id
FROM m
WHERE m.foo = h.foo)
WHERE m_id IS NULL
The EXPLAIN PLAN for this mentioned ROWID lookups and index usage, but it also went on for several hours without completing. I've also always been under the impression that queries like this would cause the subquery to be executed for every result from the outer query's predicate, so I would expect very poor performance from this rewrite anyway.
Is there anything wrong with my approach, or is my problem related to indexes, tablespace, or some other non-query-related factor?
Edit:
I'm also having abysmal performance from simple count queries like this:
SELECT COUNT(*)
FROM h
WHERE m_id IS NULL
These queries are taking anywhere from ~30 seconds to sometimes ~30 minutes(!).
I am noticing no locks, but the tablespace for these tables is sitting at 99.5% usage (only ~6MB free) right now. I've been told that this shouldn't matter as long as indexes are being used, but I don't know...

Some points:
Oracle does not index NULL values (it will index a NULL that is part of a globally non-null tuple, but that's about it).
Oracle is going for a HASH JOIN because of the size of both h and m. This is likely the best option performance-wise.
The second UPDATE might get Oracle to use indexes, but then Oracle is usually smart about merging subqueries. And it would be a worse plan anyway.
Do you have recent, reasonable statistics for your schema? Oracle really needs decent statistics.
In your execution plan, which is the first table in the HASH JOIN? For best performance it should be the smaller table (m in your case). If you don't have good cardinality statistics, Oracle will get messed up. You can force Oracle to assume fixed cardinalities with the cardinality hint, it may help Oracle get a better plan.
For example, in your first query:
UPDATE (SELECT /*+ cardinality(h 5000000) cardinality(m 500000) */
h.m_id, m.id
FROM h
INNER JOIN m
ON h.foo = m.foo)
SET m_id = id
WHERE m_id IS NULL
In Oracle, FULL SCAN reads not only every record in the table, it basically reads all storage allocated up to the maximum used (the high water mark in Oracle documentation). So if you have had a lot of deleted rows your tables might need some cleaning up. I have seen a SELECT COUNT(*) on an empty table consume 30+ seconds because the table in question had like 250 million deleted rows. If that is the case, I suggest analyzing your specific case with a DBA, so he/she can reclaim space from deleted rows and lower the high water mark.

As far as I remember, a WHERE m_id IS NULL performs a full-table scan, since NULL values cannot be indexed.
Full-table scan means, that the engine needs to read every record in the table to evaluate the WHERE condition, and cannot use an index.
You could try to add a virtual column set to a not-null value if m_id IS NULL, and index this column, and use this column in the WHERE condition.
Then you could also move the WHERE condition from the UPDATE statement to the sub-select, which will probably make the statement faster.
Since JOINs are expensive, rewriting INNER JOIN m ON h.foo = m.foo as
WHERE h.foo IN (SELECT m.foo FROM m WHERE m.foo IS NOT NULL)
may also help.

For large tables, MERGE is often much faster than UPDATE. Try this (untested):
MERGE INTO h USING
(SELECT h.h_id,
m.id as new_m_id
FROM h
INNER JOIN m
ON h.foo = m.foo
WHERE h.m_id IS NULL
) new_data
ON (h.h_id = new_data.h_id)
WHEN MATCHED THEN
UPDATE SET h.m_id = new_data.new_m_id;

Try undocumented hint /*+ BYPASS_UJVC */. If it works, add an UNIQUE/PK constraint on m.foo.

I would update the table in iterations, for example, add a condition according to where h.date_created > sysdate-30 and after it finishes I would run the same query and change the condition to: where h.date_created between sysdate-30 and sysdate-60 etc. If you don't have a column like date_created maybe there's another column you can filter by ? for example: WHERE m.foo = h.foo AND m.foo between 1 and 10
Only the result of plan can explain why the cost of this update is high, but an educated guess will be that both tables are very big and that there are many NULL values as well as a lot of matching (m.foo = h.foo)...

Update statement taking 10 minutes with large data set

I have a SQL stored procedure in which one statement is taking 95% of the total time (10 minutes) to complete. #Records has approximately 133,000 rows and Records has approximately 12,000 rows.
-- Check Category 1 first
UPDATE #Records
SET Id = (SELECT TOP 1 Id FROM Records WHERE Cat1=#Records.Cat1)
WHERE Cat1 IS NOT NULL
I have tried adding a index to Cat1 in #Records, but the statement time did not improve.
CREATE CLUSTERED INDEX IDX_C_Records_Cat1 ON #Records(Cat1)
A similar statement that follows, takes only a fraction of the time
-- Check Category 2
UPDATE #Records
SET Id = (SELECT TOP 1 Id FROM Records WHERE Cat2=#Records.Cat2)
WHERE ID IS NULL
Any ideas on why this is happening or what I can do to make this statement more time effective?
Thanks in advance.
I am running this on Microsoft SQL Server 2005.

update with join maybe
update t
set t.ID = r.ID
FROM (Select Min(ID) as ID,Cat1 From Records group by cat1) r
INNER JOIN #Records t ON r.Cat1 = t.cat1
Where t.cat1 is not null

I would say your problem is probably that you are using a correlated subquery instead of a join. Joins work in sets, correlated subqueries run row-by-agonzing-row and are essentially cursors.

In my experience, when you are trying to update a high number of records, sometimes is faster to use a cursor and iterate throught records rather than use an update query.
Maybe this help in your case.

Can this SQL Query be optimized to run faster?

I have an SQL Query (For SQL Server 2008 R2) that takes a very long time to complete. I was wondering if there was a better way of doing it?
SELECT #count = COUNT(Name)
FROM Table1 t
WHERE t.Name = #name AND t.Code NOT IN (SELECT Code FROM ExcludedCodes)
Table1 has around 90Million rows in it and is indexed by Name and Code.
ExcludedCodes only has around 30 rows in it.
This query is in a stored procedure and gets called around 40k times, the total time it takes the procedure to finish is 27 minutes.. I believe this is my biggest bottleneck because of the massive amount of rows it queries against and the number of times it does it.
So if you know of a good way to optimize this it would be greatly appreciated! If it cannot be optimized then I guess im stuck with 27 min...
EDIT
I changed the NOT IN to NOT EXISTS and it cut the time down to 10:59, so that alone is a massive gain on my part. I am still going to attempt to do the group by statement as suggested below but that will require a complete rewrite of the stored procedure and might take some time... (as I said before, im not the best at SQL but it is starting to grow on me. ^^)

In addition to workarounds to get the query itself to respond faster, have you considered maintaining a column in the table that tells whether it is in this set or not? It requires a lot of maintenance but if the ExcludedCodes table does not change often, it might be better to do that maintenance. For example you could add a BIT column:
ALTER TABLE dbo.Table1 ADD IsExcluded BIT;
Make it NOT NULL and default to 0. Then you could create a filtered index:
CREATE INDEX n ON dbo.Table1(name)
WHERE IsExcluded = 0;
Now you just have to update the table once:
UPDATE t
SET IsExcluded = 1
FROM dbo.Table1 AS t
INNER JOIN dbo.ExcludedCodes AS x
ON t.Code = x.Code;
And ongoing you'd have to maintain this with triggers on both tables. With this in place, your query becomes:
SELECT #Count = COUNT(Name)
FROM dbo.Table1 WHERE IsExcluded = 0;
EDIT
As for "NOT IN being slower than LEFT JOIN" here is a simple test I performed on only a few thousand rows:
EDIT 2
I'm not sure why this query wouldn't do what you're after, and be far more efficient than your 40K loop:
SELECT src.Name, COUNT(src.*)
FROM dbo.Table1 AS src
INNER JOIN #temptable AS t
ON src.Name = t.Name
WHERE src.Code NOT IN (SELECT Code FROM dbo.ExcludedCodes)
GROUP BY src.Name;
Or the LEFT JOIN equivalent:
SELECT src.Name, COUNT(src.*)
FROM dbo.Table1 AS src
INNER JOIN #temptable AS t
ON src.Name = t.Name
LEFT OUTER JOIN dbo.ExcludedCodes AS x
ON src.Code = x.Code
WHERE x.Code IS NULL
GROUP BY src.Name;
I would put money on either of those queries taking less than 27 minutes. I would even suggest that running both queries sequentially will be far faster than your one query that takes 27 minutes.
Finally, you might consider an indexed view. I don't know your table structure and whether your violate any of the restrictions but it is worth investigating IMHO.

You say this gets called around 40K times. WHy? Is it in a cursor? If so do you really need a cursor. Couldn't you put the values you want for #name in a temp table and index it and then join to it?
select t.name, count(t.name)
from table t
join #name n on t.name = n.name
where NOT EXISTS (SELECT Code FROM ExcludedCodes WHERE Code = t.code)
group by t.name
That might get you all your results in one query and is almost certainly faster than 40K separate queries. Of course if you need the count of all the names, it's even simpleer
select t.name, count(t.name)
from table t
NOT EXISTS (SELECT Code FROM ExcludedCodes WHERE Code = t
group by t.name

NOT EXISTS typically performs better than NOT IN, but you should test it on your system.
SELECT #count = COUNT(Name)
FROM Table1 t
WHERE t.Name = #name AND NOT EXISTS (SELECT 1 FROM ExcludedCodes e WHERE e.Code = t.Code)
Without knowing more about your query it's tough to supply concrete optimization suggestions (i.e. code suitable for copy/paste). Does it really need to run 40,000 times? Sounds like your stored procedure needs reworking, if that's feasible. You could exec the above once at the start of the proc and insert the results in a temp table, which can keep the indexes from Table1, and then join on that instead of running this query.
This particular bit might not even be the bottleneck that makes your query run 27 minutes. For example, are you using a cursor over those 90 million rows, or scalar valued UDFs in your WHERE clauses?

Have you thought about doing the query once and populating the data in a table variable or temp table? Something like
insert into #temp (name, Namecount)
values Name, Count(name)
from table1
where name not in(select code from excludedcodes)
group by name
And don't forget that you could possibly use a filtered index as long as the excluded codes table is somewhat static.

Start evaluating the execution plan. Which is the heaviest part to compute?
Regarding the relation between the two tables, use a JOIN on indexed columns: indexes will optimize query execution.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

number of rows in big table - sql

SELECT COUNT(*) FROM BigTable_1 Which way I should to use to get number of rows in table if I have more than 1 billion of rows? UPDATE: For example, if we have 'a timeout problem' with the query above is there any way to optimize it? How to do it quicker?

You can use sys.dm_db_partition_stats. select sum(row_count) from sys.dm_db_partition_stats where object_id = object_id('TableName') and index_id < 2

Try this: select sum(P.rows) from sys.partitions P with (nolock) join sys.tables T with (nolock) on P.object_id = T.object_id where T.Name = 'Table_1' and index_id = 1 it should be a lot faster. Got it from here: SELECT COUNT(*) FOR BIG TABLE

Your query will get the number of rows regardless of the quantity. Try using the query you listed in your question.

There's only 1 [accurate] way to count the rows in a table: count(*). sp_spaceused or looking at the statistics won't necessarily give you the [a?] correct answer.

if you've got a primary key you should be able to do this: select count(PrimaryKey) from table_1

Related

Is it possible to get the number of rows from a table in O(1) time?

Tuning And Performance

Horrible Oracle update performance

Update statement taking 10 minutes with large data set

Can this SQL Query be optimized to run faster?

Categories

Resources