Will one query run faster than multiple queries, if they are deleting the same amount of records - sql

I have a table like this:
Table company
companyid | companyname | owner |
5 |coffecompany |Mike |
6 |juicecompany |Mike |
For some reason, I need to use this:
DELETE FROM company WHERE companyid='5';
DELETE FROM company WHERE companyid='6';
instead of
DELETE FROM company WHERE owner='Mike';
But I wonder if the second choice run faster, if it does, will it run much faster? In the future, I may have to use it to delete a large amount of records, so I really need to know.

delete from company where companyId in (5, 6); should always be faster, even though the difference might be negligible if eg. you've got proper indexes, no concurrent queries, no issues with locking etc.
Note that my query is for MS SQL, if your database server allows using the same construct (ie. specifying all the values in such concise way), you should probably use it, if not, go with something like delete from company where companyId = 5 or companyId = 6; Also, don't use string literals if companyid is a number (is the table column actually a number, or a text?).
In any case, it gives the server more lee-way in implementing the actual operation, and DB servers tend to be very good at query optimization.
One possible bottle-neck for deletion could be in the transaction logs, however. It might very well be that if you're deleting a huge amount of rows at once, it would be better to do a few separate deletes in separate transactions to fit within transaction size limits.

Generally, SQL is language operating on sets of data so second query will be much faster for huge amount of rows.
First choice might be slower as you'll have to send query text as many times as you have rows to delete. Imagine network traffic if you want to delete 1 000 000 rows.
On small amounts of rows probably you won't be able to see any difference.

If you are using Oracle, think of using bind variable :
execute immediate 'DELETE FROM company WHERE companyid=:ID' USING 6;
But other than that, there is no specific answer to your question, you need to benchmark yourself, it depends on the amount of data, your indexes etc...

When using Where clause in a query, RDBMS will go to find the result set applying the condition.
Normally RDBMS will do a full-table-scan to find the result set, it means that any records will be investigated to see if the condition is matched. Based on the table size that will be time consuming.
Above approach will differed when when the column(s) listed in the where condition are indexed.
Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it.
As a simplified sample:
A linear search (full-table-scan) on the field A of table T containing N records would require an average of N/2 accesses to find a value.
If 'A' field is indexed then a sorted binary search will requiring an average of log2 N block accesses. Asuming that N=1,000,000 then we will have
N/2 = 500,000
log2 1000000 = 19.93 = 20
Instantly we can see this is a drastic improvement.
It looks like the companyid is the primary key of company table, if so any primary key column will be indexed automatically by RDMS and the search will be more effective than searching by owner.

Related

Proper way to create dynamic 1:M SQL table

Simplified example: Two tables - people and times. Goal is to keep track of all the times a person walks through a doorway.
A person could have between 0 and 50 entries in the times table daily.
What is the proper and most efficient way to keep track of these records? Is it
times table
-----------
person_id
timestamp
I'm worried that this table can get well over a million records rather quickly. Insertion and retrieval times are of utmost importance.
ALSO: Obviously non-normalized but would it be a better idea to do
times table
-----------
person_id
serialized_timestamps_for_the_day
date
We need to access each individual timestamp for the person but ONLY query records on date or the person's id.
The second solution has some problems:
Since you need to access individual timestamps1, serialized_timestamps_for_the_day cannot be considered atomic and would violate the 1NF, causing a bunch of problems.
On top of that, you are introducing a redundancy: the date can be inferred from the contents of the serialized_timestamps_for_the_day, and your application code would need to make sure they never become "desynchronized", which is vulnerable to bugs.2
Therefore go with the first solution. If properly indexed, a modern database on modern hardware can handle much more than mere "well over a million records". In this specific case:
A composite index on {person_id, timestamp} will allow you to query for person or combination of person and date by a simple index range scan, which can be very efficient.
If you need just "by date" query, you'll need an index on {timestamp}. You can easily search for all timestamps within a specific date by searching for a range 00:00 to 24:00 of the given day.
1 Even if you don't query for individual timestamps, you still need to write them to the database one-by-one. If you have a serialized field, you first need to read the whole field to append just one value, and then write the whole result back to the database, which may become a performance problem rather quickly. And there are other problems, as mentioned in the link above.
2 As a general rule, what can be inferred should not be stored, unless there is a good performance reason to do so, and I don't see any here.
Consider what are we talking about here. Accounting for just raw data (event_time, user_id) this would be (4 + 4) * 1M ~ 8MB per 1M rows. Let's try to roughly estimate this in a DB.
One integer 4 bytes, timestamp 4 bytes; row header, say 18 bytes -- this brings the first estimate of the row size to 4 + 4 + 18 = 26 bytes. Using page fill factor of about 0.7; ==> 26 / 0.7 ~ 37 bytes per row.
So, for 1 M rows that would be about 37 MB. You will need index on (user_id, event_time), so let's simply double the original to 37 * 2 = 74 MB.
This brings the very rough, inacurate estimate to 74MB per 1M rows.
So, to keep this in memory all the time, you would need 0.074 GB for each 1M rows of this table.
To get better estimate, simply create a table, add the index and fill it with few million rows.
Given the expected data volume, this can all easily be tested with 10M rows even on a laptop -- testing always beats speculating.
P.S. Your option 2 does not look "obviously better idea" too me at all.
I think first option would be a better option.
Even if you go for second option, the size of the index might not reduce. In fact there will be an additional column.
And the data for different users is not related, you can shard the database based on person_id. i.e. let's say your data cannot be fit on a single database server node and requires two nodes. Then data for half the users will be stored on one node and rest of the data will be stored on another node.
This can be done using RDBMS like MySQL or Document oriented databases like MongoDB and OrientDB as well.

How much is performance improved when using LIMIT in a SQL sentence?

Let's suppose I have a table in my database with 1.000.000 records.
If I execute:
SELECT * FROM [Table] LIMIT 1000
Will this query take the same time as if I have that table with 1000 records and just do:
SELECT * FROM [Table]
?
I'm not looking for if it will take exactly the same time. I just want to know if the first one will take much more time to execute than the second one.
I said 1.000.000 records, but it could be 20.000.000. That was just an example.
Edit:
Of course that when using LIMIT and without using it in the same table, the query built using LIMIT should be executed faster, but I'm not asking that...
To make it generic:
Table1: X records
Table2: Y records
(X << Y)
What I want to compare is:
SELECT * FROM Table1
and
SELECT * FROM Table2 LIMIT X
Edit 2:
Here is why I'm asking this:
I have a database, with 5 tables and relationships between some of them. One of those tables will (I'm 100% sure) contain about 5.000.000 records. I'm using SQL Server CE 3.5, Entity Framework as the ORM and LINQ to SQL to make the queries.
I need to perform basically three kind of non-simple queries, and I was thinking about showing to the user a limit of records (just like lot of websites do). If the user wants to see more records, the option he/she has is to restrict more the search.
So, the question came up because I was thinking about doing this (limiting to X records per query) or if storing in the database only X results (the recent ones), which will require to do some deletions in the database, but I was just thinking...
So, that table could contain 5.000.000 records or more, and what I don't want is to show the user 1000 or so, and even like this, the query still be as slow as if it would be returning the 5.000.000 rows.
TAKE 1000 from a table of 1000000 records - will be 1000000/1000 (= 1000) times faster because it only needs to look at (and return) 1000/1000000 records. Since it does less, it is naturally faster.
The result will be pretty (pseudo-)random, since you haven't specified any order in which to TAKE. However, if you do introduce an order, then one of two below becomes true:
The ORDER BY clause follows an index - the above statement is still true.
The ORDER BY clause cannot use any index - it will be only marginally faster than without the TAKE, because
it has to inspect ALL records, and sort by ORDER BY
deliver only a subset (TAKE count)
so it is not faster in the first step, but the 2nd step involves less IO/network than ALL records
If you TAKE 1000 records from a table of 1000 records, it will be equivalent (with little significant differences) to TAKE 1000 records from 1 billion, as long as you are following the case of (1) no order by, or (2) order by against an index
Assuming both tables are equivalent in terms of index, row-sizing and other structures. Also assuming that you are running that simple SELECT statement. If you have an ORDER BY clause in your SQL statements, then obviously the larger table will be slower. I suppose you're not asking that.
If X = Y, then obviously they should run in similar speed, since the query engine will be going through the records in exactly the same order -- basically a table scan -- for this simple SELECT statement. There will be no difference in query plan.
If Y > X only by a little bit, then also similar speed.
However, if Y >> X (meaning Y has many many more rows than X), then the LIMIT version MAY be slower. Not because of query plan -- again should be the same -- but simply because that the internal structure of data layout may have several more levels. For example, if data is stored as leafs on a tree, there may be more tree levels, so it may take slightly more time to access the same number of pages.
In other words, 1000 rows may be stored in 1 tree level in 10 pages, say. 1000000 rows may be stored in 3-4 tree levels in 10000 pages. Even when taking only 10 pages from those 10000 pages, the storage engine still has to go through 3-4 tree levels, which may take slightly longer.
Now, if the storage engine stores data pages sequentially or as a linked list, say, then there will be no difference in execution speed.
It would be approximately linear, as long as you specify no fields, no ordering, and all the records. But that doesn't buy you much. It falls apart as soon as your query wants to do something useful.
This would be quite a bit more interesting if you intended to draw some useful conclusion and tell us about the way it would be used to make a design choice in some context.
Thanks for the clarification.
In my experience, real applications with real users seldom have interesting or useful queries that return entire million-row tables. Users want to know about their own activity, or a specific forum thread, etc. So unless yours is an unusual case, by the time you've really got their selection criteria in hand, you'll be talking about reasonable result sizes.
In any case, users wouldn't be able to do anything useful with many rows over several hundred, transporting them would take a long time, and they couldn't scroll through it in any reasonable way.
MySQL has the LIMIT and OFFSET (starting record #) modifiers primarlly for the exact purpose of creating chunks of a list for paging as you describe.
It's way counterproductive to start thinking about schema design and record purging until you've used up this and a bunch of other strategies. In this case don't solve problems you don't have yet. Several-million-row tables are not big, practically speaking, as long as they are correctly indexed.

How can I improve performance of average method in SQL?

I'm having some performance problems where a SQL query calculating the average of a column is progressively getting slower as the number of records grows. Is there an index type that I can add to the column that will allow for faster average calculations?
The DB in question is PostgreSQL and I'm aware that particular index type might not be available, but I'm also interested in the theoretical answer, weather this is even possible without some sort of caching solution.
To be more specific, the data in question is essentially a log with this sort of definition:
table log {
int duration
date time
string event
}
I'm doing queries like
SELECT average(duration) FROM log WHERE event = 'finished'; # gets average time to completion
SELECT average(duration) FROM log WHERE event = 'finished' and date > $yesterday; # average today
The second one is always fairly fast since it has a more restrictive WHERE clause, but the total average duration one is the type of query that is causing the problem. I understand that I could cache the values, using OLAP or something, my question is weather there is a way I can do this entirely by DB side optimisations such as indices.
The performance of calculating an average will always get slower the more records you have, at it always has to use values from every record in the result.
An index can still help, if the index contains less data than the table itself. Creating an index for the field that you want the average for generally isn't helpful as you don't want to do a lookup, you just want to get to all the data as efficiently as possible. Typically you would add the field as an output field in an index that is already used by the query.
Depends what you are doing? If you aren't filtering the data then beyond having the clustered index in order, how else is the database to calculate an average of the column?
There are systems which perform online analytical processing (OLAP) which will do things like keeping running sums and averages down the information you wish to examine. It all depends one what you are doing and your definition of "slow".
If you have a web based program for instance, perhaps you can generate an average once a minute and then cache it, serving the cached value out to users over and over again.
Speeding up aggregates is usually done by keeping additional tables.
Assuming sizeable table detail(id, dimA, dimB, dimC, value) if you would like to make the performance of AVG (or other aggregate functions) be nearly constant time regardless of number of records you could introduce a new table
dimAavg(dimA, avgValue)
The size of this table will depend only on the number of distinct values of dimA (furthermore this table could make sense in your design as it can hold the domain of the values available for dimA in detail (and other attributes related to the domain values; you might/should already have such table)
This table is only helpful if you will anlayze by dimA only, once you'll need AVG(value) according to dimA and dimB it becomes useless. So, you need to know by which attributes you will want to do fast analysis on. The number of rows required for keeping aggregates on multiple attributes is n(dimA) x n(dimB) x n(dimC) x ... which may or may not grow pretty quickly.
Maintaining this table increases the costs of updates (incl. inserts and deletes), but there are further optimizations that you can employ...
For example let us assume that system predominantly does inserts and only occasionally updates and deletes.
Lets further assume that you want to analyze by dimA only and that ids are increasing. Then having structure such as
dimA_agg(dimA, Total, Count, LastID)
can help without a big impact on the system.
This is because you could have triggers that would not fire on every insert, but lets say on ever 100 inserts.
This way you can still get accurate aggregates from this table and the details table with
SELECT a.dimA, (SUM(d.value)+MAX(a.Total))/(COUNT(d.id)+MAX(a.Count)) as avgDimA
FROM details d INNER JOIN
dimA_agg a ON a.dimA = d.dimA AND d.id > a.LastID
GROUP BY a.dimA
The above query with proper indexes would get one row from dimA_agg and only less then 100 rows from detail - this would perform in near constant time (~logfanoutn) and would not require update to dimA_agg for every insert (reducing update penalties).
The value of 100 was just given as an example, you should find optimal value yourself (or even keep it variable, though triggers only will not be enough in that case).
Maintaining deletes and updates must fire on each operation but you can still inspect if the id of the record to be deleted or updated is in the stats already or not to avoid the unnecessary updates (will save some I/O).
Note: The analysis is done for the domain with discreet attributes; when dealing with time series the situation gets more complicated - you have to decide the granularity of the domain in which you want to keep the summary.
EDIT
There are also materialized views, 2, 3
Just a guess, but indexes won't help much since average must read all the record (in any order), indexes are usefull the find subsets of rows, ubt if you have to iterate on all rows with no special ordering indexes are not helping...
This might not be what you're looking for, but if your table has some way to order the data (e.g. by date), then you can just do incremental computations and store the results.
For example, if your data has a date column, you could compute the average for records 1 - Date1 then store the average for that batch along with Date1 and the #records you averaged. The next time you compute, you restrict your query to results Date1..Date2, and add the # of records, and update the last date queried. You have all the information you need to compute the new average.
When doing this, it would obviously be helpful to have an index on the date, or whatever column(s) you are using for the ordering.

Organizing lots of timestamped values in a DB (sql / nosql)

I have a device I'm polling for lots of different fields, every x milliseconds
the device returns a list of ids and values which I need to store with a time stamp in a DB of sorts.
Users of the system need to be able to query this DB for historic logs to create graphs, or query the last timestamp for each value.
A simple approach would be to define a MySQL table with
id,value_id,timestamp,value
and let users select
Select value form t where value_id=x order by timestamp desc limit 1
and just push everything there with index on timestamp and id, But my question is what's the best approach performance / size wise for designing the schema? or using nosql? can anyone comment on possible design trade offs. Will such a design scale with millions of records?
When you say "... or query the last timestamp for each value" is this what you had in mind?
select max(timestamp) from T where value = ?
If you have millions of records, and the above is what you meant (i.e. value is alone in the WHERE clause), then you'd need an index on the value column, otherwise you'd have to do a full table scan. But if queries will ALWAYS have [timestamp] column in the WHERE clause, you do not need an index on [value] column if there's an index on timestamp.
You need an index on the timestamp column if your users will issue queries where the timestamp column appears alone in the WHERE clause:
select * from T where timestamp > x and timestamp < y
You could index all three columns, but you want to make sure the writes do not slow down because of the indexing overhead.
The rule of thumb when you have a very large database is that every query should be able to make use of an index, so you can avoid a full table scan.
EDIT:
Adding some additional remarks after your clarification.
I am wondering how you will know the id? Is [id] perhaps a product code?
A single simple index on id might not scale very well if there are not many different product codes, i.e. if it's a low-cardinality index. The rebalancing of the trees could slow down the batch inserts that are happening every x milliseconds. A composite index on (id,timestamp) would be better than a simple index.
If you rarely need to sort multiple products but are most often selecting based on a single product-code, then a non-traditional DBMS that uses a hashed-key sparse-table rather than a b-tree might be a very viable even a superior alternative for you. In such a database, all of the records for a given key would be found physically on the same set of contiguous "pages"; the hashing algorithm looks at the key and returns the page number where the record will be found. There is no need to rebalance an index as there isn't an index, and so you completely avoid the related scaling worries.
However, while hashed-file databases excel at low-overhead nearly instant retrieval based on a key value, they tend to be poor performers at sorting large groups of records on an attribute, because the data are not stored physically in any meaningful order, and gathering the records can involve much thrashing. In your case, timestamp would be that attribute. If I were in your shoes, I would base my decision on the cardinality of the id: in a dataset of a million records, how many DISTINCT ids would be found?
YET ANOTHER EDIT SINCE THE SITE IS NOT LETTING ME ADD ANOTHER ANSWER:
Simplest way is to have two tables, one with the ongoing history, which is always having new values inserted, and the other, containing only 250 records, one per part, where the latest value overwrites/replaces the previous one.
Update latest
set value = x
where id = ?
You have a choice of
indexes (composite; covering value_id, timestamp and value, or some combination of them): you should test performance with different indexes; composite and non-composite, also be aware that there are quite a few significantly different ways to get 'max per group' (search so, especially mysql version with variables)
triggers - you might use triggers to maintain max row values in another table (best performance of further selects; this is redundant and could be kept in memory)
lazy statistics/triggers, since your database is updated quite often you can save cycles if you update your statistics periodically (if you can allow the stats to be y seconds old and if you poll 1000 / x times a second, then you potentially save y * 100 / x potential updates; and this can be noticeable, especially in terms of scalability)
The above is true if you are looking for last bit of performance, if not keep it simple.

Appropriate query and indexes for a logging table in SQL

Assume a table named 'log', there are huge records in it.
The application usually retrieves data by simple SQL:
SELECT *
FROM log
WHERE logLevel=2 AND (creationData BETWEEN ? AND ?)
logLevel and creationData have indexes, but the number of records makes it take longer to retrieve data.
How do we fix this?
Look at your execution plan / "EXPLAIN PLAN" result - if you are retrieving large amounts of data then there is very little that you can do to improve performance - you could try changing your SELECT statement to only include columns you are interested in, however it won't change the number of logical reads that you are doing and so I suspect it will only have a neglible effect on performance.
If you are only retrieving small numbers of records then an index of LogLevel and an index on CreationDate should do the trick.
UPDATE: SQL server is mostly geared around querying small subsets of massive databases (e.g. returning a single customer record out of a database of millions). Its not really geared up for returning truly large data sets. If the amount of data that you are returning is genuinely large then there is only a certain amount that you will be able to do and so I'd have to ask:
What is it that you are actually trying to achieve?
If you are displaying log messages to a user, then they are only going to be interested in a small subset at a time, and so you might also want to look into efficient methods of paging SQL data - if you are only returning even say 500 or so records at a time it should still be very fast.
If you are trying to do some sort of statistical analysis then you might want to replicate your data into a data store more suited to statistical analysis. (Not sure what however, that isn't my area of expertise)
1: Never use Select *
2: make sure your indexes are correct, and your statistics are up-to-date
3: (Optional) If you find you're not looking at log data past a certain time (in my experience, if it happened more than a week ago, I'm probably not going to need the log for it) set up a job to archive that to some back-up, and then remove unused records. That will keep the table size down reducing the amount of time it takes search the table.
Depending on what kinda of SQL database you're using, you might look into Horizaontal Partitioning. Oftentimes, this can be done entirely on the database side of things so you won't need to change your code.
Do you need all columns? First step should be to select only those you actually need to retrieve.
Another aspect is what you do with the data after it arrives to your application (populate a data set/read it sequentially/?).
There can be some potential for improvement on the side of the processing application.
You should answer yourself these questions:
Do you need to hold all the returned data in memory at once? How much memory do you allocate per row on the retrieving side? How much memory do you need at once? Can you reuse some memory?
A couple of things
do you need all the columns, people usually do SELECT * because they are too lazy to list 5 columns of the 15 that the table has.
Get more RAM, themore RAM you have the more data can live in cache which is 1000 times faster than reading from disk
For me there are two things that you can do,
Partition the table horizontally based on the date column
Use the concept of pre-aggregation.
Pre-aggregation:
In preagg you would have a "logs" table, "logs_temp" table, a "logs_summary" table and a "logs_archive" table. The structure of logs and logs_temp table is identical. The flow of application would be in this way, all logs are logged in the logs table, then every hour a cron job runs that does the following things:
a. Copy the data from the logs table to "logs_temp" table and empty the logs table. This can be done using the Shadow Table trick.
b. Aggregate the logs for that particular hour from the logs_temp table
c. Save the aggregated results in the summary table
d. Copy the records from the logs_temp table to the logs_archive table and then empty the logs_temp table.
This way results are pre-aggregated in the summary table.
Whenever you wish to select the result, you would select it from the summary table.
This way the selects are very fast, because the number of records are far less as the data has been pre-aggregated per hour. You could even increase the threshold from an hour to a day. It all depends on your needs.
Now the inserts would be fast too, because the amount of data is not much in the logs table as it holds the data only for the last hour, so index regeneration on inserts would take very less time as compared to very large data-set hence making the inserts fast.
You can read more about Shadow Table trick here
I employed the pre-aggregation method in a news website built on wordpress. I had to develop a plugin for the news website that would show recently popular (popular during the last 3 days) news items, and there are like 100K hits per day, and this pre-aggregation thing has really helped us a lot. The query time came down from more than 2 secs to under a second. I intend on making the plugin publically available soon.
As per other answers, do not use 'select *' unless you really need all the fields.
logLevel and creationData have indexes
You need a single index with both values, what order you put them in will affect performance, but assuming you have a small number of possible loglevel values (and the data is not skewed) you'll get better performance putting creationData first.
Note that optimally an index will reduce the cost of a query to log(N) i.e. it will still get slower as the number of records increases.
C.
I really hope that by creationData you mean creationDate.
First of all, it is not enough to have indexes on logLevel and creationData. If you have 2 separate indexes, Oracle will only be able to use 1.
What you need is a single index on both fields:
CREATE INDEX i_log_1 ON log (creationData, logLevel);
Note that I put creationData first. This way, if you only put that field in the WHERE clause, it will still be able to use the index. (Filtering on just date seems more likely scenario that on just log level).
Then, make sure the table is populated with data (as much data as you will use in production) and refresh the statistics on the table.
If the table is large (at least few hundred thousand rows), use the following code to refresh the statistics:
DECLARE
l_ownname VARCHAR2(255) := 'owner'; -- Owner (schema) of table to analyze
l_tabname VARCHAR2(255) := 'log'; -- Table to analyze
l_estimate_percent NUMBER(3) := 5; -- Percentage of rows to estimate (NULL means compute)
BEGIN
dbms_stats.gather_table_stats (
ownname => l_ownname ,
tabname => l_tabname,
estimate_percent => l_estimate_percent,
method_opt => 'FOR ALL INDEXED COLUMNS',
cascade => TRUE
);
END;
Otherwise, if the table is small, use
ANALYZE TABLE log COMPUTE STATISTICS FOR ALL INDEXED COLUMNS;
Additionally, if the table grows large, you shoud consider to partition it by range on creationDate column. See these links for the details:
Oracle Documentation: Range Partitioning
OraFAQ: Range partitions
How to Create and Manage Partition Tables in Oracle