Computational Complexity of SQL Query

Computational Complexity of SQL Query - sql

If I have a table of, say, blog posts, with columns such as post_id and author_id, and I used the SQL "SELECT * FROM post_table where author_id = 34", what would be the computational complexity of that query? Would it simply look through each row and check if it has the correct author id, O(n), or does it do something more efficient?
I was just wondering because I'm in a situation where I could either search an SQL database with this data, or load an xml file with a list of posts, and search through those, I was wondering which would be faster.

There are two basic ways that such a simple query would be executed.
The first is to do a full table scan. This would have O(n) performance.
The second is a to look up the value in an index, then load the page, and return the results. The index scan should be O(log(n)). Loading the page should be O(1).
With a more complicated query, it would be hard to make such a general statement. But any SQL engine is generally going to take one of these two paths. Oh, there is a third option if the table is partitioned on author_id, but you are probably not interested in that.
That said, the power of a database is not in these details. It is in the management of memory. The database will cache the data and index in memory, so you do not have to re-read data pages. The database will take advantage of multiple processors and multiple disks, so you do not have to code this. The database keeps everything consistent, in the face of updates and deletes.
As for your specific question. If the data is in the database, search it there. Loading all the data into an xml file and then doing the search in memory requires a lot of overhead. You would only want to do that if the connection to your database is slow and you are doing many such queries.

Have a look at the EXPLAIN command. It shows you what the database actually does when executing a given SELECT query.

Related

How do I improve performance when querying on a column that changes frequently in SQL Azure using LINQ to SQL

I have an SQL Azure database, and one of the tables contains over 400k objects. One of the columns in this table is a count of the number of times that the object has been downloaded.
I have several queries that include this particular column (call it timesdownloaded), sorted descending, in order to find the results.
Here's an example query in LINQ to SQL (I'm writing all this in C# .NET):
var query = from t in db.tablename
where t.textcolumn.StartsWith(searchfield)
orderby t.timesdownloaded descending
select t.textcolumn;
// grab the first 5
var items = query.Take(5);
This query called perhaps 90 times per minute on average.
Objects are downloaded perhaps 10 times per minute on average, so this timesdownloaded column is updated that frequently.
As you can imagine, any index involving the timesdownloaded column gets over 30% fragmented in a matter of hours. I have implemented an index maintenance plan that checks and rebuilds these indexes when necessary every few hours. This helps, but of course adds spikes in query response times whenever the indexes are rebuilt which I would like to avoid or minimize.
I have tried a variety of indexing schemes.
The best performing indexes are covering indexes that include both the textcolumn and timesdownloaded columns. When these indexes are rebuilt, the queries are amazingly quick of course.
However, these indexes fragment badly and I end up with pretty frequent delay spikes due to rebuilding indexes and other factors that I don't understand.
I have also tried simply not indexing the timesdownloaded column. This seems to perform more consistently overall, though slower of course. And when I check on the SQL query execution plan, it seems to be pretty inconsistent in how SQL tries to optimize this query. Of course it ends up with a log of logical reads as it has to fetch the timesdownloaded column from the table and not an organized index. So this isn't optimal.
What I'm trying to figure out is if I am fundamentally missing something in how I have configured or manage this database.
I'm no SQL expert, and I've yet to find a good answer for how to do this.
I've seen some suggestions that Stored Procedures could help, but I don't understand why and haven't tried to get those going with LINQ just yet.
As commented below, I have considered caching but haven't taken that step yet either.
For some context, this query is a part of a search suggestion feature. So it is called frequently with many different search terms.
Any suggestions would be appreciated!

Based on the comments to my question and further testing, I ended up using an Azure Table to cache my results. This is working really well and I get a lot of hits off of my cache and many fewer SQL queries. The overall performance of my API is much better now.
I did try using Azure In Role Caching, but that method doesn't appear to work well for my needs. It ended up using too much memory (no matter how I configured it, which I don't understand), swapping to disk like crazy and brought my little Small instances to their knees. I don't want to pay more at the moment, so Tables it is.
Thanks for the suggestions!

Storing Search results for Future Use

I have the following scenario where the search returns a list of userid values (1,2,3,4,5,6... etc.) If the search were to be run again, the results are guaranteed to change given some time. However I need to stored the instance of the search results to be used in the future.
We have a current implementation (legacy), which creates a record for the search_id with the criteria and inserts every row returned into a different table with the associated search_id.
table search_results
search_id unsigned int FK, PK (clustered index)
user_id unsigned int FK
This is an unacceptable approach as this table has grown onto millions of records. I've considered partitioning the table, but either I will have numerous partitions (1000s).
I've optimized the existing tables that search results expired unless they're used elsewhere, so all the search results are referenced elsewhere.
In the current schema, I cannot store the results as serialized arrays or XML. I am looking to efficiently store the search result information, such that it can be efficiently accessed later without being burdened by the number of records.
EDIT: Thank you for the answers, I don't have any problems running the searches themselves, but the result set for the search gets used in this case for recipient lists, which will be used over and over again, the purpose of storing is exactly to have a snapshot of the data at the given time.

The answer is don't store query results. It's a terrible idea!
It introduces statefulness, which is very bad unless you really (really really) need it
It isn't scalable (as you're finding out)
The data is stale as soon as it's stored
The correct approach is to fix your query/database so it runs acceptable quickly.
If you can't make the queries faster using better SQL and/or indexes etc, I recommend using lucene (or any text-based search engine) and denormalizing your database into it. Lucene queries are incredibly fast.
I recently did exactly this on a large web site that was doing what you're doing: It was caching query results from the production relational database in the session object in an attempt top speed up queries, but it was a mess, and wasn't much faster anyway - before my time, a "senior" java developer (whose name started with Jam.. and ended with .illiams) who was actually a moron decided it was a good idea.
I put in Solr (a java-tailored lucene implementation) and kept Solr up to date with the relational database (using work queues) and the web queries are now just a few milliseconds.

Is there a reason why you need to store every search? Surely you would want the most up to date information available for the user ?
I'll admit first, this isn't a great solution.
Setup another database alongside your current one [SYS_Searches]
The save script could use SELECT INTO [SYS_Searches].Results_{Search_ID}
The script that retrieves can do a simple SELECT out of the matching table.
Benefits:
Every search is neatly packed into it's own table, [preferably in another DB]
The retrieval query is very simple
The retrieval time should be very quick, no massive table scans.
Drawbacks:
You will have a table for every x user * y searches a user can store.
This could get very silly very quickly unless there is management involved to expire results or the user can only have 1 cached search result set.
Not pretty, but I can't think of another way.

Under what conditions would SELECT by PRIMARY KEY be slow?

Chasing down some DB performance issues in a fairly typical EclipseLink/JPA application.
I am seeing frequent queries that are taking 25-100ms. These are simple queries, just selecting all columns from a table where its primary key is equal to a value. They shouldn't be slow.
I'm looking at the query time in the postgres log, using the log_min_duration_statement so this should eliminate any network or application overhead.
This query is not slow, but it is used very often.
Why would selecting * by primary key be slow?
Is this specific to postgres or is it a generic DB issue?
How can I speed this up? In general? For postgres?
Sample query from the pg log:
2010-07-28 08:19:08 PDT - LOG: duration: 61.405 ms statement: EXECUTE <unnamed> [PREPARE: SELECT coded_ele
ment_key, code_system, code_system_label, description, label, code, concept_key, alternate_code_key FROM coded
_element WHERE (coded_element_key = $1)]
Table has around 3.5 million rows.
I have also run EXPLAIN and EXPLAIN ANALYZE on this query, its only doing an index scan.

Select * makes your database work harder, and as a general rule, is a bad practice. There are tons of questions/answers on stackoverflow talking about that.
have you tried replacing * with the field names?

Could you be getting some kind of locking contention? What kind of locks are you taking when performing these queries?

Well, I don't know much about postgres SQL, so I'll give you a tip for MS SQL Server which might be applicable.
MS SQL Server has the concept of a "cluster index" which is the physical layout of the data on the disk. It's good to use on field where you'll be seeking a range between to values (date fields mostly). It's not much use if you're looking for a exact value (like a primary key lookup). However, sometimes the primary key index is inadvertantly set as a clustered index. This makes an index lookup into a table scan.

The the row unusually large or contain BLOBs and large binary fields?
Is this directly through console or is this query being run through some data access API like jdbc or ADO.NET? You mention JPA that looks like a data access API. For short queries, data access API become a larger percent of execution time-- creating the command, creating objects to hold the rows and cells, etc.

select * is almost always a very very bad idea.
If the order of the fields changes, it will break your code.
According to comments, this isn't really important given the abstraction library you're using.
You're probably returning more data from the table than you actually want. Selecting for the specific fields you want can save transfer time.
25ms is about the lower bound you're going to see on almost any kind of SQL query -- that's only two disk accesses! You might want to look into ways to reduce the number of times the query is run rather than trying to optimize the query.

SQL Server Express performance issue

I know my questions will sound silly and probably nobody will have perfect answer but since I am in a complete dead-end with the situation it will make me feel better to post it here.
So...
I have a SQL Server Express database that's 500 Mb. It contains 5 tables and maybe 30 stored procedure. This database is use to store articles and is use for the Developer It web site. Normally the web pages load quickly, let's say 2 ou 3 sec. BUT, sqlserver process uses 100% of the processor for those 2 or 3 sec.
I try to find which stored procedure was the problem and I could not find one. It seems like every read into the table dans contains the articles (there are about 155,000 of them and 20 or so gets added every 15 minutes).
I added few indexes but without luck...
It is because the table is full text indexed ?
Should I have order with the primary key instead of date ? I never had any problems with ordering by dates....
Should I use dynamic SQL ?
Should I add the primary key into the URL of the articles ?
Should I use multiple indexes for separate columns or one big index ?
I you want more details or code bits, just ask for it.
Basically, every little hint is much appreciated.
Thanks.

If your index is not being used, then it usually indicates one of two problems:
Non-sargable predicate conditions, such as WHERE DATEPART(YY, Column) = <something>. Wrapping columns in a function will impair or eliminate the optimizer's ability to effectively use an index.
Non-covered columns in the output list, which is very likely if you're in the habit of writing SELECT * instead of SELECT specific_columns. If the index doesn't cover your query, then SQL Server needs to perform a RID/key lookup for every row, one by one, which can slow down the query so much that the optimizer just decides to do a table scan instead.
See if one of these might apply to your situation; if you're still confused, I'd recommend updating the question with more information about your schema, the data, and the queries that are slow. 500 MB is very small for a SQL database, so this shouldn't be slow. Also post what's in the execution plan.

Use SQL Profiler to capture a lot of typical queries used in your app. Then run the profiler results through index tuning wizard. That will tell you what indexes can be added to optimize.
Then look at the worst performing queries and analyze their execution plans manually.

How do you optimize tables for specific queries?

What are the patterns you use to determine the frequent queries?
How do you select the optimization factors?
What are the types of changes one can make?

This is a nice question, if rather broad (and none the worse for that).
If I understand you, then you're asking how to attack the problem of optimisation starting from scratch.
The first question to ask is: "is there a performance problem?"
If there is no problem, then you're done. This is often the case. Nice.
On the other hand...
Determine Frequent Queries
Logging will get you your frequent queries.
If you're using some kind of data access layer, then it might be simple to add code to log all queries.
It is also a good idea to log when the query was executed and how long each query takes. This can give you an idea of where the problems are.
Also, ask the users which bits annoy them. If a slow response doesn't annoy the user, then it doesn't matter.
Select the optimization factors?
(I may be misunderstanding this part of the question)
You're looking for any patterns in the queries / response times.
These will typically be queries over large tables or queries which join many tables in a single query. ... but if you log response times, you can be guided by those.
Types of changes one can make?
You're specifically asking about optimising tables.
Here are some of the things you can look for:
Denormalisation. This brings several tables together into one wider table, so in stead of your query joining several tables together, you can just read one table. This is a very common and powerful technique. NB. I advise keeping the original normalised tables and building the denormalised table in addition - this way, you're not throwing anything away. How you keep it up to date is another question. You might use triggers on the underlying tables, or run a refresh process periodically.
Normalisation. This is not often considered to be an optimisation process, but it is in 2 cases:
updates. Normalisation makes updates much faster because each update is the smallest it can be (you are updating the smallest - in terms of columns and rows - possible table. This is almost the very definition of normalisation.
Querying a denormalised table to get information which exists on a much smaller (fewer rows) table may be causing a problem. In this case, store the normalised table as well as the denormalised one (see above).
Horizontal partitionning. This means making tables smaller by putting some rows in another, identical table. A common use case is to have all of this month's rows in table ThisMonthSales, and all older rows in table OldSales, where both tables have an identical schema. If most queries are for recent data, this strategy can mean that 99% of all queries are only looking at 1% of the data - a huge performance win.
Vertical partitionning. This is Chopping fields off a table and putting them in a new table which is joinned back to the main table by the primary key. This can be useful for very wide tables (e.g. with dozens of fields), and may possibly help if tables are sparsely populated.
Indeces. I'm not sure if your quesion covers these, but there are plenty of other answers on SO concerning the use of indeces. A good way to find a case for an index is: find a slow query. look at the query plan and find a table scan. Index fields on that table so as to remove the table scan. I can write more on this if required - leave a comment.
You might also like my post on this.

That's difficult to answer without knowing which system you're talking about.
In Oracle, for example, the Enterprise Manager lets you see which queries took up the most time, lets you compare different execution profiles, and lets you analyze queries over a block of time so that you don't add an index that's going to help one query at the expense of every other one you run.

Your question is a bit vague. Which DB platform?
If we are talking about SQL Server:
Use the Dynamic Management Views. Use SQL Profiler. Install the SP2 and the performance dashboard reports.
After determining the most costly queries (i.e. number of times run x cost one one query), examine their execution plans, and look at the sizes of the tables involved, and whether they are predominately Read or Write, or a mixture of both.
If the system is under your full control (apps. and DB) you can often re-write queries that are badly formed (quite a common occurrance), such as deep correlated sub-queries which can often be re-written as derived table joins with a little thought. Otherwise, you options are to create covering non-clustered indexes and ensure that statistics are kept up to date.

For MySQL there is a feature called log slow queries
The rest is based on what kind of data you have and how it is setup.

In SQL server you can use trace to find out how your query is performing. Use ctrl + k or l
For example if u see full table scan happening in a table with large number of records then it probably is not a good query.
A more specific question will definitely fetch you better answers.

If your table is predominantly read, place a clustered index on the table.

My experience is with mainly DB2 and a smattering of Oracle in the early days.
If your DBMS is any good, it will have the ability to collect stats on specific queries and explain the plan it used for extracting the data.
For example, if you have a table (x) with two columns (date and diskusage) and only have an index on date, the query:
select diskusage from x where date = '2008-01-01'
will be very efficient since it can use the index. On the other hand, the query
select date from x where diskusage > 90
would not be so efficient. In the former case, the "explain plan" would tell you that it could use the index. In the latter, it would have said that it had to do a table scan to get the rows (that's basically looking at every row to see if it matches).
Really intelligent DBMS' may also explain what you should do to improve the performance (add an index on diskusage in this case).
As to how to see what queries are being run, you can either collect that from the DBMS (if it allows it) or force everyone to do their queries through stored procedures so that the DBA control what the queries are - that's their job, keeping the DB running efficiently.

indices on PKs and FKs and one thing that always helps PARTITIONING...

1. What are the patterns you use to determine the frequent queries?
Depends on what level you are dealing with the database. If you're a DBA or a have access to the tools, db's like Oracle allow you to run jobs and generate stats/reports over a specified period of time. If you're a developer writing an application against a db, you can just do performance profiling within your app.
2. How do you select the optimization factors?
I try and get a general feel for how the table is being used and the data it contains. I go about with the following questions.
Is it going to be updated a ton and on what fields do updates occur?
Does it have columns with low cardinality?
Is it worth indexing? (tables that are very small can be slowed down if accessed by an index)
How much maintenance/headache is it worth to have it run faster?
Ratio of updates/inserts vs queries?
etc.
3. What are the types of changes one can make?
-- If using Oracle, keep statistics up to date! =)
-- Normalization/De-Normalization either one can improve performance depending on the usage of the table. I almost always normalize and then only if I can in no other practical way make the query faster will de-normalize. A nice way to denormalize for queries and when your situation allows it is to keep the real tables normalized and create a denormalized "table" with a materialized view.
-- Index judiciously. Too many can be bad on many levels. BitMap indexes are great in Oracle as long as you're not updating the column frequently and that column has a low cardinality.
-- Using Index organized tables.
-- Partitioned and sub-partitioned tables and indexes
-- Use stored procedures to reduce round trips by applications, increase security, and enable query optimization without affecting users.
-- Pin tables in memory if appropriate (accessed a lot and fairly small)
-- Device partitioning between index and table database files.
..... the list goes on. =)
Hope this is helpful for you.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas