Is there any sql database not creating an index for a unique constraint? - sql

I have seen this question, but really, it's only about MySQL. Is there any sql database out there, that does not create an index for a unique constraint?

In one sense, no one can give you a definitive answer. As we speak, someone could be creating that very thing. But it's a fair bet that any DBMS you've heard of or are likely to hear of will use indexes to enforce uniqueness, because that's what the science dictates.
DBMSs use indexes for this because searching them is quick. The index uses some kind of structure that supports a binary search, providing O(log N) time complexity.
Consider what the system would have to do without such a structure.
for each row to be inserted
scan all rows in table
error if found
In the best case -- when there's no error -- each inserted row would cause a scan of the entire table. That's O(nm) complexity, a/k/a exponential time.
Suppose for example you're inserting 10,000 rows into a 10,000-row table. You're looking at 100,000,000 = 10,000 * 10,000 comparisons! A binary search, by contrast, requires ~13 comparisons for 10,000 rows, and ~17 for 20,000. Because we're inserting into the same table we're comparing to, the number of comparisons on average will be 15, so the total number of comparisons is 150,000 = 15 * 10,000, or 0.15% of the work.
Databases are all about scale, and exponential time even at modest scale is infeasible.
On an ordinary machine I have handy, a simple program to compare two unsorted arrays of 10,000 integers takes 0.1 seconds. As we might expect, 100,000 integers takes 10 seconds, 100 times longer. At 1,000,000 integers, we could expect 1000 seconds, or about 15 minutes. A cool billion would take a million times longer, until sometime in the year 2042.
Rob Pike likes to say, Fancy algorithms are slow when n is small, and n is usually small. It's true. But rule #5 is just as important: Data dominates.

Related

Is the time complexity of querying an indexed column O(1)?

Let's suppose that table A has a column named X which is numeric and indexed.
If the query is something like:
find all rows where X is greater than some value
Is the time complexity of retrieving the result O(1)?
In other words, it does not matter whether table A has 1 million rows versus 10 billion rows?
Question 2:
Let's suppose that table A has another numeric column Y which is numeric and indexed.
If the query is now:
find all rows where
X is greater than some value
AND
Y is smaller than some value
Would this query take twice as long as the first query?
This is a very vague questions, let me break it apart to several cases.
Firstly nothing is O(1), regardless of how you're fetching your data you always need to scan a complexity that's relative to the size of the data.
Case 1 - no indexes that support the queries exist.
In this case no matter what query you use Mongo will perform a "collection scan", this means all data in the collection will be checked to see if it matches the query. or in complexity terms O(N). this is true for both queries hence overall the complexity is the same.
Case 2 - an index exist that satisfy's both queries ( { x: 1, y: 1 } ).
In this case Mongo will perform an "index scan", this means it will scan the index trees (btrees) instead of the entire collection, giving you a logarithmic complexity, I'm not entirely sure on the exact complexity of this as it depends on the way Mongo choose to write these things, but overall it should be O(t log(n)) for query 1. because a compound index nests tree indexes this means the complexity for query 2 should be the same times some constant.
Now we can answer both questions:
In other words, it does not matter whether table A has 1 million rows versus 10 billion rows?
Obviously it matters, the time complexity for each search is the same regardless of scale but in real life terms this greatly matters as O(1M) != O(1B) even if the ratio is the same.
Would this query take twice as long as the first query?
This is a little harder to answer, and I would argue it's more dependant on scale than anything else, for case 1 (colscan) and smallish scale it will probably run in around the same time. The best way for you to answer this is to run your own benchmarks that match your usecase.

What is the mathematical relationship between "no. of rows affected" and "execution time" of a sql query?

The query remains constant i.e it will remain the same.
e.g. a select query takes 30 minutes if it returns 10000 rows.
Would the same query take 1 hour if it has to return 20000 rows?
I am interested in knowing the mathematical relation between no. of rows(N) and execution time(T) keeping other parameters as constant(K).
i.e T= N*K or
T=N*K + C or
any other formula?
Reading http://research.microsoft.com/pubs/76556/progress.pdf if it helps. Anybody who can understand this before me, please do reply. Thanks...
Well that is good question :), but there is not exact formula, because it depends of execution plan.
SQL query optimizer could choose another execution plan on query which return different number of rows.
I guess if the query execution plan is the same for both query's and you have some "lab" conditions then time growth could be linear. You should research more on sql execution plans and statistics
Take the very simple example of reading every row in a single table.
In the worst case, you will have to read every page of the table from your underlying storage. The worst case for this is having to do a random seek. The seek time will dominate all other factors. So you can estimate the total time.
time ~= seek time x number of data pages
Assuming your rows are of a fairly regular size, then this is linear in the number of rows.
However databases do a number of things to try and avoid this worst case. For example, in SQL Server table storage is often allocated in extents of 8 consecutive pages. A hard drive has a much faster streaming IO rate than random IO rate. If you have a clustered index, reading the pages in cluster order tend to have a lot more streaming IO than random IO.
The best case time, ignoring memory caching, is (8KB is the SQL Server page size)
time ~= 8KB * number of data pages / streaming IO rate in KB/s
This is also linear in the number of rows.
As long as you do a reasonable job managing fragmentation, you could reasonably extrapolate linearly in this simple case. This assumes your data is much larger than the buffer cache. If not, you also have to worry about the cliff edge where your query changes from reading from buffer to reading from disk.
I'm also ignoring details like parallel storage paths and access.

mysql - Creating rows vs. columns performance

I built an analytics engine that pulls 50-100 rows of raw data from my database (lets call it raw_table), runs a bunch statistical measurements on it in PHP and then comes up with exactly 140 datapoints that I then need to store in another table (lets call it results_table). All of these data points are very small ints ("40","2.23","-1024" are good examples of the types of data).
I know the maximum # of columns for mysql is quite high (4000+) but there appears to be a lot of grey area as far as when performance really starts to degrade.
So a few questions here on best performance practices:
1) The 140 datapoints could be, if it is better, broken up into 20 rows of 7 data points all with the same 'experiment_id' if fewer columns is better. HOWEVER I would always need to pull ALL 20 rows (with 7 columns each, plus id, etc) so I wouldn't think this would be better performance than pulling 1 row of 140 columns. So the question: is it better to store 20 rows of 7-9 columns (that would all need to be pulled at once) or 1 row of 140-143 columns?
2) Given my data examples ("40","2.23","-1024" are good examples of what will be stored) I'm thinking smallint for the structure type. Any feedback there, performance-wise or otherwise?
3) Any other feedback on mysql performance issues or tips is welcome.
Thanks in advance for your input.
I think the advantage to storing as more rows (i.e. normalized) depends on design and maintenance considerations in the face of change.
Also, if the 140 columns have the same meaning or if it differs per experiment - properly modeling the data according to normalization rules - i.e. how is data related to a candidate key.
As far as performance, if all the columns are used it makes very little difference. Sometimes a pivot/unpivot operation can be expensive over a large amount of data, but it makes little difference on a single key access pattern. Sometimes a pivot in the database can make your frontend code a lot simpler and backend code more flexible in the face of change.
If you have a lot of NULLs, it might be possible to eliminate rows in a normalized design and this would save space. I don't know if MySQL has support for a sparse table concept, which could come into play there.
You have a 140 data items to return every time, each of type double.
It makes no practical difference whether this is 1x140 or 20x7 or 7x20 or 4x35 etc. It could be infinitesimally quicker for one shape of course but then have you considered the extra complexity in the PHP code to deal with a different shape.
Do you have a verified bottleneck, or is this just random premature optimisation?
You've made no suggestion that you intend to store big data in the database, but for the purposes of this argument, I will assume that you have 1 billion (10^9) data points.
If you store them in 140 columns, you'll have a mere 7 millon rows, however, if you want to retrieve a single data point from lots of experiments, then it will have to fetch a large number of very wide rows.
These very wide rows will take up more space in your innodb_buffer_pool, hence you won't be able to cache so many; this will potentially slow you down when you access them again.
If you store one datapoint per row, in a table with very few columns (experiment_id, datapoint_id, value) then you'll need to pull out the same number of smaller rows.
However, the size of rows makes little difference to the number of IO operations required. If we assume that your 1 billion datapoints doesn't fit in ram (which is NOT a safe assumption nowadays), maybe the resulting performance will be approximately the same.
It is probably better database design to use few columns; but it will use less disc space and perhaps be faster to populate, if you use lots of columns.

What's the database performance improvement from storing as numbers rather than text?

Suppose I have text such as "Win", "Lose", "Incomplete", "Forfeit" etc. I can directly store the text in the database. Instead if use numbers such as 0 = Win, 1 = Lose etc would I get a material improvement in database performance? Specifically on queries where the field is part of my WHERE clause
At the CPU level, comparing two fixed-size integers takes just one instruction, whereas comparing variable-length strings usually involves looping through each character. So for a very large dataset there should be a significant performance gain with using integers.
Moreover, a fixed-size integer will generally take less space and can allow the database engine to perform faster algorithms based on random seeking.
Most database systems however have an enum type which is meant for cases like yours - in the query you can compare the field value against a fixed set of literals while it is internally stored as an integer.
There might be significant performance gains if the column is used in an index.
It could range anywhere from negligible to extremely beneficial depending on the table size, the number of possible values being enumerated and the database engine / configuration.
That said, it almost certainly will never perform worse to use a number to represent an enumerated type.
Don't guess. Measure.
Performance depends on how selective the index is (how many distinct values are in it), whether critical information is available in the natural key, how long the natural key is, and so on. You really need to test with representative data.
When I was designing the database for my employer's operational data store, I built a testbed with tables designed around natural keys and with tables designed around id numbers. Both those schemas have more than 13 million rows of computer-generated sample data. In a few cases, queries on the id number schema outperformed the natural key schema by 50%. (So a complex query that took 20 seconds with id numbers took 30 seconds with natural keys.) But 80% of the test queries had faster SELECT performance against the natural key schema. And sometimes it was staggeringly faster--a difference of 30 to 1.
The reason, of course, is that lots of the queries on the natural key schema need no joins at all--the most commonly needed information is naturally carried in the natural key. (I know that sounds odd, but it happens surprisingly often. How often is probably application-dependent.) But zero joins is often going to be faster than three joins, even if you join on integers.
Clearly if your data structures are shorter, they are faster to compare AND faster to store and retrieve.
How much faster 1, 2, 1000. It all depends on the size of the table and so on.
For example: say you have a table with a productId and a varchar text column.
Each row will roughly take 4 bytes for the int and then another 3-> 24 bytes for the text in your example (depending on if the column is nullable or is unicode)
Compare that to 5 bytes per row for the same data with a byte status column.
This huge space saving means more rows fit in a page, more data fits in the cache, less writes happen when you load store data, and so on.
Also, comparing strings at the best case is as fast as comparing bytes and worst case much slower.
There is a second huge issue with storing text where you intended to have a enum. What happens when people start storing Incompete as opposed to Incomplete?
having a skinner column means that you can fit more rows per page.
it is a HUGE difference between a varchar(20) and an integer.

Complexity of adding n entries to a database

What the complexity in big O notation of adding n entries to a database with m entries with i indexes in MySQL and afterwards committing?
Inserting into a MyISAM table without indexes takes O(n) (linear) time.
Inserting into an InnoDB table and into any index takes log(m) * O(n) (linear time depending on the number of already existing records) time (assuming m >> n), since InnoDB tables and indexes are B-Trees.
Overall time is the sum of these values.
That would depend on the number of indexes you have in your tables, among other factors.
Each individual operation in a database has a different complexity. For example, the time complexity for B-Tree search operations is O(log n), and the time for an actual search depends on whether a table scan takes place, which is O(n).
I would imagine that you could build an equation that is quite complex for what you are describing. You would have to account for each operation individually, and I'm not sure that can be done in a deterministic way, given database systems' propensity for deciding in an adhoc way how they will execute things using query plans, etc.