INT vs VARCHAR in search - sql

Which one of the following queries will be faster and more optimal (and why):
SELECT * FROM items WHERE w = 320 AND h = 200 (w and h are INT)
SELECT * FROM items WHERE dimensions = '320x200'(dimensions is VARCHAR)

Here are some actual measurements. (Using SQLite; may try it with MySQL later.)
Data = All 1,000,000 combinations of w, h ∈ {1...1000}, in randomized order.
CREATE TABLE items (id INTEGER PRIMARY KEY, w INTEGER, h INTEGER)
Average time (of 20 runs) to execute SELECT * FROM items WHERE w = 320 and h = 200 was 5.39±0.29 µs.
CREATE TABLE items (id INTEGER PRIMARY KEY, dimensions TEXT)
Average time to execute SELECT * FROM items WHERE dimensions = '320x200' was 5.69±0.23 µs.
There is no significant difference, efficiency-wise.
But
There is a huge difference in terms of usability. For example, if you want to calculate the area and perimeter of the rectangles, the two-column approach is easy:
SELECT w * h, 2 * (w + h) FROM items
Try to write the corresponding query for the other way.

Intuitively, if you do not create INDEXes on those columns, integer comparison seems faster.
In integer comparison, you compare directly 32-bit values equality with logical operators.
On the other hand, strings are character arrays, it will be difficult to compare them. Character-by-character.
However, another point is that, in 2nd query you have 1 field to compare, in 1st query you have 2 fields. If you have 1,000,000 records and no indexes on columns, that means you may have 1,000,000 string comparisons on worst case (unluckily last result is the thing you've looking for or not found at all)
On the other hand you have 1,000,000 records and all are w=320, then you'll be comparing them for h,too. That means 2,000,000 comparisons. However you create INDEXes on those fields IMHO they will be almost identical since VARCHAR will be hashed (takes O(1) constant time) and will be compared using INT comparison and take O(logn) time.
Conclusion, it depends. Prefer indexes on searchable columns and use ints.

Probably the only way to know that is to run it. I would suspect that if all columns used are indexed, there would be basically no difference. If INT is 4 bytes, it will be almost the same size as the string.
The one wrinkle is in how VARCHAR is stored. If you used a constant string size, it might be faster than VARCHAR, but mostly because your select * needs to go get it.
The huge advantage of using INT is that you can do much more sophisticated filtering. That alone should be a reason to prefer it. What if you need a range, or just width, or you want to do math on width in the filtering? What about constraints based on the columns, or aggregates?
Also, when you get the values into your programming language, you won't need to parse them before using them (which takes time).
EDIT: Some other answers are mentioning string compares. If indexed, there won't be many string compares done. And it's possible to implement very fast compare algorithms that don't need to loop byte-by-byte. You'd have to know the details of what mysql does to know for sure.

Second query, as the chances to match the exact string is smaller (which mean smaller set of records but with greater cardinality)
First query, chances matching first column is higher and more rows are potentially matched (lesser cardinality)
of course, assuming index are defined for both scenario

first one because it is faster to compare numeric data.

It depends on the data and the available indexes. But it is quite possible for the VARCHAR version to be faster because searching a single index can be faster than two. If the combination of values provides a unique (or "mostly" unique) result while each individual H/W value has multiple entries, then it could narrow the down to a much smaller set using the single index.
On the other hand, if you have a multiple column index on the to integer columns, that would likely be the most efficient.

Related

Closest position between randomly moving objects

I have a large database tables that contains grid references (X and Y) associated with various objects (each with a unique object identifier) as they move with time. The objects move at approximately constant speed but random directions.
The table looks something like this….
CREATE TABLE positions (
objectId INTEGER,
x_coord INTEGER,
y_coord INTEGER,
posTime TIMESTAMP);
I want to find which two objects got closest to each other and at what time.
Finding the distance between two fixes is relatively easy – simple Pythagoras for the differences between the X and Y values should do the trick.
The first problem seems to be one of volume. The grid itself is large, 100,000 possible X co-ordinates and a similar number of Y co-ordinates. For any given time period the table might contain 10,000 grid reference positions for 1000 different objects – 10 million rows in total.
That’s not in itself a large number, but I can’t think of a way of avoiding doing a ‘product query’ to compare every fix to every other fix. Doing this with 10 million rows will produce 100 million million results.
The next issue is that I’m not just interested in the closest two fixes to each other, I’m interested in the closest two fixes from different objects.
Another issue is that I need to match time as well as position – I’m not just interested in two objects that have visited the same grid square, they need to have done so at the same time.
The other point (may not be relevant) is that the items are unlikely to every occupy exactly the same location at the same time.
I’ve got as far as a simple product query with a few sample rows, but I’m not sure on my next steps. I’m beginning to think this isn’t going something I can pull off with a single SQL query (please prove me wrong) and I’m likely to have to extract the data and subject it to some procedural programming.
Any suggestions?
I’m not sure what SE forum this best suited for – database SQL? Programming? Maths?
UPDATE - Another issue to add to the complexity, the timestamping for each object and position is irregular, one item might have a position recorded at 14:10:00 and another at 14:10:01. If these two positions are right next to each other and one second apart then they may actually represent the closest position although the time don't match!
In order to reduce the number of tested combinations you should segregate them by postime using subqueries. Also, it's recommended you create an index by postime to increase performance.
create index ix1_time on positions (postime);
Since you didn't mention any specific database I assumed PostgreSQL since it's easy to use (for me). The solution should look like:
with t as (
select distinct(postime) as pt from positions
)
select *
from t,
(
select *
from (
select
a.objectid as aid, b.objectid as bid,
a.x_coord + a.y_coord + b.x_coord + b.y_coord as dist -- fix here!
from t
join positions a on a.postime = t.pt
join positions b on b.postime = t.pt
where a.objectid <> b.objectid
) x
order by dist desc
limit 1
) y;
This SQL should compare each 10000 objects against each other on by postime. It will test 10 million combinations for each different postime value, but not against other postime values.
Please note: I used a.x_coord + a.y_coord + b.x_coord + b.y_coord as the distance formula. I leave the correct one for you to implement here.
In total it will compute 10 million x 1000 time values: a total of 10 billion comparisons. It will return the closest two points for each timepos, that is a total of 1000 rows.

Most efficient way to query for lat-long rectangle in SQL

I'm currently making consistent queries for a block of land within a given latitude, longitude rectangle. The coordinates are stored as individual double precision values. I've created a single index of both columns, so the current query containing 15240 tiles takes .10 seconds on my local machine.
At the moment, there's 23 million rows in the table, but there's going to be around 800 million upon completion of the table, so I expect this query time to get much slower.
Here's the query I'm running, with example values:
SELECT * FROM territories
WHERE nwlat < 47.606977 and nwlat > 47.506977
and nwlng < -122.232991 and nwlng > -122.338991;
Is there a more efficient way of doing this? I'm fairly new to large databases, so any help is appreciated. FYI, I'm using PostgreSQL.
It would be much more efficient with a GiST or an SP-GiST index and a "box-contains-points" query ...
GiST index
The index is on a box with zero area, built from the same point (point(nwlat, nwlng)) twice.
There is a related code example in the manual for CREATE INDEX.
CREATE INDEX territories_box_gist_idx ON territories
USING gist (box(point(nwlat, nwlng), point(nwlat, nwlng)));
Query with the "overlaps" operator &&:
SELECT *
FROM territories
WHERE box(point(nwlat, nwlng), point(nwlat, nwlng))
&& '(47.606977, -122.232991), (47.506977, -122.338991)'::box;
SP-GiST index
Smaller index on just points:
CREATE INDEX territories_box_spgist_idx ON territories
USING spgist (point(nwlat, nwlng));
Query with the contains operator #>:
SELECT *
FROM point
WHERE '(47.606977, -122.232991), (47.506977, -122.338991)'::box
#> point(nwlat, nwlng);
I get fastest results for the SP-GiST index in a simple test with 1M rows on Postgres 9.6.1.
For more sophisticated needs consider the PostGIS extension.

Should I reverse order a queryset before slicing the first N records, or count it to slice the last N records?

Let's say I want to get the last 50 records of a query that returns around 10k records, in a table with 1M records. I could do (at the computational cost of ordering):
data = MyModel.objects.filter(criteria=something).order_by('-pk')[:50]
I could also do (at the cost of 2 database hits):
# assume I don't care about new records being added between
# the two queries being executed
index = MyModel.objects.filter(criteria=something).count()
data = MyModel.objects.filter(criteria=something)[index-50:]
Which is better for just an ordinary relational database with no indexing on the criteria (eg postgres in my case; no columnar storage or anything fancy)? Most importantly, why?
Does the answer change if the table or queryset is significantly bigger (eg 100k records from a 10M row table)?
This one is going to be very slow
data = MyModel.objects.filter(criteria=something)[index-50:]
Why because it translates into
SELECT * FROM myapp_mymodel OFFEST (index-50)
You are not enforcing any ordering here, so the server is going to have to calulcate the result set and jump to the end of it and that's going to involve a lot of reading and will be very slow. Let us not forgot that count() queries aren't all that hot either.
OTH, this one is going to be fast
data = MyModel.objects.filter(criteria=something).order_by('-pk')[:50]
You are reverse ordering on the primary key and getting the first 50. And the first 50 you can fetch equally quickly with
data = MyModel.objects.filter(criteria=something).order_by('pk')[:50]
So this is what you really should be doing
data1 = MyModel.objects.filter(criteria=something).order_by('-pk')[:50]
data2 = MyModel.objects.filter(criteria=something).order_by('pk')[:50]
The cost of ordering on the primary key is very low.

PostgreSQL, find strings differ by n characters

Suppose I have a table like this
id data
1 0001
2 1000
3 2010
4 0120
5 0020
6 0002
sql fiddle demo
id is primary key, data is fixed length string where characters could be 0, 1, 2.
Is there a way to build an index so I could quickly find strings which are differ by n characters from given string? like for string 0001 and n = 1 I want to get row 6.
Thanks.
There is the levenshtein() function, provided by the additional module fuzzystrmatch. It does exactly what you are asking for:
SELECT *
FROM a
WHERE levenshtein(data, '1110') = 1;
SQL Fiddle.
But it is not very fast. Slow with big tables, because it can't use an index.
You might get somewhere with the similarity or distance operators provided by the additional module pg_trgm. Those can use a trigram index as detailed in the linked manual pages. I did not get anywhere, the module is using a different definition of "similarity".
Generally the problem seems to fit in the KNN ("k nearest neighbours") search pattern.
If your case is as simple as the example in the question, you can use LIKE in combination with a trigram GIN index, which should be reasonably fast with big tables:
SELECT *
FROM a
WHERE data <> '1110'
AND (data LIKE '_110' OR
data LIKE '1_10' OR
data LIKE '11_0' OR
data LIKE '111_');
Obviously, this technique quickly becomes unfeasible with longer strings and more than 1 difference.
However, since the string is so short, any query will match a rather big percentage of the base table. Therefore, index support will hardly buy you anything. Most of the time it will be faster for Postgres to scan sequentially.
I tested with 10k and 100k rows with and without a trigram GIN index. Since ~ 19% match the criteria for the given test case, a sequential scan is faster and levenshtein() still wins. For more selective queries matching less than around 5 % of the rows (depends), a query using an index is (much) faster.

Any idea why contains(...) querys so slow in SQL Server 2005

I've got a simple select query which executes in under 1 second normally, but when I add in a contains(column, 'text') into the where clause, suddenly it's running for 20 seconds up to a minute. The table it's selecting from has around 208k rows.
Any ideas what would cause this query to run so slow with just the addition of the contains clause?
Substring matching is a computationally expensive operation. Is the field indexed? If this is a major feature implementation, consider a search-caching table so you can simply lookup where the words exist.
Depending on the search keyword and the median length of characters in the column it is logical that it would take a long time.
Consider searching for 'cookie' in a column with median length 100 characters in a dataset of 200k rows.
Best case scenario with early outs, you would do 100 * 200k = 20m comparisons
Worst case scenario near missing on every compare, you would do (5 * 100) * 200k = 100m comparisons
Generally I would:
reorder your query to filter out as much as possible in advance prior to string matching
limit number of the results if you don't need all of them at once (TOP x)
reduce the number characters in your search term
reduce the number of search terms by filtering out terms that are likely to match a lot, or not at all (if applicable)
cache query results if possible (however cache invalidation can get pretty tricky if you want to do it right)
Try this:
SELECT *
FROM table
WHERE CONTAINS((column1, column2, column3), '"*keyword*"')
Instead of this:
SELECT *
FROM table
WHERE CONTAINS(column1, '"*keyword*"')
OR CONTAINS(column2, '"*keyword*"')
OR CONTAINS(column3y, '"*keyword*"')
The first one is a lot faster.
CONTAINS does a lot of extra work. There's a few things to note here:
NVarChar is always faster, so do CONTAINS(column, N'text')
If all you want to do is see if the word is in there, compare the performance to column LIKE '%' + text + '%'.
Compare query plans before and after, did it go to a table scan? If so, post more so we can figure out why.
In ultimo, you can break up the text's individual words into a separate table so they can be indexed.