What's more efficient: string searches, or joins through SQL? - sql

I have two lists of words and I need to find matches (intersection of the two sets.)
Should I store each list as a string and find matches through string functions (like a regular expression) or store the words in a table, and have SQL find matches by joining?

It is almost impossible to say without more information about the problem. Here are some things to consider:
How many different distinct items do you have?
How many different combinations would be on a typical row?
Do your searches require looking for wildcards?
How long are the individual items?
Specifics on the database engine and hardware you are running on.
I want to emphasize that in almost all situations, you want to store the values in another table. Performance is not necessarily the primary reason. More important are ease of updating and deleting individual values, and the ability to support many more types of queries (such as a list of all available values).
But, we can still think about the performance issues. Storing values in a single string simply requires fetching the page with the record on it, and then applying a function that goes through the string. For simple patterns (such as identifying the presence of a fixed substring), this should go quite fast. There are few things that computers do faster than looping through strings and comparing values (assuming a reasonable implementation).
In the fastest possible join, both tables need to be read in, and the keys need to be matched. This requires additional effort. The situation is even worse, because you really want two additional tables, one for the individual string items and the other for the relationship between the original records and the items.
At this point, you may think "gosh, strings seem like a better idea". This is wrong. One of the big differences is in average size. If you items are, on average, longer than, say, 4 characters, then you save space by using a reference table. This saved space immediately translates into improved performance, because there is less I/O. With indexes, the additional tables would be in memory anyway, so the matching would be quite fast.
And, there is the issue of querying. You can use standard SQL functions for queries such as records that have A and B (many string functions are database specific). You can easily find out exactly which items are in the database, and relatively easily find what pairs exist on records. You can keep track of when an item is added to a record, and the first time it appears in the database. Generally, this flexible functionality -- which is just basic SQL functionality -- is what you need when managing this type of data.

Storing in a table will be much faster than a SQL string manipulation function in most circumstances especially if you can index the words.

I think you're asking if this:
SELECT word FROM table_one WHERE word in (SELECT word FROM table_two)
is faster than this:
SELECT table_one.word FROM table_one
INNER JOIN table_two ON table_one.word = table_two.word
The first answer should be faster, because the second creates a (potentially large) temporary object (the joined table).
Note that I assume you have an index on word. Also: if the strings are very long (URLs, for example), this will be very slow, and you should match on a hash instead.

Related

How to generate a numeric identifier for entries based on a string

I'm working in Redshift SQL syntax, and want to know a way to convert a string id for each entry in a table to a numeric id (since numeric joins between tables are supposedly much quicker and more efficient than string joins).
Currently the ids look like this - a bunch of strings with both numbers and letters
01r00001ABCDeAAF
01r00001IJKLmAAN
...
01r00001OPQRtAAN
What I would like is to turn this into a purely numeric identifier, using the string id as an input and ensuring that each output is unique and corresponds only to a single input with no collisions (which can be replicated across tables so that accurate joins are possible).
I've tried using some hash functions within SQL like CHECKSUM() and BINARY_CHECKSUM() over the columns, but I'm a little unclear which would be the most applicable here - I understand some are case-sensitive and others aren't, while some generate collisions and others don't.
First, your reference for strings versus integers is based on an entirely different database. I would not generalize from SQL Server performance to other databases, particularly a massively parallel columnar database. There is also a lot of information that is taken out of context and generalized to wrong situations.
Second, you can test on tables in Amazon Redshift. Generating the data and doing the tests should be faster than modifying existing data. You will probably find no need to change anything.
You need to understand what is happening "under the hood" before making a change like this, particularly if you think it is for performance reasons.
Strings can be troublesome for a variety of reasons. First, they can have different collations or character sets -- information that is hidden. Such differences would preclude the use of indexes -- a major hit in a database such as SQL Server. Not using indexes is generally not an issue in Redshift.
Strings can also have variable lengths. This makes indexes slightly less efficient. They also require a wee bit more overhead to compare than numbers, because those collations and character sets need to be taken into account. They also need to be compared character-by-character, whereas most hardware has built-in comparisons for numbers. The extra cycles here is usually minimal compared to the cost of moving data.
When you do a join in Amazon Redshift, the first thing it is going to do is collocate the data, probably by hashing the values and sending the data to the same nodes in the parallel environment. Moving the data is expensive. Hashing the values, much less so.
In Redshift, you should be more concerned about how your data is distributed. Although I haven't tested it, adding a new column that is a number might make the query more expensive, because in a columnar database, the number of columns referenced has an impact on performance.

Use separate fields or single field as CSV in SQL

I need to save about 500 values in structured database (SQL, postgresql) or whatever. So what is the best way to store data. Is it to take 500 fields or single field as (CSV) comma separated values.
What would be the pros and cons.
What would be easy to maintain.
What would be better to retrieve data.
A comma-separated value is just about never the right way to store values.
The traditional SQL method would be a junction or association table, with one row per field and per entity. This multiplies the number of rows, but that is okay, because databases can handle big tables. This has several advantages, though:
Foreign key relationships can be properly defined.
The correct type can be implemented for the object.
Check constraints are more naturally written.
Indexes can be built, incorporating the column and improving performance.
Queries do not need to depend on string functions (which might be slow).
Postgres also supports two other methods for such data, arrays and JSON-encoding. Under some circumstances one or the other might be appropriate as well. A comma-separated string would almost never be the right choice.

Postgresql performance comparison between arrays and joins

What are the performance implications in postgres of using an array to store values as compared to creating another table to store the values with a has-many relationship?
I have one table that needs to be able to store anywhere from about 1-100 different string values in either an array column or a separate table. These values will need to be frequently searched for exact matches, so lookup performance is critical. Would the array solution be faster, or would it be faster to use joins to lookup the values in the separate table?
These values will need to be frequently searched
Searched how? This is crucial.
Prefix pattern match only? Infix/suffix pattern matches too? Fuzzy string search / similarity matching? Stubbing and normalization for root words, de-pluralization? Synonym search? Is the data character sequences or natural language text? One language, or multiple different languages?
Hand-waving around "searched" makes any answer that ignores that part pretty much invalid.
so lookup performance is critical. Would the array solution be faster, or would it be faster to use joins to lookup the values in the separate table?
Impossible to be strictly sure without proper info on the data you're searching.
Searching text fields is much more flexible, giving you many options you don't have with an array search. It also generally reduces the amount of data that must be read.
In general, I strongly second Clodaldo: Design it right. Optimize later, if you need to.
According to the official PostgreSQL reference documentation, searching for specific elements in a table is expected to perform better than in an array
https://www.postgresql.org/docs/current/arrays.html#ARRAYS-SEARCHING :
Arrays are not sets; searching for specific array elements can be a
sign of database misdesign. Consider using a separate table with a row
for each item that would be an array element. This will be easier to
search, and is likely to scale better for a large number of elements.
The reason for the worse search performance on array elements than on tables could be that arrays are internally stored as strings as stated here
https://www.postgresql.org/message-id/op.swbsduk5v14azh%40oren-mazors-computer.local
the array is actually stored as a string by postgres. a string that
happens to have lots of brackets in it.
although I could not corroborate this statement by any official PostgreSQL documentation. I also do not have any evidence that handling well-structured strings is necessarily less performant than handling tables.

MySQL Table with TEXT column

I've been working on a database and I have to deal with a TEXT field.
Now, I believe I've seen some place mentioning it would be best to isolate the TEXT column from the rest of the table(putting it in a table of its own).
However, now I can't find this reference anywhere and since it was quite a while ago, I'm starting to think that maybe I misinterpreted this information.
Some research revealed this, suggesting that
Separate text/blobs from metadata, don't put text/blobs in results if you don't need them.
However, I am not familiar with the definition of "metadata" being used here.
So I wonder if there are any relevant advantages in putting a TEXT column in a table of its own. What are the potential problems of having it with the rest of the fields? And potential problems of keeping it in a separated table?
This table(without the TEXT field) is supposed to be searched(SELECTed) rather frequently. Is "premature optimization considered evil" important here? (If there really is a penalty in TEXT columns, how relevant is it, considering it is fairly easy to change this later if needed).
Besides, are there any good links on this topic? (Perhaps stackoverflow questions&answers? I've tried to search this topic but I only found TEXT vs VARCHAR discussions)
Yep, it seems you've misinterpreted the meaning of the sentence. What it says is that you should only do a SELECT including a TEXT field if you really need the contents of that field. This is because TEXT/BLOB columns can contain huge amounts of data which would need to be delivered to your application - this takes time and of course resources.
Best wishes,
Fabian
This is probably premature optimisation. Performance tuning MySQL is really tricky and can only be done with real performance data for your application. I've seen plenty of attempts to second guess what makes MySQL slow without real data and the result each time has been a messy schema and complex code which will actually make performance tuning harder later on.
Start with a normalised simple schema, then when something proves too slow add a complexity only where/if needed.
As others have pointed out the quote you mentioned is more applicable to query results than the schema definition, in any case your choice of storage engine would affect the validity of the advice anyway.
If you do find yourself needing to add the complexity of moving TEXT/BLOB columns to a separate table, then it's probably worth considering the option of moving them out of the database altogether. Often file storage has advantages over database storage especially if you don't do any relational queries on the contents of the TEXT/BLOB column.
Basically, get some data before taking any MySQL tuning advice you get on the Internet, including this!
The data for a TEXT column is already stored separately. Whenever you SELECT * from a table with text column(s), each row in the result-set requires a lookup into the text storage area. This coupled with the very real possibility of huge amounts of data would be a big overhead to your system.
Moving the column to another table simply requires an additional lookup, one into the secondary table, and the normal one into the text storage area.
The only time that moving TEXT columns into another table will offer any benefit is if there it a tendency to usually select all columns from tables. This is merely introducing a second bad practice to compensate for the first. It should go without saying the two wrongs is not the same as three lefts.
The concern is that a large text field—like way over 8,192 bytes—will cause excessive paging and/or file i/o during complex queries on unindexed fields. In such cases, it's better to migrate the large field to another table and replace it with the new table's row id or index (which would then be metadata since it doesn't actually contain data).
The disadvantages are:
a) More complicated schema
b) If the large field is using inspected or retrieved, there is no advantage
c) Ensuring data consistency is more complicated and a potential source of database malaise.
There might be some good reasons to separate a text field out of your table definition. For instance, if you are using an ORM that loads the complete record no matter what, you might want to create a properties table to hold the text field so it doesn't load all the time. However if you are controlling the code 100%, for simplicity, leave the field on the table, then only select it when you need it to cut down on data trasfer and reading time.
Now, I believe I've seen some place mentioning it would be best to isolate the TEXT column from the rest of the table(putting it in a table of its own).
However, now I can't find this reference anywhere and since it was quite a while ago, I'm starting to think that maybe I misinterpreted this information.
You probably saw this, from the MySQL manual
http://dev.mysql.com/doc/refman/5.5/en/optimize-character.html
If a table contains string columns such as name and address, but many queries do not retrieve those columns, consider splitting the string columns into a separate table and using join queries with a foreign key when necessary. When MySQL retrieves any value from a row, it reads a data block containing all the columns of that row (and possibly other adjacent rows). Keeping each row small, with only the most frequently used columns, allows more rows to fit in each data block. Such compact tables reduce disk I/O and memory usage for common queries.
Which indeed is telling you that in MySQL you are discouraged from keeping TEXT data (and BLOB, as written elsewhere) in tables frequently searched

How (and where) should I combine one-to-many relationships?

I have a user table, and then a number of dependent tables with a one to many relationship
e.g. an email table, an address table and a groups table. (i.e. one user can have multiple email addresses, physical addresses and can be a member of many groups)
Is it better to:
Join all these tables, and process the heap of data in code,
Use something like GROUP_CONCAT and return one row, and split apart the fields in code,
Or query each table independently?
Thanks.
It really depends on how much data you have in the related tables and on how many users you're querying at a time.
Option 1 tends to be messy to deal with in code.
Option 2 tends to be messy to deal with as well in addition to the fact that grouping tends to be slow especially on large datasets.
Option 3 is easiest to deal with but generates more queries overall. If your data-set is small and you're not planning to scale much beyond your current needs its probably the best option. It's definitely the best option if you're only trying to display one record.
There is a fourth option however that is a middle of the road approach which I use in my job in which we deal with a very similar situation. Instead of getting the related records for each row 1 at a time, use IN() to get all of the related records for your results set. Then loop in your code to match them to the appropriate record for display. If you cache search queries you can cache that second query as well. Its only two queries and only one loop in the code (no parsing, use hashes to relate things by their key)
Personally, assuming my table indexes where up to scratch I'd going with a table join and get all the data out in one go and then process that to end up with a nested data structure. This way you're playing to each systems strengths.
Generally speaking, do the most efficient query for the situation you're in. So don't create a mega query that you use in all cases. Create case specific queries that return just the information you need.
In terms of processing the results, if you use GROUP_CONCAT you have to split all the resulting values during processing. If there are extra delimiter characters in your GROUP_CONCAT'd values, this can be problematic. My preferred method is to put the GROUPed BY field into a $holder during the output loop. Compare that field to the $holder each time through and change your output accordingly.