What are the performance implications in postgres of using an array to store values as compared to creating another table to store the values with a has-many relationship?
I have one table that needs to be able to store anywhere from about 1-100 different string values in either an array column or a separate table. These values will need to be frequently searched for exact matches, so lookup performance is critical. Would the array solution be faster, or would it be faster to use joins to lookup the values in the separate table?
These values will need to be frequently searched
Searched how? This is crucial.
Prefix pattern match only? Infix/suffix pattern matches too? Fuzzy string search / similarity matching? Stubbing and normalization for root words, de-pluralization? Synonym search? Is the data character sequences or natural language text? One language, or multiple different languages?
Hand-waving around "searched" makes any answer that ignores that part pretty much invalid.
so lookup performance is critical. Would the array solution be faster, or would it be faster to use joins to lookup the values in the separate table?
Impossible to be strictly sure without proper info on the data you're searching.
Searching text fields is much more flexible, giving you many options you don't have with an array search. It also generally reduces the amount of data that must be read.
In general, I strongly second Clodaldo: Design it right. Optimize later, if you need to.
According to the official PostgreSQL reference documentation, searching for specific elements in a table is expected to perform better than in an array
https://www.postgresql.org/docs/current/arrays.html#ARRAYS-SEARCHING :
Arrays are not sets; searching for specific array elements can be a
sign of database misdesign. Consider using a separate table with a row
for each item that would be an array element. This will be easier to
search, and is likely to scale better for a large number of elements.
The reason for the worse search performance on array elements than on tables could be that arrays are internally stored as strings as stated here
https://www.postgresql.org/message-id/op.swbsduk5v14azh%40oren-mazors-computer.local
the array is actually stored as a string by postgres. a string that
happens to have lots of brackets in it.
although I could not corroborate this statement by any official PostgreSQL documentation. I also do not have any evidence that handling well-structured strings is necessarily less performant than handling tables.
Related
I'm working in Redshift SQL syntax, and want to know a way to convert a string id for each entry in a table to a numeric id (since numeric joins between tables are supposedly much quicker and more efficient than string joins).
Currently the ids look like this - a bunch of strings with both numbers and letters
01r00001ABCDeAAF
01r00001IJKLmAAN
...
01r00001OPQRtAAN
What I would like is to turn this into a purely numeric identifier, using the string id as an input and ensuring that each output is unique and corresponds only to a single input with no collisions (which can be replicated across tables so that accurate joins are possible).
I've tried using some hash functions within SQL like CHECKSUM() and BINARY_CHECKSUM() over the columns, but I'm a little unclear which would be the most applicable here - I understand some are case-sensitive and others aren't, while some generate collisions and others don't.
First, your reference for strings versus integers is based on an entirely different database. I would not generalize from SQL Server performance to other databases, particularly a massively parallel columnar database. There is also a lot of information that is taken out of context and generalized to wrong situations.
Second, you can test on tables in Amazon Redshift. Generating the data and doing the tests should be faster than modifying existing data. You will probably find no need to change anything.
You need to understand what is happening "under the hood" before making a change like this, particularly if you think it is for performance reasons.
Strings can be troublesome for a variety of reasons. First, they can have different collations or character sets -- information that is hidden. Such differences would preclude the use of indexes -- a major hit in a database such as SQL Server. Not using indexes is generally not an issue in Redshift.
Strings can also have variable lengths. This makes indexes slightly less efficient. They also require a wee bit more overhead to compare than numbers, because those collations and character sets need to be taken into account. They also need to be compared character-by-character, whereas most hardware has built-in comparisons for numbers. The extra cycles here is usually minimal compared to the cost of moving data.
When you do a join in Amazon Redshift, the first thing it is going to do is collocate the data, probably by hashing the values and sending the data to the same nodes in the parallel environment. Moving the data is expensive. Hashing the values, much less so.
In Redshift, you should be more concerned about how your data is distributed. Although I haven't tested it, adding a new column that is a number might make the query more expensive, because in a columnar database, the number of columns referenced has an impact on performance.
I need to save about 500 values in structured database (SQL, postgresql) or whatever. So what is the best way to store data. Is it to take 500 fields or single field as (CSV) comma separated values.
What would be the pros and cons.
What would be easy to maintain.
What would be better to retrieve data.
A comma-separated value is just about never the right way to store values.
The traditional SQL method would be a junction or association table, with one row per field and per entity. This multiplies the number of rows, but that is okay, because databases can handle big tables. This has several advantages, though:
Foreign key relationships can be properly defined.
The correct type can be implemented for the object.
Check constraints are more naturally written.
Indexes can be built, incorporating the column and improving performance.
Queries do not need to depend on string functions (which might be slow).
Postgres also supports two other methods for such data, arrays and JSON-encoding. Under some circumstances one or the other might be appropriate as well. A comma-separated string would almost never be the right choice.
I'm working with a big table (millions of rows) on a postgresql database, each row has a name column and i would like to perform a search on that column.
For instance, if i'm searching for the movie Django Unchained, i would like the query to return the movie whether i search for Django or for Unchained (or Dj or Uncha), just like the IMDB search engine.
I've looked up full text search but i believe it is more intended for long text, my name column will never be more than 4-5 words.
I've thought about having a table keywords with a many to many relationship, but i'm not sure that's the best way to do it.
What would be the most efficient way to query my database ?
My guess is that for what you want to do, full text search is the best solution. (Documented here.)
It does allow you to search for any complete words. It allows you to search for prefixes on words (such as "Dja"). Plus, you can add synonyms as necessary. It doesn't allow for wildcards at the beginning of a word, so "Jango" would need to be handled with a synonym.
If this doesn't meet your needs and you need the capabilities of like, I would suggest the following. Put the title into a separate table that basically has two columns: an id and the title. The goal is to make the scanning of the table as fast as possible, which in turn means getting the titles to fit in the smallest space possible.
There is an alternative solution, which is n-gram searching. I'm not sure if Postgres supports it natively, but here is an interesting article on the subject that include Postgres code for implementing it.
The standard way to search for a sub-string anywhere in a larger string is using the LIKE operator:
SELECT *
FROM mytable
WHERE name LIKE '%Unchai%';
However, in case you have millions of rows it will be slow because there are no significant efficiencies to be had from indexes.
You might want to dabble with multiple strategies, such as first retrieving records where the value for name starts with the search string (which can benefit from an index on the name column - LIKE 'Unchai%';) and then adding middle-of-the-string hits after a second non-indexed pass. Humans tend to be significantly slower than computers on interpreting strings, so the user may not suffer.
This question is very much related to the autocomplete in forms. You will find several threads for that.
Basically, you will need a special kind of index, a space partitioning tree. There is an extension called SP-GiST for Postgres which supports such index structures. You will find a bunch of useful stuff if you google for that.
I have two lists of words and I need to find matches (intersection of the two sets.)
Should I store each list as a string and find matches through string functions (like a regular expression) or store the words in a table, and have SQL find matches by joining?
It is almost impossible to say without more information about the problem. Here are some things to consider:
How many different distinct items do you have?
How many different combinations would be on a typical row?
Do your searches require looking for wildcards?
How long are the individual items?
Specifics on the database engine and hardware you are running on.
I want to emphasize that in almost all situations, you want to store the values in another table. Performance is not necessarily the primary reason. More important are ease of updating and deleting individual values, and the ability to support many more types of queries (such as a list of all available values).
But, we can still think about the performance issues. Storing values in a single string simply requires fetching the page with the record on it, and then applying a function that goes through the string. For simple patterns (such as identifying the presence of a fixed substring), this should go quite fast. There are few things that computers do faster than looping through strings and comparing values (assuming a reasonable implementation).
In the fastest possible join, both tables need to be read in, and the keys need to be matched. This requires additional effort. The situation is even worse, because you really want two additional tables, one for the individual string items and the other for the relationship between the original records and the items.
At this point, you may think "gosh, strings seem like a better idea". This is wrong. One of the big differences is in average size. If you items are, on average, longer than, say, 4 characters, then you save space by using a reference table. This saved space immediately translates into improved performance, because there is less I/O. With indexes, the additional tables would be in memory anyway, so the matching would be quite fast.
And, there is the issue of querying. You can use standard SQL functions for queries such as records that have A and B (many string functions are database specific). You can easily find out exactly which items are in the database, and relatively easily find what pairs exist on records. You can keep track of when an item is added to a record, and the first time it appears in the database. Generally, this flexible functionality -- which is just basic SQL functionality -- is what you need when managing this type of data.
Storing in a table will be much faster than a SQL string manipulation function in most circumstances especially if you can index the words.
I think you're asking if this:
SELECT word FROM table_one WHERE word in (SELECT word FROM table_two)
is faster than this:
SELECT table_one.word FROM table_one
INNER JOIN table_two ON table_one.word = table_two.word
The first answer should be faster, because the second creates a (potentially large) temporary object (the joined table).
Note that I assume you have an index on word. Also: if the strings are very long (URLs, for example), this will be very slow, and you should match on a hash instead.
I have to index different kinds of data (text documents, forum messages, user profile data, etc) that should be searched together (ie, a single search would return results of the different kinds of data).
What are the advantages and disadvantages of having multiple indexes, one for each type of data?
And the advantages and disadvantages of having a single index for all kinds of data?
Thank you.
If you want to search all types of document with one search , it's better that you keep all
types to one index . In the index you can define more field type that you want to Tokenize or Vectore them .
It takes a time to introduce to each IndexSearcher a directory that include indeces .
If you want to search terms separately , it would better that index each type to one index .
single index is more structural than multiple index.
In other hand , we can balance our loading with multiple indeces .
Not necessarily answering your direct questions, but... ;)
I'd go with one index, add a Keyword (indexed, stored) field for the type, it'll let you filter if needed, as well as tell the difference between the results you receive back.
(and maybe in the vein of your questions... using separate indexes will allow each corpus to have it's own relevency score, don't know if excessively repeated terms in one corpus will throw off relevancy of documents in others?)
You should think logically as to what each dataset contains and design your indexes by subject-matter or other criteria (such as geography, business unit etc.). As a general rule your index architecture is similar to how you would databases (you likely wouldn't combine an accounting with a personnel database for example even if technically feasible).
As #llama pointed out, creating a single uber-index affects relevance scores, security/access issues, among other things and causes a whole new set of headaches.
In summary: think of a logical partitioning structure depending on your business need. Would be hard to explain without further background.
Agree that each kind of data should have its own index. So that all the index options can be set accordingly - like analyzers for the fields, what is stored for the fields for term vectors and similar. And also to be able to use different dynamic when IndexReaders/Writers are reopened/committed for different kinds of data.
One obvious disadvantage is the need to handle several indexes instead of one. To make it easier, and because I always use more than one index, created small library to handle it: Multi Index Lucene Manager