Efficient way to store content translations? - sql

Suppose you have a few, quite large (100k+) objects in available and can provide this data (e.g. name) in 20+ languages. What is an efficient way to store/handle this data in a SQL database.
The obvious way to do that looks like this - however, are there other ways which make more sense? I'm a bit worried about performance.
CREATE TABLE "object" (
"id" serial NOT NULL PRIMARY KEY
);
CREATE TABLE "object_name" (
"object_id" integer NOT NULL REFERENCES "object" ("id")
"lang" varchar(5) NOT NULL,
"name" varchar(50) NOT NULL
);
As for usage, the use will only select one language and that will result in potentially large joins over the object_name table.
Premature optimization or not, I'm interested in other approaches, if only gain some peace of mind, that the obvious solution isn't a very stupid one.
To clarify the actual model is way more complicated. That's just the pattern identified so far.

If you have a combined key on (object_id, lang) there shouldn't be any joins, just an O(1) lookup, right? (Try with EXPLAIN SELECT to be sure)

In my own projects, I don't translate at the DB level. I let the user (or the OS) give me a lang code and then I load all the texts in one go into a hash. The DB then sends me IDs for that hash and I translate the texts the moment I display them somewhere.
Note that my IDs are strings, too. That way, you can see which text you're using (compare "USER" with "136" -- who knows what "136" might mean in the UI without looking into the DB?).
[EDIT] If you can't translate at the UI level, then your DB design is the best you can aim for. It's as small as possible, easy to index and joins don't take a lot.
If you want to take it one step further and you can generate the SQL queries at the app level, you can consider to create views (one per language) and then use the views in the joins which would give you a way to avoid the two-column-join. But I doubt that such a complex approach will have a positive ROI.

Have you considered using multiple tables, one for each language?
It will cost a bit more in terms of coding complexity, but you will be loading/accessing only one table per language, in which metadata will be smaller and therefore more time efficient (possibly also space-wise, as you won't have a "lang" variable for each row)
Also, if you really want one-table-to-rule-them-all, you can create a view and join them :)

In addition to what Wim writed, the table OBJECT in your case is useless. There's no need for such table since it does not store any single information not contained in table OBJECT_NAME.

Related

Best performance in filtering large datasets: = ANY() or #> on an array column; XOR on a bitmapped binary column; join a m2m table or something else?

The General Question:
This question pertains to tables with a large number of rows (say millions) that has a many to many relationship with a relatively small set of data (say tens). For example you might have 20 different tags or types or categories and each of your 10 million records is associated with one or more of them.
You could have a separate table for your tags with 20 rows and then use a many to many table to define the relationships between those 20 tags and your 10 million records.
In Postgres there are array types that could be used instead. In mySQL there is a the SET column type that like that.
You could also use a bitmap or bit string type and perform an XOR on that column.
My question is which of these solutions (or an alternative solution) is best for performance when querying the large table for records that are associated with one member of the small set. A solution should include what indexes to create and use.
My Specifics:
I've tried to keep the question here as generic as possible because I believe the answer could be applicable in many fairly common scenarios. However, for clarity I'll describe my specific situation now.
I have a table with millions of titles. Each title is associated with one or more languages. For example 'Don Quixote' may be 'Дон Кихот' in Russian, Bulgarian, Kazakh and 'Don Quijote' in a bunch of other languages. I have a search string and a language and want to find the best match in titles. I'm using Postgres and have a trigram index using gin on the titles. An example search would be find matches for 'Дон Кихот' in Bulgarian.
Currently have have the languages in a char(2)[] array type column using two letter language codes. I assume using an int array with language IDs would be better, but I want to go for what is best. I'm not worried about how much effort it would take to setup a bitmap for languages to do an XOR search or whatever effort and complexity is involved in implementing a particular solution. The performance is what matters.
I would tend to think that JOINing a many to many table would not be the best solution because that table would have multiple entries per row in the title table and so it would be huge. But perhaps I'm wrong about that because that is what RDBMS are designed to do.
Huge thanks to all of you who spend your highly valuable time answering these questions.
The postgreSQL documentation warns against searching arrays.
Arrays are not sets; searching for specific array elements can be a sign of database misdesign. Consider using a separate table with a row for each item that would be an array element. This will be easier to search, and is likely to scale better for a large number of elements.
SQL in general, and postgreSQL in particular, is astonishingly good at joining vast many-to-many tables, especially when their rows are modestly sized, their columns' data types match the columns you use for joining, and they have the right indexes.
So avoid arrays for the application you describe.
For a two-column junction table like you describe, you'd define it as something like this:
CREATE TABLE title_lang (
title_id BIGINT NOT NULL,
lang_id SMALLINT NOT NULL,
PRIMARY KEY (title_id, lang_id)
};
CREATE INDEX lang_title ON title_lang (lang_id, title_id);
I used SMALLINT for the language id. When you use character types, the database gives you unicode characters. Those take unexpected amounts of space. And indexes on integers are very efficient. But you should use the data type that makes sense in the rest of your schema.
I suggest a primary key going from title to language, and a reverse key going the other way. You can omit the reverse key if you don't go from language to title.
To associate a title with a language you insert a row in the table. To get rid of the association you delete the row.

PostgreSQL - What should I put inside a JSON column

The data I want to store data that has this characteristics:
There are a finite number of fields (I don't expect to add new fields);
There are some columns that are common to all sets of data (a category field, for instance);
There are some columns that are specific to individual sets of data (each category needs it's own fields);
Here's how it would look like in a regular table:
I'm having trouble figuring out which would be the better way to store this data in a database for this situation.
Bellow are the ideas I already had:
Do exactly as the tabular table (I would have many NULL values);
Divide the categories into tables (I would use joins when needed);
Use JSON type for storing the values (no NULL values and having it all in same table).
So my questions are:
Is one of these solutions (or one that I have not thought about it) that is better for this case?
Are there other factors, other than the ones presented here, that I should consider to make this decision?
Unless you have very many columns (~ 100), it is usually better to use normal columns. NULL values don't take any storage space in PostgreSQL.
On the other hand, if you have queries that can use any of these columns in the WHERE condition, and you compare with =, a single GIN index on a jsonb might be better than having many B-tree indexes, because the index maintenance costs would be higher.
The definitive answer depends on the SQL statements that you plan to run on that table.
You have laid out the three options pretty well. Things to consider are:
Performance
Data size
Each of maintenance
Flexibility
Security
Note that you don't even allude to security considerations. But security at the table level is usually a tad simpler than at the column level and might be important for regulated data such as PII (personally identifiable information).
The primary strength of the JSON solution is flexibility. It is easy to add new columns. But you don't need that. JSON has a cost in data size and data type flexibility (notably JSON doesn't support date/times explicitly).
A multiple table solution requires duplicating the primary key but may result in much less storage overall if the columns really are sparse. The "may" may also depend on the data type. A NULL string for instance occupies less space than a NULL float in a table record.
The joins on multiple tables will be 1-1 on primary keys. These should be pretty fast.
What would I do? Unless the answer is obvious, I would dump the data into a single table with a bunch of columns. If that table starts to get unwieldy, then I would think about splitting it into separate tables -- but still have one table for the common columns. The details of one or multiple tables can be hidden behind a view.
Depends on how much data you want to store, but as long as it is finite it shouldn't make a big difference if it contains a lot of null's or not

When would combining columns into a single, delimited column be better in a RDB schema?

Consider for example the case where you have two peaces of data, where one value is rarely used without the other. As one example, here is a table holding user authentication data :
CREATE TABLE users
(
id INT PRIMARY KEY,
auth_name STRING,
auth_password STRING,
auth_password_salt STRING
)
I think that password is meaningless without salt, and the other way around. I also have the option on representing the data this way:
CREATE TABLE users
(
id INT PRIMARY KEY,
auth_name STRING,
auth_secret STRING,
)
And in auth_secret, store strings such as D5SDfsuuAedW:unguessable42
In general, are there any situations where combining columns into one, delimited column would be a better choice?
Even if it is never a "better choice" overall, are there any costs (performance, space, anything) to having more columns vs fewer columns (for the same data)? My motivation is better understanding and to be able to more competently argue against it when someone suggests this sort of thing.
--edited I changed the example... original example as follows:
CREATE TABLE points
(
id INT PRIMARY KEY,
x_coordinate INT,
y_coordinate INT,
z_coordinate INT
)
vs
CREATE TABLE points
(
id INT PRIMARY KEY,
position STRING
)
In position, storing strings such as 7:3:15
You do that when there is no chance of needing to join, query, report or aggregate the data.
In other words - never. It is bad database design.
First Normal form (NF1) states that attributes should be distinct - it is the basic requirement.
The only possible answer to this question is never. Never, ever, store delimited data in a column. It defeats the entire point of columns, which are there to delimit your data, and makes it inordinately difficult to do anything that a database has been designed to do. It's a violation of normalisation so huge that you'll spend hours on Stack Overflow trying to correct it in a months time.
Never do this.
However, "never say never".
In certain, extremely limited, circumstances it's okay. Never assume it's okay but it can be.
A good example is Stack Overflow's own Posts table, which stores the tags in a delimited format for quick reading. The tags a question has are read from the database far more often than they are edited. The tags are stored in a separate table, PostTags, and then denormalised to Posts when they are updated.
In short, even though you can denormalise your data in this way, don't. Try everything possible to avoid it. If you come across a situation where you've been optimizing for days and the only way to get something quicker is to denormalize, then it's okay. Just ensure that you are only ever going to read data from that column and you have a secondary process in place to ensure that it is kept up-to-date. If the update of the denormalised data fails, roll everything back to ensure that your data is consistent.
You left out a significant option: create an appropriate user-defined data type. (PostgreSQL has long had an intrinsic data type for 2-space.)
PostgreSQL
Oracle
SQL Server
DB2
These implementations differ quite a lot.
But you might not have the luxury of using one of those platforms. You might have to use MySQL, for example, which doesn't support user-defined data types.
Relational theory says that data types can be arbitrarily complex; they can have internal structure. The most common data type that has internal structure is the type "date". Relational theory specifies what the dbms is supposed to do with data types like that. The dbms must either
ignore the internal structure entirely, or
provide functions to manipulate the parts.
In the case of dates, every SQL dbms provides functions to manipulate the parts.
You can make a good argument for a single column that stores 3-space coordinates like "7:3:15" in MySQL. To keep in line with relational theory, you'd want the dbms to ignore the structure, and return only the single value "7:3:15"; manipulation of parts is left to application code.
One problem with implementing something like that in MySQL is that MySQL doesn't enforce CHECK constraints. So it's a lot harder to prevent values like "wibble:frog:foo" from finding their way into the database.

How to design a table that only needs a column?

I am creating a database table that'll have a list of all Tags available in my application (just like SO's tags).
Currently, I don't have anything associated with each tag (and I'll probably never have), so my idea was to have something of the form
Tags (Tag(pk) : string)
Should this be the way to do it? Or should I instead do something like
Tags (tag_id(pk) : int, tag : string)
I guess looking up on the table in the 2nd case would be faster than in the first one, but that it also takes up more space?
Thanks
I'd go for the second option with the surrogate key.
It will mean the table takes up more space but will likely reduce space over all assuming that you have the tag information as a foreign key in other tables (e.g. a posts/tags table)
using an int rather than a string will make the lookups required to enforce the foreign key more efficient and mean that updates of tag titles don't need to affect multiple tables.
Indexes work better with integers than CHAR/VARCHAR, go with a dedicated integer primary key column. If you need tag names to be unique you can add a constraint, but it's probably not worth the hassle.
You should go for the second option. Firstly, you never know what the future holds. Secondly, you may later want multiple language support or other things that makes the string-as-the-primary-key have a strange feeling around it. Thirdly, I like the idea of using a standard procedure for a table definition, ie. that there always is a column 'id' or 'pk'. It separates business from technology.
Quite possibly you'll have a faster lookup with the index being an integer. Further, consider making your index clustered for even further speedup.
I wouldn't emphasize too much on the performance issue though. As soon as a program starts talking to a database over the internet, you have a much bigger delay than 99% of all the queries of your database (of course with the exception of reporting queries!).
Those two options achieve quite different things. In the first case you have unique tags and in the second you don't. You haven't said what use TAG_ID is in this model. Unless you put in TAG_ID for a good reason then I'd stick with the first design. It's smaller, appears to meet your requirements precisely and Tag seems like a more obvious choice for a key (on grounds of familiarity and simplicity).

Char(4) versus int as StatusID/StatusCode column in a table

I need a status column that will have about a dozen possible values.
Is there any reason why I should choose int (StatusID) over char(4) (StatusCode)?
Since sql server doesn't support named constants, char is far more descriptive than int when used in stored procedure and views as constants.
To clarify, I would still use a lookup table either way. Since the I will need a more descriptive text for the UI. So this decision is only to help me as the developer when I'm maintaining the stored procedures and views.
Right now I'm leaning toward char(4). Especially since designing views in SQL Server Management Studio prevents me from adding comments (I know it's possible to add it in the script editor, but realistically I will use the View Designer far more often, especially if the view is trivial). StateCODE = 'NEW' is much more readable than StateID = 1000.
I guess the question is will there be cases where char(4) is problematic, and since the database is pretty small, I'm not too concerned about slight performance hit (like using TinyInt versus int), but more afraid of code maintenance problems.
Database purists will say a key should have no meaning in the business domain, and that you should create a status table where you look up the description and other meanings of the status.
But for operators and end users, having a descriptive status code can be a blessing. And it doesn't even have to be char(4), you can make it varchar(20). This allows them to query without joins, and inspect the database in an easier way.
In the end, I think the char(20) organization will run more smoothly, and go home earlier on Friday. But the int organization has a better abstraction of the database, and they can enjoy meta programming on friday evening (or boosting on forums.)
(All of this assuming that you're writing business support software. One of the more succesful business support systems, SAP, makes successful use of meaningful keys.)
There are many pro's and con's to each method. I'm sure other arguments will come up in favour of using a char(4). My reasons for choosing an int over a char include:
I always use lookup tables. They allow for an audit trail of the value to be retained and easily examined. For example, if one of your status codes is 'MING' and a business decision is made to change it from 'MING' to 'MONG' from a certain date, my lookup table handles this.
Smaller index - if you need to index this column, it will be thinner.
Extendability - OK, I made that word up, but if you need to go from 4 chars to 5 chars for example, a lookup table would be a blessing.
Descriptions: We use a lot of TLA's here which once you know what they are is great but if I gave a business user a report that said "GDA's 2007 1001", they wouldn't necessarily twig that GDA = Good Dead on Arrival. With a lookup table, I can add this description.
Best practice: Can't find the link to hand but it might be something I read in a K.Tripp article. Aim to make your clustered primary key incrementing integers to optimise the index.
Of course if you are absolutely positive that you will never need any more than a handful of 4 characters, there is no reason not to bang it in the table.
The best thing should be a lookup table with defined values and then relate it to original table, that uses that enumeration.
Collation ambigities are one reason to say no to char 4: Does ABcD = abCD = äBCd?
If you have 12 possible values, why not tinyint/byte and a Status table?
If you have to store the status for 10 million rows the 3 bytes different and the collation/string compares add up.
The place where I've run into this use case is columns that would map onto things that I would typically use an Enum for when programming. Do you store the integer value of the Enum or the name of the Enum in the database column? Honestly, I've done it both ways. Usually, I ask myself if the database will be used outside the application I'm building. If so, I will choose the human readable format to store in the database. If not, then I'll choose the integer value as it saves a little time when reconstituting (it's just a cast instead of a parse operation) the Enum in code.
You could also use a tinyint over an int
i always choose int's simply because they are easier to map to enums in code.
If you're dealing with huge amounts of data and high throughput then a smallint or tinyint can give better performance and a smaller footprint on the hard disk. If the data in your application is often viewed directly through applications like Access or Cognos then your business people will probably appreciate the descriptive values. I know that when I'm analyzing data as part of my Database Developer role I get tired of joining a lot of lookup tables because I can't remember if 1 = Foo and 2 = Bar or 1 = Bar and 2 = Foo.
Also, although performance will be enhanced if you have to lookup rows by these codes which can have smaller indexes, it can also be hurt (in a minor way) by having to do the joins if you are often looking up rows regardless of the code but where you have to include the text value. In most applications that's not an issue though and would probably only come into play in large data warehousing/reporting environments.