We want to extend our database to create Multilanguage support but we are unsure how to do this.
Our database looks like this:
ID – Name – Description – (a lot of irrelevant columns)
Option 1 is to add an xml column to the table, in this column we can store the information we need like this:
<translation>
<language value=’en’>
<Name value=’’>
<Description value=’’>
</language>
<language value=’fr’>
<Name value=’’>
<Description value=’’>
</language>
</translation>
Does the trick and the advantage is that when I delete the row, I also delete the translations.
Option 2 is to add an extra table, it’s easy to create a table to store the information in, but it requires inner joins when getting the information and more effort to delete rows when the original row is deleted.
What is the preferred option in this case? Or are there other good solutions for this?
I'd recommend the "relational" approach, i.e. separate translation table(s). Consider doing it like this:
This model has some nice properties:
For each multi-lingual table, create a separate translation table. This way, you can use the fields appropriate for that particular table, and the translation cannot be "misconnected" to the wrong table.
The existence of the LANGUAGE table and the associated FOREIGN KEYs, ensures that a translation cannot exist for non-existent language, unlike the XML.
ON DELETE CASCADE referential action will ensure no "orphaned" translation can be left behind when a language is removed, unlike the XML.
While XML may be faster in simpler cases, I suspect JOIN is more scalable when the number of languages grows.1 In any case, measure the difference and decide for yourself if it's significant enough.
Separate fields such as NAME and DESCRIPTION may be easier to index. With XML, you'd probably need a DBMS with special support for XML, or possibly some sort of full-text index.
Fields such as NAME and DESCRIPTION will likely be just regular VARCHARs. OTOH, putting them together may produce XML too large for a regular VARCHAR, forcing you to use a CLOB/BLOB, which may have its own performance complications.
If your DBMS supports clustering (see below), the whole translation table can be stored in a single B-Tree. XML has a lot of redundant data (opening and closing tags), likely making it larger and less cache-friendly than the B-Tree (even when we count-in all the associated overheads).
You'll notice that the model above uses identifying relationships and the resulting PK: {LANGUAGE_ID, TABLEx_ID} can be used for clustering (so the translations that belong to the same language are stored physically close together in the database). As long you have few predominant (or "hot") languages, this should be OK - the caching is done at the database page level, so avoiding mixing "hot" and "cold" data in the same page avoids caching "cold" data (and making the cache "smaller").
OTOH, if you routinely need to query for many languages, consider flipping the clustering key order to: {TABLEx_ID, LANGUAGE_ID}, so all the translations of the same row are stored physically close together in the database. Once you retrieve one translation, other translations of the same row are probably already cached. Or, if you want to extract multiple translations in the single query, you could do it with less I/O.
1 We can JOIN just to the translation in the desired language. With XML, you must load (and parse) the whole XML, before deciding to use only a small portion of it that pertains to the desired language. Whenever you add a new languages (and the associated translations to the XML), it slows down the processing of existing rows even if you rarely use the new language.
Related
I am used to seeing relational databases where distinct entities are stored in different tables. (simple example: Country, State, City). Recently I been seeing more cases where distinct but similar entities are bundled into same table combined with different Views. I supposed this can economize on tables and data access programs (maybe at the expense of clarity and flexibility). Re-reading definition of normalized databases, I don't think this breaks any rules, but it seems less intuitive and through back to old mainframe "Miscellaneous" tables where you put anything that was forgotten in design stage. See 2 examples below: Multi-table solution vs Single table solution. Is this phenomenon part of a data or programming design pattern and have a name?
If you have small dedicated tables, then the database can easily cache the ones it needs in memory.
If you take what would otherwise be small tables and cram them together into one, the database doesn't know which entries are important to cache and which aren't.
More importantly, there is more opportunity for errors because you can inadvertently type in the wrong type code and end up joining to something irrelevant, with no RI or typechecking to warn you. If you use small dedicated tables then you can specify RI constraints.
Thinking back to a place where I saw the single monster-lookup-table pattern done, I think the attraction was that developers can add more kinds of entries without needing DBA intervention to create more tables. There were a lot of developers and only a few DBAs and this was how the DBAs avoided getting sucked into having to create dedicated lookup tables every time a new type of lookup entry was introduced. (Apparently granting create table rights in dev was not acceptable for the DBAs there.)
This seems like a workaround for environments where database schema changes are hard to come by. But another consideration is it may be easier to internationalize if all your entries are in one table.
And the pattern has an established name, it's called the One True Lookup Table. The linked article calls it out as an antipattern, and lists more downfalls of this technique. Here is the bulleted list from the article:
It makes the SQL look ugly.
Many statements will require multiple joins to the lookup table. The extra join columns make the statements look bigger and scarier. There will be the same number of joins when using separate lookup tables, but those joins will be simpler.
Multiple references to the same table can make it hard to determine what is happening in the execution plan, as you will see those repeated references there, and have to refer to the predicates to understand the context of table reference. If you were using separate lookup tables, it would be clear which table you were referring to at any point of the execution plan.
You can't foreign key to this type of table. Technically you can if you are willing to put both columns (lookup_type_code and lookup_key) in the table, but you won't because it is ugly. This means there is a good chance your data integrity will be compromised over time. It's really easy to foreign key to individual lookup tables, and therefore protect your data.
It's hard to control the contents of the table. It's a shared resource, so check constraints and triggers are problematic. If you need users to have different privileges, depending on which lookup they are dealing with, things are going to get messy. That would be really easy with separate lookup tables.
If you need to make a change for one reference type, like extending the size of the key or value, it affects all reference data. Using separate lookup tables isolates the change.
Over time, many reference tables take on additional data. To model that you would need to either split out that reference data from this shared lookup table, or start adding optional columns to cope with the "one-off" issues. A change like this is really simple for separate lookup tables.
Data types matter. You should always use the correct data type, as it will reduce the number of data type conversions needed. Implicit data type conversions are bugs waiting to happen!
Performance can be a problem with the OTLT approach as it's hard for the optimizer to make sound judgements about the data. The optimizer cares about cardinality, but it may be hard to make that decision if you are dealing with a large number of rows, most of which are irrelevant in any one specific context. The optimizer also cares about high/low values, but these are not be relevant to any one lookup, but shared. We've also mentioned you probably won't foreign key to this data, which will reduce the amount of information the optimizer has when making its decision. You may have artificially made columns optional, that are actually mandatory, a key must have a value, but which column? I think you get the message.
I think, if you need name dictionary only ( for spellchecking or something like ) second approach is good enough. Otherwise, if objects have some additional specific fields second approach is very bed.
In about every SQL-based database application I have worked on so far, sooner or later the following three-faceted requirement has popped up:
There is some entity, linked in a hierarchical fashion (i.e. the tuples form a tree structure).
Users must be able to define any number of custom attributes with values for the tuples, and these values are inherited/overridden towards the leaves of the tree structure. ("Dumb" attributes usually suffice. That is, no uniqueness constraints, no foreign keys, only one value per attribute, ...)
Users must be able to run arbitrary queries on this data (i.e. custom boolean expressions, based upon filters for the values of the user-defined attributes that are linked with AND/OR).
Storing the data, roughly matching the first two bullets above, is quite straightforward:
The hierarchy is built up by giving the respective table a parent column. This column will be null for root nodes, and a pointer to the ID of the parent node for all other nodes.
The user-defined attributes are stored according to the entity-attribute-value pattern.
While there are numerous resources that suggest to use a different approach especially in the latter point (e.g. answers here, here, or here), I have not usually been in a position to move away from a traditional static relational database schema. Hence, let's simply assume the above as a given. Also, hardly ever could I rely on the specifics of a particular DBMS; the more usual case was systems that were supposed to work with MS SQL Server, Oracle, and possibly others as backends without requiring two significantly different product versions.
Solving the third item, however, is always problematic (even without considering the hierarchical inheritance of attribute values). The number of joins depends on the different number of attributes considered in the boolean expression. Alternatively, the number of joins can somewhat be reduced by determining the maximum number of distinct attributes considered in any case of the custom boolean expression, which may save joins, but makes the resulting queries and the code used to generate them even less intelligible and maintainable. For instance,
a = 5 or (b = 8 and c = 9)
could do with 2 joins to the attribute-value table.
I have always been able to do this "somehow", but as this appears to be a fairly ubiquitous situation, I am looking for the "canonical" way to generate SQL queries in this situation. Is there a "standard pattern" to follow here?
Careful not to fall prey to the inner platform effect. It is a complicated problem, and SQL itself is designed to handle the complexities. Generate DDL to add and remove columns as needed, and generate simple select statements for queries. Store each Tuple Type (distinct set of attributes) as a table.
With regards to inheritance, I recommend handling it in the application or DAL, and only storing the non-inherited values. On retrieval, read all parent rows to calculate the functional values. If you do need to access "functional" values from SQL, use an indexed view or triggers to maintain them separate from storage.
Hierarchies can be represented as you describe, but a simple "Parent" column can make it difficult to query beyond a single level. Look at hierarchyid on SQL Server or CONNECT BY on oracle.
Avoiding EAV stores allows you to:
Use indexes and statistics where needed
Keep efficient storage (ints stored as ints, money stored as money)
Keep understandable queries (SELECT * FROM vwProducts WHERE Color = 'RED' ORDER BY Price ASC)
If you want an EAV system because you have too many attributes (>1024 per type) or they are not somewhat statically defined (many changes per hour), I would avoid using a relational database in the first place. Use an EAV (NoSQL) database server instead.
tl;dr: If you have a schema, use DDL to tell the server about it. If you don't, use a more appropriate server.
Here's a simple example of an SQL table:
CREATE TABLE persons
(
id INTEGER,
name VARCHAR(255),
height DOUBLE
);
Since I haven't used SQL very much, I haven't yet learned to think in its terms. Effectively, my brain translates the above into this:
struct Person
{
int id;
string name;
double height;
Person(int id_, const char* name_, double height_)
:id(id_),name(name_),height(height_)
{}
};
Person persons[64];
Then, inserting some elements, in SQL:
INSERT INTO persons (id, name, height) VALUES (1234, 'Frank', 5.125);
INSERT INTO persons (id, name, height) VALUES (5678, 'Jesse', 6.333);
...and how I'm thinking of it:
persons[0] = Person(1234, "Frank", 5.125);
persons[1] = Person(5678, "Jesse", 6.333);
I've read that SQL can be thought of as two major parts: data manipulation and data definition. I'm more concerned about organizing my data in the first place, as opposed to querying and modifying it. There, the distinctions of SQL are immediately obvious. To me, it seems like the subtleties of how data can and should be structured in SQL is a more obscure topic. Where does the array-of-structs analogy I'm automatically drawing for myself break down?
To give a concrete example, let's say that I want each entry in my persons table (or each of my Person objects) to contain a field denoting the names of that person's children (actual fruit-of-your-loins children, not hierarchical data structure children). In reality, these would probably be cross-table references (or pointers to objects), but let's keep things simple and make this field contain zero or more names. In my C++ example, I'd modify the declaration like so:
vector<string> namesOfChildren;
...and do something like this:
persons[0].namesOfChildren.push_back("John");
persons[0].namesOfChildren.push_back("Jane");
But, from what I can tell, the typical usage of SQL doesn't mirror this approach. If I'm wrong and there's a simple, straightforward solution, great. If not, I'm sure a SQL novice like myself could benefit greatly from a little cogitation on the subject of how databases of SQL tables are meant to be used in contrast to bare, generic data structures.
To me, it seems like the subtleties of how data can and should be structured in SQL is a more obscure topic.
It's called "data(base) modeling" and is somewhere between engineering discipline and art (like much of the computer programming). If you are really interested in the topic, take a look at ERwin Methods Guide.
Where does the array-of-structs analogy I'm automatically drawing for myself break down?
At persistency, concurrency, consistency and scalability.
Persistency: The table is automatically saved to the permanent storage. It'll stay there and survive reboots (not that a real database server will reboot much) until you explicitly delete it or there is a catastrophic hardware failure. DBMSes have well-oiled backup procedures that should help in the latter case.
Concurrency: Tables are meant to be accessed and (need be) modified by many clients concurrently. Mechanisms such as locking and multi-version concurrency control are employed to ensure clients will not "step on each other's toes".
Consistency: You can define certain constraints (such as uniqueness, foreign keys or checks) and the DBMS will make sure they are never broken. Furthermore, this can often be done in a declarative manner, minimizing chance for errors. On top of that, everything you do in a database is transactional, so you reap the benefits of atomicity, consistency, isolation and durability (aka. "ACID"). In a nutshell, the database will defend itself from bad data.
Scalability: A well designed database schema can grow well beyond the confines of the available RAM, and still keep good performance, using techniques such as indexing, partitioning, clustering etc... Furthermore, SQL is declarative and set-based, which means that the DBMS has the latitude to pick the optimal "query execution plan" for the data at hand, auto-parallelize the query, cache the results in hope they will be reused etc... without changing the meaning of the query.
Your analogy to the array of structs is not bad ... for the beginning.
After this beginning the differences start in relation to organizing data.
Database people love their "Normal Forms" laws. We do not have these laws in C++ or similar programming languages. Organizing data in the tables according to these laws help database engines to do their magic (queries, joins) better, i.e keep databases compact and crunch millions of rows in fractions of a second, and allow multiple requests concurrently. They are not absolute laws: the 1NF (1st Normal Form) is followed in 99.9999% cases, but the bigger the number (2NF, 3NF, ...) the more often DB planners allow themselves to deviate from them.
Description of normal forms can be found for example here.
I will try to illustrate differences on your example.
In your example the fields of your struct correspond to the columns of the database table. Adding vector of the names as a new field of struct would correspond to adding comma separated list of the names into a new column of your table. This is a violation of the 1NF which demands that one cell is for one value - not for the list of values. To normalize your data you will need to have two arrays: one of Person structs, and another new of structs for Child. While in C++ we can use just pointers to link each child to its parent, in SQL we must use the mechanism of the key. You already added id field into Person struct, now we need to add ParentId field to Child struct so that database engine could find the Parent. ParentId column is called foreign key. Another approach to satisfy 1NF instead of creating the new table/struct for children is that we can switch to children-centric thinking and have just one table with a record per child which will include all the information about the parent of the child. Info about the parent obviously will be repeated in as many records as many children this parent has.
Note (this is also considered part of 1NF) that while in the array of structs we always know the order of the elements, in databases it is up to the engine in what order to store the records. It is just mathematical un-ordered set of records, the engine can resort it in internal storage for various optimizations as it likes. When you retrieve the records from the database with the SELECT statement, if you care about the order, you need to provide ORDER BY clause.
2NF is about removing repetitions from your records. Imagine you would have place of work related fields also as part of your Person struct. Imagine it would include Name of the company and company address. If many Persons in your dataset work in the same Company your would repeat address of the company in their records. Probably we wouldn't do these repetitions in C++ either, but nevertheless extracting these repetitions into a separate table would satisfy 2NF. Strictly speaking even if there is no repetitions and all your Persons work in different places, 2NF still requires to extract data about the workplaces into separate table because it requires that one table would represent one entity.
3NF is about removing transitive dependency and is considered kind of optional, so I will not describe it here. See link above.
Another feature of databases quite different from conventional programming of data structures in C++ is the database indexes. Simplifying, index is just a copy of a column (or columns) (i.e vertical slice) into a separate table where they are stored in an inherent for them order and each record in the index retains the reference to the whole record. So, in your example to create index by height you would create another array of 64 elems of the new
struct HeightIndexElem
{
double height;
Person* pFullRecord;
}
and sort them by height in this array. This will allow the DB engine to automatically optimize certain queries. The database engine itself decides when to use certain index. In C++ we usually create maps (Dictionaries in C#) to speed up finding element by certain characteristic but we must use these maps ourselves - no automatic aspect there.
There are major differences:-
SQL tables are persistent -- (English Tran: written to disk)
They are transactional -- (really written to disk)
They can be an arbitary size -- (Tables of a several hundred million rows are quite common)
They support relational algebra -- (Joins with other tables, filtering etc.)
Relational Algebra is provable -- For a given SELECT statement there is only one possible correct answer.
The biggest differences are that when you "UPDATE" and "COMMIT" you know your data is saved in the database and will be there until you decide to "DELETE" it. When you update a structure within an array its gone when the machine is switched off.
The other big difference is scale. The size of a modern DBMS is only limited by your hard disk budget.
[I really like farfareast's answer from an academic stand point, but I feel the need to add a more practically oriented answer too.]
SQL tables themselves are "bare, generic data structures" as you call C++'s structures. They are only different data structures: a table is always an array of (fixed size) structs and the only pointers you can use are foreign keys.
For example, when you are adding a vector<string> to your struct, you are already using pointers internally as strings are only a "fancy" way of writing char*. This would already require a second table in SQL (using a secondayr index column to keep the elements in order). Of course there are things like postgresql's arrays that can help in this specific case, but those are "only" shortcuts for similar hand-writeable constructs.
The real difference in data structure and algorithms comes from the fact that you can easily add declarations of index structures. Say you know you need to always access Persons in the order of their height. In C++ you'd use some kind of tree or sorted list to keep them in order. There is an STL container for that. The downside is, that when you need to access them in a different order (say by name), you'll have to add a second tree and duplicate the data or start using pointers to Persons. If you add a Person, you need to update all containers and so on. This becomes cumbersome and soon you'll be on the front page of The Daily WTF. SQL tables on the other side can have attached indices which automatically keep up with new and changed data. Of course, their maintenance also must be paid in performance, but the management of them is basically deciding which are required by your access patterns -- something needed in every case -- and defining them. In contrast to having to rewrite large parts of an application, this is a much more favorable situation.
What I'm doing
I am creating an SQL table that will provide the back-end storage mechanism for complex-typed objects. I am trying to determine how to accomplish this with the best performance. I need to be able to query on each individual simple type value of the complex type (e.g. the String value of a City in an Address complex type).
I was originally thinking that I could store the complex type values in one record as an XML, but now I am concerned about the search performance of this design. I need to be able to create variable schemas on the fly without changing anything about the database access layer.
Where I'm at now
Right now I am thinking to create the following tables.
TABLE: Schemas
COLUMN NAME DATA TYPE
SchemaId uniqueidentifier
Xsd xml //contains the schema for the document of the given complex type
DeserializeType varchar(200) //The Full Type name of the C# class to which the document deserializes.
TABLE: Documents
COLUMN NAME DATA TYPE
DocumentId uniqueidentifier
SchemaId uniqueidentifier
TABLE: Values //The DocumentId+ValueXPath function as a PK
COLUMN NAME DATA TYPE
DocumentId uniqueidentifier
ValueXPath varchar(250)
Value text
from these tables, when performing queries I would do a series of self-joins on the value table. When I want to get the entire object by the DocumentId, I would have a generic script for creating a view mimics a denormalized datatable of the complex-type.
What I want to know
I believe there are better ways to accomplish what I am trying to, but I am a little too ignorant about the relative performance benefits of different SQL techniques. Specifically I don't know the performance cost of:
1 - comparing the value of a text field versus of a varchar field.
2 - different kind of joins versus nested queries
3 - getting a view versus an xml document from the sql db
4 - doing some other things that I don't even know I don't know would be affecting my query but, I am experienced enough to know exist
I would appreciate any information or resources about these performance issues in sql as well as a recommendation for how to approach this general issue in a more efficient way.
For Example,
Here's an example of what I am currently planning on doing.
I have a C# class Address which looks like
public class Address{
string Line1 {get;set;}
string Line2 {get;set;}
string City {get;set;}
string State {get;set;}
string Zip {get;set;
}
An instance is constructed from new Address{Line1="17 Mulberry Street", Line2="Apt C", City="New York", State="NY", Zip="10001"}
its XML value would be look like.
<Address>
<Line1>17 Mulberry Street</Line1>
<Line2>Apt C</Line2>
<City>New York</City>
<State>NY</State>
<Zip>10001</Zip>
</Address>
Using the db-schema from above I would have a single record in the Schemas table with an XSD definition of the address xml schema. This instance would have a uniqueidentifier (PK of the Documents table) which is assigned to the SchemaId of the Address record in the Schemas table. There would then be five records in the Values table to represent this Address.
They would look like:
DocumentId ValueXPath Value
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/Line1 17 Mulberry Street
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/Line2 Apt C
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/City New York
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/State NY
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/Zip 10001
Just Added a Bounty...
My objective is to obtain the resources I need in order to give my application a data access layer that is fully searchable and has a data-schema generated from the application layer that does not require direct database configuration (i.e. creating a new SQL table) in order to add a new aggregate root to the domain model.
I am open to the possibility of using .NET compatible technologies other than SQL, but I will require that any such suggestions be adequately substantiated in order to be considered.
How about looking for a solution at the architectural level? I was also breaking my head on complex graphs and performance until I discovered CQRS.
[start evangelist mode]
You can go document-based or relational as storage. Even both! (Event Sourcing)
Nice separation of concerns: Read Model vs Write Model
Have your cake and eat it too!
Ok, there is an initial learning / technical curve to get over ;)
[end evangelist mode]
As you stated: "I need to be able to create variable schemas on the fly without changing anything about the database access layer." The key benefit is that your read model can be very fast since it's made for reading. If you add Event Sourcing to the mix, you can drop and rebuild your Read Model to whatever schema you want... even "online".
There are some nice opensource frameworks out there like nServiceBus which saves lots of time and technical challenges. All depends on how far you want to take these concepts what you're willing/can spend time on. You can even start with just basics if you follow Greg Young's approach. See the info in the links below.
See
CQRS Examples and Screencasts
CQRS Questions
Intro (Also see the video)
Somehow what you want sounds like a painful thing to do in SQL. Basically, you should treat the inside of a text field as opaque as when querying an SQL database. Text fields were not made for efficient queries.
If you just want to store serialized objects in a text field, that is fine. But do not try to build queries that look inside the text field to find objects.
Your idea sounds like you want to perform some joins, XML parsing, and XPath application to get to a value. This doesn't strike me as the most efficient thing to do.
So, my advise:
Either just store serialized objects in the db, and do nothing more than load them and perform all other operations in memory
Or, if you need to query complex data structures, you may really want to look into document stores/databases like CouchDB or MongoDB; you can also check Wikipedia on the subject. There are even databases specifically designed for storing XML, even though I personally don't like them very much.
Addendum, per your explanations above
Simply put, don't go over the top with this thing:
If you just want to persist C#/.NET objects, just use the XML Serialization already built into the framework, a single table and be done with it.
If you, for some reason, need to store complex XML, use a dedicated XML store
If you have a fixed database schema, but it is too complex for efficient queries, use a Document Store in memory where you keep a denormalized version of your data for faster queries (or just simplify your database schema)
If you don't really need a fixed schema, use just a Document Store, and forget about having any "schema definition" at all
As for your solution, yes, it could work somehow. As could a plain SQL schema if you set it up right. But for applying an XPath, you'll probably parse the whole XML document each time you access a record, which wouldn't be very efficient to begin with.
If you want to check out Document databases, there are .NET drivers for CouchDB and MongoDB. The eXist XML database offers a number of Web protocols, and you can probably create a client class easily with VisualStudio's point-and-shoot interface. Or just google for someone who already did.
I need to be able to create variable
schemas on the fly without changing
anything about the database access
layer.
You are re-implementing the RDBMS within an RDBMS. The DB can do this already - that is what the DDL statements like create table and create schema are for....
I suggest you look into "schemas" and SQL security. There is no reason with the correct security setup you cannot allow your users to create their own tables to store document attributes in, or even generate them automatically.
Edit:
Slightly longer answer, if you don't have full requirements immediately, I would store the data as XML data type, and query them using XPath queries. This will be OK for occasional queries over smallish numbers of rows (fewer than a few thousand, certainly).
Also, your RDBMS may support indexes over XML, which may be another way of solving your problem. CREATE XML INDEX in SqlServer 2008 for example.
However for frequent queries, you can use triggers or materialized views to create copies of relevant data in table format, so more intensive reports can be speeded up by querying the breakout tables.
I don't know your requirements, but if you are responsible for creating the reports/queries yourself, this may be an approach to use. If you need to enable users to create their own reports that's a bigger mountain to climb.
I guess what i am saying is "are you sure you need to do this and XML can't just do the job".
In part, it will depend of your DB Engine. You're using SQL Server, don't you?
Answering your topics:
1 - Comparing the value of a text field versus of a varchar field: if you're comparing two db fields, varchar fields are smarter. Nvarchar(max) stores data in unicode with 2*l+2 bytes, where "l" is the lengh. For performance issues, you will need consider how much larger tables will be, for selecting the best way to index (or not) your table fields. See the topic.
2 - Sometimes nested queries are easily created and executed, also serving as a way to reduce query time. But, depending of the complexity, would be better to use different kind of joins. The best way is try to do in both ways. Execute two or more times each query, for the DB engine "compiles" a query on first executing, then the subsequent are quite faster. Measure the times for different parameters and choose the best option.
"Sometimes you can rewrite a subquery to use JOIN and achieve better performance. The advantage of creating a JOIN is that you can evaluate tables in a different order from that defined by the query. The advantage of using a subquery is that it is frequently not necessary to scan all rows from the subquery to evaluate the subquery expression. For example, an EXISTS subquery can return TRUE upon seeing the first qualifying row." - link
3- There's no much information in this question, but if you will get the xml document directly from the table, would be a good idea insted a view. Again, it will depends of the view and the document.
4- Other issues is about the total records expected for your table; the indexing of the columns, in wich you need to consider sorting, joining, filtering, PK's and FK's. Each situation could demmand different aproaches. My sugestion is to invest some time reading about your database engine and queries functioning and relating to your system.
I hope I've helped.
Interesting question.
I think you may be asking the wrong question here. Broadly speaking, as long as you have a FULLTEXT index on your text field, queries will be fast. Much faster than varchar if you have to use wild cards, for instance.
However, if I were you, I'd concentrate on the actual queries you're going to be running. Do you need boolean operators? Wildcards? Numerical comparisons? That's where I think you will encounter the real performance worries.
I would imagine you would need queries like:
"find all addresses in the states of New York, New Jersey and Pennsylvania"
"find all addresses between house numbers 1 and 100 on Mulberry Street"
"find all addresses where the zipcode is missing, and the city is New York"
At a high level, the solution you propose is to store your XML somewhere, and then de-normalize that XML into name/value pairs for querying.
Name/value pairs have a long and proud history, but become unwieldy in complex query situations, because you're not using the built-in optimizations and concepts of the relational database model.
Some refinements I'd recommend is to look at the domain model, and at least see if you can factor out separate data types into the "value" column; you might end up with "textValue", "moneyValue", "integerValue" and "dateValue". In the example you give, you might factor "address 1" into "housenumber" (as an integer) and "streetname".
Having said all this - I don't think there's a better solution other than completely changing tack to a document-focused database.
I've been working on a database and I have to deal with a TEXT field.
Now, I believe I've seen some place mentioning it would be best to isolate the TEXT column from the rest of the table(putting it in a table of its own).
However, now I can't find this reference anywhere and since it was quite a while ago, I'm starting to think that maybe I misinterpreted this information.
Some research revealed this, suggesting that
Separate text/blobs from metadata, don't put text/blobs in results if you don't need them.
However, I am not familiar with the definition of "metadata" being used here.
So I wonder if there are any relevant advantages in putting a TEXT column in a table of its own. What are the potential problems of having it with the rest of the fields? And potential problems of keeping it in a separated table?
This table(without the TEXT field) is supposed to be searched(SELECTed) rather frequently. Is "premature optimization considered evil" important here? (If there really is a penalty in TEXT columns, how relevant is it, considering it is fairly easy to change this later if needed).
Besides, are there any good links on this topic? (Perhaps stackoverflow questions&answers? I've tried to search this topic but I only found TEXT vs VARCHAR discussions)
Yep, it seems you've misinterpreted the meaning of the sentence. What it says is that you should only do a SELECT including a TEXT field if you really need the contents of that field. This is because TEXT/BLOB columns can contain huge amounts of data which would need to be delivered to your application - this takes time and of course resources.
Best wishes,
Fabian
This is probably premature optimisation. Performance tuning MySQL is really tricky and can only be done with real performance data for your application. I've seen plenty of attempts to second guess what makes MySQL slow without real data and the result each time has been a messy schema and complex code which will actually make performance tuning harder later on.
Start with a normalised simple schema, then when something proves too slow add a complexity only where/if needed.
As others have pointed out the quote you mentioned is more applicable to query results than the schema definition, in any case your choice of storage engine would affect the validity of the advice anyway.
If you do find yourself needing to add the complexity of moving TEXT/BLOB columns to a separate table, then it's probably worth considering the option of moving them out of the database altogether. Often file storage has advantages over database storage especially if you don't do any relational queries on the contents of the TEXT/BLOB column.
Basically, get some data before taking any MySQL tuning advice you get on the Internet, including this!
The data for a TEXT column is already stored separately. Whenever you SELECT * from a table with text column(s), each row in the result-set requires a lookup into the text storage area. This coupled with the very real possibility of huge amounts of data would be a big overhead to your system.
Moving the column to another table simply requires an additional lookup, one into the secondary table, and the normal one into the text storage area.
The only time that moving TEXT columns into another table will offer any benefit is if there it a tendency to usually select all columns from tables. This is merely introducing a second bad practice to compensate for the first. It should go without saying the two wrongs is not the same as three lefts.
The concern is that a large text field—like way over 8,192 bytes—will cause excessive paging and/or file i/o during complex queries on unindexed fields. In such cases, it's better to migrate the large field to another table and replace it with the new table's row id or index (which would then be metadata since it doesn't actually contain data).
The disadvantages are:
a) More complicated schema
b) If the large field is using inspected or retrieved, there is no advantage
c) Ensuring data consistency is more complicated and a potential source of database malaise.
There might be some good reasons to separate a text field out of your table definition. For instance, if you are using an ORM that loads the complete record no matter what, you might want to create a properties table to hold the text field so it doesn't load all the time. However if you are controlling the code 100%, for simplicity, leave the field on the table, then only select it when you need it to cut down on data trasfer and reading time.
Now, I believe I've seen some place mentioning it would be best to isolate the TEXT column from the rest of the table(putting it in a table of its own).
However, now I can't find this reference anywhere and since it was quite a while ago, I'm starting to think that maybe I misinterpreted this information.
You probably saw this, from the MySQL manual
http://dev.mysql.com/doc/refman/5.5/en/optimize-character.html
If a table contains string columns such as name and address, but many queries do not retrieve those columns, consider splitting the string columns into a separate table and using join queries with a foreign key when necessary. When MySQL retrieves any value from a row, it reads a data block containing all the columns of that row (and possibly other adjacent rows). Keeping each row small, with only the most frequently used columns, allows more rows to fit in each data block. Such compact tables reduce disk I/O and memory usage for common queries.
Which indeed is telling you that in MySQL you are discouraged from keeping TEXT data (and BLOB, as written elsewhere) in tables frequently searched