Strategy for Storing Multiple Nullable Booleans in SQL - sql

I have an object (happens to be C#) with about 20 properties that are nullable booleans. There will be perhaps a few million such objects persisted to a SQL database (currently SQL Server 2008 R2, but MySQL may need to be supported in the future). The instances themselves are relatively large because they contain about a paragraph of text as well as some other unrelated properties.
For a given object instance, most of the properties will be null most of the time.
When users search for instances of such objects, they will select perhaps 1-3 of the nullable boolean properties and search for instances where at least one of those 1-3 properties is non-null (OR search).
My first thought is to persist the object to a single table with nullable BIT columns representing the nullable boolean properties. However, this strategy will require one index per BIT column to avoid performing a table scan when searching. Further, each index would not be particularly selective since there are only three possible values per index.
Is there a better way to approach this problem?

For performance reasons, I would suggest that you split the table into two tables.
Create a primary key with the bit fields used for indexes. Have another table that has the additional data (such as the paragraph). Use the first for WHERE conditions, joining in the second to get the data you want. Something like:
select p.*
from BitFields bf join
Paragraph p
on bf.bfid = p.bfid
where <conditions on the bit fields>
With a bunch of binary/ternary fields, I wouldn't think that indexes would help much, so the query engine will be resorting to a full table scan. If you put in the bit fields in one table, you can store the table in memory, and achieve good performance.
The alternative is to store the fields as name value pairs. If you really have lots of such fields (say many hundreds or thousands) and only a few are used in a given row (say a dozen or so), then an entity-attribute-value (EAV) structure might work better. This is a table with three important columns:
Entity id (what I call bfid above).
Attribute id (the particular attribute)
Value (true or false)

Related

Best performance in filtering large datasets: = ANY() or #> on an array column; XOR on a bitmapped binary column; join a m2m table or something else?

The General Question:
This question pertains to tables with a large number of rows (say millions) that has a many to many relationship with a relatively small set of data (say tens). For example you might have 20 different tags or types or categories and each of your 10 million records is associated with one or more of them.
You could have a separate table for your tags with 20 rows and then use a many to many table to define the relationships between those 20 tags and your 10 million records.
In Postgres there are array types that could be used instead. In mySQL there is a the SET column type that like that.
You could also use a bitmap or bit string type and perform an XOR on that column.
My question is which of these solutions (or an alternative solution) is best for performance when querying the large table for records that are associated with one member of the small set. A solution should include what indexes to create and use.
My Specifics:
I've tried to keep the question here as generic as possible because I believe the answer could be applicable in many fairly common scenarios. However, for clarity I'll describe my specific situation now.
I have a table with millions of titles. Each title is associated with one or more languages. For example 'Don Quixote' may be 'Дон Кихот' in Russian, Bulgarian, Kazakh and 'Don Quijote' in a bunch of other languages. I have a search string and a language and want to find the best match in titles. I'm using Postgres and have a trigram index using gin on the titles. An example search would be find matches for 'Дон Кихот' in Bulgarian.
Currently have have the languages in a char(2)[] array type column using two letter language codes. I assume using an int array with language IDs would be better, but I want to go for what is best. I'm not worried about how much effort it would take to setup a bitmap for languages to do an XOR search or whatever effort and complexity is involved in implementing a particular solution. The performance is what matters.
I would tend to think that JOINing a many to many table would not be the best solution because that table would have multiple entries per row in the title table and so it would be huge. But perhaps I'm wrong about that because that is what RDBMS are designed to do.
Huge thanks to all of you who spend your highly valuable time answering these questions.
The postgreSQL documentation warns against searching arrays.
Arrays are not sets; searching for specific array elements can be a sign of database misdesign. Consider using a separate table with a row for each item that would be an array element. This will be easier to search, and is likely to scale better for a large number of elements.
SQL in general, and postgreSQL in particular, is astonishingly good at joining vast many-to-many tables, especially when their rows are modestly sized, their columns' data types match the columns you use for joining, and they have the right indexes.
So avoid arrays for the application you describe.
For a two-column junction table like you describe, you'd define it as something like this:
CREATE TABLE title_lang (
title_id BIGINT NOT NULL,
lang_id SMALLINT NOT NULL,
PRIMARY KEY (title_id, lang_id)
};
CREATE INDEX lang_title ON title_lang (lang_id, title_id);
I used SMALLINT for the language id. When you use character types, the database gives you unicode characters. Those take unexpected amounts of space. And indexes on integers are very efficient. But you should use the data type that makes sense in the rest of your schema.
I suggest a primary key going from title to language, and a reverse key going the other way. You can omit the reverse key if you don't go from language to title.
To associate a title with a language you insert a row in the table. To get rid of the association you delete the row.

PostgreSQL - What should I put inside a JSON column

The data I want to store data that has this characteristics:
There are a finite number of fields (I don't expect to add new fields);
There are some columns that are common to all sets of data (a category field, for instance);
There are some columns that are specific to individual sets of data (each category needs it's own fields);
Here's how it would look like in a regular table:
I'm having trouble figuring out which would be the better way to store this data in a database for this situation.
Bellow are the ideas I already had:
Do exactly as the tabular table (I would have many NULL values);
Divide the categories into tables (I would use joins when needed);
Use JSON type for storing the values (no NULL values and having it all in same table).
So my questions are:
Is one of these solutions (or one that I have not thought about it) that is better for this case?
Are there other factors, other than the ones presented here, that I should consider to make this decision?
Unless you have very many columns (~ 100), it is usually better to use normal columns. NULL values don't take any storage space in PostgreSQL.
On the other hand, if you have queries that can use any of these columns in the WHERE condition, and you compare with =, a single GIN index on a jsonb might be better than having many B-tree indexes, because the index maintenance costs would be higher.
The definitive answer depends on the SQL statements that you plan to run on that table.
You have laid out the three options pretty well. Things to consider are:
Performance
Data size
Each of maintenance
Flexibility
Security
Note that you don't even allude to security considerations. But security at the table level is usually a tad simpler than at the column level and might be important for regulated data such as PII (personally identifiable information).
The primary strength of the JSON solution is flexibility. It is easy to add new columns. But you don't need that. JSON has a cost in data size and data type flexibility (notably JSON doesn't support date/times explicitly).
A multiple table solution requires duplicating the primary key but may result in much less storage overall if the columns really are sparse. The "may" may also depend on the data type. A NULL string for instance occupies less space than a NULL float in a table record.
The joins on multiple tables will be 1-1 on primary keys. These should be pretty fast.
What would I do? Unless the answer is obvious, I would dump the data into a single table with a bunch of columns. If that table starts to get unwieldy, then I would think about splitting it into separate tables -- but still have one table for the common columns. The details of one or multiple tables can be hidden behind a view.
Depends on how much data you want to store, but as long as it is finite it shouldn't make a big difference if it contains a lot of null's or not

Use separate fields or single field as CSV in SQL

I need to save about 500 values in structured database (SQL, postgresql) or whatever. So what is the best way to store data. Is it to take 500 fields or single field as (CSV) comma separated values.
What would be the pros and cons.
What would be easy to maintain.
What would be better to retrieve data.
A comma-separated value is just about never the right way to store values.
The traditional SQL method would be a junction or association table, with one row per field and per entity. This multiplies the number of rows, but that is okay, because databases can handle big tables. This has several advantages, though:
Foreign key relationships can be properly defined.
The correct type can be implemented for the object.
Check constraints are more naturally written.
Indexes can be built, incorporating the column and improving performance.
Queries do not need to depend on string functions (which might be slow).
Postgres also supports two other methods for such data, arrays and JSON-encoding. Under some circumstances one or the other might be appropriate as well. A comma-separated string would almost never be the right choice.

"Canonical" approach for mapping custom queries to hierarchical entities with user-defined key/value pairs

In about every SQL-based database application I have worked on so far, sooner or later the following three-faceted requirement has popped up:
There is some entity, linked in a hierarchical fashion (i.e. the tuples form a tree structure).
Users must be able to define any number of custom attributes with values for the tuples, and these values are inherited/overridden towards the leaves of the tree structure. ("Dumb" attributes usually suffice. That is, no uniqueness constraints, no foreign keys, only one value per attribute, ...)
Users must be able to run arbitrary queries on this data (i.e. custom boolean expressions, based upon filters for the values of the user-defined attributes that are linked with AND/OR).
Storing the data, roughly matching the first two bullets above, is quite straightforward:
The hierarchy is built up by giving the respective table a parent column. This column will be null for root nodes, and a pointer to the ID of the parent node for all other nodes.
The user-defined attributes are stored according to the entity-attribute-value pattern.
While there are numerous resources that suggest to use a different approach especially in the latter point (e.g. answers here, here, or here), I have not usually been in a position to move away from a traditional static relational database schema. Hence, let's simply assume the above as a given. Also, hardly ever could I rely on the specifics of a particular DBMS; the more usual case was systems that were supposed to work with MS SQL Server, Oracle, and possibly others as backends without requiring two significantly different product versions.
Solving the third item, however, is always problematic (even without considering the hierarchical inheritance of attribute values). The number of joins depends on the different number of attributes considered in the boolean expression. Alternatively, the number of joins can somewhat be reduced by determining the maximum number of distinct attributes considered in any case of the custom boolean expression, which may save joins, but makes the resulting queries and the code used to generate them even less intelligible and maintainable. For instance,
a = 5 or (b = 8 and c = 9)
could do with 2 joins to the attribute-value table.
I have always been able to do this "somehow", but as this appears to be a fairly ubiquitous situation, I am looking for the "canonical" way to generate SQL queries in this situation. Is there a "standard pattern" to follow here?
Careful not to fall prey to the inner platform effect. It is a complicated problem, and SQL itself is designed to handle the complexities. Generate DDL to add and remove columns as needed, and generate simple select statements for queries. Store each Tuple Type (distinct set of attributes) as a table.
With regards to inheritance, I recommend handling it in the application or DAL, and only storing the non-inherited values. On retrieval, read all parent rows to calculate the functional values. If you do need to access "functional" values from SQL, use an indexed view or triggers to maintain them separate from storage.
Hierarchies can be represented as you describe, but a simple "Parent" column can make it difficult to query beyond a single level. Look at hierarchyid on SQL Server or CONNECT BY on oracle.
Avoiding EAV stores allows you to:
Use indexes and statistics where needed
Keep efficient storage (ints stored as ints, money stored as money)
Keep understandable queries (SELECT * FROM vwProducts WHERE Color = 'RED' ORDER BY Price ASC)
If you want an EAV system because you have too many attributes (>1024 per type) or they are not somewhat statically defined (many changes per hour), I would avoid using a relational database in the first place. Use an EAV (NoSQL) database server instead.
tl;dr: If you have a schema, use DDL to tell the server about it. If you don't, use a more appropriate server.

Storing flags in SQL column, and indexing them

I need to store a set of flags that are related to an entity into database. Flags might not be the best word because these are not binary information (on/off), but rather a to-be-defined set of codes.
Normally, you would store each information (say each flag value) in a distinct column, but I'm exploring opportunities for storing such information in data structures different than one-column-for-each-attribute to prevent a dramatic increase in column mappings. Since each flag is valid for each attribute of an entity, you understand that for large entities that intrinsically require a large number of columns the total number of columns may grow as 2n.
Eventually, these codes can be mapped to a positional string.
I'm thinking about something like: 02A not being interpreted as dec 42 but rather as:
Flag 0 in position 1 (or zero if you prefer...)
Flag 2 in position 2
Flag A in position 3
Data formatted in such a way can be easily processed by high-level programming languages, because PL/SQL is out of the scope of the question and all these values are supposed to be processed by Java.
Now the real problem
One of my specs is to optimize searching. I have been required to find a way (say, an efficient way) to seek for entities that show a certain flag (or a special 0 flag) in a given position.
Normally, in SQL, given the RDBMS-specific substring function, you would
SELECT * FROM ENTITIES WHERE SUBSTRING(FLAGS,{POSITION},1) = {VALUE};
This works, but I'm afraid it may be a little slow on all platforms but Oracle, which, AFAIK, supports creating secondary indexes mapped to a substring.
However, my solution must work in MySQL, Oracle, SQL Server and DB2 thanks to Hibernate.
Given such a design, is there some, possibly cross-platform, indexing strategy that I'm missing?
If performance is an issue I would go for a some different model here.
Say a table that store entities and a relation 1->N to another table (say: flags table: entId(fk), flag, position) and this table would have an index on flag and position.
The issue here would be to get this flags in a simple column wich can be done in java or even on the database (but it would be difficult to have a cross plataform query to this)
If you want a database-independent, reasonable method for storing such flags, then use typical SQL data types. For a binary flag, you can use bit or boolean (this differs among databases). For other flags, you can use tinyint or smallint.
Doing bit-fiddling is not going to be portable. If nothing else, the functions used to extract particular bits from data differ among databases.
Second, if performance is an issue, then you may need to create indexes to avoid full table scans. You can create indexes on normal SQL data types (although some databases may not allow indexes on bits).
It sounds like you are trying to be overly clever. You should first get the application to work using reasonable data structures. Then you will understand where the performance issues are and can work on fixing them.
I have improved my design and performed a benchmark and found an interesting result.
I created a dummy demographic entity with first/last name columns, birthdate, birthplace, email, SSN...
Then in version 1
I added a column VALIDATION VARCAHR(40) NULL DEFAULT NULL with an index on it.
Instead of positional flags, the new column contains an unordered set of codes each representing a specific format error (e.g. A01 means "last name not specified", etc.). Each code is terminated by a colon : symbol.
Example columns look like
NULL
'A01:A03:A10:'
'A05:'
Typical queries are:
SELECT * FROM ENTITIES WHERE VALIDATION IS {NOT} NULL
Search for entities that are valid/invalid (NULL = no problem)
SELECT * FROM ENTITIES WHERE VALIDATION LIKE '%AXX:';
Selects entities with a specific problem
Then in version 1
I added a column VALID TINYINT NOT NULL with an index which is 0=invalid, 1=valid (Hibernate maps a Boolean to a TINYINT in MySQL).
I added a lookup table
CREATE TABLE ENTITY_VALIDATION (
ID BIGINT NOT NULL PRIMARY KEY,
PERSON_ID LONG NOT NULL, --REFERENCES PERSONS(ID) --Omitted for performance
ERROR CHAR(3) NOT NULL
)
With index on both PERSON_ID and ERROR. This represents the 1:N relationship
Queries:
SELECT * FROM ENTITIES WHERE VALIDATION = {0|1}
Select invalid/valid entities
SELECT * FROM ENTITIES JOIN ENTITY_VALIDATION ON ENTITIES.ID = ENTITY_VALIDATION.PERSON_ID WHERE ERROR = 'Axx';
Selects entities with a given problem
Then I benchmarked
the count(*) function via JUnit+JDBC. So the same queries you see above replace * with COUNT(*).
I did several benchmarks, with entity table containing 100k, 250k, 500k, 750k, 1M entities with a mean ratio entity:flag of 1:3 (there are meanly 3 errors for each entity).
The result
is displayed below. While correct/incorrect entities lookup is equally performing, it looks like MySQL is faster in the LIKE operator rather than in a JOIN, even though there are indexes
Of course,
This was only a benchmark on MySQL. While the approach is cross-platform, the benchmark does not (yet) compare performance in different DBMSes