Best practices for schemas with many boolean variables

Best practices for schemas with many boolean variables - sql

I'm creating a Postgresql database where we have many (10-40) variables that will have boolean values. I'd like to figure out what the best way to store this data is, given moderate numbers of updates and lots of multi-column searches.
It seems pretty straightforward to just create the 30 or so boolean columns and create multi-column indexes where necessary. Alternatively, someone suggested creating a bit string that combines all of the booleans. It seems like the second option should be faster, but the answers other people have given online seem to be contradictory (see below).
Any suggestions or explanations would be helpful. The data is tens of millions of rows, but not larger, and I expect selects to return somewhere between 1/100 to 1/4 of the data.
https://stackoverflow.com/questions/14067969/optimized-sql-using-bitwise-operator
alternative to bitmap index in postgresql
UPDATE:
I found one resource that suggests using ints or big ints if you have more than a few variables (where you should use separate columns) and fewer than 33 or so (where you switch to bitstrings). This seems to be motivated more by storage size than ease of search.
https://dba.stackexchange.com/questions/25073/should-i-use-the-postgresql-bit-string

I've found a related discussion at the Database Administrators site.
First, I would define/analyze what is "best" in your context. Are you just looking for speed? What is your search pattern? Is data/disk volume an issue?
What alternatives do you have? Besides of bitstrings, one could use ordinary text strings, integer arrays, and separate columns. To get the data fast, you have to think about indexing. You mentioned multi-column indexes. Would it make sense to store/index the same bit variable in several indices?
40 bits without too many duplicate record means up to 2^20 = 1.1E12 records. This makes a full-table scan a lengthy affair. On the other hand, indexing is not really helpful, if you have a lot of duplicate keys.
If you are expecting a resultset of some 25%, you would have to transfer 2.7E11 (partial) records between database and application. Assuming 10,000 records/s, this would take 7,736 hours or 10 months.
My conclusion is that you should think about storing the data in big BLOBs (1.1E12 x 40 bits is just 40 GByte). You could partition your data, read the interesting part into memory and do the search there. This is more or less what a BigData or Datawarehouse system is doing.

Related

SQL Server column design

I always tried to make my sql database as simple and as understandable as possible.
Until now I always used a limited number of columns, I think I never had more than 20. Now, there is one thing, that would make my life easier, if I had much more columns. Let´s say 200 columns. (not rows). What do you think about it?
I just want to know, if it is a bad idea, not why i´m doing this or if there are other possibilities, just if somebody has already experienced something like that and if it is a bad idea to do such a table.

Fewer, smaller width columns is better than lots of columns and/or large width columns.
Why? Because the narrower the row size, the more rows you fit on a 8K page. That means you do less I/O and use less memory to buffer pages. That is always a good thing.
In those (hopefully) rare cases, where the domain requires many attributes on an object (with the assumption of 1-1 object-table mapping), you should consider splitting into two tables ina 1-1 relationship, one containing the frequently used columns.

I don't think it is black and white. Having a large row size (implied by the large number of columns) will hurt performance (i.e., more I/O) -- but there are cases where taking a small hit in performance in one place will be offset by increased performance in others.
I'd say it depends on how many rows you expect this table to have, how often will it will be queried, how many of those additional columns will really be accessed, and how it would compare to your alternative design in terms of efficiency and complexity.

Luke--
It really depends on the type of the system you are working with. Example in transactional systems, most tables have at most 50 columns or so with almost no redundant data attrributes ( If you have a process date, you would not need the Process Month or the process year as a seperate column). This of course is because the records are updated/inserted frequently and you'll need to update all the redundant attributes everytime you update one row.
In Data Warehouse/reporting environments, for Dimension tables (which have the attributes for an entity) it is typical to have 100+ columns as there are could be various ways you want to categorize a given entity.The Updates here are not so much a problem as data is typically loaded once during off-peak hours and then is used mostly in selects.
Take a look at these links to know more..
http://en.wikipedia.org/wiki/Database_normalization
http://en.wikipedia.org/wiki/Star_schema
So the answer is it depends... If you want a perfectly relational system, then may be 200+ columns is kind of a red flag indicating you should look at normalize your data (May be not). Updates and Indexes are two things that you should be concerned with in such a system.

You are using SQL Server, which I think defaults to row-oriented storage (all fields in a row are stored together in a page), which can be a problem with large number of columns. However, if you use column-oriented storage, the number of columns per table does not matter because each column is stored together. I don't know if this is possible with SQL Server.

Maximizing performance of SQL indexes for VARCHAR columns with homogeneous prefixes

I'm designing a DB2 table, one VARCHAR column of which will store an alpha-numeric product identifier. The first few characters of these IDs have very little variation. The column will be indexed, and I'm concerned that performance may suffer because of the common prefixes.
As far as I can tell, DB2 does not use hash codes for selecting VARCHARs. (At least basic DB2, I don't know about any extensions.)
If this is to be a problem, I can think of three obvious solutions.
Create an extra, hash code column.
Store the text backward, to ensure good distribution of initial characters.
Break the product IDs into two columns, one containing a long enough prefix to produce better distribution in the remainder.
Each of these would be a hack, of course.
Solution #2 would provide the best key distribution. The backwards text could be stored in a separate column, or I could reverse the string after reading. Each approach involves overhead, which I would want to profile and compare.
With solution #3, the key distribution still would be non-optimal, and I'd need to concatenate the text after reading, or use 3 columns for the data.
If I leave my product IDs as-is, is my index likely to perform poorly? If so, what is the best method to optimize the performance?

I'm a SQL dba, not db2, but I wouldn't think that having common prefixes would hurt you at all, indexing wise.
The index pages simply store a "from" and "to" range of key values with pointers to the actual pages. The fact that an index page happens to store FrobBar001291 to FrobBar009281 shouldn't matter in the slightest to the db engine.
In fact, having these common prefixes allows the index to take advantage of other queries like:
SELECT * FROM Products WHERE ProductID LIKE 'FrobBar%'

I agree with BradC that I don't think this is a problem at all, and even if there was some small benefit to the alternatives you suggest, I imagine all the overhead and complexity would outweigh any benefits.
If you're looking to understand and improve index performance, there are a number of topics in the Info Center that you should consider (in particular the last two topics seem relevant): http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/nav/2_3_2_4_1 like:
Index structure
Index cleanup and maintenance
Asynchronous index cleanup
Asynchronous index cleanup for MDC tables
Online index defragmentation
Using relational indexes to improve performance
Relational index planning tips
Relational index performance tips

What indexing implementations can handle arbitrary column combinations?

I am developing a little data warehouse system with a web interface where people can do filtered searches. There are current about 50 columns that people may wish to filter on, and about 2.5 million rows. A table scan is painfully slow. The trouble is that the range of queries I'm getting have no common prefixes.
Right now I'm using sqlite3, which will only use an index if there the columns required are the leftmost columns in that index. This seems to mean I'd need a lot of indexes. A quick glance at MySQL suggests it would also require many indexes for this kind of query.
My question is what indexing implementations are available for different database systems which can handle this kind of query on arbitrary combinations of columns?
I've prototyped my own indexing scheme; I store extra tables which list integer primary keys in my big table where each value for each column occur, and I keep enough statistics to be able to first examine the values with the smallest number of matches. It works okay; much better than a table scan but still a bit on the slow side, which is unsurprising for a first version in Python doing many SQL queries.

There are column-oriented databases that store data on a per-column base, where every column is its own index. They are a very good fit for Data Warehouse as they are extremly fast to read, but fairly slow to update.
Kickfire is such an example, which is a customized MySQL engine and has held the TPC-H benchmark top crown for a number of weeks, at an impressive system cost. Note that Kickfire is an appliance, sold as a hardware box.
Infobright would be another similar example, and has a free community edition that runs on Windows and Linux.

When there's too many indexes to create for a table I usually fall back on Full Text Search. Can't say if it will fit your scenario though.

SInce data warehouses are typically optimized for reading data not writing it, I would consider simply indexing all the columns. Yes this will slow down putting data into the warehouse, but typically that happens during non-peak hours and only once a day or less often.

One should only consider introducing "home grown" index structures, based on SQL tables, as a last resort, i.e. if there still exists [business-wise plausible] query cases not properly handled with an traditional index setting. For example if the list of such indexes were to become to big etc.
A few observations
You do not necessarily need indexes that include all of the columns that may be involved in one particular query; only the [collectively] selective ones may be required.
In other words if the query uses, for example, columns a, b, c and d, but if an index with a and b exists and if that produces, statistically only a few thousand rows, it may be acceptable to not introduce indexes with a, b and c (or and d or both), if c or d are not very plausible search criteria (used infrequently), and if their width is such that is would unduly burden the a+b index (or if there were other columns with a better fit for being "tacked-on" to the a+b index).
Aside from the obvious additional demand they put on disk storage, additional indexes, while possibly helping with SELECT (read) operations may also become an impediment with CUD (Create/Update/Delete) operations. It appears the context here is akin to a datawarehouse, where few [unscheduled] CUD operations take place, but it is good to keep this in mind.
See SQLite Optimizer for valuable insight into the way SQLite determines the way a particular query is executed.
Making a list of indexes
A tentative basis for the index scheme for this application may look like this:
[A] A single column index for every column in the table (save maybe the ones which are ridiculously unselective, say a "Married" column w/ "Y/N" values in it....)
[B] A two (or three) columns index for each the likely/common use case queries
[C] Additional two/three column indexes for the cases where some non-common query case involves a set of columns none of which is individually selective.
From this basis we then can define the actual list of indexes needed by:
Adding one (or a few) extra columns at the end of (and in a well thought out order...) to the [B] indexes above. Typically such columns are choosed because of their relative small width (they do grow the index unduly) and because they have a relative chance of being used in combination with the columns cited before them in the index.
Removing the [A] indexes which are generally equivalent to one or several [B] indexes. That is: columns which start with the same column, and for which the extra columns do no burden much the index.
reviewing the TREE of all possible (or all acceptable) cases, and marking off the branches adequately served with the indexes above. Then adding yet more indexes for the odd use cases not readily covered (if only with partial index scan + main table lookup for an acceptable number of rows).
In this situation, I find a hand-written tree structure a useful tool to help manage the otherwise unmanageable lists of possible combinations. Assuming a maximum of 4 search criteria selected from the 50 columns indicated in the question, we have in excess of 230,000 combinations to consider... The tree helps prune this rather quickly.

Sql optimization Techniques

I want to know optimization techniques for databases that has nearly 80,000 records,
list of possibilities for optimizing
i am using for my mobile project in android platform
i use sqlite,i takes lot of time to retreive the data
Thanks

Well, with only 80,000 records and assuming your database is well designed and normalized, just adding indexes on the columns that you frequently use in your WHERE or ORDER BY clauses should be sufficient.
There are other more sophisticated techniques you can use (such as denormalizing certain tables, partitioning, etc.) but those normally only start to come into play when you have millions of records to deal with.
ETA:
I see you updated the question to mention that this is on a mobile platform - that could change things a bit.
Assuming you can't pare down the data set at all, one thing you might be able to do would be to try to partition the database a bit. The idea here is to take your one large table and split it into several smaller identical tables that each hold a subset of the data.
Which of those tables a given row would go into would depend on how you choose to partition it. For example, if you had a "customer_id" field that could range from 0 to 10,000 you might put customers 0 - 2500 in table1, 2,500 - 5,000 in table2, etc. splitting the one large table into 4 smaller ones. You would then have logic in your app that would figure out which table (or tables) to query to retrieve a given record.
You would want to partition your data in such a way that you generally only need to query one of the partitions at a time. Exactly how you would partition the data would depend on what fields you have and how you are using them, but the general idea is the same.

Create indexes
Delete indexes
Normalize
DeNormalize

80k rows isn't many rows these days. Clever index(es) with queries that utlise these indexes will serve you right.

Learn how to display query execution maps, then learn to understand what they mean, then optimize your indices, tables, queries accordingly.

Such a wide topic, which does depend on what you want to optimise for. But the basics:
indexes. A good indexing strategy is important, indexing the right columns that are frequently queried on/ordered by is important. However, the more indexes you add, the slower your INSERTs and UPDATEs will be so there is a trade-off.
maintenance. Keep indexes defragged and statistics up to date
optimised queries. Identify queries that are slow (using profiler/built-in information available from SQL 2005 onwards) and see if they could be written more efficiently (e.g. avoid CURSORs, used set-based operations where possible
parameterisation/SPs. Use parameterised SQL to query the db instead of adhoc SQL with hardcoded search values. This will allow better execution plan caching and reuse.
start with a normalised database schema, and then de-normalise if appropriate to improve performance
80,000 records is not much so I'll stop there (large dbs, with millions of data rows, I'd have suggested partitioning the data)

You really have to be more specific with respect to what you want to do. What is your mix of operations? What is your table structure? The generic advice is to use indices as appropriate but you aren't going to get much help with such a generic question.
Also, 80,000 records is nothing. It is a moderate-sized table and should not make any decent database break a sweat.

First of all, indexes are really a necessity if you want a well-performing database.
Besides that, though, the techniques depend on what you need to optimize for: Size, speed, memory, etc?

One thing that is worth knowing is that using a function in the where statement on the indexed field will cause the index not to be used.
Example (Oracle):
SELECT indexed_text FROM your_table WHERE upper(indexed_text) = 'UPPERCASE TEXT';

Is BIT field faster than int field in SQL Server?

I have table with some fields that the value will be 1 0. This tables will be extremely large overtime. Is it good to use bit datatype or its better to use different type for performance? Of course all fields should be indexed.

I can't give you any stats on performance, however, you should always use the type that is best representative of your data. If all you want is 1-0 then absolutely you should use the bit field.
The more information you can give your database the more likely it is to get it's "guesses" right.

Officially bit will be fastest, especially if you don't allow nulls. In practice it may not matter, even at large usages. But if the value will only be 0 or 1, why not use a bit? Sounds like the the best way to ensure that the value won't get filled with invalid stuff, like 2 or -1.

As I understand it, you still need a byte to store a bit column (but you can store 8 bit columns in a single byte). So having a large number (how many?) of these bit columns could save you a bit on storage. As Yishai said it probably won't make much of a difference in performance (though a bit will translate to a boolean in application code more nicely).
If you can state with 100% confidence that the two options for this column will NEVER change then by all means use the bit. But if you can see a third value popping up in the future it could make life a little easier when that day comes to use a tinyint.
Just a thought, but I'm not sure how much good an index will do you on this column either, unless you see the vast majority of rows going to one side or the other. In a roughly 50/50 distribution you might actually take more of a hit keeping the index up to date than it gains you'd see in querying the table.

It depends.
If you would like to maximize speed of selects, use int (tinyint to save space), because bit in where clause is slower then int (not drastically, but every millisecond counts). Also make the column not null which also speeds things up. Below is link to actual performance test, which I would recommend to run at your own database and also extend it by using not nulls, indexes and using multiple columns at once. At home I even tried to compare using multiple bit columns vs multiple tinyint columns and tinyint columns were faster (select count(*) where A=0 and B=0 and C=0). I thought that SQL Server (2014) would optimize by doing only one comparison using bitmask, so it should by three times faster but that wasn't the case. If you use indexes, you would need more than 5000000 rows (as used in the test) to notice any difference (which I didn't have the patience to do since filling table with multiple millions of rows would take ages on my machine).
https://www.mssqltips.com/sqlservertip/4137/sql-server-performance-test-for-bit-data-type-in-a-where-clause/
If you would like to save space, use bit, since 8 of them can ocuppy one byte whereas 8 tinyints will ocupy 8 bytes. Which is around 7 Megabytes saved on each million of rows.
The differences between those two cases are basically negligable and since using bit has the upside of signalling that the column represents merely a flag, I would recommend using bit.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas