Design Question - Put hundreds of Yes/No switches in columns, rows, or other? - sql

We are porting an old application that used a hierarchical database to a relational web app, and are trying to figure out the best way to port configuration switches (Y/N values).
Our old system had 256 distinct switches (per client) that were each stored as a bit in one of 8 32-bit data fields. Each client would typically have ~100 switches set. To read or set a switch, we'd use bitwise arithmetic using a #define value. For example:
if (a_switchbank4 & E_SHOW_SALARY_ON_CHECKS) //If true, print salary on check
We were debating what approach to store switches in our new relational (MS-SQL) database:
Put each switch in its own field
Pros: fast and easy read/write/access - 1 row per client
Cons: seems kludgey, need to change schema every time we add a switch
Create a row per switch per client
Pros: unlimited switches, no schema changes necessary w/ new switches
Cons: slightly more arduous to pull data, lose intellisense w/o extra work
Maintain bit fields
Pros: same code can be leveraged, smaller XML data transmissions between machines
Cons: doesn't make any sense to our developers, hard to debug, too easy to use wrong 'switch bank' field for comparison
I'm leaning towards #1 ... any thoughts?

It depends on a few factors such as:
How many switches are set for each client
How many switches are actually used
How often switches are added
If I had to guess (and I would be guessing) I'd say what you really want are tags. One table has clients, with a unique ID for each, another has tags (the tag name and a unique ID) and a third has client ID / tag ID pairs, to indicate which clients have which tags.
This differs from your solution #2 in that tags are only present for the clients where that switch is true. In other words, rather than storing a client ID, a switch ID, and a boolean you store just a client ID and a switch ID, but only for the clients with that switch set.
This takes up about one third the space over solution number two, but the real advantage is over solutions one and three: indexing. If you want to find out things like which clients have switches 7, 45, and 130 set but not 86 or 14, you can do them efficiently with a single index on a tag table, but there's no practical way to do them with the other solutions.

You could think about using database views to give you the best of each solution.
For example store the data as one row per switch, but use a view that pivots the switches (rows) into columns where this is more convenient.

I would go with option #2, one row per flag.
However, I'd also consider a mix of #1 and #2. I don't know your app, but if some switches are related, you could group those into tables where you have multiple columns of switches. You could group them based on use or type. You could, and would probably still have a generic table with one switch per row, for ones that don't fit into the groups.

Remember too if you change the method, you may have a lot of application code to change that relys on the existing method of storing the data. Whether you should change the method may depend on exactly how hard it will be and how many hours it will take to change everything associated. I agree with Markus' solution, but you do need to consider exactly how hard refactoring is going to be and whether your project can afford the time. The refactoring book I've been reading would suggest that you maintain both for a set time period with triggers to keep them in synch while you then start fixing all the references. Then on a set date you drop the original (and the triggers) from the database. This allows you to usue the new method going forth, but gives the flexibility that nothing will break before you get it fixed, so you can roll out the change before all references are fixed. It requires discipline however as it is easy to not get rid of the legacy code and columns because everything is working and you are afraid not to. If you are in the midst of total redesign where everything will be tested thougroughly and you have the time built into the project, then go ahead and change everything all at once.

I'd also lean toward option 1, but would also consider an option 4 in some scenarios.
4- Store in dictionary of name value pairs. Serialize to database.

I would recommend option 2. It's relatively straightforward to turn a list of tags/rows into a hash in the code, which makes it fairly easy to check variables. Having a table with 256+ columns seems like a nightmare.
One problem with option #2 is that having a crosstab query is a pain:
Client S1 S2 S3 S4 S5 ...
A X X
B X X X
But there are usually methods for doing that in a database-specific way.

Related

How should I store 500 key-value pairs in a legacy database?

I have a legacy table with 260 columns: 4 storing global values and 256 columns storing parameters. I need to extend the table to store 512 values.
The optimal solution would be to create a child table with only three columns (id, value, parent_id) and so store as many parameters as needed. However this will require a significant rewriting of the code and I am not sure about the performance.
Another solution is simply to extend the table (making it wider) to store twice as many values. This is a safer solution as I will need to implement minimal changes to my code. Moreover the new table will not be modified for the next 4 years (at lest), so it might make sense to stick with it.
Just to give a bit more information, my system does not have unit test (hence my reluctance to modify the code), does not use an ORM (therefore I will need to do many changes in the code writing and reading the DB) and the PM is concerned that the first solution might be slower than the existing one. I can expect to store 10000 rows in the table, so proposed child table will be about 5M rows.
Can anyone give my a proof that the proposed design is actually faster then the current design? Design consideration are not enough (the current system works...). Is there any other good (or bad) reason to move to the proposed design?
You are looking at a key between database warehousing and snowflaking.
If this is a table where you are going to deal with a lot of reads vs a lot a writes you have different things to focus on.
Inserting an entry into the database after you break it into your "child table" will now take 512 inserts instead of 1. However, editing a single parameter will modify one small area and will not affect the other 511 areas.
Personally, I think that the 512 columns is a bit excessive, but I would consider whether there is some other relationship that can be turned into logical tables rather than the "anonymous table" of "whatever value I need goes here".
With all database things, the best thing you can do it test it. Only you know how the table is going to be most often used and writing a test to insert 1M rows and read 1M times is pretty easy.

Delete records from a database or simply hide them during Reads?

I'm wondering if someone can provide various rationales/solutions for knowing when to delete records from a database vs. simplying hiding them during read operations via a field value, e.g., is_hidden=1.
My application is a social network/e-commerce web application. I tend to favor the is_hidden strategy but as one's site grows I can see this leading to a really badly performing site.
Here's my list. What items on the list am I missing? Is the list's prioritization good?
Delete:
rationale: Reduce table size/improve database performance
rationale: Useful if data is trivial to create
solution: SQL DELETE
is_hidden:
rationale: allow users/DBA to restore data/useful for sensitive & hard to CREATE data
rationale: can DELETE it later if necessary
solution: SQL SELECT ... WHERE is_hidden!=1
Thoughts?
The major reason you might want to do a soft-delete is where an audit trail requires it. For example we might have an invoice table along with an voided column and we might normally just omit voided invoices. This preserves an audit trail so we know what invoices were entered and which ones were voided.
There are many fields (particularly in finance) where soft deletes are preferred for this reason. Typically the number of deletes are small compared to the data set, and you don't want to really delete because actually doing so might allow someone to cover for theft of money or real-world goods. The "deleted" data can then be shown for those queries which require it.
A good non-db example would be as follows: "When writing in your general journal or general ledger, write with a pen and if you make an error that you spot right away, cross it out with a single line so that the original data is still legible, and write correct values underneath. If you find out later, either write in an adjustment entry or write in a reversal and a new one." In that case, your principle reason is to see what was changed when so that you can audit those changes if there is ever a question.
The people typically needing to see such information are likely to be financial or other auditors.
You've already said everthing in your question:
DELETE will entirely delete the entry and
is_hidden=1 will hide it.
So: If there's the possibility that you will need the data in the future you should use the hiding method. If you are sure that the data will never ever be used again: Use delete.
Concerning performance:
You can use two tables:
1 for visible items
1 for the hidden ones
Or even three tables:
1 for visible
1 for hidden
1 as an archive, where you move all the hidden data that's older than 3 years or something
Or:
1 for visible and hidden ones (using the is_hidden flag)
1 as an archive for old entries
It's all up to you. But if you look at facebook or google: They will never ever delete anything! Data == Money == Power ;)
As far as performance and ease of development, it may be possible on your platform to have filtered indexes, indexed views etc which would mean that keeping the soft-deleted data around has little impact on your system.

Super column vs serialization vs 2 lookups in Cassandra

We have:
users, each of which has events, each of which has several properties (time, type etc.). Our basic use case is to fetch all events of a given user in a given time-span.
We've been considering the following alternatives in Cassandra for the Events column-family. All alternatives share: key=user_id (UUID), column_name = event_time
column_value = serialized object of event properties. Will need to read/write all the properties every time (not a problem), but might also be difficult to debug (can't use Cassandra command-line client easily)
column is actually a super column, sub-columns are separate properties. Means reading all events(?) every time (possible, though sub-optimal). Any other cons?
column_value is a row-key to another CF, where the event properties are stored. Means maintaining two tables -> complicates calls + reads/writes are slower(?).
Anything we're missing? Any standard best-practice here?
Alternative 1 : Why go to Cassandra if you are to store serialized object ? MongoDB or a similar product would perform better on this task if I get it wright (never actually tried a document base NoSQL, so correct me if I'm wrong on this one). Anyway, I tried this alternative once in MySQL 6 years ago and it is still painful to maintain today.
Alternative 2 : Sorry, I didn't had to play with super colunm yet. Would use this only if I had to show frequently many information on many users (i.e. much more than just their username and a few qualifiers) and their respective events in one query. Also could make query based on a given time-span a bit tricky if there are conditions on the user itself too, since a user's row is likely to have event's columns that fit in the span an other columns that doesn't.
Alternative 3 : Would defenitly be my choice in most cases. You are not likely to write events and create a user in the same transaction, so no worry for consistency. Use the username itself as a standard event column (don't forget to index it) so your calls will be pretty fast. More on this type of data model at http://www.datastax.com/docs/0.8/ddl/index.
Yes it's a two call read, but it do is two different families of data anyway.
As for a best-practices, the field is kinda new, not sure there are any widely approved yet.

applying business rules at the database level

I'm working on a project in which we will need to determine certain types of statuses for a large body of people, stored in a database. The business rules for determining these statuses are fairly complex and may change.
For example,
if a person is part of group X
and (if they have attribute O) has either attribute P or attribute Q,
or (if they don't have attribute O) has attribute P but not Q,
and don't have attribute R,
and aren't part of group Y (unless they also are part of group Z),
then status A is true.
Multiply by several dozen statuses and possibly hundreds of groups and attributes. The people, groups, and attributes are all in the database.
Though this will be consumed by a Java app, we also want to be able to run reports directly against the database, so it would be best if the set of computed statuses were available at at the data level.
Our current design plan, then, is to have a table or view that consists of a set of boolean flags (hasStatusA? hasStatusB? hasStatusC?) for each person. This way, if I want to query for everyone who has status C, I don't have to know all of the rules for computing status C; I just check the flag.
(Note that, in real life, the flags will have more meaningful names: isEligibleForReview?, isPastDueForReview?, etc.).
So a) is this a reasonable approach, and b) if so, what's the best way to compute those flags?
Some options we're considering for computing flags:
Make the set of flags a view, and calculate the flag values from the underlying data in real time using SQL or PL-SQL (this is an Oracle DB). This way the values are always accurate, but performance may suffer, and the rules would have to be maintained by a developer.
Make the set of flags consist of static data, and use some type of rules engine to keep those flags up-to-date as the underlying data changes. This way the rules can be maintained more easily, but the flags could potentially be inaccurate at a given point in time. (If we go with this approach, is there a rules engine that can easily manipulate data within a database in this way?)
In a case like this I suggest applying Ward Cunningham's question- ask yourself "What's the simplest thing that could possibly work?".
In this case, the simplest thing might be to come up with a view that looks at the data as it exists and does the calculations and computations to produce all the fields you care about. Now, load up your database and try it out. Is it fast enough? If so, good - you did the simplest possible thing and it worked out fine. If it's NOT fast enough, good - the first attempt didn't work, but you've got the rules mapped out in the view code. Now you can go on to try the next iteration of "the simplest thing" - perhaps your write a background task that watches for inserts and updates and then jumps in to recompute the flags. If that works, fine and dandy. If not, go to the next iteration...and so on.
Share and enjoy.
I would advise against making the statuses as column names but rather use a status id and value. such as a customer status table with columns of ID and Value.
I would have two methods for updating statuses. One a stored procedure that either has all the logic or calls separate stored procs to figure out each status. you could make all this dynamic by having a function for each status evaluation, and the one stored proc could then call each function. The 2nd method would be to have whatever stored proc(s), that updates user info, call a stored proc to go update all the users statuses based upon the current data. These two methods would allow you to have both realtime updates for the data that changed and if you add a new status, you can call the method to update all statuses with new logic.
Hopefully you have one point of updates to the user data, such as a user update stored proc, and you can put the status update stored proc call in that procedure. This would also save having to schedule a task every n seconds to update statuses.
An option I'd consider would be for each flag to be backed by a deterministic function that returns the up-to-date value given the relevant data.
The function might not perform well enough, however, if you're calling it for many rows at a time (e.g. for reporting). So, if you're on Oracle 11g, you can solve this by adding virtual columns (search for "virtual column") to the relevant tables based on the function. The Result Cache feature should improve the performance of the function as well.

Simulating queries of large views for benchmarking purposes

A Windows Forms application of ours pulls records from a view on SQL Server through ADO.NET and a SOAP web service, displaying them in a data grid. We have had several cases with ~25,000 rows, which works relatively smoothly, but a potential customer needs to have many times that much in a single list.
To figure out how well we scale right now, and how (and how far) we can realistically improve, I'd like to implement a simulation: instead of displaying actual data, have the SQL Server send fictional, random data. The client and transport side would be mostly the same; the view (or at least the underlying table) would of course work differently. The user specifies the amount of fictional rows (e.g. 100,000).
For the time being, I just want to know how long it takes for the client to retrieve and process the data and is just about ready to display it.
What I'm trying to figure out is this: how do I make the SQL Server send such data?
Do I:
Create a stored procedure that has to be run beforehand to fill an actual table?
Create a function that I point the view to, thus having the server generate the data 'live'?
Somehow replicate and/or randomize existing data?
The first option sounds to me like it would yield the results closest to the real world. Because the data is actually 'physically there', the SELECT query would be quite similar performance-wise to one on real data. However, it taxes the server with an otherwise meaningless operation. The fake data would also be backed up, as it would live in one and the same database — unless, of course, I delete the data after each benchmark run.
The second and third option tax the server while running the actual simulation, thus potentially giving unrealistically slow results.
In addition, I'm unsure how to create those rows, short of using a loop or cursor. I can use SELECT top <n> random1(), random2(), […] FROM foo if foo actually happens to have <n> entries, but otherwise I'll (obviously) only get as many rows as foo happens to have. A GROUP BY newid() or similar doesn't appear to do the trick.
For data for testing CRM type tables, I highly recommend fakenamegenerator.com, you can get 40,000 fake names for free.
You didn't mention if you're using SQL Server 2008. If you use 2008 and you use Data Compression, be aware that random data will act very differently (slower) than real data. Random data is much harder to compress.
Quest Toad for SQL Server and Microsoft Visual Studio Data Dude both have test data generators that will put fake "real" data into records for you.
If you want results you can rely on you need to make the testing scenario as realistic as possible, which makes option 1 by far your best bet. As you point out if you get results that aren't good enough with the other options you won't be sure that it wasn't due to the different database behaviour.
How you generate the data will depend to a large degree on the problem domain. Can you take data sets from multiple customers and merge them into a single mega-dataset? If the data is time series then maybe it can be duplicated over a different range.
The data is typically CRM-like, i.e. contacts, projects, etc. It would be fine to simply duplicate the data (e.g., if I only have 20,000 rows, I'll copy them five times to get my desired 100,000 rows). Merging, on the other hand, would only work if we never deploy the benchmarking tool publicly, for obvious privacy reasons (unless, of course, I apply a function to each column that renders the original data unintelligible beyond repair? Similar to a hashing function, only without modifying the value's size too much).
To populate the rows, perhaps something like this would do:
WHILE (SELECT count(1) FROM benchmark) < 100000
INSERT INTO benchmark
SELECT TOP 100000 * FROM actualData