Table design options for large number of rows? - sql

I have an application that sends data based on user interaction (not user input). The data sent could be an Integer, String, Date, or Boolean value. There are 140 keys. We may get anywhere from 1 key value pair to all 140 at a time.
We want to store everything but will only be using 20 out of 140 keys within the application. The remaining will be used for an audit trail later on - so we still need to store them.
This data is used by the application to decide where the user needs to go so it needs to access the record by student id and pull the 20 or so options within milliseconds. There could be billions of rows of data (it is an upgrade to an existing application with over 20,000 users) so performance is critical. The user generates a new row each time they access the application.
EXAMPLE DATA:
Score:1
ID:3212
IsLast:False
Action:Completed
I have 2 ideas on how to do this and looking for some help on which is best or is a third option a better choice.
OPTION 1:
My first idea is to use a column for the value as a string then have a look-up table of possible data types to use when the value needs to be Cast for use.
value | dataType
-----------------------
"1" | int
"Completed" | string
While the data being sent is not user generated I know there must be a gotcha somewhere in this method. The only reason for doing this is that we don't know what key:pair will be sent (outside of date and id) and trying to avoid more than a few columns.
The SO Question
How to Handle Unknown Data Type in one Table uses a similar idea.
OPTION 2:
The other solution is to have 140 columns - one for each key. However, the amount of data generated is very large (billions of rows) so that calling this data will not be fast enough - I don't think.
Technical Details:
This is using SQL Server 2008 - not R2 with DotNet C# and Reporting Services.
Am I missing something here - what is the best way to create this table for performance?

Vertically segment your data. Put the 20 keys that are necessary for navigational control in one table, all 20 in one row, with PK that identifies the user Interaction (Callit say, InteractionId). Put the other 120 values in another table, with composite Primary Key, based on the PK of the first table (InteractionId, plus the KeyTypeId identifying which of the 120 possible key value pairs the value is for. Store all the values in this second table as strings. In a third lookup table called, say, KeyTypes, store the KeyTypeId, KeyTypeName, and KeyValueDataType to allow your code to know how to cast the string value to output it properly as either a string, datetime, an integer, or a decimal value or whatever...
The first table will be accessed much more often, and so it contains only those values which the application's navigational functionality needs more frequent access to, keeping the rows narrower, which allows more rows per page, and minimizes disk IO. Putting all 20 values in one row will keep the row count smaller (~ 1/20th as large), minimizng the depth of the index seeks that will need to be performed for each access.
The other table with all the other 120 key-values will not be accessed as frequently, so it's structure can probably be optimized for logical simplicity rather than for performance.

Actually, you might merge the suggestions offered so far:
Create a table with the 20 keys necessary for navigational control, plus one column for a Primary Key, plus one column that is an XML data type to store the rest of the possible data. You could then create a DTD that handles the data types for each key, plus constraints on certain keys as needed.

Well it should be simple enough to test both ideas, but a variation on option 1 looks favoured to me. RDBMSs like SQL Server prefer long, narrow tables (i.e. fewer columns but lots of rows).
I won't go any further because it appears Charles has beat to it, with a perfectly sensible suggestion.

Related

Index on string column on big MSSQL database

I have to design a database to store some production process measurements data. The data would be collected from PLCs. I would like to store this data on a remote server, to which all the machines have access. I would like to store the following data:
timestamp
plant
production line
machine
workpiece number
measurement unit
measurement type
Some machines share same measurements types, some have unique ones. Is it a good solution, that I would make measurement type column of type VARCHAR and let the PLC programmers free hands with naming new measurement types without my intervention to insert new entry in related foreign table and handing them the new ids? The expected count of unique measurement types is around 100. Is an index on this column a solution for later filtering and selecting from this table which is expected to have around 50 billion rows in a year?
Probably also the size of the table would become a big issue.
EDIT: Should I also separate measurement value and measurement type to other table than part information?
Is there a way, that SQL server would take care of adding new measurement type to some internal table and handling the ids?
Hopefully I explained my question enough, otherwise write the question in comment.
Regards
I would say create a new table measurement type as you mentioned the record is less than 100 then in such case make the Id column TINYINT which will save your space as well and helpful in creating the index.

Storing key value pairs in SQL / Stats Aggregator

I'm writing a web application that needs to periodically collect data from an API and perform analysis on these stats to produce a dashboard for unique users. There are 236 unique 'stats' coming in from the API per user which are essentially key value pairs, where the value consists of either a string or number (or time duration or percent).
I'm trying to figure out how best to store this data. One option I thought of which would be the simplest approach was to store the raw JSON response against a userId and perform all analysis from that JSON. The obvious issue with this is that I need to be able to query the data easily and do things like ordering different users by one of the 236 unique stats. The other option would be in a relational database.
If I were to go the relational route, how is it best to store snapshots of data like this? I imagine creating a column for each of the 236 stats would be a bit of a mess, and annoying to add to in the future. I've looked at other relatively similar questions but haven't found anything right for me.
My thoughts so far:
Create a StatsType(id, typename) containing 236 rows,
and a UserStats(statid, userid, typeid, value, date_added) table, containing 236 rows for each user update from the API.
Would this end up being too huge as the app grows? (Think 200,000+ users) Thoughts would be much appreciated
Different value types is an argument for different columns. Your requirement to order users also prompts to have a single row for a user.
You may create kind of data dictionary to keep your code clean and udaptable minding future changes.

Best practice for tables with varying content

Currently I am working on a problem where I have to log data in a Oracle10g database. I want to store data from up to 40 devices (but not necessarily always 40) as one data point, these share a bit of information and the rest is device specific.
So I could either create arrays for every device-specific column and if the device is in use the according array field is getting populated.
ID TIMESTAMP BOARD DEVICE_ID[40] ERROR_CNT[40] TEMP[40] MORE_DATA[40]...
But I think I would be wasting a lot of database space by doing it like that, because the arrays would be hardly populated
The other method I can think of would be to just use the same ID for a multi-line entry and then I put as many rows into the table as I have used devices.
ID TIMESTAMP BOARD DEVICE_ID ERROR_CNT TEMP MORE_DATA
1 437892 1 1 100 25 xxx
1 437892 1 2 50 28 yyy
Now the shared information is multiple times in the database and the data is shattered among multiple lines.
Another issue is that there might be columns used by a part of the devices and some do not carry that information, so there might be even more unused fields. So maybe it would be best to create multiple tables and split the devices into groups according to the information they have and log their data in the corresponding tables.
I appreciate any help, maybe I am even paranoid about wasted db space and should not worry about that and simply follow the 'easiest' approach, which I think would be the one with arrays.
Never store arrays in a database. Violating first normal form is a big mistake.
Worry more about how the data is queried than how it is stored. Keep the data model "dumb" and there are literally millions of people who can understand how to use it. There are probably only a few hundred people who understand Oracle object types.
For example, using object types, here is the simplest code to create a table, insert data, and query it:
drop table device;
create or replace type error_count_type is table of number;
create table device(id number, error_count error_count_type)
nested table error_count store as error_count_table;
insert into device values(1, error_count_type(10, 20));
commit;
select sum(column_value) error_count
from device
cross join table(error_count);
Not many people or tools understand creating types, store as, instantiating types, COLUMN_VALUE, or TABLE(...). Internally, Oracle stores arrays as tables anyway so there's no performance benefit.
Do it the simple way, with multiple tables. As Gordon pointed out, it's a small database anyway. Keep it simple.
I think this is too long for a comment:
1000 hours * 12/hour * 40 devices = 480,000 rows.
This is not a lot of data, so I wouldn't worry about duplication of values. You might want to go with the "other method" because it provides a lot of flexibility.
You can store all the data in columns, but if you get the columns wrong, you have to start messing around with alter table statements and that might affect queries you have already written.

Store an integer for bitwise compare in a permission model using JPA 2

I am using a permission model where I have a table user_permissions. This table will hold one or more columns with a certain bigint. I will use the bits of each decimal number to compare with certain permission rules (the bit location will be a permission rule and the value will be the condition of the rule active or not active).
The problem with this approach is that I have limited number of bits to work when using a number such as bigint.
What is the best column type I can use in this case that works in a cross-database environment?
The tags represent the technologies I am aiming for, so any other solution related to those technologies are appreciated.
I was thinking to use #Lob annotation to store large data, is that the best practice?
UPDATE:
The user_permission table extends the user with a 1:1 relationship and have bigint fields like bin_create, bin_read, bin_update, bin_delete that will hold the binary data as decimal numbers.
To clarify the question:
I am considering comparing the permissions using bitwise operators. So let's assume I have a user with the permission value 10(1010), and an action requiring 13 (1101). So 10 & 13 == 8 (1000) -> The user have one permission matching the required permissions for the action, so I could allow or deny (it is up to the application rules define which).
But with this approach I have a limited number of bits to work on (lets say I increase the permissions to be considered, so the numbers will increase too). The max bigint value is ~9223372036854775807 and that gives me the binary 111111111111111111111111111111111111111111111111111111111111111 with ~60 blocks and permission possibilities per field.
So What is the best column type I can use in this case that works in a cross-database environment to store a huge quantity of binary blocks and with the possibility to work with bitwise operators in java?
If you want to store your data in an optimal way, you have to name the target, where you want to optimize.
This is an optimal solution for MySQL (by defining BINARY(32)), you can try something similar on your favorite database:
#Column(columnDefinition = "BINARY(32)", length = 32, nullable = false)
private byte[] bits;
Sometimes with some JPA providers and databases the column definitions ends up with Lob. That's not the best solution, because reading a Lob is an external (very expensive) operation. Try to change either the provider, or the database (if you're working with pure JPA, you can try it).
Options for replacing Lobs are for example numeric columns (you can use e.g. 4 columns with 64bit width, or similar). If you want a nice solution, these container columns can be even #Embedded into your main class. But it all depends on your database.
This way you will have 256 bits (32 bytes), without any conversion and further calculation, and you will have the possibility to extend the range if you want. You have to be careful when changing the column definition, though.
If it's the amount of data you can fit in a field that you're concerned about, why not store the number as a varchar? To my knowledge, pretty much any database will let you go up to at least a varchar(255). If you need more than 255 digits in the number, you could encode it in base 64 to squeeze it down more. If my mental arithmetic is right, that gives you 255 characters * 6 bits per character = 1530 different bits to use. If you need more than that, I might suggest your permissions model is a little excessive.
That's assuming that you're trying to crowd the data into a smallish space in the database. Your question isn't entirely clear on what you're trying to solve for. On the other end of the spectrum, you could unpack the bits and save each bit to its own field or its own row. For example, user_permissions could be a table with two columns: user and permission, where each row is one permission granted to one user.
There are two different approaches:
1) Pretty data model
One row (user in your example) can have values (here a user permission which is one bit or a boolean value) where you don't know how many values are possible, i. e. the number of values is principally unlimited. The normal approach in SQL to handle this is a child table:
You create a table (and Java class for mapping / annotations) UserPermission, which contains the user id as a foreign key, a permission id and the boolean value. User id and permission id is a unique key for this table (you can add an id as a different primary key if you like). You even can add columns for the user who gave the permission, the date when this was done etc, if you want to have some auditing, but this is not necessary.
If you want to make it more pretty, then you also create a table (and Java class) Permission, which contains the permission id, a name for the permission and perhaps other information.
This solution needs more database space than your idea with the bits in an integer, but take in mind you won't have so many users compared to other data in the data base, so the extra amount doesn't matter.
2) Fast solution:
If the solution with extra tables needs to much overhead, because your permissions are not really important, and you worry an integer can be to short, then you can use the Java type BigInteger (the type allows bit manipulations) and map it with an #Column annotation to a NUMBER or DECIMAL in the database.
Bear in mind the size of a database's NUMBER also is limited (for example Oracle allows at maximum 10^40). If this might be a problem, then you must use solution 1).
One more disadvantage of solution 2) is, you never can use an index for the permissions. (A selection of all users having a certain permission set never will go over an index.)
I always would use solution 1).

How do I add a column to large sql server table

I have a SQL Server table in production that has millions of rows, and it turns out that I need to add a column to it. Or, to be more accurate, I need to add a field to the entity that the table represents.
Syntactically this isn't a problem, and if the table didn't have so many rows and wasn't in production, this would be easy.
Really what I'm after is the course of action. There are plenty of websites out there with extremely large tables, and they must add fields from time to time. How do they do it without substantial downtime?
One thing I should add, I did not want the column to allow nulls, which would mean that I'd need to have a default value.
So I either need to figure out how to add a column with a default value in a timely manner, or I need to figure out a way to update the column at a later time and then set the column to not allow nulls.
ALTER TABLE table1 ADD
newcolumn int NULL
GO
should not take that long... What takes a long time is to insert columns in the middle of other columns... b/c then the engine needs to create a new table and copy the data to the new table.
I did not want the column to allow nulls, which would mean that I'd need to have a default value.
Adding a NOT NULL column with a DEFAULT Constraint to a table of any number of rows (even billions) became a lot easier starting in SQL Server 2012 (but only for Enterprise Edition) as they allowed it to be an Online operation (in most cases) where, for existing rows, the value will be read from meta-data and not actually stored in the row until the row is updated, or clustered index is rebuilt. Rather than paraphrase any more, here is the relevant section from the MSDN page for ALTER TABLE:
Adding NOT NULL Columns as an Online Operation
Starting with SQL Server 2012 Enterprise Edition, adding a NOT NULL column with a default value is an online operation when the default value is a runtime constant. This means that the operation is completed almost instantaneously regardless of the number of rows in the table. This is because the existing rows in the table are not updated during the operation; instead, the default value is stored only in the metadata of the table and the value is looked up as needed in queries that access these rows. This behavior is automatic; no additional syntax is required to implement the online operation beyond the ADD COLUMN syntax. A runtime constant is an expression that produces the same value at runtime for each row in the table regardless of its determinism. For example, the constant expression "My temporary data", or the system function GETUTCDATETIME() are runtime constants. In contrast, the functions NEWID() or NEWSEQUENTIALID() are not runtime constants because a unique value is produced for each row in the table. Adding a NOT NULL column with a default value that is not a runtime constant is always performed offline and an exclusive (SCH-M) lock is acquired for the duration of the operation.
While the existing rows reference the value stored in metadata, the default value is stored on the row for any new rows that are inserted and do not specify another value for the column. The default value stored in metadata is moved to an existing row when the row is updated (even if the actual column is not specified in the UPDATE statement), or if the table or clustered index is rebuilt.
Columns of type varchar(max), nvarchar(max), varbinary(max), xml, text, ntext, image, hierarchyid, geometry, geography, or CLR UDTS, cannot be added in an online operation. A column cannot be added online if doing so causes the maximum possible row size to exceed the 8,060 byte limit. The column is added as an offline operation in this case.
The only real solution for continuous uptime is redundancy.
I acknowledge #Nestor's answer that adding a new column shouldn't take long in SQL Server, but nevertheless, it could still be an outage that is not acceptable on a production system. An alternative is to make the change in a parallel system, and then once the operation is complete, swap the new for the old.
For example, if you need to add a column, you may create a copy of the table, then add the column to that copy, and then use sp_rename() to move the old table aside and the new table into place.
If you have referential integrity constraints pointing to this table, this can make the swap even more tricky. You probably have to drop the constraints briefly as you swap the tables.
For some kinds of complex upgrades, you could completely duplicate the database on a separate server host. Once that's ready, just swap the DNS entries for the two servers and voilĂ !
I supported a stock exchange company
in the 1990's who ran three duplicate
database servers at all times. That
way they could implement upgrades on
one server, while retaining one
production server and one failover
server. Their operations had a
standard procedure of rotating the
three machines through production,
failover, and maintenance roles every
day. When they needed to upgrade
hardware, software, or alter the
database schema, it took three days to
propagate the change through their
servers, but they could do it with no
interruption in service. All thanks
to redundancy.
"Add the column and then perform relatively small UPDATE batches to populate the column with a default value. That should prevent any noticeable slowdowns"
And after that you have to set the column to NOT NULL which will fire off in one big transaction. So everything will run really fast until you do that so you have probably gained very little really. I only know this from first hand experience.
You might want to rename the current table from X to Y. You can do this with this command sp_RENAME '[OldTableName]' , '[NewTableName]'.
Recreate the new table as X with the new column set to NOT NULL and then batch insert from Y to X and include a default value either in your insert for the new column or placing a default value on the new column when you recreate table X.
I have done this type of change on a table with hundreds of millions of rows. It still took over an hour, but it didn't blow out our trans log. When I tried to just change the column to NOT NULL with all the data in the table it took over 20 hours before I killed the process.
Have you tested just adding a column filling it with data and setting the column to NOT NULL?
So in the end I don't think there's a magic bullet.
select into a new table and rename. Example, Adding column i to table A:
select *, 1 as i
into A_tmp
from A_tbl
//Add any indexes here
exec sp_rename 'A_tbl', 'A_old'
exec sp_rename 'A_tmp', 'A_tbl'
Should be fast and won't touch your transaction log like inserting in batches might.
(I just did this today w/ a 70 million row table in < 2 min).
You can wrap it in a transaction if you need it to be an online operation (something might change in the table between the select into and the renames).
Another technique is to add the column to a new related table (Assume a one-to-one relationship which you can enforce by giving the FK a unique index). You can then populate this in batches and then you can add the join to this table wherever you want the data to appear. Note I would only consider this for a column that I would not want to use in every query on the original table or if the record width of my original table was getting too large or if I was adding several columns.