I have to design a database to store some production process measurements data. The data would be collected from PLCs. I would like to store this data on a remote server, to which all the machines have access. I would like to store the following data:
timestamp
plant
production line
machine
workpiece number
measurement unit
measurement type
Some machines share same measurements types, some have unique ones. Is it a good solution, that I would make measurement type column of type VARCHAR and let the PLC programmers free hands with naming new measurement types without my intervention to insert new entry in related foreign table and handing them the new ids? The expected count of unique measurement types is around 100. Is an index on this column a solution for later filtering and selecting from this table which is expected to have around 50 billion rows in a year?
Probably also the size of the table would become a big issue.
EDIT: Should I also separate measurement value and measurement type to other table than part information?
Is there a way, that SQL server would take care of adding new measurement type to some internal table and handling the ids?
Hopefully I explained my question enough, otherwise write the question in comment.
Regards
I would say create a new table measurement type as you mentioned the record is less than 100 then in such case make the Id column TINYINT which will save your space as well and helpful in creating the index.
Related
I'm planning on making a SQLite database to hold data acquired from numerous sensors. The data would be simple things like date added, volume in mL, temperature, etc.
Would it be good practice to create one table per sensor? Or am I better off creating a column for the sensor name instead and putting everything under one table? I plan to query data from sensor(s) based on the date added attribute.
Is there a correlation between the amount of rows/number of columns used and it's impact within the (MS)SQL database?
A little more background:
We have to store lots of data from measurement devices. These devices ping a string with data back to us around 100 times a day. These strings contains +- 300 fields. Assume we have 100 devices in operation that means we get 10000 records back every day. At our back-end we split these data strings and have to put these into the database. When these data strings are fixed that means we add each days around 10000 new rows into the database. No big deal.
Whatsoever, the contents of these data strings may change during time. There are two options we are considering:
Using vertical tables to store the data dynamically
Using horizontal tables and add a new column now and then when it's needed.
From the perspective of ease we'd like to choose for the first approach. Whatsoever, that means we're adding 100*100*300=3000000 rows each day. Data has to be stored 1 year and a month (395 days) so then we're around 1.2 billion rows. Not calculated the expected growth.
Is it from a performance perspective smarter to use a 'vertical' or a 'horizontal' approach?
When choosing for the 'vertical' solution, how can we actual optimize performance by using PK's/FK's wisely?
When choosing for the 'horizontal' solution, are there recommendations for adding columns to the table?
I have a vertical DB with 275 million rows in the "values" table. We took this approach because we couldn't accurately define the schema at the outset either. Inserts are fantastic. Selects suck. Too be fair we throw in a couple of extra doohickies the typical vertical schema doesn't have to deal with.
Have a search for EAV aka Entity Attribute Value models. You'll find a lot of heat on both sides of the debate. Too good articles on making it work are
What is so bad about EAV, anyway?
dave’s guide to the eav
My guess is these sensors don't just start sending you extra fields. You have to release new sensors or sensor code for this to happen. That's your chance to do change control on your schema and add the extra columns. If external parties can connect sensors without notifying you this argument is null and void and you may be stuck with an EAV.
For the horizontal option you can split tables putting the frequently-used columns in one table and the less-used in a second; both tables have the same primary key values so you can link less-used to more-used columns. Also you can use RDBMS's built-in partitioning functionality to split each day's (or week's or month's) data for the others'.
Generally, you can tune a table more for inserts (or any DML) or for queries. Improving one side comes at the expense of the other. Usually, it's a balancing act.
First of all, 10K inserts a day is not really a large number. Sure, it's not insignificant, but it doesn't even come close to what would be considered "large" nowadays So, while we don't want to make inserts downright sluggish, this gives you some wiggle room.
Creating an index on the device id and/or entry timestamp will do some logical partitioning of the data for you. The exact makeup of your index(es) will depend on your queries. Are you looking for all entries for a given date or date range? Then index the timestamp column. Are you looking for all entries received from a particular device? Then index the device id column. Are you looking for entries from a particular device on a particular date or date range or sorted by the date? Then create an index on both columns.
So if you ask for the entries for device x on date y, then you are going out to the table and looking only at the rows you need. The fact that the table is much larger than the small subset you query is incidental. It's as if the rest of the table doesn't even exist. The total size of the table need not be intimidating.
Another option: As it looks like the data is written to the table and never altered after that, then you may want to create a data warehouse schema for the data. New entries can be moved to the warehouse every day or several times a day. The point is, the warehouse schema can have the data sliced, diced, quartered and cubed to make queries much more efficient. So you can have the existing table tuned for more efficient inserts and the warehouse tuned for more efficient queries. That is, after all, what data warehouses are for.
You also imply that some of each entry is (or can be) duplicated from one entry to the next. See if you can segment the data into three types:
Type 1: Data that never changes (the device id, for example)
Type 2: Data that rarely changes
Type 3: Data that changes often
Now all you have is a normalization problem, something a lot easier to solve. Let's say the row is equally split between the types. So you have one table with 100 rows of 33 columns. That's it. It never changes. Linked to that is a table with at least 100 rows of 33 columns but maybe several new rows are added each day. Finally, linked to the second table a table with rows of 33 columns that possibly grows by the full 10K every day.
This minimizes the grow-space required by the online database. The warehouse could then denormalize back to one huge table for ease of querying.
I am currently writing an application which will have a lot of transaction.
Each transaction will have a value although the value can be an int, bit, short string, large string etc...
I want to try to keep processing and storage to a minimum as I would like to run this in the cloud. Should I have lot of different fields on the transaction eg.
TransactionLine.valueint
TransactionLine.valuestring
TransactionLine.valuedecimal
TransactionLine.valuebool
or should I have separate tables for each value transaction value type.
TransactionLine - Table
---------------
TransactionLine.ValueId
ValueInt -Table
-------
ValueInt.ValueId
ValueInt.Value
ValueString - Table
-------
ValueString.ValueId
ValueString.Value
You could store key-value pairs in the database. The only data type that can store any other data type is a VARCHAR(MAX) or a BLOB. That means that all data must be converted to a string before it can be stored. That conversion will take processing time.
In the opposite direction, when you want to do a SUM or a MAX or an AVG , ... of numeric data you will first have to convert the string back to its real data type. That conversion too will take processing time.
Databases are read a lot more than written to. The conversion nightmare will get your system on its knees. There has been a lot of debate on this topic. The high cost of conversions is the killer.
There are systems that store the whole database in one single table. But in those cases the whole system is build with one clear goal: to support that system in an efficient way in a fast compiled programming language, like C(++, #), not in a relational database language like SQL.
I don't have the idea I fully understand what you really want. If you only want to store the transactions, this may be a worth trying. But why do you want to store them one field at a time? Data is stored in groups in records. And the data type of each and every column in a record is known at the creation time of the table.
You should really look into cassandra. When you say a lot of transactions, do you mean millions of records? For cassandra, handling millions of records is a norm. You will have a column family (in rdbms, table is similiar to column family) store many rows, and for each row, you do not need to predefined a column. It can be define on demand, thus reducing the storage dramatically especially if you are dealing with a lot of records.
You do not need to worry if the data is of data type int, string, decimal or bool because default datatype for column value is in BytesType. There are other data types which you can predefined too in the the column family column metadata if you want to. Since you are starting to write an application, I will suggest you spend sometime to read into cassandra and how it would help you in your situation.
I have a couple of databases containing simple data which needs to be imported into a new format schema. I've come up with a flexible schema, but it relies on the critical data of the to older DBs to be stored in one table. This table has only a primary key, a foreign key (both int's), a datetime and a decimal field, but adding the count of rows from the two older DBs indicates that the total row count for this new table would be about 200,000,000 rows.
How do I go about dealing with this amount of data? It is data stretching back about 10 years and does need to be available. Fortunately, we don't need to pull out even 1% of it when making queries in the future, but it does all need to be accessible.
I've got ideas based around having multiple tables for year, supplier (of the source data) etc - or even having one database for each year, with the most recent 2 years in one DB (which would also contain the stored procs for managing all this.)
Any and all help, ideas, suggestions very, deeply, much appreciated,
Matt.
Most importantly. consider profiling your queries and measuring where your actual bottlenecks are (try identifying the missing indexes), you might see that you can store everything in a single table, or that buying a few extra hard disks will be enough to get sufficient performance.
Now, for suggestions, have you considered partitioning? You could create partitions per time range, or one partition with the 1% commonly accessed and another with the 99% of the data.
This is roughly equivalent to splitting the tables manually by year or supplier or whatnot, but internally handled by the server.
On the other hand, it might make more sense to actually splitting the tables in 'current' and 'historical'.
Another possible size improvement is using an int (like an epoch) instead of a datetime and provide functions to convert from datetime to int, thus having queries like
SELECT * FROM megaTable WHERE datetime > dateTimeToEpoch('2010-01-23')
This size savings will probably have a cost performance wise if you need to do complex datetime queries. Although on cubes there is the standard technique of storing, instead of an epoch, an int in YYYYMMDD format.
What's the problem with storing this data in a single table? An enterprise-level SQL server like Microsoft SQL 2005 can handle it without much pain.
By the way, do not do tables per year, tables per supplier or other things like this. If you have to store similar set of items, you need one and one only table. Setting multiple tables to store the same type of things will cause problems, like:
Queries would be extremely difficult to write, and performance will be decreased if you have to query from multiple tables.
The database design will be very difficult to understand (especially since it's not something natural to store the same type of items in different places).
You will not be able to easily modify your database (maybe it's not a problem in your case), because instead of changing one table, you would have to change every table.
It would require to automate a bunch of tasks. Let's see you have a table per year. If a new record is inserted on 2011-01-01 00:00:00.001, will a new table be created? Will you check at each insert if you must create a new table? How it would affect performance? Can you test it easily?
If there is a real, visible separation between "recent" and "old" data (for example you have to use daily the data saved the last month only, and you have to keep everything older, but you do not use it), you can build a system with two SQL servers (installed on different machines). The first, highly available server, will serve to handle recent data. The second, less available and optimized for writing, will store everything else. Then, on schedule, a program will move old data from the first one to the second.
With such a small tuple size (2 ints, 1 datetime, 1 decimal) I think you will be fine having a single table with all the results in it. SQL server 2005 does not limit the number of rows in a table.
If you go down this road and run in to performance problems, then it is time to look at alternatives. Until then, I would plow ahead.
EDIT: Assuming you are using DECIMAL(9) or smaller, your total tuple size is 21 bytes which means that you can store the entire table in less than 4 GB of memory. If you have a decent server(8+ GB of memory) and this is the primary memory user, then the table and a secondary index could be stored in memory. This should ensure super fast queries after a slower warm-up time before the cache is populated.
I have an application that sends data based on user interaction (not user input). The data sent could be an Integer, String, Date, or Boolean value. There are 140 keys. We may get anywhere from 1 key value pair to all 140 at a time.
We want to store everything but will only be using 20 out of 140 keys within the application. The remaining will be used for an audit trail later on - so we still need to store them.
This data is used by the application to decide where the user needs to go so it needs to access the record by student id and pull the 20 or so options within milliseconds. There could be billions of rows of data (it is an upgrade to an existing application with over 20,000 users) so performance is critical. The user generates a new row each time they access the application.
EXAMPLE DATA:
Score:1
ID:3212
IsLast:False
Action:Completed
I have 2 ideas on how to do this and looking for some help on which is best or is a third option a better choice.
OPTION 1:
My first idea is to use a column for the value as a string then have a look-up table of possible data types to use when the value needs to be Cast for use.
value | dataType
-----------------------
"1" | int
"Completed" | string
While the data being sent is not user generated I know there must be a gotcha somewhere in this method. The only reason for doing this is that we don't know what key:pair will be sent (outside of date and id) and trying to avoid more than a few columns.
The SO Question
How to Handle Unknown Data Type in one Table uses a similar idea.
OPTION 2:
The other solution is to have 140 columns - one for each key. However, the amount of data generated is very large (billions of rows) so that calling this data will not be fast enough - I don't think.
Technical Details:
This is using SQL Server 2008 - not R2 with DotNet C# and Reporting Services.
Am I missing something here - what is the best way to create this table for performance?
Vertically segment your data. Put the 20 keys that are necessary for navigational control in one table, all 20 in one row, with PK that identifies the user Interaction (Callit say, InteractionId). Put the other 120 values in another table, with composite Primary Key, based on the PK of the first table (InteractionId, plus the KeyTypeId identifying which of the 120 possible key value pairs the value is for. Store all the values in this second table as strings. In a third lookup table called, say, KeyTypes, store the KeyTypeId, KeyTypeName, and KeyValueDataType to allow your code to know how to cast the string value to output it properly as either a string, datetime, an integer, or a decimal value or whatever...
The first table will be accessed much more often, and so it contains only those values which the application's navigational functionality needs more frequent access to, keeping the rows narrower, which allows more rows per page, and minimizes disk IO. Putting all 20 values in one row will keep the row count smaller (~ 1/20th as large), minimizng the depth of the index seeks that will need to be performed for each access.
The other table with all the other 120 key-values will not be accessed as frequently, so it's structure can probably be optimized for logical simplicity rather than for performance.
Actually, you might merge the suggestions offered so far:
Create a table with the 20 keys necessary for navigational control, plus one column for a Primary Key, plus one column that is an XML data type to store the rest of the possible data. You could then create a DTD that handles the data types for each key, plus constraints on certain keys as needed.
Well it should be simple enough to test both ideas, but a variation on option 1 looks favoured to me. RDBMSs like SQL Server prefer long, narrow tables (i.e. fewer columns but lots of rows).
I won't go any further because it appears Charles has beat to it, with a perfectly sensible suggestion.