Database for 5-dimensional data? - sql

I commonly deal with data-sets that have well over 5 billion data points in a 3D grid over time. Each data point has a certain value, which needs to be visualised. So it's a 5-dimensional data-set. Lets say the data for each point looks like (x, y, z, time, value)
I need to run arbitrary queries against these data-sets where for example the value is between a certain range, or below a certain value.
I need to run queries where I need all data for a specific z value
These are the most common queries that I would need to run against this data-set. I have tried the likes of MySQL and MongoDB and created indices for those values, but the resource requirements are quite extreme with long query-runtime. I ended up writing my own file-format to at least store the data for relative easy retrieval. This approach makes it difficult to find data without having to read/scan the entire data-set.
I have looked at the likes of Hadoop and Hive, but the queries are not designed to be run in real-time. In terms of data size it seems a better fit though.
What would be the best method to index such larger amounts of data efficiently? Is a custom indexing system the best approach or to slice the data into smaller chunks and device a specific way of indexing (which way though?). the goal is to be able to run queries against the data and have the results returned in under 0.5 seconds. My best was 5 seconds by running the entire DB on a huge RAM drive.
Any comments and suggestions are welcome.
EDIT:
the data for all x, y, z, time and value are all FLOAT

It really depends on the hardware you have available, but regardless of that and considering the type and the amount of data you are dealing with, I definitely suggest a clustered solution.
As you already mentioned, Hadoop is not a good fit because it is primarily a batch processing tool.
Have a look at Cassandra and see if it solves your problem. I feel like a column-store rdbms like CitusDB (Free up to 6 nodes) or Vertica (Free up to 3 nodes) may also prove useful.

Related

Choosing NOsql or RDMS, and which DataStructure?

I'm going to start a big 3d modelization project, and i need to choose a system to store my data. The 3d-model in its raw format, before my engine process it, is made of billions of colored triangles.
Inputs :
- Each 3d-model will consist of a very large number of triangles ( 3 space points (bigint x,y,z ), and a color (rgb) ).
- If the INSERT is slow it's not a big deal, but the SELECT must be as fast as possible ( SELECT with some linear WHERE conditions ).
- Data consistancy is not important, if i lost a triangle or two in a model that's not very problematic.
- i can make a table per object, so i can make it read-only and i can put some column indexes (RDMS) on it.
there are my questions :
Data structure :
Many triangles will share the same dots. Should i save the triangles in one table like id x1,y1,z1 x2,y2,z2 x3,y3,z3 ,r,g,b ; or a table for dots id,x,y,z and a table for triangles id,dot1Id,dot2Id,dot3Id,r,g,b ( a join on billion rows will be slow i think, and with NOsql we can't even do a join if my memory is good... ) ???
RDMS or NOsql ?
I think NOsql is good for what i need, does it do fast conditionnal Selects on very large data ? RDMS can be good too because my data is very formatted and consists of integers, indexes can do magic here.
Being no expert, I cannot really answer your question. But the comments are to short to give you proper advice there. So here are my "comments" in the "answer section" :-)
I understand the model such that it consists of a list (or a set to be precise) of triangles. You will always have to read the complete data to load your model. You don't only load the trianles where x < 100 and y > 1000 or whatever.
So your thoughts to the problem are good. What links your triangles are the points they share. So yes, you could use an RDBMS and store points in one table and their associated triangles in another. However the "pointers" (in your triangle table pointing to your point records) in an RDBMS are usually quite large (the so called row ids), so you don't really save space, but have the need to re-construct all your triangles from their related points, which takes time. My advice: No RDBMS.
If above assumption is right and your model consists of a mere set of triangles in random order, you can easily store them in a file. In some custom binary format, so the file will be rather small. You could even compress it to use less hard disk sectors. No use to store some relation anywhere, just one file per model. You read from beginning to end. It cannot be faster.
I don't know of any NoSQL database being specialized on just storing large binary files. Its only task would be to give you the raw data for a model. I mean it would only save you the trouble to think of a way to store several model files and find them again (and it might do the compression if appropriate). But well, storing the model names with the file names can be done in registry, config file, small text file, whatever. No big deal, I think. So my second advice would be not to use a NoSQL DBMS either. But as mentioned, I'm no exprt on this, and there may be dbms specialized on exactly this (small) task.

indexed flat-files

We have a batch analytical SQL job – run once daily – that reads data from 2 source tables held in a powerful RDBMS. The source tables are huge (>100TB) but has less than 10 fields combined.
The question I have is can the 2 source tables be held in a compressed and indexed flat file so the entire operation can be much faster and saves on storage and can be run on a low spec server. Also, can we run SQL like queries against these compressed and indexed flat-files? Any pointers on how to go about doing this would be extremely helpful.
Most optimization strategies optimize either speed or size, and trade one off against the other. In general, RDBMS solutions optimize for speed, at the expense of size - for instance, by creating an index, you take up more space, and in return you get faster data access.
So your desire to optimize for both speed AND size is unlikely to be fulfilled - you almost certainly have to trade one against the other.
Secondly, if you want to execute "sql-like" queries, I'm pretty sure that an RDBMS is the best solution - especially with huge data sets.
It may be the case that the underlying data lends itself to a specific optimization - for instance, if you can create a custom indexing scheme based on bitmasks to create integers, and using those integers to access data using boolean operators, you may be able to beat the performance of an RDBMS index.

What would be the most efficient method for storing/updating Interval based data in SQL?

I have a database table with about 700 millions rows plus (growing exponentially) of time based data.
Fields:
PK.ID,
PK.TimeStamp,
Value
I also have 3 other tables grouping this data into Days, Months, Years which contains the sum of the value for each ID in that time period. These tables are updated nightly by a SQL job, the situation has arisen where by the tables will need to updated on the fly when the data in the base table is updated, this can be however up to 2.5 million rows at a time (not very often, typically around 200-500k up to every 5 minutes), is this possible without causing massive performance hits or what would be the best method for achieving this?
N.B
The daily, monthly, year tables can be changed if needed, they are used to speed up queries such as 'Get the monthly totals for these 5 ids for the last 5 years', in raw data this is about 13 million rows of data, from the monthly table its 300 rows.
I do have SSIS available to me.
I cant afford to lock any tables during the process.
700M recors in 5 months mean 8.4B in 5 years (assuming data inflow doesn't grow).
Welcome to the world of big data. It's exciting here and we welcome more and more new residents every day :)
I'll describe three incremental steps that you can take. The first two are just temporary - at some point you'll have too much data and will have to move on. However, each one takes more work and/or more money so it makes sense to take it a step at a time.
Step 1: Better Hardware - Scale up
Faster disks, RAID, and much more RAM will take you some of the way. Scaling up, as this is called, breaks down eventually, but if you data is growing linearly and not exponentially, then it'll keep you floating for a while.
You can also use SQL Server replication to create a copy of your database on another server. Replication works by reading transaction logs and sending them to your replica. Then you can run the scripts that create your aggregate (daily, monthly, annual) tables on a secondary server that won't kill the performance of your primary one.
Step 2: OLAP
Since you have SSIS at your disposal, start discussing multidimensional data. With good design, OLAP Cubes will take you a long way. They may even be enough to manage billions of records and you'll be able to stop there for several years (been there done that, and it carried us for two years or so).
Step 3: Scale Out
Handle more data by distributing the data and its processing over multiple machines. When done right this allows you to scale almost linearly - have more data then add more machines to keep processing time constant.
If you have the $$$, use solutions from Vertica or Greenplum (there may be other options, these are the ones that I'm familiar with).
If you prefer open source / byo, use Hadoop, log event data to files, use MapReduce to process them, store results to HBase or Hypertable. There are many different configurations and solutions here - the whole field is still in its infancy.
Indexed views.
Indexed views will allow you to store and index aggregated data. One of the most useful aspects of them is that you don't even need to directly reference the view in any of your queries. If someone queries an aggregate that's in the view, the query engine will pull data from the view instead of checking the underlying table.
You will pay some overhead to update the view as data changes, but from your scenario it sounds like this would be acceptable.
Why don't you create monthly tables, just to save the info you need for that months. It'd be like simulating multidimensional tables. Or, if you have access to multidimensional systems (oracle, db2 or so), just work with multidimensionality. That works fine with time period problems like yours. At this moment I don't have enough info to give you, but you can learn a lot about it just googling.
Just as an idea.

mysql - Creating rows vs. columns performance

I built an analytics engine that pulls 50-100 rows of raw data from my database (lets call it raw_table), runs a bunch statistical measurements on it in PHP and then comes up with exactly 140 datapoints that I then need to store in another table (lets call it results_table). All of these data points are very small ints ("40","2.23","-1024" are good examples of the types of data).
I know the maximum # of columns for mysql is quite high (4000+) but there appears to be a lot of grey area as far as when performance really starts to degrade.
So a few questions here on best performance practices:
1) The 140 datapoints could be, if it is better, broken up into 20 rows of 7 data points all with the same 'experiment_id' if fewer columns is better. HOWEVER I would always need to pull ALL 20 rows (with 7 columns each, plus id, etc) so I wouldn't think this would be better performance than pulling 1 row of 140 columns. So the question: is it better to store 20 rows of 7-9 columns (that would all need to be pulled at once) or 1 row of 140-143 columns?
2) Given my data examples ("40","2.23","-1024" are good examples of what will be stored) I'm thinking smallint for the structure type. Any feedback there, performance-wise or otherwise?
3) Any other feedback on mysql performance issues or tips is welcome.
Thanks in advance for your input.
I think the advantage to storing as more rows (i.e. normalized) depends on design and maintenance considerations in the face of change.
Also, if the 140 columns have the same meaning or if it differs per experiment - properly modeling the data according to normalization rules - i.e. how is data related to a candidate key.
As far as performance, if all the columns are used it makes very little difference. Sometimes a pivot/unpivot operation can be expensive over a large amount of data, but it makes little difference on a single key access pattern. Sometimes a pivot in the database can make your frontend code a lot simpler and backend code more flexible in the face of change.
If you have a lot of NULLs, it might be possible to eliminate rows in a normalized design and this would save space. I don't know if MySQL has support for a sparse table concept, which could come into play there.
You have a 140 data items to return every time, each of type double.
It makes no practical difference whether this is 1x140 or 20x7 or 7x20 or 4x35 etc. It could be infinitesimally quicker for one shape of course but then have you considered the extra complexity in the PHP code to deal with a different shape.
Do you have a verified bottleneck, or is this just random premature optimisation?
You've made no suggestion that you intend to store big data in the database, but for the purposes of this argument, I will assume that you have 1 billion (10^9) data points.
If you store them in 140 columns, you'll have a mere 7 millon rows, however, if you want to retrieve a single data point from lots of experiments, then it will have to fetch a large number of very wide rows.
These very wide rows will take up more space in your innodb_buffer_pool, hence you won't be able to cache so many; this will potentially slow you down when you access them again.
If you store one datapoint per row, in a table with very few columns (experiment_id, datapoint_id, value) then you'll need to pull out the same number of smaller rows.
However, the size of rows makes little difference to the number of IO operations required. If we assume that your 1 billion datapoints doesn't fit in ram (which is NOT a safe assumption nowadays), maybe the resulting performance will be approximately the same.
It is probably better database design to use few columns; but it will use less disc space and perhaps be faster to populate, if you use lots of columns.

Do relational databases provide a feasible backend for a process historian?

In the process industry, lots of data is read, often at a high frequency, from several different data sources, such as NIR instruments as well as common instruments for pH, temperature, and pressure measurements. This data is often stored in a process historian, usually for a long time.
Due to this, process historians have different requirements than relational databases. Most queries to a process historian require either time stamps or time ranges to operate on, as well as a set of variables of interest.
Frequent and many INSERT, many SELECT, few or no UPDATE, almost no DELETE.
Q1. Is relational databases a good backend for a process historian?
A very naive implementation of a process historian in SQL could be something like this.
+------------------------------------------------+
| Variable |
+------------------------------------------------+
| Id : integer primary key |
| Name : nvarchar(32) |
+------------------------------------------------+
+------------------------------------------------+
| Data |
+------------------------------------------------+
| Id : integer primary key |
| Time : datetime |
| VariableId : integer foreign key (Variable.Id) |
| Value : float |
+------------------------------------------------+
This structure is very simple, but probably slow for normal process historian operations, as it lacks "sufficient" indexes.
But for example if the Variable table would consist of 1.000 rows (rather optimistic number), and data for all these 1.000 variables would be sampled once per minute (also an optimistic number) then the Data table would grow with 1.440.000 rows per day. Lets continue the example, estimate that each row would take about 16 bytes, which gives roughly 23 megabytes per day, not counting additional space for indexes and other overhead.
23 megabytes as such perhaps isn't that much but keep in mind that numbers of variables and samples in the example were optimistic and that the system will need to be operational 24/7/365.
Of course, archiving and compression comes to mind.
Q2. Is there a better way to accomplish this? Perhaps using some other table structure?
I work with a SQL Server 2008 database that has similar characteristics; heavy on insertion and selection, light on update/delete. About 100,000 "nodes" all sampling at least once per hour. And there's a twist; all of the incoming data for each "node" needs to be correlated against the history and used for validation, forecasting, etc. Oh, there's another twist; the data needs to be represented in 4 different ways, so there are essentially 4 different copies of this data, none of which can be derived from any of the other data with reasonable accuracy and within reasonable time. 23 megabytes would be a cakewalk; we're talking hundreds-of-gigabytes to terabytes here.
You'll learn a lot about scale in the process, about what techniques work and what don't, but modern SQL databases are definitely up to the task. This system that I just described? It's running on a 5-year-old IBM xSeries with 2 GB of RAM and a RAID 5 array, and it performs admirably, nobody has to wait more than a few seconds for even the most complex queries.
You'll need to optimize, of course. You'll need to denormalize frequently, and maintain pre-computed aggregates (or a data warehouse) if that's part of your reporting requirement. You might need to think outside the box a little: for example, we use a number of custom CLR types for raw data storage and CLR aggregates/functions for some of the more unusual transactional reports. SQL Server and other DB engines might not offer everything you need up-front, but you can work around their limitations.
You'll also want to cache - heavily. Maintain hourly, daily, weekly summaries. Invest in a front-end server with plenty of memory and cache as many reports as you can. This is in addition to whatever data warehousing solution you come up with if applicable.
One of the things you'll probably want to get rid of is that "Id" key in your hypothetical Data table. My guess is that Data is a leaf table - it usually is in these scenarios - and this makes it one of the few situations where I'd recommend a natural key over a surrogate. The same variable probably can't generate duplicate rows for the same timestamp, so all you really need is the variable and timestamp as your primary key. As the table gets larger and larger, having a separate index on variable and timestamp (which of course needs to be covering) is going to waste enormous amounts of space - 20, 50, 100 GB, easily. And of course every INSERT now needs to update two or more indexes.
I really believe that an RDBMS (or SQL database, if you prefer) is as capable for this task as any other if you exercise sufficient care and planning in your design. If you just start slinging tables together without any regard for performance or scale, then of course you will get into trouble later, and when the database is several hundred GB it will be difficult to dig yourself out of that hole.
But is it feasible? Absolutely. Monitor the performance constantly and over time you will learn what optimizations you need to make.
It sounds like you're talking about telemetry data (time stamps, data points).
We don't use SQL databases for this (although we do use SQL databases to organize it); instead, we use binary streaming files to capture the actual data. There are a number of binary file formats that are suitable for this, including HDF5 and CDF. The file format we use here is a proprietary compressible format. But then, we deal with hundreds of megabytes of telemetry data in one go.
You might find this article interesting (links directly to Microsoft Word document):
http://www.microsoft.com/caseStudies/ServeFileResource.aspx?4000003362
It is a case study from the McClaren group, describing how SQL Server 2008 is used to capture and process telemetry data from formula one race cars. Note that they don't actually store the telemetry data in the database; instead, it is stored in the file system, and the FILESTREAM capability of SQL Server 2008 is used to access it.
I believe you're headed in the right path. We have a similar situation were we work. Data comes from various transport / automation systems across various technologies such as manufacturing, auto, etc. Mainly we deal with the big 3: Ford, Chrysler, GM. But we've had a lot of data coming in from customers like CAT.
We ended up extracting data into a database and as long as you properly index your table, keep updates to a minimum and schedule maintenance (rebuild indexes, purge old data, update statistics) then I see no reason for this to be a bad solution; in fact I think it is a good solution.
Certainly a relational database is suitable for mining the data after the fact.
Various nuclear and particle physics experiments I have been involved with have explored several points from not using a RDBMS at all though storing just the run summaries or the run summaries and the slowly varying environmental conditions in the DB all the way to cramming every bit collected into the DB (though it was staged to disk first).
When and where the data rate allows more and more groups are moving towards putting as much data as possible into the database.
IBM Informix Dynamic Server (IDS) has a TimeSeries DataBlade and RealTime Loader which might provide relevant functionality.
Your naïve schema records each reading 100% independently, which makes it hard to correlate across readings- both for the same variable at different times and for different variables at (approximately) the same time. That may be necessary, but it makes life harder when dealing with subsequent processing. How much of an issue that is depends on how often you will need to run correlations across all 1000 variables (or even a significant percentage of the 1000 variables, where significant might be as small as 1% and would almost certainly start by 10%).
I would look to combine key variables into groups that can be recorded jointly. For example, if you have a monitor unit that records temperature, pressure and acidity (pH) at one location, and there are perhaps a hundred of these monitors in the plant that is being monitored, I would expect to group the three readings plus the location ID (or monitor ID) and time into a single row:
CREATE TABLE MonitorReading
(
MonitorID INTEGER NOT NULL REFERENCES MonitorUnit,
Time DATETIME NOT NULL,
PhReading FLOAT NOT NULL,
Pressure FLOAT NOT NULL,
Temperature FLOAT NOT NULL,
PRIMARY KEY (MonitorID, Time)
);
This saves having to do self-joins to see what the three readings were at a particular location at a particular time, and uses about 20 bytes instead of 3 * 16 = 48 bytes per row. If you are adamant that you need a unique ID integer for the record, that increases to 24 or 28 bytes (depending on whether you use a 4-byte or 8-byte integer for the ID column).
Yes, a DBMS is appropriate for this, although not the fastest option. You will need to invest in a reasonable system to handle the load though. I will address the rest of my answer to this problem.
It depends on how beefy a system you're willing to throw at the problem. There are two main limiters for how fast you can insert data into a DB: bulk I/O speed and seek time. A well-designed relational DB will perform at least 2 seeks per insertion: one to begin the transaction (in case the transaction can not be completed), and one when the transaction is committed. Add to this additional storage to seek to your index entries and update them.
If your data are large, then the limiting factor will be how fast you can write data. For a hard drive, this will be about 60-120 MB/s. For a solid state disk, you can expect upwards of 200 MB/s. You will (of course) want extra disks for a RAID array. The pertinent figure is storage bandwidth AKA sequential I/O speed.
If writing a lot of small transactions, the limitation will be how fast your disk can seek to a spot and write a small piece of data, measured in IO per second (IOPS). We can estimate that it will take 4-8 seeks per transaction (a reasonable case with transactions enabled and an index or two, plus some integrity checks). For a hard drive, the seek time will be several milliseconds, depending on disk RPM. This will limit you to several hundred writes per second. For a solid state disk, the seek time is under 1 ms, so you can write several THOUSAND transactions per second.
When updating indices, you will need to do about O(log n) small seeks to find where to update, so the DB will slow down as the record counts grow. Remember that a DB may not write in the most efficient format possible, so data size may be bigger than you expect.
So, in general, YES, you can do this with a DBMS, although you will want to invest in good storage to ensure it can keep up with your insertion rate. If you wish to cut on cost, you may want to roll data over a specific age (say 1 year) into a secondary, compressed archive format.
EDIT:
A DBMS is probably the easiest system to work with for storing recent data, but you should strongly consider the HDF5/CDF format someone else suggested for storing older, archived data. It is an flexible and widely supported format, provides compression, and provides for compression and VERY efficient storage of large time series and multi-dimensional arrays. I believe it also provides for some methods of indexing in the data. You should be able to write a little code to fetch from these archive files if data is too old to be in the DB.
There is probably a data structure that would be more optimal for your given case than a relational database.
Having said that, there are many reasons to go with a relational DB including robust code support, backup & replication technology and a large community of experts.
Your use case is similar to high-volume financial applications and telco applications. Both are frequently inserting data and frequently doing queries that are both time-based and include other select factors.
I worked on a mid-sized billing project that handled cable bills for millions of subscribers. That meant an average of around 5 rows per subscriber times a few million subscribers per month in the financial transaction table alone. That was easily handled by a mid-size Oracle server using (now) 4 year old hardware and software. Large billing platforms can have 10x that many records per unit time.
Properly architected and with the right hardware, this case can be handled well by modern relational DB's.
Years ago, a customer of ours tried to load an RDBMS with real-time data collected from monitoring plant machinery. It didn't work in a simplistic way.
Is relational databases a good backend for a process historian?
Yes, but. It needs to store summary data, not details.
You'll need a front-end based in-memory and on flat files. Periodic summaries and digests can be loaded into an RDBMS for further analysis.
You'll want to look at Data Warehousing techniques for this. Most of what you want to do is to split your data into two essential parts ---
Facts. The data that has units. Actual measurements.
Dimensions. The various attributes of the facts -- date, location, device, etc.
This leads you to a more sophisticated data model.
Fact: Key, Measure 1, Measure 2, ..., Measure n, Date, Geography, Device, Product Line, Customer, etc.
Dimension 1 (Date/Time): Year, Quarter, Month, Week, Day, Hour
Dimension 2 (Geography): location hierarchy of some kind
Dimension 3 (Device): attributes of the device
Dimension *n*: attributes of each dimension of the fact
You may want to look at KDB. It is specificaly optimized for this kind of usage: many inserts, few or no updates or deletes.
It isn't as easy to use as traditional RDBMS though.
The other aspect to consider is what kind of selects you're doing. Relational/SQL databases are great for doing complex joins dependent on multiple indexes, etc. They really can't be beaten for that. But if you're not doing that kind of thing, they're probably not such a great match.
If all you're doing is storing per-time records, I'd be tempted to roll your own file format ... even just output the stuff as CSV (groans from the audience, I know, but it's hard to beat for wide acceptance)
It really depends on your indexing/lookup requirements, and your willingness to write tools to do it.
You may want to take a look at a Stream Data Manager System (SDMS).
While not addressing all your needs (long-time persistence), sliding windows over time and rows and frequently changing data are their points of strength.
Some useful links:
Stanford Stream Data Manager
Stream Mill
Material about Continuous Queries
AFAIK major database makers all should have some kind of prototype version of an SDMS in the works, so I think it's a paradigm worth checking out.
I know you're asking about relational database systems, but those are unicorns. SQL DBMSs are probably a bad match for your needs because no current SQL system (I know of) provides reasonable facilities to deal with temporal data. depending on your needs you might or might not have another option in specialized tools and formats, see e. g. rrdtool.