Comment on (write once) database design storing network information - sql

I have raw logs of up to a week of associations of wifi enabled devices to the wifi routers in my Institute.
As of now I plan to put it in a database as follows:
One big table having columns:
Router MAC address
Device MAC address
Timestamp
The database will be queried only to retrieve connection history of a particular device over the week.
After a record becomes one week old, it will be deleted or moved somewhere else.
Now the number of devices are possibly between 800 to 2000. So it does not make sense to have a separate table for each device, or does it?
The number of wifi routers are about <40 I believe.
Finally, the log of a week will be less than 1 GB.
I am for now using a SQL database(really sticking to this, dont want to change, but please mention an alternative if it has drastic improvement).
Now my question is that is the (one table) approach efficient time wise - time to for query to get executed and return results.
Space and redundancy are no problems, just the speed at which the query comes back*.
Also comment on what possible measures/modifications can be taken to make such a system scalable.
*It should not be so complicated that it becomes difficult to maintain.

I see no reason why you would need to split this up any other way. If your sole purpose is to store Router and Device Addresses with a datetime stamp there really is nothing else you need to do. One table would do it.
The whole idea behind normalization (yes I am over simplifying it here) is that you should never have to repeat the same data over and over again in multiple tables.
For example say you wanted to store this:
Router | Mac Address | Device Type | IP Address | Device Model | Device Serial Number
It would be better to have a router/device table that specifies all this and has a specific DeviceID and one that has say
DeviceID | TimeStamp
You would do this so you wouldnt have to repeat all the information every time for each log entry.
Hope this helps...

Due to the way you are storing and removing the data, my suggestion would be to create a single table, partitioned by timestamp. The benefits of this method:
Archiving data (dropping partition, moving partition) has less of a penalty than large DELETEs.
The table size that you are querying will be smaller, thus queries should be faster (caveats always apply).
If you partition by timestamp and use that partitioning key in your queries, constraint exclusion will come into play, meaning that on the partitions that contain the data you are after will be queried, and the others will be discarded from the plan.
On top of that, you can index over multiple columns with indexes designed specifically to quickly retrieve data for your queries. What those indexes will look like is impossible to say at this time as there is no DDL, sample data, or queries (that can be a follow up question).
Using a CLUSTERed index could help too, as you have static data (WORM).
If you need really fast lookups on the INET types (if the built-in network datatypes are not fast enough that is), they have a look at the IP4R data type: http://pgfoundry.org/projects/ip4r/

Related

How to archive a giant postgres table?

We have a postgres table called history which is almost 900GB and continpusly increasing 10GB per day.
This is table is being accessed by a microservice (carts). We have postgres replication setup (one Master and one Slave).
2 Instances of the microservice is running in production, where 1 instance uses the master postgres connection to write and read data for some endpoints
and 1 another instance is using postgres slave connection to just the read the data alone.
Table definition:
id - uuid
data - jsonb column
internal - jsonb column
context - jsonb column
created_date - date
modified_date - date
Now in the above table data and internal column is loaded with big json for every row. We have came to a conclusion on null'ing the data and internal
column will reduce the total space consumption by this table.
Question:
How to archive this giant table? (meaning only cleaning up data and internal column alone).
How to achieve this without zero down time / performance degradation?
Approaches tested as of now.
Using pg_repack (this is best idea so far, but the issue here is once pg_repack is done then entire new table needs to get synced with slave instance which is causing WAL overhead).
Just nullify the data and internal column alone - Problem with this approach is just increases the table size due to postgres follows MVCC pattern.
Using Temp table and clone the data
Create a UNLOGGED table - historyv2
Copy the data from the original table to the historyv2 table without data and internal
then switch the table to LOGGED. ( I guess this will also cause the WAL overhead)
then rename the tables.
Can you guys give me some pointers on how to achieve this?
Postgres Version: 9.5
I always feel that questions like these conflate a few different ideas, which makes them seem more complicated than they should be. What does minimal performance impact mean? Generating lots of WAL may increase file i/o and cpu and network usage, but in most cases does not affect the system enough to have a client-facing impact. If no downtime is the most important thing, you should focus on optimizing for that, and not worry about what the process is to get there (within reason).
That said, if I woke up in your shoes, I would work first set up partitions for data going forward using table-inheritance, so that the data could be more easily segmented and worked on in the future. (This isn't entirely necessary, but probably makes your life easier in the future). After that, I would write a script to slowly go through the old data and creating new partitions with the "nulled out" data, interleaving partition creation, deletion of data, and vacuums against the main table. Once that is automated, you can let it churn slowly or during off-hours until it is done. You might need to do a final repack or vacuum full on the parent once all the data is moved, but it's probably ok even without it. Again, this isn't the simplest idea, probably not the fastest way to do it (if you could have downtime), but in the end, you'll have the schema you want without causing any service disruptions.

SQL Server Time Series Modelling Huge datacollection

I have to implement data collection for replay for electrical parameters for 100-1000's of devices with at least 20 parameters to monitor. This amounts to huge data collection as it will be based very similar to time series.I have to support resolution for 1 second. thinking about 1 year [365*24*60*60*1000]=31536000000 rows.
I did my research but still have few questions
As data will be huge is it good to keep data in same table or should the tables be spitted. [data structure is same] or i should
rely on indexes?
Data inserts also will be very frequent but i can batch them still what is the best way? Is it directly writing to same database
or using a temporary database for write and sync with it?
Does SQL Server has a specific schema recommendation to do time series optimization for select,update and inserts? any out of box
helps for day average ? or specific common aggregate functions i can
write my own but just to know as this a standard problem so they
might have some best practices and samples out of box.**
please let me know any help is appreciated, thanks in advance
1) You probably want to explore the use of partitions. This will allow very effective inserts (its a meta operation if you do the partitioning correctly) and very fast (2). You may want to explore columnstore indexes because the data (once collected) will never change and you will have very large data sets. Partitioning and columnstore require a learning curve but its very doable. There are lots of code on the internet describing the use of date functions in SQL Server.
That is a big number but I would start with one table see if it hold up. If you split it in multiple tables it is still the same amount of data.
Do you ever need to search across devices? If not you can have a separate table for each device.
I have some audit tables that are not that big but still big and have not had any problems. If the data is loaded in time order then make date the first (or only) column of the clustered index.
If the the PK is date, device then fine but if you can get two reading in the same seconds you cannot do that. If this is the PK then if you can load the data by that sort. Even if you have to stage each second and load. You just cannot afford to fragment a table that big. If you cannot load by the sort then take a fill factor of 50%.
If you cannot have a PK then just use date as clustered index but not as PK and put a non clustered index on device.
I have some tables of 3,000,000,000 and I have the luxury of loading by PK with no other indexes. There is no measurable degradation in insert from row 1 to row 3,000,000,000.

archiving the table : searching for the best way

there is a table which has 80.000 rows.
Everyday I will clone this table to another log table giving a name like 20101129_TABLE
, and every day the prefix will be changed according to date..
As you calculate, the data will be 2400 000 rows every month..
Advices please for saving space, and getting fast service and other advantages and disadvantages!! how should i think to create the best archive or log..
it is a table has the accounts info. branch code balance etc
It is quite tricky to answer your question since you are a bit vague on some important facts:
How often do you need the archived tables?
How free are you in your design-choices?
If you don't need the archived data often and you are free in your desgin I'd copy the data into an archive database. That will give you the option of storing the database on a separate disk (cost-efficiency) and you can have a separate backup-schedule on that database as well.
You could also store all the data in one table with just an additional column like ArchiveDate datetime. But I think this depends really on how you plan on accessing the data later.
Consider TABLE PARTITIONING (MSDN) - it is designed for exactly this kind of scenarios. Not only you can spread data across partitions (and map partitions to different disks), you can keep all data in the same table and let MSSQL do all the hard work in the background (what partition to use based on select criteria, etc.).

Handling 100's of 1,000,000's of rows in T-SQL2005

I have a couple of databases containing simple data which needs to be imported into a new format schema. I've come up with a flexible schema, but it relies on the critical data of the to older DBs to be stored in one table. This table has only a primary key, a foreign key (both int's), a datetime and a decimal field, but adding the count of rows from the two older DBs indicates that the total row count for this new table would be about 200,000,000 rows.
How do I go about dealing with this amount of data? It is data stretching back about 10 years and does need to be available. Fortunately, we don't need to pull out even 1% of it when making queries in the future, but it does all need to be accessible.
I've got ideas based around having multiple tables for year, supplier (of the source data) etc - or even having one database for each year, with the most recent 2 years in one DB (which would also contain the stored procs for managing all this.)
Any and all help, ideas, suggestions very, deeply, much appreciated,
Matt.
Most importantly. consider profiling your queries and measuring where your actual bottlenecks are (try identifying the missing indexes), you might see that you can store everything in a single table, or that buying a few extra hard disks will be enough to get sufficient performance.
Now, for suggestions, have you considered partitioning? You could create partitions per time range, or one partition with the 1% commonly accessed and another with the 99% of the data.
This is roughly equivalent to splitting the tables manually by year or supplier or whatnot, but internally handled by the server.
On the other hand, it might make more sense to actually splitting the tables in 'current' and 'historical'.
Another possible size improvement is using an int (like an epoch) instead of a datetime and provide functions to convert from datetime to int, thus having queries like
SELECT * FROM megaTable WHERE datetime > dateTimeToEpoch('2010-01-23')
This size savings will probably have a cost performance wise if you need to do complex datetime queries. Although on cubes there is the standard technique of storing, instead of an epoch, an int in YYYYMMDD format.
What's the problem with storing this data in a single table? An enterprise-level SQL server like Microsoft SQL 2005 can handle it without much pain.
By the way, do not do tables per year, tables per supplier or other things like this. If you have to store similar set of items, you need one and one only table. Setting multiple tables to store the same type of things will cause problems, like:
Queries would be extremely difficult to write, and performance will be decreased if you have to query from multiple tables.
The database design will be very difficult to understand (especially since it's not something natural to store the same type of items in different places).
You will not be able to easily modify your database (maybe it's not a problem in your case), because instead of changing one table, you would have to change every table.
It would require to automate a bunch of tasks. Let's see you have a table per year. If a new record is inserted on 2011-01-01 00:00:00.001, will a new table be created? Will you check at each insert if you must create a new table? How it would affect performance? Can you test it easily?
If there is a real, visible separation between "recent" and "old" data (for example you have to use daily the data saved the last month only, and you have to keep everything older, but you do not use it), you can build a system with two SQL servers (installed on different machines). The first, highly available server, will serve to handle recent data. The second, less available and optimized for writing, will store everything else. Then, on schedule, a program will move old data from the first one to the second.
With such a small tuple size (2 ints, 1 datetime, 1 decimal) I think you will be fine having a single table with all the results in it. SQL server 2005 does not limit the number of rows in a table.
If you go down this road and run in to performance problems, then it is time to look at alternatives. Until then, I would plow ahead.
EDIT: Assuming you are using DECIMAL(9) or smaller, your total tuple size is 21 bytes which means that you can store the entire table in less than 4 GB of memory. If you have a decent server(8+ GB of memory) and this is the primary memory user, then the table and a secondary index could be stored in memory. This should ensure super fast queries after a slower warm-up time before the cache is populated.

Do relational databases provide a feasible backend for a process historian?

In the process industry, lots of data is read, often at a high frequency, from several different data sources, such as NIR instruments as well as common instruments for pH, temperature, and pressure measurements. This data is often stored in a process historian, usually for a long time.
Due to this, process historians have different requirements than relational databases. Most queries to a process historian require either time stamps or time ranges to operate on, as well as a set of variables of interest.
Frequent and many INSERT, many SELECT, few or no UPDATE, almost no DELETE.
Q1. Is relational databases a good backend for a process historian?
A very naive implementation of a process historian in SQL could be something like this.
+------------------------------------------------+
| Variable |
+------------------------------------------------+
| Id : integer primary key |
| Name : nvarchar(32) |
+------------------------------------------------+
+------------------------------------------------+
| Data |
+------------------------------------------------+
| Id : integer primary key |
| Time : datetime |
| VariableId : integer foreign key (Variable.Id) |
| Value : float |
+------------------------------------------------+
This structure is very simple, but probably slow for normal process historian operations, as it lacks "sufficient" indexes.
But for example if the Variable table would consist of 1.000 rows (rather optimistic number), and data for all these 1.000 variables would be sampled once per minute (also an optimistic number) then the Data table would grow with 1.440.000 rows per day. Lets continue the example, estimate that each row would take about 16 bytes, which gives roughly 23 megabytes per day, not counting additional space for indexes and other overhead.
23 megabytes as such perhaps isn't that much but keep in mind that numbers of variables and samples in the example were optimistic and that the system will need to be operational 24/7/365.
Of course, archiving and compression comes to mind.
Q2. Is there a better way to accomplish this? Perhaps using some other table structure?
I work with a SQL Server 2008 database that has similar characteristics; heavy on insertion and selection, light on update/delete. About 100,000 "nodes" all sampling at least once per hour. And there's a twist; all of the incoming data for each "node" needs to be correlated against the history and used for validation, forecasting, etc. Oh, there's another twist; the data needs to be represented in 4 different ways, so there are essentially 4 different copies of this data, none of which can be derived from any of the other data with reasonable accuracy and within reasonable time. 23 megabytes would be a cakewalk; we're talking hundreds-of-gigabytes to terabytes here.
You'll learn a lot about scale in the process, about what techniques work and what don't, but modern SQL databases are definitely up to the task. This system that I just described? It's running on a 5-year-old IBM xSeries with 2 GB of RAM and a RAID 5 array, and it performs admirably, nobody has to wait more than a few seconds for even the most complex queries.
You'll need to optimize, of course. You'll need to denormalize frequently, and maintain pre-computed aggregates (or a data warehouse) if that's part of your reporting requirement. You might need to think outside the box a little: for example, we use a number of custom CLR types for raw data storage and CLR aggregates/functions for some of the more unusual transactional reports. SQL Server and other DB engines might not offer everything you need up-front, but you can work around their limitations.
You'll also want to cache - heavily. Maintain hourly, daily, weekly summaries. Invest in a front-end server with plenty of memory and cache as many reports as you can. This is in addition to whatever data warehousing solution you come up with if applicable.
One of the things you'll probably want to get rid of is that "Id" key in your hypothetical Data table. My guess is that Data is a leaf table - it usually is in these scenarios - and this makes it one of the few situations where I'd recommend a natural key over a surrogate. The same variable probably can't generate duplicate rows for the same timestamp, so all you really need is the variable and timestamp as your primary key. As the table gets larger and larger, having a separate index on variable and timestamp (which of course needs to be covering) is going to waste enormous amounts of space - 20, 50, 100 GB, easily. And of course every INSERT now needs to update two or more indexes.
I really believe that an RDBMS (or SQL database, if you prefer) is as capable for this task as any other if you exercise sufficient care and planning in your design. If you just start slinging tables together without any regard for performance or scale, then of course you will get into trouble later, and when the database is several hundred GB it will be difficult to dig yourself out of that hole.
But is it feasible? Absolutely. Monitor the performance constantly and over time you will learn what optimizations you need to make.
It sounds like you're talking about telemetry data (time stamps, data points).
We don't use SQL databases for this (although we do use SQL databases to organize it); instead, we use binary streaming files to capture the actual data. There are a number of binary file formats that are suitable for this, including HDF5 and CDF. The file format we use here is a proprietary compressible format. But then, we deal with hundreds of megabytes of telemetry data in one go.
You might find this article interesting (links directly to Microsoft Word document):
http://www.microsoft.com/caseStudies/ServeFileResource.aspx?4000003362
It is a case study from the McClaren group, describing how SQL Server 2008 is used to capture and process telemetry data from formula one race cars. Note that they don't actually store the telemetry data in the database; instead, it is stored in the file system, and the FILESTREAM capability of SQL Server 2008 is used to access it.
I believe you're headed in the right path. We have a similar situation were we work. Data comes from various transport / automation systems across various technologies such as manufacturing, auto, etc. Mainly we deal with the big 3: Ford, Chrysler, GM. But we've had a lot of data coming in from customers like CAT.
We ended up extracting data into a database and as long as you properly index your table, keep updates to a minimum and schedule maintenance (rebuild indexes, purge old data, update statistics) then I see no reason for this to be a bad solution; in fact I think it is a good solution.
Certainly a relational database is suitable for mining the data after the fact.
Various nuclear and particle physics experiments I have been involved with have explored several points from not using a RDBMS at all though storing just the run summaries or the run summaries and the slowly varying environmental conditions in the DB all the way to cramming every bit collected into the DB (though it was staged to disk first).
When and where the data rate allows more and more groups are moving towards putting as much data as possible into the database.
IBM Informix Dynamic Server (IDS) has a TimeSeries DataBlade and RealTime Loader which might provide relevant functionality.
Your naïve schema records each reading 100% independently, which makes it hard to correlate across readings- both for the same variable at different times and for different variables at (approximately) the same time. That may be necessary, but it makes life harder when dealing with subsequent processing. How much of an issue that is depends on how often you will need to run correlations across all 1000 variables (or even a significant percentage of the 1000 variables, where significant might be as small as 1% and would almost certainly start by 10%).
I would look to combine key variables into groups that can be recorded jointly. For example, if you have a monitor unit that records temperature, pressure and acidity (pH) at one location, and there are perhaps a hundred of these monitors in the plant that is being monitored, I would expect to group the three readings plus the location ID (or monitor ID) and time into a single row:
CREATE TABLE MonitorReading
(
MonitorID INTEGER NOT NULL REFERENCES MonitorUnit,
Time DATETIME NOT NULL,
PhReading FLOAT NOT NULL,
Pressure FLOAT NOT NULL,
Temperature FLOAT NOT NULL,
PRIMARY KEY (MonitorID, Time)
);
This saves having to do self-joins to see what the three readings were at a particular location at a particular time, and uses about 20 bytes instead of 3 * 16 = 48 bytes per row. If you are adamant that you need a unique ID integer for the record, that increases to 24 or 28 bytes (depending on whether you use a 4-byte or 8-byte integer for the ID column).
Yes, a DBMS is appropriate for this, although not the fastest option. You will need to invest in a reasonable system to handle the load though. I will address the rest of my answer to this problem.
It depends on how beefy a system you're willing to throw at the problem. There are two main limiters for how fast you can insert data into a DB: bulk I/O speed and seek time. A well-designed relational DB will perform at least 2 seeks per insertion: one to begin the transaction (in case the transaction can not be completed), and one when the transaction is committed. Add to this additional storage to seek to your index entries and update them.
If your data are large, then the limiting factor will be how fast you can write data. For a hard drive, this will be about 60-120 MB/s. For a solid state disk, you can expect upwards of 200 MB/s. You will (of course) want extra disks for a RAID array. The pertinent figure is storage bandwidth AKA sequential I/O speed.
If writing a lot of small transactions, the limitation will be how fast your disk can seek to a spot and write a small piece of data, measured in IO per second (IOPS). We can estimate that it will take 4-8 seeks per transaction (a reasonable case with transactions enabled and an index or two, plus some integrity checks). For a hard drive, the seek time will be several milliseconds, depending on disk RPM. This will limit you to several hundred writes per second. For a solid state disk, the seek time is under 1 ms, so you can write several THOUSAND transactions per second.
When updating indices, you will need to do about O(log n) small seeks to find where to update, so the DB will slow down as the record counts grow. Remember that a DB may not write in the most efficient format possible, so data size may be bigger than you expect.
So, in general, YES, you can do this with a DBMS, although you will want to invest in good storage to ensure it can keep up with your insertion rate. If you wish to cut on cost, you may want to roll data over a specific age (say 1 year) into a secondary, compressed archive format.
EDIT:
A DBMS is probably the easiest system to work with for storing recent data, but you should strongly consider the HDF5/CDF format someone else suggested for storing older, archived data. It is an flexible and widely supported format, provides compression, and provides for compression and VERY efficient storage of large time series and multi-dimensional arrays. I believe it also provides for some methods of indexing in the data. You should be able to write a little code to fetch from these archive files if data is too old to be in the DB.
There is probably a data structure that would be more optimal for your given case than a relational database.
Having said that, there are many reasons to go with a relational DB including robust code support, backup & replication technology and a large community of experts.
Your use case is similar to high-volume financial applications and telco applications. Both are frequently inserting data and frequently doing queries that are both time-based and include other select factors.
I worked on a mid-sized billing project that handled cable bills for millions of subscribers. That meant an average of around 5 rows per subscriber times a few million subscribers per month in the financial transaction table alone. That was easily handled by a mid-size Oracle server using (now) 4 year old hardware and software. Large billing platforms can have 10x that many records per unit time.
Properly architected and with the right hardware, this case can be handled well by modern relational DB's.
Years ago, a customer of ours tried to load an RDBMS with real-time data collected from monitoring plant machinery. It didn't work in a simplistic way.
Is relational databases a good backend for a process historian?
Yes, but. It needs to store summary data, not details.
You'll need a front-end based in-memory and on flat files. Periodic summaries and digests can be loaded into an RDBMS for further analysis.
You'll want to look at Data Warehousing techniques for this. Most of what you want to do is to split your data into two essential parts ---
Facts. The data that has units. Actual measurements.
Dimensions. The various attributes of the facts -- date, location, device, etc.
This leads you to a more sophisticated data model.
Fact: Key, Measure 1, Measure 2, ..., Measure n, Date, Geography, Device, Product Line, Customer, etc.
Dimension 1 (Date/Time): Year, Quarter, Month, Week, Day, Hour
Dimension 2 (Geography): location hierarchy of some kind
Dimension 3 (Device): attributes of the device
Dimension *n*: attributes of each dimension of the fact
You may want to look at KDB. It is specificaly optimized for this kind of usage: many inserts, few or no updates or deletes.
It isn't as easy to use as traditional RDBMS though.
The other aspect to consider is what kind of selects you're doing. Relational/SQL databases are great for doing complex joins dependent on multiple indexes, etc. They really can't be beaten for that. But if you're not doing that kind of thing, they're probably not such a great match.
If all you're doing is storing per-time records, I'd be tempted to roll your own file format ... even just output the stuff as CSV (groans from the audience, I know, but it's hard to beat for wide acceptance)
It really depends on your indexing/lookup requirements, and your willingness to write tools to do it.
You may want to take a look at a Stream Data Manager System (SDMS).
While not addressing all your needs (long-time persistence), sliding windows over time and rows and frequently changing data are their points of strength.
Some useful links:
Stanford Stream Data Manager
Stream Mill
Material about Continuous Queries
AFAIK major database makers all should have some kind of prototype version of an SDMS in the works, so I think it's a paradigm worth checking out.
I know you're asking about relational database systems, but those are unicorns. SQL DBMSs are probably a bad match for your needs because no current SQL system (I know of) provides reasonable facilities to deal with temporal data. depending on your needs you might or might not have another option in specialized tools and formats, see e. g. rrdtool.