We are creating a database where we store large number of records. We estimate millions (billions after few years) of record in one table and we always INSERT and rarely UPDATE or DELETE any of the record. Its a kind of archive system where we insert historic record on daily basis. We will generate different sort of reports on this historic record on user request so we've some concerns and require technical input from you people:
What is the best way to manage this kind of table and database?
What impact we may see in future for very large table?
Is there any limitation on number of records in one table or size of table?
How we suppose to INSERT bulk record from different sources (mostly from Excel sheet)?
What is the best way to index large data tables?
Which is the best ORM (object relational Mapping) should we use in this project?
You last statement sums it up. There is no ORM that will deal nicely with this volume of data and reporting queries: employ SQL experts to do it for you. You heard it here first.
Otherwise
On disk: filegroups, partitioning etc
Compress less-used data
Is all data required? (Data retention policies)
No limit of row numbers or table size
INSERT via staging tables or staging databases, clean/scrub/lookup keys, then flush to main table: DO NOT load main table directly
As much RAM as you can buy. Then add more.
Few, efficient indexes
Do you have parent tables or flat data mart? Have FKs but don't use them (eg bene update/delete in parent table) so no indexes needed
Use a SAN (easier to add disk space, more volumes etc)
Normalise
Some of these are based on our experiences of around 10 billion rows through one of our systems in 30 months, with peaks of 40k rows+ per second.
See this too for high volume systems: 10 lessons from 35K tps
Summary: do it properly or not at all...
What is the best way to manage this kind of table and database?
If you are planning to store billions of records then you'll be needing plenty of diskspace, I'd recommend a 64bit OS running SQL 2008 R2 and as much RAM and HD space as is available. Depending on what performance you need I'd be tempted to look into SSDs.
What impact we may see in future for very large table?
If you have the right hardware, with a properly indexed table and properly normalized the only thing you should notice are the reports will begin to run slower. Inserts may slow down slightly as the Index file becomes bigger and you'll just have to keep an eye on it.
Is there any limitation on number of records in one table or size of table?
On the right setup I described above, no. It's only limited by disk space.
How we suppose to INSERT bulk record from different sources (mostly from Excel sheet)?
I've run into problems running huge SQL queries but I've never tried to import from very large flat files.
What is the best way to index large data tables?
Index as few fields as necessary and keep them to numerical fields only.
Which is the best ORM (object relational Mapping) should we use in this project?
Sorry can't advise here.
Billions of rows in a "few years" is not an especially large volume. SQL Server should cope perfectly well with it - assuming your design and implementation is appropriate. There is no particular limit on the size of a table. Stick to solid design principles: normalize your tables, choose keys and data types carefully and have a suitable partitioning and indexing strategy.
Related
My first thought is to first load data from S3 to a temporary table, apply the necessary transformations and then INSERT INTO target, final table. All the tables would have the same columns and are in Redshift.
However, how big of a performance hit would there be because of using multiple UPDATEs? Would it be better to split UPDATEs and filtering between multiple temporary tables for daily batch processing.
Instead of S3 -> TEMP -> FINAL, the flow would look like S3 -> TEMP1 -> ... -> TEMPN - > FINAL, where "->" would be "INSERT INTO".
Also, is it better to create temporary tables (CREATE TEMP TABLE) on the spot and dropping them every day, or use persisting temporary tables that would be truncated every day. I think using persisting temp tables would be the better choice as it allows me to check how the data looked as it was loaded and transformed that day.
As you are seeing there are lots of ways to run an update process and which is better will depend on factors that are not presented here. First off let's clarify what a TEMP table is and differentiate it from a staging table. A temp table only lives as long as the current session (connection) is active. If the connection drops then so does the TEMP table. A staging table is a permanent table used for staging data which more closely matches what you are describing you parts of your question. I'll use these two terms to be clear about which is being made (TEMP or staging).
Your question revolves around how big of a performance hit it would be to have a series of tables in the ETL (ELT?) process to improve, I expect, diagnosability / debug-ability. This is a fine goal but there are some downsides as with all tradeoffs in the real world.
If this is correct these tables will need to be staging tables as TEMP tables will disappear when the ETL session ends.
Saving a bunch of staging tables when one could be used has some downsides but how big these are depends on you situation. If your cluster is fairly idle and the ETL data payload isn't huge then the impact to the ETL process of the extra tables will be real but not large (a couple of seconds or less). These impacts are mostly around setting up (or truncating) the staging or TEMP tables. But if your cluster is running other workloads when ETL runs then the impact can be much larger.
You see there are many "resources" in a Redshift cluster that all need to be shared by everything running on the database. Some like memory allocation can be (somewhat) controlled through the WLM. Others cannot. The two biggies are network bandwidth and disk bandwidth. There is a fixed capacity to these bandwidths in Redshift and even though they are high, they are finite. There are other limits to Redshift's ability to execute a total workload but these in my experience are the big two.
Every time you create a table, TEMP or permanent, the data is stored to disk. This means a write to disk as well as distributing the data per the distribution settings of the table. Then when the table is accessed the data needs to be read from disk. All this unneeded data movement will have some impact, how large will depend on how big it is and what else is going on at the time. So you see the impact will be moderately small up to very large depending on a lot of factors, not the least of which is how many tables are you creating. The cost of doing this will need to be offset by the benefit of having these extra tables which is a business decision.
A common pattern is to load (COPY) data to a temp or staging table and then extract the DELETE patterns to one staging table and the INSERT data to another. Once the deletes and inserts are applied to then save these tables with a date stamp in the name and possibly unloaded to S3. After a while these sets of data are deleted, 1 month is common. This way you can figure out 'what happened' if things go sideways. This plus good database backups can be used to recover from code bugs.
Your secondary question is about whether it is better to drop and recreate or truncate. There have been a number of performance improvements to both of these statements. With a grain of salt, I'll offer my slightly dated experience comparing these. Both are fast but I saw drop and recreate as slightly faster (fewer dependencies to manage). That said the main difference is in how they interoperate with other aspects of the database. DROP will fail if there are dependent views (unless cascaded) and table permissions will be lost. DROP cannot be run in a transaction block and since it needs an exclusive lock on the table can be held off my another session reading the table. TRUNCATE can run in a transaction block but will force a commit so transaction changes will become visible to all. It is usually these differences that made the decision about TRUNCATE vs. DROP and there are other options such as DELETE and ALTER TABLE APPEND that have their own set of plusses and minuses.
So I'd generally advise against creating more tables than are actually needed in the ETL process when all needs are weighed (including performance and business needs). You may have excess capacity now but usually Redshift clusters get busier over time. The guiding principal here is don't move large amounts of data more times than is necessary.
I have a large table with 10 million records and is used for one of our existing applications. we are working on a new application wherein it requires only filtered result set of large table with 7000 records.
My question is will there be any performance gain going for a smaller table with 7000 records vs querying large table with filter condition(and it will joined to few other tables in the schema which are completely independant from existing application)? or should I avoid redundancy maintaining all the data in one table? This is the design in data warehouse. Please suggest!
Thank you!
For almost any database, using a sample table will be noticeably faster. This is because reading the records will require loading fewer data pages.
In addition, if the base table is being updated, then a "snapshot" is isolated from page, table, and row locks that occur on the main table. This is good from a performance perspective, but it means that the versions can get out-of-synch, which may be bad.
And, from a querying perspective, the statistics on the sample would be more accurate. This helps the optimizer choose the best query plans.
I can think of two cases where performance might not improve significantly. The first is if your database supports clustered indexes and the rows that you want are defined by a range of index keys (or a single key). These will be "adjacent", so the clustered index would scan about the same number of pages. There is a slight overhead for the actual index structure.
Similarly, if your records were so large that there was one record per data page, then the advantage of a second table would be less. It would eliminate the index access overhead, but not reduce the number of reads.
None of these considerations say whether or not you should use a separate table. You should test in your environment. The overhead of managing a separate table (and there is a cost to creating and deleting it both in terms of performance and application complexity) may outweigh small performance gains.
I have to implement data collection for replay for electrical parameters for 100-1000's of devices with at least 20 parameters to monitor. This amounts to huge data collection as it will be based very similar to time series.I have to support resolution for 1 second. thinking about 1 year [365*24*60*60*1000]=31536000000 rows.
I did my research but still have few questions
As data will be huge is it good to keep data in same table or should the tables be spitted. [data structure is same] or i should
rely on indexes?
Data inserts also will be very frequent but i can batch them still what is the best way? Is it directly writing to same database
or using a temporary database for write and sync with it?
Does SQL Server has a specific schema recommendation to do time series optimization for select,update and inserts? any out of box
helps for day average ? or specific common aggregate functions i can
write my own but just to know as this a standard problem so they
might have some best practices and samples out of box.**
please let me know any help is appreciated, thanks in advance
1) You probably want to explore the use of partitions. This will allow very effective inserts (its a meta operation if you do the partitioning correctly) and very fast (2). You may want to explore columnstore indexes because the data (once collected) will never change and you will have very large data sets. Partitioning and columnstore require a learning curve but its very doable. There are lots of code on the internet describing the use of date functions in SQL Server.
That is a big number but I would start with one table see if it hold up. If you split it in multiple tables it is still the same amount of data.
Do you ever need to search across devices? If not you can have a separate table for each device.
I have some audit tables that are not that big but still big and have not had any problems. If the data is loaded in time order then make date the first (or only) column of the clustered index.
If the the PK is date, device then fine but if you can get two reading in the same seconds you cannot do that. If this is the PK then if you can load the data by that sort. Even if you have to stage each second and load. You just cannot afford to fragment a table that big. If you cannot load by the sort then take a fill factor of 50%.
If you cannot have a PK then just use date as clustered index but not as PK and put a non clustered index on device.
I have some tables of 3,000,000,000 and I have the luxury of loading by PK with no other indexes. There is no measurable degradation in insert from row 1 to row 3,000,000,000.
I want to setup a schema for storing a large amount of sequential data up to billions of rows and once the data is inserted I'm only going to be reading it in future queries. I have two options for setting up my schema and wondering which one is better than the other below or if there is another option I'm not thinking of.
Option 1.
Create a massive table to hold billions of rows of data. I like this because it keeps the schema static and life simple but not sure of any performance trade offs.
Option 2.
In this instance I'm storing historical market data for stocks and the second option would be creating a table for each stock to spread the large volume of data across multiple tables in the db system. This feels like it would be more permanent but the down-side is having a mess if I want to add new columns to my data set in the future.
Looking for some warehouse lovers that can help me write this right the first time! - Duncan
In SQL Server, you would want to create Partitioned Views with the data broken into smaller tables by date. This will end up being much better for performance. This will also be helpful with setting up tables on different Filegroups, which you will find is helpful for a backup strategy for that much data.
One table, but partitioned for easier maintenance. As far as the performance is concerned, create the appropriate indexes, don't rely on partitioning to improve performance (could be achieved in some cases, but not that simple as it seems).
I'm designing my DB for functionality and performance for realtime AJAX web applications, and I don't currently have the resources to add DB server redundancy or load-balancing.
Unfortunately, I have a table in my DB that could potentially end up storing hundreds of millions of rows, and will need to read and write quickly to prevent lagging the web-interface.
Most, if not all, of the columns in this table are individually indexed, and I'd love to know if there are other ways to ease the burden on the server when running querys on large tables. But is there eventually a cap for the size (in rows or GB) of a table before a single unclustered SQL server starts to choke?
My DB only has a dozen tables, with maybe a couple dozen foriegn key relationships. None of my tables have more than 8 or so columns, and only one or two of these tables will end up storing a large number of rows. Hopefully the simplicity of my DB will make up for the massive amounts of data in these couple tables ...
Rows are limited strictly by the amount of disk space you have available. We have SQL Servers with hundreds of millions of rows of data in them. Of course, those servers are rather large.
In order to keep the web interface snappy you will need to think about how you access that data.
One example is to stay away from any type of aggregate queries which require processing large swaths of data. Things like SUM() can be a killer depending on how much data it's trying to process. In these situations you are much better off calculating any summary or grouped data ahead of time and letting your site query these analytic tables.
Next you'll need to partition the data. Split those partitions across different drive arrays. When SQL needs to go to disk it makes it easier to parallelize the reads. (#Simon touched on this).
Basically, the problem boils down to how much data you need to access at any one time. This is the main problem regardless of the amount of data you have on disk. Even small databases can be choked if the drives are slow and the amount of available RAM in the DB server isn't enough to keep enough of the DB in memory.
Usually for systems like this large amounts of data are basically inert, meaning that it's rarely accessed. For example, a PO system might maintain a history of all invoices ever created, but they really only deal with any active ones.
If your system has similar requirements, then you might have a table that is for active records and simply archive them to another table as part of a nightly process. You could even have statistics like monthly averages (as an example) recomputed as part of that archival.
Just some thoughts.
The only limit is the size of your primary key. Is it an INT or a BIGINT?
SQL will happily store the data without a problem. However, with 100 millions of rows, your best off partitioning the data. There are many good articles on this such as this article.
With partitions, you can have 1 thread per partition working at the same time to parallelise the query even more than is possible without paritioning.
My gut tells me that you will probably be okay, but you'll have to deal with performance. It's going to depend on the acceptable time-to-retrieve results from queries.
For your table with the "hundreds of millions of rows", what percentage of the data is accessed regularly? Is some of the data, rarely accessed? Do some users access selected data and other users select different data? You may benefit from data partitioning.