Process more than 500 million rows to cube without performance issues

Process more than 500 million rows to cube without performance issues - sql

i have a huge database, for example :
My customer loads everyday 500 million records of data for sales in a buffer fact table called "Sales". I have to process this sales to my cube in append/update mode but this is destroying the performance even with 186 GB of RAM.
I've already tried to create indexes on the dimension tables, this help a little but not too much.
My customer said that they expect a 15% sales data increment every 6 months...
There is a smart way in order to load this data without to wait too many ours?
I'm using SQL-Server 2016.
Thanks!

You can adapt column store index feature of sql server 2016.
Columnstore indexes are the standard for storing and querying large data warehousing fact tables. This index uses column-based data storage and query processing to achieve gains up to 10 times the query performance in your data warehouse over traditional row-oriented storage. You can also achieve gains up to 10 times the data compression over the uncompressed data size. Beginning with SQL Server 2016 (13.x), columnstore indexes enable operational analytics: the ability to run performant real-time analytics on a transactional workload.
You can have get more idea about this from microsoft link

If you're using a SAN to store your database. You might want to look into some software like Condusiv V-locity to eliminate a lot of the I/O being sent to and received from the database engine.
I might suggest to create a separate database engine, to ship the transaction log over to a separate server and apply transaction logs to the DB every 15 minutes for you to create analytics without using live data. Also the heavy writes to the production DB will not affect your ability to create complex query that locks the table or rows from time to time on your reporting server.

Related

How to store millions of statistics records efficiently?

We have about 1.7 million products in our eshop, we want to keep record of how many views this products had for 1 year long period, we want to record the views every atleast 2 hours, the question is what structure to use for this task?
Right now we tried keeping stats for 30 days back in records that have 2 columns classified_id,stats where stats is like a stripped json with format date:views,date:views... for example a record would look like
345422,{051216:23212,051217:64233} where 051216,051217=mm/dd/yy and 23212,64233=number of views
This of course is kinda stupid if you want to go 1 year back since if you want to get the sum of views of say 1000 products you need to fetch like 30mb from the database and calculate it your self.
The other way we think of going right now is just to have a massive table with 3 columns classified_id,date,view and store its recording on its own row, this of course will result in a huge table with hundred of millions of rows , for example if we have 1.8 millions of classifieds and keep records 24/7 for one year every 2 hours we need
1800000*365*12=7.884.000.000(billions with a B) rows which while it is way inside the theoritical limit of postgres I imagine the queries on it(say for updating the views), even with the correct indices, will be taking some time.
Any suggestions? I can't even imagine how google analytics stores the stats...

This number is not as high as you think. In current work we store metrics data for websites and total amount of rows we have is much higher. And in previous job I worked with pg database which collected metrics from mobile network and it collected ~2 billions of records per day. So do not be afraid of billions in number of records.
You will definitely need to partition data - most probably by day. With this amount of data you can find indexes quite useless. Depends on planes you will see in EXPLAIN command output. For example that telco app did not use any indexes at all because they would just slow down whole engine.
Another question is how quick responses for queries you will need. And which steps in granularity (sums over hours/days/weeks etc) for queries you will allow for users. You may even need to make some aggregations for granularities like week or month or quarter.
Addition:
Those ~2billions of records per day in that telco app took ~290GB per day. And it meant inserts of ~23000 records per second using bulk inserts with COPY command. Every bulk was several thousands of records. Raw data were partitioned by minutes. To avoid disk waits db had 4 tablespaces on 4 different disks/ arrays and partitions were distributed over them. PostreSQL was able to handle it all without any problems. So you should think about proper HW configuration too.
Good idea also is to move pg_xlog directory to separate disk or array. No just different filesystem. It all must be separate HW. SSDs I can recommend only in arrays with proper error check. Lately we had problems with corrupted database on single SSD.

First, do not use the database for recording statistics. Or, at the very least, use a different database. The write overhead of the logs will degrade the responsiveness of your webapp. And your daily backups will take much longer because of big tables that do not need to be backed up so frequently.
The "do it yourself" solution of my choice would be to write asynchronously to log files and then process these files afterwards to construct the statistics in your analytics database. There is good code snippet of async write in this response. Or you can benchmark any of the many loggers available for Java.
Also note that there are products like Apache Kafka specifically designed to collect this kind of information.
Another possibility is to create a time series in column oriented database like HBase or Cassandra. In this case you'd have one row per product and as many columns as hits.
Last, if you are going to do it with the database, as #JosMac pointed, create partitions, avoid indexes as much as you can. Set fillfactor storage parameter to 100. You can also consider UNLOGGED tables. But read thoroughly PostgreSQL documentation before turning off the write-ahead log.

Just to raise another non-RDBMS option for you (so a little off topic), you could send text files (CSV, TSV, JSON, Parquet, ORC) to Amazon S3 and use AWS Athena to query it directly using SQL.
Since it will query free text files, you may be able to just send it unfiltered weblogs, and query them through JDBC.

Bigquery partitioning table performance

I've got a question about BQ performance in various scenarios, especially revolving around parallelization "under the hood".
I am saving 100M records on a daily basis. At the moment, I am rotating tables every 5 days to avoid high charges due to full table scans.
If I were to run a query with a date range of "last 30 days" (for example), I would be scanning between 6 (if I am at the last day of the partition) and 7 tables.
I could, as an alternative, partition my data into a new table daily. In this case, I will optimize my expenses - as I'm never querying more data than I have too. The question is, will be suffering a performance penalty in terms of getting the results back to the client, because I am now querying potentially 30 or 90 or 365 tables in parallel (Union).
To summarize:
More tables = less data scanned
Less tables =(?) longer response time to the client
Can anyone shed some light on how to find the balance between cost and performance?

A lot depends how you write your queries and how much development costs, but that amount of data doesn't seam like a barrier, and thus you are trying to optimize too early.
When you JOIN tables larger than 8MB, you need to use the EACH modifier, and that query is internally paralleled.
This partitioning means that you can get higher effective read bandwidth because you can read from many of these disks in parallel. Dremel takes advantage of this; when you run a query, it can read your data from thousands of disks at once.
Internally, BigQuery stores tables in
shards; these are discrete chunks of data that can be processed in parallel. If
you have a 100 GB table, it might be stored in 5000 shards, which allows it to be
processed by up to 5000 workers in parallel. You shouldn’t make any assumptions
about the size of number of shards in a table. BigQuery will repartition
data periodically to optimize the storage and query behavior.
Go ahead and create tables for every day, one recommendation is that write your create/patch script that creates tables for far in the future when it runs eg: I create the next 12 months of tables for every day now. This is better than having a script that creates tables each day. And make it part of your deploy/provisioning script.
To read more check out Chapter 11 ■ Managing Data Stored in BigQuery from the book.

Improving query performance of of database table with large number of columns and rows(50 columns, 5mm rows)

We are building an caching solution for our user data. The data is currently stored i sybase and is distributed across 5 - 6 tables but query service built on top of it using hibernate and we are getting a very poor performance. In order to load the data into the cache it would take in the range of 10 - 15 hours.
So we have decided to create a denormalized table of 50 - 60 columns and 5mm rows into another relational database (UDB), populate that table first and then populate the cache from the new denormalized table using JDBC so the time to build us cache is lower. This gives us a lot better performance and now we can build the cache in around an hour but this also does not meet our requirement of building the cache whithin 5 mins. The denormlized table is queried using the following query
select * from users where user id in (...)
Here user id is the primary key. We also tried a query
select * from user where user_location in (...)
and created a non unique index on location also but that also did not help.
So is there a way we can make the queries faster. If not then we are also open to consider some NOSQL solutions.
Which NOSQL solution would be suited for our needs. Apart from the large table we would be making around 1mm updates on the table on a daily basis.
I have read about mongo db and seems that it might work but no one has posted any experience with mongo db with so many rows and so many daily updates.
Please let us know your thoughts.

The short answer here, relating to MongoDB, is yes - it can be used in this way to create a denormalized cache in front of an RDBMS. Others have used MongoDB to store datasets of similar (and larger) sizes to the one you described, and can keep a dataset of that size in RAM. There are some details missing here in terms of your data, but it is certainly not beyond the capabilities of MongoDB and is one of the more frequently used implementations:
http://www.mongodb.org/display/DOCS/The+Database+and+Caching
The key will be the size of your working data set and therefore your available RAM (MongoDB maps data into memory). For larger solutions, write heavy scaling, and similar issues, there are numerous approaches (sharding, replica sets) that can be employed.
With the level of detail given it is hard to say for certain that MongoDB will meet all of your requirements, but given that others have already done similar implementations and based on the information given there is no reason it will not work either.

Large Volume Database

We are creating a database where we store large number of records. We estimate millions (billions after few years) of record in one table and we always INSERT and rarely UPDATE or DELETE any of the record. Its a kind of archive system where we insert historic record on daily basis. We will generate different sort of reports on this historic record on user request so we've some concerns and require technical input from you people:
What is the best way to manage this kind of table and database?
What impact we may see in future for very large table?
Is there any limitation on number of records in one table or size of table?
How we suppose to INSERT bulk record from different sources (mostly from Excel sheet)?
What is the best way to index large data tables?
Which is the best ORM (object relational Mapping) should we use in this project?

You last statement sums it up. There is no ORM that will deal nicely with this volume of data and reporting queries: employ SQL experts to do it for you. You heard it here first.
Otherwise
On disk: filegroups, partitioning etc
Compress less-used data
Is all data required? (Data retention policies)
No limit of row numbers or table size
INSERT via staging tables or staging databases, clean/scrub/lookup keys, then flush to main table: DO NOT load main table directly
As much RAM as you can buy. Then add more.
Few, efficient indexes
Do you have parent tables or flat data mart? Have FKs but don't use them (eg bene update/delete in parent table) so no indexes needed
Use a SAN (easier to add disk space, more volumes etc)
Normalise
Some of these are based on our experiences of around 10 billion rows through one of our systems in 30 months, with peaks of 40k rows+ per second.
See this too for high volume systems: 10 lessons from 35K tps
Summary: do it properly or not at all...

What is the best way to manage this kind of table and database?
If you are planning to store billions of records then you'll be needing plenty of diskspace, I'd recommend a 64bit OS running SQL 2008 R2 and as much RAM and HD space as is available. Depending on what performance you need I'd be tempted to look into SSDs.
What impact we may see in future for very large table?
If you have the right hardware, with a properly indexed table and properly normalized the only thing you should notice are the reports will begin to run slower. Inserts may slow down slightly as the Index file becomes bigger and you'll just have to keep an eye on it.
Is there any limitation on number of records in one table or size of table?
On the right setup I described above, no. It's only limited by disk space.
How we suppose to INSERT bulk record from different sources (mostly from Excel sheet)?
I've run into problems running huge SQL queries but I've never tried to import from very large flat files.
What is the best way to index large data tables?
Index as few fields as necessary and keep them to numerical fields only.
Which is the best ORM (object relational Mapping) should we use in this project?
Sorry can't advise here.

Billions of rows in a "few years" is not an especially large volume. SQL Server should cope perfectly well with it - assuming your design and implementation is appropriate. There is no particular limit on the size of a table. Stick to solid design principles: normalize your tables, choose keys and data types carefully and have a suitable partitioning and indexing strategy.

What is the maximum recommended number of rows that a SQL 2008 R2 standalone server should store in a single table?

I'm designing my DB for functionality and performance for realtime AJAX web applications, and I don't currently have the resources to add DB server redundancy or load-balancing.
Unfortunately, I have a table in my DB that could potentially end up storing hundreds of millions of rows, and will need to read and write quickly to prevent lagging the web-interface.
Most, if not all, of the columns in this table are individually indexed, and I'd love to know if there are other ways to ease the burden on the server when running querys on large tables. But is there eventually a cap for the size (in rows or GB) of a table before a single unclustered SQL server starts to choke?
My DB only has a dozen tables, with maybe a couple dozen foriegn key relationships. None of my tables have more than 8 or so columns, and only one or two of these tables will end up storing a large number of rows. Hopefully the simplicity of my DB will make up for the massive amounts of data in these couple tables ...

Rows are limited strictly by the amount of disk space you have available. We have SQL Servers with hundreds of millions of rows of data in them. Of course, those servers are rather large.
In order to keep the web interface snappy you will need to think about how you access that data.
One example is to stay away from any type of aggregate queries which require processing large swaths of data. Things like SUM() can be a killer depending on how much data it's trying to process. In these situations you are much better off calculating any summary or grouped data ahead of time and letting your site query these analytic tables.
Next you'll need to partition the data. Split those partitions across different drive arrays. When SQL needs to go to disk it makes it easier to parallelize the reads. (#Simon touched on this).
Basically, the problem boils down to how much data you need to access at any one time. This is the main problem regardless of the amount of data you have on disk. Even small databases can be choked if the drives are slow and the amount of available RAM in the DB server isn't enough to keep enough of the DB in memory.
Usually for systems like this large amounts of data are basically inert, meaning that it's rarely accessed. For example, a PO system might maintain a history of all invoices ever created, but they really only deal with any active ones.
If your system has similar requirements, then you might have a table that is for active records and simply archive them to another table as part of a nightly process. You could even have statistics like monthly averages (as an example) recomputed as part of that archival.
Just some thoughts.

The only limit is the size of your primary key. Is it an INT or a BIGINT?
SQL will happily store the data without a problem. However, with 100 millions of rows, your best off partitioning the data. There are many good articles on this such as this article.
With partitions, you can have 1 thread per partition working at the same time to parallelise the query even more than is possible without paritioning.

My gut tells me that you will probably be okay, but you'll have to deal with performance. It's going to depend on the acceptable time-to-retrieve results from queries.
For your table with the "hundreds of millions of rows", what percentage of the data is accessed regularly? Is some of the data, rarely accessed? Do some users access selected data and other users select different data? You may benefit from data partitioning.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas