Oracle: Fastest way of doing stats on millions of rows? - sql

I have developed an application which allows users to enter measurements - these are stored in an Oracle database. Each measurement "session" could contain around 100 measurements. There could be around 100 measurement sessions in a "batch", so that's 10,000 measurements per batch. There could easily be around 1000 batches at some point, bringing the total number of measurements into the millions.
The problem is that calculations and statistics need to be performed on the measurements. It ranges from things like average measurements per batch to statistics across the last 6 months of measurements.
My question is: is there any way that I can make the process of calculating these statistics faster? Either through the types of queries I'm running or the structure of the database?
Thanks!

I assume that most calculations will be performed on either a single session or a single batch. If this is the case, then it's important that sessions and batches aren't distributed all over the disk.
In order to achieve the desired data clustering, you probably want to create an index-organized table (IOT) organized by batch and session. That way, the measurements belonging to the same session or same batch are close togehter on the disk and the queries for a session or batch will be limited to a small number of disk pages.

Unfortunately, as the number of calculations that needed to be carried out are not limited to just a few, I could not calculate them at the end of every measurement session.
In the end the queries did not take so long - around 3 minutes for calculating all the stats. For end users this was still an unacceptably long time to wait, BUT the good thing was that the stats did not necessarily have to be completely up to date.
Therefore I used a materialized view to take a 'snapshot' of the stats, and set it to update itself every morning at 2am. Then, when the user requested the stats from the materialized view it was instant!

Related

What would be the most efficient way to compute aggregated stats real time from a table with a million records which very frequently gets updated?

We provide e-learning video content to our user's and we have a table where we insert their progress with values such as video number, seconds watched. So based on the number of seconds watched we aggregate the data to get the number of minutes watched by a user in the last X days.
The problem is that we want to show real-time stats on the analytics dashboard but running a query that brings an aggregated result by each user puts a good amount of load on our database, which is already a 4 CPU / 8 GB managed Postgres instance.
I thought of using triggers on every insert, to maintain the aggregated data in a separate table but the insert transactions happen at the start, any pause, and at the end. But we are doubtful how efficient is this approach as these can trigger N number of times which will then has to compute and insert the data in another table. Basically a lot of transactions.
I wanted to know how can we efficiently model our database so that we can compute these things in real-time or close to real-time.

Deleting millions of rows without impacting transaction log

Our client database is growing at increasing pace, more specifically entities like Auditing & Logging which are growing much greater speed.
For instance, as of now the Auditing table has ~30 million rows and its is growing with the rate of 1.5 million rows per week.
Similarly, the Logging table is growing at the rate of ~1 million rows per week. This table has ~50 million rows.
We have decided to archive tables based on our data retention policy & delete some 'N' number of records from these tables when ever archiving jobs runs.
I am looking for best advice for defining the chuckSize which will not impact transaction logs of sql server db or table locking. I know this value cannot be straight way derived, we need to run different test scenarios to come with this magic number.
The best advice is to partition by data, presumably by date.
Then you can remove entire partitions without having to log the results.
The subject of partitioning tables is rather broad. The documentation is a good place to start.

How to store millions of statistics records efficiently?

We have about 1.7 million products in our eshop, we want to keep record of how many views this products had for 1 year long period, we want to record the views every atleast 2 hours, the question is what structure to use for this task?
Right now we tried keeping stats for 30 days back in records that have 2 columns classified_id,stats where stats is like a stripped json with format date:views,date:views... for example a record would look like
345422,{051216:23212,051217:64233} where 051216,051217=mm/dd/yy and 23212,64233=number of views
This of course is kinda stupid if you want to go 1 year back since if you want to get the sum of views of say 1000 products you need to fetch like 30mb from the database and calculate it your self.
The other way we think of going right now is just to have a massive table with 3 columns classified_id,date,view and store its recording on its own row, this of course will result in a huge table with hundred of millions of rows , for example if we have 1.8 millions of classifieds and keep records 24/7 for one year every 2 hours we need
1800000*365*12=7.884.000.000(billions with a B) rows which while it is way inside the theoritical limit of postgres I imagine the queries on it(say for updating the views), even with the correct indices, will be taking some time.
Any suggestions? I can't even imagine how google analytics stores the stats...
This number is not as high as you think. In current work we store metrics data for websites and total amount of rows we have is much higher. And in previous job I worked with pg database which collected metrics from mobile network and it collected ~2 billions of records per day. So do not be afraid of billions in number of records.
You will definitely need to partition data - most probably by day. With this amount of data you can find indexes quite useless. Depends on planes you will see in EXPLAIN command output. For example that telco app did not use any indexes at all because they would just slow down whole engine.
Another question is how quick responses for queries you will need. And which steps in granularity (sums over hours/days/weeks etc) for queries you will allow for users. You may even need to make some aggregations for granularities like week or month or quarter.
Addition:
Those ~2billions of records per day in that telco app took ~290GB per day. And it meant inserts of ~23000 records per second using bulk inserts with COPY command. Every bulk was several thousands of records. Raw data were partitioned by minutes. To avoid disk waits db had 4 tablespaces on 4 different disks/ arrays and partitions were distributed over them. PostreSQL was able to handle it all without any problems. So you should think about proper HW configuration too.
Good idea also is to move pg_xlog directory to separate disk or array. No just different filesystem. It all must be separate HW. SSDs I can recommend only in arrays with proper error check. Lately we had problems with corrupted database on single SSD.
First, do not use the database for recording statistics. Or, at the very least, use a different database. The write overhead of the logs will degrade the responsiveness of your webapp. And your daily backups will take much longer because of big tables that do not need to be backed up so frequently.
The "do it yourself" solution of my choice would be to write asynchronously to log files and then process these files afterwards to construct the statistics in your analytics database. There is good code snippet of async write in this response. Or you can benchmark any of the many loggers available for Java.
Also note that there are products like Apache Kafka specifically designed to collect this kind of information.
Another possibility is to create a time series in column oriented database like HBase or Cassandra. In this case you'd have one row per product and as many columns as hits.
Last, if you are going to do it with the database, as #JosMac pointed, create partitions, avoid indexes as much as you can. Set fillfactor storage parameter to 100. You can also consider UNLOGGED tables. But read thoroughly PostgreSQL documentation before turning off the write-ahead log.
Just to raise another non-RDBMS option for you (so a little off topic), you could send text files (CSV, TSV, JSON, Parquet, ORC) to Amazon S3 and use AWS Athena to query it directly using SQL.
Since it will query free text files, you may be able to just send it unfiltered weblogs, and query them through JDBC.

Bigquery partitioning table performance

I've got a question about BQ performance in various scenarios, especially revolving around parallelization "under the hood".
I am saving 100M records on a daily basis. At the moment, I am rotating tables every 5 days to avoid high charges due to full table scans.
If I were to run a query with a date range of "last 30 days" (for example), I would be scanning between 6 (if I am at the last day of the partition) and 7 tables.
I could, as an alternative, partition my data into a new table daily. In this case, I will optimize my expenses - as I'm never querying more data than I have too. The question is, will be suffering a performance penalty in terms of getting the results back to the client, because I am now querying potentially 30 or 90 or 365 tables in parallel (Union).
To summarize:
More tables = less data scanned
Less tables =(?) longer response time to the client
Can anyone shed some light on how to find the balance between cost and performance?
A lot depends how you write your queries and how much development costs, but that amount of data doesn't seam like a barrier, and thus you are trying to optimize too early.
When you JOIN tables larger than 8MB, you need to use the EACH modifier, and that query is internally paralleled.
This partitioning means that you can get higher effective read bandwidth because you can read from many of these disks in parallel. Dremel takes advantage of this; when you run a query, it can read your data from thousands of disks at once.
Internally, BigQuery stores tables in
shards; these are discrete chunks of data that can be processed in parallel. If
you have a 100 GB table, it might be stored in 5000 shards, which allows it to be
processed by up to 5000 workers in parallel. You shouldn’t make any assumptions
about the size of number of shards in a table. BigQuery will repartition
data periodically to optimize the storage and query behavior.
Go ahead and create tables for every day, one recommendation is that write your create/patch script that creates tables for far in the future when it runs eg: I create the next 12 months of tables for every day now. This is better than having a script that creates tables each day. And make it part of your deploy/provisioning script.
To read more check out Chapter 11 ■ Managing Data Stored in BigQuery from the book.

What would be the most efficient method for storing/updating Interval based data in SQL?

I have a database table with about 700 millions rows plus (growing exponentially) of time based data.
Fields:
PK.ID,
PK.TimeStamp,
Value
I also have 3 other tables grouping this data into Days, Months, Years which contains the sum of the value for each ID in that time period. These tables are updated nightly by a SQL job, the situation has arisen where by the tables will need to updated on the fly when the data in the base table is updated, this can be however up to 2.5 million rows at a time (not very often, typically around 200-500k up to every 5 minutes), is this possible without causing massive performance hits or what would be the best method for achieving this?
N.B
The daily, monthly, year tables can be changed if needed, they are used to speed up queries such as 'Get the monthly totals for these 5 ids for the last 5 years', in raw data this is about 13 million rows of data, from the monthly table its 300 rows.
I do have SSIS available to me.
I cant afford to lock any tables during the process.
700M recors in 5 months mean 8.4B in 5 years (assuming data inflow doesn't grow).
Welcome to the world of big data. It's exciting here and we welcome more and more new residents every day :)
I'll describe three incremental steps that you can take. The first two are just temporary - at some point you'll have too much data and will have to move on. However, each one takes more work and/or more money so it makes sense to take it a step at a time.
Step 1: Better Hardware - Scale up
Faster disks, RAID, and much more RAM will take you some of the way. Scaling up, as this is called, breaks down eventually, but if you data is growing linearly and not exponentially, then it'll keep you floating for a while.
You can also use SQL Server replication to create a copy of your database on another server. Replication works by reading transaction logs and sending them to your replica. Then you can run the scripts that create your aggregate (daily, monthly, annual) tables on a secondary server that won't kill the performance of your primary one.
Step 2: OLAP
Since you have SSIS at your disposal, start discussing multidimensional data. With good design, OLAP Cubes will take you a long way. They may even be enough to manage billions of records and you'll be able to stop there for several years (been there done that, and it carried us for two years or so).
Step 3: Scale Out
Handle more data by distributing the data and its processing over multiple machines. When done right this allows you to scale almost linearly - have more data then add more machines to keep processing time constant.
If you have the $$$, use solutions from Vertica or Greenplum (there may be other options, these are the ones that I'm familiar with).
If you prefer open source / byo, use Hadoop, log event data to files, use MapReduce to process them, store results to HBase or Hypertable. There are many different configurations and solutions here - the whole field is still in its infancy.
Indexed views.
Indexed views will allow you to store and index aggregated data. One of the most useful aspects of them is that you don't even need to directly reference the view in any of your queries. If someone queries an aggregate that's in the view, the query engine will pull data from the view instead of checking the underlying table.
You will pay some overhead to update the view as data changes, but from your scenario it sounds like this would be acceptable.
Why don't you create monthly tables, just to save the info you need for that months. It'd be like simulating multidimensional tables. Or, if you have access to multidimensional systems (oracle, db2 or so), just work with multidimensionality. That works fine with time period problems like yours. At this moment I don't have enough info to give you, but you can learn a lot about it just googling.
Just as an idea.