I've got a cube with around 6 facts and 40 dimensions. Right now my cube is taking 1 and half hour to process, and most of the time is taken by 2 facts. But now the users are asking for data which should not be more then an hour old. Now I'm thinking of storing the partitions of those 2 facts table in ROLAP(right now they are MOLAP)mode. Would it help in improving the cube processing time or I should look for another approach?Also, is it going to make much difference in query performance.
You are on the right track. Partitions are the key to sink processing time. You have not specified how much facts you have (time-wise), but lets say you have a month worth of data. So one partition would be month - 1 latest day and the second partition would contain that latest day. You would only re-process this small partition every hour and process full cube once every day when there are no users online.
Partitions do help out when it comes to processing time.
If the partitions are small, and the logic to access the ROLAP partition is fast (i. e. no complex views, etc.) the increased query time should not be dramatic. But in fact, you just should test, as there are many factors influencing performance.
Related
Inserting data to U-SQL table is taking too much time. We are using partitioned tables to recalculate previously processed data. Insertion for the first time took almost 10-12 minutes on three tables with 11, 5 and 1 partitions and parallelism was set to 10. Second time insertion of same data took almost 4 hours. Currently we are using year based partitions. We tested insertion and querying without adding partitions and performance was much better. Is this an issue with partitioned tables?
It is very strange that the same job would be taking that much longer for the same data and script executed with the same degree of parallelism. If you look at the job graph (or the vertex execution information) from within VisualStudio, can you see where the time was being spent?
Note that (coarse-grained) partitions are more of a data life-cycle management feature that allows you to address individual partitions of a table, and not necessarily a performance feature (although partition elimination can help with query performance). But it should not go from minutes to hours with the same script, resources and data.
I've a SSAS cube with rigid relationship. Daily I get data from source for last 2 months only. My cube have data since 2010 onwards.
I am planning to partition that cube and then process it. My questions are
I know that in rigid relationship I've to go with Process full. Does that mean that I've to process all partition as Process Full or I can go ahead with selected partition for process full.
How can I design my partition strategy? If I do 2 months partition then I will end up in 6 partition per year and later they may increase. I thought of going with 6 months partition. but if I am on 7th month or 1st month then I've to process two partition(i.e. current + last 6 month). Is it good enough?
Marking attribute relationships as Rigid when they actually do change (meaning when the rollups change such as Product A rolling up to Cereal vs. Oatmeal category) is a bad idea. Just mark them as Flexible relationships. Rigid vs. flexible doesn't impact query performance just processing performance. And if Rigid causes you to do ProcessFull on dimensions that is going to mean you have to reprocess all your measure group partitions. So change relationships to Flexible unless you are 100% sure you never run an UPDATE statement on your dimension table in your ETL.
I would partition by month. Then you can just process the most recent two months every day. To be more explicit:
ProcessUpdate your dimensions
ProcessData the most recent two months of partitions.
ProcessIndexes on your cube (which rebuilds indexes and flexible aggs on older partitions)
I've got a question about BQ performance in various scenarios, especially revolving around parallelization "under the hood".
I am saving 100M records on a daily basis. At the moment, I am rotating tables every 5 days to avoid high charges due to full table scans.
If I were to run a query with a date range of "last 30 days" (for example), I would be scanning between 6 (if I am at the last day of the partition) and 7 tables.
I could, as an alternative, partition my data into a new table daily. In this case, I will optimize my expenses - as I'm never querying more data than I have too. The question is, will be suffering a performance penalty in terms of getting the results back to the client, because I am now querying potentially 30 or 90 or 365 tables in parallel (Union).
To summarize:
More tables = less data scanned
Less tables =(?) longer response time to the client
Can anyone shed some light on how to find the balance between cost and performance?
A lot depends how you write your queries and how much development costs, but that amount of data doesn't seam like a barrier, and thus you are trying to optimize too early.
When you JOIN tables larger than 8MB, you need to use the EACH modifier, and that query is internally paralleled.
This partitioning means that you can get higher effective read bandwidth because you can read from many of these disks in parallel. Dremel takes advantage of this; when you run a query, it can read your data from thousands of disks at once.
Internally, BigQuery stores tables in
shards; these are discrete chunks of data that can be processed in parallel. If
you have a 100 GB table, it might be stored in 5000 shards, which allows it to be
processed by up to 5000 workers in parallel. You shouldn’t make any assumptions
about the size of number of shards in a table. BigQuery will repartition
data periodically to optimize the storage and query behavior.
Go ahead and create tables for every day, one recommendation is that write your create/patch script that creates tables for far in the future when it runs eg: I create the next 12 months of tables for every day now. This is better than having a script that creates tables each day. And make it part of your deploy/provisioning script.
To read more check out Chapter 11 ■ Managing Data Stored in BigQuery from the book.
So I'm looking into data warehousing and partitioning and am very curious at to what scale makes the most sense for partitioning a data on a key (for instance, SaleDate).
Tutorials often mention that you're trying to break it down into logical chunks so as to make updating the data less likely to cause service disruptions.
So let's say I'm a medium scale company working in a given US state. I do a lot of work in relation to SaleDate, often tens of thousands of transactions a day (with requisite transaction details, 4-50 each?), and have about 5 years of data. I would like to query and build trend information off of that; for instance:
On a yearly basis to know what items are becoming less popular over time.
On a monthly basis to see what items get popular at a certain time of year (ice in summer)
On a weekly basis to see how well my individual stores are doing
On a daily basis to observe theft trends or something
Now my business unit also wants to query that data, but I'd like to be able to keep it responsive.
How do I know that it would be best to partition on Year, Month, Week, Day, etc for this data set? Is it just whatever I actually observe as providing the best response time by testing out each scenario? Or is there some kind of scale that I can use to understand where my partitions would be the most efficient?
Edit: I, personally, am using Sql Server 2012. But I'm curious as to how others view this question in relation to the core concept rather than the implementation (Unless this isn't one of those cases where you can do so).
Things to consider:
What type of database are you using? Really important, different strategies for Oracle vs SQLServer vs IBM, etc.
Sample queries and run times. Partitions usage depends on the conditions in your where clause, what are you filtering on?
Does it make sense to create/use aggregate tables? Seems like a monthly aggregate would save you some time.
Partitions usage depends on the conditions in your where clause, what are you filtering on?
Lots of options based on the hardware and storage options available to you, need more details to make a more specific recommendation.
Here is an Ms-SQL 2012 database with 7 million records a day, with an ambition to grow the database to 6 years of data for trend analyses.
The partitions are based on the YearWeek column, expressed as an integer (after 201453 comes 201501). So each partition holds one week of transaction data.
This makes for a maximum of 320 partitions, which is well chosen below the maximum of 1000 partitions within a scheme. The maximum size for one partition in one table is now approx. 10 Gb, which makes it much easier to handle than the 3Tb size of the total.
A new file in the partition scheme is used for each new year. The 500Gb datafiles are suitable for backup and deletion.
When calculating data for one month the 4 processors are working in parallel to handle one partition each.
I have a database table with about 700 millions rows plus (growing exponentially) of time based data.
Fields:
PK.ID,
PK.TimeStamp,
Value
I also have 3 other tables grouping this data into Days, Months, Years which contains the sum of the value for each ID in that time period. These tables are updated nightly by a SQL job, the situation has arisen where by the tables will need to updated on the fly when the data in the base table is updated, this can be however up to 2.5 million rows at a time (not very often, typically around 200-500k up to every 5 minutes), is this possible without causing massive performance hits or what would be the best method for achieving this?
N.B
The daily, monthly, year tables can be changed if needed, they are used to speed up queries such as 'Get the monthly totals for these 5 ids for the last 5 years', in raw data this is about 13 million rows of data, from the monthly table its 300 rows.
I do have SSIS available to me.
I cant afford to lock any tables during the process.
700M recors in 5 months mean 8.4B in 5 years (assuming data inflow doesn't grow).
Welcome to the world of big data. It's exciting here and we welcome more and more new residents every day :)
I'll describe three incremental steps that you can take. The first two are just temporary - at some point you'll have too much data and will have to move on. However, each one takes more work and/or more money so it makes sense to take it a step at a time.
Step 1: Better Hardware - Scale up
Faster disks, RAID, and much more RAM will take you some of the way. Scaling up, as this is called, breaks down eventually, but if you data is growing linearly and not exponentially, then it'll keep you floating for a while.
You can also use SQL Server replication to create a copy of your database on another server. Replication works by reading transaction logs and sending them to your replica. Then you can run the scripts that create your aggregate (daily, monthly, annual) tables on a secondary server that won't kill the performance of your primary one.
Step 2: OLAP
Since you have SSIS at your disposal, start discussing multidimensional data. With good design, OLAP Cubes will take you a long way. They may even be enough to manage billions of records and you'll be able to stop there for several years (been there done that, and it carried us for two years or so).
Step 3: Scale Out
Handle more data by distributing the data and its processing over multiple machines. When done right this allows you to scale almost linearly - have more data then add more machines to keep processing time constant.
If you have the $$$, use solutions from Vertica or Greenplum (there may be other options, these are the ones that I'm familiar with).
If you prefer open source / byo, use Hadoop, log event data to files, use MapReduce to process them, store results to HBase or Hypertable. There are many different configurations and solutions here - the whole field is still in its infancy.
Indexed views.
Indexed views will allow you to store and index aggregated data. One of the most useful aspects of them is that you don't even need to directly reference the view in any of your queries. If someone queries an aggregate that's in the view, the query engine will pull data from the view instead of checking the underlying table.
You will pay some overhead to update the view as data changes, but from your scenario it sounds like this would be acceptable.
Why don't you create monthly tables, just to save the info you need for that months. It'd be like simulating multidimensional tables. Or, if you have access to multidimensional systems (oracle, db2 or so), just work with multidimensionality. That works fine with time period problems like yours. At this moment I don't have enough info to give you, but you can learn a lot about it just googling.
Just as an idea.