I have this measure group taking too long to process then time out. The measure group is based on a view, which is rather complicated with many table joins, union, left outer join etc on a OLTP database. Is this the reason the processing taking so long? What's the options? I am thinking either to materialized the view to a data warehouse or using multiple partitions (query binding on the view) so each query/partition will be much smaller? I haven't tried both but would like to hear from your opinions.
--update
The error information:
OLE DB error: OLE DB or ODBC error: Query timeout expired; HYT00.
Errors in the OLAP storage engine: An error occurred while processing the 'Vw Fact Stock By Day' partition of the 'Stocks' measure group for the 'xxx' cube from the xxx database.
In my case partitioning speeds up processing grately (cause parallel comutations are faster on servers) I have partitions for every day. (They are generated via C#)
But for perfomance profit you should have
1) Slice on every partition! Not only query binding
2) Using partitions you'll probably get rid of timeout, but it still may take a long time, so materialized and indexed view is a next step, if needed.
Related
I execute SELECT INTO query where my source data are in other database than the table I insert to (but on the same server).
When I execute the query using the same database where my source data are (USE DATABASE_MY_SOURCE_DATA), it completes in under a minute. When I change the database to the database where my target table sits, it doesn't complete in 10 minutes (I don't know the exact time because I cancelled it).
Why is that? Why is the difference so huge? I can't get my head around it.
Querying cross-database, even using a linked server connection, is always likely (at least in 2021) to present performance concerns.
The first problem is that the optimizer doesn't have access to estimate the number of rows in the remote table(s). It's also going to miss indexes on those tables, resorting to table scans (which tend to be a lot slower on large tables than index seeks).
Another issue is that there is no data caching, so the optimizer makes round-trips to the remote database for every necessary operation.
More information (from a great source):
https://www.brentozar.com/archive/2021/07/why-are-linked-server-queries-so-bad/
Assuming that you want this to be more performant, and that you are doing substantial filtering on the remote data source, you may see some performance benefit from creating - on the remote database - a view that filters to just the rows you want on the target table and query that for your results.
Alternatively (and likely more correctly) you should wrap these operations in an ETL process (such as SSIS) that better manages these connections.
I build statistical output generated on-demand from data stored in BigQuery tables. Some data is imported daily via stitch using "Append-Only". This results in duplicated observations in the imported tables (around 20kk rows growing 8kk yearly).
I could either schedule a BigQuery query to store deduplicated values in a cleaned table, or build views to do the same, but I don't understand the tradeoffs in terms of:
costs on BigQuery for storing/running scheduled queries and views.
speed of later queries dependent on deduplicated views. Do the views cache?
Am I correct to assume that daily scheduled queries to store deduplicated data is more costly (for re-writing stored tables) but speeds up later queries to the deduplicated data (saving on costs for usage) ?
The deduplicated data will namely in turn be queried hundreds of times daily to produce dashboard output for which responsiveness is a concern.
How should I argue when deciding for the better solution?
Lets go to the facts:
The price you will pay in the query is the same regardless you are using a View or a Scheduled Query
When using a Scheduled Query, you will need to pay for the data you store in the de-duplicated table. As a View will not store any data, you will not have extra charges.
In terms of speed, using the Scheduled Query approach wins because you have your data already de-duplicated and cleaned. If you are going to feed dashboards with this data, the View approach can lead to laziness in the dashboard loading.
Another possible approach for you is using Materialized Views, which are smarter Views that periodically cache results in order to improve performance. In this guide you can find some information about choosing between Scheduled Queries and Materialized Views:
When should I use scheduled queries versus materialized views?
Scheduled queries are a convenient way to run arbitrarily complex
calculations periodically. Each time the query runs, it is being run
fully. The previous results are not used, and you pay the full price
for the query. Scheduled queries are great when you don't need the
freshest data and you have a high tolerance for data staleness.
Materialized views are suited for when you need to query the latest
data while cutting down latency and cost by reusing the previously
computed result. You can use materialized views as pseudo-indexes,
accelerating queries to the base table without updating any existing
workflows.
As a general guideline, whenever possible and if you are not running
arbitrarily complex calculations, use materialized views.
I think it might also be affected by how often the your view / table would be queried.
For example - a very complex query over a large dataset will be costly every time it's run. If the result is a significantly smaller dataset, it will be more cost-effective to schedule a query to save the results, and query the results directly - rather than using a view which will perform the very complex query time and time again.
For the speed factor - it definitely is better to query a reduced table directly and not a view.
For the cost factor - I would try to understand how often this view/table will be queried and how are the processing+storage costs for it:
For a view: roughly calculate the processing costs * amount of times it will be queried monthly, for example
For a stored table: scheduled queries performed per month * processing costs + monthly storage costs for the table results
This should give you pretty much the entire case you need to build in order to argue for your solution.
When I try to create a view which query more than 600 tables, BigQuery was running for a long time and response is :
BigQuery error in mk operation: Backend Error.
the query itself is like:
'select col1,col2,col3 from t1,t2,t3......t600'
I suspect the operation is timing out. The limit here is whether validating the view query can be completed within the deadline limits for a single synchronous request like view creation. This many tables may just be too many.
A potential work-around might be to shard this view: create smaller view tables, then a single view of the set of smaller views.
An alternate solution would be to explore your data layout. Perhaps you don't need 600 tables to hold your data? The BigQuery team announced at GCP Next 2016 that table partitioning by date will be coming soon, so if you are sharding your tables by day and need to reference years of data, then there will be a single-table solution for you soon.
I have a 'Employee' dimension which will be changed (modified) everyday, I made monthly partitions in cube and only process full the current month partition. Lately found that the past month's aggregation will not be dropped. Tired to 'ProcessUpdate' on this dimension and 'ProcessIndex' on partition but remained same. Also tried the setting 'ProcessAffectObjects'and 'ProcessIndex' again, still the same, tried both on lazyprocessing true and false with no luck.
So my question is how to drop the stale aggregation on past month and rebuild them explicit ?
It is a distinct count measurement and no aggregation designed via wizard
Tried drop the index by using 'ProcessClearIndexes' in XMAL command, it worked fine and use 'ProcessIndexes' did rebuild the indexes and aggregation, saw them from the SSMS query execution message .
So might it only be related to the distinct count , just because it is a non-aggregation measurement ?
"Non-additive measures create the following problems on a typical OLAP system:
Roll-ups are not possible. When pre-calculating results during cube processing, the system cannot deduce summaries from other summaries. All results must be calculated from the detail data. This situation places a heavy burden in processing time.
All results must be pre-calculated. With non-additive measures, there is no way to deduce the result for a higher-level summary query from one pre-calculated aggregation. Failure to pre-calculate the results in advance means that the results are not available. It is impossible to perform and maintain incremental updates to the system. A single transaction added to the cube usually invalidates huge portions of previously pre-calculated results. In order to recover from this, a complete recalculation is needed."
"Aggregations
As mentioned before, DISTINCT COUNTs are not additive (and this is the main reason why these measures are so problematic). Therefore, the aggregations, which are all derived from additive operators, are completely useless;"
someone answered my question on MSDN
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/7302227f-11b8-4adc-98ff-72b6c395775b/ssas-update-a-dimension-wont-drop-aggregation-process-index-wont-rebuild-aggregation?forum=sqlanalysisservices
If you use materialized reference dimensions ensure you do ProcessFull to reprocess the fact tables again fully. The reason is that the join to the intermediate dimension happens in the measure group partition processing query:
http://sqlblog.com/blogs/alberto_ferrari/archive/2009/02/25/ssas-reference-materialized-dimension-might-produce-incorrect-results.aspx
I am storing event data in BigQuery, partitioned by day - one table per day. The following query failed:
select count(distinct event)
from TABLE_DATE_RANGE(my_dataset.my_dataset_events_, SEC_TO_TIMESTAMP(1391212800), SEC_TO_TIMESTAMP(1393631999))
Each table is about 8GB in size.
Has anyone else experienced this error? Seems like it's limited by table size, because in this query, I've only limited it to just one column. When I use a smaller time range, it works.. but the whole point of using BigQuery was its support for large datasets.
"Query too large" in this case means that the TABLE_RANGE is getting expanded internally to too many tables, generating an internal query that is too large to be processed.
This has 2 workarounds:
Query less tables (could you aggregate these tables into a bigger one?).
Wait until the BQ team solves this issue internally. Instead of using a workaround, you should be able to run this query unchanged. Just not today :).