I have a Tabular Model cube where I have split the tables into partitions to make processing more efficient.
When I Process Full the daily partition only, it takes 2h 45m. However, when I Process Full the entire database (that includes daily and historical data), it takes 1h 10m.
Anyone know what can be causing this?
Thanks!
ProcessFull within a Tabular model basically is a combination of ProcessData (grab the data from the source, build dictionaries, etc.) and ProcessReCalc (build up indexes, attribute hierarchies, etc.). While the ProcessData is only grabbing the most recent data (i.e. the data for the partition), the ProcessReCalc itself needs to be executed on the entire database. A good reference is Cathy Dumas' blog post: http://cathydumas.com/2012/01/25/processing-data-transactionally-in-amo/
To get to the cause of the processing, best to dig into the profiler traces / logs to determine what actions are taking a very long time for the processing to complete. By any chance is your data something that has a lot of repeating set of data such as audit logs? It may be possible that its faster to do the entire database (vs. a single partition) because it's able to more efficiently compress and organize the data because the repeated data can be better compressed thus taking up less memory. A potential way to check this is to see what the model size is after running ProcessFull on the partition vs. running it on the entire database. If it is true, the latter processing will result in a smaller sized database.
HTH!
Related
If I have a BigQuery dataset with data that I would like to make available to 1000 people (where each of these people would only be allowed to view their subset of the data, and is OK to view a 24hr stale version of their data), how can I do this without exceeding the 50 concurrent queries limit?
In the BigQuery documentation there's mention of 50 concurrent queries being permitted which give on-the-spot accurate data, which I would surpass if I needed them to all be able to view on-the-spot accurate data - which I don't.
In the documentation there is mention of Batch jobs being permitted and saving of results into destination tables which I'm hoping would somehow allow a reliable solution for my scenario, but am having difficulty finding information on how reliably or frequently those batch jobs can be expected to run, and whether or not someone querying results that exist in those destination tables is in itself counting towards the 50 concurrent users limit.
Any advice appreciated.
Without knowing the specifics of your situation and depending on how much data is in the output, I would suggest putting your own cache in front of BigQuery.
This sounds kind of like a dashboading/reporting solution, so I assume there is a large amount of data going in and a relatively small amount coming out (per-user).
Run one query per day with a batch script to generate your output (grouped by user) and then export it to GCS. You can then break it up into multiple flat files (or just read it into memory on your frontend). Each user hits your frontend, you determine which part of the output to serve up to them and respond.
This should be relatively cheap if you can work off the cached data and it is small enough that handling the BigQuery output isn't too much additional processing.
Google Cloud Functions might be an easy way to handle this, if you don't want the extra work of setting up a new VM to host your frontend.
I've been hearing conflicting statements on how much records / data size, tableau can handle.
In the last week two people have told me they have dashes which are, 100m and 600m records. They do incremental refreshes.
If I have a dash with xxx million records. Do clients only receive the data that is in their aggregated view.
So, if I have a source with 200million records. In the dash it shows the aggregated total per week per product. Let's say this is 400 cells(underneath it's millions of records). Is the client only receiving 400 data points.
If I then add filters to sub product or user level data, would that mean all of these data is imported due to the filters? If this is the case, how does this affect speed?
Ultimately, Tableau can handle as much data as your datasource can handle. If you are set up so Tableau connects to a datasource directly, only the results of a query are transmitted to the user. I've got billion row datasources in BigQuery that return reasonably fast aggregated numbers to Tableau.
If your datasource is not fast then this won't give good results in Tableau.
If you are using extracts, where, in effect, Tableau pulls all the data locally, things will usually be faster, but you will have local drive and memory limits on the size of the dataset. And each user will need an extract. Unless you are using Tableau server in which case the extract can be on the server.
Dashboards built on big datasources sometimes get slow when there are a lot of filters because populating each filter requires a datasource query (which may be triggered every time you use a filter). There are strategies to speed up dashboards with this problem by using partial extracts that generate all the values used for filtering (you can sometimes use parameters for a similar speed gain). Or even just designing the filters intelligently. But speed is usually the limiting factor not the size of the source table.
The only real limit on how much Tableau can handle is how many points are displayed. And that depends on RAM. In my experience a 4GB machine will choke on a chart will a couple of million points (e.g. a map plotting every postcode in the UK). But on a 16GB RAM machine I have never found a limit other than how fast the points are drawn.
does a simple-recovery database records transaction logs when selected from a full-recovery database? I mean, we have a full-recovery database, and it records too much transaction logs, causing its size to grow.
my question is, does the simple recovery still has does its minimal logging even if the data are selected from a full-recovery model database? thank you!
One thing has nothing to do with the other. Where the data comes from does not affect logging of changes to the tables in the db it's going to.
However as Martin Smith pointed out this is solving a symptom, there's naff all point in having full recovery mode on if you (they??) aren't backing up the transaction logs frequently enough to make the overhead useful. Whole point of them, aside from restoring up to particular transaction in the event of some catastrophy in your applications is speed and granularity.
Please read the MSDN page for recovery models.
http://msdn.microsoft.com/en-us/library/ms189275.aspx
Here is a quick summary from MSDN.
1 - Simple model - Automatically reclaims log space to keep space requirements small,
essentially eliminating the need to manage the transaction log space
2 - Bulk Copy model -
An adjunct of the full recovery model that permits high-performance
bulk copy operations.
**The first two do not support point in time recovery!**
3 - Full model - Can recover to an arbitrary point in time
(for example, prior to application or user error).
If no tail log backup possible, recover to last log backup.
So your problem is with either log usage or log backups.
A - Are you deleting from temporary tables instead of truncating?
http://msdn.microsoft.com/en-us/library/ms177570.aspx The delete operating will log each row in the transaction log.
B - Are you inserting large amounts of data via a ETL job? Each insert will get logged in the T-Log.
If you use bulk copy and ETL that support (fast data loads), it will be minimally logged.
However, page density and fill factor come into play when determining the size of the T-LOG.
http://blogs.msdn.com/b/sqlserverfaq/archive/2011/01/07/using-bulk-logged-recovery-model-for-bulk-operations-will-reduce-the-size-of-transaction-log-backups-myths-and-truths.aspx
C - How often are you taking transaction log backups? After each backup, the T-LOG space can be reused. Resulting in overall smaller size.
D - How fragmented is the T-LOG? I suggest reducing and re-growing the log during a maintenance period. A 20% log to data ratio with hourly backups worked fine at my old company. It all depends on how many changes you are making. http://craftydba.com/?p=3374
In summary, these are the places you should be looking at. Not the old data in the system since it is probably not being modified.
Moving the old data to a read only reporting database so that ADHOC queries from novice T-SQL users might not be a bad idea. But that solves other problems, possible BLOCKING and DEADLOCKS in your OLTP database.
every day we have an SSIS package running to import data into a database.
sometimes people are querying the database at the same time.
the loading (data import) times out because there's a table lock on the specific table.
what is the standard protocol on inserting data and querying data at the same time?
First you need to figure out where those locks are coming from. Use the link to see if there are any locks.
How to: Determine Which Queries Are Holding Locks
If you have another process that holds a table lock then not much you can do.
Are you sure the error is "not able to OBTAIN a table lock". If so look at changing your SSIS package to not use table locks.
There are several strategies.
One approach is to design your ETL pipeline as to minimize lock time. All the data is prepared in staging tables and then, when complete, is switched in using fast partition switch operations, see Transferring Data Efficiently by Using Partition Switching. This way the ETL blocks reads onyl for a very short duration. It also has the advantage that the reads see all the ETL data at once, not intermediate stages. The drawback is difficult implementation.
Another approach is to enable snapshot isolation and/or read committed snapshot in the database, see Row Versioning-based Isolation Levels in the Database Engine. This way reads no longer block behind the locks held by the ETL. the drawback is resource consumption, the hardware must be able to drive the additional load of row versioning.
Yet another approach is to move the data querying to a read-only standby server, eg. using log shipping.
I've been working on optimizing my site and databases, and I have been using mysqltuner.pl to help with this. I've gotten just about everything correct, except for the table cache hit rate, no matter how high I raise it in my.cnf, I am still hitting about 0% (284 open / 79k opened).
My problem is that I don't really understand exactly what affects this so I don't really know what to look for in my queries/database structure to fix this.
The table cache defines the number of simultaneous file descriptors that MySQL has open. So table cache hit rate will be affected by how many tables you have relative to your limit as well as how frequently you re-reference tables (keeping in mind that it is counting not just a single connection, but simultaneous connections)
For instance, if your limit is 100 and you have 101 tables and you query each one in order, you'll never get any table cache hits. On the other hand, if you only have 1 table, you should generally get close to 100% hit rate unless you run FLUSH TABLES a lot ( as long as your table_cache is set higher than the number of typically simultaneous connections).
So for tuning, you need to look at how many distinct tables you might reference by one process/client and then look at how many simultaneous connections you might typically have.
Without more details, I can't guess whether your case is due to too many simultaneous connections or too many frequently referenced tables.
A cache is supposed to maintain copies of hot data. Hot data is data that is used a lot. If you cannot retrieve data out of a certain cache it means the DB has to go to disk to retrieve it.
--edit--
sorry if the definition seemed a bit obnoxious. a specific cache often covers a lot of entities, and these are database specific, you need to find out what is cached by the table cache firstly.
--edit: some investigation --
Ok, it seems (from the reply to this post), that Mysql uses the table cache for the data structures used to represent a table. the data structures also (via encapsulation or by having duplicate table entries for each table) represent a set of file descriptors open for the data files on the file system. The MyIsam engine uses one for a table and one for each index, additionally each active query element requires its own descriptors.
A file descriptor is a kernel entity used for file IO, it represents the low-level context of a particular file read or write.
I think you are either interpreting the value's incorrectly or they need to be interpreted differently in this context. 284 is the amount of active tables at the instance you took the snapshot and the second value represents the amount of times a table was acquired since you started Mysql.
I would hazard a guess that you need to take multiple snapshots of this reading and see if the first value (active fd's at that instance) ever exceed your cache size capacity.
p.s., the kernel generally has a upper limit on the amount of file descriptors it will allow each process to open -- so you might need to tune this if it is too low.