What is the best way of MDX querying for each drill down of BI dash board chart? as a example if you have four drill level every drill down we should execute four MDX query or execute only one query in the initial time, and keep all data of four drill levels in object collection. If you can please explain with a example.
This depends a lot on what tool you are using to display the BI Dashboard. Is it SSRS, PerformancePoint, something else?
Pull all the data in the initial MDX query, configure the Dashboard software to display the top level of detail and provide user with options for drilldown. As users drilldown, unhide the next level of detail. This option only requires 1 roundtrip to the database. So intially loading the dashboard may be a bit slower, but drilldown experience will be very fast (since the data has already be retrieved).
Pull just the top level of detail in the initial MDX query, configure the Dashboard software to display results and provide users with options for drilldown. As users drilldown, Dashboard software will send another MDX query to retrieve the next level of detail from your data source. This option will require multiple roundtrips to the database...one for the intial top-level of detail when the user first loads the dashboard, and another for each time the user drills down.
Either option will work but you'll need to make the call on which option best suits your needs after weighing the pros and cons...
how fast is the network between your dashboard and the datasource?
how much concurrency can your data-source handle?
how "big" is the query to pull everything?
how important is speed to your users?
be sure and test each if you are unsure.
Related
I need some help cleaning my data...
I have a BQ table where I receive new entries from my back-end, these data are recorded to my BQ and I'm using Google Data Studio to present these data.
My problem is, I a field named sessions that sometimes are duplicates, I can't solve that directly in my back-end because a user can send different data from the same session so I can't just stop recording duplicates.
I've managed my problem by creating a View that selects the newest duplicate record and I'm using this view as data-source for my report. The problem with this approach is that I lost the feature of "real-time report" and that is important in this case. And another problem is that I also lost Accelerated by BigQuery BI Engine and I would like to have these feature too.
Is this the best solution for my problem and I'll need to accept this outcome or there is another way?
Many thanks in advance, kind regards.
Using the view should work for BI Engine acceleration. Can you please share more details on BI Engine? It should show you the reason query wasn't accelerated, likely mentioning one of the limitations. If you hover over "not accelerated" sign it should give you more details on why your query wasn't supported. Feel free to share it here and I will be happy to help.
Another way you can clean up the data: Have scheduled job to preprocess the data. It will mean data may not be the most recent, but it will give you ability to clean up and aggregate data.
I want to know about the way of saving BigQuery data capacity with changing setting of Data Portal(Google BI tool/old name:Data Studio).
The reason is I can't execute SQL or defray the much cost , if I don't save my BigQuery data capacity .
I want to know the way is not used Changing BigQuery Setting(contain of change SQL code) , but Data Protal setting.
Because , the dashboard in data portal continue to use BigQuery data capacity , I can't solve my problem ,even if I change the SQL code.
My situations is below:
My situations:
1.I made a "view" in my BigQuery Enviroment.
I tried to make the query not to use a lot of BigQuery data capacity.
For example , I didn't use "SELECT * FROM ...".
I set the view to "data sorce" in the data portal.
And I made the dashboard using the "data sorce".
If someone open the dashboard , the view I made is executed.
And , BigQuery data capacity is used every time that someone open the dashboard.
If I'm understanding correctly, you're wanting to reduce the amount of data processed in BigQuery from your Data Studio (or in Japan, Data Portal) reports.
There are a few ways to do this:
Make sure that the "Enable Cache" option is checked in the report settings.
Avoid using BigQuery views as a query source, as these aren't cached at the BigQuery level (the view query is run every time, and likely many times per report for various charts). Instead, use a Custom Query connection or pull the table data directly to allow caching. Another option (which we use heavily) is to run a scheduled query that saves the output of a view as a table and replaces it regularly (or is triggered when the underlying data is refreshed). This way your queries can be cached, but the business logic can still exist within the view.
Create a BI Engine reservation in BigQuery. This adds another level of caching to Data Studio reports, and may give you better results for things that can't be query-cached or cached in Data Studio. (While there will be a cost to the service in the future based on the size of instance you reserve, it's free during their beta period.)
Don't base your queries on tables with a streaming buffer attached (even if it hasn't received rows recently), uses wildcard tables in the query, or is based on an external dataset (e.g. file in Cloud Storage or BigTable). See Caching Exceptions for details.
Pull as little data as possible by using the new Data Source Parameters. This means you can pass the values of your date range or other filters directly to BigQuery and filter the data before it reaches your report. This is especially helpful if you have a date-partitioned table, as you can only scan the needed partitions (which greatly reduces processing and the amount of data returned)
Also, sometimes it seems like you're moving a lot of data but that doesn't always relate to a high cost. Check your cost breakdowns or look at the logging filtered to the user your data source authenticates as, then see how much cost that's incurred. Certain operations fall under a free tier, and others don't result in cost for non-egress use cases like Data Studio. All that to say that you may want to make sure there's a cost problem at the BigQuery level in the first place before killing yourself trying to optimize the usage.
Does anyone know of any way to remove the public datasets from a BigQuery project?
Though the risk is very low, I don't want my users to be able to run queries against them and rack up costs.
Thanks
Its an old question, but for those who just want to unpin the "bigquery-public-data" to tidy up the resources list, you can click the name on the side, then on the far right of the info pane there is an "unpin project button". Click that.
The whole point of public datasets is that everyone has access to them so they can test BigQuery. Even if a feature request will create the option to disable the listing in the panel of the BigQuery web UI, the users will still have access and could query the public datasets.
It will be more practical to use custom quotas.
So you would create a project with a number of users that share a quota that you consider enough for their activities. When the established quota is reached BigQuery stops and the users receive an error message when trying to run queries.
Another useful tool is creating budget alerts with a desired level that you can set taking into account the previous month's spend. The alert will notify you when the project's bill have reached the amount you set and can save you from bad surprises.
In addition, implementing the Audit Logs in your project will give comprehensive overview on the BigQuery operations. Check this example of an Audit Logs query that will give details on the performed queries. Of course, you will find out about the use of a public dataset after it happens but this will point out who’s the user that performed the query and you can reinforce the administration policy of not inquiring public datasets. To get information on the performed query, including the interrogated dataset, use this field when querying the Audit Logs:
'protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.query'
As a last resort, you can create a designated project for your users to query the public datasets and to make sure it will not create additional costs, you can remove the billing account. Though, by doing so you can only query 1 TB of data per month, the BigQuery always free usage tier.
Also keep in mind about this best practices to limit the queries costs.
if you closed current tab , public data set will disappear from google BigQuery page
On my website, there exists a group of 'power users' who are fantastic and adding lots of content on to my site.
However, their prolific activities has led to their profile pages slowing down a lot. For 95% of the other users, the SPROC that is returning the data is very quick. It's only for these group of power users, the very same SPROC is slow.
How does one go about optimising the query for this group of users?
You can assume that the right indexes have already been constructed.
EDIT: Ok, I think I have been a bit too vague. To rephrase the question, how can I optimise my site to enhance the performance for these 5% of users. Given that this SPROC is the same one that is in use for every user and that it is already well optimised, I am guessing the next steps are to explore caching possibilities on the data and application layers?
EDIT2: The only difference between my power users and the rest of the users is the amount of stuff they have added. So I guess the bottleneck is just the sheer number of records that is being fetched. An average user adds about 200 items to my site. These power users add over 10,000 items. On their profile, I am showing all the items they have added (you can scroll through them).
I think you summed it up here:
An average user adds about 200 items
to my site. These power users add over
10,000 items. On their profile, I am
showing all the items they have added
(you can scroll through them).
Implement paging so that it only fetches 100 at a time or something?
Well you can't optimize a query for a specific result set and leave the query for the rest unchanged. If you know what I mean. I'm guessing there's only one query to change, so you will optimize it for every type of user. Therefore this optimization scenario is no different from any other. Figure out what the problem is; is it too much data being returned? Calculations taking too long because of the amount of data? Where exactly is the cause of the slowdown? Those are questions you need to ask yourself.
However I see you talking about profile pages being slow. When you think the query that returns that information is already optimized (because it works for 95%), you might consider some form of caching of the profile page content. In general, profile pages do not have to supply real-time information.
Caching can be done in a lot of ways, far too many to cover in this answer. But to give you one small example; you could work with a temp table. Your 'profile query' returns information from that temp table, information that is already calculated. Because that query will be simple, it won't take that much time to execute. Meanwhile, you make sure that the temp table periodically gets refreshed.
Just a couple of ideas. I hope they're useful to you.
Edit:
An average user adds about 200 items to my site. These power users add over 10,000 items.
On their profile, I am showing all the
items they have added (you can scroll
through them).
An obvious help for this will be to limit the number of results inside the query, or apply a form of pagination (in the DAL, not UI/BLL!).
You could limit the profile display so that it only shows the most recent 200 items. If your power users want to see more, they can click a button and get the rest of their items. At that point, they would expect a slower response.
Partition / separate the data for those users then the tables in question will be used by only them.
In a clustered environment I believe SQL recognises this and spreads the load to compensate, however in a single server environment i'm not entirely sure how it does the optimisation.
So essentially (greatly simplified of course) ...
If you havea table called "Articles", have 2 tables ... "Articles", "Top5PercentArticles".
Because the data is now separated out in to 2 smaller subsets of data the indexes are smaller and the read and write requests on a single table in the database will drop.
it's not ideal from a business layer point of view as you would then need some way to list what data is stored in what tables but that's a completely separate problem altogether.
Failing that your only option past execution plans is to scale up your server platform.
I am trying to decide on the best method for audit logging within my application. The main reason for the log is reporting the sequence of events (changes).
I have a hierarchy of Objects, I need to create reports when something changes on any part of that hierarchy, at a latter date.
I think that I have three options:
Have a log for each table and therefore matching the hierarchy of objects then creating a view for the report.
Flatten the hierarchy and de-normalise the table, making reporting easier - simple select statement.
Have one log table and have a record for each change making reporting harder but more flexible to changes.
I am currently leaning towards option 1.
I have to talk to this subject even though it's old.
It is usually a poor idea to have only one audit table as you will create locking problems in the database as everything hits that table. Use separate audit tables for each table.
It is also a poor idea to have the application do the auditing. Audit must be done at the database level or you risk losing some of the information. Data does not change only from applications in most databases; no one is going to change the prices of all their products one at a time from the user interface when you need a 10% increase to all 10,000,000 of them. Auditing should capture all changes not just some of them. This should be done in a trigger in most databases (SQL server 2008 has a built in auditing function). Some of the worst potential possible changes (employees committing fraud or wanting to maliciously destroy data) also are frequently from places other than the application especially if you allow table level access to users (Which you should not do in any financial database or one that contains personal information). Auditing from the application won't catch this. Developers often forget that in protecting their data, outside sources are not the only threat.
An audit log is basically a chronological list of events that occurred, who performed these events, and what the events were.
I think a flat view would be better as it can be easily ordered and queried. So I'm leaning more towards your option #2/#3.
Include things like the transaction type, the time, the user id, a description of what's changed, and other pertinent information related to your product.
You can also add things to your product over time and you won't need to continually modify your audit log module.
If it's for auditing purposes I'd use a true append-only medium rather than a table/tables in the same db.
You suggest it's for change history purposes - in which case I would restructure your application/db to record the actual events in the first place rather than just the current state.
I would go with (2) and (3): create a single table for all Audit entries.
A flat view is good, provided the extra work flattening does not impact performance.
You could look into an AOP framework to help with this. It would allow you to inject logging functionality at the beginning or end of any/all methods. If you go down this road, it might help define what would make sense for storing the log data.