How to debug and improve memory consumption on PowerBI Embedded service - optimization

I am current using PowerBI Embedded service from azure with an A1 unit, which is constantly reaching peak memory consumption and thus causing errors in the visualization of production reports.
1) Is there any way to identify which reports/pages/visuals are consuming the largest share of memory?
2) What would be the overall best strategy (on a high-level, general analysis) to reduce required memory? Would that be reducing the amount of data being loaded, reducing the number of pages, reducing the number of visuals, or any other possible strategy?

You can deploy the report Power BI Premium metrics app, this is for capacities, both Premium and Embedded. It will show dataset memory usage and other metrics on the capacity.
1) Is there any way to identify which reports/pages/visuals are
consuming the largest share of memory?
It will give a good overview of memory usage and whats causing it to time out/evict datasets and reports. Check the link for the full metric lists.
2) What would be the overall best strategy (on a high-level, general
analysis) to reduce required memory? Would that be reducing the amount
of data being loaded, reducing the number of pages, reducing the
number of visuals, or any other possible strategy?
Yes reduce dataset sizes, reports that suck in a number of columns but only use a few of them. Look at badly written queries and data models. For visuals, each visual on a page is a query, each query sucks up memory. I've had issues were people have had 30 visuals on a page, reducing them made it a lot quicker.
Look at the usage, are lots of reports being loaded at once, this can lead to dataset evictions, were it is dumped out of memory as other reports are taking priority. The Metric app will give you some pointers to what is happening, you'll have to take it from there and determine the root cause.
As it is an A sku you can set up an Azure Automation/Logic App to scale up and down the sku, or even pause it when needed. Also A1 & 2 are shared capacity as well not dedicated (A3 onwards) so you may have to account for any noisy neighbour issues in the background, but that will not show up on the metric app.
Hope that helps

Related

Rapidly changing large data processing advise

My team has the following dilemma that we need some architectural/resources advise:
Note: Our data is semi-structured
Over-all Task:
We have a semi-large data that we process during the day
each day this "process" get executed 1-5 times a day
each "process" takes anywhere from 30 minutes to 5 hours
semi-large data = ~1 million rows
each row gets updated anywhere from 1-10 times during the process
during this update ALL other rows may change, as we aggregate these rows for UI
What we are doing currently:
our current system is functional, yet expensive and inconsistent
we use SQL db to store all the data and we retrieve/update as process requires
Unsolved problems and desired goals:
since this processes are user triggered we never know when to scale up/down, which causes high spikes and Azure doesnt make it easy to do autoscale based on demand without data warehouse which we are wanting to stay away from because of lack of aggregates and other various "buggy" issues
because of constant IO to the db we hit 100% of DTU when 1 process begins (we are using Azure P1 DB) which of course will force us to grow even larger if multiple processes start at the same time (which is very likely)
yet we understand the cost comes with high compute tasks, we think there is better way to go about this (SQL is about 99% optimized, so much left to do there)
We are looking for some tool that can:
Process large amount of transactions QUICKLY
Can handle constant updates of this large amount of data
supports all major aggregations
is "reasonably" priced (i know this is an arguable keyword, just take it lightly..)
Considered:
Apache Spark
we don't have ton of experience with HDP so any pros/cons here will certainly be useful (does the use case fit the tool??)
ArangoDB
seems promising.. Seems fast and has all aggregations we need..
Azure Data Warehouse
too many various issues we ran into, just didn't work for us.
Any GPU-accelerated compute or some other high-end ideas are also welcome.
Its hard to try them all and compare which one fits the best, as we have a fully functional system and are required to make adjustments to whichever way we go.
Hence, any before hand opinions are welcome, before we pull the trigger.

Optaplanner - large datasets with millions of rows

There are a couple of threads discussing the scalability of Optaplanner, and I am wondering what's the recommended approach to deal with very large datasets when it comes to millions of rows?
As this blog discussed I am already using heuristic (Simulated Annealing + Tabu Search). The search space of cloud balancing problem is c^p, but the feasible space is unknown/NP-complete.
http://www.optaplanner.org/blog/2014/03/27/IsTheSearchSpaceOfAnOptimizationProblemReallyThatBig.html
The problem I am trying to solve is similar to cloud balancing. But the main difference is in the input data, besides a list of computers and a list of processes, there is also a big two dimensional 'score list/table' which has the scores for each possible combinations that needs to be loaded into memory.
In other words, except for the constraints between computers and processes that the planning needs to satisfy, different valid combinations yield various scores and the higher the score the better.
It's a simple problem but when it comes to hundreds of computers, 100k+ processes and the score table has a million+ combinations, it needs a lot of memory. Even though I could allocate more memory to increase the heap size, the planning could become very slow and struggling, as the steps are sorted with custom planning variable/entity comparator classes.
A straight-forward solution is to divide the dataset into smaller subsets, run each of them individually and then combine the results, so that I could have multiple machines to run at the same time and each machine runs on multi-threads. The biggest drawback of this approach is the result produced is far away from optimal.
I am wondering is there any other better solutions?
The MachineReassignment example also has a big "score combination" matrix. OptaPlanner doesn't care about that - those are just problem facts and the DRL quickly matches the combination(s) that is picked for an assignment. The Solver.solve() causes no big memory consumption or performance impact.
However, loading the problem in your code (before calling Solver.solve()) does cause a huge memory consumption: Understand that if n = 20k, then n² = 400m and an int takes of up 4 bytes, so for 20 000 elements that matrix is 1.6 GB in its most efficient uncompressed form int[][] (both in Java and C++!). So for 20k reserve 2GB RAM, for 40k reserve 8GB RAM for 80k reserve 32 GB RAM. That scales badly.
As for dealing with these big problems, I use combinations of techniques such as Nearby Selection (see my blog article on that), Partitioned Search (what you described, it will be supported out of the box in 7, but I 've implemented it for customers in a CustomPhase), Limited Selection Construction Heuristics (need to research that further, the plumbing is there, usually overkill), ... Partitioned Search does indeed exclude optimal solutions, but above 10k planning entities the trade-off quality vs time taking clearly favors Partitioned Search given a reasonable solving time (minutes/hours/days instead of millenia). The trick is to keep the size of each partition big enough, above 1k entities (hence the use NearbySelection). Score calculation speed also matters a lot, of course.

RRDtool what use are multiple RRAs?

I'm trying to implement rrdtool. I've read the various tutorials and got my first database up and running. However, there is something that I don't understand.
What eludes me is why so many of the examples I come across instruct me to create multiple RRAs?
Allow me to explain: Let's say I have a sensor that I wish to monitor. I will want to ultimately see graphs of the sensor data on an hourly, daily, weekly and monthly basis and one that spans (I'm still on the fence on this one) about 1.5 yrs (for visualising seasonal influences).
Now, why would I want to create an RRA for each of these views? Why not just create a database like this (stepsize=300 seconds):
DS:sensor:GAUGE:600:U:U \
RRA:AVERAGE:0.5:1:160000
If I understand correctly, I can then create any graph I desire, for any given period with whatever resolution I need.
What would be the use of all the other RRAs people tell me I need to define?
BTW: I can imagine that in the past this would have been helpful when computing power was more rare. Nowadays, with fast disks, high-speed interfaces and powerful CPUs I guess you don't need the kind of pre-processing that RRAs seem to be designed for.
EDIT:
I'm aware of this page. Although it explains about consolidation very clearly, it is my understanding that rrdtool graph can do this consolidation aswell at the moment the data is graphed. There still appears (to me) no added value in "harvest-time consolidation".
Each RRA is a pre-consolidated set of data points at a specific resolution. This performs two important functions.
Firstly, it saves on disk space. So, if you are interested in high-detail graphs for the last 24h, but only low-detail graphs for the last year, then you do not need to keep the high-detail data for a whole year -- consolidated data will be sufficient. In this way, you can minimise the amount of storage required to hold the data for graph generation (although of course you lose the detail so cant access it if you should want to). Yes, disk is cheap, but if you have a lot of metrics and are keeping low-resolution data for a long time, this can be a surprisingly large amount of space (in our case, it would be in the hundreds of GB)
Secondly, it means that the consolidation work is moved from graphing time to update time. RRDTool generates graphs very quickly, because most of the calculation work is already done in the RRAs at update time, if there is an RRA of the required configuration. If there is no RRA available at the correct resolution, then RRDtool will perform the consolidation on the fly from a high-granularity RRA, but this takes time and CPU. RRDTool graphs are usually generated on the fly by CGI scripts, so this is important, particularly if you expect to have a large number of queries coming in. In your example, using a single 5min RRA to make a 1.5yr graph (where 1pixel would be about 1 day) you would need to read and process 288 times more data in order to generate the graph than if you had a 1-day granularity RRA available!
In short, yes, you could have a single RRA and let the graphing work harder. If your particular implementation needs faster updates and doesnt care about slower graph generation, and you need to keep the detailed data for the entire time, then maybe this is a solution for you, and RRDTool can be used in this way. However, usually, people will optimise for graph generation and disk space, meaning using tiered sets of RRAs with decreasing granularity.

What is the best way to store highly parametrized entities?

Ok, let met try to explain this in more detail.
I am developing a diagnostic system for airplanes. Let imagine that airplanes has 6 to 8 on-board computers. Each computer has more than 200 different parameters. The diagnostic system receives all this parameters in binary formatted package, then I transfer data according to the formulas (to km, km/h, rpm, min, sec, pascals and so on) and must store it somehow in a database. The new data must be handled each 10 - 20 seconds and stored in persistence again.
We store the data for further analytic processing.
Requirements of storage:
support sharding and replication
fast read: support btree-indexing
NOSQL
fast write
So, I calculated an average disk or RAM usage per one plane per day. It is about 10 - 20 MB of data. So an estimated load is 100 airplanes per day or 2GB of data per day.
It seems that to store all the data in RAM (memcached-liked storages: redis, membase) are not suitable (too expensive). However, now I am looking to the mongodb-side. Since it can utilize as RAM and disk usage, it supports all the addressed requirements.
Please, share your experience and advices.
There is a helpful article on NOSQL DBMS Comparison.
Also you may find information about the ranking and popularity of them, by category.
It seems regarding to your requirements, Apache's Cassandra would be a candidate due to its Linear scalability, column indexes, Map/reduce, materialized views and powerful built-in caching.

When should we store images in database?

I have a table of productList in which i have 4 column, now i have to store image for each row so i have two option for this..
Store image in data base.
Save images in a folder and store only path on table.
So my question is which one is better in this situation and why ?
Microsoft Research published quite an extensive paper on the subject, called To Blob Or Not To Blob.
Their synopsis is:
Application designers often face the question of whether to store large objects in a filesystem or in a database. Often this decision is made for application design simplicity. Sometimes, performance measurements are also used. This paper looks at the question of fragmentation – one of the operational issues that can affect the performance and/or manageability of the system as deployed long term. As expected from the common wisdom, objects smaller than 256K are best stored in a database while objects larger than 1M are best stored in the filesystem. Between 256K and 1M, the read:write ratio and rate of object overwrite or replacement are important factors. We used the notion of “storage age” or number of object overwrites as way of normalizing wall clock time. Storage age allows our results or similar such results to be applied across a number of read:write ratios and object replacement rates.
It depends -
You can store images in DB if you know that they wont increase in size very often. This has its advantage when you are deploying your systems or migrating to new servers. you dont have to worry about copying images seperately.
If the no. of rows increase very frequently on that system, and the images get bulkier, then its good to store on the file system and have a path stored in database for later retrieval. This also will keep you on toes when migrating your servers where you have to take care of copying the images from filepath seperately.