What tools exist for benchmarking Cascading for Hadoop routines? - optimization

I have been given a multi-step Cascading program that runs in about ten times the amount of time that an equivalent M/R job runs. How do I go about figuring out which of the steps is running the slowest so I can target it for optimization?

Not a complete answer, but enough to get you started I think. You need to generate a graphical representation of the MapReduce workflow for your job. See this page for an example: http://www.cascading.org/multitool/. The graph should help with trying to figure out where the bottleneck is.

Related

GUROBI only uses single core to setup problem with cvxpy (python)

I have a large MILP that I build with cvxpy and want to solve with GUROBI. When I give use the solve() function of cvxpy it take a really really really long time to setup and does not start solving for hours. Whilest doing that only 1 core of my cluster is being used. It is used for 100%. I would like to use multiple cores to build the model so that the process of building the model does not take so long. Running grbprobe also shows that gurobi knows about the other cores and for solving the problem it uses multiple cores.
I have tried to run with different flags i.e. turning presolve off and on or giving the number of Threads to be used (this seemed like i didn't even for the solving.
I also have reduce the number of constraints in the problem and it start solving much faster which means that this is definitively not a problem of the model itself.
The problem in it's normal state should have 2200 constraints i reduce it to 150 and it took a couple of seconds until it started to search for a solution.
The problem is that I don't see anything since it takes so long to get the ""set username parameters"" flag and I don't get any information on what the computer does in the mean time.
Is there a way to tell GUROBI or CVXPY that it can take more cpus for the build-up?
Is there another way to solve this problem?
Sorry. The first part of the solve (cvxpy model generation, setup, presolving, scaling, solving the root, preprocessing) is almost completely serial. The parallel part is when it really starts working on the branch-and-bound tree. For many problems, the parallel part is by far the most expensive, but not for all.
This is not only the case for Gurobi. Other high-end solvers have the same behavior.
There are options to do less presolving and preprocessing. That may get you earlier in the B&B. However, usually, it is better not to touch these options.
Running things with verbose=True may give you more information. If you have more detailed questions, you may want to share the log.

Parallel Write Operations in GraphDB Free vs. Standard Edition

I am trying to run several SPARQL queries in parallel in Ontotext GraphDB. These queries are identical except for the named graph which they read from. I've attempted a multithreading solution in Scala to launch 3 queries against the database in parallel (see picture below).
The issue is that I am using the Free Edition of GraphDB, which only supports a single core for write operations. What this seems to mean is that the queries which are supposed to run in parallel basically just queue up to run against the single core. As you can see, the first query has completed 41,145 operations in 12 seconds, but no operations have completed on the two other queries. Once the first query completes, the second query will run to completion, and once that completes, the third will run.
I understand this is likely expected behavior for the Free Edition. My question is, will upgrading to the Standard Edition fix this problem and allow the queries to actually run in parallel? Based on the documentation I've looked at, it seems that multiple cores can be made available for the Standard Edition to complete write operations. However, I also saw something which implied that single write queries launched against the Standard Edition would automatically be processed over multiple cores, which might make the multithreading approach obsolete anyway?
Anyone have experience with launching parallel write operations against GraphDB and is able to weigh in?
You could find the difference in the official GraphDB 9.1 benchmark statistics page:
http://graphdb.ontotext.com/documentation/standard/benchmark.html.

Developing and deploy jobs in Apache Flink

we started to develop some jobs with Flink. Our current dev / deployment process looked like this:
1. develop code in local IDE and compile
2. upload jar-file to server (via UI)
3. register new job
However it turns out that the generate jar-file is ~70MB and upload process takes a few minutes. What is the best way to speed up development (e.g. using a on-server ide?)
One solution is to use a version control system and after committing and pushing your changes, you might build the jar on the server itself. You could write a simple script for this.
The other solution, which would take time and effort is to set up a CI CD Pipeline which would automate the entire process and much manual effort would be minimised.
Also, try not to use fat jar to minimise the jar size if you have to scp your jar to the cluster.
First off, uploading 70 mb shouldn't take minutes nowadays. You might want to check your network configurations. Of course, if your internet connection is not good, you can't help it.
In general, I'd try to avoid cluster test runs as much as possible. It's slow and hard to debug. It should only ever be used for performance tests or right before releasing into production.
All logic should be unit tested. The complete job should be integration tested and ideally you'd also have an end-to-end test. I recommend to use a docker-based approach for external systems and use things like test containers for Kafka, such that you are able to run all tests from your IDE.
Going onto the test cluster should then be a rare thing. If you find any issue that has not been covered by your automatic tests, you need to add a new test case and solve it locally, such that there is a high probability that it will be solved on the test cluster.
edit (addressing comment):
If it's much easier for you to write a Flink job to generate data, then it sounds like a viable option. I'm just fearing that you would also need to test that job...
It rather sounds like you want to have an end-2-end setup where you run several Flink jobs in succession and compare the final results. That's a common setup for complex pipelines consisting of several Flink jobs. However, it's rather hard to maintain and may have unclear ownership status if it involves components from several teams. I like to rather solve it by having tons of metrics (how many records are produced, how many invalid records are filtered in each pipeline...) and having specific validation jobs that just assess the quality of (intermediate) output (potentially involving humans).
So from my experience, either you can test the logic in some local job setup or it's so big that you spend much more time in setting and maintaining the test setup than actually writing production code. In the latter case, I'd rather trust and invest in the monitoring and quality assurance capabilities of (pre-)production that you need to have anyways.
If you really just want to test one Flink job with another, you can also run the Flink job in testcontainers, so technically it's not an alternative but an addition.

OptaPlanner: Gaps in Chained Through Time Pattern

I'm just starting learning to use OptaPlanner recently. Please pardon me if there is any technically inaccurate description below.
Basically, I have a problem to assign several tasks on a bunch of machines. Tasks have some precedence restrictions such that some task cannot be started before the end of another task. In addition, each task can only be run on certain machines. The target is to minimize the makespan of all these tasks.
I modeled this problem with Chained Through Time Pattern in which each machine is the anchor. But the problem is that tasks on certain machine might not be executed sequentially due to the precedence restriction. For example, Task B can only be started after Task A completes while Tasks A and B are executed on machines I and II respectively. This means during the execution of Task A on machine I, if there is no other task that can be run on machine II, then machine II can only keep idle until Task A completes at which point Task B could be started on it. This kind of gap is not deterministic as it depends on the duration of Task A with respect to this example. According to the tutorial of OptaPlanner, it seems that additional planning variable gaps should be introduced for this kind of problem. But I have difficulty in modeling this gap variable now. In general, how to integrate the gap variable in the model using Chained Through Time Pattern? Some detailed explanation or even a simple example would be highly appreciated.
Moreover, I'm actually not sure whether chained through time pattern is suitable for modeling this kind of task assigning problem or I just used an entirely inappropriate method. Could someone please shed some light on this? Thanks in advance.
I'am using chained through time pattern to solve the same question as yours.And to solve the precedence restriction you can write drools rules.

Setting up a lab for developers performance testing

Our product earned bad reputation in terms of performance. Well, it's a big enterprise application, 13 years old, that needs a refreshment treat, and specifically a boost in its performance.
We decided to address the performance problem strategically in this version. We are evaluating a few options on how to do that.
We do have an experienced load test engineers equipped with the best tools in the market, but usually they get a stable release late in the version development life cycle, therefore in the last versions developers didn't have enough time to fix all their findings. (Yes, I know we need to deliver earlier a stable versions, we are working on this process as well, but it's not in my area)
One of the directions I am pushing is to set up a lab environment installed with the nightly build so developers can test the performance impact of their code.
I'd like this environment to be constantly loaded by scripts simulating real user's experience. On this loaded environment each developer will have to write a specific script that tests his code (i.e. single user experience in a real world environment). I'd like to generate a report that shows each iteration impact on existing features, as well as performance of new features.
I am a bit worried that I'm aiming too high, and it it will turn out to become too complicated.
What do you think of such an idea?
Does anyone have an experience with setting up such an environment?
Can you share your experience?
It sounds like a good idea, but in all honesty, if your organisation can't get a build to the expensive load test team it has employed just for this purpose, then it will never get your idea working.
Go for the low hanging fruit first. Get a nightly build available to the performance testing team earlier in the process.
In fact, if this version is all about performance, why not have the team just take this version to address all the performance issues that came late in the iteration for the last version.
EDIT: "Don't developers have a responsibility to performance test code" was a comment. Yes, true. I personally would have every developer have a copy of YourKit java profiler (it's cheap and effective) and know how to use it. However, unfortunately performance tuning is a really, really fun technical activity and it is possible to spend a lot of time doing this when you would be better developing features.
If your developer team are repeatedly developing noticeably slow code then education on performance or better programmers is the only answer, not more expensive process.
One of the biggest boost in productivity is an automated build system which runs overnight (this is called Continuous Integration). Errors made yesterday are caught today early in the morning, when I'm still fresh and when I might still remember what I did yesterday (instead of several weeks/months later).
So I suggest to make this happen first because it's the very foundation for anything else. If you can't reliably build your product, you will find it very hard to stabilize the development process.
After you have done this, you will have all the knowledge necessary to create performance tests.
One piece of advice though: Don't try to achieve everything at once. Work one step at a time, fix one issue after the other. If someone comes up with "we must do this, too", you must do the same triage as you do with any other feature request: How important is this? How dangerous? How long will it take to implement? How much will we gain?
Postpone hard but important tasks until you have sorted out the basics.
Nightly builds are the right approach to performance testing. I suggest you require scripts that run automatically each night. Then record the results in a database and provide regular reports. You really need two sorts of reports:
A graph of each metric over time. This will help you see your trends
A comparison of each metric against a baseline. You need to know when something drops dramatically in a day or when it crosses a performance threshold.
A few other suggestions:
Make sure your machines vary similarly to your intended environment. Have low and high end machines in the pool.
Once you start measuring, never change the machines. You need to compare like to like. You can add new machines, but you can't modify any existing ones.
We built a small test bed, to do sanity testing - ie did the app fire up and work as expected when the buttons were pushed, did the validation work etc. Ours was a web app and we used Watir a ruby based toolkit to drive the browser. The output from those runs are created as Xml documents, and the our CI tool (cruise control) could output the results, errors and performance as part of each build log. The whole thing worked well, and could have been scaled onto multiple PCs for proper load testing.
However, we did all that because we had more bodies than tools. There are some big end stress test harnesses that will do everything you need. They cost, but that will be less than the time spent to hand roll. Another issue we had was getting our Devs to write Ruby/Watir tests, in the end that fell to one person and the testing effort was pretty much a bottleneck because of that.
Nightly builds are excellent, lab environments are excellent, but you're in danger of muddling performance testing with straight up bug testing I think.
Ensure your lab conditions are isolated and stable (i.e. you vary only one factor at a time, whether that's your application or a windows update) and the hardware is reflective of your target. Remember that your benchmark comparisons will only be bulletproof internally to the lab.
Test scripts written by the developers who wrote the code tends to be a toxic thing to do. It doesn't help you drive out misunderstandings at implementation (since the same misunderstanding will be in the test script), and there is limited motivation to actually find problems. Far better is to take a TDD approach and write the tests first as a group (or a separate group), but failing that you can still improve the process by writing the scripts collaboratively. Hopefully you have some user-stories from your design stage, and it may be possible to replay logs for real world experience (app varying).