Spark - Automated Deployment & Performance Testing

Spark - Automated Deployment & Performance Testing - testing

We are developing an application which uses Spark & Hive to do static and ad-hoc reporting. For these static reports, they take a number of parameters and then run over a data set. We would like to make it easier to test performance of these reports on a cluster.
If we have a test cluster running with a sufficient sample data set which developers can share. To speed up development time, what is the best way to deploy a Spark application to a Spark cluster (in standalone) via an IDE?
I'm thinking we would create an SBT task which would run the spark submit script. Is there a better way?
Eventually this will feed into some automated performance testing which we plan to run as a twice daily Jenkins job. If its an SBT deploy task, it makes it easy to call in Jenkins. Is there a better way to do this?

I've found a project on GitHub, maybe you can get some inspiration.
Maybe just add a for loop for submitting jobs and increase the loop times to find the performance limit, not sure if I'm right or not.

Related

Automating SQL scripts run into AWS redshift environments (Dev, preprod & prod)

I wish to automate runs of SQL (DML's and DML's) into the AWS redshift cluster, i.e. as soon as someone merge the SQL file into S3 bucket it should run in the configured environment say dev, preprod & prod.
Is there any way I can do this?
My investigation says that AWS codepipeline is one of the solution however, I am not sure how I will connect to the Redshift database in Codepipeline?
Another way is using Lambda function but it has its limitation of 5 minutes I guess and some of the DDL/DML might take more than 5 minutes to run.
Regards,
Shay

There are a lot of choices out there and which is best will depend on many factors including your team's skillset and your budget. I'll let the community weigh in on all the possibilities.
I would like to advise on using the AWS serverless ecosystem to perform these functions. First off the Lambda limit is now 15 min but this really isn't important. The most important development is Redshift Data API which lets you start queries in a Lambda and for other Lambdas check on completion later. See: https://docs.aws.amazon.com/redshift/latest/mgmt/data-api.html
With Redshift Data API for fire-and-forget access to Redshift and Step Functions to orchestrate the Lambda functions you can create a low cost, light weight infrastructure to perform all sort of integrations and actions. These can include triggering other tools / services as needed for you. This is not the best approach in all cases but Lambda based solutions should not be excluded due to run time limits.

How to use Jmeter with timer

I am having a problem with the JMETER, using it with Timer causes Crash to the Jmeter
The case is : I want to create a load of requests to be executed every half hour
Is that something you can do with Jmeter?
every-time i try it it causes Jmeter to keep loading and hangs and require a shut down

If you want to leave JMeter up and running forever make sure to follow JMeter Best Practices as certain test elements might cause memory leaks
If you need to create "spikes" of load each 30 minutes it might be a better idea to consider your operating system scheduling mechanisms to execute "short" tests each half an hour like:
Windows Task Scheduler
Unix cron
MacOS launchd
Or even better go for Continuous Integration server like Jenkins, it has the most powerful trigger mechanism allowing defining flexible criteria regarding when to start the job and you can also benefit from the Performance Plugin which allows automatically marking build as unstable or failed depending on test metrics and building performance trend charts

running load testing on selenium and api tests in Visual studio Team Services

I am trying to run load tests on my existing selenium web tests and my api(unit) tests. The tests run in Visual studio using load test editor but does not collect all the metrics like response time and requests per seconds. Are there any additional parameters that I need to add to collect all the metrics ?

Load testing; how many selenium clients are you running? One or two will not generate much load. First issue to think about; you need load generators and selenium is a poor way to go about this (unless you are running grid headless but still).
So the target server is what, Windows Server 2012? Google Create a Data Collector Set to Monitor Performance Counters.
Data collection and analysis of same is your second issue to think about. People pays loads of money for tools like LoadRunner because they provide load generators and sophisticated data collection of servers, database, WANs and LANS and analysis reports to pinpoint bottlenecks. Doing this manually is hard and not easily repeatable. Most folks who start down your path eventually abandon it. Look into the various load/performance tools to see what works best for you and that you can afford.

How to proceed with query automation using Import.io

I've successfully created a query with the Extractor tool found in Import.io. It does exactly what I want it to do, however I need to now run this once or twice a day. Is the purpose of Import.io as an API to allow me to build logic such as data storage and schedules tasks (running queries multiple times a day) with my own application or are there ways to scheduled queries and make use of long-term storage of my results completely within the Import.io service?
I'm happy to create a Laravel or Rails app to make requests to the API and store the information elsewhere but if I'm reinventing the wheel by doing so and they provides the means to address this then that is a true time saver.

Thanks for using the new forum! Yes, we have moved this over to Stack Overflow to maximise the community atmosphere.
At the moment, Import does not have the ability to schedule crawls. However, this is something we are going to roll out in the near future.
For the moment, there is the ability to set a Cron job to run when you specify.

Another solution if you are using the free version is to use a CI tool like travis or jenkins to schedule your API scripts.
You can query live the extractors so you don't need to make them run manually every time. This will consume one of your requests from your limit.
The endpoint you can use is:
https://extraction.import.io/query/extractor/extractor_id?_apikey=apikey&url=url
Unfortunately the script will not be a very simple one since most websites have very different respond structures towards import.io and as you may already know, the premium version of the tool provides now with scheduling capabilities.

Can TeamCity tests be run asynchronously

In our environment we have quite a few long-running functional tests which currently tie up build agents and force other builds to queue. Since these agents are only waiting on test results they could theoretically just be handing off the tests to other machines (test agents) and then run queued builds until the test results are available.
For CI builds (including unit tests) this should remain inline as we want instant feedback on failures, but it would be great to get a better balance between the time taken to run functional tests, the lead time of their results, and the throughput of our collective builds.
As far as I can tell, TeamCity does not natively support this scenario so I'm thinking there are a few options:
Spin up more agents and assign them to a 'Test' pool. Trigger functional build configs to run on these agents (triggered by successful Ci builds). While this seems the cleanest it doesn't scale very well as we then have a lead time of purchasing licenses and will often have need to run tests in alternate environments which would temporarily double (or more) the required number of test agents.
Add builds or build steps to launch tests on external machines, then immediately mark the build as successful so queued builds can be processed then, when the tests are complete, we mark the build as succeeded/failed. This is reliant on being able to update the results of a previous build (REST API perhaps?). It also feels ugly to mark something as successful then update it as failed later but we could always be selective in what we monitor so we only see the final result.
Just keep spinning up agents until we no longer have builds queueing. The problem with this is that it's a moving target. If we knew where the plateau was (or whether it existed) this would be the way to go, but our usage pattern means this isn't viable.
Has anyone had success with a similar scenario, or knows pros/cons of any of the above I haven't thought of?

Your description of the available options seems to be pretty accurate.
If you want live update of the builds progress you will need to have one TeamCity agent "busy" for each running build.
The only downside here seems to be the agent licenses cost.
If the testing builds just launch processes on other machines, the TeamCity agent processes themselves can be run on a low-end machine and even many agents on a single computer.
An extension to your second scenario can be two build configurations instead of single one: one would start external process and another one can be triggered on external process completeness and then publish all the external process results as it's own. It can also have a snapshot dependency on the starting build to maintain the relation.

For anyone curious we ended up buying more agents and assigning them to a test pool. Investigations proved that it isn't possible to update build results (I can definitely understand why this ugliness wouldn't be supported out of the box).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas