Related to speed of execution of Job in Amazon Elastic Mapreduce - amazon-emr

My Task is
1) Initially I want to import the data from MS SQL Server into HDFS using SQOOP.
2) Through Hive I am processing the data and generating the result in one table
3) That result containing table from Hive is again exported to MS SQL SERVER back.
I want to perform all this using Amazon Elastic Map Reduce.
The data which I am importing from MS SQL Server is very large (near about 5,00,000 entries in one table. Like wise I have 30 tables). For this I have written a task in Hive which contains only queries (And each query has used a lot of joins in it). So due to this the performance is very poor on my single local machine ( It takes near about 3 hrs to execute completely).
I want to reduce that time as much less as possible. For that we have decided to use Amazon Elastic Mapreduce. Currently I am using 3 m1.large instance and still I have same performance as on my local machine.
In order to improve the performance what number of instances should I need to use?
As number of instances we use are they configured automatically or do I need to specify while submitting JAR to it for execution? Because as I use two machine time is same.
And also Is there any other way to improve the performance or just to increase the number of instance. Or am I doing something wrong while executing JAR?
Please guide me through this as I don't much about the Amazon Servers.
Thanks.

You could try Ganglia, which can be installed on your EMR cluster using a bootstrap action. This will give you some metrics on the performance of each node in the cluster and may help you optimise to get the right sized cluster:
http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_Ganglia.html
If you use the EMR Ruby client on your local machine, you can set up an SSH tunnel to allow you to view the ganglia web interface in Firefox (you'll also need to setup FoxyProxy as per the following http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-connect-master-node-foxy-proxy.html)

Related

Index creation time is high on Azure Managed Instance

I am working with Azure Managed Instances for hosting a data warehouse. For the large table loads the indexes are removed and rebuilt instead of inserting with the indexes in place. The indexes are re-created using a stored procedure that builds them from a list kept in an admin table. When moving from our on-prem solution to the managed instance, we have seen considerable decrease in performance when building the indexes. The process takes roughly twice as long when running in Azure vs when running on-prem.
The specs for the server are higher in the Azure Managed Instance, more cores and more memory. We have looked at IO time and tried increasing file size to increase IO but it has had a minimal impact.
Why would it take longer to build indexes on the same data using the same code in an Azure Managed Instance than it does on an on-pre SQL Server?
Is there a setting or configuration in Azure that could be changed to improve performance?
Could you please check the transaction log file for the database. Monitor log space use by using sys.dm_db_log_space_usage. This DMV returns information about the amount of log space currently used and indicates when the transaction log needs truncation. Please see the referral link here sys.dm_db_log_space_usage (Transact-SQL) - SQL Server | Microsoft Docs
As creating the index will easily reach throughput limit either for data or log files, you might need to increase individual file sizes. Resource limits - Azure SQL Managed Instance | Microsoft Docs
You also can use this script managed-instance/MI-GP-storage-perf.sql at master · dimitri-furman/managed-instance · GitHub to determine if the IOPS/throughput seen against each database file.

Presto query performance via Hue on AWS EMR

I am currently trying to setup Presto connection trough Hue on an AWS EMR Cluster (release 5.24.0).
By default, AWS is setting the connection using the jdbc interface.
The problem of using this interface is that for some reason it loads only the first 1000 records when performing a query of the type
SELECT * FROM table
Which is probably due to a current Hue limitation that hardcodes the fetchsize to 1000 rows and cannot be edited.
According to Hue's documentation, however, it is advisable to use sqlalchemy as interface instead of jdbc whenever possible. So I changed the presto interpreter settings and installed pyhive as documented in the guide.
The fetchsize problem is gone, but even performing a
SELECT COUNT(1) FROM table
can take several minutes. Moreover, from the Hue UI, the query seems to be completed, when it is still running if I look at the Presto UI.
What is worse is that if I submit a new query in the Hue Presto editor, the new query is submitted, but the previous one keeps running.
Has anyone experienced similar issues? Does changing some Hue/Presto/Hive settings improve anything?

Changing PostgreSQL server changed Django app characteristics

I had to switch an enterprise Django 1.11 site from a corporate-hosted PostgreSQL 9.4 server to AWS RDS Aurora-PostgreSQL 10 cluster. My initial impression was that it should be a straightforward migration, as I was not using any version-specific code.
Immediately after migration, the site started breaking down horribly. Queries that used to take milliseconds suddenly jumped to 100x the time, causing timeouts all over gunicorn threads. I also kept seeing connections being dropped from both RDS and Django.
It kept appearing as if it would be some setting I need to match between previous server and current server, but despite engaging PostgreSQL experts and AWS support, there were no simple answers (or even complex ones). I finally had to fine-tune most queries in my Django code to bring stability to the site.
The app has several queries that refer to foreign relationships, so I used a number of prefetch_related and similar tricks to fix the slowdown. So, a query that was taking 0.5 seconds went to 80 seconds, and after I added prefetch_related, went back to 0.5 seconds.
Even though the site is now stable, I am posting this in the hope that some PostgreSQL and/or Django expert sees this and recognizes this as a symptom of some wrong setting. I am not in a position to share sample queries and am not asking for query optimization. The question is: what would cause a query to become 100x slower when we move from one PostgreSQL server to another, with no change in application code?
In general, postgres-compatible aurora has wildly different performance characteristics than vanilla postgres, and the configuration and tuning for both can be very different. The easiest path forward for you would have been for you to have used AWS RDS for Postgres and not AWS RDS with Aurora Postgres if you had wanted to get performance characteristics that were close to your self-hosted postgres. There are a number of configuration details that you didn't share that would affect performance between RDS and a self-hosted server including VPC settings, SSL, etc. that could also affect performance.

Query Terminating in Redshift

We are migrating our database from SQL Server 2012 to Amazon Redshift.
The front end of our application is developed in MicroStrategy (MSTR) which fires the queries on Redshift.
Although the application is working fine in Production (on SQL Server 2012), we have run into a strange issue in our PoC Environment on Redshift.
When we kicked off a dashboard in MSTR, the query from the dashboard hits Redshift and it completes successfully without any issues.
But when we stress test the application by running all the dashboards simultaneously, then that particular dashboard's query terminates in Redshift. The database does not throw any error message which is why we cannot troubleshoot why the query is terminating.
Can anyone please suggest how we should go about solving this problem.
Thank you
The problem might be that you have some timeout on the queue that you are sending the query using WLM configuration.
Redshift is designed differently from other DB, to be optimized for Analytical queries. For that reason it doesn't cache queries results, as you would do with OLTP DB. The other difference is that you have a predefined concurrently level (also part of WLM - http://docs.aws.amazon.com/redshift/latest/mgmt/workload-mgmt-config.html). Each concurrency slot will have its allocated resources to complete big queries quickly, but it is limiting the number of concurrent queries that can run. The default configuration is 5, and you can increase it up to 50. The recommendation is to have it increased to not more than 15-20, as with 50, it means that each query is getting only 2% of the cluster resource instead of 20% (with 5) or 5% (with 20).
The combination of these two differences is: if you are connecting many dashboards, each one sends its queries to Redshift, competes over the resources (without caching each query will run again and again), and might timeout or just be too slow for an interactive dashboard.
Please make sure that you are using the Redshift optimized drivers for MicroStrategy, which are sending queries to Redshift under the above assumptions.
You can also consider putting some RDS between your dashboards and Redshift, with the aggregation data that you need for your dashboards, and that can use in-memory caching and higher concurrency on that summary data. You can see an interesting pattern that you can implement with pg-bouncer see here, that can help you send some queries (the analytical ones) to Redshift, and some (the aggregated dashboard ones) to a PostgreSQL one.

Automatic Hive or Cascading for ETL in AWS-EMR

I have a large dataset residing in AWS S3. This data is typically a transactional data (like calling records). I run a sequence of Hive queries to continuously run aggregate and filtering condtions to produce a couple of final compact files (csvs with millions of rows at max).
So far with Hive, I had to manually run one query after another (as sometimes some queries do fail due to some problems in AWS or etc).
I have so far processed 2 months of data so far using manual means.
But for subsequent months, I want to be able to write some workflow which will execute the queries one by one, and if should a query fail , it will rerun it again. This CANT be done by running hive queries in bash.sh file (my current approach at least).
hive -f s3://mybucket/createAndPopulateTableA.sql
hive -f s3://mybucket/createAndPopulateTableB.sql ( this might need Table A to be populated before executing).
Alternatively, I have been looking at Cascading wondering whether it might be the solution to my problem and it does have Lingual, which might fit the case. Not sure though, how it fits into the AWS ecosystem.
The best solution, is if there is some hive query workflow process, it would be optimal. Else what other options do I have in the hadoop ecosystem ?
Edited:
I am looking at Oozie now, though facing a sh!tload of issues setting up in emr. :(
You can use AWS Data Pipeline:
AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available
You can configure it to do or retry some actions when a script fails, and it support Hive scripts : http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-hiveactivity.html