Presto query performance via Hue on AWS EMR - amazon-emr

I am currently trying to setup Presto connection trough Hue on an AWS EMR Cluster (release 5.24.0).
By default, AWS is setting the connection using the jdbc interface.
The problem of using this interface is that for some reason it loads only the first 1000 records when performing a query of the type
SELECT * FROM table
Which is probably due to a current Hue limitation that hardcodes the fetchsize to 1000 rows and cannot be edited.
According to Hue's documentation, however, it is advisable to use sqlalchemy as interface instead of jdbc whenever possible. So I changed the presto interpreter settings and installed pyhive as documented in the guide.
The fetchsize problem is gone, but even performing a
SELECT COUNT(1) FROM table
can take several minutes. Moreover, from the Hue UI, the query seems to be completed, when it is still running if I look at the Presto UI.
What is worse is that if I submit a new query in the Hue Presto editor, the new query is submitted, but the previous one keeps running.
Has anyone experienced similar issues? Does changing some Hue/Presto/Hive settings improve anything?

Related

Schedule task to load BigQuery table into Apache Ignite

I have a use case where we need to periodically load BigQuery table in to a cache and support SQL query from there. I'm doing researching on Apache Ignite and think it could be a good fit to our use case. Only that it's not clear to me yet how I can get auto-load from BigQuery. By "auto-load" I mean to keep Apache Ignite updated with BigQuery table data and let this updating transparent to applications. In most cases, our BigQuery tables are updated by other scheduled jobs/queries with intervals from 5 minutes to 1 month.
I'm new to Ignite, and I guess my questions are as the following:
Is this a feature supported in Ignite already? (I couldn't find any)
Or is there any exiting pluggins already? (I couldn't find any)
how to implement the auto-load cache for BigQuery using Ignite?
You can do this once with Cache Store / loadCache(), but doing this every few minutes is infeasible. You may wish to design a BigQuery streamer to Apache Ignite, if it supports pushing of deltas.
If Google BigQuery doesn't open its changelog files for CDC tools then find how to capture those updates differently and stream them to Ignite via its IgniteDataStreamer API. There should be a way to capture the changes with some pub/sub mechanism.

Changing PostgreSQL server changed Django app characteristics

I had to switch an enterprise Django 1.11 site from a corporate-hosted PostgreSQL 9.4 server to AWS RDS Aurora-PostgreSQL 10 cluster. My initial impression was that it should be a straightforward migration, as I was not using any version-specific code.
Immediately after migration, the site started breaking down horribly. Queries that used to take milliseconds suddenly jumped to 100x the time, causing timeouts all over gunicorn threads. I also kept seeing connections being dropped from both RDS and Django.
It kept appearing as if it would be some setting I need to match between previous server and current server, but despite engaging PostgreSQL experts and AWS support, there were no simple answers (or even complex ones). I finally had to fine-tune most queries in my Django code to bring stability to the site.
The app has several queries that refer to foreign relationships, so I used a number of prefetch_related and similar tricks to fix the slowdown. So, a query that was taking 0.5 seconds went to 80 seconds, and after I added prefetch_related, went back to 0.5 seconds.
Even though the site is now stable, I am posting this in the hope that some PostgreSQL and/or Django expert sees this and recognizes this as a symptom of some wrong setting. I am not in a position to share sample queries and am not asking for query optimization. The question is: what would cause a query to become 100x slower when we move from one PostgreSQL server to another, with no change in application code?
In general, postgres-compatible aurora has wildly different performance characteristics than vanilla postgres, and the configuration and tuning for both can be very different. The easiest path forward for you would have been for you to have used AWS RDS for Postgres and not AWS RDS with Aurora Postgres if you had wanted to get performance characteristics that were close to your self-hosted postgres. There are a number of configuration details that you didn't share that would affect performance between RDS and a self-hosted server including VPC settings, SSL, etc. that could also affect performance.

Query Terminating in Redshift

We are migrating our database from SQL Server 2012 to Amazon Redshift.
The front end of our application is developed in MicroStrategy (MSTR) which fires the queries on Redshift.
Although the application is working fine in Production (on SQL Server 2012), we have run into a strange issue in our PoC Environment on Redshift.
When we kicked off a dashboard in MSTR, the query from the dashboard hits Redshift and it completes successfully without any issues.
But when we stress test the application by running all the dashboards simultaneously, then that particular dashboard's query terminates in Redshift. The database does not throw any error message which is why we cannot troubleshoot why the query is terminating.
Can anyone please suggest how we should go about solving this problem.
Thank you
The problem might be that you have some timeout on the queue that you are sending the query using WLM configuration.
Redshift is designed differently from other DB, to be optimized for Analytical queries. For that reason it doesn't cache queries results, as you would do with OLTP DB. The other difference is that you have a predefined concurrently level (also part of WLM - http://docs.aws.amazon.com/redshift/latest/mgmt/workload-mgmt-config.html). Each concurrency slot will have its allocated resources to complete big queries quickly, but it is limiting the number of concurrent queries that can run. The default configuration is 5, and you can increase it up to 50. The recommendation is to have it increased to not more than 15-20, as with 50, it means that each query is getting only 2% of the cluster resource instead of 20% (with 5) or 5% (with 20).
The combination of these two differences is: if you are connecting many dashboards, each one sends its queries to Redshift, competes over the resources (without caching each query will run again and again), and might timeout or just be too slow for an interactive dashboard.
Please make sure that you are using the Redshift optimized drivers for MicroStrategy, which are sending queries to Redshift under the above assumptions.
You can also consider putting some RDS between your dashboards and Redshift, with the aggregation data that you need for your dashboards, and that can use in-memory caching and higher concurrency on that summary data. You can see an interesting pattern that you can implement with pg-bouncer see here, that can help you send some queries (the analytical ones) to Redshift, and some (the aggregated dashboard ones) to a PostgreSQL one.

Automatic Hive or Cascading for ETL in AWS-EMR

I have a large dataset residing in AWS S3. This data is typically a transactional data (like calling records). I run a sequence of Hive queries to continuously run aggregate and filtering condtions to produce a couple of final compact files (csvs with millions of rows at max).
So far with Hive, I had to manually run one query after another (as sometimes some queries do fail due to some problems in AWS or etc).
I have so far processed 2 months of data so far using manual means.
But for subsequent months, I want to be able to write some workflow which will execute the queries one by one, and if should a query fail , it will rerun it again. This CANT be done by running hive queries in bash.sh file (my current approach at least).
hive -f s3://mybucket/createAndPopulateTableA.sql
hive -f s3://mybucket/createAndPopulateTableB.sql ( this might need Table A to be populated before executing).
Alternatively, I have been looking at Cascading wondering whether it might be the solution to my problem and it does have Lingual, which might fit the case. Not sure though, how it fits into the AWS ecosystem.
The best solution, is if there is some hive query workflow process, it would be optimal. Else what other options do I have in the hadoop ecosystem ?
Edited:
I am looking at Oozie now, though facing a sh!tload of issues setting up in emr. :(
You can use AWS Data Pipeline:
AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available
You can configure it to do or retry some actions when a script fails, and it support Hive scripts : http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-hiveactivity.html

Related to speed of execution of Job in Amazon Elastic Mapreduce

My Task is
1) Initially I want to import the data from MS SQL Server into HDFS using SQOOP.
2) Through Hive I am processing the data and generating the result in one table
3) That result containing table from Hive is again exported to MS SQL SERVER back.
I want to perform all this using Amazon Elastic Map Reduce.
The data which I am importing from MS SQL Server is very large (near about 5,00,000 entries in one table. Like wise I have 30 tables). For this I have written a task in Hive which contains only queries (And each query has used a lot of joins in it). So due to this the performance is very poor on my single local machine ( It takes near about 3 hrs to execute completely).
I want to reduce that time as much less as possible. For that we have decided to use Amazon Elastic Mapreduce. Currently I am using 3 m1.large instance and still I have same performance as on my local machine.
In order to improve the performance what number of instances should I need to use?
As number of instances we use are they configured automatically or do I need to specify while submitting JAR to it for execution? Because as I use two machine time is same.
And also Is there any other way to improve the performance or just to increase the number of instance. Or am I doing something wrong while executing JAR?
Please guide me through this as I don't much about the Amazon Servers.
Thanks.
You could try Ganglia, which can be installed on your EMR cluster using a bootstrap action. This will give you some metrics on the performance of each node in the cluster and may help you optimise to get the right sized cluster:
http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_Ganglia.html
If you use the EMR Ruby client on your local machine, you can set up an SSH tunnel to allow you to view the ganglia web interface in Firefox (you'll also need to setup FoxyProxy as per the following http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-connect-master-node-foxy-proxy.html)