I am using Hive on Amazon EMR for ETL of CSV data into Parquet data (and applying a strongly typed schema). I am connecting from my SQS listener application to HiveServer2 using thrift with up to 250 open connections.
An ETL looks like: SET parameters, CREATE EXTERNAL TABLE (source), CREATE EXTERNAL TABLE (destination), INSERT OVERRIDE, DROP TABLEs
When running ~100 or less ETLs at a time the performance of the SET, CREATE, and DROP statements are subsecond. The INSERT takes ~100 seconds via a MR job.
Now when I ramp this up to ~250 ETLs at a time the performance of the SET, CREATE, and DROP statement gets orders of magnitude worse. It almost looks like these simple statements are getting stuck behind the INSERT statements even though each ETL is on a separate connection.
I tried increasing hive.server2.async.exec.threads and hive.server2.async.exec.wait.queue.size but I don't see a difference. Am I missing some other tunable configuration?
I am basically unable to push enough jobs through to occupy the cluster resources. I know batching is a strategy, but I would like to keep the liveliness of my data.
Related
I have a Singlestore (previously MemSQL) cloud database set up.
My software is running in the background, constantly writing to a table.
When I try to query this table, it takes 10+ seconds. When the software is shut off, the query takes milliseconds.
What would be the reason for this? And is there anything that can be done to mitigate against this?
From a high level, cluster resources are much more utilized while the background software constantly writes to the table. The same resources that handle the constant writes are concurrently trying to serve the query, so it makes sense its faster when there is no writing.
A 'knob to turn' WRT database ingest performance is partition count - you can try creating a test DB w/ more partitions that the current DB (say 2x more). Then try querying from the test DB, both while the background software is running and while it is not - compare this to the DB w/ fewer partitions.
For general guidance on troubleshooting query performance, see this section of the docs: https://docs.singlestore.com/managed-service/en/query-data/query-procedures/troubleshooting-poorly-performing-queries.html
If you're an active customer, you can file a support ticket for the issue for some additional analysis of the backend workings
I'm looking to build a RESTful API in Go that would be in charge of inserting datas based on the data sent by multiple mobile apps (that would be stored in an Amazon Redshift cluster). Possibly receiving tens of thousands of requests per second.
From what I have read, Redshift give slow insert speeds.
That's why few people have advised me to use an intermediate database like dynamodb or s3 in which I'd perform the inserts first. Then, in a second time, I'd import the data to Redshift.
I'm wondering why would I need to use Redshift in that case as the data would already be stored in a database ? Do you think I can proceed differently ?
I have also thought of a simpler solution by writing to a queue and progressively inserting the data to redshift but I think it might be a problem if the queue gets increasingly bigger as the insert speed isn't fast enough to compensate the incoming data.
Thanks in advance for your help! :-)
Advice like this is normally off-topic for StackOverflow, but...
Amazon Redshift is a massively parallel processing (MPP) database with an SQL interface. It can be used to query TBs and even PBs of data and it can do it very efficiently.
You ask "why would I need to use Redshift" -- the answer is if your querying requirements cannot be satisfied with a traditional database. If you can satisfactorily use a normal database for your queries, then there's no real reason to use Redshift.
However, if your queries need Redshift, then you should continue to use it. The design of Redshift is such that the most efficient way to insert data is to load from Amazon S3 via the COPY command. It is inefficient to insert data via normal INSERT statements unless they are inserting many rows per INSERT statement (eg hundreds or thousands).
So, some questions to ask:
Do I need the capabilities of Amazon Redshift for my queries, or can a traditional database suffice?
Do I need to load data in real-time, or is it sufficient to load in batches?
If using batches, how often do I need to load the batch? Can I do it hourly or daily, or does it need to be within a few minutes of the data arriving?
You could also consider using Amazon Kinesis Firehose, which can accept a stream of data and insert it into an Amazon Redshift database automatically.
We are evaluating Amazon Redshift for real time data warehousing.
Data will be streamed and processed through a Java service and it should be stored in the database. We process row by row (real time) and we will only insert one row per transaction.
What is best practice for real time data loading to Amazon Redshift?
Shall we use JDBC and perform INSERT INTO statements, or try to use Kinesis Firehose, or perhaps AWS Lambda?
I'm concerned about using one of these services because both will use Amazon S3 as a middle layer and perform the COPY command which is suitable for bigger data sets, not for "one-row" inserts.
It is not efficient to use individual INSERT statements with Amazon Redshift. It is designed as a Data Warehouse, providing very fast SQL queries. It is not a transaction-processing database where data is frequently updated and inserted.
The best practice is to load batches (or micro-batches) via the COPY command. Kinesis Firehose uses this method. This is much more efficient, because multiple nodes are used to load the data in parallel.
If you are seriously looking at processing data in real-time, then Amazon Redshift might not be the best database to use. Consider using a traditional SQL database (eg those provided by Amazon RDS), a NoSQL database (such as Amazon DynamoDB) or even Elasticsearch. You should only choose to use Redshift if your focus is on reporting across large volumes of data, typically involving many table joins.
As mentioned in the Amazon Redshift Best Practices for Loading Data:
If a COPY command is not an option and you require SQL inserts, use a multi-row insert whenever possible. Data compression is inefficient when you add data only one row or a few rows at a time.
The best option is Kinesis Firehose, which is working on batches of events. You are writing the events into Firehose, one by one, and it is batching it in an optimal way, based on your definition. You can define how many minutes to batch the events, or the size of the batch in MB.
You might be able to insert the event faster into Redshift with INSERT, but this method is not scalable. COPY designed to work in almost every scale.
I have a large dataset residing in AWS S3. This data is typically a transactional data (like calling records). I run a sequence of Hive queries to continuously run aggregate and filtering condtions to produce a couple of final compact files (csvs with millions of rows at max).
So far with Hive, I had to manually run one query after another (as sometimes some queries do fail due to some problems in AWS or etc).
I have so far processed 2 months of data so far using manual means.
But for subsequent months, I want to be able to write some workflow which will execute the queries one by one, and if should a query fail , it will rerun it again. This CANT be done by running hive queries in bash.sh file (my current approach at least).
hive -f s3://mybucket/createAndPopulateTableA.sql
hive -f s3://mybucket/createAndPopulateTableB.sql ( this might need Table A to be populated before executing).
Alternatively, I have been looking at Cascading wondering whether it might be the solution to my problem and it does have Lingual, which might fit the case. Not sure though, how it fits into the AWS ecosystem.
The best solution, is if there is some hive query workflow process, it would be optimal. Else what other options do I have in the hadoop ecosystem ?
Edited:
I am looking at Oozie now, though facing a sh!tload of issues setting up in emr. :(
You can use AWS Data Pipeline:
AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available
You can configure it to do or retry some actions when a script fails, and it support Hive scripts : http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-hiveactivity.html
My Task is
1) Initially I want to import the data from MS SQL Server into HDFS using SQOOP.
2) Through Hive I am processing the data and generating the result in one table
3) That result containing table from Hive is again exported to MS SQL SERVER back.
I want to perform all this using Amazon Elastic Map Reduce.
The data which I am importing from MS SQL Server is very large (near about 5,00,000 entries in one table. Like wise I have 30 tables). For this I have written a task in Hive which contains only queries (And each query has used a lot of joins in it). So due to this the performance is very poor on my single local machine ( It takes near about 3 hrs to execute completely).
I want to reduce that time as much less as possible. For that we have decided to use Amazon Elastic Mapreduce. Currently I am using 3 m1.large instance and still I have same performance as on my local machine.
In order to improve the performance what number of instances should I need to use?
As number of instances we use are they configured automatically or do I need to specify while submitting JAR to it for execution? Because as I use two machine time is same.
And also Is there any other way to improve the performance or just to increase the number of instance. Or am I doing something wrong while executing JAR?
Please guide me through this as I don't much about the Amazon Servers.
Thanks.
You could try Ganglia, which can be installed on your EMR cluster using a bootstrap action. This will give you some metrics on the performance of each node in the cluster and may help you optimise to get the right sized cluster:
http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_Ganglia.html
If you use the EMR Ruby client on your local machine, you can set up an SSH tunnel to allow you to view the ganglia web interface in Firefox (you'll also need to setup FoxyProxy as per the following http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-connect-master-node-foxy-proxy.html)