Change spark configuration for a specific command - apache-spark-sql

I would like to build an executed plan in Spark, without the WholeStageCodegenExec node.
Is there a way to turn off the configuration "spark.sql.codegen.wholeStage" for a single query that would not affect other queries (i.e., by modifying spark session)?
Thanks!

Related

How can i run multiple queries in parallel on hive with tez execution engine?

We want to run hive with tez for querying data in hdfs as multiple users will query hive so we need to configure hive in such a way so that the queries get executed in parallel
As tez uses yarn for assigning resources to multiple nodes, we are trying to limit the containers getting assigned per query of hive in yarn but not able to find the proper config for that
User Limit Factor is a way to control the max amount of resources that a single user can consume. I hope below Cloudera blog will help you.
https://blog.cloudera.com/yarn-capacity-scheduler/#:~:text=User%20Limit%20Factor%20is%20a%20way%20to%20control%20the%20max%20amount%20of%20resources%20that%20a%20single%20user%20can%20consume
Then we have Queue level AMshare as well.
https://community.cloudera.com/t5/Support-Questions/how-to-tune-yarn-scheduler-capacity-maximum-am-resource/td-p/289415

Ignet query on local node pontential issue?

New to ignite, i have a use case, i need to run a job to clean up. I have ignite embedded in our spring boot application, for multiple instances, i am thinking have the job run on each instance, then just query the local data and clean up those. Do you see any issue with this? I am not sure how often ignite does reshuffing data?
Thanks
Shannon
You can surely do that.
With regards to data reshuffling, it will only happen when node is added or removed to cluster. However, ignite.compute().affinityRun() family of calls guarantees that code is ran near the data.
Otherwise, you could do ignite.compute().broadcast() and only iterate on each affected cache's local entries. You don't have the aforementioned guarantee then, though.

How are Apache Pig UDFs distributed to data nodes?

There are plenty of documentation about how to write Pig UDFs in the various languages but I haven't found anything on how they are distributed to the data nodes.
Are they done automatically when pig script is invoked? If it makes any difference, I'd be writing UDF in Java.
Let me make it more clear. Whenever we wite a UDF and the pig is in hdfs mode. Then UDFs, which initially resides in the local or the client side, is carried to the cluster as per the internal architecture of hadoop. Now the UDFs task is performed by the task tracker and it becomes the duty of the job tracker to assign the the UDFs to task tracker, which is near to the data node where the input file resides.
Note: Its always the job tracker(component of name node), which actually decides which task tracker should perform the execution of the UDFs.
If the input file is in local file system(local mode), then the UFDs get executed in the local JVM.
The fact is apache pig works in two modes
1) local mode
2) hdfs mode
To answer you question, which belongs to pig running in hdfs mode, we only made sure that the input file that we are loading is present in the hdfs(data node). When the question comes for UDF, this is simply a function that is used to process the input file, just link pig latin language. We are writing UDFs, pig latin via the client side node and thus all the data related to this will be stored in the client side machine.
Above all, we have configure the pig so that client can interact with the hdfs to process the required result.
Hope this helps

Automatic Hive or Cascading for ETL in AWS-EMR

I have a large dataset residing in AWS S3. This data is typically a transactional data (like calling records). I run a sequence of Hive queries to continuously run aggregate and filtering condtions to produce a couple of final compact files (csvs with millions of rows at max).
So far with Hive, I had to manually run one query after another (as sometimes some queries do fail due to some problems in AWS or etc).
I have so far processed 2 months of data so far using manual means.
But for subsequent months, I want to be able to write some workflow which will execute the queries one by one, and if should a query fail , it will rerun it again. This CANT be done by running hive queries in bash.sh file (my current approach at least).
hive -f s3://mybucket/createAndPopulateTableA.sql
hive -f s3://mybucket/createAndPopulateTableB.sql ( this might need Table A to be populated before executing).
Alternatively, I have been looking at Cascading wondering whether it might be the solution to my problem and it does have Lingual, which might fit the case. Not sure though, how it fits into the AWS ecosystem.
The best solution, is if there is some hive query workflow process, it would be optimal. Else what other options do I have in the hadoop ecosystem ?
Edited:
I am looking at Oozie now, though facing a sh!tload of issues setting up in emr. :(
You can use AWS Data Pipeline:
AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available
You can configure it to do or retry some actions when a script fails, and it support Hive scripts : http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-hiveactivity.html

Related to speed of execution of Job in Amazon Elastic Mapreduce

My Task is
1) Initially I want to import the data from MS SQL Server into HDFS using SQOOP.
2) Through Hive I am processing the data and generating the result in one table
3) That result containing table from Hive is again exported to MS SQL SERVER back.
I want to perform all this using Amazon Elastic Map Reduce.
The data which I am importing from MS SQL Server is very large (near about 5,00,000 entries in one table. Like wise I have 30 tables). For this I have written a task in Hive which contains only queries (And each query has used a lot of joins in it). So due to this the performance is very poor on my single local machine ( It takes near about 3 hrs to execute completely).
I want to reduce that time as much less as possible. For that we have decided to use Amazon Elastic Mapreduce. Currently I am using 3 m1.large instance and still I have same performance as on my local machine.
In order to improve the performance what number of instances should I need to use?
As number of instances we use are they configured automatically or do I need to specify while submitting JAR to it for execution? Because as I use two machine time is same.
And also Is there any other way to improve the performance or just to increase the number of instance. Or am I doing something wrong while executing JAR?
Please guide me through this as I don't much about the Amazon Servers.
Thanks.
You could try Ganglia, which can be installed on your EMR cluster using a bootstrap action. This will give you some metrics on the performance of each node in the cluster and may help you optimise to get the right sized cluster:
http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_Ganglia.html
If you use the EMR Ruby client on your local machine, you can set up an SSH tunnel to allow you to view the ganglia web interface in Firefox (you'll also need to setup FoxyProxy as per the following http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-connect-master-node-foxy-proxy.html)