nifi pyspark - "no module named boto3" - amazon-s3

I'm trying to run a pyspark job I created that downloads and uploads data from s3 using the boto3 library. While the job runs fine in pycharm, when I try to run it in nifi using this template https://github.com/Teradata/kylo/blob/master/samples/templates/nifi-1.0/template-starter-pyspark.xml
The ExecutePySpark errors with "No module named boto3".
I made sure it was installed on my conda environment that is active.
Any ideas, im sure im missing something obvious.
Here is a picture of the nifi spark processor.
Thanks,
tim

The Python environment where PySpark should run on is configured via the PYSPARK_PYTHON variable.
Go to Spark installation directory
Go to conf
Edit spark-env.sh
Add this line: export PYSPARK_PYTHON=PATH_TO_YOUR_CONDA_ENV

Related

Running remote Pycharm interpreter with tensorflow and cuda (with module load)

I am using a remote computer in order to run my program on its GPU. My program contains some code with tensorflow functions, and for easier debugging with Pycharm I would like to connect via ssh with remote interpreter to the computer with the GPU. This part can be done easily since Pycharm has this option so I can connect there. However, tensorflow is not loaded automatically so I get import error.
Note that in our institution, we run module load cuda/10.0 and module load tensorflow/1.14.0 each time the computer is loaded. Now this part is the tricky one. Opening a remote terminal creates another session which is not related to the remote interpreter session so it's not affecting remote interpreter modules.
I know that module load in general configures env, however I am not sure how can I export the environment variables to Pycharm's environment variables that are configured before a run.
Any help would be appreciated. Thanks in advance.
The workaround after all was relatively simple: first, I have installed the EnvFile plugin, as it was explained here: https://stackoverflow.com/a/42708476/13236698
Them I created an .env file with a quick script on python, extracting all environment variables and their values from os.environ and wrote them to a file in the following format: <env_variable>=<variable_value>, and saved the file with .env extension. Then I loaded it to PyCharm, and voila - all tensorflow modules were loaded fine.

AWS Elastic Beanstalk: setting up X virtual framebuffer (Xvfb)

I'm trying to get a Selenium script running on an Elastic Beanstalk server, to achieve this I am using pyvirtualdisplay package following this answer. However, for the Display driver to run xvfb also needs to be installed on the system. I'm getting this error message:
OSError=[Errno 2] No such file or directory: 'Xvfb'
Is there any way to manually install this on EB? I have also set up an EC2 server as suggested here, but the whole process seems unnecessary for this task.
You can create a file in .ebextensions/ like: .ebextensions/xvfb.config with the following content:
packages:
yum:
xorg-x11-server-Xvfb: []

running Hadoop with HBase: org.apache.hadoop.hbase.client.HTable.<init>(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/String

I'm trying to make a mapreduce program on Hadoop using HBase. I'm using Hadoop 2.5.1 with HBase 0.98.10.1.
The program can be compiled successfully and being made into a jar file. But, when I try to run the jar using "hadoop jar" the program shows error says:
"org.apache.hadoop.hbase.client.HTable.(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/String".
Here is the line of code I used to initiate the HTable.
HBaseConfiguration config = new HBaseConfiguration();
HTable table = new HTable(config, "Test");
What should I change to overcome this error?
I've somehow solved this error by just re-installing both Hadoop 2.5.1 and HBase 0.98.10

Apache oozie sharedlib is showing a blank list

Relatively new to Apache OOZIE and did an installation on Ubuntu 14.04, Hadoop 2.6.0, JDK 1.8. I was able to install oozie and the web console is visible at the 11000 port of my server.
Now while i copied the examples bundled with oozie and tried to run them i am running into an error which says no sharedlib exists.
Installed the sharedlib as below-
bin/oozie-setup.sh sharelib create -fs hdfs://localhost:54310
(my namenode is running on localhost 54310 and JT on localhost 54311)
hadoop fs -ls /user/hduser/share/lib is showing shared library created as per the oozie-site.xml file. However when i check the shared library using the command -
oozie admin -oozie http://localhost:11000/oozie -shareliblist the list is blank and also jobs are failing for the same reason.
Any clues on how should i approach this problem?
Thanks.
The sharelib create command looks fine.
If you havent done so already copy the core-site.xml from your hadoop installation folder into $OOZIE_HOME/conf/hadoop-conf/.
There might already be a "placeholder" core-site.xml in the hadoop-conf folder, delete or rename that one. Oozie doesnt get its hadoop configuration directly from your hadoop install (like hive for example) but from the core-site.xml you place in that hadoop-conf folder.
Okay i got a solution for this.
So when i was trying to create the sharedlib directory it was doing on HDFS but while running the job local path was being refereed. So i extracted the oozie-sharedlib tar.gz file in my local /user/hduser/share/lib directory and its working now.
But did not get the reason so its still an open question.
I have encountered the same issue and it turned out that
oozie was not able to communicate with hdfs, as it was not able to find the location for core-site.xml or any other hadoop configuration which has to be declared inside oozie-site.xml.
Corresponding property in oozie-site.xml is oozie.service.HadoopAccessorService.hadoop.configurations
this property was defined wrongly in my case.
changed it to point to where my Hadoop configuration xmls are present and then it started communicating with hdfs and hence was able to locate the sharelib on hdfs

Sqoop - Could not find or load main class org.apache.sqoop.Sqoop

I installed Hadoop, Hive, HBase, Sqoop and added them to the PATH.
When I try to execute sqoop command, I'm getting this error:
Error: Could not find or load main class org.apache.sqoop.Sqoop
Development Environment:
OS : Ubuntu 12.04 64-bit
Hadoop Version: 1.0.4
Hive Version: 0.9.0
Hbase Version: 0.94.5
Sqoop Version: 1.4.3
make sure you have sqoop-1.4.3.jar under your SQOOP HOME directory.
Note : May be because you had downloaded wrong distribution under Sqoop Distribution
I have resolved this issue on CentOS 6.3.
I have Hadoop-1.0.4, hbase-0.94.6, hive-0.10.0, pig-0.11.1, sqoop-1.4.3.bin__hadoop-1.0.0, zookeeper-3.4.5 installed.
I was also running same problem at sqoop: Error - Could not find the main class: org.apache.sqoop.Sqoop.
To resolve this issue I have copied the jar file: sqoop-1.4.3.jar from $SQOOP_HOME/ into the $HADOOP_HOME/lib/.
Hope this would help someone who struggling sqoop to be work with hadoop.
Unfortunately, I didn't find a complete answer for my problems. Current sqoop installation version I used was 1.4.6 . I am not sure about sqoop-1.4.6.tar.gz if one has to compile the source code, I was able to beat the same error Error - Could not find the main class: org.apache.sqoop.Sqoop using following instructions:
Instead I downloaded sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz from apache sqoop and installed it at /home/ubuntu/SQOOP/ renamed sqoop-1.4.6.bin__hadoop-2.0.4-alpha to sqoop. I wanted to use with Yarn.
Then export and set $SQOOP_HOME
I used this
export SQOOP_HOME=/home/ubuntu/SQOOP/sqoop/
export PATH=$PATH:$SQOOP_HOME/bin
Now if one go to $SQOOP_HOME/bin and try
./sqoop help
It should work without any issue.
The problem in my case was that hadoop-env.sh file has this line in it:
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
It seems that when you call sqoop it internally calls configure-sqoop which sets the HADOOP_CLASSPATH correctly but then when it (sqoop) calls hadoop, hadoop ignores that variable and reset it back to what is in hadooop-env.sh
The fix was to change the hadoop-env.sh to have this line instead:
export HADOOP_CLASSPATH="${JAVA_HOME}/lib/tools.jar:$HADOOP_CLASSPATH"
#user225003 solution magically worked and I looked into some of the files and here is what happens under the hood when you execute "sqoop" script.
The "sqoop" script essentially executes "hadoop" script from $HADOOP_COMMON_HOME/bin/ directory. While configuring sqoop, in "sqoop-env.sh" we set the $HADOOP_COMMON_HOME to hadoop installation directory. If your sqoop and hadoop installations are not in regular location /usr/local, I believe sqoop-x.x.x.jar is not in the hadoop script's classpath.