Integrating PySpark with Salesforce - api

I want to read data from salesforce via mapr/spark.
I have integrated following jar files in my config.
spark.driver.extraClassPath xyztest/scripts/salesforce/force-partner-api-40.0.0.jar:xyztest/scripts/salesforce/force-wsc-40.0.0.jar:xyztest/scripts/salesforce/jackson-core-2.10.3.jar:xyztest/scripts/salesforce/jackson-dataformat-xml-2.10.3.jar:xyztest/scripts/salesforce/salesforce-wave-api-1.0.9.jar:xyztest/scripts/salesforce/spark-salesforce_2.11-1.1.1.jar
spark.executer.extraClassPath xyztest/scripts/salesforce/force-partner-api-40.0.0.jar:xyztest/scripts/salesforce/force-wsc-40.0.0.jar:xyztest/scripts/salesforce/jackson-core-2.10.3.jar:xyztest/scripts/salesforce/jackson-dataformat-xml-2.10.3.jar:xyztest/scripts/salesforce/salesforce-wave-api-1.0.9.jar:xyztest/scripts/salesforce/spark-salesforce_2.11-1.1.1.jar
But when I execute this function I get an error.
soql = "SELECT * FROM Goodwill__c"
df = spark \
.read \
.format("com.springml.spark.salesforce") \
.option("username", "xyzUser") \
.option("password", "passwort1234token1234") \
.option("soql", soql) \
.load()
ERROR: com.sforce.ws.ConnectionException: Failed to send request to https://login.salesforce.com/services/Soap/u/35.0
What is wrong with my function? Has anyone an idea how to get a connection from mapr/spark to salesforce?

Related

Unable to copy kafka topic into Amazon S3 using writestreem method in Glue Pyspark job?

Job running successfully but unable to write topic data into s3 bucket by writeStream method in Glue job , able to write something but its not proper topic data . using below code to -
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_brokers) \
.option("subscribe","topic1,topic2") \
.option("kafka.sasl.jaas.config",
f'org.apache.kafka.common.security.scram.ScramLoginModule required username="{scram_user}" password="{scram_pwd}";') \
.option("kafka.sasl.mechanism", "SCRAM-SHA-512") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("startingOffset", starting_offsets) \
.load()
df.writeStream \
.format("parquet") \
.outputMode("append") \
.option("path", "s3://datalake-raw/latest/topics-raw") \
.option("checkpointLocation", "s3://datalake-raw/latest/topics-raw") \
.start()

I have an error trying to run my sqoop job (trying to copy a table from oracle to hive)

I am trying to copy a table from oracle to hadoop (hive) with a sqoop script (the table does not already exist in hive). Within putty, I launch a script called "my_script.sh", code sample below. However, after I run it, it gives me back my code followed by no such file or directory error. Can someone please tell me if I am missing something from my code?
Yes my source and target directory is correct (I made sure to triple check).
Thank you
#!/bin/bash
sqoop import \
-Dmapred.map.child.java.opts='-Doracle.net.tns_admin=. -Doracle.net.wallet_location=.' \
-files $WALLET_LOCATION/cwallet.sso,$WALLET_LOCATION/ewallet.p12,$TNS_ADMIN/sqlnet.ora,$TNS_ADMIN/tnsnames.ora \
--connect jdbc:oracle:thin:/#MY_ORACLE_DATABASE \
--table orignal_schema.orignal_table \
--hive-drop-import-delims \
--hive-import \
--hive-table new_schema.new_table \
--num-mappers 1 \
--hive-overwrite \
--mapreduce-job-name my_sqoop_job \
--delete-target-dir \
--target-dir /hdfs://myserver/apps/hive/warehouse/new_schema.db \
--create-hive-table

Error when submitting training job to gcloud

I am new to training on Google Cloud.
When I am running the training job, I get the following error:
(gcloud.ml-engine.jobs.submit.training) Could not copy [research/dist/object_detection-0.1.tar.gz] to [training/packages/c5292b23e57f357dc2d63baab473c04337dbadd2deeb10965e743cd8422b964f/object_detection-0.1.tar.gz]. Please retry: HTTPError 404: Not Found
I am using this to run the training job
gcloud ml-engine jobs submit training job1 \
--job-dir=gs://${ml-project-neu}/training \
--packages research/dist/object_detection-0.1.tar.gz,research/slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--config cloud.yml \
--runtime-version=1.4
-- \
--train_dir=gs://${ml-project-neu}/training \
--pipeline_config_path=gs://${ml-project-neu}/data/faster_rcnn_inception_v2_pets.config
Make sure ${ml-poject-neu} is valid (it may be the empty string in your case); Make sure gs://${ml-project-neu} exists. And make sure the credentials you are using with gcloud have access to your GCS bucket (consider running gcloud auth login).

rnn translate showing data_utils not found in google-cloud-ml-engine

I want to create a chatbot using Tensorflow.I am using the code in 'github.com/tensorflow/models/tree/master/tutorials/rnn/translate'.While running the code in google-cloud-ml-engine I am getting an exception '/usr/bin/python: No module named data_utils' and the job is getting failed.
Here is the commands I used,
gcloud ml-engine jobs submit training ${JOB_NAME} \
--package-path=. \
--module-name=translate.translate \
--staging-bucket="${TRAIN_BUCKET}" \
--region=us-central1 \
-- \
--from_train_data=${INPUT_TRAIN_DATA_A} \
--to_train_data=${INPUT_TRAIN_DATA_B} \
--from_dev_data=${INPUT_TEST_DATA_A} \
--to_dev_data=${INPUT_TEST_DATA_B} \
--train_dir="${TRAIN_PATH}" \
--data_dir="${TRAIN_PATH}" \
--steps_per_checkpoint=5 \
--from_vocab_size=45000 \
--to_vocab_size=45000
ml_engine log screenshot 1
ml_engine log screenshot 2
Is it the problem with ml_engine or tensorflow?
I followed the blog 'blog.kovalevskyi.com/how-to-train-a-chatbot-with-the-tensorflow-and-google-cloud-ml-3a5617289032' and initially used 'github.com/b0noI/models/tree/translate_tutorial_supports_google_cloud_ml/tutorials/rnn/translate'. It was also giving the same error.
None, it is actually a problem within the code you are uploading,
namely satisfying local dependencies. The filedata_utils.py is located in the same folder as where you got the example from. This is also mentioned in this post you should make sure it is available for your model.

Cannot resubmit job to ml-engine because "A job with this id already exists"

I am trying to submit a job to gcloud ml-engine. For reference the job is using this sample provided by Google
It went through the first time, but with errors unrelated to this question, and now I am trying reissue the command after having corrected my errors:
gcloud ml-engine jobs submit training $JOB_NAME \
--stream-logs \
--runtime-version 1.0 \
--job-dir $GCS_JOB_DIR \
--module-name trainer.task \
--package-path trainer/ \
--region us-east1 \
-- \
--train-files $TRAIN_GCS_FILE \
--eval-files $EVAL_GCS_FILE \
--train-steps $TRAIN_STEPS
, where $JOB_NAME = census. Unfortunately, it seems that I cannot proceed with resubmitting the job unless I change $JOB_NAME to be something like census2, then census3, etc. for every new job.
The following is the error I receive:
ERROR: (gcloud.ml-engine.jobs.submit.training) Project [my-project-name]
is the subject of a conflict: Field: job.job_id Error: A job with this
id already exists.
Is this part of the design to not be able to resubmit using the same job name or I am missing something?
Like Chunck just said, simply try setting JOB_NAME as:
JOB_NAME="census_$(date +%Y%m%d_%H%M%S)"
Not sure if this will help but in Google's sample code for flowers, the error is avoided by appending the date and time to the job id as shown on line 22, e.g.,
declare -r JOB_ID="flowers_${USER}_$(date +%Y%m%d_%H%M%S)"