CaffeOnSpark/Scala--ERROR yarn.ApplicationMaster: User class threw exception: java.lang.NullPointerException - caffe-on-spark

When I train a DNN network using CaffeOnSpark with 2 Spark executors with Ethernet connection, I'm getting an error. I run the job as the example on the https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_yarn
export SPARK_WORKER_INSTANCES=2
export DEVICES=1
hadoop fs -rm -f hdfs:///mnist.model
hadoop fs -rm -r -f hdfs:///mnist_features_result
spark-submit --master yarn --deploy-mode cluster \
--num-executors 2 \
--files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt \
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
--class com.yahoo.ml.caffe.CaffeOnSpark \
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
-train \
-features accuracy,loss -label label \
-conf lenet_memory_solver.prototxt \
-devices 1 \
-connection ethernet \
-model hdfs:///mnist.model \
-output hdfs:///mnist_features_result
Here is the error I'm getting.
When I see the log of the datanode, the error is as below.
Log of datanode
Thank you very much for you answer.

Related

Which API can I use to add my own logs to QEMU for debugging purpose

I tried to add my own logs to qemu by using fprintf(stdout, "my own log") and qemu_log("my own log"), and then compiled the qemu from source code and started a VM by the following command:
/usr/bin/qemu-system-x86_64 \
-D /home/VM1-qemu-log.txt \
-d cpu_reset \
-enable-kvm \
-m 4096 \
-nic user,model=virtio \
-drive file=/var/lib/libvirt/images/VM1.qcow2,media=disk,if=virtio \
-nographic
There are CPU-related logs in VM1-qemu-log.txt, however, I cannot find where "my own log" is. Can anyone advise? Thanks!
qemu_log("my own log") works, I added it to the wrong place(i.e., beginning of the 'main()' in 'qemu/vl.c', where the logging has not been setup yet). By adding it to another place(e.g. in virtio_blk_get_request() under qemu/hw/block/virtio-blk.c), I will be able to see "my own log" in /home/VM1-qemu-log.txt. The VM is created by:
/usr/bin/qemu-system-x86_64 \
-D /home/VM1-qemu-log.txt \
-enable-kvm \
-m 4096 \
-nic user,model=virtio \
-drive file=/var/lib/libvirt/images/VM1.qcow2,media=disk,if=virtio \
-nographic

Can't submit training job gcloud ml

I get this error when I try to submit my training job.
ERROR: (gcloud.ml-engine.jobs.submit.training) Could not copy [dist/object_detection-0.1.tar.gz] to [packages/10a409168355064d603079b7c34cdd7010a13b181a8f7776751e9110d66a5bdf/object_detection-0.1.tar.gz]. Please retry: HTTPError 404: Not Found
I'm running the following code:
gcloud ml-engine jobs submit training ${train1} \
--job-dir=gs://${object-detection-tutorial-bucket1/}/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train1 \
--region us-central1 \
--config object_detection/samples/cloud/cloud.yml \
--runtime-version=1.4 \
-- \
--train_dir=gs://${object-detection-tutorial-bucket1/}/train \
--pipeline_config_path=gs://${object-detection-tutorial-
bucket1/}/data/ssd_mobilenet_v1_coco.config
It looks like the syntax you're using is incorrect.
If the name of your bucket is object-detection-tutorial-bucket1, then you specify that with:
--job-dir=gs://object-detection-tutorial-bucket1/train
or you can run:
export YOUR_GCS_BUCKET="gs://object-detection-tutorial-bucket1"
and then specify the bucket as:
--job-dir=${YOUR_GCS_BUCKET}/train
The ${} syntax is used for accessing the value of a variable, but object-detection-tutorial-bucket1/ isn't a valid variable name, so it evaluates as empty.
Sources:
https://cloud.google.com/blog/big-data/2017/06/training-an-object-detector-using-cloud-machine-learning-engine
Difference between ${} and $() in Bash
Just remove $ { } in the script.Considering your bucket name to be object-detection-tutorial-bucket1,Run the below script-
gcloud ml-engine jobs submit training \
--job-dir=gs://object-detection-tutorial-bucket1/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train1 \
--region us-central1 \
--config object_detection/samples/cloud/cloud.yml \
--runtime-version=1.4 \
-- \
--train_dir=gs://object-detection-tutorial-bucket1/train \
--pipeline_config_path=gs://object-detection-tutorial- \
bucket1/data/ssd_mobilenet_v1_coco.config \
Terrible fix but something which worked for me - just remove $variable format completely.
Here is an example:
!gcloud ai-platform jobs submit training anurag_card_fraud \
--scale-tier basic \
--job-dir gs://anurag/credit_card_fraud/models/JOB_20210401_194058 \
--master-image-uri gcr.io/anurag/xgboost_fraud_trainer:latest \
--config trainer/hptuning_config.yaml \
--region us-central1 \
-- \
--training_dataset_path=$TRAINING_DATASET_PATH \
--validation_dataset_path=$EVAL_DATASET_PATH \
--hptune

where is the default sqoop hive import destination directory? is it controllable?

I need to do a sqoop import all tables from an existing mysql database to hive, the first table is categories.
The command is as below:
sqoop import-all-tables -m 1 \
--connect=jdbc:mysql://ms.itversity.com/retail_db \
--username=retail_user \
--password=itversity \
--hive-import \
--hive-overwrite \
--create-hive-table \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--outdir java_output0322
It failed for the following reason:
Output directory
hdfs://nn01.itversity.com:8020/user/paslechoix/categories already
exists
I am wondering how can I import them into /apps/hive/warehouse/paslechoix.db/
paslechoix is the hive database name.
UPDATE1 on 20180323 to Bala who commented at the first place:
I've updated the script to:
sqoop import-all-tables -m 1 \
--connect=jdbc:mysql://ms.itversity.com/retail_db \
--username=retail_user \
--password=itversity \
--hive-import \
--hive-overwrite \
--create-hive-table \
--hive-database paslechoix_new \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--outdir java_output0323
added what you suggested: --hive-database paslechoix_new
paslechoix_new is a new hive database just created.
I still receive error of:
AlreadyExistsException: Output directory
hdfs://nn01.itversity.com:8020/user/paslechoix/categories already
exists
Now, it is really interesting, why it keeps referring to paslechoix? I already indicate in the script that the hive database is paslechoix_new, why it doesn't get recognized?
Update 2 on 20180323:
I took the other suggestion in Bala's comment:
sqoop import-all-tables -m 1 \
--connect=jdbc:mysql://ms.itversity.com/retail_db \
--username=retail_user \
--password=itversity \
--hive-import \
--hive-overwrite \
--create-hive-table \
--hive-database paslechoix_new \
--warehouse-dir /apps/hive/warehouse/paslechoix_new.db \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--outdir java_output0323
So now the import doesn't throw error any more, however, I checked the hive database, all the tables are created, with no data
add the option --warehouse-dir to import into a specific directory
--warehouse-dir /apps/hive/warehouse/paslechoix.db/
if you want to import to specific hive database, then use
--hive-database paslechoix

Rsnapshot Issue

I am trying to do some backups with Rsnapshot and am constantly getting this error:
/usr/bin/rsync -av --delete --numeric-ids --relative --delete-excluded \
--stats -L --whole-file --exclude=*/web/ --exclude=*/tmp/ \
--exclude=*/dms/ --exclude=*/Recycle\ Bin/ --exclude=*/app/logs/ \
--exclude=*/app/cache/ --exclude=*/vendor/ --exclude=/var/www/files/ \
--exclude=*/releases/ \
--exclude=/var/www/www.xxx.net/app/var/sessions/ \
--rsync-path=rsync_wrapper.sh --exclude=/var/www/psan-static/ \
--rsh=/usr/bin/ssh -p 9922 backup#xxx.xxx.xxx.xxx:/var/www \
/data-ext/backups/rsnapshot/daily.0/myserver/
Unexpected remote arg: backup#xxx.xxx.xxx.xxx:/var/www
rsync error: syntax or usage error (code 1) at main.c(1348) [sender=3.1.0]
----------------------------------------------------------------------------
rsnapshot encountered an error! The program was invoked with these options:
/usr/bin/rsnapshot daily
----------------------------------------------------------------------------
ERROR: /usr/bin/rsync returned 1 while processing backup#xxx.xxx.xxx.xxx:/var/www/
/usr/bin/logger -i -p user.err -t rsnapshot /usr/bin/rsnapshot daily: \
ERROR: /usr/bin/rsync returned 1 while processing \
backup#xxx.xxx.xxx.xxx:/var/www/
I was trying to play with parameters but cannot figure out what's the issue
The issue was with the
--exclude=*/Recycle\ Bin/
This needs to be quoted as spaces seems not to work.
--exclude="*/Recycle\ Bin/"

multiple KVM guests script using virt-install

I would like install 3 KVM guests automatically using kickstart.
I have no problem installing it manually using virt-install command.
virt-install \
-n dal \
-r 2048 \
--vcpus=1 \
--os-variant=rhel6 \
--accelerate \
--network bridge:br1,model=virtio \
--disk path=/home/dal_internal,size=128 --force \
--location="/home/kvm.iso" \
--nographics \
--extra-args="ks=file:/dal_kick.cfg console=tty0 console=ttyS0,115200n8 serial" \
--initrd-inject=/opt/dal_kick.cfg \
--virt-type kvm
I have 3 scripts like the one above - i would like to install all 3 at the same time, how can i disable the console? or running it in the background?
Based on virt-install man page:
http://www.tin.org/bin/man.cgi?section=1&topic=virt-install
--noautoconsole
Don't automatically try to connect to the guest console. The
default behaviour is to launch virt-viewer(1) to display the
graphical console, or to run the "virsh" "console" command to
display the text console. Use of this parameter will disable this
behaviour.
virt-install will connect console automatically. If you don't want,
just simply add --noautoconsole in your cmd like
virt-install \
-n dal \
-r 2048 \
--vcpus=1 \
--quiet \
--noautoconsole \
...... other options
We faced the same problem and at the end the only way we found was to create new threads with the &.
We also include the quiet option, not mandatory.
---quiet option (Only print fatal error messages).
virt-install \
-n dal \
-r 2048 \
--vcpus=1 \
--quiet \
--os-variant=rhel6 \
--accelerate \
--network bridge:br1,model=virtio \
--disk path=/home/dal_internal,size=128 --force \
--location="/home/kvm.iso" \
--nographics \
--extra-args="ks=file:/dal_kick.cfg console=tty0 console=ttyS0,115200n8 serial" \
--initrd-inject=/opt/dal_kick.cfg \
--virt-type kvm &
I know this is kind of old, but I wanted to share my thoughts.
I ran into the same problem, but due to the environment we work in, we need to use sudo with a password (compliance reasons). The solution I came up with was to use timeout instead of &. When we fork it right away, it would hang due to the sudo prompt never appearing. So using timeout with your example above: (we obviously did timeout 10 sudo virt-instal...)
timeout 15 virt-install \
-n dal \
-r 2048 \
--vcpus=1 \
--quiet \
--os-variant=rhel6 \
--accelerate \
--network bridge:br1,model=virtio \
--disk path=/home/dal_internal,size=128 --force \
--location="/home/kvm.iso" \
--nographics \
--extra-args="ks=file:/dal_kick.cfg console=tty0 console=ttyS0,115200n8 serial" \
--initrd-inject=/opt/dal_kick.cfg \
--virt-type kvm
This allowed us to interact with our sudo prompt and send the password over, and then start the build. The timeout doesnt kill the process, it will continue on and so can your script.