Can't get $TPU_NAME environment variable to work properly - tensorflow

I'm a newbie! I'm trying to train BERT model from scratch on a Kaggle kernel. Can't get the BERT run_pretraining.py script to work on TPUs. Works fine on CPUs though. I'm guessing the issue is with the $TPU_NAME environment variable.
!python run_pretraining.py \
--input_file='gs://xxxxxxxxxx/*' \
--output_dir=/kaggle/working/model/ \
--do_train=True \
--do_eval=True \
--bert_config_file=/kaggle/input/bert-bangla-test-config/config.json \
--train_batch_size=32 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--num_train_steps=20 \
--num_warmup_steps=2 \
--learning_rate=2e-5 \
--use_tpu=True \
--tpu_name=$TPU_NAME

If the script is using a tf.distribute.cluster_resolver.TPUClusterResolver() (https://www.tensorflow.org/api_docs/python/tf/distribute/cluster_resolver/TPUClusterResolver), then you can simply instantiate the TPUClusterResolver without any arguments, and it'll automatically pickup the TPU_NAME (https://github.com/tensorflow/tensorflow/blob/v2.3.0/tensorflow/python/tpu/client/client.py#L47).

Okay, I found a rookie solution :P
run:
import os
os.environ
from the returned dictionary, you can get the address. Just copy paste it or something. It'll be in the form 'TPU_NAME': 'grpc://xxxxxxx'.

Related

Pyspark df.count() Why does it work with only one executor?

i am trying to read data from kafka and I want to get the count of it.
It takes a long time because it works with only one executor. how can i increase it?
spark = SparkSession.builder.appName('oracle_read_test') \
.config("spark.driver.memory", "30g") \
.config("spark.driver.maxResultSize", "64g") \
.config("spark.executor.cores", "10") \
.config("spark.executor.instances", "15") \
.config('spark.executor.memory', '30g') \
.config('num-executors', '20') \
.config('spark.yarn.executor.memoryOverhead', '32g') \
.config("hive.exec.dynamic.partition", "true") \
.config("orc.compress", "ZLIB") \
.config("hive.merge.smallfiles.avgsize", "40000000") \
.config("hive.merge.size.per.task", "209715200") \
.config("dfs.blocksize", "268435456") \
.config("hive.metastore.try.direct.sql", "true") \
.config("spark.sql.orc.enabled", "true") \
.config("spark.dynamicAllocation.enabled", "false") \
.config("spark.sql.sources.partitionOverwriteMode","dynamic") \
.getOrCreate()
df = spark.read.format("kafka") \
.option("kafka.bootstrap.servers","localhost:9092") \
.option("includeHeaders","true") \
.option("subscribe","test") \
.load()
df.count()
How many partitions does your topic have? If only one, then you cannot have more executors.
Otherwise, --num-executors exists as a flag to spark-submit.
Also, this code only counts the records returned in one batch, not the entire topic. Counting the entire topic would take even longer.

How to Loop Yolov3 iterate on google colab?

Im using google colab and iterate one image using this code
%cd /content/darknet/
!./darknet detector test \
/content/dataset/data/obj.data \
/content/dataset/models/yolov3.cfg \
/content/dataset/backup/yolov3_final.weights \
/content/test_images/IMG_2022.jpg\
show_image(/content/darknet/predictions.jpg)```
but i want to iterate multiple image in one folder, can someone have a solution on how to run multiple image in one folder using yolov3 program?

Pyspark DataFrame does not quote data at save

I am trying to save a file to hdfs using com.databricks.spark.csv package, but it does not quote my data, although i define it.
What i am doing wrong?
df.write.format('com.databricks.spark.csv').mode('overwrite').option("header", "false").option("quote","\"").save(output_path)
I am calling using --packages com.databricks:spark-csv_2.10:1.5.0
output:
john,doo,male
expected:
"john","doo","male"
In Spark >= 2.X you should use the option quoteAll:
df.write \
.format('com.databricks.spark.csv') \
.mode('overwrite') \
.option("header", "false") \
.option("quoteAll","true") \
.save(output_path)

How periodicaly evaluate the Performance of Models in TF-Slim?

I am trying to use DensNet for regression problem with TF-Slim. My data contains 60000 jpeg images with 37 float labels for each image. I divided my data into three different tfrecords files of a train set (60%), a validation set (20%) and a test set (20%).
I need to evaluate validation set during training loop and make a plot like image.
In TF-Slim documentation they just explain train loop and evaluation loop separately. I can just evaluate validation or test set after training loop finished. While as I said I need to evaluate during training.
I tried to use slim.evaluation.evaluation_loop function instead of slim.evaluation.evaluate_once. But it doesn't help.
slim.evaluation.evaluation_loop(
master=FLAGS.master,
checkpoint_dir=checkpoint_path,
logdir=FLAGS.eval_dir,
num_evals=num_batches,
eval_op=list(names_to_updates.values()) + print_ops,
variables_to_restore=variables_to_restore,
summary_op = tf.summary.merge(summary_ops),
eval_interval_secs = eval_interval_secs )
I tried evaluation.evaluate_repeatedly as well.
from tensorflow.contrib.training.python.training import evaluation
evaluation.evaluate_repeatedly(
master=FLAGS.master,
checkpoint_dir=checkpoint_path,
eval_ops=list(names_to_updates.values()) + print_ops,
eval_interval_secs = eval_interval_secs )
In both of these functions, they just read the latest available checkpoint from checkpoint_dir and apparently waiting for the next one, however when the new checkpoints are generated, they don't perform at all.
I use Python 2.7.13 and Tensorflow 1.3.0 on CPU.
Any help will be highly appreciated.
Using evaluate_once works just fine with bash script using sleep. Appears that Tensorboard is capable plotting multiple single runs from given eval_dir...
So I use something like:
#!/bin/bash
set -e
# Paths to model and evaluation results
TRAIN_DIR=~/pDL/tensorflow/model/mobilenet_v1_1_224_rp-v1/run0004
TEST_DIR=${TRAIN_DIR}/eval
# Where the dataset is saved to.
DATASET_DIR=/mnt/data/tensorflow/data
# Run evaluation (using slim.evaluation.evaluate_once)
CONTINUE=1
while [ "$CONTINUE" -ne 0 ]
do
python eval_image_classifier.py \
--checkpoint_path=${TRAIN_DIR} \
--eval_dir=${TEST_DIR} \
--dataset_name=master_db \
--preprocessing_name=preprocess224 \
--dataset_split_name=valid \
--dataset_dir=${DATASET_DIR} \
--model_name=mobilenet_v1 \
--patch_size=64
echo "sleeping for next run"
sleep 600
done
This appear to be issue of setting the checkpoint_path properly as addressed here:
https://github.com/tensorflow/tensorflow/issues/13769
Where the answer is by Ellie68 setting:
if tf.gfile.IsDirectory(FLAGS.checkpoint_path):
if tf.train.latest_checkpoint(FLAGS.checkpoint_path):
checkpoint_path = tf.train.latest_checkpoint(FLAGS.checkpoint_path)
else:
checkpoint_path = FLAGS.checkpoint_path

Tensorflow Slim Imagenet training

I am trying to prepare the date to train an ImageNet model from scratch and I am a bit confused about how the training works.
While preparing the TF records I noticed this file inside the Inception model data directory: "imagenet_metadata.txt". The file holds labels for 21842 classes yet the training script and "imagenet_lsvrc_2015_synsets.txt" file only works for 1000 classes.
I am wondering what modifications I need to do to train the model on the 21K classes not the 1K one?
It's quite straightforward with slim.To Train imgnet 21k with slim I recommend to do the following steps:
1.In tf_models/slim/datasets Folder create a copy of imagenet.py File ( for example imgnet.py).In the new file Change the required
Variables to your desired Values:
_FILE_PATTERN = ####your tfrecord_file_pattern. for me('imgnet_%s_*.tfrecord')
_SPLITS_TO_SIZES = {
'train': ####Training Samples,
'validation': ####Validation Samples,}
_NUM_CLASSES = 21841
*the wordnet synset contains 21482 entries but the total number of classes in imagenet21k in 21481 (n04399382 is missed).so be sure about the total number of available classes.
*Also you need to do a little modification in code in order to load the synset files from your local address.
base_url = '/home/snf/libraries/tf_models/slim'
synset_url = '{}/listOfTags.txt'.format(base_url)
synset_to_human_url = '{}/imagenet21k_metadata.txt'.format(base_url)
Add the new dataset to datasetfactory.py in tf_models/slim/datasets :
from datasets import imgnet
datasets_map = {
'cifar10': cifar10,
'flowers': flowers,
'imagenet': imagenet,
'mnist': mnist,
'imgnet': imgnet, #add this line to dataset_map
}
In tf_models/slim/ create Train_Imgnet.sh File containing these lines:
TRAIN_DIR=trained/imgnet-inception-v4
DATASET_DIR=/media/where/tfrecords/saved/tfRecords-fall11_21k
CUDA_VISIBLE_DEVICES="0,1,2,3" python train_image_classifier.py
--train_dir=${TRAIN_DIR} \
--dataset_name=imgnet \
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--model_name=inception_v4 \
--max_number_of_steps=10000000 \
--batch_size=32 \
--learning_rate=0.01 \
--learning_rate_decay_type=fixed \
--save_interval_secs=60 \
--save_summaries_secs=60 \
--log_every_n_steps=100 \
--optimizer=rmsprop \
--weight_decay=0.00004\
--num_readers=12 \
--num_clones=4
set the file to executable (Chmod +x Train_Imgnet.sh ) and run it (./Train_Imgnet.sh )