Pyspark df.count() Why does it work with only one executor? - dataframe

i am trying to read data from kafka and I want to get the count of it.
It takes a long time because it works with only one executor. how can i increase it?
spark = SparkSession.builder.appName('oracle_read_test') \
.config("spark.driver.memory", "30g") \
.config("spark.driver.maxResultSize", "64g") \
.config("spark.executor.cores", "10") \
.config("spark.executor.instances", "15") \
.config('spark.executor.memory', '30g') \
.config('num-executors', '20') \
.config('spark.yarn.executor.memoryOverhead', '32g') \
.config("hive.exec.dynamic.partition", "true") \
.config("orc.compress", "ZLIB") \
.config("hive.merge.smallfiles.avgsize", "40000000") \
.config("hive.merge.size.per.task", "209715200") \
.config("dfs.blocksize", "268435456") \
.config("hive.metastore.try.direct.sql", "true") \
.config("spark.sql.orc.enabled", "true") \
.config("spark.dynamicAllocation.enabled", "false") \
.config("spark.sql.sources.partitionOverwriteMode","dynamic") \
.getOrCreate()
df = spark.read.format("kafka") \
.option("kafka.bootstrap.servers","localhost:9092") \
.option("includeHeaders","true") \
.option("subscribe","test") \
.load()
df.count()

How many partitions does your topic have? If only one, then you cannot have more executors.
Otherwise, --num-executors exists as a flag to spark-submit.
Also, this code only counts the records returned in one batch, not the entire topic. Counting the entire topic would take even longer.

Related

Can't get $TPU_NAME environment variable to work properly

I'm a newbie! I'm trying to train BERT model from scratch on a Kaggle kernel. Can't get the BERT run_pretraining.py script to work on TPUs. Works fine on CPUs though. I'm guessing the issue is with the $TPU_NAME environment variable.
!python run_pretraining.py \
--input_file='gs://xxxxxxxxxx/*' \
--output_dir=/kaggle/working/model/ \
--do_train=True \
--do_eval=True \
--bert_config_file=/kaggle/input/bert-bangla-test-config/config.json \
--train_batch_size=32 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--num_train_steps=20 \
--num_warmup_steps=2 \
--learning_rate=2e-5 \
--use_tpu=True \
--tpu_name=$TPU_NAME
If the script is using a tf.distribute.cluster_resolver.TPUClusterResolver() (https://www.tensorflow.org/api_docs/python/tf/distribute/cluster_resolver/TPUClusterResolver), then you can simply instantiate the TPUClusterResolver without any arguments, and it'll automatically pickup the TPU_NAME (https://github.com/tensorflow/tensorflow/blob/v2.3.0/tensorflow/python/tpu/client/client.py#L47).
Okay, I found a rookie solution :P
run:
import os
os.environ
from the returned dictionary, you can get the address. Just copy paste it or something. It'll be in the form 'TPU_NAME': 'grpc://xxxxxxx'.

output of one rule as input of another

I am new to snakemake and I'm trying to write a complex pipeline with many steps and branching points. One of the earlier steps is a STAR alignment.
Here I want to use the genome alignment for some stuff and the transcriptome aligment for others. I'm outputting two output files and I want to use each of these as inputs for other rules in snakemake.
If possible I want to avoid using actual filenames and I want snakemake to deal with it for me.
rule star:
input:
reads=samples.iloc[0,1].split(",")
output:
tx_align=temp("/".join([output_directory, "star", samplename+"Aligned.toTranscriptome.out.bam"])),
genome_align="/".join([output_directory, "star", samplename+"Aligned.sortedByCoord.out.bam"])
params:
index=config["resources"]["star_index"],
gtf=config["resources"]["gtf"],
prefix="/".join([output_directory, "star", samplename])
log: config["root_dir"]+"/"+str{samples["samplename"]}+"star.log"
threads: 10
shell:
"""
STAR --runMode alignReads \
--runThreadN {threads} \
--readFilesCommand zcat \
--readFilesIn {reads} \
--genomeDir {params.index} \
--outFileNamePrefix {params.prefix} \
--twopassMode Basic \
--sjdbGTFfile {params.gtf} \
--outFilterType BySJout \
--limitSjdbInsertNsj 1200000 \
--outSAMstrandField intronMotif \
--outFilterIntronMotifs None \
--alignSoftClipAtReferenceEnds Yes \
--quantMode TranscriptomeSAM GeneCounts \
--outSAMtype BAM SortedByCoordinate \
--outSAMattrRGline ID:{samplename}, SM:sm1 \
--outSAMattributes All \
--outSAMunmapped Within \
--outSAMprimaryFlag AllBestScore \
--chimSegmentMin 15 \
--chimJunctionOverhangMin 15 \
--chimOutType Junctions \
--chimMainSegmentMultNmax 1 \
--genomeLoad NoSharedMemory
"""
So then I wan to use something like
rule rsem:
input:
rules.star.ouput[0]
output:
somefile
run:
etc
I'm not even sure if this is possible.
EDIT:
nwm here is the solution
rule1
input: some_input
output:
out1=output1,
out2=output2
shell:
"command {input} {out1} {out2}"
rule2
input:rules.rule1.output.out1

Pyspark DataFrame does not quote data at save

I am trying to save a file to hdfs using com.databricks.spark.csv package, but it does not quote my data, although i define it.
What i am doing wrong?
df.write.format('com.databricks.spark.csv').mode('overwrite').option("header", "false").option("quote","\"").save(output_path)
I am calling using --packages com.databricks:spark-csv_2.10:1.5.0
output:
john,doo,male
expected:
"john","doo","male"
In Spark >= 2.X you should use the option quoteAll:
df.write \
.format('com.databricks.spark.csv') \
.mode('overwrite') \
.option("header", "false") \
.option("quoteAll","true") \
.save(output_path)

Tensorflow Slim Imagenet training

I am trying to prepare the date to train an ImageNet model from scratch and I am a bit confused about how the training works.
While preparing the TF records I noticed this file inside the Inception model data directory: "imagenet_metadata.txt". The file holds labels for 21842 classes yet the training script and "imagenet_lsvrc_2015_synsets.txt" file only works for 1000 classes.
I am wondering what modifications I need to do to train the model on the 21K classes not the 1K one?
It's quite straightforward with slim.To Train imgnet 21k with slim I recommend to do the following steps:
1.In tf_models/slim/datasets Folder create a copy of imagenet.py File ( for example imgnet.py).In the new file Change the required
Variables to your desired Values:
_FILE_PATTERN = ####your tfrecord_file_pattern. for me('imgnet_%s_*.tfrecord')
_SPLITS_TO_SIZES = {
'train': ####Training Samples,
'validation': ####Validation Samples,}
_NUM_CLASSES = 21841
*the wordnet synset contains 21482 entries but the total number of classes in imagenet21k in 21481 (n04399382 is missed).so be sure about the total number of available classes.
*Also you need to do a little modification in code in order to load the synset files from your local address.
base_url = '/home/snf/libraries/tf_models/slim'
synset_url = '{}/listOfTags.txt'.format(base_url)
synset_to_human_url = '{}/imagenet21k_metadata.txt'.format(base_url)
Add the new dataset to datasetfactory.py in tf_models/slim/datasets :
from datasets import imgnet
datasets_map = {
'cifar10': cifar10,
'flowers': flowers,
'imagenet': imagenet,
'mnist': mnist,
'imgnet': imgnet, #add this line to dataset_map
}
In tf_models/slim/ create Train_Imgnet.sh File containing these lines:
TRAIN_DIR=trained/imgnet-inception-v4
DATASET_DIR=/media/where/tfrecords/saved/tfRecords-fall11_21k
CUDA_VISIBLE_DEVICES="0,1,2,3" python train_image_classifier.py
--train_dir=${TRAIN_DIR} \
--dataset_name=imgnet \
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--model_name=inception_v4 \
--max_number_of_steps=10000000 \
--batch_size=32 \
--learning_rate=0.01 \
--learning_rate_decay_type=fixed \
--save_interval_secs=60 \
--save_summaries_secs=60 \
--log_every_n_steps=100 \
--optimizer=rmsprop \
--weight_decay=0.00004\
--num_readers=12 \
--num_clones=4
set the file to executable (Chmod +x Train_Imgnet.sh ) and run it (./Train_Imgnet.sh )

Querying redshift via spark-redshift is not fast

I connect to Redshift via pyspark using pyspark-redshift, i.e.
sparkConf = SparkConf()
sc = SparkContext(conf=sparkConf)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", 'AWS_KEY_ID')
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", 'AWS_KEY')
sql_context = SQLContext(sc)
sql_context.getConf("spark.sql.shuffle.partitions", u"5")
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://example.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd") \
.option("dbtable", "table_name") \
.option('forward_spark_s3_credentials',True) \
.option("tempdir", "s3n://bucket") \
.load()
When I compare the run time of a query, e.g. 300k rows on pyspark and Redshift directly I find no difference.
I read that the configuration spark.sql.shuffle.partitions should be changed to less than the default=200 depending on the size of the dataframe.
What are the important configurations I should check/ which people saw making a difference?