Pyspark DataFrame does not quote data at save - dataframe

I am trying to save a file to hdfs using com.databricks.spark.csv package, but it does not quote my data, although i define it.
What i am doing wrong?
df.write.format('com.databricks.spark.csv').mode('overwrite').option("header", "false").option("quote","\"").save(output_path)
I am calling using --packages com.databricks:spark-csv_2.10:1.5.0
output:
john,doo,male
expected:
"john","doo","male"

In Spark >= 2.X you should use the option quoteAll:
df.write \
.format('com.databricks.spark.csv') \
.mode('overwrite') \
.option("header", "false") \
.option("quoteAll","true") \
.save(output_path)

Related

Pyspark df.count() Why does it work with only one executor?

i am trying to read data from kafka and I want to get the count of it.
It takes a long time because it works with only one executor. how can i increase it?
spark = SparkSession.builder.appName('oracle_read_test') \
.config("spark.driver.memory", "30g") \
.config("spark.driver.maxResultSize", "64g") \
.config("spark.executor.cores", "10") \
.config("spark.executor.instances", "15") \
.config('spark.executor.memory', '30g') \
.config('num-executors', '20') \
.config('spark.yarn.executor.memoryOverhead', '32g') \
.config("hive.exec.dynamic.partition", "true") \
.config("orc.compress", "ZLIB") \
.config("hive.merge.smallfiles.avgsize", "40000000") \
.config("hive.merge.size.per.task", "209715200") \
.config("dfs.blocksize", "268435456") \
.config("hive.metastore.try.direct.sql", "true") \
.config("spark.sql.orc.enabled", "true") \
.config("spark.dynamicAllocation.enabled", "false") \
.config("spark.sql.sources.partitionOverwriteMode","dynamic") \
.getOrCreate()
df = spark.read.format("kafka") \
.option("kafka.bootstrap.servers","localhost:9092") \
.option("includeHeaders","true") \
.option("subscribe","test") \
.load()
df.count()
How many partitions does your topic have? If only one, then you cannot have more executors.
Otherwise, --num-executors exists as a flag to spark-submit.
Also, this code only counts the records returned in one batch, not the entire topic. Counting the entire topic would take even longer.

Sample Input from file

I am trying to create the input for rules from a sample file. The sample file contains a Column SampleID which should be used as sample wildcard. I want to extract the paths of normal and tumor bams from the columns Path_Normal and Path_Tumor per SampleID from the data frame.
For this I tried like this:
import pandas as pd
input_table = "sampletable.tsv"
samples = pd.read_table(input_table).set_index("SampleID", drop=False)
rule all:
input:
expand("/directory/sm_mutect2_paired/vcf/{sample}.mt2.vcf", sample=samples.index)
rule Mutect2:
input:
tumor = samples[samples['SampleID']=="{sample}"]['Path_Tumor'],
normal = samples[samples['SampleID']=="{sample}"]['Path_Normal']
output:
"/directory/sm_mutect2_paired/vcf/{sample}.mt2.vcf"
conda:
"envs/gatk_mutect2_paired.yaml"
shell:
"gatk --java-options '-Xmx16G -XX:+UseParallelGC -XX:ParallelGCThreads=16' Mutect2 \
-R /directory/ref/genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta \
{input.tumor} \
{input.normal} \
-L /directory/GATK_interval_files_Agilent/S07604514_hs_hg38/S07604514_Covered.bed \
-O {output} \
--af-of-alleles-not-in-resource 2.5e-06 \
--germline-resource /directory/GATK_gnomad/af-only-gnomad.hg38.vcf.gz \
-pon /home/zyto/unger/GATK_PoN/1000g_pon.hg38.vcf.gz"
...
When doing a dry run I do not get an error message but the execution fails because the input is empty which becomes looking at the log:
atk --java-options '-Xmx16G -XX:+UseParallelGC -XX:ParallelGCThreads=16' Mutect2 -R /directory/GATK_ref/genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta -L /directory/GATK_interval_files_Agilent/S07604514_hs_hg38/S07604514_Covered.bed -O /directory/WES_Rezidiv_HNSCC_Clonality/sm_mutect2_paired/vcf/HL05_Rez_HL05_NG.mt2.vcf --af-of-alleles-not-in-resource 2.5e-06 --germline-resource /directory/GATK_gnomad/af-only-gnomad.hg38.vcf.gz -pon /directory/GATK_PoN/1000g_pon.hg38.vcf.gz
The two input files should appear between "Mutect2" and "-R".
So it looks I am doing something wrong defining the inputs...
You need to defer the determination of the input files of that rule to the so-called DAG phase, when jobs and wildcard values are known. This works via input functions. I would strongly recommend to do the official Snakemake tutorial, which covers this topic in depth.

Can't get $TPU_NAME environment variable to work properly

I'm a newbie! I'm trying to train BERT model from scratch on a Kaggle kernel. Can't get the BERT run_pretraining.py script to work on TPUs. Works fine on CPUs though. I'm guessing the issue is with the $TPU_NAME environment variable.
!python run_pretraining.py \
--input_file='gs://xxxxxxxxxx/*' \
--output_dir=/kaggle/working/model/ \
--do_train=True \
--do_eval=True \
--bert_config_file=/kaggle/input/bert-bangla-test-config/config.json \
--train_batch_size=32 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--num_train_steps=20 \
--num_warmup_steps=2 \
--learning_rate=2e-5 \
--use_tpu=True \
--tpu_name=$TPU_NAME
If the script is using a tf.distribute.cluster_resolver.TPUClusterResolver() (https://www.tensorflow.org/api_docs/python/tf/distribute/cluster_resolver/TPUClusterResolver), then you can simply instantiate the TPUClusterResolver without any arguments, and it'll automatically pickup the TPU_NAME (https://github.com/tensorflow/tensorflow/blob/v2.3.0/tensorflow/python/tpu/client/client.py#L47).
Okay, I found a rookie solution :P
run:
import os
os.environ
from the returned dictionary, you can get the address. Just copy paste it or something. It'll be in the form 'TPU_NAME': 'grpc://xxxxxxx'.

Tensorflow Slim Imagenet training

I am trying to prepare the date to train an ImageNet model from scratch and I am a bit confused about how the training works.
While preparing the TF records I noticed this file inside the Inception model data directory: "imagenet_metadata.txt". The file holds labels for 21842 classes yet the training script and "imagenet_lsvrc_2015_synsets.txt" file only works for 1000 classes.
I am wondering what modifications I need to do to train the model on the 21K classes not the 1K one?
It's quite straightforward with slim.To Train imgnet 21k with slim I recommend to do the following steps:
1.In tf_models/slim/datasets Folder create a copy of imagenet.py File ( for example imgnet.py).In the new file Change the required
Variables to your desired Values:
_FILE_PATTERN = ####your tfrecord_file_pattern. for me('imgnet_%s_*.tfrecord')
_SPLITS_TO_SIZES = {
'train': ####Training Samples,
'validation': ####Validation Samples,}
_NUM_CLASSES = 21841
*the wordnet synset contains 21482 entries but the total number of classes in imagenet21k in 21481 (n04399382 is missed).so be sure about the total number of available classes.
*Also you need to do a little modification in code in order to load the synset files from your local address.
base_url = '/home/snf/libraries/tf_models/slim'
synset_url = '{}/listOfTags.txt'.format(base_url)
synset_to_human_url = '{}/imagenet21k_metadata.txt'.format(base_url)
Add the new dataset to datasetfactory.py in tf_models/slim/datasets :
from datasets import imgnet
datasets_map = {
'cifar10': cifar10,
'flowers': flowers,
'imagenet': imagenet,
'mnist': mnist,
'imgnet': imgnet, #add this line to dataset_map
}
In tf_models/slim/ create Train_Imgnet.sh File containing these lines:
TRAIN_DIR=trained/imgnet-inception-v4
DATASET_DIR=/media/where/tfrecords/saved/tfRecords-fall11_21k
CUDA_VISIBLE_DEVICES="0,1,2,3" python train_image_classifier.py
--train_dir=${TRAIN_DIR} \
--dataset_name=imgnet \
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--model_name=inception_v4 \
--max_number_of_steps=10000000 \
--batch_size=32 \
--learning_rate=0.01 \
--learning_rate_decay_type=fixed \
--save_interval_secs=60 \
--save_summaries_secs=60 \
--log_every_n_steps=100 \
--optimizer=rmsprop \
--weight_decay=0.00004\
--num_readers=12 \
--num_clones=4
set the file to executable (Chmod +x Train_Imgnet.sh ) and run it (./Train_Imgnet.sh )

Querying redshift via spark-redshift is not fast

I connect to Redshift via pyspark using pyspark-redshift, i.e.
sparkConf = SparkConf()
sc = SparkContext(conf=sparkConf)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", 'AWS_KEY_ID')
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", 'AWS_KEY')
sql_context = SQLContext(sc)
sql_context.getConf("spark.sql.shuffle.partitions", u"5")
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://example.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd") \
.option("dbtable", "table_name") \
.option('forward_spark_s3_credentials',True) \
.option("tempdir", "s3n://bucket") \
.load()
When I compare the run time of a query, e.g. 300k rows on pyspark and Redshift directly I find no difference.
I read that the configuration spark.sql.shuffle.partitions should be changed to less than the default=200 depending on the size of the dataframe.
What are the important configurations I should check/ which people saw making a difference?