Why loading 4000 images into redis using spark-submit takes time (9 Minutes) longer than loading the same images into HBase (2.5 Minutes)? - redis

Loading Images into Redis should be much faster than doing the same thing using Hbase since Redis deals with RAM while HBase uses HDFS to store the data. I was surprised when I loaded 4000 images into Redis, it took 9 Minutes to finish! While the same process I've done using HBase and It took only 2.5 Minutes. Is there an interpretation for this? Any Suggestions to improve my code? Here is my code:
// The code for loading the images into Hbase (adopted from NIST)
val conf = new SparkConf().setAppName("Fingerprint.LoadData")
val sc = new SparkContext(conf)
Image.dropHBaseTable() Image.createHBaseTable()
val checksum_path = args(0)
println("Reading paths from: %s".format(checksum_path.toString))
val imagepaths = loadImageList(checksum_path) println("Got %s images".format(imagepaths.length))
println("Reading files into RDD")
val images = sc.parallelize(imagepaths).map(paths => Image.fromFiles(paths._1, paths._2))
println(s"Saving ${images.count} images to HBase")
} val conf = new SparkConf().setAppName("Fingerprint.LoadData") val sc = new SparkContext(conf) Image.dropHBaseTable() Image.createHBaseTable() val checksum_path = args(0) println("Reading paths from: %s".format(checksum_path.toString)) val imagepaths = loadImageList(checksum_path) println("Got %s images".format(imagepaths.length)) imagepaths.foreach(println) println("Reading files into RDD") val images = sc.parallelize(imagepaths) .map(paths => Image.fromFiles(paths._1, paths._2)) println(s"Saving ${images.count} images to HBase") Image.toHBase(images) println("Done")
} def toHBase(rdd: RDD[T]): Unit = {
val cfg = HBaseConfiguration.create()
cfg.set(TableOutputFormat.OUTPUT_TABLE, tableName)
val job = Job.getInstance(cfg)
//The code for Loading images intto Redis
val images = sc.parallelize(imagepaths).map(paths => Image.fromFiles(paths._1, paths._2)).collect
for(i <- images){
val stringRdd = sc.parallelize(Seq((i.uuid, new String(i.Png, StandardCharsets.UTF_8))))


Writing a huge dataframe iterator to S3 Any configuration to reclaim memory after write

I have this method that writes huge dataframe to S3 and this runs in a docker which has been allocated 30 GB memory but this is giving a OOM in the node and the task dies. Is there any spark configuration that can be set and is there any efficient way to do this. This cardsWithdfs can have 9 frames or frames up to 72. I want to know why the memory keeps going up even when i am processing only two dataframes and how can i do this clean up operation to claim the memory back. The dataframe can be 3GB in size
def writeParquetData(basePath: String, batchNumber: Int, resultCatalogCard: ResultsCatalogCard, cardsWithdfs: Iterable[(Int, DataFrame, String)], errorCategory: String, sortColumn: String, partitionColumns: Seq[String]): Iterable[(Int, String, Boolean, String)] = {
var outLevelIterator: mutable.MutableList[(Int, String, Boolean, String)] = mutable.MutableList.empty
cardsWithdfs.grouped(2).foreach( batch => {
val writeDataFrames = Future.traverse(batch) {
cardDataFrameCodeTriple =>
Future {
val (outputLevelCode, df, perspCode) = cardDataFrameCodeTriple
Try {
if (!df.isEmpty) {
saveResultsSortedDesc(df, partitionColumns, Seq(sortColumn), ulfPathLocation)
(outputLevelCode, pathWithouttenant, false, perspCode)
} else (outputLevelCode, pathWithouttenant, true, perspCode) // df is empty path does not matter
} match {
case Success(x) => {
val (outLevel, ulfPath, ignoreSchema, perspective) = x
logger.info(s"RUN: writeParquetData ulfPathLocation written for workflowId: ${workflowId}, analysisId: ${analysisId} outputLevel: ${outputLevelCode} lossPath: ${ulfPath} ignoreSchema ${ignoreSchema},perspective ${perspective}")
//Add to iterator
case Failure(ex) => this.synchronized {
resultsMetadataClient.update(resultCatalogCard.id.toInt, s"""{"$errorCategory": "FAILED", "$CONVERTED_TO_ULF_MIGRATION_ERROR": "FAILED - analysisId: ${analysisId} runningWorkflowId: $runningWorkflowId with ${ex.getMessage}"}""")
logger.warn(s"Exception Analysis readParquetData df for workflowId: ${workflowId}, analysisId: ${analysisId} outputLevel: ${outputLevelCode} batchNumber: ${batchNumber} failed with ${ex.getMessage} StackTrace: ${ex.printStackTrace()}", ex)
Await.result(writeDataFrames, Duration.Inf)

PDFBox not clean tmp files after convertToImage method

I use the PDFBox function, such as convertToImage, everything works fine, but PDFBox does not clear the temporary files after the conversion. In my system in the directory for temporary files "/tmp" there are many files such us +~JF132216249314633400.tmp, they are deleted only after restarting my application, but when the application continues to work, temporary files are not deleted.
PDFBox version - 1.8.15
when I use this
page.convertToImage(BufferedImage.TYPE_INT_RGB, 300)
the PDFbox library creates tmp files such as "+~JF132216249314633400.tmp"
my method:
def splitPdfToImages(file: File): List[File] = {
val document = PDDocument.load(file)
val pages = (for (i <- 0 until document.getNumberOfPages)
yield document.getDocumentCatalog.getAllPages.get(i).asInstanceOf[PDPage]).toList
val imgFiles = pages.zipWithIndex.map { case (page, i) =>
val baos = IOUtils.createBAOS
ImageIO.write(page.convertToImage(BufferedImage.TYPE_INT_RGB, 300), "jpg", baos)
val bais = IOUtils.createBAIS(baos.toByteArray)
try {
val img = Image.fromStream(bais)
implicit val writer = JpegWriter().withCompression(100)
val tmpFile = File.createTempFile(s"""${file.getName.split("\\.").head}_$i""", file.getName.split("\\.").last)
} finally {
Please help me to solve this issue.

Programatically creating dstreams in apache spark

I am writing some self contained integration tests around Apache Spark Streaming.
I want to test that my code can ingest all kinds of edge cases in my simulated test data.
When I was doing this with regular RDDs (not streaming). I could use my inline data and call "parallelize" on it to turn it into a spark RDD.
However, I can find no such method for creating destreams. Ideally I would like to call some "push" function once in a while and have the tupple magically appear in my dstream.
ATM I'm doing this by using Apache Kafka: I create a temp queue, and I write to it. But this seems like overkill. I'd much rather create the test-dstream directly from my test data without having to use Kafka as a mediator.
For testing purpose, you can create an input stream from a queue of RDDs.
Pushing more RDDs in the queue will simulate having processed more events in the batch interval.
val sc = SparkContextHolder.sc
val ssc = new StreamingContext(sc, Seconds(1))
val inputData: mutable.Queue[RDD[Int]] = mutable.Queue()
val inputStream: InputDStream[Int] = ssc.queueStream(inputData)
inputData += sc.makeRDD(List(1, 2)) // Emulate the RDD created during the first batch interval
inputData += sc.makeRDD(List(3, 4)) // 2nd batch interval
// etc
val result = inputStream.map(x => x*x)
result.foreachRDD(rdd => assertSomething(rdd))
ssc.start() // Don't forget to start the streaming context
In addition to Raphael solution I think you like to also either can process one batch a time or everything available approach. You need to set oneAtATime flag accordingly on queustream's optional method argument as shown below:
val slideDuration = Milliseconds(100)
val conf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[8]")
val sparkSession: SparkSession = SparkSession.builder.config(conf).getOrCreate()
val sparkContext: SparkContext = sparkSession.sparkContext
val queueOfRDDs = mutable.Queue[RDD[String]]()
val streamingContext: StreamingContext = new StreamingContext(sparkContext, slideDuration)
val rddOneQueuesAtATimeDS: DStream[String] = streamingContext.queueStream(queueOfRDDs, oneAtATime = true)
val rddFloodOfQueuesDS: DStream[String] = streamingContext.queueStream(queueOfRDDs, oneAtATime = false)
for (i <- (1 to 10)) {
queueOfRDDs += sparkContext.makeRDD(simplePurchase(i))
queueOfRDDs += sparkContext.makeRDD(simplePurchase((i + 3) * (i + 3)))
I found this base example:
The key here is calling the "store" command. Replace the contents of store with whatever you want.

Put an ObjectOutputStream on AWS s3

I use spark mllib to create a linear regression model. I then tried to save the model in an ObjectOutputStream, so I could put it on s3 and read it later. The following is my code:
val algorithm = new LinearRegressionWithSGD()
val model = algorithm.run(trainingData)
val credentials = new BasicAWSCredentials("myKey", "mySecretKey");
val s3Client = new AmazonS3Client(credentials);
val oos = new ObjectOutputStream(new FileOutputStream("myModelFile"));
s3Client.putObject("myBucket", "myPath", oos)
Then I got complains at line:
s3Client.putObject("myBucket", "myPath", oos)
What did I miss and how to fix it? Thanks a lot!

Piglatin jodatime error with StanfordCoreNLP

I am trying to create a Pig UDF that extracts the locations mentioned in a tweet using the Stanford CoreNLP package interfaced through the sista Scala API. It works fine when run locally with 'sbt run', but throws a "java.lang.NoSuchMethodError" exception when called from Pig:
Loading default properties from tagger
Reading POS tagger model from
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz
2013-06-14 10:47:54,952 [communication thread] INFO
org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce done [7.5
sec]. Loading classifier from
edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ...
2013-06-14 10:48:02,108 [Low Memory Detector] INFO
org.apache.pig.impl.util.SpillableMemoryManager - first memory handler
call - Collection threshold init = 18546688(18112K) used =
358671232(350264K) committed = 366542848(357952K) max =
699072512(682688K) done [5.0 sec]. Loading classifier from
... 2013-06-14 10:48:10,522 [Low Memory Detector] INFO
org.apache.pig.impl.util.SpillableMemoryManager - first memory handler
call- Usage threshold init = 18546688(18112K) used =
590012928(576184K) committed = 597786624(583776K) max =
699072512(682688K) done [5.6 sec]. 2013-06-14 10:48:11,469 [Thread-11]
WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
at edu.stanford.nlp.time.SUTime$Duration.compareTo(SUTime.java:3406)
at edu.stanford.nlp.time.SUTime$Duration.max(SUTime.java:3488) at
edu.stanford.nlp.time.SUTime$Time.difference(SUTime.java:1308) at
edu.stanford.nlp.time.SUTime$Range.(SUTime.java:3793) at
Here is the relevant code:
object CountryTokenizer {
def tokenize(text: String): String = {
val locations = TweetEntityExtractor.NERLocationFilter(text)
locations.map(x => Cities.country(x)).flatten.mkString(" ")
class PigCountryTokenizer extends EvalFunc[String] {
override def exec(tuple: Tuple): java.lang.String = {
val text: java.lang.String = Util.cast[java.lang.String](tuple.get(0))
object TweetEntityExtractor {
val processor:Processor = new CoreNLPProcessor()
def NERLocationFilter(text: String): List[String] = {
val doc = processor.mkDocument(text)
val locations = doc.sentences.map(sentence => {
val entities = sentence.entities.map(List.fromArray(_)) match {
case Some(l) => l
case _ => List()
val words = List.fromArray(sentence.words)
(words zip entities).filter(x => {
x._1 != "" && x._2 == "LOCATION"
I am using sbt-assembly to construct a fat-jar, and so the joda-time jar file should be accessible. What is going on?
Pig ships with its own version of joda-time (1.6), which is incompatible with 2.x.