Writing a huge dataframe iterator to S3 Any configuration to reclaim memory after write - apache-spark-sql

I have this method that writes huge dataframe to S3 and this runs in a docker which has been allocated 30 GB memory but this is giving a OOM in the node and the task dies. Is there any spark configuration that can be set and is there any efficient way to do this. This cardsWithdfs can have 9 frames or frames up to 72. I want to know why the memory keeps going up even when i am processing only two dataframes and how can i do this clean up operation to claim the memory back. The dataframe can be 3GB in size
def writeParquetData(basePath: String, batchNumber: Int, resultCatalogCard: ResultsCatalogCard, cardsWithdfs: Iterable[(Int, DataFrame, String)], errorCategory: String, sortColumn: String, partitionColumns: Seq[String]): Iterable[(Int, String, Boolean, String)] = {
var outLevelIterator: mutable.MutableList[(Int, String, Boolean, String)] = mutable.MutableList.empty
cardsWithdfs.grouped(2).foreach( batch => {
val writeDataFrames = Future.traverse(batch) {
cardDataFrameCodeTriple =>
Future {
val (outputLevelCode, df, perspCode) = cardDataFrameCodeTriple
//.....
Try {
if (!df.isEmpty) {
saveResultsSortedDesc(df, partitionColumns, Seq(sortColumn), ulfPathLocation)
(outputLevelCode, pathWithouttenant, false, perspCode)
} else (outputLevelCode, pathWithouttenant, true, perspCode) // df is empty path does not matter
} match {
case Success(x) => {
val (outLevel, ulfPath, ignoreSchema, perspective) = x
logger.info(s"RUN: writeParquetData ulfPathLocation written for workflowId: ${workflowId}, analysisId: ${analysisId} outputLevel: ${outputLevelCode} lossPath: ${ulfPath} ignoreSchema ${ignoreSchema},perspective ${perspective}")
//Add to iterator
}
case Failure(ex) => this.synchronized {
resultsMetadataClient.update(resultCatalogCard.id.toInt, s"""{"$errorCategory": "FAILED", "$CONVERTED_TO_ULF_MIGRATION_ERROR": "FAILED - analysisId: ${analysisId} runningWorkflowId: $runningWorkflowId with ${ex.getMessage}"}""")
logger.warn(s"Exception Analysis readParquetData df for workflowId: ${workflowId}, analysisId: ${analysisId} outputLevel: ${outputLevelCode} batchNumber: ${batchNumber} failed with ${ex.getMessage} StackTrace: ${ex.printStackTrace()}", ex)
}
}
}
}
Await.result(writeDataFrames, Duration.Inf)
})
outLevelIterator
}

Related

Is it possible to load constrains from file (csv, txt) to Deequ Checks?

Is it possible to save suggested constrains to file and then load them as cheks? I was able to do it without saving them with next code
val allConstraints = suggestionResult.constraintSuggestions.flatMap {
case (_, suggestions) =>
suggestions.map {
_.constraint
}
}.toSeq
val generatedCheck = Check(CheckLevel.Error, "generated constraints", allConstraints)
val verificationResult: VerificationResult = {
VerificationSuite()
.onData(tested_df)
.addCheck(generatedCheck)
.run()
}
However I want to save them to file and apply them later when needed? Is there any way to do this?

Why loading 4000 images into redis using spark-submit takes time (9 Minutes) longer than loading the same images into HBase (2.5 Minutes)?

Loading Images into Redis should be much faster than doing the same thing using Hbase since Redis deals with RAM while HBase uses HDFS to store the data. I was surprised when I loaded 4000 images into Redis, it took 9 Minutes to finish! While the same process I've done using HBase and It took only 2.5 Minutes. Is there an interpretation for this? Any Suggestions to improve my code? Here is my code:
// The code for loading the images into Hbase (adopted from NIST)
val conf = new SparkConf().setAppName("Fingerprint.LoadData")
val sc = new SparkContext(conf)
Image.dropHBaseTable() Image.createHBaseTable()
val checksum_path = args(0)
println("Reading paths from: %s".format(checksum_path.toString))
val imagepaths = loadImageList(checksum_path) println("Got %s images".format(imagepaths.length))
imagepaths.foreach(println)
println("Reading files into RDD")
val images = sc.parallelize(imagepaths).map(paths => Image.fromFiles(paths._1, paths._2))
println(s"Saving ${images.count} images to HBase")
Image.toHBase(images)
println("Done")
} val conf = new SparkConf().setAppName("Fingerprint.LoadData") val sc = new SparkContext(conf) Image.dropHBaseTable() Image.createHBaseTable() val checksum_path = args(0) println("Reading paths from: %s".format(checksum_path.toString)) val imagepaths = loadImageList(checksum_path) println("Got %s images".format(imagepaths.length)) imagepaths.foreach(println) println("Reading files into RDD") val images = sc.parallelize(imagepaths) .map(paths => Image.fromFiles(paths._1, paths._2)) println(s"Saving ${images.count} images to HBase") Image.toHBase(images) println("Done")
} def toHBase(rdd: RDD[T]): Unit = {
val cfg = HBaseConfiguration.create()
cfg.set(TableOutputFormat.OUTPUT_TABLE, tableName)
val job = Job.getInstance(cfg)
job.setOutputFormatClass(classOf[TableOutputFormat[String]])
rdd.map(Put).saveAsNewAPIHadoopDataset(job.getConfiguration)
}
//The code for Loading images intto Redis
val images = sc.parallelize(imagepaths).map(paths => Image.fromFiles(paths._1, paths._2)).collect
for(i <- images){
val stringRdd = sc.parallelize(Seq((i.uuid, new String(i.Png, StandardCharsets.UTF_8))))
sc.toRedisKV(stringRdd)(redisConfig)
stringRdd.collect}
println("Done")

How to recover from akka.stream.io.Framing$FramingException

On: akka-stream-experimental_2.11 1.0.
We are using Framing.delimiter in a Tcp server. When a message arrives with length greater than maximumFrameLength the FramingException is thrown and we could capture it from OnError of the ActorSubscriber.
Server Code:
def bind(address: String, port: Int, target: ActorRef, maxInFlight: Int, maxFrameLength: Int)
(implicit system: ActorSystem, actorMaterializer: ActorMaterializer): Future[ServerBinding] = {
val sink = Sink.foreach {
conn: Tcp.IncomingConnection =>
val targetSubscriber = ActorSubscriber[Message](system.actorOf(Props(new TargetSubscriber(target, maxInFlight))))
val targetSink = Flow[ByteString]
.via(Framing.delimiter(ByteString("\n"), maximumFrameLength = maxFrameLength, allowTruncation = true))
.map(raw ⇒ Message(raw))
.to(Sink(targetSubscriber))
conn.flow.to(targetSink).runWith(Source(Promise().future))
}
val connections = Tcp().bind(address, port)
connections.to(sink).run()
}
Subscriber code:
class TargetSubscriber(target: ActorRef, maxInFlight: Int) extends ActorSubscriber with ActorLogging {
private var inFlight = 0
override protected def requestStrategy = new MaxInFlightRequestStrategy(maxInFlight) {
override def inFlightInternally = inFlight
}
override def receive = {
case OnNext(msg: Message) ⇒
target ! msg
inFlight += 1
case OnError(t) ⇒
inFlight -= 1
log.error(t, "Subscriber encountered error")
case TargetAck(_) ⇒
inFlight -= 1
}
}
Problem:
Messages that are under the max frame length do not flow after this exception for that incoming connection. killing the client and re running it works fine.
ActorSubscriber does not honor supervision
What is the correct way to skip the bad message and continue with the next good message ?
Have you tried to put supervision on the targetFlow sink instead of the whole materialiser? I don't see it anywhere here and I believe it should be set on that flow directly.
Stil this is more a guess than science ;)
I had the same exception reading from a file, and for me it was solved by putting a return after last line.

How to stream SQL results to JSON using Groovy StreamingJsonBuilder?

I am trying to execute a SQL query and convert the results to JSON as follows. Though I got it working without streaming, I'm having some issues using StreamingJsonBuilder to stream the results.
non-streaming code
def writer = new StringWriter()
def jsonBuilder = new StreamingJsonBuilder(writer)
sql.eachRow("select * from client"){ row ->
jsonBuilder( id: row.id, name: row.name )
}
println writer.toString()
Result from the code above
{"id":123,"name":"ABCD"}{"id":124,"name":"NYU"}
The problem with this result is that, all documents are printed on same line without delimitation. How do I get the results as an array and each document pretty-printed as below
Expected result
[
{
id: 123,
name: "ABCD",
...
},
{
id: 124,
name: "NYU",
...
},
]
I put this here more as an fallback. If your problem is just to have your data properly formatted as json, but the sheer amount of data make you use the streaming API, then you are better off with using the streaming for your data and handle the "array" for yourself.
All the calls in the StreamingJsonBuilder take an object and directly write it to the writer. So there is no safe way (I can see) to have the writer open the array, then send the data in chunks you provide and then close the array. So while we already hold the writer, why not just deal with the array your self (this part of json is rather easy to get right):
new File('/tmp/out.json').withWriter{ writer ->
writer << '['
def jsonBuilder = new groovy.json.StreamingJsonBuilder(writer)
def first = true
10000000.times{
if (!first) writer << "\n,"
first = false
jsonBuilder(id: it, name: it.toString())
}
writer << ']'
}
I've no access to any SQL to try but the following piece of code should do the job (You need to replace the data variable):
import groovy.json.*
def writer = new StringWriter()
def jsonBuilder = new StreamingJsonBuilder(writer)
def data = [
[id:1, name: 'n1', other: 'o1'],
[id:2, name: 'n2', other: 'o2']
]
def dataJson = jsonBuilder(data.collect { [id:it.id, name:it.name] })
println(JsonOutput.prettyPrint(JsonOutput.toJson(dataJson)))
UPDATE (after #cfrick's comment)
Here, every row is processed one ofter another but, a key (data in this case) is needed.
import groovy.json.*
def writer = new StringWriter()
def jsonBuilder = new StreamingJsonBuilder(writer)
def data = [
[id:1, name: 'n1', other: 'o1'],
[id:2, name: 'n2', other: 'o2']
]
def root = jsonBuilder(data: [])
data.each { d ->
root.data << [id:d.id, name: d.name]
}
println(JsonOutput.prettyPrint(JsonOutput.toJson(root)))

Piglatin jodatime error with StanfordCoreNLP

I am trying to create a Pig UDF that extracts the locations mentioned in a tweet using the Stanford CoreNLP package interfaced through the sista Scala API. It works fine when run locally with 'sbt run', but throws a "java.lang.NoSuchMethodError" exception when called from Pig:
Loading default properties from tagger
edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger
Reading POS tagger model from
edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz
2013-06-14 10:47:54,952 [communication thread] INFO
org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce done [7.5
sec]. Loading classifier from
edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ...
2013-06-14 10:48:02,108 [Low Memory Detector] INFO
org.apache.pig.impl.util.SpillableMemoryManager - first memory handler
call - Collection threshold init = 18546688(18112K) used =
358671232(350264K) committed = 366542848(357952K) max =
699072512(682688K) done [5.0 sec]. Loading classifier from
edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz
... 2013-06-14 10:48:10,522 [Low Memory Detector] INFO
org.apache.pig.impl.util.SpillableMemoryManager - first memory handler
call- Usage threshold init = 18546688(18112K) used =
590012928(576184K) committed = 597786624(583776K) max =
699072512(682688K) done [5.6 sec]. 2013-06-14 10:48:11,469 [Thread-11]
WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
java.lang.NoSuchMethodError:
org.joda.time.Duration.compareTo(Lorg/joda/time/ReadableDuration;)I
at edu.stanford.nlp.time.SUTime$Duration.compareTo(SUTime.java:3406)
at edu.stanford.nlp.time.SUTime$Duration.max(SUTime.java:3488) at
edu.stanford.nlp.time.SUTime$Time.difference(SUTime.java:1308) at
edu.stanford.nlp.time.SUTime$Range.(SUTime.java:3793) at
edu.stanford.nlp.time.SUTime.(SUTime.java:570)
Here is the relevant code:
object CountryTokenizer {
def tokenize(text: String): String = {
val locations = TweetEntityExtractor.NERLocationFilter(text)
println(locations)
locations.map(x => Cities.country(x)).flatten.mkString(" ")
}
}
class PigCountryTokenizer extends EvalFunc[String] {
override def exec(tuple: Tuple): java.lang.String = {
val text: java.lang.String = Util.cast[java.lang.String](tuple.get(0))
CountryTokenizer.tokenize(text)
}
}
object TweetEntityExtractor {
val processor:Processor = new CoreNLPProcessor()
def NERLocationFilter(text: String): List[String] = {
val doc = processor.mkDocument(text)
processor.tagPartsOfSpeech(doc)
processor.lemmatize(doc)
processor.recognizeNamedEntities(doc)
val locations = doc.sentences.map(sentence => {
val entities = sentence.entities.map(List.fromArray(_)) match {
case Some(l) => l
case _ => List()
}
val words = List.fromArray(sentence.words)
(words zip entities).filter(x => {
x._1 != "" && x._2 == "LOCATION"
}).map(_._1)
})
List.fromArray(locations).flatten
}
}
I am using sbt-assembly to construct a fat-jar, and so the joda-time jar file should be accessible. What is going on?
Pig ships with its own version of joda-time (1.6), which is incompatible with 2.x.