Scalding Unit Test - How to Write A Local File? - scalding

I work at a place where scalding writes are augmented with a specific API to track dataset meta data. When converting from normal writes to these special writes, there are some intricacies with respect to Key/Value, TSV/CSV, Thrift ... datasets. I would like to compare the binary file is the same prior to conversion and after conversion to the special API.
Given I cannot provide the specific api for the metadata-inclusive writes, I only ask how can I write a unit test for .write method on a TypedPipe?
implicit val timeZone: TimeZone = DateOps.UTC
implicit val dateParser: DateParser = DateParser.default
implicit def flowDef: FlowDef = new FlowDef()
implicit def mode: Mode = Local(true)
val fileStrPath = root + "/test"
println("writing data to " + fileStrPath)
TypedPipe
.from(Seq[Long](1, 2, 3, 4, 5))
// .map((x: Long) => { println(x.toString); System.out.flush(); x })
.write(TypedTsv[Long](fileStrPath))
.forceToDisk
The above doesn't seem to write anything to local (OSX) disk.
So I wonder if I need to use a MiniDFSCluster something like this:
def setUpTempFolder: String = {
val tempFolder = new TemporaryFolder
tempFolder.create()
tempFolder.getRoot.getAbsolutePath
}
val root: String = setUpTempFolder
println(s"root = $root")
val tempDir = Files.createTempDirectory(setUpTempFolder).toFile
val hdfsCluster: MiniDFSCluster = {
val configuration = new Configuration()
configuration.set(MiniDFSCluster.HDFS_MINIDFS_BASEDIR, tempDir.getAbsolutePath)
configuration.set("io.compression.codecs", classOf[LzopCodec].getName)
new MiniDFSCluster.Builder(configuration)
.manageNameDfsDirs(true)
.manageDataDfsDirs(true)
.format(true)
.build()
}
hdfsCluster.waitClusterUp()
val fs: DistributedFileSystem = hdfsCluster.getFileSystem
val rootPath = new Path(root)
fs.mkdirs(rootPath)
However, my attempts to get this MiniCluster to work haven't panned out either - somehow I need to link the MiniCluster with the Scalding write.
Note: The Scalding JobTest framework for unit testing isn't going to work due actual data written is sometimes wrapped in bijection codec or setup with case class wrappers prior to the writes made by the metadata-inclusive writes APIs.
Any ideas how I can write a local file (without using the Scalding REPL) with either Scalding alone or a MiniCluster? (If using the later, I need a hint how to read the file.)

Answering ... There is an example of how to use a mini cluster for exactly reading and writing to HDFS. I will be able to cross read with my different writes and examine them. Here it is in the tests for scalding's TypedParquet type
HadoopPlatformJobTest is an extension for JobTest that uses a MiniCluster.
With some hand-waiving on detail in the link, the bulk of the code is this:
"TypedParquetTuple" should {
"read and write correctly" in {
import com.twitter.scalding.parquet.tuple.TestValues._
def toMap[T](i: Iterable[T]): Map[T, Int] = i.groupBy(identity).mapValues(_.size)
HadoopPlatformJobTest(new WriteToTypedParquetTupleJob(_), cluster)
.arg("output", "output1")
.sink[SampleClassB](TypedParquet[SampleClassB](Seq("output1"))) {
toMap(_) shouldBe toMap(values)
}
.run()
HadoopPlatformJobTest(new ReadWithFilterPredicateJob(_), cluster)
.arg("input", "output1")
.arg("output", "output2")
.sink[Boolean]("output2")(toMap(_) shouldBe toMap(values.filter(_.string == "B1").map(_.a.bool)))
.run()
}
}

Related

How can I reformat a list of items from lua to kotlin?

I am working on a project that originally started in Lua and I want to update it to Kotlin.
I have about 5000 questions/answers that look like:
['Question'][1] = "'7x' was used to refer to the secret ingredient of what drink";
['Answers'][1] = {"coca cola"};
['Question'][2] = "'And the big wheel keep on turning neon burning up above and I'm just high on the world come on and take the low ride with me girl on the.....' What's the Dire Straits song title?";
['Answers'][2] = {"tunnel of love"};
I want to change the format of these without manually going through all 5000, so that they look like:
val que1 = Question(
1, "'7x' was used to refer to the secret ingredient of what drink",
"coca cola"
)
val que2 = Question(
2, "'And the big wheel keep on turning neon burning up above and I'm just high on the world come on and take the low ride with me girl on the.....' What's the Dire Straits song title?",
"tunnel of love"
)
Please help me figure out how to reformat these questions/answers. Thanks.
You can write a script like this to get the desired output.
Step 1: Create input.txt and paste your lua code here which you want to translate
['Question'][1] = "'7x' was used to refer to the secret ingredient of what drink";
['Answers'][1] = {"coca cola"};
['Question'][2] = "'And the big wheel keep on turning neon burning up above and I'm just high on the world come on and take the low ride with me girl on the.....' What's the Dire Straits song title?";
['Answers'][2] = {"tunnel of love"};
Step 2: Create an empty file named output.txt. This file will consist of converted code.
Step 3: Run this main function. You might need to modify file path base on your directory structure. Note: This code is written in Kotlin.
fun main() {
val input = File("src/main/kotlin", "input.txt").readLines()
val outputWriter = File("src/main/kotlin", "output.txt").printWriter()
val lines = input.windowed(2, 2)
val questions = mutableListOf<String>()
lines.forEachIndexed { index, str ->
val que = str[0].split("\"")[1]
val ans = str[1].split("\"")[1]
val question = "val que${index + 1} = Question(${index + 1}, \"$que\", \"$ans\")"
println(question)
questions.add(question)
}
outputWriter.use { out->
questions.forEach {
out.println(it)
}
}
}
After running the script you will get the desired output in output.txt. Alternatively, you can also get this from the console.

Problem with getting rights data [for execution / read / write] file

I'm writing a program that should output a meta-info (size, permissions to execute / read / write, time of last modification) of all files from the specified directory.
I received information about all the information, except the rights to execute / read / write.
I tried to get this info using PosixFilePermissions, but when added to the List I get Exception in thread "main" java.lang.UnsupportedOperationException.
Maybe you should use some other library? Or did I make a mistake somewhere? I would be grateful for any advice!
fun long(path:Path) : MutableList<String> {
var listOfFiles = mutableListOf<String>()
val files = File("$path").listFiles()
var attr: BasicFileAttributes
Arrays.sort(files, NameFileComparator.NAME_COMPARATOR)
files.forEach {
if (it.isFile) {
attr = Files.readAttributes<BasicFileAttributes>(it.toPath(), BasicFileAttributes::class.java)
listOfFiles.add("${it.name} ${attr.size()} ${attr.lastModifiedTime()}" +
" ${PosixFilePermissions.toString(Files.getPosixFilePermissions(it.toPath()))}")
}
else listOfFiles.add("dir ${it.name}")
}
return listOfFiles
}
PosixFilePermissions are only usable for POSIX-compatible file systems (Linux etc.).
For a Windows system, the permissions have to be accessed directly:
file.canRead()
file.canWrite()
file.canExecute()

How to pass list/array of request objects to tensorflow serving in one server call?

After loading the wide and deep model, i was able to make prediction for one request object using the map of features and then serializing it to string for predictions as shown below-
is there a way we can create a batch of requests objects and send them for prediction to tensorflow server?
Code for single prediction looks like this-
for (each feature in feature list) {
Feature feature = null;
feature = Feature.newBuilder().setBytesList(BytesList.newBuilder().addValue(ByteString.copyFromUtf8("dummy string"))).build();
if (feature != null) {
inputFeatureMap.put(name, feature);
}
}
//Converting features(in inputFeatureMap) corresponding to one request into 'Features' Proto object
Features features = Features.newBuilder().putAllFeature(inputFeatureMap).build();
inputStr = Example.newBuilder().setFeatures(features).build().toByteString();
}
TensorProto proto = TensorProto.newBuilder()
.addStringVal(inputStr)
.setTensorShape(TensorShapeProto.newBuilder().addDim(TensorShapeProto.Dim.newBuilder().setSize(1).build()).build())
.setDtype(DataType.DT_STRING)
.build();
PredictRequest req = PredictRequest.newBuilder()
.setModelSpec(ModelSpec.newBuilder()
.setName("your serving model name")
.setSignatureName("serving_default")
.setVersion(Int64Value.newBuilder().setValue(modelVer)))
.putAllInputs(ImmutableMap.of("inputs", proto))
.build();
PredictResponse response = stub.predict(req);
System.out.println(response.getOutputsMap());
Is there a way we can send the list of Features Object for predictions, something similar to this-
List<Features> = {someway to create array/list of inputFeatureMap's which can be converted to serialized string.}
For anyone stumbling here, I found a simple workaround with Example proto to do batch request. I will borrow code from this question and modify it for the batch.
Features features =
Features.newBuilder()
.putFeature("Attribute1", feature("A12"))
.putFeature("Attribute2", feature(12))
.putFeature("Attribute3", feature("A32"))
.putFeature("Attribute4", feature("A40"))
.putFeature("Attribute5", feature(7472))
.putFeature("Attribute6", feature("A65"))
.putFeature("Attribute7", feature("A71"))
.putFeature("Attribute8", feature(1))
.putFeature("Attribute9", feature("A92"))
.putFeature("Attribute10", feature("A101"))
.putFeature("Attribute11", feature(2))
.putFeature("Attribute12", feature("A121"))
.putFeature("Attribute13", feature(24))
.putFeature("Attribute14", feature("A143"))
.putFeature("Attribute15", feature("A151"))
.putFeature("Attribute16", feature(1))
.putFeature("Attribute17", feature("A171"))
.putFeature("Attribute18", feature(1))
.putFeature("Attribute19", feature("A191"))
.putFeature("Attribute20", feature("A201"))
.build();
Example example = Example.newBuilder().setFeatures(features).build();
String pfad = System.getProperty("user.dir") + "\\1511523781";
try (SavedModelBundle model = SavedModelBundle.load(pfad, "serve")) {
Session session = model.session();
final String xName = "input_example_tensor";
final String scoresName = "dnn/head/predictions/probabilities:0";
try (Tensor<String> inputBatch = Tensors.create(new byte[][] {example.toByteArray(), example.toByteArray(), example.toByteArray(), example.toByteArray()});
Tensor<Float> output =
session
.runner()
.feed(xName, inputBatch)
.fetch(scoresName)
.run()
.get(0)
.expect(Float.class)) {
System.out.println(Arrays.deepToString(output.copyTo(new float[4][2])));
}
}
Essentially you can pass each example as an object in byte[4][] and you will receive the result in the same shape float[4][2]

output for append job in BigQuery using Luigi Orchestrator

I have a Bigquery task which only aims to append a daily temp table (Table-xxxx-xx-xx) to an existing table (PersistingTable).
I am not sure how to handle the output(self) method. Indeed, I can not just output PersistingTable as a luigi.contrib.bigquery.BigQueryTarget, since it already exists before the process started. Has anyone asked himself such a question?
I could not find an answer anywhere else so I will give my solution even though this is a very old question.
I created a new class that inherits from luigi.contrib.bigquery.BigQueryLoadTask
class BigQueryLoadIncremental(luigi.contrib.bigquery.BigQueryLoadTask):
'''
a subclass that checks whether a write-log on gcs exists to append data to the table
needs to define Two Outputs! [0] of type BigQueryTarget and [1] of type GCSTarget
Everything else is left unchanged
'''
def exists(self):
return luigi.contrib.gcs.GCSClient.exists(self.output()[1].path)
#property
def write_disposition(self):
"""
Set to WRITE_APPEND as this subclass only makes sense for this
"""
return luigi.contrib.bigquery.WriteDisposition.WRITE_APPEND
def run(self):
output = self.output()[0]
gcs_output = self.output()[1]
assert isinstance(output,
luigi.contrib.bigquery.BigQueryTarget), 'Output[0] must be a BigQueryTarget, not %s' % (
output)
assert isinstance(gcs_output,
luigi.contrib.gcs.GCSTarget), 'Output[1] must be a Cloud Storage Target, not %s' % (
gcs_output)
bq_client = output.client
source_uris = self.source_uris()
assert all(x.startswith('gs://') for x in source_uris)
job = {
'projectId': output.table.project_id,
'configuration': {
'load': {
'destinationTable': {
'projectId': output.table.project_id,
'datasetId': output.table.dataset_id,
'tableId': output.table.table_id,
},
'encoding': self.encoding,
'sourceFormat': self.source_format,
'writeDisposition': self.write_disposition,
'sourceUris': source_uris,
'maxBadRecords': self.max_bad_records,
'ignoreUnknownValues': self.ignore_unknown_values
}
}
}
if self.source_format == luigi.contrib.bigquery.SourceFormat.CSV:
job['configuration']['load']['fieldDelimiter'] = self.field_delimiter
job['configuration']['load']['skipLeadingRows'] = self.skip_leading_rows
job['configuration']['load']['allowJaggedRows'] = self.allow_jagged_rows
job['configuration']['load']['allowQuotedNewlines'] = self.allow_quoted_new_lines
if self.schema:
job['configuration']['load']['schema'] = {'fields': self.schema}
# test write to and removal of GCS pseudo output in order to make sure this does not fail.
gcs_output.fs.put_string(
'test write for task {} (this file should have been removed immediately)'.format(self.task_id),
gcs_output.path)
gcs_output.fs.remove(gcs_output.path)
bq_client.run_job(output.table.project_id, job, dataset=output.table.dataset)
gcs_output.fs.put_string(
'success! The following BigQuery Job went through without errors: {}'.format(self.task_id), gcs_output.path)
it uses a second output (which might violate luigis atomicity principle) on google cloud storage. Example usage:
class LeadsToBigQuery(BigQueryLoadIncremental):
date = luigi.DateParameter(default=datetime.date.today())
def output(self):
return luigi.contrib.bigquery.BigQueryTarget(project_id=...,
dataset_id=...,
table_id=...), \
create_gcs_target(...)

Programatically creating dstreams in apache spark

I am writing some self contained integration tests around Apache Spark Streaming.
I want to test that my code can ingest all kinds of edge cases in my simulated test data.
When I was doing this with regular RDDs (not streaming). I could use my inline data and call "parallelize" on it to turn it into a spark RDD.
However, I can find no such method for creating destreams. Ideally I would like to call some "push" function once in a while and have the tupple magically appear in my dstream.
ATM I'm doing this by using Apache Kafka: I create a temp queue, and I write to it. But this seems like overkill. I'd much rather create the test-dstream directly from my test data without having to use Kafka as a mediator.
For testing purpose, you can create an input stream from a queue of RDDs.
Pushing more RDDs in the queue will simulate having processed more events in the batch interval.
val sc = SparkContextHolder.sc
val ssc = new StreamingContext(sc, Seconds(1))
val inputData: mutable.Queue[RDD[Int]] = mutable.Queue()
val inputStream: InputDStream[Int] = ssc.queueStream(inputData)
inputData += sc.makeRDD(List(1, 2)) // Emulate the RDD created during the first batch interval
inputData += sc.makeRDD(List(3, 4)) // 2nd batch interval
// etc
val result = inputStream.map(x => x*x)
result.foreachRDD(rdd => assertSomething(rdd))
ssc.start() // Don't forget to start the streaming context
In addition to Raphael solution I think you like to also either can process one batch a time or everything available approach. You need to set oneAtATime flag accordingly on queustream's optional method argument as shown below:
val slideDuration = Milliseconds(100)
val conf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[8]")
val sparkSession: SparkSession = SparkSession.builder.config(conf).getOrCreate()
val sparkContext: SparkContext = sparkSession.sparkContext
val queueOfRDDs = mutable.Queue[RDD[String]]()
val streamingContext: StreamingContext = new StreamingContext(sparkContext, slideDuration)
val rddOneQueuesAtATimeDS: DStream[String] = streamingContext.queueStream(queueOfRDDs, oneAtATime = true)
val rddFloodOfQueuesDS: DStream[String] = streamingContext.queueStream(queueOfRDDs, oneAtATime = false)
rddOneQueuesAtATimeDS.print(120)
rddFloodOfQueuesDS.print(120)
streamingContext.start()
for (i <- (1 to 10)) {
queueOfRDDs += sparkContext.makeRDD(simplePurchase(i))
queueOfRDDs += sparkContext.makeRDD(simplePurchase((i + 3) * (i + 3)))
Thread.sleep(slideDuration.milliseconds)
}
Thread.sleep(1000L)
I found this base example:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/CustomReceiver.scala
The key here is calling the "store" command. Replace the contents of store with whatever you want.