How to stream SQL results to JSON using Groovy StreamingJsonBuilder? - sql

I am trying to execute a SQL query and convert the results to JSON as follows. Though I got it working without streaming, I'm having some issues using StreamingJsonBuilder to stream the results.
non-streaming code
def writer = new StringWriter()
def jsonBuilder = new StreamingJsonBuilder(writer)
sql.eachRow("select * from client"){ row ->
jsonBuilder( id: row.id, name: row.name )
}
println writer.toString()
Result from the code above
{"id":123,"name":"ABCD"}{"id":124,"name":"NYU"}
The problem with this result is that, all documents are printed on same line without delimitation. How do I get the results as an array and each document pretty-printed as below
Expected result
[
{
id: 123,
name: "ABCD",
...
},
{
id: 124,
name: "NYU",
...
},
]

I put this here more as an fallback. If your problem is just to have your data properly formatted as json, but the sheer amount of data make you use the streaming API, then you are better off with using the streaming for your data and handle the "array" for yourself.
All the calls in the StreamingJsonBuilder take an object and directly write it to the writer. So there is no safe way (I can see) to have the writer open the array, then send the data in chunks you provide and then close the array. So while we already hold the writer, why not just deal with the array your self (this part of json is rather easy to get right):
new File('/tmp/out.json').withWriter{ writer ->
writer << '['
def jsonBuilder = new groovy.json.StreamingJsonBuilder(writer)
def first = true
10000000.times{
if (!first) writer << "\n,"
first = false
jsonBuilder(id: it, name: it.toString())
}
writer << ']'
}

I've no access to any SQL to try but the following piece of code should do the job (You need to replace the data variable):
import groovy.json.*
def writer = new StringWriter()
def jsonBuilder = new StreamingJsonBuilder(writer)
def data = [
[id:1, name: 'n1', other: 'o1'],
[id:2, name: 'n2', other: 'o2']
]
def dataJson = jsonBuilder(data.collect { [id:it.id, name:it.name] })
println(JsonOutput.prettyPrint(JsonOutput.toJson(dataJson)))
UPDATE (after #cfrick's comment)
Here, every row is processed one ofter another but, a key (data in this case) is needed.
import groovy.json.*
def writer = new StringWriter()
def jsonBuilder = new StreamingJsonBuilder(writer)
def data = [
[id:1, name: 'n1', other: 'o1'],
[id:2, name: 'n2', other: 'o2']
]
def root = jsonBuilder(data: [])
data.each { d ->
root.data << [id:d.id, name: d.name]
}
println(JsonOutput.prettyPrint(JsonOutput.toJson(root)))

Related

Pipeline generation - passing in simple datastructures like lists/arrays

For a code repository project in Palantir Foundry, I am struggling with re-using some of my transformation logic.
It seems almost trivial, but: is there way to send an Input to a Transform that is not a dataset/dataframe reference?
In my case I want to pass in strings or lists/arrays.
This is my code:
from pyspark.sql import functions as F
from transforms.api import Transform, Input, Output
def my_computation(result, customFilter, scope, my_categories, my_mappings):
scope_df = scope.dataframe()
my_categories_df = my_categories.dataframe()
my_mappings_df = my_mappings.dataframe()
filtered_cat_df = (
my_categories_df
.filter(F.col('CAT_NAME').isin(customFilter))
)
# ... more logic
def generateTransforms(config):
transforms = []
for key, value in config.items():
o = {}
for outKey, outValue in value['outputs'].items():
o[outKey] = Output(outValue)
i = {}
for inpKey, inpValue in value['inputs'].items():
i[inpKey] = Input(inpValue)
i['customFilter'] = Input(value['my_custom_filter'])
transforms.append(Transform(my_computation, inputs=i, outputs=o))
return transforms
config = {
"transform_one": {
"my_custom_filter": {
"foo",
"bar"
},
"inputs": {
"scope": "/my-project/input/scope",
"my_categories": "/my-project/input/my_categories",
"my_mappings": "/my-project/input/my_mappings"
},
"outputs": {
"result": "/my-project/output/result"
}
}
}
TRANSFORMS = generateTransforms(config)
The concrete question is: how can I send in the values from my_custom_filter into customFilter in the transformation function my_computation?
If I execute it like above, I get the error "TypeError: unhashable type: 'set'"
This looks like a python issue, any chance you can point out which line is causing the error?
Reading throung your code, I would guess it's this line:
i['customFilter'] = Input(value['my_custom_filter'])
Your python logic is wrong, if we unpack your code you're trying to do this call:
i['customFilter'] = Input({"foo", "bar"})
Edit to answer the comment on how to create a python transform to lock a variable in a closure:
def create_transform(inputs={}, outputs={}, my_other_var):
#transform(**inputs, **outputs)
def compute(input_foo, input_bar, output_foobar, ctx):
df = input_foo.dataframe()
df = df.withColumn("mycol", F.lit(my_other_var))
output_foorbar.write_dataframe(df)
return compute
and now you can call this:
transforms.append(create_tranform(inputs, outptus, "foobar"))

Scalding Unit Test - How to Write A Local File?

I work at a place where scalding writes are augmented with a specific API to track dataset meta data. When converting from normal writes to these special writes, there are some intricacies with respect to Key/Value, TSV/CSV, Thrift ... datasets. I would like to compare the binary file is the same prior to conversion and after conversion to the special API.
Given I cannot provide the specific api for the metadata-inclusive writes, I only ask how can I write a unit test for .write method on a TypedPipe?
implicit val timeZone: TimeZone = DateOps.UTC
implicit val dateParser: DateParser = DateParser.default
implicit def flowDef: FlowDef = new FlowDef()
implicit def mode: Mode = Local(true)
val fileStrPath = root + "/test"
println("writing data to " + fileStrPath)
TypedPipe
.from(Seq[Long](1, 2, 3, 4, 5))
// .map((x: Long) => { println(x.toString); System.out.flush(); x })
.write(TypedTsv[Long](fileStrPath))
.forceToDisk
The above doesn't seem to write anything to local (OSX) disk.
So I wonder if I need to use a MiniDFSCluster something like this:
def setUpTempFolder: String = {
val tempFolder = new TemporaryFolder
tempFolder.create()
tempFolder.getRoot.getAbsolutePath
}
val root: String = setUpTempFolder
println(s"root = $root")
val tempDir = Files.createTempDirectory(setUpTempFolder).toFile
val hdfsCluster: MiniDFSCluster = {
val configuration = new Configuration()
configuration.set(MiniDFSCluster.HDFS_MINIDFS_BASEDIR, tempDir.getAbsolutePath)
configuration.set("io.compression.codecs", classOf[LzopCodec].getName)
new MiniDFSCluster.Builder(configuration)
.manageNameDfsDirs(true)
.manageDataDfsDirs(true)
.format(true)
.build()
}
hdfsCluster.waitClusterUp()
val fs: DistributedFileSystem = hdfsCluster.getFileSystem
val rootPath = new Path(root)
fs.mkdirs(rootPath)
However, my attempts to get this MiniCluster to work haven't panned out either - somehow I need to link the MiniCluster with the Scalding write.
Note: The Scalding JobTest framework for unit testing isn't going to work due actual data written is sometimes wrapped in bijection codec or setup with case class wrappers prior to the writes made by the metadata-inclusive writes APIs.
Any ideas how I can write a local file (without using the Scalding REPL) with either Scalding alone or a MiniCluster? (If using the later, I need a hint how to read the file.)
Answering ... There is an example of how to use a mini cluster for exactly reading and writing to HDFS. I will be able to cross read with my different writes and examine them. Here it is in the tests for scalding's TypedParquet type
HadoopPlatformJobTest is an extension for JobTest that uses a MiniCluster.
With some hand-waiving on detail in the link, the bulk of the code is this:
"TypedParquetTuple" should {
"read and write correctly" in {
import com.twitter.scalding.parquet.tuple.TestValues._
def toMap[T](i: Iterable[T]): Map[T, Int] = i.groupBy(identity).mapValues(_.size)
HadoopPlatformJobTest(new WriteToTypedParquetTupleJob(_), cluster)
.arg("output", "output1")
.sink[SampleClassB](TypedParquet[SampleClassB](Seq("output1"))) {
toMap(_) shouldBe toMap(values)
}
.run()
HadoopPlatformJobTest(new ReadWithFilterPredicateJob(_), cluster)
.arg("input", "output1")
.arg("output", "output2")
.sink[Boolean]("output2")(toMap(_) shouldBe toMap(values.filter(_.string == "B1").map(_.a.bool)))
.run()
}
}

How to create single PredictRequest() with multiple model_spec?

I'm trying to add multiple model_spec & their respected inputs into single predict_pb2.PredictRequest() as follow:
tmp = predict_pb2.PredictRequest()
tmp.model_spec.name = '1'
tmp.inputs['tokens'].CopyFrom(make_tensor_proto([1,2,3]))
tmp.model_spec.name = '2'
tmp.inputs['tokens'].CopyFrom(make_tensor_proto([4,5,6]))
But I'm only getting 2's information:
>> tmp
model_spec {
name: "2"
}
inputs {
key: "tokens"
value {
dtype: DT_INT32
tensor_shape {
dim {
size: 3
}
}
tensor_content: "\004\000\000\000\005\000\000\000\006\000\000\000"
}
}
How can I get a single PredictRequest() for multiple models with their respective inputs?
My aim is to create a single request and send it to the tensorflow serving which is serving two models. Is there any other way around this? Creating two separate requests for both models and getting results from tf_serving one after another works, but I'm wondering if I can just combine two requests into one.
I'm afraid it's not possible. In tensorflow_serving/api/predict.proto, each PredictRequest has only one ModelSpec. You may try to add some code to do this.
Did you try using Configuration File.
Contents of Config file can be as shown below:
model_config_list {
config {
name: 'my_first_model'
base_path: '/tmp/my_first_model/'
}
config {
name: 'my_second_model'
base_path: '/tmp/my_second_model/'
}
}
For more information, you can refer the link shown below:
https://www.tensorflow.org/tfx/serving/serving_config

Test file structure in groovy(Spock)

How to test created and expected file tree in groovy(Spock)?
Right now I'm using Set where I specify paths which I expect to get and collecting actual paths in this way:
Set<String> getCreatedFilePaths(String root) {
Set<String> createFilePaths = new HashSet<>()
new File(root).eachFileRecurse {
createFilePaths << it.absolutePath
}
return createFilePaths
}
But the readability of the test isn't so good.
Is it possible in groovy to write expected paths as a tree, and after that compare with actual
For example, expected:
region:
usa:
new_york.json
california.json
europe:
spain.json
italy.json
And actual will be converted to this kind of tree.
Not sure if you can do it with the built-in recursive methods. There certainly are powerful ones, but this is standard recursion code you can use:
def path = new File("/Users/me/Downloads")
def printTree(File file, Integer level) {
println " " * level + "${file.name}:"
file.eachFile {
println " " * (level + 1) + it.name
}
file.eachDir {
printTree(it, level + 1)
}
}
printTree(path, 1)
That prints the format you describe
You can either build your own parser or use Groovy's built-in JSON parser:
package de.scrum_master.stackoverflow
import groovy.json.JsonParserType
import groovy.json.JsonSlurper
import spock.lang.Specification
class FileRecursionTest extends Specification {
def jsonDirectoryTree = """{
com : {
na : {
tests : [
MyBaseIT.groovy
]
},
twg : {
sample : {
model : [
PrimeNumberCalculatorSpec.groovy
]
}
}
},
de : {
scrum_master : {
stackoverflow : [
AllowedPasswordsTest.groovy,
CarTest.groovy,
FileRecursionTest.groovy,
{
foo : [
LoginIT.groovy,
LoginModule.groovy,
LoginPage.groovy,
LoginValidationPage.groovy,
User.groovy
]
},
LuceneTest.groovy
],
testing : [
GebTestHelper.groovy,
RestartBrowserIT.groovy,
SampleGebIT.groovy
]
}
}
}"""
def "Parse directory tree JSON representation"() {
given:
def jsonSlurper = new JsonSlurper(type: JsonParserType.LAX)
def rootDirectory = jsonSlurper.parseText(jsonDirectoryTree)
expect:
rootDirectory.de.scrum_master.stackoverflow.contains("CarTest.groovy")
rootDirectory.com.twg.sample.model.contains("PrimeNumberCalculatorSpec.groovy")
when:
def fileList = objectGraphToFileList("src/test/groovy", rootDirectory)
fileList.each { println it }
then:
fileList.size() == 14
fileList.contains("src/test/groovy/de/scrum_master/stackoverflow/CarTest.groovy")
fileList.contains("src/test/groovy/com/twg/sample/model/PrimeNumberCalculatorSpec.groovy")
}
List<File> objectGraphToFileList(String directoryPath, Object directoryContent) {
List<File> files = []
directoryContent.each {
switch (it) {
case String:
files << directoryPath + "/" + it
break
case Map:
files += objectGraphToFileList(directoryPath, it)
break
case Map.Entry:
files += objectGraphToFileList(directoryPath + "/" + (it as Map.Entry).key, (it as Map.Entry).value)
break
default:
throw new IllegalArgumentException("unexpected directory content value $it")
}
}
files
}
}
Please note:
I used new JsonSlurper(type: JsonParserType.LAX) in order to avoid having to quote each single String in the JSON structure. If your file names contain spaces or other special characters, you will have to use something like "my file name", though.
In rootDirectory.de.scrum_master.stackoverflow.contains("CarTest.groovy") you can see how you can nicely interact with the parsed JSON object graph in .property syntax. You might like it or not, need it or not.
Recursive method objectGraphToFileList converts the parsed object graph to a list of files (if you prefer a set, change it, but File.eachFileRecurse(..) should not yield any duplicates, so the set is not needed.
If you do not like the parentheses etc. in the JSON, you can still build your own parser.
You might want to add another utility method to create a JSON string like the given one from a validated directory structure, so you have less work when writing similar tests.
Modified Bavo Bruylandt answer to collect file tree paths, and sort it to not care about the order of files.
def "check directory structure"() {
expect:
String created = getCreatedFilePaths(new File("/tmp/region"))
String expected = new File("expected.txt").text
created == expected
}
private String getCreatedFilePaths(File root) {
List paths = new ArrayList()
printTree(root, 0, paths)
return paths.join("\n")
}
private void printTree(File file, Integer level, List paths) {
paths << ("\t" * level + "${file.name}:")
file.listFiles().sort{it.name}.each {
if (it.isFile()) {
paths << ("\t" * (level + 1) + it.name)
}
if (it.isDirectory()) {
collectFileTree(it, level + 1, paths)
}
}
}
And expected files put in the expected.txt file with indent(\t) in this way:
region:
usa:
new_york.json
california.json
europe:
spain.json
italy.json

output for append job in BigQuery using Luigi Orchestrator

I have a Bigquery task which only aims to append a daily temp table (Table-xxxx-xx-xx) to an existing table (PersistingTable).
I am not sure how to handle the output(self) method. Indeed, I can not just output PersistingTable as a luigi.contrib.bigquery.BigQueryTarget, since it already exists before the process started. Has anyone asked himself such a question?
I could not find an answer anywhere else so I will give my solution even though this is a very old question.
I created a new class that inherits from luigi.contrib.bigquery.BigQueryLoadTask
class BigQueryLoadIncremental(luigi.contrib.bigquery.BigQueryLoadTask):
'''
a subclass that checks whether a write-log on gcs exists to append data to the table
needs to define Two Outputs! [0] of type BigQueryTarget and [1] of type GCSTarget
Everything else is left unchanged
'''
def exists(self):
return luigi.contrib.gcs.GCSClient.exists(self.output()[1].path)
#property
def write_disposition(self):
"""
Set to WRITE_APPEND as this subclass only makes sense for this
"""
return luigi.contrib.bigquery.WriteDisposition.WRITE_APPEND
def run(self):
output = self.output()[0]
gcs_output = self.output()[1]
assert isinstance(output,
luigi.contrib.bigquery.BigQueryTarget), 'Output[0] must be a BigQueryTarget, not %s' % (
output)
assert isinstance(gcs_output,
luigi.contrib.gcs.GCSTarget), 'Output[1] must be a Cloud Storage Target, not %s' % (
gcs_output)
bq_client = output.client
source_uris = self.source_uris()
assert all(x.startswith('gs://') for x in source_uris)
job = {
'projectId': output.table.project_id,
'configuration': {
'load': {
'destinationTable': {
'projectId': output.table.project_id,
'datasetId': output.table.dataset_id,
'tableId': output.table.table_id,
},
'encoding': self.encoding,
'sourceFormat': self.source_format,
'writeDisposition': self.write_disposition,
'sourceUris': source_uris,
'maxBadRecords': self.max_bad_records,
'ignoreUnknownValues': self.ignore_unknown_values
}
}
}
if self.source_format == luigi.contrib.bigquery.SourceFormat.CSV:
job['configuration']['load']['fieldDelimiter'] = self.field_delimiter
job['configuration']['load']['skipLeadingRows'] = self.skip_leading_rows
job['configuration']['load']['allowJaggedRows'] = self.allow_jagged_rows
job['configuration']['load']['allowQuotedNewlines'] = self.allow_quoted_new_lines
if self.schema:
job['configuration']['load']['schema'] = {'fields': self.schema}
# test write to and removal of GCS pseudo output in order to make sure this does not fail.
gcs_output.fs.put_string(
'test write for task {} (this file should have been removed immediately)'.format(self.task_id),
gcs_output.path)
gcs_output.fs.remove(gcs_output.path)
bq_client.run_job(output.table.project_id, job, dataset=output.table.dataset)
gcs_output.fs.put_string(
'success! The following BigQuery Job went through without errors: {}'.format(self.task_id), gcs_output.path)
it uses a second output (which might violate luigis atomicity principle) on google cloud storage. Example usage:
class LeadsToBigQuery(BigQueryLoadIncremental):
date = luigi.DateParameter(default=datetime.date.today())
def output(self):
return luigi.contrib.bigquery.BigQueryTarget(project_id=...,
dataset_id=...,
table_id=...), \
create_gcs_target(...)