Pipeline generation - passing in simple datastructures like lists/arrays - apache-spark-sql

For a code repository project in Palantir Foundry, I am struggling with re-using some of my transformation logic.
It seems almost trivial, but: is there way to send an Input to a Transform that is not a dataset/dataframe reference?
In my case I want to pass in strings or lists/arrays.
This is my code:
from pyspark.sql import functions as F
from transforms.api import Transform, Input, Output
def my_computation(result, customFilter, scope, my_categories, my_mappings):
scope_df = scope.dataframe()
my_categories_df = my_categories.dataframe()
my_mappings_df = my_mappings.dataframe()
filtered_cat_df = (
my_categories_df
.filter(F.col('CAT_NAME').isin(customFilter))
)
# ... more logic
def generateTransforms(config):
transforms = []
for key, value in config.items():
o = {}
for outKey, outValue in value['outputs'].items():
o[outKey] = Output(outValue)
i = {}
for inpKey, inpValue in value['inputs'].items():
i[inpKey] = Input(inpValue)
i['customFilter'] = Input(value['my_custom_filter'])
transforms.append(Transform(my_computation, inputs=i, outputs=o))
return transforms
config = {
"transform_one": {
"my_custom_filter": {
"foo",
"bar"
},
"inputs": {
"scope": "/my-project/input/scope",
"my_categories": "/my-project/input/my_categories",
"my_mappings": "/my-project/input/my_mappings"
},
"outputs": {
"result": "/my-project/output/result"
}
}
}
TRANSFORMS = generateTransforms(config)
The concrete question is: how can I send in the values from my_custom_filter into customFilter in the transformation function my_computation?
If I execute it like above, I get the error "TypeError: unhashable type: 'set'"

This looks like a python issue, any chance you can point out which line is causing the error?
Reading throung your code, I would guess it's this line:
i['customFilter'] = Input(value['my_custom_filter'])
Your python logic is wrong, if we unpack your code you're trying to do this call:
i['customFilter'] = Input({"foo", "bar"})
Edit to answer the comment on how to create a python transform to lock a variable in a closure:
def create_transform(inputs={}, outputs={}, my_other_var):
#transform(**inputs, **outputs)
def compute(input_foo, input_bar, output_foobar, ctx):
df = input_foo.dataframe()
df = df.withColumn("mycol", F.lit(my_other_var))
output_foorbar.write_dataframe(df)
return compute
and now you can call this:
transforms.append(create_tranform(inputs, outptus, "foobar"))

Related

Karate : dynamic test data using scenario outline is not working in some cases

I was tryiny to solve dynamic test data problem using dynamic scenario outline as mentioned in the documentation https://github.com/karatelabs/karate#dynamic-scenario-outline
It worked perfectly fine when I passed something like this in Example section
Examples:
|[{'entity':country},{'entity':state},{'entity':district},{'entity':corporation}]]
But I tried to generate this json object programatically , I am getting aa strange error
WARN com.intuit.karate - ignoring dynamic expression, did not evaluate to list: users - [type: MAP, value: com.intuit.karate.ScriptObjectMap#2b8bb184]
Code to generate json object
* def user =
"""
function(response){
entity_type_ids =[]
var entityTypes = response.entityTypes
for(var i =0;i<entityTypes.length;i++ ){
object = {}
object['entity'] = entityTypes[i].id
entity_type_ids.push(object)
}
return JSON.stringify(entity_type_ids)
}
"""

Terraform 0.12: Output list of buckets, use as input for another module and iterate

I'm using Tf 0.12. I have an s3 module that outputs a list of buckets, that I would like to use as an input for a cloudfront module that I've got.
The problem I'm facing is that when I do terraform plan/apply I get the following error count.index is 0 |var.redirect-buckets is tuple with 1 element
I've tried all kinds of splats moving the count.index call around to no avail. My sample code is below.
module.s3
resource "aws_s3_bucket" "redirect" {
count = length(var.redirects)
bucket = element(var.redirects, count.index)
}
mdoule.s3.output
output "redirect-buckets" {
value = [aws_s3_bucket.redirect.*]
}
module.cdn.variables
...
variable "redirect-buckets" {
description = "Redirect buckets"
default = []
}
....
The error is thrown down here
module.cdn
resource "aws_cloudfront_distribution" "redirect" {
count = length(var.redirect-buckets)
default_cache_behavior {
// Line below throws the error, one amongst many
target_origin_id = "cloudfront-distribution-origin-${var.redirect-buckets[count.index]}.s3.amazonaws.com"
....
//Another error throwing line
target_origin_id = "cloudfront-distribution-origin-${var.redirect-buckets[count.index]}.s3.amazonaws.com"
Any help is greatly appreciated.
module.s3
resource "aws_s3_bucket" "redirects" {
for_each = var.redirects
bucket = each.value
}
Your variable definition for redirects needs to change to something like this:
variable "redirects" {
type = map(string)
}
module.s3.output:
output "redirect_buckets" {
value = aws_s3_bucket.redirects
}
module.cdn
resource "aws_cloudfront_distribution" "redirects" {
for_each = var.redirect_buckets
default_cache_behavior {
target_origin_id = "cloudfront-distribution-origin-${each.value.id}.s3.amazonaws.com"
}
Your variable definition for redirect-buckets needs to change to something like this (note underscores, using skewercase is going to behave strangely in some cases, not worth it):
variable "redirect_buckets" {
type = map(object(
{
id = string
}
))
}
root module
module "s3" {
source = "../s3" // or whatever the path is
redirects = {
site1 = "some-bucket-name"
site2 = "some-other-bucket"
}
}
module "cdn" {
source = "../cdn" // or whatever the path is
redirects_buckets = module.s3.redirect_buckets
}
From an example perspective, this is interesting, but you don't need to use outputs from S3 here since you could just hand the cdn module the same map of redirects and use for_each on those.
There is a tool called Terragrunt which wraps Terraform and supports dependencies.
https://terragrunt.gruntwork.io/docs/features/execute-terraform-commands-on-multiple-modules-at-once/#dependencies-between-modules

trie implementation in python , object reference

im looking at the following implementation of trie in python:
tree = {}
def add_to_tree(root, value_string):
for character in value_string:
root = root.setdefault(character, {})
def main():
tree={}
add_to_tree(tree, 'abc')
print tree
if __name__=="__main__":
main()
What is not clear to me is:
why is it returning {a:{b:{c:{}}}} instead of {a:{},b:{},c:{}} ?
I ran the code through this which gives a visualization of it. After iterating though 'a' I get tree = {'a':{}}, root = {} then after 'b' I get tree = {a:{b:{}}}, root={}. Whats not clear is what variable is holding the reference to {b:{}} which gets assigned to {a:{}} to change it to {a:{b:{}}} ?
You are reassigning root to the newly created dict for every character, change this line:
root = root.setdefault(character, {})
To becomes:
root.setdefault(character, {})
This gives the desired output (note that dict are unordered):
{'a': {}, 'c': {}, 'b': {}}

output for append job in BigQuery using Luigi Orchestrator

I have a Bigquery task which only aims to append a daily temp table (Table-xxxx-xx-xx) to an existing table (PersistingTable).
I am not sure how to handle the output(self) method. Indeed, I can not just output PersistingTable as a luigi.contrib.bigquery.BigQueryTarget, since it already exists before the process started. Has anyone asked himself such a question?
I could not find an answer anywhere else so I will give my solution even though this is a very old question.
I created a new class that inherits from luigi.contrib.bigquery.BigQueryLoadTask
class BigQueryLoadIncremental(luigi.contrib.bigquery.BigQueryLoadTask):
'''
a subclass that checks whether a write-log on gcs exists to append data to the table
needs to define Two Outputs! [0] of type BigQueryTarget and [1] of type GCSTarget
Everything else is left unchanged
'''
def exists(self):
return luigi.contrib.gcs.GCSClient.exists(self.output()[1].path)
#property
def write_disposition(self):
"""
Set to WRITE_APPEND as this subclass only makes sense for this
"""
return luigi.contrib.bigquery.WriteDisposition.WRITE_APPEND
def run(self):
output = self.output()[0]
gcs_output = self.output()[1]
assert isinstance(output,
luigi.contrib.bigquery.BigQueryTarget), 'Output[0] must be a BigQueryTarget, not %s' % (
output)
assert isinstance(gcs_output,
luigi.contrib.gcs.GCSTarget), 'Output[1] must be a Cloud Storage Target, not %s' % (
gcs_output)
bq_client = output.client
source_uris = self.source_uris()
assert all(x.startswith('gs://') for x in source_uris)
job = {
'projectId': output.table.project_id,
'configuration': {
'load': {
'destinationTable': {
'projectId': output.table.project_id,
'datasetId': output.table.dataset_id,
'tableId': output.table.table_id,
},
'encoding': self.encoding,
'sourceFormat': self.source_format,
'writeDisposition': self.write_disposition,
'sourceUris': source_uris,
'maxBadRecords': self.max_bad_records,
'ignoreUnknownValues': self.ignore_unknown_values
}
}
}
if self.source_format == luigi.contrib.bigquery.SourceFormat.CSV:
job['configuration']['load']['fieldDelimiter'] = self.field_delimiter
job['configuration']['load']['skipLeadingRows'] = self.skip_leading_rows
job['configuration']['load']['allowJaggedRows'] = self.allow_jagged_rows
job['configuration']['load']['allowQuotedNewlines'] = self.allow_quoted_new_lines
if self.schema:
job['configuration']['load']['schema'] = {'fields': self.schema}
# test write to and removal of GCS pseudo output in order to make sure this does not fail.
gcs_output.fs.put_string(
'test write for task {} (this file should have been removed immediately)'.format(self.task_id),
gcs_output.path)
gcs_output.fs.remove(gcs_output.path)
bq_client.run_job(output.table.project_id, job, dataset=output.table.dataset)
gcs_output.fs.put_string(
'success! The following BigQuery Job went through without errors: {}'.format(self.task_id), gcs_output.path)
it uses a second output (which might violate luigis atomicity principle) on google cloud storage. Example usage:
class LeadsToBigQuery(BigQueryLoadIncremental):
date = luigi.DateParameter(default=datetime.date.today())
def output(self):
return luigi.contrib.bigquery.BigQueryTarget(project_id=...,
dataset_id=...,
table_id=...), \
create_gcs_target(...)

How to stream SQL results to JSON using Groovy StreamingJsonBuilder?

I am trying to execute a SQL query and convert the results to JSON as follows. Though I got it working without streaming, I'm having some issues using StreamingJsonBuilder to stream the results.
non-streaming code
def writer = new StringWriter()
def jsonBuilder = new StreamingJsonBuilder(writer)
sql.eachRow("select * from client"){ row ->
jsonBuilder( id: row.id, name: row.name )
}
println writer.toString()
Result from the code above
{"id":123,"name":"ABCD"}{"id":124,"name":"NYU"}
The problem with this result is that, all documents are printed on same line without delimitation. How do I get the results as an array and each document pretty-printed as below
Expected result
[
{
id: 123,
name: "ABCD",
...
},
{
id: 124,
name: "NYU",
...
},
]
I put this here more as an fallback. If your problem is just to have your data properly formatted as json, but the sheer amount of data make you use the streaming API, then you are better off with using the streaming for your data and handle the "array" for yourself.
All the calls in the StreamingJsonBuilder take an object and directly write it to the writer. So there is no safe way (I can see) to have the writer open the array, then send the data in chunks you provide and then close the array. So while we already hold the writer, why not just deal with the array your self (this part of json is rather easy to get right):
new File('/tmp/out.json').withWriter{ writer ->
writer << '['
def jsonBuilder = new groovy.json.StreamingJsonBuilder(writer)
def first = true
10000000.times{
if (!first) writer << "\n,"
first = false
jsonBuilder(id: it, name: it.toString())
}
writer << ']'
}
I've no access to any SQL to try but the following piece of code should do the job (You need to replace the data variable):
import groovy.json.*
def writer = new StringWriter()
def jsonBuilder = new StreamingJsonBuilder(writer)
def data = [
[id:1, name: 'n1', other: 'o1'],
[id:2, name: 'n2', other: 'o2']
]
def dataJson = jsonBuilder(data.collect { [id:it.id, name:it.name] })
println(JsonOutput.prettyPrint(JsonOutput.toJson(dataJson)))
UPDATE (after #cfrick's comment)
Here, every row is processed one ofter another but, a key (data in this case) is needed.
import groovy.json.*
def writer = new StringWriter()
def jsonBuilder = new StreamingJsonBuilder(writer)
def data = [
[id:1, name: 'n1', other: 'o1'],
[id:2, name: 'n2', other: 'o2']
]
def root = jsonBuilder(data: [])
data.each { d ->
root.data << [id:d.id, name: d.name]
}
println(JsonOutput.prettyPrint(JsonOutput.toJson(root)))