Apache Flink Error Handing and Conditional Processing - error-handling

I am new to Flink and have gone through site(s)/examples/blogs to get started. I am struggling with the correct use of operators. Basically I have 2 questions
Question 1: Does Flink support declarative exception handling, I need to handle parse/validate/... errors?
Can I use org.apache.flink.runtime.operators.sort.ExceptionHandler or similar
to handle errors?
or Rich/FlatMap function my best option?
If Rich/FlatMap the only option then is there a way to get handle to Stream inside Rich/FlatMap function so Sink(s) could be attached for error processing?
Question 2: Can I conditionally attach different Sink(s)?
Based on certain field(s) in keyed split streams I need to select different sink(s), do I split the stream again or use a Rich/FlatMap to handle that?
I am using Flink 1.3.2. Here is the relevant portion of my job
.....
.....
DataStream<String> eventTextStream = env.addSource(messageSource)
KeyedStream<EventPojo, Tuple> eventPojoStream = eventTextStream
// parse, transform or enrich
.flatMap(new MyParseTransformEnrichFunction())
.assignTimestampsAndWatermarks(new EventAscendingTimestampExtractor())
.keyBy("eventId");
// split stream based on eventType as different reduce and windowing functions need to be applied
SplitStream<EventPojo> splitStream = eventPojoStream
.split(new EventStreamSplitFunction());
// need to apply reduce function
DataStream<EventPojo> event1TypeStream = splitStream.select("event1Type");
// need to apply reduce function
DataStream<EventPojo> event2TypeStream = splitStream.select("event2Type");
// need to apply time based windowing function
DataStream<EventPojo> event3TypeStream = splitStream.select("event3Type");
....
....
env.execute("Event Processing");
Am I using the correct operators here?
Update 1:
Tried using the ProcessFunction as suggested by #alpinegizmo but that didn't work as it depends upon a keyed stream which I don't have until I parse/validate input. I get "InvalidProgramException: Field expression must be equal to '*' or '_' for non-composite types. ".
It's such a common use case where your first parse/validate input and won't have keyed stream yet, so how do you solve it?
Thanks for your patience and help.

There's one key building block that you've overlooked. Take a look at side outputs.
This mechanism provides a typesafe way to produce any number of additional output streams. This can be a clean way to report errors, among other uses. In Flink 1.3 side outputs can only be used with ProcessFunction, but 1.4 will add side outputs to ProcessWindowFunction.

Related

read specific files names in adf pipeline

I have got requirement saying, blob storage has multiple files with names file_1.csv,file_2.csv,file_3.csv,file_4.csv,file_5.csv,file_6.csv,file_7.csv. From these i have to read only filenames from 5 to 7.
how we can achieve this in ADF/Synapse pipeline.
I have repro’d in my lab, please see the below repro steps.
ADF:
Using the Get Metadata activity, get a list of all files.
(Parameterize the source file name in the source dataset to pass ‘*’ in the dataset parameters to get all files.)
Get Metadata output:
Pass the Get Metadata output child items to ForEach activity.
#activity('Get Metadata1').output.childItems
Add If Condition activity inside ForEach and add the true case expression to copy only required files to sink.
#and(greater(int(substring(item().name,4,1)),4),lessOrEquals(int(substring(item().name,4,1)),7))
When the If Condition is True, add copy data activity to copy the current item (file) to sink.
Source:
Sink:
Output:
I took a slightly different approaching using a Filter activity and the endsWith function:
The filter expression is:
#or(or(endsWith(item().name, '_5.csv'),endsWith(item().name, '_6.csv')),endsWith(item().name, '_7.csv'))
Slightly different approaches, similar results, it depends what you need.
You can always do what #NiharikaMoola-MT suggested . But since you already know the range of the files ( 5-7) , I suggest
Declare two paramter as an upper and lower range
Create a Foreach loop and pass the parameter and to create a range[lowerlimit,upperlimit]
Create a paramterized dataset for source .
Use the fileNumber from the FE loop to create a dynamic expression like
#concat('file',item(),'.csv')

Is there a way to execute text gremlin query with PartitionStrategy

I'm looking for an implementation to run text query ex: "g.V().limit(1).toList()" while using the PatitionStrategy in Apache TinkerPop.
I'm attempting to build a REST interface to run queries on selected graph paritions only. I know how to run a raw query using Client, but I'm looking for an implementation where I can create a multi-tenant graph (https://tinkerpop.apache.org/docs/current/reference/#partitionstrategy) and query only selected tenants using raw text query instead of a GLV. Im able to query only selected partitions using pythongremlin, but there is no reference implementation I could find to run a text query on a tenant.
Here is tenant query implementation
connection = DriverRemoteConnection('ws://megamind-ws:8182/gremlin', 'g')
g = traversal().withRemote(connection)
partition = PartitionStrategy(partition_key="partition_key",
write_partition="tenant_a",
read_partitions=["tenant_a"])
partitioned_g = g.withStrategies(partition)
x = partitioned_g.V.limit(1).next() <---- query on partition only
Here is how I execute raw query on entire graph, but Im looking for implementation to run text based queries on only selected partitions.
from gremlin_python.driver import client
client = client.Client('ws://megamind-ws:8182/gremlin', 'g')
results = client.submitAsync("g.V().limit(1).toList()").result().one() <-- runs on entire graph.
print(results)
client.close()
Any suggestions appreciated? TIA
It depends on how the backend store handles text mode queries, but for the query itself, essentially you just need to use the Groovy/Java style formulation. This will work with GremlinServer and Amazon Neptune. For other backends you will need to make sure that this syntax is supported. So from Python you would use something like:
client.submit('
g.withStrategies(new PartitionStrategy(partitionKey: "_partition",
writePartition: "b",
readPartitions: ["b"])).V().count()')

Query parameter handling in karate framework

Is there any easy way to handle huge query param like below. Also I would like to know how can I do run time parameterisation for some values?
http://154.213.196.243:7941/v1/banking/Jumio/callback?callBackType=NetVerifyId&jumioIdScanReference=123abcde-1244-8571-3454-abcd12345567&merchantIdScanReference=66a9ff2e-d8ec-e811-a956-000d3ab3f117&verificationStatus=APPROVED_VERIFIED&idScanStatus=SUCCESS&id+ScanSource=API&idCheckDataPositions=OK&idCheckDocumentValidation=OK&idCheckHologram=OK&idCheckMRZcode=OK&idCheckMicroprint=OK&idCheckSecurityFeatures=OK&idCheckSignature=OK&transactionDate=2018-11-20T20%3A53%3A25.797Z&callbackDate=2018-11-20T20%3A53%3A25.797Z&idType=DRIVING_LICENSE&idCountry=GBR&idScanImage+=https%3A%2F%2Fnetverify.com%2Frecognition%2Fv1%2Fidscan%2F123abcde-1244-8571-3454-abcd12345567%2Ffront&idFirstName=ILARIA&idLastName=FURS&idDob=1976-12-23&idExpiry=2025-12-31&personalNumber=123456789&clientIp=xxx.xxx.xxx.xxx&idAddress=%7B%22country%22%3A%22USA%22%2C%20%22stateCode%22%3A%22US-OH%22%7D&idNumber=P12345&idStatus=TESTER961260SS9DL54&identityVerification=%7B%22similarity%22%3A%22MATCH%22%2C%22validity%22%3Atrue%7D HTTP/1.1
Yes. Read the docs: https://github.com/intuit/karate#param
For example:
* param callBackType = 'NetVerifyId'
and so on. And look at params where you can set all keys up as one single JSON and also do parameterization if needed, there are multiple possibilities: https://github.com/intuit/karate#params
See this example as well: dynamic-params.feature

How to evenly distribute data in apache pig output files?

I've got a pig-latin script that takes in some xml, uses the XPath UDF to pull out some fields and then stores the resulting fields:
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
store results into '$output';
Note that we're using pig-0.12.0 on our cluster, so I ripped the XPath/XMLLoader classes out of pig-0.14.0 and put them in my own jar so that I could use them in 0.12.
This above script works fine and produces the data that I'm looking for. However, it generates over 1,900 partfiles with only a few mbs in each file. I learned about the default_parallel option, so I set that to 128 to try and get 128 partfiles. I ended up having to add a piece to force a reduce phase to achieve this. My script now looks like:
set default_parallel 128;
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
forced_reduce = FOREACH (GROUP results BY RANDOM()) GENERATE FLATTEN(results);
store forced_reduce into '$output';
Again, this produces the expected data. Also, I now get 128 part-files. My problem now is that the data is not evenly distributed among the part-files. Some have 8 gigs, others have 100 mb. I should have expected this when grouping them by RANDOM() :).
My question is what would be the preferred way to limit the number of part-files yet still have them evenly-sized? I'm new to pig/pig latin and assume I'm going about this in the completely wrong way.
p.s. the reason I care about the number of part-files is because I'd like to process the output with spark and our spark cluster seems to do a lot better with a smaller number of files.
I'm still looking for a way to do this directly from the pig script but for now my "solution" is to repartition the data within the spark process that works on the output of the pig script. I use the RDD.coalesce function to rebalance the data.
From the first code snippet, I am assuming it is map only job since you are not using any aggregates.
Instead of using reducers, set the property pig.maxCombinedSplitSize
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
store results into '$output';
exec;
set pig.maxCombinedSplitSize 1000000000; -- 1 GB(given size in bytes)
x = load '$output' using PigStorage();
store x into '$output2' using PigStorage();
pig.maxCombinedSplitSize - setting this property will make sure each mapper reads around 1 GB data and above code works as identity mapper job, which helps you write data in 1GB part file chunks.

Endeca UrlENEQuery java API search

I'm currently trying to create an Endeca query using the Java API for a URLENEQuery. The current query is:
collection()/record[CONTACT_ID = "xxxxx" and SALES_OFFICE = "yyyy"]
I need it to be:
collection()/record[(CONTACT_ID = "xxxxx" or CONTACT_ID = "zzzzz") and
SALES_OFFICE = "yyyy"]
Currently this is being done with an ERecSearchList with CONTACT_ID and the string I'm trying to match in an ERecSearch object, but I'm having difficulty figuring out how to get the UrlENEQuery to generate the or in the correct fashion as I have above. Does anyone know how I can do this?
One of us is confused on multiple levels:
Let me try to explain why I am confused:
If Contact_ID and Sales_Office are different dimensions, where Contact_ID is a multi-or dimension, then you don't need to use EQL (the xpath like language) to do anything. Just select the appropriate dimension values and your navigation state will reflect the query you are trying to build with XPATH. IE CONTACT_IDs "ORed together" with SALES_OFFICE "ANDed".
If you do have to use EQL, then the only way to modify it (provided that you have to modify it from the returned results) is via string manipulation.
ERecSearchList gives you ability to use "Search Within" functionality which functions completely different from the EQL filtering, though you can achieve similar results by using tricks like searching only specified field (which would be separate from the generic search interface") I am still not sure what's the connection between ERecSearchList and the EQL expression above?
Having expressed my confusion, I think what you need to do is to use String manipulation to dynamically build the EQL expression and add it to the Query.
A code example of what you are doing would be extremely helpful as well.