How to handle to input stream in Pentaho with script steps? - pentaho

How many different kind of steps in Pentaho can accept more than one input stream, such as "Merge Join", "Stream Look up"?
What's the typical user scenario of them?
Any script related steps can accept more than one stream as input, like javascript or UDJC? e.g. use one stream as data source, another as filter condition?
Thank you all.

All the steps under "Joins" and "Lookup", joins just like table join, lookup is to using one stream as source dataset another as "translate" dictionary, this is what I know

Answer to 3 questions as below:
All the Steps available in "Joins" and "Lookup" section will accept two streams. (i haven't tried with 3 streams) Some filter steps like Java Filter will also accept more than one stream.
Typical use scenario is to get data from one or more streams and to work on your business logic. There is no specific example i can explain at the moment.
As per my knowledge, you cannot use more than one stream in JavaScript Step. You might get an error like
I am trying to stream two columns of different names. Input 1 has column "a" and Input 2 has column "b".
You can ignore this error if you can make both the input stream columns to the same name.
Hope this help :)

Related

Cleaning up raw syslog events coming into Sentinel - json/string

I'm having an issue parsing out syslog data coming into Sentinel. I think it's a misunderstanding of the data types and what my options are when working with them.
I have some raw syslog coming into Sentinel. This data is being ingested with 4 columns: TimeStamp, SyslogMessage, Computer, and Facility. The 'SyslogMessage' column is the one with by far the most data in it, but I'm having issues parsing it out to make it useful. I'd like to be able to take pieces out of the "SyslogMessage" column, and extend new columns from that data, which will give a better ability to manipulate that data than some string operator like contains.
For instance, in a separate situation I had some raw event data coming through as what I think is Json. With this dataset, I was able to do something like extend c = RawEventData.AccountMoniker, which would give me a column 'c' and would only project the AccountMoniker data. Here is an example of that working dataset:
The data set that I am currently working with, looks like this picture. It looks to be formatted similarly to json, but seems to have had a string prefixed to the beginning of it which made the rest of the data a string I think. Here is that data:
I've been able to work in some regex and get the 'SyslogMessage' down to just the bracketed material, but have still been having issues when trying to do something like 'parse_json'. Right now, the only way I'm able to search through this data is using 'has' or 'contains'. What are my options for getting the 'SyslogMessage' data into a type that I can more easily search through and project as columns?

Using two different tables in a JavaScript step

I have 2 different tables (2 fields and 3 fields) from the same access file and I want to use them in a JavaScript Step but send me a hop error
I use join steps but one multiply the rows and the others ask me for a foreign key.
I would like to use the data from the 2 different tables in the javaScript step.
enter image description here
It is totally possible for to hops to target the same step (a Javascript in your case), and you do not have to do any thing special to "union" flows.
Except that the two flows must be strictly similar : columns in the same order with the same name and the same type. You can use two Select value steps for that.
Something you hate when you develop and love when you maintain.

How to replace mail adresses in table input column with Pentaho

I'm quite new to PDI and currently facing a challenge where I have to replace mail adresses read from the email column of an incoming table (extracted by the Table Input step in Kettle) with other mail adresses.
e.g. user.test#example.com should become abc[seq. number]#example.com.
The goal is to "anonymize" the incoming adresses for further work with the data.
I currently have no solution for this and am hoping you guys have one. :-)
Thank you!
You can implement a Java class, or you can do the following, after the table entry, you create a sequence, then with the step, split rows you process the mail, takes as delimiter the #, in the configuration of the step you create two fields, One that will contain the initial part of the email and the other with the domain (gmail.com for example), then take the field of the sequence you created earlier, concatenate it with a constant # (in the split rows you lose the symbol), And concatenate with field of the domain, in the end you will get 1#gmail.com, 2#hotmail.com, ect.solo are 4 steps I hope it helps you, greetings
There exists a "Replace in String" step under the "Transform" section exactly for your case.
Nevertheless, I recommend you to read some documentation first.
I solved it. I just took the long way with adding constants, sequences and concatenating eventually.

Azure Stream Analytics -> how much control over path prefix do I really have?

I'd like to set the prefix based on some of the data coming from event hub.
My data is something like:
{"id":"1234",...}
I'd like to write a blob prefix that is something like:
foo/{id}/guid....
Ultimately I'd like to have one blob for each id. This will help how it gets consumed downstream by a couple of things.
What I don't see is a way to create prefixes that aren't related to date and time. In theory I can write another job to pull from blobs and break it up after the stream analytics step. However, it feels like SA should allow me to break it up immediately.
Any ideas?
{date} , {time} and {partition} are the only ones supported in blob output prefix. {partition} is a number.
Using a column value in blob prefix is currently not supported.
If you have a limited number of such {id}s then you could workaround by writing multiple "select --" statements with different filters writing to different outputs and hardcode the prefix in the output. Otherwise it is not possible with just ASA.
It should be noted that now you actually can do this. Not sure when it was implemented but you can now use a single property from your message as a custom partition key and the syntax is exactly as the OP has asked for: foo/{id}/something/else
More details are documented here: https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-custom-path-patterns-blob-storage-output
Key points:
Only one custom property allowed
Must be a direct reference to an existing message property (i.e. no concatenations like {prop1+prop2})
If the custom property results in too many partitions (more than 8,000) then an arbitrary number of blobs may be created for the same parition

KeywordFilter field to filter from database values

How can i implement a KeywordFilter field to filter data from the database table as soon as text is fed into the field.
Most of the samples I have come across demonstrates filtering from predefined arrays.What i am looking out for is filtering from database.
Please guide how to go about it.Thanks
I have tried out this example of BB docs which shows in arrays
A straightforward approach would be to load the data from the database into a ReadableList and pass that to the KeywordFilterField.
The method used to set the values is
setSourceList(ReadableList list, KeywordProvider helper)
ReadableList is an interface which has a few implementations. The example code you are looking at uses the SortedReadableList but a BasicFilteredList would work nicely too.