Pentaho JsonInput GET fields - pentaho

I'm trying to use PDI to read data from an API (json) and now I'm simply trying to use json input to get a few specific fields but the get fields button on the input step gives me.
ERROR (version 8.3.0.0-371, build 8.3.0.0-371 from 2019-06-11 11.09.08 by buildguy) : Index 1 out of bounds for length 1
all the steps execute fine, and produce data - just not the json input step doesn't wnat to give me the fields option! - I've tired the text file and json oput and both write valid json so IDK whats going on....
PS. this is my first time using PDI
ISSUE 2:
It looks like PDI uses jayway for its json path parsing so I've been using this site https://jsonpath.herokuapp.com/ jayway selection which gives me my expected path. When I put that into the 'fields' of the json input dialog I only get the FIRST instance of that path value vs it actually parsing the json and giving me every instance, and can't figure out why though I assume it has something to do with PDI's row based view on things but I also don't know how to get it to understand that its json and it should be giving me back all values that match that path.
UPDATE 1:
I've been looking at this https://forums.pentaho.com/threads/135882-Parsing-JSON-data-without-knowing-field-names/ it seems like this Modified Java Script Value step might be the way to go. Will continue testing.
UPDATE 2
OK - Used the MJSV as posted above along with a select fields step and finally able to get the key's
var obj = JSON.parse(mydata);
var keys = Object.keys(obj);
for (var i = 0; i < Object.keys(obj).length; i++) {
var row = createRowCopy(getOutputRowMeta().size());
var idx = getInputRowMeta().size();
row[idx++] = keys[i];
putRow(row);
}
trans_Status = SKIP_TRANSFORMATION;

Related

Can't figure out how to insert keys and values of nested JSON data into SQL rows with NiFi

I'm working on a personal project and very new (learning as I go) to JSON, NiFi, SQL, etc., so forgive any confusing language used here or a potentially really obvious solution. I can clarify as needed.
I need to take the JSON output from a website's API call and insert it into a table in my MariaDB local server that I've set up. The issue is that the JSON data is nested, and two of the key pieces of data that I need to insert are used as variable key objects rather than values, so I don't know how to extract it and put it in the database table. Essentially, I think I need to identify different pieces of the JSON expression and insert them as values, but I'm clueless how to do so.
I've played around with the EvaluateJSON, SplitJSON, and FlattenJSON processors in particular, but I can't make it work. All I can ever do is get the result of the whole expression, rather than each piece of it.
{"5381":{"wind_speed":4.0,"tm_st_snp":26.0,"tm_off_snp":74.0,"tm_def_snp":63.0,"temperature":58.0,"st_snp":8.0,"punts":4.0,"punt_yds":178.0,"punt_lng":55.0,"punt_in_20":1.0,"punt_avg":44.5,"humidity":47.0,"gp":1.0,"gms_active":1.0},
"1023":{"wind_speed":4.0,"tm_st_snp":26.0,"tm_off_snp":82.0,"tm_def_snp":56.0,"temperature":74.0,"off_snp":82.0,"humidity":66.0,"gs":1.0,"gp":1.0,"gms_active":1.0},
"5300":{"wind_speed":17.0,"tm_st_snp":27.0,"tm_off_snp":80.0,"tm_def_snp":64.0,"temperature":64.0,"st_snp":21.0,"pts_std":9.0,"pts_ppr":9.0,"pts_half_ppr":9.0,"idp_tkl_solo":4.0,"idp_tkl_loss":1.0,"idp_tkl":4.0,"idp_sack":1.0,"idp_qb_hit":2.0,"humidity":100.0,"gp":1.0,"gms_active":1.0,"def_snp":23.0},
"608":{"wind_speed":6.0,"tm_st_snp":20.0,"tm_off_snp":53.0,"tm_def_snp":79.0,"temperature":88.0,"st_snp":4.0,"pts_std":5.5,"pts_ppr":5.5,"pts_half_ppr":5.5,"idp_tkl_solo":4.0,"idp_tkl_loss":1.0,"idp_tkl_ast":1.0,"idp_tkl":5.0,"humidity":78.0,"gs":1.0,"gp":1.0,"gms_active":1.0,"def_snp":56.0},
"3396":{"wind_speed":6.0,"tm_st_snp":20.0,"tm_off_snp":60.0,"tm_def_snp":70.0,"temperature":63.0,"st_snp":19.0,"off_snp":13.0,"humidity":100.0,"gp":1.0,"gms_active":1.0}}
This is a snapshot of an output with a couple thousand lines. Each of the numeric keys that you see above (5381, 1023, 5300, etc) are player IDs for the following stats. I have a table set up with three columns: Player ID, Stat ID, and Stat Value. For example, I need that first snippet to be inserted into my table as such:
Player ID Stat ID Stat Value
5381 wind_speed 4.0
5381 tm_st_snp 26.0
5381 tm_off_snp 74.0
And so on, for each piece of data. But I don't know how to have NiFi select the right pieces of data to insert in the right columns.
I believe that it's possible to use jolt to transform your json into a format:
[
{"playerId":"5381", "statId":"wind_speed", "statValue": 0.123},
{"playerId":"5381", "statId":"tm_st_snp", "statValue": 0.456},
...
]
then use PutDatabaseRecord with json reader.
Another approach is to use ExecuteGroovyScript processor.
Add new parameter to it with name SQL.mydb and link it to your DBCP controller service
And use the following script as Script Body parameter:
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
def ff=session.get()
if(!ff)return
//read flow file content and parse it
def body = ff.read().withReader("UTF-8"){reader->
new JsonSlurper().parse(reader)
}
def results = []
//use defined sql connection to create a batch
SQL.mydb.withTransaction{
def cmd = 'insert into mytable(playerId, statId, statValue) values(?,?,?)'
results = SQL.mydb.withBatch(100, cmd){statement->
//run through all keys/subkeys in flow file body
body.each{pid,keys->
keys.each{k,v->
statement.addBatch(pid,k,v)
}
}
}
}
//write results as a new flow file content
ff.write("UTF-8"){writer->
new JsonBuilder(results).writeTo(writer)
}
//transfer to success
REL_SUCCESS << ff

Lucene query : parse execption

I am using alfresco and trying to execute these queries,
These are my queries.
1st:
PATH:"/app:company_home/st:sites/cm:swsdp/cm:dataLists/cm:aea88103-517e-4aa0-a3be-de258d0e6465//*"
1st query is working properly but 2nd query is not able to parse
2nd
+PATH:"/app:company_home/st:swsdp/cm:/cm:dataLists/cm:9787a75b-cbc9-4d42-b76c-df88461e62c6//*"
Exception :
Cannot parse '+PATH:"/app:company_home/st:swsdp/cm:/cm:dataLists/cm:9787a75b-cbc9-4d42-b76c-df88461e62c6//*" AND +TYPE:"fdm:formDatalist"': Failed to parse XPath...
Unexpected '9787'
I tried by escaping but still getting same.
+PATH:"/app:company_home/st:swsdp/cm:/cm:dataLists/cm:9787a75b\-cbc9\-4d42-b76c\-df88461e62c6//
I noticed that in 1st query cm:aea88103-517e-4aa0-a3be-de258d0e6465 starts with latter but in 2nd query cm:9787a75b-cbc9-4d42-b76c-df88461e62c6 this contains numbers in starting so that it's not able to parse.
Please solve this error.
Certain characters need to be encoded in hexa for lucene PATH queries.
You need to encode your path this way :
var rawString = "//test:123 DIR/FILE.TXT #";
=> rawString: //test:123 DIR/FILE.TXT #
var encodedString = search.ISO9075Encode(rawString);
=> encodedString: _x002f__x002f_test_x003a_123_x0020_DIR_x002f_FILE.TXT_x0020__x0040_
var decodedString = search.ISO9075Decode(encodedString);
=> decodedString: //test:123 DIR/FILE.TXT #
See alfresco documentation for more information : http://docs.alfresco.com/5.2/references/API-JS-iso9075Encode.html

Talend - Dynamic Column Name (Enterprise version)

Can anyone help me solve this case?
I have much file to process, two of them is like on below screenshot with my expected output.
I use this transformation on Talend: tFileList---tInputExcel---tUnpivotRow---tMap---tPostgresqlOutput
The output is different to my expected output. This is the screenshot of the output
Can anyone help me to reach my expected output which is like on my first picture above?
This will be pretty hard. You'd have to handle that as a text file. And whenever you found "store" value in the first column you'd update your type with the value.
Here's how I'd start:
Basically tJavaFlex begin piece would contain:
String col1Type
String colNType
main part:
if input_row.col0.equalsIgnoreCase("store") {
col1Type = input_row.col1;
col2Type = input_row.col2;
colNType = input_row.colN;
continue; /*(so this record will be Ignored for the rest of the components!)*/
}
output_row.col1Type = col1Type;
output_row.col1Value = Integer.valueOf(input_row.col1);
/*coz we have text and need numbers :( */
I think using propagate results will save you from writing down all the other fields.
And from here it would be very simple as you have key-type-value-type-value-type-value results.

How to use Bioproject ID, for example, PRJNA12997, in biopython?

I have an Excel file in which are given more then 2000 organisms, where each one of them has a Bioproject ID associated (like PRJNA12997). The idea is to use these IDs to get the sequence for a later multiple alignment with other five sequences that I have in a text file.
Can anyone help me understand how I can do this using biopython? At least the part with the bioproject ID.
You can first get the info using Bio.Entrez:
from Bio import Entrez
Entrez.email = "Your.Name.Here#example.org"
# This call to efetch fails sometimes with a 400 error.
handle = Entrez.efetch(db="bioproject", id="PRJNA12997")
I've been trying, and Entrez.read(handle) doesn't seems to work. But if you do record_xml = handle.read() you'll get the XML entry for this record. In this XML you can get the ID for the organism, in this case 12997.
handle = Entrez.esearch(db="nuccore", term="12997[BioProject]")
search_results = Entrez.read(handle)
Now you can efecth from your search results. At this point you should use Biopython to parse whatever you will get in the efetch step, playing with the rettype http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/
for result in search_results["IdList"]:
entry = Entrez.efetch(db="nuccore", id=result, rettype="fasta")
this_seq_in_fasta = entry.read()

REGEX_EXTRACT error in PIG

I have a CSV file with 3 columns: tweetid , tweet, and Userid. However within the tweet column there are comma separated values.
i.e. of 1 row of data:
`396124437168537600`,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
I want to extract all 3 fields individually, but REGEX_EXTRACT is giving me an error with this code:
a = LOAD tweets USING PigStorage(',') AS (f1,f2,f3);
b = FILTER a BY REGEX_EXTRACT(f1,'(.*)\\"(.*)',1);
The error is:
error: Filter's condition must evaluate to boolean.
In the use case shared, reading the data using PigStrorage(',') will result in missing savava143 (last field value)
A = LOAD '/Users/muralirao/learning/pig/a.csv' USING PigStorage(',') AS (f1,f2,f3);
DUMP A;
Output : A : Observe that the last field value is missing.
(396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.")
For the use case shared, to extract all the values from CSV file with field values having ',' we can use either CSVExcelStorage or CSVLoader.
Approach 1 : Using CSVExcelStorage
Ref : http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html
Input : a.csv
396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
Pig Script :
REGISTER piggybank.jar;
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (f1,f2,f3);
DUMP A;
Output : A
(396124437168537600,I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.,savava143)
Approach 2 : Using CSVLoader
Ref : http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVLoader.html
Below script makes use of CSVLoader(), DUMP A will result in the same output seen earlier.
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (f1,f2,f3);
The error is that you do not want to FILTER based on a regex but GENERATE new fields based on a regex. To filter, you need to know if the line have to be filtered, hence the boolean requirement.
Therefore, you have to use :
b = FOREACH a GENERATE REGEX_EXTRACT(FIELD, REGEX, HOW_MANY_GROUPS_TO_RETURN);
However, as #Murali Rao said, your values are not just coma separated but CSV (think how you will handle a coma in tweet : it is not a field separator, just some content).