Extracting tfidf-vectors by key without destroying the fileformat - apache-pig

I have about 200000 tfidf-vectors in the output-format seq2sparse delivers. Now I need to extract 500 but not randomly like with the split-function. I know the keys of 500 of them and I need them in the same dataformat like the one from seq2sparse.
When I open the sequencefile with the 200000 entries I can see that the keys are coded with
org.apache.hadoop.io.Text and the values with org.apache.mahout.math.VectorWritable.
But when I try to use
https://github.com/kevinweil/elephant-bird/blob/master/mahout/src/main/java/com/twitter/elephantbird/pig/mahout/VectorWritableConverter.java
and
https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/store/SequenceFileStorage.java
in Pig Latin for reading and writing them the output has org.apache.hadoop.io.Text for both key and value.
I really need exactly those 500 entries in this format because I want to use them in trainnb and testnb.
Basically it would be enough to know how I can do something like the reverse of mahout seqdumper.

While there's no specific Mahout command to do this you could write a relatively simple utility function Using Mahout's:
org.apache.mahout.common.Pair;
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable;
org.apache.mahout.math.VectorWritable;
and:
org.apache.hadoop.io.SequenceFile;
org.apache.hadoop.io.Text;
com.google.common.io.Closeables;
You could do something like the following:
// load up the 500 desired keys with some function
Vector<Text>desiredKeys = getDesiredKeys();
//create a new SequenceFile writer for the 500 Desired Vectors
SequenceFile.Writer writer =
SequenceFile.createWriter(fs, conf, output500filePath ,
Text.class,
VectorWritable.class);
try {
// create an iterator over the tfidfVector sequence file
SequenceFileIterable<Text, VectorWritable>seqFileIterable =
new SequenceFileIterable<Text, VectorWritable>(
tfidfVectorPath, true, conf)
// loop over tfidf sequence file and write out only Pairs with keys
// contained in the desiredKeys Vector to the output500file
for (Pair<Text, VectorWritable> pair : seqFileIterable) {
if(desiredKeys.contains(pair.getFirst())){
writer.append(pair.getFirst(),pair.getSecond());
}
}
}finally {
Closeables.close(writer, false);
}
And use the path to the "output500file" for the input to trainnb. Using vector.contains() is not the the most efficient way to do it, but this would be the general idea.

Related

Pentaho JsonInput GET fields

I'm trying to use PDI to read data from an API (json) and now I'm simply trying to use json input to get a few specific fields but the get fields button on the input step gives me.
ERROR (version 8.3.0.0-371, build 8.3.0.0-371 from 2019-06-11 11.09.08 by buildguy) : Index 1 out of bounds for length 1
all the steps execute fine, and produce data - just not the json input step doesn't wnat to give me the fields option! - I've tired the text file and json oput and both write valid json so IDK whats going on....
PS. this is my first time using PDI
ISSUE 2:
It looks like PDI uses jayway for its json path parsing so I've been using this site https://jsonpath.herokuapp.com/ jayway selection which gives me my expected path. When I put that into the 'fields' of the json input dialog I only get the FIRST instance of that path value vs it actually parsing the json and giving me every instance, and can't figure out why though I assume it has something to do with PDI's row based view on things but I also don't know how to get it to understand that its json and it should be giving me back all values that match that path.
UPDATE 1:
I've been looking at this https://forums.pentaho.com/threads/135882-Parsing-JSON-data-without-knowing-field-names/ it seems like this Modified Java Script Value step might be the way to go. Will continue testing.
UPDATE 2
OK - Used the MJSV as posted above along with a select fields step and finally able to get the key's
var obj = JSON.parse(mydata);
var keys = Object.keys(obj);
for (var i = 0; i < Object.keys(obj).length; i++) {
var row = createRowCopy(getOutputRowMeta().size());
var idx = getInputRowMeta().size();
row[idx++] = keys[i];
putRow(row);
}
trans_Status = SKIP_TRANSFORMATION;

Can't figure out how to insert keys and values of nested JSON data into SQL rows with NiFi

I'm working on a personal project and very new (learning as I go) to JSON, NiFi, SQL, etc., so forgive any confusing language used here or a potentially really obvious solution. I can clarify as needed.
I need to take the JSON output from a website's API call and insert it into a table in my MariaDB local server that I've set up. The issue is that the JSON data is nested, and two of the key pieces of data that I need to insert are used as variable key objects rather than values, so I don't know how to extract it and put it in the database table. Essentially, I think I need to identify different pieces of the JSON expression and insert them as values, but I'm clueless how to do so.
I've played around with the EvaluateJSON, SplitJSON, and FlattenJSON processors in particular, but I can't make it work. All I can ever do is get the result of the whole expression, rather than each piece of it.
{"5381":{"wind_speed":4.0,"tm_st_snp":26.0,"tm_off_snp":74.0,"tm_def_snp":63.0,"temperature":58.0,"st_snp":8.0,"punts":4.0,"punt_yds":178.0,"punt_lng":55.0,"punt_in_20":1.0,"punt_avg":44.5,"humidity":47.0,"gp":1.0,"gms_active":1.0},
"1023":{"wind_speed":4.0,"tm_st_snp":26.0,"tm_off_snp":82.0,"tm_def_snp":56.0,"temperature":74.0,"off_snp":82.0,"humidity":66.0,"gs":1.0,"gp":1.0,"gms_active":1.0},
"5300":{"wind_speed":17.0,"tm_st_snp":27.0,"tm_off_snp":80.0,"tm_def_snp":64.0,"temperature":64.0,"st_snp":21.0,"pts_std":9.0,"pts_ppr":9.0,"pts_half_ppr":9.0,"idp_tkl_solo":4.0,"idp_tkl_loss":1.0,"idp_tkl":4.0,"idp_sack":1.0,"idp_qb_hit":2.0,"humidity":100.0,"gp":1.0,"gms_active":1.0,"def_snp":23.0},
"608":{"wind_speed":6.0,"tm_st_snp":20.0,"tm_off_snp":53.0,"tm_def_snp":79.0,"temperature":88.0,"st_snp":4.0,"pts_std":5.5,"pts_ppr":5.5,"pts_half_ppr":5.5,"idp_tkl_solo":4.0,"idp_tkl_loss":1.0,"idp_tkl_ast":1.0,"idp_tkl":5.0,"humidity":78.0,"gs":1.0,"gp":1.0,"gms_active":1.0,"def_snp":56.0},
"3396":{"wind_speed":6.0,"tm_st_snp":20.0,"tm_off_snp":60.0,"tm_def_snp":70.0,"temperature":63.0,"st_snp":19.0,"off_snp":13.0,"humidity":100.0,"gp":1.0,"gms_active":1.0}}
This is a snapshot of an output with a couple thousand lines. Each of the numeric keys that you see above (5381, 1023, 5300, etc) are player IDs for the following stats. I have a table set up with three columns: Player ID, Stat ID, and Stat Value. For example, I need that first snippet to be inserted into my table as such:
Player ID Stat ID Stat Value
5381 wind_speed 4.0
5381 tm_st_snp 26.0
5381 tm_off_snp 74.0
And so on, for each piece of data. But I don't know how to have NiFi select the right pieces of data to insert in the right columns.
I believe that it's possible to use jolt to transform your json into a format:
[
{"playerId":"5381", "statId":"wind_speed", "statValue": 0.123},
{"playerId":"5381", "statId":"tm_st_snp", "statValue": 0.456},
...
]
then use PutDatabaseRecord with json reader.
Another approach is to use ExecuteGroovyScript processor.
Add new parameter to it with name SQL.mydb and link it to your DBCP controller service
And use the following script as Script Body parameter:
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
def ff=session.get()
if(!ff)return
//read flow file content and parse it
def body = ff.read().withReader("UTF-8"){reader->
new JsonSlurper().parse(reader)
}
def results = []
//use defined sql connection to create a batch
SQL.mydb.withTransaction{
def cmd = 'insert into mytable(playerId, statId, statValue) values(?,?,?)'
results = SQL.mydb.withBatch(100, cmd){statement->
//run through all keys/subkeys in flow file body
body.each{pid,keys->
keys.each{k,v->
statement.addBatch(pid,k,v)
}
}
}
}
//write results as a new flow file content
ff.write("UTF-8"){writer->
new JsonBuilder(results).writeTo(writer)
}
//transfer to success
REL_SUCCESS << ff

How to get the reorder the column with csv input fixed column in pentaho

Scenario:
I have created transformation to load data into table from csv file and I have following columns in csv file:
Customer_Id
Company_Id
Employee_Name
But user may give input file with column ordering (random order) as
Employee_Name
Company_Id
Customer_Id
so, if I try to load file which has random column ordering, will kettle load correct column values as per column names ... ?
Using ETL Metadata Injection you can use a transformation like this, to either normalize the data, or to store it to your database:
Then you just need to send the correct data to that transformation. You can read the header line from the CSV, and use Row Normaliser to convert to the format used by ETL Metadata Injection.
I have included a quick example here: csv_inject on Dropbox, if you make something like this and run it from something that runs it per csv file it should work.
Ooh, thats some nasty javascript!
The way to do this is with metadata injection. Look at the samples, but basically you need a template which reads the file, and writes it back out. you then use another parent transformation to figure out the headings, configure that template and then execute it.
There are samples in the PDI samples folder, and also take a look at the "figuring out file format" example in matt casters blueprints project on github.
You could try something like this as your JavaScript:
//Script here
var seen;
trans_Status = CONTINUE_TRANSFORMATION;
var col_names = ['Customer_Id','Company_Id','Employee_Name'];
var col_pos;
if (!seen) {
// First line
trans_Status = SKIP_TRANSFORMATION;
seen = 1;
col_pos = [-1,-1,-1];
for (var i = 0; i < col_names.length; i++) {
for (var j = 0; j < row.length; j++) {
if (row[j] == col_names[i]) {
col_pos[i] = j;
break;
}
}
if (col_pos[i] === -1) {
writeToLog("e", "Cannot find " + col_names[i]);
trans_Status = ERROR_TRANSFORMATION;
break;
}
}
}
var Customer_Id = row[col_pos[0]];
var Company_Id = row[col_pos[1]];
var Employee_Name = row[col_pos[2]];
Here is the .ktr I tried: csv_reorder.ktr
(edit, here are the test csv files)
1.csv:
Customer_Id,Company_Id,Employee_Name
cust1,comp1,emp1
2.csv:
Employee_Name,Company_Id,Customer_Id
emp2,comp2,cust2
Assuming rejecting the input file is not an option you basically have 4 solutions.
reorder the fields in an external editor (don't use excel if it contains dates)
Use code within your transformation to detect the column headers and reorder the file.
Use metadata injection as proposed by bolav
Create a job. This need to:
a. load the file into a temporary database.
b. use an sql statement to retrieve the fields (use a SELECT with an ORDER By clause)
c. output the file in the correct order

How to evenly distribute data in apache pig output files?

I've got a pig-latin script that takes in some xml, uses the XPath UDF to pull out some fields and then stores the resulting fields:
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
store results into '$output';
Note that we're using pig-0.12.0 on our cluster, so I ripped the XPath/XMLLoader classes out of pig-0.14.0 and put them in my own jar so that I could use them in 0.12.
This above script works fine and produces the data that I'm looking for. However, it generates over 1,900 partfiles with only a few mbs in each file. I learned about the default_parallel option, so I set that to 128 to try and get 128 partfiles. I ended up having to add a piece to force a reduce phase to achieve this. My script now looks like:
set default_parallel 128;
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
forced_reduce = FOREACH (GROUP results BY RANDOM()) GENERATE FLATTEN(results);
store forced_reduce into '$output';
Again, this produces the expected data. Also, I now get 128 part-files. My problem now is that the data is not evenly distributed among the part-files. Some have 8 gigs, others have 100 mb. I should have expected this when grouping them by RANDOM() :).
My question is what would be the preferred way to limit the number of part-files yet still have them evenly-sized? I'm new to pig/pig latin and assume I'm going about this in the completely wrong way.
p.s. the reason I care about the number of part-files is because I'd like to process the output with spark and our spark cluster seems to do a lot better with a smaller number of files.
I'm still looking for a way to do this directly from the pig script but for now my "solution" is to repartition the data within the spark process that works on the output of the pig script. I use the RDD.coalesce function to rebalance the data.
From the first code snippet, I am assuming it is map only job since you are not using any aggregates.
Instead of using reducers, set the property pig.maxCombinedSplitSize
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
store results into '$output';
exec;
set pig.maxCombinedSplitSize 1000000000; -- 1 GB(given size in bytes)
x = load '$output' using PigStorage();
store x into '$output2' using PigStorage();
pig.maxCombinedSplitSize - setting this property will make sure each mapper reads around 1 GB data and above code works as identity mapper job, which helps you write data in 1GB part file chunks.

Write extracted data to a file using jmeter

I am using JMeter v2.5.
I need to get data from the responses of the test and extract data from it (which I am doing using regular exp extractor). How do I store this extracted data to a file?
Just solved a similar problem. After getting the data using a regular expression extractor, add a BeanShell PostProcessor element. Use the code below to write the variables to a file:
name = vars.get("name");
email = vars.get("email");
log.info(email); // if you want to log something to jmeter.log file
// Pass true if you want to append to existing file
// If you want to overwrite, then don't pass the second argument
f = new FileOutputStream("/my/file/path/result.csv", true);
p = new PrintStream(f);
this.interpreter.setOut(p);
print(name + "," + email);
f.close();
import org.apache.jmeter.services.FileServer;
String path=FileServer.getFileServer().getBaseDir();
name1= vars.get("user_Name_value");
name2= vars.get("UserId_value");
f = new FileOutputStream("E://csvfile/result.csv", true); //spec-ify true if you want to overwrite file. Keep blank otherwise.
p = new PrintStream(f);
this.interpreter.setOut(p);
p.println(name1+"," +name2);
f.close();
this is worked for me i hope it will work for you also
If you just want to write extracted variables to CSV results file, then just add to user.properties the variables you want:
sample_variables=name,email
As per doc:
https://jmeter.apache.org/usermanual/properties_reference.html#results_file_config
They will be appended as last column of CSV results file.
You have a couple options
You can tally the results by adding an aggregate report listener to your thread group => add listener => aggregate report
You can get raw results by adding a simple data writer listener to your thread group => add listener => simple data writer
Hope this helps
You may use https://jmeter-plugins.org/wiki/FlexibleFileWriter/ with sample variables set up.
Or with fake Dummy Sampler.
Anyway Flexible File Writer is good for writing data into file.