Can we do string manipulation and conditional check in smooks? - smooks

I want to manipulate a large text file, which is coming as TEXT and want to use smooks to manipulate it. The text file contains large number of lines. And from each line, i have to split the characters and get information out of that.
Eg: i do following in java;
row.substring(0, 4)
row.substring(4, 64)
I have to convert the text content to CSV file.
Can we do exact same string manipulation in smooks too? (that is in smooks configuration can i do that?) I believe i can use Fixed Length processing for that?
How to add IF ELSE condition in smooks configuration?
Like in java;
if (row.length() == 900) {
//DO
}else(){
//DO
}

We can do string manipulation using fixed length reader[1]. but still i do not find a way to do condition check.
Eg: if /else
[1]http://www.smooks.org/mediawiki/index.php?title=V1.4:Smooks_v1.4_User_Guide#XML

If the format does not fit the flatfile reader, then you might be able to use the regex reader: https://github.com/smooks/smooks/tree/v1.5.1/smooks-examples/flatfile-to-xml-regex/
As for the conditional stuff... you really need to bind the data fragments into a Java model of some sort (real or virtual) and then conditionally process those fragments by either adding elements on the visitors being applied, or process the fragments by routing them to another process that processes them in parallel, which is a far better way of processing a huge data stream.

Related

how to remove stopwords from a text in pre processing of spark

I have a requirement to pre process the data in the spark before running the algorithms
One of the pre processing logic was to remove the stopwords from the text. I tried with spark StopWordsRemover. StopWordsRemover requires input and output should be Array[String]. After running the program the final column output is shown as collection of strings, i would require a plain string.
My code as follows.
val tokenizer: RegexTokenizer = new RegexTokenizer().setInputCol("raw").setOutputCol("token")
val stopWordsRemover = new StopWordsRemover().setInputCol("token").setOutputCol("final")
stopWordsRemover.setStopWords(stopWordsRemover.getStopWords ++ customizedStopWords)
val tokenized: DataFrame = tokenizer.transform(hiveDF)
val transformDF = stopWordsRemover.transform(tokenized)
Actual Output
["rt messy support need help with bill"]
Required Output:
rt messy support need help with bill
My output should be like a string but not as array of string. Is there any way to do this. I require the output of the column in the dataframe as string.
Also I would need suggestion on the below options to remove stopwords from the text in the spark program.
StopWordsRemover from SparkMlib
Standford CoreNLP Library.
Which of the option gives better performance when parsing huge files.
Any help appreciated.
Thanks in advance.
You may use this to get string instead of array-of-strings - df.collect()[0] - if you are sure only first item is in your interest.
However, that should not be any issue here as long as you traverse the array and get each items there.
Ultimately HiveDF will give you RDD[String] - and it becomes Array[String] when you convert from RDD.

Escape special characters in Apache pig data

I am using Apache Pig to process some data.
My data set has some strings that contain special characters i.e (#,{}[]).
This programming pig book says that you can't escape those characters.
So how can I process my data without deleting the special characters?
I thought about replacing them but would like to avoid that.
Thanks
Have you tried loading your data? There is no way to escape these characters when they are part of the values in a tuple, bag, or map, but there is no problem whatsoever in loading these characters in when part of a string. Just specify that field as type chararray.
The only issue you will have to watch out for here is if your strings ever contain the character that Pig is using as field delimiter - for example, if you are USING PigStorage(',') and your strings contain commas. But as long as you are not telling Pig to parse your field as a map, #, [, and ] will be handled just fine.
Easiest way would be,
input = LOAD 'inputLocation' USING TextLoader() as unparsedString:chararray;
TextLoader just reads each line of input into a String regardless of what's inside that string. You could then use your own parsing logic.
When writing your loader function, instead of returning tuples with e.g. maps as a String (and thus later relying on Utf8StorageConverter to get the conversion to a map right):
Tuple tuple = tupleFactory.newTuple( 1 );
tuple.set(0, new DataByteArray("[age#22, name#joel]"));
you can create and set directly a Java map:
HashMap<String, Object> map = new HashMap<String, Object>(2);
map.put("age", 22);
map.put("name", "joel");
tuple.set(0, map);
This is useful especially if you have to do the parsing during loading anyway.

SBJSON append new data into existing JSON file without parsing it first

I am making an app that lets the user draw on the screen in different colors and brush sizes. I am storing the info about each drawn path in a JSON file once it has been drawn to keep it out of memory. Right now I have it parsing all existing paths, then adding the new one in and writing it all back out again. I want it to simply append the new data into the JSON file without having to read it in and parse it first, that will make it so only one path is ever in memory at a time.
I am using SBJSON, the JSONWriter has a few append functions but I think you need to have the JSON string to append it to first, not the file, meaning I would have to read in the file anyway. Is there a way to do this without reading in the file at all? I know exactly how the data is structured.
It's possible, but you have to cheat a little. You can just create a stand-alone JSON document per path, and append that to the file. So you'll have something like this in your file:
{"name":"path1", "from": [0,3], "to":[3, 9]}
{"name":"path2", "from": [0,3], "to":[3, 9]}
{"name":"path3", "from": [0,3], "to":[3, 9]}
Note that this is not ONE JSON document but THREE. Handily, however, SBJsonStreamParser supports reading multiple JSON documents in one go. Set the supportMultipleDocuments property and plug it into a SBJsonStreamParserAdapter, and off you go. This also has the benefit that if you have many, many paths in your file as you can start drawing before you're finished reading the whole file. (Because you get a callback for each path.)
You can see some information on the use case here.
I'm pretty sure its not possible...what I ended up doing was reading in the JSON file as a string then instead of wasting memory changing all that into Dictionaries and Arrays, I just looked for an instance of part of the string (ex: i wanted to insert something before the string "], "texts"" showed up) where I wanted to insert data and inserted it there and wrote it back out to file.
As far as I can tell this is the best solution.

Change Url using Regex

I have url, for example:
http://i.myhost.com/myimage.jpg
I want to change this url to
http://i.myhost.com/myimageD.jpg.
(Add D after image name and before point)
i.e I want add some words after image name and before point using regex.
What is the best way do it using regex?
Try using ^(.*)\.([a-zA-Z]{3,5}) and replacing with \1D\2. I'm assuming the extension is 3-5 alphanumeric numbers but you can modify it to suit. E.g. if it's just jpg images then you can put that instead of the [a-zA-Z]{3,5}.
Sounds like a homework question given the solution must use a regex, on that assumption here is an outline to get you going.
If all you have is a URL then #mathematical.coffee's solution will suit. However if you have a chunk of text within which is one or more URLs and you have to locate and change just those then you'll need something a little more involved.
Look at the structure of a URL: {protocol}{address}{item}; where
{protocol} is "http://", "ftp://" etc.;
{address} is a name, e.g. "www.google.com", or a number, e.g. "74.125.237.116" - there will always be at least one dot in the address; and
{item} is "/name" where name is quite flexible - there will be zero or more items, you can think of them as directories and a file but this isn't strictly true. Also the sequence of items can end in a "/" (including when there are zero of them).
To make a regex which matches a URL start by matching each part. In the case of the items you'll want to match the last in the sequence separately - you'll have zero or more "directories" and one "file", the latter must be of the form "name.extension".
Once you have regexes for each part you just concatenate them to produce a regex for the whole. To form the replacement pattern you can surround parts of your regex with parentheses and refer to those parts using \number in the replacement string - see #mathematical.coffee's solution for an example.
The best way to learn regexs is to use an editor which supports them and just experiment. The exact syntax may not be the same as NSRegularExpression but they are mostly pretty similar for the basic stuff and you can translate from one to another easily.

Take input from function parameters, not a file

All of the examples which I see read in a file the lex & parse it.
I need to have a function which takes a string (char *, I'm generating C code) as parameter, and acts upon that.
How can I do that best? I thought of writing the string to a stream, then feeding that to the lexer, but it doesn't feel right. Is there any better way?
Thanks in advance
You would need to use the antlr3NewAsciiStringInPlaceStream method.
You didn't say what version of Antlr you were using so I'll assume Antlr v3.
The inputs to this method are the string to parse, it's length and then you can probably use NULL for the last input.
This produces an input stream similar to the antlr3AsciiFileStreamNew that you would use to parse a file.
I see that you mentioned writing the input to a stream. If you can use C++ then that's the best method you'll probably come by.
This is the barebones code I normally use:
std::istringstream issInput(std::cin); // make this an ifstream if you want to parse a file
Lexer oLexer(issInput);
Parser oParser(oLexer);
oFactory("CommonASTWithHiddenTokens",&antlr::CommonASTWithHiddenTokens::factory);
antlr::ASTFactory oFactory;
oParser.initializeASTFactory(oFactory);
oParser.setASTFactory(&oFactory);
oParser.main();
antlr::RefAST ast = oParser.getAST();
if (ast)
{
TreeWalker oTreeWalker;
oTreeWalker.main(ast, rPCode);
}
I think you should feed it to a stream. You could feed it to stdin if you'd like. That way, your code shouldn't differ too much from reading strings from a file.