Escape special characters in Apache pig data - apache-pig

I am using Apache Pig to process some data.
My data set has some strings that contain special characters i.e (#,{}[]).
This programming pig book says that you can't escape those characters.
So how can I process my data without deleting the special characters?
I thought about replacing them but would like to avoid that.
Thanks

Have you tried loading your data? There is no way to escape these characters when they are part of the values in a tuple, bag, or map, but there is no problem whatsoever in loading these characters in when part of a string. Just specify that field as type chararray.
The only issue you will have to watch out for here is if your strings ever contain the character that Pig is using as field delimiter - for example, if you are USING PigStorage(',') and your strings contain commas. But as long as you are not telling Pig to parse your field as a map, #, [, and ] will be handled just fine.

Easiest way would be,
input = LOAD 'inputLocation' USING TextLoader() as unparsedString:chararray;
TextLoader just reads each line of input into a String regardless of what's inside that string. You could then use your own parsing logic.

When writing your loader function, instead of returning tuples with e.g. maps as a String (and thus later relying on Utf8StorageConverter to get the conversion to a map right):
Tuple tuple = tupleFactory.newTuple( 1 );
tuple.set(0, new DataByteArray("[age#22, name#joel]"));
you can create and set directly a Java map:
HashMap<String, Object> map = new HashMap<String, Object>(2);
map.put("age", 22);
map.put("name", "joel");
tuple.set(0, map);
This is useful especially if you have to do the parsing during loading anyway.

Related

Find records where length of array equal to - Rails 4

In my Room model, I have an attribute named available_days, which is being stored as an array.
For example:
Room.first.available_days
=> ["wed", "thurs", "fri"]
What is the best way to find all Rooms where the size of the array is equal to 3?
I've tried something like
Room.where('LENGTH(available_days) = ?', 3)
with no success.
Update: the data type for available_days is a string, but in order to store an array, I am serializing the attribute from my model:
app/models/room.rb
serialize :available_days
Can't think of a purely sql way of doing it for sqlite since available_days is a string.
But here's one way of doing it without loading all records at once.
rooms = []
Room.in_batches(of: 10).each_record do |r|
rooms << r if r.available_days.length == 3
end
p rooms
If you're using postgres you can parse the serialized string to an array type, then query on the length of the array. I expect other databases may have similar approaches. How to do this depends on how the text is being serialized, but by default for Rails 4 should be YAML, so I expect you data is encoded like this:
---
- first
- second
The following SQL will remove the leading ---\n- as well as the final newline, then split the remaining string on - into an array. It's not strictly necessary to cleanup the extra characters to find the length, but if you want to do other operations you may find it useful to have a cleaned up array (no leading characters or trailing newline). This will only work for simple YAML arrays and simple strings.
Room.where("ARRAY_LENGTH(STRING_TO_ARRAY(RTRIM(REPLACE(available_days,'---\n- ',''),'\n'), '\n- '), 1) = ?", 3)
As you can see, this approach is rather complex. If possible you may want to add a new structured column (array or jsonb) and migrate the serialized string into the a typed column to make this easier and more performant. Rails supports jsonb serialization for postgres.

Parse streaming JSON in Objective C

I am using JSON-RPC over TCP, the problem is that I could not find any JSON parse capable of parsing multiple JSON objects correctly, and it would be relatively hard to split it, since there is no delimiter used.
Anyone knows a way how I could handle i.e. this:
{"foo":false, "bar: true, "baz": "cool"}{"ba
Somehow I need to split it so I end up just with the first, complete JSON object. The remaining string needs to stay in buffer until I have enough data to parse it properly.
XBMC's JSON-RPC doc does give a hint:
As such, your client needs to be able to deal with this, eg. by counting and matching curly braces ({}).
Update: As Jody Hagins pointed out, beware of curly braces inside JSON strings when using this approach.
Another possible and probably much better solution would be using a streaming JSON parser like yajl (or its Objective-C wrapper yajl-objc). You can feed the parser with data until it says the current object is done and then restart parsing.
#ePirat, if someone just concatenates multiple JSON dictionaries without delimiters, they should be shot.
For parsing: JSONSerialization parses NSData which could come in any encoding. Fortunately, if you have multiple JSON dictionaries concatenated, they are quite easy to take apart. All you need is looking at the bytes and check for the characters \ " { and }.
If you find a { then increase the counter for "open brackets".
If you find a } then decrease the counter for "open brackets". If the counter is at zero, you've found the end of a dictionary.
If you find a ", then repeatedly look at the next character. If the next character is a " then skip it and go to the normal processing (you've found the end of a string). If the next character is a \ then skip that character and the following character. If the next character is anything else, skip it.
If you reach the end of the data, then your JSON data is incomplete. You could remember which state you were in (count of open brackets, whether you are parsing a string, and if parsing a string whether you just encountered a backlash character) and continue right where you left off.
No need to convert the NSData to a string until you've separated it into dictionaries. If you suspect that you might be given UTF-16 or UTF-32, check whether bytes 0, 1, 2 or 1, 2, 3 are zero (UTF-32), then check whether bytes 0 and 2 or 1 and 3 are zero (UTF-16). But in that case, if the server sends non-standard JSON in UTF-16 or UTF-32, change "the person responsible should be shot" to "the person responsible must be shot".

Can we do string manipulation and conditional check in smooks?

I want to manipulate a large text file, which is coming as TEXT and want to use smooks to manipulate it. The text file contains large number of lines. And from each line, i have to split the characters and get information out of that.
Eg: i do following in java;
row.substring(0, 4)
row.substring(4, 64)
I have to convert the text content to CSV file.
Can we do exact same string manipulation in smooks too? (that is in smooks configuration can i do that?) I believe i can use Fixed Length processing for that?
How to add IF ELSE condition in smooks configuration?
Like in java;
if (row.length() == 900) {
//DO
}else(){
//DO
}
We can do string manipulation using fixed length reader[1]. but still i do not find a way to do condition check.
Eg: if /else
[1]http://www.smooks.org/mediawiki/index.php?title=V1.4:Smooks_v1.4_User_Guide#XML
If the format does not fit the flatfile reader, then you might be able to use the regex reader: https://github.com/smooks/smooks/tree/v1.5.1/smooks-examples/flatfile-to-xml-regex/
As for the conditional stuff... you really need to bind the data fragments into a Java model of some sort (real or virtual) and then conditionally process those fragments by either adding elements on the visitors being applied, or process the fragments by routing them to another process that processes them in parallel, which is a far better way of processing a huge data stream.

Yaml-cpp parser doesn't handle key:value fragment correctly

today I found following strange behavior in yaml-cpp library.
Following yaml fragment:
- { a: b }
is correctly parsed as key:value element with key=a and value=b. But when I updated fragment to this:
- { a:b }
fragment is evaluated as scalar value "a:b".
Is this correct behavior? And is there a simply way how to force parser to evaluate this fragment as key:value ?
Thanks!
This is correct behavior. From the YAML spec:
Normally, YAML insists the “:” mapping value indicator be separated from the value by white space. A benefit of this restriction is that the “:” character can be used inside plain scalars, as long as it is not followed by white space. This allows for unquoted URLs and timestamps. It is also a potential source for confusion as “a:1” is a plain scalar and not a key: value pair.
...
To ensure JSON compatibility, if a key inside a flow mapping is JSON-like, YAML allows the following value to be specified adjacent to the “:”. This causes no ambiguity, as all JSON-like keys are surrounded by indicators.
For example, you could write:
- { "a":b }
However, as they point out, this isn't very readable; stick with putting a space after the colon.

JSON string containing regular expression as data

I am reading JSON code from the database and then parsing the string using json parsers available for java. But I am getting JSONexception. Even if I try to parse this string on an online parser http://json.parser.online.fr/ there also the strings are taken as errors. Is there a way out to get rid of these errors or in other words how can I take care of such special symbols. The value of match is a regular expression.
Here is subpart of the sample string I am trying to parse as a json object.
{"RULE":[{"replace":{"value":"","type":"text"},"match":{"value":"<a [^>]*><img src="[^"]*WindowsLiveWriter/IconsfordifferentSocialBookmarkingSites[^>]*>\s*</a>","type":"text"}},{"replace":{"value":"","type":"text"},"match":{"value":"<a [^>]*><img src="[^"]*WindowsLiveWriter/IconsfordifferentSocialBookmarkingSites[^>]*>\s*</a>","type":"text"}}]}
use this json
{"RULE":[{"replace":{"value":"","type":"text"},"match":{"value":"<a [^>]*><img src=\"[^\"]*WindowsLiveWriter/IconsfordifferentSocialBookmarkingSites[^>]*>\\s*</a>","type":"text"}},{"replace":{"value":"","type":"text"},"match":{"value":"<a [^>]*><img src=\"[^\"]*WindowsLiveWriter/IconsfordifferentSocialBookmarkingSites[^>]*>\\s*</a>","type":"text"}}]}