Does pig support load with no delimiter? - apache-pig

I'd like to load a lot of small files from HDFS with Pig and process them as tuples (filename, filecontent).
a=LOAD 'mydir' USING PigStorage('','-tagPath') AS (filepath:chararray, filecontents:chararray);
However it seems like I cannot omit specifying the delimiter. Is there some sort of a "NULL" in Pig or is there any other way to make sure the content of the file will not be split?

You will have to write your own custom loader by extending LoadFunc.
Short answer to your question is no.In order to make sure the content is not split,use a delimiter that would not exist in the content.In that way, the whole content would be loaded to the field filecontents:chararray.So assuming,your input files do not have a special character '~'
a=LOAD 'mydir' USING PigStorage('~','-tagPath') AS (filepath:chararray, filecontents:chararray);

Related

Multi-line text in a .env file

In vue, is there a way to have a value span multiple lines in an .env file. Ex:
Instead of:
someValue=[{"someValue":"Here is a really really long piece which should be split into multiple lines"}]
I want to do something like:
someValue=`[{"someValue":"Here is a really
really long piece which
should be split into multiple lines"}]`
Doing the latter gives me a JSON parsing error if I try to do JSON.parse(someValue) in my code
I don't know if this will work, but I can't format a comment appropriately enough to get the point across so see if this will work:
someValue=[{"someValue":"Here is a really\
really long piece which\
should be split into multiple lines"}]
Where "\" should escape the newline similar to how you can write long bash commands while escaping the newline. I'm not certain the .env interpreter will support it though.
EDIT
Looks like this won't work. This syntax was actually proposed, but I don't think it was incorporated. See motdotla/dotenv#333 (which is what Vue uses to parse .env).
Like #zero298 said, this isn't possible. Likely you could delimit the entry with a character that wouldn't show up normally in the text (^ is a good candidate), then parse it within the application using string.replace('^', '\n');

Load raw data from a file without dropping backslash characters

I have a file that contains the following content (simplified version that demonstrates the problem):
"abc\"def"
I would like to load the literal content of the file into a table without any mangling of the data. Here is what I am currently doing:
CREATE TABLE file_content (content text);
COPY file_content FROM '/path/to/test.txt';
The resulting line in the table is:
"abc"def"
In other words, the backslash was silently dropped/ignored. I've tried the copy with different encodings (UTF8, LATIN1, SQL_ASCII) without any change in behavior.
Also, the ESCAPE and QUOTE options seemed promising at first, but they are only for COPY ... TO.
Is there a way to load raw data from a file without the mangling? I'm using version PostgreSQL version 9.4.6.
You need to change \ to \\. You can use sed for that:
sed -i -- 's/\\/\\\\/g' import.file
Please make sure you have reviewed your data and backuped it before performing operation above.

Apache Pig filtering out carriage returns

I'm fairly new with apache pig and trying to work with some fixed width text. In pig, I'm reading every line in as a chararray (I know I can use fixedwidthloader, but am not in this instance). One of the fields I'm working with is an email field and one entry has a carriage return that generates extra lines of output in the finished data dump (I show 12 rows instead of the 9 I'm expecting). I know which entry has the error but I'm unable to filter it out using pig.
Thus far I've tried to use pig's REPLACE to replace on \r or \uFFFD and even tried a python UDF which works on the command line but not when I run it as a UDF through PIG. Anyone have any suggestions? Please let me know if more details are required.
My original edit with a solution turned out to only work part of the time. This time I had to clean the data before I ran it through pig. On the raw data file I did a perl -i -pe 's/\r//g' filename to remove the rogue carriage return.

How to escape delimiter found in value - pig script?

In pig script, I would like to find a way to escape the delimiter character in my data so that it doesn't get interpreted as extra columns. For example, if I'm using colon as a delimiter, and I have a column with value "foo:bar" I want that string interpreted as a single column without having the loader pick up the comma in the middle.
You can try http://pig.apache.org/docs/r0.12.0/func.html#regex-extract-all
A = LOAD 'somefile' AS (s:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(s, '(.*) : (.*)'));
The regex might have to be adapted.
It seems Pig takes the Input as the string its not so intelligent to identify how what is data or what is not.
The pig Storage works on the Strong Tokenizer. So if u want to do something like
a = LOAD '/abc/def/file.txt' USING PigStorage(':');
It doesn't seems to be solving your problem. But if we can write our own PigStorage() Method possibly we could come across some solution.
I will try posting the Code to resolve this.
you can use STRSPLIT(string, regex, limit); for the column split based on the delimiter.

Reading a large-single XML line to a variable using Batch Script

I have a xml file which only contains a single line, but the problem is the line is very large, so it seems that I can't store in a variable.
What i want is this,
given tag1, tag2.....tag900, I want to break each tag into a line as follow:
tag1
tag2
tag3
......
tag900
Do not attempt to do this using native batch. It will be extremely difficult, and any solution will be very slow.
The problem is native batch cannot read lines > 8k, and batch does not have a good way to read partial lines.
There is a method that creates a test file that has size >= your file that consists of a single repeated character. A binary file compare ( FC /B ) is then done and the results are parsed character by character expressed as hex codes. It's a bit more complex than that, but I don't think you want to go there.
The only other option is to use SET /P to read in 1021 chars at a time, and then parse and piece things together. But this is unproven, and again, I don't think worth the effort.
If you want to use a native scripting language than I suggest VBScript or JScript. (Perhaps PowerShell, but I don't really know much about its capabilities).
You could download a Unix text processing tool like sed that has been ported to Windows.
I don't do much with XML, but I've got to believe there is a free tool geared specifically for XML that would make your job fairly easy.
Basically, use anything except batch! (this is coming from someone whose hobby is solving problems with batch)