Apache Pig filtering out carriage returns - apache-pig

I'm fairly new with apache pig and trying to work with some fixed width text. In pig, I'm reading every line in as a chararray (I know I can use fixedwidthloader, but am not in this instance). One of the fields I'm working with is an email field and one entry has a carriage return that generates extra lines of output in the finished data dump (I show 12 rows instead of the 9 I'm expecting). I know which entry has the error but I'm unable to filter it out using pig.
Thus far I've tried to use pig's REPLACE to replace on \r or \uFFFD and even tried a python UDF which works on the command line but not when I run it as a UDF through PIG. Anyone have any suggestions? Please let me know if more details are required.

My original edit with a solution turned out to only work part of the time. This time I had to clean the data before I ran it through pig. On the raw data file I did a perl -i -pe 's/\r//g' filename to remove the rogue carriage return.

Related

Combine SQL files with command `copy` in a batch file introduce an incorrect syntaxe because it does add an invisible character `U+FEFF`

In a pre-build event, a batch file is executed to combine multiple SQL files into a single one.
It is done using this command :
COPY %#ProjectDir%\Migrations\*.sql %#ProjectDir%ContinuousDeployment\AllFilesMergedTogether.sql
Everything appear to work fine but somehow the result give an incorrect syntaxe error.
After two hours of investigation, it turn out the issue is caused by an invisible character that remain invisible even with notepad++.
Using an online website, the character has been spotted and is U+FEFF has shown in following image.
Here are the two input scripts.
PRINT 'Script1'
PRINT 'Script2'
Here is the output given by the copy command.
PRINT 'Script1'
PRINT 'Script2'
Additional info :
Batch file is encoded with UTF-8
Input files are encoded with UTF-8-BOM
Output file is encoded with UTF-8-BOM.
I'm not sure it is possible to change the encoding output of command copy.
I've tried and failed.
What should be done to eradicate this extremely frustrating parasitic character?
It has turned out that changing encoding of input files to ANSI does fix the issue.
No more pesky character(s).
Also, doing so does change the encoding of the result file to UTF-8 instead of UTF-8-BOM which is great I believe.
Encoding can be changed using Notepad++ as show in following picture.

Unable to Remove Special Characters In Pig

I have a text file that I want to Load onto my Pig Engine,
The text file have names in it in separate rows, and the data but has errors in it.....special characters....Something like this:
Ja##$s000on
J##a%^ke
T!!ina
Mel#ani
I want to remove the special characters from all the names using REGEX ....One way i found to do the job in pig and finally have the output as...
Jason
Jake
Tina
Melani
Can someone please tell me the regex that will do this job in Pig.
Also write the command that will do it as I unable to use the REGEX_EXTRACT and REGEX_EXTRACT_ALL function.
Also can someone explain what is the Significance of the number 1 that we pass to this function as Argument after defining the Regex.
Any help would be highly appreciated.
You can use REPLACE with RegEx to solve this problem.
input.txt
Ja##$s000on
J##a%^ke T!!ina Mel#ani
PigScript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z\\s]+)','');
dump B;
Output:
(Jason)
(Jake Tina Melani)
There is no way to escape these characters when they are part of the values in a tuple, bag, or map, but there is no problem whatsoever in loading these characters in when part of a string. Just specify that field as type chararray
Please Have a look here

Removing handling newlines in a simple text import class

I have an input file that I want to use the string SPLIT function on for each line, depending on the Type field. However, the description field sometimes has data that has new lines in it so it messes up my file reader since it uses streamreader's readline() function
Handled:
Type|Name|User|Description
Type|Name|User|Description
Unhandled:
Type|Name|User|Description line 1
Description Line 2
Type|Name|User|Description
Besides not being able to validate on 'Type' for each line and keep reading the file for when the next Type field appears, are there any ways folks can come up with to properly read this file?
My solution was to have the file maker replace newline characters in their description field with another unique character that I can later add back in. I'm still interested in solutions from the file reader's perspective though
I know I'm talking to myself a lot here, but I found another solution, which is to remove remove line feeds, since the output file creator wrote out carriage returns for each line.
You could easily set a conditional statement to see if the Split array contains more than one element, which would indicate that it's a line you want to parse.

Can I split a command over multiple lines in Apache Pig Latin?

I have some very long lines as Apache Pig (Latin) expressions. Is there a way of splitting these over multiple lines?
I've tried a trailing backslash to no avail, as soon as I press enter the (incomplete) command executes...
I'm writing at a Pig 0.8 grunt shell and I just hit return and it gives me a continuation prompt ">>". It doesn't appear to have any problem splitting individual statements over multiple lines in interactive mode, which is what it sounded like you were asking about. You can also feed a script with multiline statements to Pig via STDIN.
I can't test it right now, but if memory serves me well, the underscore character "_" should help you splitting your command over multiple lines.
The beginning of my very very very very _
very very very very very very very very _
very very very long command

Is there a tool to clean the output of the script(1) tool?

script(1) is a tool for keeping a record of an interactive terminal session; by default it writes to the file transcript. My problem is that I use ksh93, which has readline features, and so the transcript is mucked up with all sorts of terminal escape sequences and it can be very difficult to reconstruct the command that was actually executed. Not to mention the stray ^M's and the like.
I'm looking for a tool that will read a transcript file written by script, remove all the junk, and reconstruct what the shell thought it was executing, so I have something that shows $PS1 and the commands actually executed. Failing that, I'm looking for suggestions on how to write such a tool, ideally using knowledge from the terminfo database, or failing that, just using ANSI escape sequences.
A cheat that looks in shell history, as long as it really really works, would also be acceptable.
Doesn't cat/more work by default for browsing the transcript? Do you intend to create a script out of the commands actually executed (which in my experience can be dangerous)?
Anyway, 3 years without an answer, so I will give it a shot with an incomplete solution. If your are only interested in the commands actually typed, remove the non-printable characters, then replace PS1' with something readable and unique, and grep for that unique string. Like this:
$ sed -i 's/[^[:print:]]//g' transcript
$ sed 's/]0;cartman#southpark: ~cartman#southpark:~/CARTMAN/g' transcript | grep CARTMAN
Explanation: After first sed, PS1' can be taken from one of the first few lines of the transcript file, as is -- PS1' is different from PS1 -- and can be modified with a unique readable string ("CARTMAN" here). Note that the dollar sign at the end of the prompt was left out intentionally.
In the few examples that I tried, this didn't solve everything but took care of most issues.
This is essentially the same question asked recently in Can I programmatically “burn in” ANSI control codes to a file using unix utils? -- removing all nonprinting characters will not fix
embedded escape sequences
backspace/overstriking for underlining
use of carriage-returns for overstriking