strsplit issue - Pig - apache-pig

I have following tuple H1 and I want to strsplit its $0 into tuple.However I always get an error message:
DUMP H1:
(item32;item31;,1)
m = FOREACH H1 GENERATE STRSPLIT($0, ";", 50);
ERROR 1000: Error during parsing. Lexical error at line 1, column 40.
Encountered: after : "\";"
Anyone knows what's wrong with the script?

There is an escaping problem in the pig parsing routines when it encounters this semicolon.
You can use a unicode escape sequence for a semicolon: \u003B. However this must also be slash escaped and put in a single quoted string. Alternatively, you can rewrite the command over multiple lines, as per Neil's answer. In all cases, this must be a single quoted string.
H1 = LOAD 'h1.txt' as (splitme:chararray, name);
A1 = FOREACH H1 GENERATE STRSPLIT(splitme,'\\u003B'); -- OK
B1 = FOREACH H1 GENERATE STRSPLIT(splitme,';'); -- ERROR
C1 = FOREACH H1 GENERATE STRSPLIT(splitme,':'); -- OK
D1 = FOREACH H1 { -- OK
splitup = STRSPLIT( splitme, ';' );
GENERATE splitup;
}
A2 = FOREACH H1 GENERATE STRSPLIT(splitme,"\\u003B"); -- ERROR
B2 = FOREACH H1 GENERATE STRSPLIT(splitme,";"); -- ERROR
C2 = FOREACH H1 GENERATE STRSPLIT(splitme,":"); -- ERROR
D2 = FOREACH H1 { -- ERROR
splitup = STRSPLIT( splitme, ";" );
GENERATE splitup;
}
Dump H1;
(item32;item31;,1)
Dump A1;
((item32,item31))
Dump C1;
((item32;item31;))
Dump D1;
((item32,item31))

STRSPLIT on a semi-colon is tricky. I got it to work by putting it inside of a block.
raw = LOAD 'cname.txt' as (name,cname_string:chararray);
xx = FOREACH raw {
cname_split = STRSPLIT(cname_string,';');
GENERATE cname_split;
}
Funny enough, this is how I originally implemented my STRSPLIT() command. Only after trying to get it to split on a semicolon did I run into the same issue.

Related

How to deal with pipe as data in pipe delimited file

I am trying to solve a problem where my csv data looks like below:
A|B|C
"Jon"|"PR | RP"|"MN"
"Pam | Map"|"Ecom"|"unity"
"What"|"is"this" happening"|"?"
That is, it is pipe delimited and has quotes as text qualifier but it also has pipe and quotes with in the data values. I have already tried
Update based on the comments
I tried to select | as delimiter and " as Text Qualifier but when trying to import data to OLEDB Destination i receive the following error:
couldn't find column delimiter for column B
You have to change the Column Delimiter property to | (vertical bar) and the Text Qualifier property to " within the Flat File Connection Manager
If these is still not working then you have some bad rows in the Flat File Source which you must handle using the Error Output:
Flat File source Error Output connection in SSIS
SQL SERVER – SSIS Component Error Outputs
I actually ended up writing a c sharp script to remove the initial and last quote and set the column delimiter to quote pipe quote ("|") in SSIS. Code was as below:
public void Main()
{
String folderSource = "path";
String folderTarget = "path";
foreach (string file in System.IO.Directory.GetFiles(folderSource))
{
String targetfilepath = folderTarget + System.IO.Path.GetFileName(file);
System.IO.File.Delete(targetfilepath);
int icount = 1;
foreach (String row in System.IO.File.ReadAllLines(file))
{
if (icount == 1)
{
System.IO.File.AppendAllText(targetfilepath, row.Replace("|", "\"|\""));
}
else
{
System.IO.File.AppendAllText(targetfilepath, row.Substring(1, row.Length - 2));
}
icount = icount + 1;
System.IO.File.AppendAllText(targetfilepath, Environment.NewLine);
}
}
Dts.TaskResult = (int)ScriptResults.Success;
}

Identifying columns through PiG

I have data set like below :
"column,1A",column2A,column3A
"column,1B",column2B,column3B
"column,1C",column2C,column3C
"column,1D",column2D,column3D
What separator I should be using in this case to separate out above 3 columns.
First column value is => Column,1A
Second column value is => Column2A
Third column value is => Column3A
Let be try my code:
a = LOAD '/home/hduser/pig_ex' USING PigStorage(',') AS (col1,col2,col3,col4);
b = FOREACH a GENERATE REGEX_EXTRACT(col1, '^\\"(.*)', 1) AS (modfirstcol),REGEX_EXTRACT(col2, '^(.*)\\"', 1) AS (modsecondcol),col3,col4;
c = foreach b generate CONCAT($0, CONCAT(', ', $1)), $2 , $3;
dump c;
I am able to resolve it using the below steps:
Input:-
"column,1A",column2A,column3A
"column,1B",column2B,column3B
"column,1C",column2C,column3C
"column,1D",column2D,column3D
PiG Script :-
A = load '/home/hduser/pig_ex' AS line;
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'\\,',4)) AS (firstcol:chararray,secondcol:chararray,thirdcol:chararray,forthcol:chararray);
C = FOREACH B GENERATE REGEX_EXTRACT(firstcol, '^\\"(.*)', 1) AS (modfirstcol),REGEX_EXTRACT(secondcol, '^(.*)\\"', 1) AS (modsecondcol),thirdcol,forthcol;
D = FOREACH C GENERATE CONCAT(modfirstcol,',',modsecondcol),thirdcol,forthcol;
DUMP D;
Output :-
(column,1A,column2A,column3A)
(column,1B,column2B,column3B)
(column,1C,column2C,column3C)
(column,1D,column2D,column3D)
Please let me know if there is any better way

Regex to extract first part of string in Apache Pig

I need to extract post code district from the input data below
AB55 4
DD7 6LL
DD5 2HI
My Code
A = load 'data' as postcode:chararray;
B = foreach A {
code_district = REGEX_EXTRACT(postcode,'<SOME EXP>',1);
generate code_district;
};
dump B;
Output should look like
AB55
DD7
DD5
what should be the regular expression to extract the first part of the string?
Can you try the below Regex?
Option1:
A = LOAD 'input' as postcode:chararray;
code_district = FOREACH A GENERATE REGEX_EXTRACT(postcode,'(\\w+).*',1);
DUMP code_district;
Option2:
A = LOAD 'input' as postcode:chararray;
code_district = FOREACH A GENERATE REGEX_EXTRACT(postcode,'([a-zA-Z0-9]+).*',1);
DUMP code_district;
Output:
(AB55)
(DD7)
(DD5)

How can I generate schema from text file? (Hadoop-Pig)

Somehow i got filename.log which looks like for example (tab separated)
Name:Peter Age:18
Name:Tom Age:25
Name:Jason Age:35
because the value of key column may differ i cannot define schema when i load text like
a = load 'filename.log' as (Name:chararray,Age:int);
Neither do i want to call column by position like
b = foreach a generate $0,$1;
What I want to do is, from only that filename.log, to make it possible to call each value by key, for example
a = load 'filename.log' using PigStorage('\t');
b = group b by Name;
c = foreach b generate group, COUNT(b);
dump c;
for that purpose, i wrote some Java UDF which seperate key:value and get value for every field in tuple as below
public class SPLITALLGETCOL2 extends EvalFunc<Tuple>{
#Override
public Tuple exec(Tuple input){
TupleFactory mTupleFactory = TupleFactory.getInstance();
ArrayList<String> mProtoTuple = new ArrayList<String>();
Tuple output;
String target=input.toString().substring(1, input.toString().length()-1);
String[] tokenized=target.split(",");
try{
for(int i=0;i<tokenized.length;i++){
mProtoTuple.add(tokenized[i].split(":")[1]);
}
output = mTupleFactory.newTupleNoCopy(mProtoTuple);
return output;
}catch(Exception e){
output = mTupleFactory.newTupleNoCopy(mProtoTuple);
return output;
}
}
}
How should I alter this method to get what I want? or How should I write other UDF to get there?
Whatever you do, don't use a tuple to store the output. Tuples are intended to store a fixed number of fields, where you know what every field contains. Since you don't know that the keys will be in Name,Age form (or even exist, or that there won't be more) you should use a bag. Bags are unordered sets of tuples. They can have any number of tuples in them as long as the tuples have the same schema. These are all valid bags for the schema B: {T:(key:chararray, value:chararray)}:
{(Name,Foo),(Age,Bar)}
{(Age,25),(Name,Jim)}
{(Name,Bob)}
{(Age,30),(Name,Roger),(Hair Color,Brown)}
{(Hair Color,),(Name,Victor)} -- Note the Null value for Hair Color
However, it sounds like you really want a map:
myudf.py
#outputSchema('M:map[]')
def mapize(the_input):
out = {}
for kv in the_input.split(' '):
k, v = kv.split(':')
out[k] = v
return out
myscript.pig
register '../myudf.py' using jython as myudf ;
A = LOAD 'filename.log' AS (total:chararray) ;
B = FOREACH A GENERATE myudf.mapize(total) ;
-- Sample usage, grouping by the name key.
C = GROUP B BY M#'Name' ;
Using the # operator you can pull out all values from the map using the key you give. You can read more about maps here.

Regular Expression to Match All Comments in a T-SQL Script

I need a Regular Expression to capture ALL comments in a block of T-SQL. The Expression will need to work with the .Net Regex Class.
Let's say I have the following T-SQL:
-- This is Comment 1
SELECT Foo FROM Bar
GO
-- This is
-- Comment 2
UPDATE Bar SET Foo == 'Foo'
GO
/* This is Comment 3 */
DELETE FROM Bar WHERE Foo = 'Foo'
/* This is a
multi-line comment */
DROP TABLE Bar
I need to capture all of the comments, including the multi-line ones, so that I can strip them out.
EDIT: It would serve the same purpose to have an expression that takes everything BUT the comments.
This should work:
(--.*)|(((/\*)+?[\w\W]+?(\*/)+))
In PHP, i'm using this code to uncomment SQL (this is the commented version -> x modifier) :
trim( preg_replace( '#
(([\'"]).*?[^\\\]\2) # $1 : Skip single & double quoted expressions
|( # $3 : Match comments
(?:\#|--).*?$ # - Single line comment
| # - Multi line (nested) comments
/\* # . comment open marker
(?: [^/*] # . non comment-marker characters
|/(?!\*) # . not a comment open
|\*(?!/) # . not a comment close
|(?R) # . recursive case
)* # . repeat eventually
\*\/ # . comment close marker
)\s* # Trim after comments
|(?<=;)\s+ # Trim after semi-colon
#msx', '$1', $sql ) );
Short version:
trim( preg_replace( '#(([\'"]).*?[^\\\]\2)|((?:\#|--).*?$|/\*(?:[^/*]|/(?!\*)|\*(?!/)|(?R))*\*\/)\s*|(?<=;)\s+#ms', '$1', $sql ) );
Using this code :
StringCollection resultList = new StringCollection();
try {
Regex regexObj = new Regex(#"/\*(?>(?:(?!\*/|/\*).)*)(?>(?:/\*(?>(?:(?!\*/|/\*).)*)\*/(?>(?:(?!\*/|/\*).)*))*).*?\*/|--.*?\r?[\n]", RegexOptions.Singleline);
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
resultList.Add(matchResult.Value);
matchResult = matchResult.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
With the following input :
-- This is Comment 1
SELECT Foo FROM Bar
GO
-- This is
-- Comment 2
UPDATE Bar SET Foo == 'Foo'
GO
/* This is Comment 3 */
DELETE FROM Bar WHERE Foo = 'Foo'
/* This is a
multi-line comment */
DROP TABLE Bar
/* comment /* nesting */ of /* two */ levels supported */
foo...
Produces these matches :
-- This is Comment 1
-- This is
-- Comment 2
/* This is Comment 3 */
/* This is a
multi-line comment */
/* comment /* nesting */ of /* two */ levels supported */
Not that this will only match 2 levels of nested comments, although in my life I have never seen more than one level being used. Ever.
I made this function that removes all SQL comments, using plain regular expressons. It removes both line comments (even when there is not a linebreak after) and block comments (even if there are nested block comments). This function can also replace literals (useful if you are searching for something inside SQL procedures but you want to ignore strings).
My code was based on this answer (which is about C# comments), so I had to change line comments from "//" to "--", but more importantly I had to rewrite the block comments regex (using balancing groups) because SQL allows nested block comments, while C# doesn't.
Also, I have this "preservePositions" argument, which instead of stripping out the comments it just fills comments with whitespace. That's useful if you want to preserve the original position of each SQL command, in case you need to manipulate the original script while preserving original comments.
Regex everythingExceptNewLines = new Regex("[^\r\n]");
public string RemoveComments(string input, bool preservePositions, bool removeLiterals=false)
{
//based on https://stackoverflow.com/questions/3524317/regex-to-strip-line-comments-from-c-sharp/3524689#3524689
var lineComments = #"--(.*?)\r?\n";
var lineCommentsOnLastLine = #"--(.*?)$"; // because it's possible that there's no \r\n after the last line comment
// literals ('literals'), bracketedIdentifiers ([object]) and quotedIdentifiers ("object"), they follow the same structure:
// there's the start character, any consecutive pairs of closing characters are considered part of the literal/identifier, and then comes the closing character
var literals = #"('(('')|[^'])*')"; // 'John', 'O''malley''s', etc
var bracketedIdentifiers = #"\[((\]\])|[^\]])* \]"; // [object], [ % object]] ], etc
var quotedIdentifiers = #"(\""((\""\"")|[^""])*\"")"; // "object", "object[]", etc - when QUOTED_IDENTIFIER is set to ON, they are identifiers, else they are literals
//var blockComments = #"/\*(.*?)\*/"; //the original code was for C#, but Microsoft SQL allows a nested block comments // //https://msdn.microsoft.com/en-us/library/ms178623.aspx
//so we should use balancing groups // http://weblogs.asp.net/whaggard/377025
var nestedBlockComments = #"/\*
(?>
/\* (?<LEVEL>) # On opening push level
|
\*/ (?<-LEVEL>) # On closing pop level
|
(?! /\* | \*/ ) . # Match any char unless the opening and closing strings
)+ # /* or */ in the lookahead string
(?(LEVEL)(?!)) # If level exists then fail
\*/";
string noComments = Regex.Replace(input,
nestedBlockComments + "|" + lineComments + "|" + lineCommentsOnLastLine + "|" + literals + "|" + bracketedIdentifiers + "|" + quotedIdentifiers,
me => {
if (me.Value.StartsWith("/*") && preservePositions)
return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks // return new string(' ', me.Value.Length);
else if (me.Value.StartsWith("/*") && !preservePositions)
return "";
else if (me.Value.StartsWith("--") && preservePositions)
return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks
else if (me.Value.StartsWith("--") && !preservePositions)
return everythingExceptNewLines.Replace(me.Value, ""); // preserve only line-breaks // Environment.NewLine;
else if (me.Value.StartsWith("[") || me.Value.StartsWith("\""))
return me.Value; // do not remove object identifiers ever
else if (!removeLiterals) // Keep the literal strings
return me.Value;
else if (removeLiterals && preservePositions) // remove literals, but preserving positions and line-breaks
{
var literalWithLineBreaks = everythingExceptNewLines.Replace(me.Value, " ");
return "'" + literalWithLineBreaks.Substring(1, literalWithLineBreaks.Length - 2) + "'";
}
else if (removeLiterals && !preservePositions) // wrap completely all literals
return "''";
else
throw new NotImplementedException();
},
RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
return noComments;
}
Test 1 (first original, then removing comments, last removing comments/literals)
[select /* block comment */ top 1 'a' /* block comment /* nested block comment */*/ from sys.tables --LineComment
union
select top 1 '/* literal with */-- lots of comments symbols' from sys.tables --FinalLineComment]
[select top 1 'a' from sys.tables
union
select top 1 '/* literal with */-- lots of comments symbols' from sys.tables ]
[select top 1 ' ' from sys.tables
union
select top 1 ' ' from sys.tables ]
Test 2 (first original, then removing comments, last removing comments/literals)
Original:
[create table [/*] /*
-- huh? */
(
"--
--" integer identity, -- /*
[*/] varchar(20) /* -- */
default '*/ /* -- */' /* /* /* */ */ */
);
go]
[create table [/*]
(
"--
--" integer identity,
[*/] varchar(20)
default '*/ /* -- */'
);
go]
[create table [/*]
(
"--
--" integer identity,
[*/] varchar(20)
default ' '
);
go]
This works for me:
(/\*(.|[\r\n])*?\*/)|(--(.*|[\r\n]))
It matches all comments starting with -- or enclosed within */ .. */ blocks
I see you're using Microsoft's SQL Server (as opposed to Oracle or MySQL).
If you relax the regex requirement, it's now possible (since 2012) to use Microsoft's own parser:
using Microsoft.SqlServer.Management.TransactSql.ScriptDom;
...
public string StripCommentsFromSQL( string SQL ) {
TSql110Parser parser = new TSql110Parser( true );
IList<ParseError> errors;
var fragments = parser.Parse( new System.IO.StringReader( SQL ), out errors );
// clear comments
string result = string.Join (
string.Empty,
fragments.ScriptTokenStream
.Where( x => x.TokenType != TSqlTokenType.MultilineComment )
.Where( x => x.TokenType != TSqlTokenType.SingleLineComment )
.Select( x => x.Text ) );
return result;
}
See Removing Comments From SQL
The following works fine - pg-minify, and not only for PostgreSQL, but for MS-SQL also.
Presumably, if we remove comments, that means the script is no longer for reading, and minifying it at the same time is a good idea.
That library deletes all comments as part of the script minification.
I am using this java code to remove all sql comments from text. It supports comments like /* ... */ , --..., nested comments, ignores comments inside quoted strings
public static String stripComments(String sqlCommand) {
StringBuilder result = new StringBuilder();
//group 1 must be quoted string
Pattern pattern = Pattern.compile("('(''|[^'])*')|(/\\*(.|[\\r\\n])*?\\*/)|(--(.*|[\\r\\n]))");
Matcher matcher = pattern.matcher(sqlCommand);
int prevIndex = 0;
while(matcher.find()) {
// add previous portion of string that was not found by regexp - meaning this is not a quoted string and not a comment
result.append(sqlCommand, prevIndex, matcher.start());
prevIndex = matcher.end();
// add the quoted string
if (matcher.group(1) != null) {
result.append(sqlCommand, matcher.start(), matcher.end());
}
}
result.append(sqlCommand.substring(prevIndex));
return result.toString();
}
Following up from Jeremy's answer and inspired by Adrien Gibrat's answer.
This is my version that supports comment characters inside single-quoted strings.
.NET C# note you need to enable RegexOptions.IgnorePatternWhitespace
, most other languages this is the x option
(?: (?:'[^']*?') | (?<singleline>--[^\n]*) | (?<multiline>(?:\/\*)+?[\w\W]+?(?:\*\/)+) )
Example
https://regex101.com/r/GMUAnc/3