I have a string in field 'product' in the following form:
";TT_RAV;44;22;"
and am wanting to first split on the ';' and then split on the '_' so that what is returned is
"RAV"
I know that I can do something like this:
parse_1 = foreach {
splitup = STRSPLIT(product,';',3);
generate splitup.$1 as depiction;
};
This will return the string 'TT_RAV' and then I can do another split and project out the 'RAV' however this seems like it will be passing the data through multiple Map jobs -- Is it possible to parse out the desired field in one pass?
This example does NOT work, as the inner splitstring retuns tuples, but shows logic:
c parse_1 = foreach {
splitup = STRSPLIT(STRSPLIT(product,';',3),'_',1);
generate splitup.$1 as depiction;
};
Is it possible to do this in pure piglatin without multiple map phases?
Don't use STRSPLIT. You are looking for REGEX_EXTRACT:
REGEX_EXTRACT(product, '_([^;]*);', 1) AS depiction
If it's important to be able to precisely pick out the second semicolon-delimited field and then the second underscore-delimited subfield, you can make your regex more complicated:
REGEX_EXTRACT(product, '^[^;]*;[^_;]*_([^_;]*)', 1) AS depiction
Here's a breakdown of how that regex works:
^ // Start at the beginning
[^;]* // Match as many non-semicolons as possible, if any (first field)
; // Match the semicolon; now we'll start the second field
[^_;]* // Match any characters in the first subfield
_ // Match the underscore; now we'll start the second subfield (what we want)
( // Start capturing!
[^_;]* // Match any characters in the second subfield
) // End capturing
The only time there will be multiple maps is if you have an operator that triggers a reduce (JOIN, GROUP, etc...). If you run an explain on the script you can see if there is more than one reduce phase.
Related
I am trying to obtain the best delimiter for my CSV file, I've seen answers that find the biggest size of the header row. Now instead of doing the standard method that would look something like this:
val supportedDelimiters: Array<Char> = arrayOf(',', ';', '|', '\t')
fun determineDelimiter(headerRow): Char {
var headerLength = 0
var chosenDelimiter =' '
supportedDelimiters.forEach {
if (headerRow.split(it).size > headerLength) {
headerLength = headerRow.split(it).size
chosenDelimiter = it
}
}
return chosenDelimiter
}
I've been trying to do it with some in-built Kotlin collections methods like filter or maxOf, but to no avail (the code below does not work).
fun determineDelimiter(headerRow: String): Char {
return supportedDelimiters.filter({a,b -> headerRow.split(a).size < headerRow.split(b)})
}
Is there any way I could do it without forEach?
Edit: The header row could look something like this:
val headerRow = "I;am;delimited;with;'semi,colon'"
I put the '' over an entry that could contain other potential delimiter
You're mostly there, but this seems simpler than you think!
Here's one answer:
fun determineDelimiter(headerRow: String)
= supportedDelimiters.maxByOrNull{ headerRow.split(it).size } ?: ' '
maxByOrNull() does all the hard work: you just tell it the number of headers that a delimiter would give, and it searches through each delimiter to find which one gives the largest number.
It returns null if the list is empty, so the method above returns a space character, like your standard method. (In this case we know that the list isn't empty, so you could replace the ?: ' ' with !! if you wanted that impossible case to give an error, or you could drop it entirely if you wanted it to give a null which would be handled elsewhere.)
As mentioned in a comment, there's no foolproof way to guess the CSV delimiter in general, and so you should be prepared for it to pick the wrong delimiter occasionally. For example, if the intended delimiter was a semicolon but several headers included commas, it could wrongly pick the comma. Without knowing any more about the data, there's no way around that.
With the code as it stands, there could be multiple delimiters which give the same number of headers; it would simply pick the first. You might want to give an error in that case, and require that there's a unique best delimiter. That would give you a little more confidence that you've picked the right one — though there's still no guarantee. (That's not so easy to code, though…)
Just like gidds said in the comment above, I would advise against choosing the delimiter based on how many times each delimiter appears. You would get the wrong answer for a header row like this:
Type of shoe, regardless of colour, even if black;Size of shoe, regardless of shape
In the above header row, the delimiter is obviously ; but your method would erroneously pick ,.
Another problem is that a header column may itself contain a delimiter, if it is enclosed in quotes. Your method doesn't take any notice of possible quoted columns. For this reason, I would recommend that you give up trying to parse CSV files yourself, and instead use one of the many available Open Source CSV parsers.
Nevertheless, if you still want to know how to pick the delimiter based on its frequency, there are a few optimizations to readability that you can make.
First, note that Kotlin strings are iterable; therefore you don't have to use a List of Char. Use a String instead.
Secondly, all you're doing is counting the number of times a character appears in the string, so there's no need to break the string up into pieces just to do that. Instead, count the number of characters directly.
Third, instead of finding the maximum value by hand, take advantage of what the standard library already offers you.
const val supportedDelimiters = ",;|\t"
fun determineDelimiter(headerRow: String): Char =
supportedDelimiters.maxBy { delimiter -> headerRow.count { it == delimiter } }
fun main() {
val headerRow = "one,two,three;four,five|six|seven"
val chosenDelimiter = determineDelimiter(headerRow)
println(chosenDelimiter) // prints ',' as expected
}
I have a table that looks like:
bl.ah
foo.bar
bar.fight
And I'd like to use HiveQL's regexp_extract to return
bl
foo
bar
Given the docs data about regexp_extract:
regexp_extract(string subject, string pattern, int index)
Returns the string extracted using the pattern. For example, regexp_extract('foothebar', 'foo(.*?)(bar)', 2) returns 'bar.' Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\s' is necessary to match whitespace, etc. The 'index' parameter is the Java regex Matcher group() method index. See docs/api/java/util/regex/Matcher.html for more information on the 'index' or Java regex group() method.
So, if you have a table with a single column (let's call it description for our example) you should be able to use regexp_extract as follows to get the data before a period, if one exists, or the entire string in the absence of a period:
regexp_extract(description,'^([^\.]+)\.?',1)
The components of the regex are as follows:
^ start of string
([^\.]+) any non-period character one or more times, in a capture group
\.? a period either once or no times
Because the part of the string we're interested in will be in the first (and only) capture group, we refer to it by passing the index parameter a value of 1.
I have database have thousand of unknow string they may be emails ,phonenum
BUT they are not for me mean they are not email or cell num for me they are only string for me but i want their common pattern so here is the string for example purposes
link to example click here
now what i want is this file out put if pattern matcehs 3 time here what i am doing is
DECLARE #strs2 nvarchar(255)
DECLARE #patternTable table(
id int ,
order by p.pat
but my example return this
485-2889
485-2889
) 485-2889
) 485-2889
.aol.com/aol/search?
.aol.com/aol/search?
gmail.com
gmail.com
but i want to add this for pattern
[a-zA-Z 0-9] [a-zA-Z 0-9] [a-zA-Z 0-9] - 485-2889
for gmail
[a-zA-Z 0-9] [a-zA-Z 0-9]# gmail.com
First of all, this is much more work than it might seem.
As far as I can say it's going to be method with heavy processing (and probably not something you want to do with a cursor in SQL (cursors are sort of bad in terms of efficiency).
You have to define a way for your code to identify a pattern. You will also have to work in priorities where a set of strings matches multiple patterns. For instance if you implement following pattern criteria (in your example):
BK-M18B-48
BK-M18B-52
BK-M82B-44
BK-M82S-38
BK-M82S-44
BK-R50B-58
BK-R50B-62
.....
should generate BK-[A-Z]-[0-9][0-9][A-Z]-[0-9][0-9]
Then next set can have multiple patterns as a result:
fedexcarepackage#outlook.com (example added for explanations)
fedexcarepackage#office.com
fedexcourierexpress#pisem.net
fedexcouriers#gmail.com ( another example added for explanations)
.....
Can generate :
fedexc%#%.% (as you said)
fedexc%#% (depending on processing)
fedexc[A-Z][A-Z]....%#%[A-Z]....[A-Z].[A-Z][A-Z][A-Z] (alphanumeris with '%' to compensate for length difference)
in addition to that if you take away fedexcarepackage#outlook.com from string list you get 1 additional pattern that you probably don't want to have:
fedexc%#%i%.% (because they have 'i' somewhere between the '#' and '.' (dot)
Anyway, that is something you will have to consider with your design.
I'll give you some basic logic you can work with:
Create a functions to identify each distinct pattern (1 pattern / function). For instnace, 1 function to check for static pieces of string (and attaching wildcards); Another to detect [A-Z],[0-9] patterns that match your conditions for this pattern to be valid; more if needed for different patterns.
Create a function to test a string with your pattern. So say you have 4 string, you find a pattern when comparing first 2 of them. Then you use this function to test if pattern applies to 3rd and 4th strings.
Create a function to test if 2 patterns are mutually exclusive. For instance 'PersonA#yahoo.%' and 'PersonA#%.net' patterns are not mutually exclusive, if they were both tested to be true. 'Person%#yahoo.com' and 'PersonB#yahoo.com' are mutually exclusive (both patterns cannot be true, so 1 is redundant.
Create a function to combine patterns that are NOT mutually exclusive (probably includes the use of function in 2nd and 3rd point). So 'PersonA#yahoo.%' and 'PersonA#%.net' can be combined into 'PersonA#%.%'
Once you have that setup, loop through each text line, and compare Current line to the next against each pattern criteria. Record any patterns you find (in a variable dedicated to that criteria, (don't mix them just yet).
Next comes the hardest part, safest way is to compare each pattern you find against each of the strings, to rule out the ones that don't apply to all strings. However, you could probably work out a way to combine patterns (in the same category) without cross checking
Finally, after you narrowed own your pattern list to 1 pattern per pattern type. Combine them into 1 or eliminate the ones
Keep in mind that in your pattern detection functions, you'll probably have to test each line multiple times and combine patterns. Some pseudo code to demonstrate:
Function CompareForStringMatches (String s1, String s2){ -- it should return a possible pattern found.
Array/List pattern;
int patternsFound=0;
For(i = 0, to length of shorter string){
For(x = 0, to length of shorter string){
if(longerString.contains(shorterString.substring(from i, to x)){
--record the pattern somewhere as:
pattern[patternsFound] = Replace(longerString, shorterString.Substring(from i, to x), '%') --pattern = longerString with substring replaced with '%' sign
patternsFound = patternsFound+1;
}
}
}
--After loops make another loop to check (partial) patterns against each other to eliminate patterns that are part of a larger pattern
--for instance Comparing 'random#asd.com' and 'sundom#asd.com' the patterns below should be found:
---compare'%andom#asd.com' and '%ndom#asd.com' and eliminate the first pattern, because both are valid, but second pattern includes the first one.
--You will have a lot of similar matches, but if you do this, you should end up with only a few patterns.
--after first cycle of checks do another one to combine patterns, where possible(for instance if you compare 'random#asd.com' and 'sundom#asd.net' you will end up with these 2 patterns'%ndom#asd.com' and 'Random#asd.%'.
--Since these patterns are true (because they were found during a comparison) you can combine them into '%ndom#asd.%'
--when you combine/eliminate all patterns, you should only have 1 left
return pattern[only pattern left];
}
PS: You can do things, much more efficiently, but if you have no idea where to start out, you probably need to do it the long way and work on improvements from first working prototypes.
Edit/Update
I suggest you make a wildcard detection method and then apply other patter checks you implement before it.
Wildcard detection for comparison of 2 strings (pseudo code), heavy processing version :
Compare 2 strings, check if every possible segment of shorter string is within longer:
for(int i = 0; i<shorterString.Length;i++){
for(int x = 0; i<shorterString.Length;i++){
if(longerString.contains(shorterString.substring(i,x))){ --from i to x
possiblePattern.Add(longerString.replace(shorterString.substring(i,x),'*')
--add to pattern list
}
}
--Next compare partal matches and eliminate ones that are a part of larger pattern
--So '*a#gmail.com' and '*na#yahoo.com' comparison should eliminate '*na#gmail.com', because if shorter pattern (with more symbols removed) is valid, then similar one with an extra symbol is part of it
--When that is done, combine remaining matches if there's more than 1 left.
--Remember, all patterns are valid if your first loop was correct, so '*#gmail.com' and 'personA#*.com' can be combined into '*#*.com
}
As for the alphanumeric detection. I would suggest you start by checking length of all strings. If they are the same, run the wildcard pattern detection method (for all of them). When done ONLY look for patern matches in wildcards.
So, You'll get a pattern like BK-*-* from wildcard detection run. On second iteration loop take 2 strings and only extract sub-strings that are represented by wildcard characters (use an array or an equivalent to store sub-strings, make sure not to combine both wildcards of a single string into 1 string).
So if you compare with pattern found above (BK-*-*) :
BK-M18B-48
BK-M18B-52
You should get following string sets to process after eliminating static characters:
Set 1:M18B and 48
Set 2:M18B and 52
Compare each character to opposite string in same position and check if characters match your category (like if String1[0].isaLetter AND String2[0].isaLetter). If they do add that 1 character to a pattern, if not either:
Add a wildcard character (will lead to pattern like BK-[A-Z]*[0-9][0-9]-[0-9][0-9]. If you do this combine adjacent wildcard characters to 1.
Pattern is false and you should abbort the ch'eck returning no patterns.
Use this basic logic to loop through strings, create (and store!!!!) patterns for each set of 2 strings. Loop through patterns, with wildcard detection (possibly a lighter version) to combine/eliminate paterns. So if you get patterns like '#yahoo.com' and '#gmail.com' from different sets of strings you should combine them into '#.com'
Keep in mind there's lots of room for optimization here.
I am working with the example about Parse Tree Matching and XPath shown here. More specifically, I was trying to understand how the following code works:
// assume we are parsing Java
ParserRuleContext tree = parser.compilationUnit();
String xpath = "//blockStatement/*"; // get children of blockStatement
String treePattern = "int <Identifier> = <expression>;";
ParseTreePattern p =
parser.compileParseTreePattern(treePattern,
ExprParser.RULE_localVariableDeclarationStatement);
List<ParseTreeMatch> matches = p.findAll(tree, xpath);
System.out.println(matches);
What I wanted to ask is if we can have regular expressions inside the treePattern string?
For example, I want to write a pattern which identifies all the localVariableDeclarations inside a for loop.
I would like to be able to identify the following code:
for (Object o : list) {
int tempVariable=0;
if ( o.id ==12) {
System.out.println(t);
}
}
The way I have written the pattern (which works) to identify this code is as follows:
String pattern3 = " for ( <className1:type> <localName1:Identifier> : <listName1:expression> ) { <localVariables1:localVariableDeclarationStatement> "
+ "if (<parameter1:expression>.<identifier1:Identifier> == <value1:primary> ) <block1:statement> }";
However, if I have more than one local variables, the pattern doesn't match. I tried to add a '*' at the end as it would happen in the grammar file, but I get an
* invalid tag error.
<localVariables1:localVariableDeclarationStatement>*
Of course I can also add a pattern with two localVariableDeclarationStatement statements, but this again means that I have to create many different patterns for each number of local variables that I want to identify:
<localVariables1:localVariableDeclarationStatement> <localVariables2:localVariableDeclarationStatement> and identify the pattern with
At this time, we don't support repeated elements within the patterns. I thought about that but it essentially means making yet another parser generator whereas static patterns like that are fairly easy to match. It's possible to build one of these, as the last version of ANTLR had tree grammars where you could in fact specify the grammatical structure of subtrees. Until we decide what sort of enhancement to the patterns we can make, I suggest you get creative.
In your specific case, find all of the localVariableDeclarations within for loops as you are doing now and then use a small bit of code to walk that list to identify the contiguous sequences (they are all siblings) and the ones terminated by that particular IF pattern. Would that work?
I do the following:
a = load '/hive/warehouse/' USING PigStorage('^') as (a1,b1,c1);
b = group a by (a1) ;
c = foreach b generate group, a.$2;
dump c;
Output shows all the groups:
abc {(1),(44),(66)}
cde {(1),(44),(66)}
How can I remove "{" and "(" characters so that the final HDFS file can be read as a coma delimited file?
You can't do this directly in Pig. The special syntax is required because you are storing a bag, and in order for Pig to be able to read this bag later, it needs to be stored with braces (for the bag) and parentheses (for the tuples contained in the bag).
You have a couple of options. You can read the file back into Pig, but instead of reading it as a bag, read it as a chararray. Then you can perform regex substitution to get rid of the punctuation (untested):
a = LOAD 'output' AS (group:chararray, list:chararray);
b = FOREACH A GENERATE group, REPLACE(list, '[{()}]', '');
Another option is to write a UDF which will turn a bag into a tuple. Note that this is not a well-defined operation: bags have no particular order, so from one run to the next, your tuple is not guaranteed to be in the same order. But for your purposes it sounds like that may not matter. The UDF could look like (very rough draft, untested):
public class BAG_TO_TUPLE extends EvalFunc(Tuple) {
public Tuple exec(Tuple input) {
DataBag bag = input.get(0);
Iterator<Tuple> iterator = bag.iterator();
Tuple out = new DefaultTuple();
while(iterator.hasNext()) {
out.append(iterator.next().get(0));
}
return out;
}
}
The above UDF is terrible -- it assumes that you have exactly one element in every tuple of the bag (that you care about) and does no checking whatsoever that the input is valid, etc. But it should get you towards what you want.
The best solution, though, is to find a way to handle the extra punctuation outside of Pig if Pig is not part of your downstream processing.
This functionality is now provided in Pig as a built-in func (I'm using 0.11).
http://pig.apache.org/docs/r0.11.0/api/org/apache/pig/builtin/BagToString.html
c = foreach b generate group, a.$2 as stuff;
d = foreach c generate group, BagToString(stuff, ',');
I don't need a comma-delimited file for my use case, but I assume you can use a store func to get the final comma (between group and the now-comma-delimited-list of bag things).
Try the FLATTEN operator;
c = foreach b generate group, FLATTEN(a.$2);