RegEx help - finding / returning a code - vb.net

I must admit it's been a few years since my RegEx class and since then, I have done little with them. So I turn to the brain power of SO. . .
I have an Excel spreadsheet (2007) with some data. I want to search one of the columns for a pattern (here's the RegEx part). When I find a match I want to copy a portion of the found match to another column in the same row.
A sample of the source data is included below. Each line represents a cell in the source.
I'm looking for a regex that matches "abms feature = XXX" where XXX is a varibale length word - no spaces in it and I think all alpha characters. Once I find a match, I want to toss out the "abms feature = " portion of the match and place the code (the XXX part) into another column.
I can handle the excel coding part. I just need help with the regex.
If you can provide a solution to do this entirely within Excel - no coding required, just using native excel formula and commands - I would like to hear that, too.
Thanks!
###################################
Structure
abms feature = rl
abms feature = sta
abms feature = pc, pcc, pi, poc, pot, psc, pst, pt, radp
font = 5 abms feature = equl, equr
abms feature = bl
abms feature = tl
abms feature = prl
font = 5
###################################

I am still learning about regex myself, but I have found this place useful for getting ideas or comparing what I came up with, might help in the future?
http://regexlib.com/

Try this regular expression:
abms feature = (\w+)
Here is an example of how to extract the value from the capture group:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
Regex regex = new Regex(#"abms feature = (\w+)",
RegexOptions.Compiled |
RegexOptions.CultureInvariant |
RegexOptions.IgnoreCase);
Match match = regex.Match("abms feature = XXX");
if (match.Success)
{
Console.WriteLine(match.Groups[1].Value);
}
}
}

(?<=^abms feature = )[a-zA-Z]*
assuming you're not doing anything with the words after the commas

Related

Using ssplit options for CoreNLP

According to the documentation, I can use options such as ssplit.isOneSentence for parsing my document into sentences. How exactly do I do this though, given a StanfordCoreNLP object?
Here's my code -
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, depparse");
pipeline.annotate(document);
Annotation document = new Annotation(doc);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
At what point do I add this option and where?
Something like this?
pipeline.ssplit.boundaryTokenRegex = '"'
I'd also like to know how to use it for the specific option boundaryTokenRegex
EDIT:
I think this seems more appropriate -
props.put("ssplit.boundaryTokenRegex", "/"");
But I still have to verify.
The way to do it for tokenizing sentences to end at any instance of a ' " ' is this -
props.setProperty("ssplit.boundaryMultiTokenRegex", "/\'\'/");
or
props.setProperty("ssplit.boundaryMultiTokenRegex", "/\"/");
depending on how it is stored. (CoreNLP normalizes it as the former)
And if you want both starting and ending quotes -
props.setProperty("ssplit.boundaryMultiTokenRegex","\/'/'|``\");

Parse SQL with REGEX to find Physical Update

I've spent a bit of time trying to bend regex to my will but its beaten me.
Here's the problem, for the following text...
--to be matched
UPDATE dbo.table
UPDATE TOP 10 PERCENT dbo.table
--do not match
UPDATE #temp
UPDATE TOP 10 PERCENT #temp
I'd like to match the first two updates statements and not match the last two update statements. So far I have the regex...
UPDATE\s?\s+[^#]
I've been trying to get the regex to ignore the TOP 10 PERCENT part as its just gets in the way. But I haven't been successful.
Thanks in advance.
I'm using .net 3.5
I assume you're trying to parse real SQL syntax (looks like SQL Server) so I've tried something that is more suitable for that (rather than just detecting the presence of #).
You can try regex like:
UPDATE\s+(TOP.*?PERCENT\s+)?(?!(#|TOP.*?PERCENT|\s)).*
It checks for UPDATE followed by optional TOP.*?PERCENT and then by something that is not TOP.*?PERCENT and doesn't start with #. It doesn't check just for the presence of # as this may legitimately appear in other position and not mean a temp table.
As I understand it, you want a regex to interact with SQL code, not actually querying a database?
You can use a negative look ahead to check if the line has #temp:
(?m)^(?!.*#temp).*UPDATE
(?!...) will fail the whole match if what's inside it matches, ^ matches the beginning of the line when combined with the m modifier. (?m) is the inline version of this modifier, as I don't know how/where you plan on using the regex.
See demo here.
#Robin's solution is much better but in case you needed regex with some simplier mechanisms employed I give you this:
UPDATE\s+(TOP\s+10\s+PERCENT\s+)?[a-z\.]+
sqlconsumer, here's a fully functioning C# .NET program. Does it do what you're looking for?
using System;
using System.Text.RegularExpressions;
class Program {
static void Main() {
string s1 = "UPDATE dbo.table";
string s2 = "UPDATE TOP 10 PERCENT dbo.table";
string s3 = "UPDATE #temp";
string s4 = "UPDATE TOP 10 PERCENT #temp";
string pattern = #"UPDATE\s+(?:TOP 10 PERCENT\s+)?dbo\.\w+";
Console.WriteLine(Regex.IsMatch(s1, pattern) );
Console.WriteLine(Regex.IsMatch(s2, pattern));
Console.WriteLine(Regex.IsMatch(s3, pattern));
Console.WriteLine(Regex.IsMatch(s4, pattern));
Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();
} // END Main
} // END Program
The Output:
True
True
False
False

Lucene.net PerFieldAnalyzerWrapper

I've read on how to use the per field analyzer wrapper, but can't get it to work with a custom analyzer of mine. I can't even get the analyzer to run the constructor, which makes me believe I'm actually calling the per field analyzer incorrectly.
Here's what I'm doing:
Create the per field analyzer:
PerFieldAnalyzerWrapper perFieldAnalyzer = new PerFieldAnalyzerWrapper(srchInfo.GetAnalyzer(true));
perFieldAnalyzer.AddAnalyzer("<special field>", dta);
Add all the fields do document as usual, including a special field that we analyze differently.
And add document using the analyzer like this:
iw.AddDocument(doc, perFieldAnalyzer);
Am I on the right track?
The problem was related to my reliance on CMSs (Kentico) built-in Lucene helper classes. Basically, using those classes you need to specify the custom analyzer at index-level through the CMS and I did not wish to do that. So I ended up using Lucene.net directly almost everywhere gaining the flexibility of using any custom analyzer I want
I also did some changes to how I structure data and ended up using the tried-and-true KeywordAnalyzer to analyze document tags. Previously I was trying to do some custom tokenization magic on comma separated values like [tag1, tag2, tag with many parts] and could not get it reliably working with multi-parted tags. I still kept that field, but started adding multiple "tag" fields to the document, each storing one tag. So now I have N "tag" fields for "N" tags, each analyzed as a keyword, meaning each tag (one word or many) is a single token.
I think I overthinked it with my initial approach.
Here is what I ended up with.
On Indexing:
KeywordAnalyzer ka = new KeywordAnalyzer();
PerFieldAnalyzerWrapper perFieldAnalyzer = new PerFieldAnalyzerWrapper(srchInfo.GetAnalyzer(true));
perFieldAnalyzer.AddAnalyzer("documenttags_t", ka);
-- Some procedure to compile all documents by reading from DB and putting into Lucene docs
foreach(var doc in docs)
{
iw.AddDocument(doc, perFieldAnalyzer);
}
On Searching:
KeywordAnalyzer ka = new KeywordAnalyzer();
PerFieldAnalyzerWrapper perFieldAnalyzer = new PerFieldAnalyzerWrapper(srchInfo.GetAnalyzer(true));
perFieldAnalyzer.AddAnalyzer("documenttags_t", ka);
string baseQuery = "documenttags_t:\"" + tagName + "\"";
Query query = _parser.Parse(baseQuery);
var results = _searcher.Search(query, sortBy)

Need to extract information from free text, information like location, course etc

I need to write a text parser for the education domain which can extract out the information like institute, location, course etc from the free text.
Currently i am doing it through lucene, steps are as follows:
Index all the data related to institute, courses and location.
Making shingles of the free text and searching each shingle in location, course and institute index dir and then trying to find out which part of text represents location, course etc.
In this approach I am missing lot of cases like B.tech can be written as btech, b-tech or b.tech.
I want to know is there any thing available which can do all these kind of things, I have heard about Ling-pipe and Gate but don't know how efficient they are.
You definitely need GATE. GATE has 2 main most frequently used features (among thousands others): rules and dictionaries. Dictionaries (gazetteers in GATE's terms) allow you to put all possible cases like "B.tech", "btech" and so on in a single text file and let GATE find and mark them all. Rules (more precisely, JAPE-rules) allow you to define patterns in text. For example, here's pattern to catch MIT's postal address ("77 Massachusetts Ave., Building XX, Cambridge MA 02139"):
{Token.kind == number}(SP){Token.orth == uppercase}(SP){Lookup.majorType == avenue}(COMMA)(SP)
{Token.string == "Building"}(SP){Token.kind == number}(COMMA)(SP)
{Lookup.majorType == city}(SP){Lookup.majorType == USState}(SP){Token.kind == number}
where (SP) and (COMMA) - macros (just to make text shorter), {Somthing} - is annotation, , {Token.kind == number} - annotation "Token" with feature "kind" equal to "number" (i.e. just number in the text), {Lookup} - annotation that captures values from dictionary (BTW, GATE already has dictionaries for such things as US cities). This is quite simple example, but you should see how easily you can cover even very complicated cases.
I didn't use Lucene but in your case I would leave different forms of the same keyword as they are and just hold a link table or such. In this table I'd keep the relation of these different forms.
You may need to write a regular expression to cover each possible form of your vocabulary.
Be careful about your choice of analyzer / tokenizer, because words like B.tech can be easily split into 2 different words (i.e. B and tech).
You may want to check UIMA. As Lingpipe and Gate, this framework features text annotation, which is what you are trying to do. Here is a tutorial which will help you write an annotator for UIMA:
http://uima.apache.org/d/uimaj-2.3.1/tutorials_and_users_guides.html#ugr.tug.aae.developing_annotator_code
UIMA has addons, in particular one for Lucene integration.
You can try http://code.google.com/p/graph-expression/
example of Adress parsing rules
GraphRegExp.Matcher Token = match("Token");
GraphRegExp.Matcher Country = GraphUtils.regexp("^USA$", Token);
GraphRegExp.Matcher Number = GraphUtils.regexp("^\\d+$", Token);
GraphRegExp.Matcher StateLike = GraphUtils.regexp("^([A-Z]{2})$", Token);
GraphRegExp.Matcher Postoffice = seq(match("BoxPrefix"), Number);
GraphRegExp.Matcher Postcode =
mark("Postcode", seq(GraphUtils.regexp("^\\d{5}$", Token), opt(GraphUtils.regexp("^\\d{4}$", Token))))
;
//mark(String, Matcher) -- means creating chunk over sub matcher
GraphRegExp.Matcher streetAddress = mark("StreetAddress", seq(Number, times(Token, 2, 5).reluctant()));
//without new lines
streetAddress = regexpNot("\n", streetAddress);
GraphRegExp.Matcher City = mark("City", GraphUtils.regexp("^[A-Z]\\w+$", Token));
Chunker chunker = Chunkers.pipeline(
Chunkers.regexp("Token", "\\w+"),
Chunkers.regexp("BoxPrefix", "\\b(POB|PO BOX)\\b"),
new GraphExpChunker("Address",
seq(
opt(streetAddress),
opt(Postoffice),
City,
StateLike,
Postcode,
Country
)
).setDebugString(true)
);
B.tech can be written as btech, b-tech or b.tech
Lucene will let you do fuzzy searches based on the Levenshtein Distance. A query for roam~ (note the ~) will find terms like foam and roams.
That might allow you to match the different cases.

Detailed information in Lucene/Solr results

After having performed a search in Lucene/Solr without having specified a field, how can I know in which fields of a result document the search string was found (and how often)?
You could use Query Highlighting.
Try setting debugQuery=on. See this example.
As mentioned, use debugQuery=true. The response will then include an "explain" section. By default, this will give you some awful formatted text that looks like this:
0.69102794 = (MATCH) weight(body:arrai^1.5 in 6357), product of:
0.46610788 = queryWeight(body:arrai^1.5), product of:
1.5 = boost
5.591044 = idf(docFreq=55709, maxDocs=5492855)
0.055577915 = queryNorm
1.4825494 = (MATCH) fieldWeight(body:arrai in 6357), product of:
2.828427 = tf(termFreq(body:arrai)=8)
5.591044 = idf(docFreq=55709, maxDocs=5492855)
0.09375 = fieldNorm(field=body, doc=6357)
For each match in each field, you will get a block like this that explains how SOLR computed the relevancy of this document to your query. What you're asking about (how many matches in this document's field) SOLR calls term frequency "tf". You can see this on the 7th line of the output i pasted above. In this line, SOLR is telling you that it found 8 matches for arrai in the field called "body".
The other lines stand for things like inverse document frequency-"idf" (how rare the matched term is) and fieldNorm, which relates to how short the document's field is relative to the match. You can learn about these here: http://wiki.apache.org/solr/SolrRelevancyFAQ
FYI if you need this "explain" information in a structured format instead of clumsy text you can pass this parameter with your query: debug.explain.structured=true However, its still pretty hard to use = )