I have a lot of data that I need to sort through and one of the fields contains both the make/model of the vehicle as well as the reg, sometimes separated by a dash (-) sometimes however it is not. Here is an example of such a string:
VehicleModel - TU69YUP
VehicleModel - TU69 YUP
VehicleModel TU69YUP
VehicleModel TU69 YUP
There are also some other variations but they are the main ones I have encountered. Is there a way that I can reliably go through all of the data and separate the vehicle reg from the model?
The data is currently contained within a Paradox database which I have no problem going through. I do not have a list of all of the vehicle models and names that are contained within the database, likewise, I also do not have a list of the licence plates.
The project is written in Delphi/SQL so I would prefer to stick with either one of these if at all possible.
Trouble ahead
If that field was originally entered by a user in the form that you're now seeing, then we can assume there was no validation, the original program would simply store whatever the user entered. If that's the case, you can't get 100% accuracy: human beings will always make mistakes, intentionally or unintentionally. Expect this kinds of human errors:
Missing fields (ie: registration only, no vehicle information - or the other way around)
Meaningless duplication of words (example: "Ford Ford K - TU 69 YUP")
Missing letters, duplicated letters, extra garbage letters. Example: "For K - T69YUP"
Wrong order of fields
Other small errors you can't even dream of.
Plain garbage that not even a human would make sense of.
You might have guessed I'm a bit pessimistic when dealing with human-entered data straight into text fields. I had the distinct misfortune to deal with a database where all data was text and there was no validation: Can you guess the kind of nonsense people typed in unvalidated date fields that allowed free user input?
The plan
Things aren't as dark as they seem, you can probably "fix" lots of things. The trick here is making sure you only fix data that's unambiguous and let a human sift through the rest of the stuff. The easiest way to do that is to do something like this:
Look at the data you have and wasn't automatically fixed yet. Figure out a rule that unambiguously applies to lots of records.
Apply the unambiguous rule.
Repeat until only a few records are left. Those should be fixed by hand, because they resisted all automatic methods that were applied.
The implementation
I strongly recommend using regular expressions for all the tests, because you'll surely end up implementing lots of different tests, and regular expressions can easily "express" the slight variations in search text. For example the following reg-ex can parse all 4 of your examples and give the correct result:
(.*?)(\ {1,3}-\ {1,3})?(\b[A-Z]{2}\ {0,2}[0-9]{2}\ {0,3}[A-Z]{3}\b)
If you've never worked with regular expressions before, that single expressions looks unintelligible, but it's in fact very simple. This is not a reg-ex question so I'm not going into any details. I'd rather explain how I've come up with the idea.
First of all, if the text includes vehicle registration numbers, those numbers will be in a very strict format: they'd be easy to match. Given your example I assume all registration numbers are of the form:
LLNNLLL
where "L" is a letter and "N" is a number. My regex is rigid in it's interpretation of it: it wants exactly two uppercase letters, followed by a small number of spaces (or no space), followed by exactly two digits, followed by a small number of spaces (or no space), finally followed by exactly 3 uppercase letters. The part of the regex that deals with that is:
[A-Z]{2}\ {0,2}[0-9]{2}\ {0,3}[A-Z]{3}
The rest of the regex makes sure the registration number isn't found embedded into other words, deals with grouping text into capture groups and creates an "lazy capture group" for the VehicleModel.
If I were to implement this myself, I'd probably write a "master" function and a number of simpler "case" functions, each function dealing with one kind of variation in user input. Example:
// This function does a validation of the extracted data. For example it validates the
// Registration number, using other, more precise criteria. The parameters are VAR so the
// function may normalize the results.
function ResultsAreValid(var Make, Registration:string): Boolean;
begin
Result := True; // Only you know what your data looks like and how it can be validated.
end;
// This is a case function that deals with a very rigid interpretation of user data
function VeryStrictInterpretation(const Text:string; out Make, Registration: string): Boolean;
var TestMake, TestReg: string;
// regex engine ...
begin
Result := False;
if (your condition) then
if ResultsAreValid(TestMake, TestReg) then
begin
Make := TestMake;
Registration := TestReg;
Result := True;
end;
end;
// Master function calling many different implementations that each deal with all sorts
// of variations of input. The most strict function should be first:
function MasterTest(const Text:string; out Make, Registration: string): Boolean;
begin
Result := VeryStrictInterpretation(Text, Make, Registration);
if not Result then Result := SomeOtherImplementation(Text, Make, Registration);
if not Result then Result := ThirdInterpretation(Text, Make, Registration);
end;
The idea here is to try to make multiple SIMPLE procedures, that each understands one kind of input in an unambiguous way; And make sure each step doesn't return false positives! And finally don't forget, a human should deal with the last few cases, so don't aim for a fix-it-all solution.
Well assuming that they are of the same format. Word[space]Word
Then you can iterate through them all, and if you encounter a whitespace without a dash, insert a dash. Then split as normal.
Here is a code example.
It will check for the - and also remove possible spaces in the license number.
Note : (as commented by Ken White), if the vehicle contains a space, this will have to be handled as well.
type
EMySplitError = class(Exception);
procedure SplitVehicleAndLicense( s : String; var vehicle,license : String);
var
p : Integer;
begin
vehicle := '';
license := '';
p := Pos('-',s);
if (p = 0) then
begin // No split delimiter
p := Pos(' ',s);
if (p > 0) then
begin
vehicle := Trim(Copy(s,1,p-1));
license := Trim(Copy(s,p+1,Length(s)));
end
else
Raise EMySplitError.CreateFmt('Not a valid vehicle/license name:%s',[s]);
end
else
begin
vehicle := Trim(Copy( s,1,p-1));
license := Trim(Copy( s,p+1,Length(s)));
end;
// Trim spaces in license
repeat
p := Pos(' ',license);
if (p <> 0) then Delete(license,p,1);
until (p = 0);
end;
Related
Hello ANTLR creators/users,
Some context - I am using PlSql ANTLR4 parser to do some lightweight transpiling of some queries from oracle sql to, let's say, spark sql. I have my listener class setup which extends the base listener.
Example of an issue -
Let's say the input is something like -
SELECT to_char(to_number(substr(ATTRIBUTE_VALUE,1,4))-3)||'0101') from xyz;
Now, I'd like to replace || with CONCAT and to_char with CAST as STRING, so that the final query looks like -
SELECT CONCAT(CAST(to_number(substr(ATTRIBUTE_VALUE,1,4))-3) as STRING),'0101') from xyz;
In my listener class, I am overriding two functions from base listener to do this - concatenation and string_function. In those, I am using a tokenStreamRewriter's replace to make the necessary transformation. Since tokenStreamRewriter is evaluated lazily, I am running to issue ->
java.lang.IllegalArgumentException: replace op boundaries of
<ReplaceOp#[#38,228:234='to_char',<2193>,3:15]..[#53,276:276=')',
<2214>,3:63]:"CAST (to_number(substr(ATTRIBUTE_VALUE,1,4))-3 as STRING)">
overlap with previous <ReplaceOp#[#38,228:234='to_char',<2193>,3:15]..
[#56,279:284=''0101'',<2209>,3:66]:"CONCAT
(to_char(to_number(substr(ATTRIBUTE_VALUE,1,4))-3),'0101')">
Clearly, the issue is my two listener functions attempting to replace/transform text on overlapping boundaries.
Is there any work around for territory overlap kind of issues for ANTLR4? I'm sure folks run into such stuff all the time probably.
I'd appreciate any workarounds, even dirty ones at this point of time :)
I did realize that ANTLR4 does not allow us to modify original AST, otherwise this would have been a little bit easier to solve.
Thanks!
A look at how tokenstreamrewriter works leads to the following understanding:
first, a list of all modification operations are built
then, you invoke getText()
here, there is a reduction of modification operations. The idea for example is to merge multiple insert together in one reduction. Its role is also to avoid multiple replace on same data (but i will expand on this point later).
every token is then read, in the case there is a modification listed for the said token index, TokenStreamRewriter do the operation, otherwise it just pop the read token.
Let's have a look on how modification operations are implemented:
for insert, tokenstream rewriter basically just adds the string to be added at the current token index, and then do an index+1, effectively going to next token
for replace, tokenstream rewriter replace a range of tokens with the new string, and set the new index to the end of this range.
So, for tokenstreamrewriter, overlapping replaces are not possible, as when you replace you jump to the end of the range of tokens to be replaced. Especially, in the case you remove the checks of overlapping, then only the first replace will be operated, as afterwards, the token index is past the other replaces.
Basically, this has been done because there is no way to tell easily what tokens should be replaced while using overlapping replaces. You would need for that symbol recognition and matching.
So, what you are trying to do is the following (for each step, the part between '*' is what is modified):
*SELECT to_char(to_number(substr(ATTRIBUTE_VALUE,1,4))-3)||'0101')* from xyz;
|
V
CONCAT (*to_char(to_number(substr(ATTRIBUTE_VALUE,1,4))-3)*,'0101') from xyz;
|
V
SELECT CONCAT(CAST(to_number(substr(ATTRIBUTE_VALUE,1,4))-3) as STRING),'0101') from xyz;
to achieve your transformation, you could do so a replace of :
'to_char' -> 'CONCAT(CAST'
'||' -> ' as STRING),'
And, by using a bit of intelligence while parsing your tokens, like is there a '||' in my tokens to know if it's string, you would know what to replace.
regards
The way I solve it in multiple projects based on ANTLR is this: I translated ANTLR parse-tree to an AST written using Kolasu, an open-source library we developed at Strumenta.
Kolasu has all sort of utilities to process and mutate ASTs. For all non-trivial projects I end up doing transformations on the AST.
Kolasu
I've been breaking my head over this for a few days now and can't seem to be able to figure it out. Perhaps it's glaringly obvious, but I don't seem to be able to spot it. I've read up on all the basics of unicode, UTF-8, UTF-16, normalisation, etc, but to no avail. Hopefully somebody's able to help me out here...
I'm using Go's Value function from the testing/quick package to generate random values for the fields in my data structs, in order to implement the Generator interface for the structs in question. Specifically, given a Metadata struct, I've defined the implementation as follows:
func (m *Metadata) Generate(r *rand.Rand, size int) (value reflect.Value) {
value = reflect.ValueOf(m).Elem()
for i := 0; i < value.NumField(); i++ {
if t, ok := quick.Value(value.Field(i).Type(), r); ok {
value.Field(i).Set(t)
}
}
return
}
Now, in doing so, I'll end up with both the receiver and the return value being set with random generated values of the appropriate type (strings, ints, etc. in the receiver and reflect.Value in the returned reflect.Value).
Now, the implementation for the Value function states that it will return something of type []rune converted to type string. As far as I know, this should allow me to then use the functions in the runes, unicode and norm packages to define a filter which filters out everything which is not part of 'Latin', 'Letter' or 'Number'. I defined the following filter which uses a transform to filter out letters which are not in those character rangetables (as defined in the unicode package):
func runefilter(in reflect.Value) (out reflect.Value) {
out = in // Make sure you return something
if in.Kind() == reflect.String {
instr := in.String()
t := transform.Chain(norm.NFD, runes.Remove(runes.NotIn(rangetable.Merge(unicode.Letter, unicode.Latin, unicode.Number))), norm.NFC)
outstr, _, _ := transform.String(t, instr)
out = reflect.ValueOf(outstr)
}
return
}
Now, I think I've tried just about anything, but I keep ending up with a series of strings which are far from the Latin range, e.g.:
𥗉똿穊
𢷽嚶
秓䝏小𪖹䮋
𪿝ท솲
𡉪䂾
ʋ𥅮ᦸ
堮𡹯憨𥗼𧵕ꥆ
𢝌𐑮𧍛併怃𥊇
鯮
𣏲𝐒
⓿ꐠ槹𬠂黟
𢼭踁퓺𪇖
俇𣄃𔘧
𢝶
𝖸쩈𤫐𢬿詢𬄙
𫱘𨆟𑊙
欓
So, can anybody explain what I'm overlooking here and how I could instead define a transformer which removes/replaces non-letter/number/latin characters so that I can use the Value function as intended (but with a smaller subset of 'random' characters)?
Thanks!
Confusingly the Generate interface needs a function using the type not a the pointer to the type. You want your type signature to look like
func (m Metadata) Generate(r *rand.Rand, size int) (value reflect.Value)
You can play with this here. Note: the most important thing to do in that playground is to switch the type of the generate function from m Metadata to m *Metadata and see that Hi Mom! never prints.
In addition, I think you would be better served using your own type and writing a generate method for that type using a list of all of the characters you want to use. For example:
type LatinString string
const latin = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01233456789"
and then use the generator
func (l LatinString) Generate(rand *rand.Rand, size int) reflect.Value {
var buffer bytes.Buffer
for i := 0; i < size; i++ {
buffer.WriteString(string(latin[rand.Intn(len(latin))]))
}
s := LatinString(buffer.String())
return reflect.ValueOf(s)
}
playground
Edit: also this library is pretty cool, thanks for showing it to me
The answer to my own question is, it seems, a combination of the answers provided in the comments by #nj_ and #jimb and the answer provided by #benjaminkadish.
In short, the answer boils down to:
"Not such a great idea as you thought it was", or "Bit of an ill-posed question"
"You were using the union of 'Letter', 'Latin' and 'Number' (Letter || Number || Latin), instead of the intersection of 'Latin' with the union of 'Letter' and 'Number' ((Letter || Number) && Latin))
Now for the longer version...
The idea behind me using the testing/quick package is that I wanted random data for (fuzzy) testing of my code. In the past, I've always written the code for doing things like that myself, again and again. This meant a lot of the same code across different projects. Now, I could of course written my own package for it, but it turns out that, even better than that, there's actually a standard package which does just about exactly what I want.
Now, it turns out the package does exactly what I want very well. The codepoints in the strings which it generates are actually random and not just restricted to what we're accustomed to using in everyday life. Now, this is of course exactly the thing which you want in doing fuzzy testing in order to test the code with values outside the usual assumptions.
In practice, that means I'm running into two problems:
There's some limits on what I would consider reasonable input for a string. Meaning that, in testing the processing of a Name field or a URL field, I can reasonably assume there's not going to be a value like 'James Mc⌢' (let alone 'James Mc🙁') or 'www.🕸site.com', but just 'James McFrown' and 'www.website.com'. Hence, I can't expect a reasonable system to be able to support it. Of course, things shouldn't completely break down, but it also can't be expected to handle the former examples without any problems.
When I filter the generated string on values which one might consider reasonable, the chance of ending up with a valid string is very small. The set of possible characters in the set used by the testing/quick is just so large (0x10FFFF) and the set of reasonable characters so small, you end up with empty strings most of the time.
So, what do we need to take away from this?
So, whilst I hoped to use the standard testing/quick package to replace my often repeated code to generate random data for fuzzy testing, it does this so well that it provides data outside the range of what I would consider reasonable for the code to be able to handle. It seems that the choice, in the end, is to:
Either be able to actually handle all fuzzy options, meaning that if somebody's name is 'Arnold 💰💰' ('Arnold Moneybags'), it shouldn't go arse over end. Or...
Use custom/derived types with their own Generator. This means you're going to have to use the derived type instead of the basic type throughout the code. (Comparable to defining a string as wchar_t instead of char in C++ and working with those by default.). Or...
Don't use testing/quick for fuzzy testing, because as soon as you run into a generated string value, you can (and should) get a very random string.
As always, further comments are of course welcome, as it's quite possible I overlooked something.
So, I'm writing a language using flex/bison and I'm having difficulty with implementing identifiers, specifically when it comes to knowing when you're looking at an assignment or a reference,
for example:
1) A = 1+2
2) B + C (where B and C have already been assigned values)
Example one I can work out by returning an ID token from flex to bison, and just following a grammar that recognizes that 1+2 is an integer expression, putting A into the symbol table, and setting its value.
examples two and three are more difficult for me because: after going through my lexer, what's being returned in ex.2 to bison is "ID PLUS ID" -> I have a grammar that recognizes arithmetic expressions for numerical values, like INT PLUS INT (which would produce an INT), or DOUBLE MINUS INT (which would produce a DOUBLE). if I have "ID PLUS ID", how do I know what type the return value is?
Here's the best idea that I've come up with so far: When tokenizing, every time an ID comes up, I search for its value and type in the symbol table and switch out the ID token with its respective information; for example: while tokenizing, I come across B, which has a regex that matches it as being an ID. I look in my symbol table and see that it has a value of 51.2 and is a DOUBLE. So instead of returning ID, with a value of B to bison, I'm returning DOUBLE with a value of 51.2
I have two different solutions that contradict each other. Here's why: if I want to assign a value to an ID, I would say to my compiler A = 5. In this situation, if I'm using my previously described solution, What I'm going to get after everything is tokenized might be, INT ASGN INT, or STRING ASGN INT, etc... So, in this case, I would use the former solution, as opposed to the latter.
My question would be: what kind of logical device do I use to help my compiler know which solution to use?
NOTE: I didn't think it necessary to post source code to describe my conundrum, but I will if anyone could use it effectively as a reference to help me understand their input on this topic.
Thank you.
The usual way is to have a yacc/bison rule like:
expr: ID { $$ = lookupId($1); }
where the the lookupId function looks up a symbol in the symbol table and returns its type and value (or type and storage location if you're writing a compiler rather than a strict interpreter). Then, your other expr rules don't need to care whether their operands come from constants or symbols or other expressions:
expr: expr '+' expr { $$ = DoAddition($1, $3); }
The function DoAddition takes the types and values (or locations) for its two operands and either adds them, producing a result, or produces code to do the addition at run time.
If possible redesign your language so that the situation is unambiguous. This is why even Javascript has var.
Otherwise you're going to need to disambiguate via semantic rules, for example that the first use of an identifier is its declaration. I don't see what the problem is with your case (2): just generate the appropriate code. If B and C haven't been used yet, a value-reading use like this should be illegal, but that involves you in control flow analysis if taken to the Nth degree of accuracy, so you might prefer to assume initial values of zero.
In any case you can see that it's fundamentally a language design problem rather than a coding problem.
I'm trying to make a program in SML that will read in a series/list/sequence of numbers from the user, process the numbers, and output the result. I don't know how many numbers the user will input. The program can either read in all the numbers and output the results all together or read and output one at a time. I don't care whether the input is in a separate file or manually input at a console.
What do I need to do to be able to read input?
fun fact x = if x<2 then 1 else x*fact(x-1);
let val keepgoing:bool ref = ref true in
while !keepgoing do
let val num = valOf(TextIO.inputLine TextIO.stdIn) in
print( Int.toString( fact( valOf( Int.fromString( num ) ) ) ) );
keepgoing := (null(explode(num)))
end
end;
Sorry about the convoluted conversions. If you also know an easier way to read in integers, I'd appreciate that, too.
Your logic is just flawed here. You want keepgoing := not (null (explode num)). Right? It works fine for me with that change. You need to implement removal of the final newline (so null explode does what you want) and parsing a line with more than one number, but you basically have the right idea.
Interesting challenge; my client enters some product information in a SQL database. The product is a painting of a famous old Russian composer called Rachmaninoff. So that name is in the description field. Now, only a few of their customers searching for products know exactly how to spell this name, but most of the time it's misspelled. Besides misspelling there are also a lot of international customers who just write this name completely different like, Rachmaninow, Rahmaninov, Рахманінаў.
If i put any of these misspellings or translations in Google it (almost) always knows how to correct it and to redirect me straight to the right page.
Does anyone know what my possibilities are to get some of this magic in my product search? Are there some API's i can use? Some super free text option that i don't know of? Or ...
We solved a similar problem with quite some success: Searching for people (german names) by name given over phone.
E.g.: The very common german last names "Schmidt", "Schmitt", "Schmied", "Schmid", "Schmit" and "Schmiedt" will be all but impossible to hold apart when given by a voice. Combine this with a first name of "Sylvia" or "Silvia" or "Sylvya" and a caller saying "Hi, I'm Sylvia Schmidt, I have forgotten my customer number" has no chance of being quickly found.
Our solution was to put up a list of synophones, e.g. (in pseudo code, for german):
{consonant}+ := {consonant}
ie := i
ii := i
dt* := t
y|j := i
{vocal}v := {vocal}f
etc., you get the drift. Now we stored the synophone-translated strings with the original strings to make search possible. This works really well.
I understand that MySQL has the Soundex() function for English strings. I would expect MSSQL to have something similar.