Generating Random String of Numbers and Letters Using Go's "testing/quick" Package - testing

I've been breaking my head over this for a few days now and can't seem to be able to figure it out. Perhaps it's glaringly obvious, but I don't seem to be able to spot it. I've read up on all the basics of unicode, UTF-8, UTF-16, normalisation, etc, but to no avail. Hopefully somebody's able to help me out here...
I'm using Go's Value function from the testing/quick package to generate random values for the fields in my data structs, in order to implement the Generator interface for the structs in question. Specifically, given a Metadata struct, I've defined the implementation as follows:
func (m *Metadata) Generate(r *rand.Rand, size int) (value reflect.Value) {
value = reflect.ValueOf(m).Elem()
for i := 0; i < value.NumField(); i++ {
if t, ok := quick.Value(value.Field(i).Type(), r); ok {
value.Field(i).Set(t)
}
}
return
}
Now, in doing so, I'll end up with both the receiver and the return value being set with random generated values of the appropriate type (strings, ints, etc. in the receiver and reflect.Value in the returned reflect.Value).
Now, the implementation for the Value function states that it will return something of type []rune converted to type string. As far as I know, this should allow me to then use the functions in the runes, unicode and norm packages to define a filter which filters out everything which is not part of 'Latin', 'Letter' or 'Number'. I defined the following filter which uses a transform to filter out letters which are not in those character rangetables (as defined in the unicode package):
func runefilter(in reflect.Value) (out reflect.Value) {
out = in // Make sure you return something
if in.Kind() == reflect.String {
instr := in.String()
t := transform.Chain(norm.NFD, runes.Remove(runes.NotIn(rangetable.Merge(unicode.Letter, unicode.Latin, unicode.Number))), norm.NFC)
outstr, _, _ := transform.String(t, instr)
out = reflect.ValueOf(outstr)
}
return
}
Now, I think I've tried just about anything, but I keep ending up with a series of strings which are far from the Latin range, e.g.:
𥗉똿穊
𢷽嚶
秓䝏小𪖹䮋
𪿝ท솲
𡉪䂾
ʋ𥅮ᦸ
堮𡹯憨𥗼𧵕ꥆ
𢝌𐑮𧍛併怃𥊇
鯮
𣏲𝐒
⓿ꐠ槹𬠂黟
𢼭踁퓺𪇖
俇𣄃𔘧
𢝶
𝖸쩈𤫐𢬿詢𬄙
𫱘𨆟𑊙
欓
So, can anybody explain what I'm overlooking here and how I could instead define a transformer which removes/replaces non-letter/number/latin characters so that I can use the Value function as intended (but with a smaller subset of 'random' characters)?
Thanks!

Confusingly the Generate interface needs a function using the type not a the pointer to the type. You want your type signature to look like
func (m Metadata) Generate(r *rand.Rand, size int) (value reflect.Value)
You can play with this here. Note: the most important thing to do in that playground is to switch the type of the generate function from m Metadata to m *Metadata and see that Hi Mom! never prints.
In addition, I think you would be better served using your own type and writing a generate method for that type using a list of all of the characters you want to use. For example:
type LatinString string
const latin = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01233456789"
and then use the generator
func (l LatinString) Generate(rand *rand.Rand, size int) reflect.Value {
var buffer bytes.Buffer
for i := 0; i < size; i++ {
buffer.WriteString(string(latin[rand.Intn(len(latin))]))
}
s := LatinString(buffer.String())
return reflect.ValueOf(s)
}
playground
Edit: also this library is pretty cool, thanks for showing it to me

The answer to my own question is, it seems, a combination of the answers provided in the comments by #nj_ and #jimb and the answer provided by #benjaminkadish.
In short, the answer boils down to:
"Not such a great idea as you thought it was", or "Bit of an ill-posed question"
"You were using the union of 'Letter', 'Latin' and 'Number' (Letter || Number || Latin), instead of the intersection of 'Latin' with the union of 'Letter' and 'Number' ((Letter || Number) && Latin))
Now for the longer version...
The idea behind me using the testing/quick package is that I wanted random data for (fuzzy) testing of my code. In the past, I've always written the code for doing things like that myself, again and again. This meant a lot of the same code across different projects. Now, I could of course written my own package for it, but it turns out that, even better than that, there's actually a standard package which does just about exactly what I want.
Now, it turns out the package does exactly what I want very well. The codepoints in the strings which it generates are actually random and not just restricted to what we're accustomed to using in everyday life. Now, this is of course exactly the thing which you want in doing fuzzy testing in order to test the code with values outside the usual assumptions.
In practice, that means I'm running into two problems:
There's some limits on what I would consider reasonable input for a string. Meaning that, in testing the processing of a Name field or a URL field, I can reasonably assume there's not going to be a value like 'James Mc⌢' (let alone 'James Mc🙁') or 'www.🕸site.com', but just 'James McFrown' and 'www.website.com'. Hence, I can't expect a reasonable system to be able to support it. Of course, things shouldn't completely break down, but it also can't be expected to handle the former examples without any problems.
When I filter the generated string on values which one might consider reasonable, the chance of ending up with a valid string is very small. The set of possible characters in the set used by the testing/quick is just so large (0x10FFFF) and the set of reasonable characters so small, you end up with empty strings most of the time.
So, what do we need to take away from this?
So, whilst I hoped to use the standard testing/quick package to replace my often repeated code to generate random data for fuzzy testing, it does this so well that it provides data outside the range of what I would consider reasonable for the code to be able to handle. It seems that the choice, in the end, is to:
Either be able to actually handle all fuzzy options, meaning that if somebody's name is 'Arnold 💰💰' ('Arnold Moneybags'), it shouldn't go arse over end. Or...
Use custom/derived types with their own Generator. This means you're going to have to use the derived type instead of the basic type throughout the code. (Comparable to defining a string as wchar_t instead of char in C++ and working with those by default.). Or...
Don't use testing/quick for fuzzy testing, because as soon as you run into a generated string value, you can (and should) get a very random string.
As always, further comments are of course welcome, as it's quite possible I overlooked something.

Related

Creating 4 digit number with no repeating elements in Kotlin

Thanks to #RedBassett for this Ressource (Kotlin problem solving): https://kotlinlang.org/docs/tutorials/koans.html
I'm aware this question exists here:
Creating a 4 digit Random Number using java with no repetition in digits
but I'm new to Kotlin and would like to explore the direct Kotlin features.
So as the title suggests, I'm trying to find a Kotlin specific way to nicely solve generate a 4 digit number (after that it's easy to make it adaptable for length x) without repeating digits.
This is my current working solution and would like to make it more Kotlin. Would be very grateful for some input.
fun createFourDigitNumber(): Int {
var fourDigitNumber = ""
val rangeList = {(0..9).random()}
while(fourDigitNumber.length < 4)
{
val num = rangeList().toString()
if (!fourDigitNumber.contains(num)) fourDigitNumber +=num
}
return fourDigitNumber.toInt()
}
So the range you define (0..9) is actually already a sequence of numbers. Instead of iterating and repeatedly generating a new random, you can just use a subset of that sequence. In fact, this is the accepted answer's solution to the question you linked. Here are some pointers if you want to implement it yourself to get the practice:
The first for loop in that solution is unnecessary in Kotlin because of the range. 0..9 does the same thing, you're on the right track there.
In Kotlin you can call .shuffled() directly on the range without needing to call Collections.shuffle() with an argument like they do.
You can avoid another loop if you create a string from the whole range and then return a substring.
If you want to look at my solution (with input from others in the comments), it is in a spoiler here:
fun getUniqueNumber(length: Int) = (0..9).shuffled().take(length).joinToString('')
(Note that this doesn't gracefully handle a length above 10, but that's up to you to figure out how to implement. It is up to you to use subList() and then toString(), or toString() and then substring(), the output should be the same.)

Go application making SQL Query using GROUP_CONCAT on FLOATS returns []uint8 instead of actual []float64

Have a problem using group_concat in a query made by my go application.
Any idea why a group_concat of FLOATS would look like a []uint8 on the Go side?
Cant seem to properly convert the suckers either.
It's definitely floats, I can see it in the raw query results, but when I do the same query in go and try to .Scan the result, Go complains that it's a []uint8 not a []float64 (which it actually is) Attempts to convert to floats gives me the wrong values (and way too many of them).
For example, at the database, I query and get 2 floats for the column in question, looks like this:
"5650.50, 5455.00"
On the go side however, go sees a []uint8 instead of []float64. Why does this happen? How does one workaround this to get the actual results?
My problem is that I have to use this SQL with the group_concat, due to the nature of the database I am working with, this is the best way to get the information, and more importantly the query itself works great, returns the data the function needs, but now I cant read it out because of type issues. No stranger to those, but Go isn't cooperating with me today.
I'd be more than pleased to learn WHY go is doing it this way, and delighted to learn of a way to deal with it.
Example:
SELECT ID, getDistance(33.1543,-110.4353, Loc.Lat, Loc.Lng) as distance,
GROUP_CONCAT(values) FROM stuff INNER JOIN device on device.ID = stuff.ID WHERE (someConditionsETC) GROUP BY ID ORDER BY ID
The actual result, when interfacing with the actual database (not within my application), is
"5650.00, 5850.50"
It's clearly 2 floats.
The same result produces a slice of uint8 when queried from Go and trying to .Scan the result in. If I range through and print those values, I get way more than 2, and they are uint8 (bytes) that look like this:
53,55,56,48,46,48,48
Not sure how Go expects me to handle this.
Solution.... stupid simple and not terribly obvious:
The solution: 
crazyBytes := []uint8("5760.00,5750.50")
aString := string(crazyBytes)
strSlice := strings.Split(aString,",") // string representation of our array (of floats)
var floatz []float64
for _, x := range strSlice {
fmt.Printf("At last, Float: %s \r\n",x)
f,err := strconv.ParseFloat(x,64)
if err != nil { fmt.Printf("Error: %s",err) }
floatz = append(floatz, f)
fmt.Printf("as float: %s \r\n", strconv.FormatFloat(f,'f',-1,64))
}
Yea sure, it's obvious NOW.
GROUP_CONCAT returns a string. So in Go you get a byte array of characters, not a float. The result you posted 53,55,56,48,46,48,48 translates into a string "5780.00" which does look like one of your values. So you need to either fix your SQL to return floats or use strings and strconv modules in Go to parse and convert your string into floats. I think the former approach is better, but it is up to you.

wxWidgets - wxGrid - reading/writing non string cell values

I have a wxGrid to edit an array of numerical data.
I was wondering what's the best way to get non-string data in and out of the cells without going through the string to numeric conversion all the time.
I've used SetCellEditor() to control the data entry.
currently I use this:
// numeric value into cell
str.clear();
str << val1;
m_grid4->SetCellValue(row, col, str);
..
// read value from back into variable
val = atoi(m_grid4->GetCellValue(row, col));
Apart from the fact that atoi() is a bit ugly and a template function with a stringstream would be better, is there a way do get non-string values a bit better in and out of cells?
I was looking at the editors and renderers but can't figure it out.
If you worry about efficiency, you almost certainly should use a custom table class deriving from wxGridTableBase instead of using the default trivial wxGridStringTable implementation which stores everything as strings. Then, and much less importantly, if it makes sense in your case, you can use wxGridCellNumberRenderer which will call your table GetValueAsLong() method instead of GetValue() (which returns a string).
Both of those are demonstrated in wxGrid sample, notably look at BugsGridTable there.
Good luck!

How to tell if an identifier is being assigned or referenced? (FLEX/BISON)

So, I'm writing a language using flex/bison and I'm having difficulty with implementing identifiers, specifically when it comes to knowing when you're looking at an assignment or a reference,
for example:
1) A = 1+2
2) B + C (where B and C have already been assigned values)
Example one I can work out by returning an ID token from flex to bison, and just following a grammar that recognizes that 1+2 is an integer expression, putting A into the symbol table, and setting its value.
examples two and three are more difficult for me because: after going through my lexer, what's being returned in ex.2 to bison is "ID PLUS ID" -> I have a grammar that recognizes arithmetic expressions for numerical values, like INT PLUS INT (which would produce an INT), or DOUBLE MINUS INT (which would produce a DOUBLE). if I have "ID PLUS ID", how do I know what type the return value is?
Here's the best idea that I've come up with so far: When tokenizing, every time an ID comes up, I search for its value and type in the symbol table and switch out the ID token with its respective information; for example: while tokenizing, I come across B, which has a regex that matches it as being an ID. I look in my symbol table and see that it has a value of 51.2 and is a DOUBLE. So instead of returning ID, with a value of B to bison, I'm returning DOUBLE with a value of 51.2
I have two different solutions that contradict each other. Here's why: if I want to assign a value to an ID, I would say to my compiler A = 5. In this situation, if I'm using my previously described solution, What I'm going to get after everything is tokenized might be, INT ASGN INT, or STRING ASGN INT, etc... So, in this case, I would use the former solution, as opposed to the latter.
My question would be: what kind of logical device do I use to help my compiler know which solution to use?
NOTE: I didn't think it necessary to post source code to describe my conundrum, but I will if anyone could use it effectively as a reference to help me understand their input on this topic.
Thank you.
The usual way is to have a yacc/bison rule like:
expr: ID { $$ = lookupId($1); }
where the the lookupId function looks up a symbol in the symbol table and returns its type and value (or type and storage location if you're writing a compiler rather than a strict interpreter). Then, your other expr rules don't need to care whether their operands come from constants or symbols or other expressions:
expr: expr '+' expr { $$ = DoAddition($1, $3); }
The function DoAddition takes the types and values (or locations) for its two operands and either adds them, producing a result, or produces code to do the addition at run time.
If possible redesign your language so that the situation is unambiguous. This is why even Javascript has var.
Otherwise you're going to need to disambiguate via semantic rules, for example that the first use of an identifier is its declaration. I don't see what the problem is with your case (2): just generate the appropriate code. If B and C haven't been used yet, a value-reading use like this should be illegal, but that involves you in control flow analysis if taken to the Nth degree of accuracy, so you might prefer to assume initial values of zero.
In any case you can see that it's fundamentally a language design problem rather than a coding problem.

Detecting a sub-string

I have a lot of data that I need to sort through and one of the fields contains both the make/model of the vehicle as well as the reg, sometimes separated by a dash (-) sometimes however it is not. Here is an example of such a string:
VehicleModel - TU69YUP
VehicleModel - TU69 YUP
VehicleModel TU69YUP
VehicleModel TU69 YUP
There are also some other variations but they are the main ones I have encountered. Is there a way that I can reliably go through all of the data and separate the vehicle reg from the model?
The data is currently contained within a Paradox database which I have no problem going through. I do not have a list of all of the vehicle models and names that are contained within the database, likewise, I also do not have a list of the licence plates.
The project is written in Delphi/SQL so I would prefer to stick with either one of these if at all possible.
Trouble ahead
If that field was originally entered by a user in the form that you're now seeing, then we can assume there was no validation, the original program would simply store whatever the user entered. If that's the case, you can't get 100% accuracy: human beings will always make mistakes, intentionally or unintentionally. Expect this kinds of human errors:
Missing fields (ie: registration only, no vehicle information - or the other way around)
Meaningless duplication of words (example: "Ford Ford K - TU 69 YUP")
Missing letters, duplicated letters, extra garbage letters. Example: "For K - T69YUP"
Wrong order of fields
Other small errors you can't even dream of.
Plain garbage that not even a human would make sense of.
You might have guessed I'm a bit pessimistic when dealing with human-entered data straight into text fields. I had the distinct misfortune to deal with a database where all data was text and there was no validation: Can you guess the kind of nonsense people typed in unvalidated date fields that allowed free user input?
The plan
Things aren't as dark as they seem, you can probably "fix" lots of things. The trick here is making sure you only fix data that's unambiguous and let a human sift through the rest of the stuff. The easiest way to do that is to do something like this:
Look at the data you have and wasn't automatically fixed yet. Figure out a rule that unambiguously applies to lots of records.
Apply the unambiguous rule.
Repeat until only a few records are left. Those should be fixed by hand, because they resisted all automatic methods that were applied.
The implementation
I strongly recommend using regular expressions for all the tests, because you'll surely end up implementing lots of different tests, and regular expressions can easily "express" the slight variations in search text. For example the following reg-ex can parse all 4 of your examples and give the correct result:
(.*?)(\ {1,3}-\ {1,3})?(\b[A-Z]{2}\ {0,2}[0-9]{2}\ {0,3}[A-Z]{3}\b)
If you've never worked with regular expressions before, that single expressions looks unintelligible, but it's in fact very simple. This is not a reg-ex question so I'm not going into any details. I'd rather explain how I've come up with the idea.
First of all, if the text includes vehicle registration numbers, those numbers will be in a very strict format: they'd be easy to match. Given your example I assume all registration numbers are of the form:
LLNNLLL
where "L" is a letter and "N" is a number. My regex is rigid in it's interpretation of it: it wants exactly two uppercase letters, followed by a small number of spaces (or no space), followed by exactly two digits, followed by a small number of spaces (or no space), finally followed by exactly 3 uppercase letters. The part of the regex that deals with that is:
[A-Z]{2}\ {0,2}[0-9]{2}\ {0,3}[A-Z]{3}
The rest of the regex makes sure the registration number isn't found embedded into other words, deals with grouping text into capture groups and creates an "lazy capture group" for the VehicleModel.
If I were to implement this myself, I'd probably write a "master" function and a number of simpler "case" functions, each function dealing with one kind of variation in user input. Example:
// This function does a validation of the extracted data. For example it validates the
// Registration number, using other, more precise criteria. The parameters are VAR so the
// function may normalize the results.
function ResultsAreValid(var Make, Registration:string): Boolean;
begin
Result := True; // Only you know what your data looks like and how it can be validated.
end;
// This is a case function that deals with a very rigid interpretation of user data
function VeryStrictInterpretation(const Text:string; out Make, Registration: string): Boolean;
var TestMake, TestReg: string;
// regex engine ...
begin
Result := False;
if (your condition) then
if ResultsAreValid(TestMake, TestReg) then
begin
Make := TestMake;
Registration := TestReg;
Result := True;
end;
end;
// Master function calling many different implementations that each deal with all sorts
// of variations of input. The most strict function should be first:
function MasterTest(const Text:string; out Make, Registration: string): Boolean;
begin
Result := VeryStrictInterpretation(Text, Make, Registration);
if not Result then Result := SomeOtherImplementation(Text, Make, Registration);
if not Result then Result := ThirdInterpretation(Text, Make, Registration);
end;
The idea here is to try to make multiple SIMPLE procedures, that each understands one kind of input in an unambiguous way; And make sure each step doesn't return false positives! And finally don't forget, a human should deal with the last few cases, so don't aim for a fix-it-all solution.
Well assuming that they are of the same format. Word[space]Word
Then you can iterate through them all, and if you encounter a whitespace without a dash, insert a dash. Then split as normal.
Here is a code example.
It will check for the - and also remove possible spaces in the license number.
Note : (as commented by Ken White), if the vehicle contains a space, this will have to be handled as well.
type
EMySplitError = class(Exception);
procedure SplitVehicleAndLicense( s : String; var vehicle,license : String);
var
p : Integer;
begin
vehicle := '';
license := '';
p := Pos('-',s);
if (p = 0) then
begin // No split delimiter
p := Pos(' ',s);
if (p > 0) then
begin
vehicle := Trim(Copy(s,1,p-1));
license := Trim(Copy(s,p+1,Length(s)));
end
else
Raise EMySplitError.CreateFmt('Not a valid vehicle/license name:%s',[s]);
end
else
begin
vehicle := Trim(Copy( s,1,p-1));
license := Trim(Copy( s,p+1,Length(s)));
end;
// Trim spaces in license
repeat
p := Pos(' ',license);
if (p <> 0) then Delete(license,p,1);
until (p = 0);
end;