Efficient Way In SQL Server To Remove All "InvalidXMLCharacters" From a NVARCHAR - sql

As part of this answer, I determined that one of the things that can break a OLAP cube is feeding in values to it (in the dimension names/values/etc) that contain characters that are considered "InvalidXMLCharacters". Now I would like to filter out these values so that they never end up in the OLAP cubes I've building in SQL. Often I find myself importing this input data from one table into another. Something like the following:
INSERT INTO [dbo].[DestinationTableThatWillBeReferencedInMyOLAPCube]
SELECT TextDataColumn1, TextDataColumn2, etc...
FROM [dbo].[SourceTableContainingColumnsWithValuesWithInvalidXMLCharacters]
WHERE XYZ...
Is there an efficient way to remove all "InvalidXMLCharacters" from my Columns in this query?
The obvious solution that comes to mind would be some sort of Regex, though from the previously stated linked posts, that might be quite complex, and I'm not sure of the performance implications around this.
Another idea I've had is to Convert the Columns to "XML" data type, but that will error if they contain invalid characters, not very helpful for removing them...
I've looked around and don't see many other cases where developers are trying to do exactly this, has thing been tackled any other way in another post that I haven't found?

.NET CLR integration in SQL Server could be helpful.
Here is a small c# example for you. You can use it as a starting point for your needs. Its most important line is using XmlConvert.IsXmlChar(ch) call to remove invalid XML characters.
c#
void Main()
{
// https://www.w3.org/TR/xml/#charsets
// ===================================
// From xml spec valid chars:
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
// any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
string content = "fafa\v\f\0";
Console.WriteLine(IsValidXmlString(content)); // False
content = RemoveInvalidXmlChars(content).Dump("Clean string");
Console.WriteLine(IsValidXmlString(content)); // True
}
// Define other methods and classes here
static string RemoveInvalidXmlChars(string text)
{
return new string(text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray());
}
static bool IsValidXmlString(string text)
{
bool rc = true;
try
{
XmlConvert.VerifyXmlChars(text);
}
catch
{
rc = false;
}
return rc;
}

Related

Finding best delimiter by size of resulting array after split Kotlin

I am trying to obtain the best delimiter for my CSV file, I've seen answers that find the biggest size of the header row. Now instead of doing the standard method that would look something like this:
val supportedDelimiters: Array<Char> = arrayOf(',', ';', '|', '\t')
fun determineDelimiter(headerRow): Char {
var headerLength = 0
var chosenDelimiter =' '
supportedDelimiters.forEach {
if (headerRow.split(it).size > headerLength) {
headerLength = headerRow.split(it).size
chosenDelimiter = it
}
}
return chosenDelimiter
}
I've been trying to do it with some in-built Kotlin collections methods like filter or maxOf, but to no avail (the code below does not work).
fun determineDelimiter(headerRow: String): Char {
return supportedDelimiters.filter({a,b -> headerRow.split(a).size < headerRow.split(b)})
}
Is there any way I could do it without forEach?
Edit: The header row could look something like this:
val headerRow = "I;am;delimited;with;'semi,colon'"
I put the '' over an entry that could contain other potential delimiter
You're mostly there, but this seems simpler than you think!
Here's one answer:
fun determineDelimiter(headerRow: String)
= supportedDelimiters.maxByOrNull{ headerRow.split(it).size } ?: ' '
maxByOrNull() does all the hard work: you just tell it the number of headers that a delimiter would give, and it searches through each delimiter to find which one gives the largest number.
It returns null if the list is empty, so the method above returns a space character, like your standard method. (In this case we know that the list isn't empty, so you could replace the ?: ' ' with !! if you wanted that impossible case to give an error, or you could drop it entirely if you wanted it to give a null which would be handled elsewhere.)
As mentioned in a comment, there's no foolproof way to guess the CSV delimiter in general, and so you should be prepared for it to pick the wrong delimiter occasionally. For example, if the intended delimiter was a semicolon but several headers included commas, it could wrongly pick the comma. Without knowing any more about the data, there's no way around that.
With the code as it stands, there could be multiple delimiters which give the same number of headers; it would simply pick the first. You might want to give an error in that case, and require that there's a unique best delimiter. That would give you a little more confidence that you've picked the right one — though there's still no guarantee. (That's not so easy to code, though…)
Just like gidds said in the comment above, I would advise against choosing the delimiter based on how many times each delimiter appears. You would get the wrong answer for a header row like this:
Type of shoe, regardless of colour, even if black;Size of shoe, regardless of shape
In the above header row, the delimiter is obviously ; but your method would erroneously pick ,.
Another problem is that a header column may itself contain a delimiter, if it is enclosed in quotes. Your method doesn't take any notice of possible quoted columns. For this reason, I would recommend that you give up trying to parse CSV files yourself, and instead use one of the many available Open Source CSV parsers.
Nevertheless, if you still want to know how to pick the delimiter based on its frequency, there are a few optimizations to readability that you can make.
First, note that Kotlin strings are iterable; therefore you don't have to use a List of Char. Use a String instead.
Secondly, all you're doing is counting the number of times a character appears in the string, so there's no need to break the string up into pieces just to do that. Instead, count the number of characters directly.
Third, instead of finding the maximum value by hand, take advantage of what the standard library already offers you.
const val supportedDelimiters = ",;|\t"
fun determineDelimiter(headerRow: String): Char =
supportedDelimiters.maxBy { delimiter -> headerRow.count { it == delimiter } }
fun main() {
val headerRow = "one,two,three;four,five|six|seven"
val chosenDelimiter = determineDelimiter(headerRow)
println(chosenDelimiter) // prints ',' as expected
}

Generating Random String of Numbers and Letters Using Go's "testing/quick" Package

I've been breaking my head over this for a few days now and can't seem to be able to figure it out. Perhaps it's glaringly obvious, but I don't seem to be able to spot it. I've read up on all the basics of unicode, UTF-8, UTF-16, normalisation, etc, but to no avail. Hopefully somebody's able to help me out here...
I'm using Go's Value function from the testing/quick package to generate random values for the fields in my data structs, in order to implement the Generator interface for the structs in question. Specifically, given a Metadata struct, I've defined the implementation as follows:
func (m *Metadata) Generate(r *rand.Rand, size int) (value reflect.Value) {
value = reflect.ValueOf(m).Elem()
for i := 0; i < value.NumField(); i++ {
if t, ok := quick.Value(value.Field(i).Type(), r); ok {
value.Field(i).Set(t)
}
}
return
}
Now, in doing so, I'll end up with both the receiver and the return value being set with random generated values of the appropriate type (strings, ints, etc. in the receiver and reflect.Value in the returned reflect.Value).
Now, the implementation for the Value function states that it will return something of type []rune converted to type string. As far as I know, this should allow me to then use the functions in the runes, unicode and norm packages to define a filter which filters out everything which is not part of 'Latin', 'Letter' or 'Number'. I defined the following filter which uses a transform to filter out letters which are not in those character rangetables (as defined in the unicode package):
func runefilter(in reflect.Value) (out reflect.Value) {
out = in // Make sure you return something
if in.Kind() == reflect.String {
instr := in.String()
t := transform.Chain(norm.NFD, runes.Remove(runes.NotIn(rangetable.Merge(unicode.Letter, unicode.Latin, unicode.Number))), norm.NFC)
outstr, _, _ := transform.String(t, instr)
out = reflect.ValueOf(outstr)
}
return
}
Now, I think I've tried just about anything, but I keep ending up with a series of strings which are far from the Latin range, e.g.:
𥗉똿穊
𢷽嚶
秓䝏小𪖹䮋
𪿝ท솲
𡉪䂾
ʋ𥅮ᦸ
堮𡹯憨𥗼𧵕ꥆ
𢝌𐑮𧍛併怃𥊇
鯮
𣏲𝐒
⓿ꐠ槹𬠂黟
𢼭踁퓺𪇖
俇𣄃𔘧
𢝶
𝖸쩈𤫐𢬿詢𬄙
𫱘𨆟𑊙
欓
So, can anybody explain what I'm overlooking here and how I could instead define a transformer which removes/replaces non-letter/number/latin characters so that I can use the Value function as intended (but with a smaller subset of 'random' characters)?
Thanks!
Confusingly the Generate interface needs a function using the type not a the pointer to the type. You want your type signature to look like
func (m Metadata) Generate(r *rand.Rand, size int) (value reflect.Value)
You can play with this here. Note: the most important thing to do in that playground is to switch the type of the generate function from m Metadata to m *Metadata and see that Hi Mom! never prints.
In addition, I think you would be better served using your own type and writing a generate method for that type using a list of all of the characters you want to use. For example:
type LatinString string
const latin = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01233456789"
and then use the generator
func (l LatinString) Generate(rand *rand.Rand, size int) reflect.Value {
var buffer bytes.Buffer
for i := 0; i < size; i++ {
buffer.WriteString(string(latin[rand.Intn(len(latin))]))
}
s := LatinString(buffer.String())
return reflect.ValueOf(s)
}
playground
Edit: also this library is pretty cool, thanks for showing it to me
The answer to my own question is, it seems, a combination of the answers provided in the comments by #nj_ and #jimb and the answer provided by #benjaminkadish.
In short, the answer boils down to:
"Not such a great idea as you thought it was", or "Bit of an ill-posed question"
"You were using the union of 'Letter', 'Latin' and 'Number' (Letter || Number || Latin), instead of the intersection of 'Latin' with the union of 'Letter' and 'Number' ((Letter || Number) && Latin))
Now for the longer version...
The idea behind me using the testing/quick package is that I wanted random data for (fuzzy) testing of my code. In the past, I've always written the code for doing things like that myself, again and again. This meant a lot of the same code across different projects. Now, I could of course written my own package for it, but it turns out that, even better than that, there's actually a standard package which does just about exactly what I want.
Now, it turns out the package does exactly what I want very well. The codepoints in the strings which it generates are actually random and not just restricted to what we're accustomed to using in everyday life. Now, this is of course exactly the thing which you want in doing fuzzy testing in order to test the code with values outside the usual assumptions.
In practice, that means I'm running into two problems:
There's some limits on what I would consider reasonable input for a string. Meaning that, in testing the processing of a Name field or a URL field, I can reasonably assume there's not going to be a value like 'James Mc⌢' (let alone 'James Mc🙁') or 'www.🕸site.com', but just 'James McFrown' and 'www.website.com'. Hence, I can't expect a reasonable system to be able to support it. Of course, things shouldn't completely break down, but it also can't be expected to handle the former examples without any problems.
When I filter the generated string on values which one might consider reasonable, the chance of ending up with a valid string is very small. The set of possible characters in the set used by the testing/quick is just so large (0x10FFFF) and the set of reasonable characters so small, you end up with empty strings most of the time.
So, what do we need to take away from this?
So, whilst I hoped to use the standard testing/quick package to replace my often repeated code to generate random data for fuzzy testing, it does this so well that it provides data outside the range of what I would consider reasonable for the code to be able to handle. It seems that the choice, in the end, is to:
Either be able to actually handle all fuzzy options, meaning that if somebody's name is 'Arnold 💰💰' ('Arnold Moneybags'), it shouldn't go arse over end. Or...
Use custom/derived types with their own Generator. This means you're going to have to use the derived type instead of the basic type throughout the code. (Comparable to defining a string as wchar_t instead of char in C++ and working with those by default.). Or...
Don't use testing/quick for fuzzy testing, because as soon as you run into a generated string value, you can (and should) get a very random string.
As always, further comments are of course welcome, as it's quite possible I overlooked something.

JSON issue with SQL lines return

I have an issue when I try to parse my JSON. I create my JSON "by my hand" like this in PHP :
$outp ='{"records":['.$outp.']}'; and I create it so I can take field from my database to show them in the page. The thing is, in my database I have a field "description" where people can give a description about something. So some people make return to line like this for example :
Interphone
Equipe:
Canape-lit
Autre:
Local
And when I try to parse my JSON there is an error because of these line's return. "SyntaxError: Unexpected token".
Here's an example of my JSON :
{"records":[{"Parking":"Aucun","Description":"Interphone
Equipé :
Canapé-lit
","Chauffage":"Fioul"}]}
Can someone help me please ?
You've really dug yourself into a very bad hole here.
The problem
The problem you're running into is that a newline (line feed and carriage return characters) are not valid JSON. They must be escaped as \n and \r. You can see the full JSON standard here here.
You need to do two things.
Fix your code
In spite of the fact that the JSON standard is comparatively simple, you should not create your JSON by hand. You already know why. You have to handle several edge cases and the like. Your users could enter anything on the page, and you need to make sure that it gets properly encoded no matter what.
You need to use a JSON serialization tool. json_encode is built in as of 5.2. If you can't use this for any reason, find an existing, widely used (and therefore heavily tested) third party library with a JSON serializer.
If you're asking, "Why can't I create my own serializer?", you could, in theory. Realistically, there is no point. Yours won't be better than existing ones. It will be much more likely to have bugs and to perform worse than something a lot of people have used in production. It will also take much longer to create and test than using an existing one.
If you need this data in code after you pull it back out of the database, then you need a JSON deserializer. json_decode should also be fine, but again, if you can't use it, look for a widely used third party library.
Fix your data
If you haven't hit production yet, you have really dodged a bullet here, and you can skip this whole section. If you have gone to production and you have data from users, you've got a major problem.
Even after you fix your code, you still have bad data in your production database that won't parse correctly. You have to do something to make this data usable. Unfortunately, it is impossible to automatically recover the original data for every possible case. This is because users might have entered the characters/substrings you added to the data to turn it into "JSON"; for example, they might have entered a comma separated list of quoted words: "dog","cat","pig", and "cow". That is an intractable problem, since you know for a fact you didn't properly serialize all your incoming input. There's no way to tell the difference between text your code generated and text the user entered. You're going to have to settle for a best effort and try to throw errors when you can't figure it out in code, and it might mess up a user's data in some special cases. You might have to fix some things manually.
Start by discussing this with your manager, team lead, whoever you answer to. Assuming that you can't lose the data, this is the most sound process to follow for creating a fix for your data:
Create a database dump of your production data.
Import that dump into a development database.
Develop and test your method of repairing this data against the development database from the last step.
Ensure you have a recovery plan for deployments gone wrong. Test this plan in your testing environment.
Once you've gone through your typical release process, it's time to release the fixed code and the data update together.
Take the website offline.
Back up the database.
Update the website with the new code.
Implement your data fix.
Verify that it worked.
Bring the site online.
If your data fix doesn't work (possibly because you didn't think of an edge case or something), then you have a nice back up you can restore and you can cancel the release. Then go back to step 1.
As for how you can fix the data, I don't recommend queries here. I recommend a little script tool. It would have to load the data from the database, pull the string apart, try to identify all the pieces, build up an object from those pieces, and finally serialize them to JSON correctly, and put them back into the database.
Here's an example function of how you might go about pulling the string apart:
const ELEMENT_SEPARATOR = '","';
const PAIR_SEPARATOR = '":"';
function recover_object_from_malformed_json($malformed_json, $known_keys) {
$tempData = substr($malformed_json, 14); // Removes {"records":[{" prefix
$tempData = substr($tempData, 0, -4); // Removes "}]} suffix
$tempData = explode(ELEMENT_SEPARATOR, $tempData); // Split into what we think are pairs
$data = array();
$lastKey = NULL;
foreach ($tempData as $i) {
$explodedI = explode(KEY_VALUE_SEPARATOR, $i, 2); // Split what we think is a key/value into key and value
if (in_array($explodedI[0], $known_keys)) { // Check if it's actually a key
// It's a key
$lastKey = $explodedI[0];
if (array_key_exists($lastKey, $data)) {
throw new RuntimeException('Duplicate key: ' + $lastKey);
}
// Assign the value to the key
$data[$lastKey] = $explodedI[1];
}
else {
// This isn't a key vlue pair, near as we can tell
// So it must actually be part of the last value,
// and the user actually entered the delimiter as part of the value.
if (is_null($lastKey)) {
// This one is REALLY messed up
throw new RuntimeException('Does not begin with a known key');
}
$data[$lastKey] += ELEMENT_SEPARATOR;
$data[$lastKey] += $i;
}
}
return $data;
}
Note that I'm assuming that your "list" is a single element. This gets much harder and much messier if you have more than one. You'll also need to know ahead of time what keys you expect to have. The bottom line is that you have to undo whatever your code did to create the "JSON", and you have to do everything you can to try to not mess up a user's data.
You would use it something like this:
$knownKeys = ["Parking", "Description", "Chauffage"];
// Fetch your rows and loop over them
foreach ($dbRows as $row) {
try {
$dataFromDb = $row.myData // or however you would pull out this string.
$recoveredData = recover_object_from_malformed_json($dataFromDb);
// Save it back to the DB
$row.myData = json_encode($recoveredData);
// Make sure to commit here.
}
catch (Exception $e) {
// Log the row's ID, the content that couldn't be fixed, and the exception
// Make sure to roll back here
}
}
(Forgive me if the database stuff looks really wonky. I don't do PHP, so I have no idea how that code should look. Hopefully, you can at least get the concept.)
Why I don't recommend trying to parse your data as JSON to recover it.
The bottom line is that your data in the database is not JSON. IF you try to parse it as such, all the other edge cases you didn't handle properly will get screwed up in the process. You'll see bad things like
\\ becomes \
\j becomes j
\t becomes a tab character
In the end, it will just mess up your data even more.
Conclusion
This is a huge mess, and you should never try to convert something into a standard format without using a properly built, well tested serializer. Fixing the data is going to be hard, and it's going to take time. I also seriously doubt you have a lot of background in text processing techniques, and lacking that knowledge is going to make this harder. You can get some good info on text processing by studying how compilers are made. Good luck.

Regex match SQL values string with multiple rows and same number of columns

I tried to match the sql values string (0),(5),(12),... or (0,11),(122,33),(4,51),... or (0,121,12),(31,4,5),(26,227,38),... and so on with the regular expression
\(\s*\d+\s*(\s*,\s*\d+\s*)*\)(\s*,\s*\(\s*\d+\s*(\s*,\s*\d+\s*)*\))*
and it works. But...
How can I ensure that the regex does not match a values string like (0,12),(1,2,3),(56,7) with different number of columns?
Thanks in advance...
As i mentioned in comment to the question, the best way to check if input string is valid: contains the same count of numbers between brackets, is to use client side programm, but not clear SQL.
Implementation:
List<string> s = new List<string>(){
"(0),(5),(12)", "(0,11),(122,33),(4,51)",
"(0,121,12),(31,4,5),(26,227,38)","(0,12),(1,2,3),(56,7)"};
var qry = s.Select(a=>new
{
orig = a,
newst = a.Split(new string[]{"),(", "(", ")"},
StringSplitOptions.RemoveEmptyEntries)
})
.Select(a=>new
{
orig = a.orig,
isValid = (a.newst
.Sum(b=>b.Split(new char[]{','},
StringSplitOptions.RemoveEmptyEntries).Count()) %
a.newst.Count()) ==0
});
Result:
orig isValid
(0),(5),(12) True
(0,11),(122,33),(4,51) True
(0,121,12),(31,4,5),(26,227,38) True
(0,12),(1,2,3),(56,7) False
Note: The second Select statement gets the modulo of sum of comma instances and the count of items in string array returned by Split function. If the result isn't equal to zero, it means that input string is invalid.
I strongly believe there's a simplest way to achieve that, but - at this moment - i don't know how ;)
:(
Unless you add some more constraints, I don't think you can solve this problem only with regular expressions.
It isn't able to solve all of your string problems, just as it cannot be used to check that the opening and closing of brackets (like "((())()(()(())))") is invalid. That's a more complicated issue.
That's what I learnt in class :P If someone knows a way then that'd be sweet!
I'm sorry, I spent a bit of time looking into how we could turn this string into an array and do more work to it with SQL but built in functionality is lacking and the solution would end up being very hacky.
I'd recommend trying to handle this situation differently as large scale string computation isn't the best way to go if your database is to gradually fill up.
A combination of client and serverside validation can be used to help prevent bad data (like the ones with more numbers) from getting into the database.
If you need to keep those numbers then you could rework your schema to include some metadata which you can use in your queries, like how many numbers there are and whether it all matches nicely. This information can be computed inexpensively from your server and provided to the database.
Good luck!

What is the best way to search an Oracle CLOB column?

I need to search a CLOB column and am looking for the best way to do this, I've seen variants online of using the DBMS_LOB package as well as using something called Oracle Text. Can someone provide a quick example of how to do this?
Oracle Text indexing is the way go. You can use either CONTEXT or CTXRULE index. CONTEXT can be used on unstructured document where CTXRULE is more helpful on structured documents.
This link will provide more info the index types & syntax.
The most important factor you need to consider is LEXER & STOPLIST.
You can also read the posts on asktom.oracle.com
http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:5533095920114
What is in your CLOB and what are you searching for ?
Oracle Text is good if you are searching for words or phrases (which is probably what you have in a CLOB). Sometimes you'll store something 'strange' in a CLOB, like XML or the return value of a web-service call and that might be a different kettle of fish.
I needed to do this just recently and came up with the following solution (uses Spring JDBC)
String sql = "select * from clobtest where dbms_lob.instr(myclob, ? , 1, 1) > 0";
return (String) getSimpleJdbcTemplate().getJdbcOperations().queryForObject(sql, new RowMapper<Object>() {
public String mapRow(ResultSet rs, int rowNum) throws SQLException {
String clobText = lobHandler.getClobAsString(rs, "myclob");
return clobText;
}
}, searchText);
Seems to work pretty well, but I'm going to do some performance testing to see how well it works under load.