Weird character (�) in SQL Server View definition - sql

I have generated the Create statement for a SQL Server view.
Pretty standard, although there is a some replacing happening on a varchar column, such as:
select Replace(txt, '�', '-')
What the heck is '�'?
When I run that against a row that contains that character, I am seeing the literal '?' being replaced.
Any ideas? Do I need some special encoding in my editor?
Edit
If it helps the end point is a Google feed.

You need to read the script in the same encoding as that in which it was written. Even then, if your editor's font doesn't include a glyph for the character, it may still not display correctly.
When the script was created, did you choose an encoding, or accept the default? If the later, you need to find out which encoding was used. UTF-8 is likely.
However, in this case, the character may not be a mis-representation. Unicode replacement character explains that this character is used as a replacement for some other character that cannot be represented. It's possible in your case that the code you are looking at is simply saying, if we have some data that could not be represented, treat it as a hyphen instead. In other words, this may be nothing to do with the script generation/viewing process, but rather a deliberate piece of code.

Related

How to find Bad characters in the column

I am trying to pull 'COURSE_TITLE' column value from 'PS_TRAINING' table in PeopleSoft and writing into UTF-8 text file to get loaded into Workday system. The file is erroring out while loading because of bad characters(Ã â and many more) present in the column. I have used a procedure which will convert non-ascii value into space. But because of this procedure, the 'Course_Title' which are written in non-english language like Chinese, Korean, Spanish also replacing with spaces.
I even tried using regular expressions (``regexp_like(course_title, 'Ã) only to find bad characters but since the table has hundreds of thousands of rows, it would be difficult to find all bad characters. Please suggest a way to solve this.
If you change your approach, this may work.
Define what you want, and retrieve it.
select *
from PS_TRAINING
where not regexp_like(course_title, '[0-9A-Za-z]')```
If you take out too much data, just add it to the regex

"¿" (inverted question mark) character in oracle

I found "¿" character (inverted question mark) in the database tables in place of single quote (') character.
Can any one let me know how i can avoid this character from the table.
There are many rows which contains the text with this character but not all single quotes are turning to this ¿ symbol.
I am not even able to filter the rows to update this character (¿) with single quote again.
When i user Like "%¿%" it filters me the text containing ordinary question mark (?)
In general there are two possibilities:
Your database tables really have ¿ characters caused by wrong NLS_LANG settings when data was inserted (or the database character set does not support the special character). In such case the LIKE '%¿%' condition should work. However, this also means you have corrupt data in your database and it is almost impossible to correct them because¿ stands for any wrong character.
Your client (e.g. SQL*Plus) is not able to display the special character caused by wrong NLS_LANG settings or the font does not support the special character.
Which client do you use (SQL*Plus, TOAD, SQL Developer, etc.)?
What is your NLS_LANG Environment variable, resp. your Registry key HKLM\SOFTWARE\ORACLE\KEY_%ORACLE_HOME_NAME%\NLS_LANG or HKLM\SOFTWARE\Wow6432Node\ORACLE\KEY_%ORACLE_HOME_NAME%\NLS_LANG?
What do you get when you select DUMP(... , 1016) from your table?
Was it really a simple quote (ASCII 39)? If they actually were some sort of "smart quotes", those do have mappings in windows-1252 but there is no ISO-8859 mapping for them, so if your database charset is ISO-8859-1 and you try to insert some windows-1252 text, Oracle tries to translete from windows-1252 to ISO-8859-1 and uses chr(191) to signal an unmappable character. chr(191) happens to be the opening question mark.
You can check that by executing this (copy the select to preserve the smart quotes):
select dump('‘’“”') from dual
This behaviour is basically "correct", as what you are asking Oracle to do cannot be done, though it is not very intuitive.
See windows-1252, ISO8859-1 and ISO-8859-15 charsets for comparison. Note that windows uses the range 127-159 that is not used in the ISO charsets

What are the characters that count as the same character under collation of UTF8 Unicode? And what VB.net function can be used to merge them?

Also what's the vb.net function that will map all those different characters into their most standard form.
For example, tolower would map A and a to the same character right?
I need the same function for these characters
german
ß === s
Ü === u
Χιοσ == Χίος
Otherwise, sometimes I insert Χιοσ and latter when I insert Χίος mysql complaints that the ID already exist.
So I want to create a unique ID that maps all those strange characters into a more stable one.
For the encoding aspect of the thing, look at String.Normalize. Notice also its overload that specifies a particular normal form to which you want to convert the string, but the default normal form (C) will work just fine for nearly everyone who wants to "map all those different characters into their most standard form".
However, things get more complicated once you move into the database and deal with collations.
Unicode normalization does not ever change the character case. It covers only cases where the characters are basically equivalent - look the same1, mean the same thing. For example,
Χιοσ != Χίος,
The two sigma characters are considered non-equivalent, and the accented iota (\u1F30) is equivalent to a sequence of two characters, the plain iota (\u03B9) and the accent (\u0313).
Your real problem seems to be that you are using Unicode strings as primary keys, which is not the most popular database design practice. Such primary keys take up more space than needed and are bound to change over time (even if the initial version of the application does not plan to support that). Oh, and I forgot their sensitivity to collations. Instead of identifying records by Unicode strings, have the database schema generate meaningless sequential integers for you as you insert the records, and demote the Unicode strings to mere attributes of the records. This way they can be the same or different as you please.
It may still be useful to normalize them before storing for the purpose of searching and safer subsequent processing; but the particular case insensitive collation that you use will no longer restrict you in any way.
1Almost the same in case of compatibility normalization as opposed to canonical normalization.

How can you query a SQL database for malicious or suspicious data?

Lately I have been doing a security pass on a PHP application and I've already found and fixed one XSS vulnerability (both in validating input and encoding the output).
How can I query the database to make sure there isn't any malicious data still residing in it? The fields in question should be text with allowable symbols (-, #, spaces) but shouldn't have any special html characters (<, ", ', >, etc).
I assume I should use regular expressions in the query; does anyone have prebuilt regexes especially for this purpose?
If you only care about non-alphanumerics and it's SQL Server you can use:
SELECT *
FROM MyTable
WHERE MyField LIKE '%[^a-z0-9]%'
This will show you any row where MyField has anything except a-z and 0-9.
EDIT:
Updated pattern would be: LIKE '%[^a-z0-9!-# ]%' ESCAPE '!'
I had to add the ESCAPE char since you want to allow dashes -.
For the same reason that you shouldn't be validating input against a black-list (i.e. list of illegal characters), I'd try to avoid doing the same in your search. I'm commenting without knowing the intent of the fields holding the data (i.e. name, address, "about me", etc.), but my suggestion would be to construct your query to identify what you do want in your database then identify the exceptions.
Reason being there are just simply so many different character patterns used in XSS. Take a look at the XSS Cheat Sheet and you'll start to get an idea. Particularly when you get into character encoding, just looking for things like angle brackets and quotes is not going to get you too far.

Accented character replacement for search then reinserted afterwards

Basically my issue is that users would like to search for a french word that has accented characters but without typing in the accented characters and then have the actual accented word appeared highlighted if found... So for example they would type in "declare" but in the result sets it would look like "déclare" and if found "déclare" would be highlighted.
My first thought was to just simply replace the characters with a regex but then I remembered that I would need to re-insert the replaced characters after the search... I was thinking of then using some sort of character map that would track position and the character so that when the search was finshed I could put the result set back to the way it was. This seems a little brute force to me and I was wondering if anyone had a better alternative? I'm using Visual Studio 2005 with this app.
Any advice would be much appreciated!
Thanks
A regular expression by default matches text. The "replacement" mode is not the normal mode. So, what you want is in fact the default. The precise syntax will depend on your Regex engine, e.g. in .Net you'd use Regex.IsMatch()