Problems with Turkish SQL Collation (Turkish "I")

Problems with Turkish SQL Collation (Turkish "I") - sql

I'm having problems with our MSSQL database set to any of the Turkish Collations. Becuase of the "Turkish I" problem, none of our queries containing an 'i' in them are working correctly. For example, if we have a table called "Unit" with a column "UnitID" defined in that case, the query "select unitid from unit" no longer works because the lower case "i" in "id" differs from the defined capital I in "UnitID". The error message would read "Invalid column name 'unitid'."
I know that this is occurring because in Turkish, the letter i and I are seen as different letters. However, I am not sure as to how to fix this problem? It is not an option to go through all 1900 SPs in the DB and correct the casing of the "i"s.
Any help would be appreciated, even suggestions of other collations that could be used instead of Turkish but would support their character set.

Turns out that the best solution was to in fact refactor all SQL and the code.
In the last few days I've written a refactoring app to fix up all Stored procs, functions, views, tablenames to be consistent and use the correct casing eg:
select unitid from dbo.unit
would be changed to
select UnitId from dbo.Unit
The app also then goes through the code and replaces any occurrences of the stored proc and its parameters and corrects them to match the case defined in the DB. All datatables in the app are set to invariant locale (thanks to FXCop for pointing out all the datatables..), this prevents the calls from within code having to be case sensitive.
If anyone would like the app or any advice on the process you can contact me on dotnetvixen#gmail.com.

I developed so many systems with Turkish support and this is well known problem as you said.
Best practice to do change your database settings to UTF-8, and that's it. It should solve the all problem.
You might run into problems if you want to support case-sensitivity in (ı-I,i-İ) that can be a problematic to support in SQL Server. If the whole entrance is from Web ensure that is UTF-8 as well.
If you keep your Web UTF-8 input and SQL Server settings as UTF-8 everything should goes smoothly.

Perhaps I don't understand the problem here, but is this not more likely because the database is case sensitive and your query is not? For example, on Sybase I can do the following:
USE master
GO
EXEC sp_server_info 16
GO
Which tells me that my database is case-insensitive:
attribute_id attribute_name attribute_value
16 IDENTIFIER_CASE MIXED

If you can change the collation that you're using then try the Invariant locale. But make sure you don't impact other things like customer names and addresses. If a customer is accustomed to having case insensitive searching for their own name, they won't like it if ı and I stop being equivalent, or if i and İ stop being equivalent.

Can you change the database collation to the default: this will leave all your text columns with the Turkish colllation?
Queries will work but data will behave correctly. In theory...
There are some gotchas with temp tables and table variables with varchar columns: you'll have to add COLLATE clauses to these

I realize you don't want to go through all the stored procedures to fix the issue but maybe you'd be OK to use a refactoring tool to solve the problem. I say take a look at SQL Refactor. I didn't use it but looks promising.

Changing the Regional Settings of your machine to English(US) completely saves the day!

Related

Replace Character in String with SQL Server Table Trigger on Insert\Update

**Answered
I am attempting to create a trigger that will replace a character ’ (MS Word Smart Quote) with a proper apostrophe ' when new data is inserted or updated by a user from our website.
The special apostrophe may be found anywhere on a 5000 NVarchar column and may be found multiple times in the same string.
Any easy replace statement for this?
REPLACE(Column,'’','''')

I'm going to argue that you should probably look at doing this in your applications instead of from within SQL Server. That's NOT the answer you're looking for - but it would probably make more sense.
Typically, when I see questions like this I instantly worry about devs trying to 'defeat' SQL Injection. If that's the case, this approach will NEVER work - as per:
http://sqlmag.com/database-security/sql-injection-beyond-basics
That said, if you're not focused on that and just need to get rid of 'pesky' characters, then REPLACE() will work (and likely be your best option), but I'd still argue that you're probably better off tackling 'formatting' issues like this from within your applications. Or in other words, treat SQL Server as your data repository - something that stores your raw data. Then, if you need to make it 'pretty' or 'tweak' it for various outputs/displays, then do that on the way out to your users by means of your application(s).

Weird OpenEdge query behaviour

We have recently had to do some work with an OpenEdge database that a third party product uses, and today (after much hair-pulling), we finally identified why a view was returning no results.
This view in question combines about 100 separate tables, and is then queried against (we have limited rights to this database). One of the fields returned by this view is a hard-coded string literal, along the lines of
'John Smith' AS TheName
We were having difficulty running queries that included this string, which we were trying to RTrim (the view returned a lot of trailing spaces) and then concatenate with another field.
However, if we used RTrim on this field then, instead of returning an error message, or a null or something like that, the row simply wasn't returned. We weren't trying to use it in a WHERE clause or JOIN, this was simply part of the SELECT ... FROM VIEWNAME. After reviewing the view, it seemed that the view had erroneously detected the length of the string as 9 characters (no length was specified in the definition), and RTrim just didn't work.
Now, I could understand why this might lead to an error message, or a NULL value in the SELECT, but why would the row simply not be returned at all? This doesn't seem like good SQL behaviour and I've never seen it happen with any other RDBMS.
Other info : we're test querying via ODBC and WinSQL, with a view to this being included in an existing ASP.NET app. We don't have access to the backend except via this, although we do have rights to create views.
Update : As a freaky follow-up, we have now discovered that if we attempt to query this view without any WHERE clause, no records are returned. This may have the same cause.

This sounds like it could be related to the SQL-WIDTH within the progress database. One problem with Progress is that if the content of the field exceeds the SQL-WIDTH then you will get strange SQL behaviour (sometimes the driver might fail, other times you get no results).
To identify this you need to use the dbtool command to check for SQL-WIDTH's that may be exceeded.

Make sure you don't have blanks. Trimming doesn't remove blanks only spaces. Blanks are also not nulls. There is a difference in the character set while it is not visibly different in your editor.
I have run into this with a few databases, DBII, Oracle, PostGreSQL. Check the character set of your editor and try viewing the tables, you might see nothing or you might see big rectangles.

That sounds like very strange behavior. Just code around it, do the trim and/or string manipulation in the application and go on your way.

Openbase SQL case-sensitivity oddities ('=' vs. LIKE) - porting to MySQL

We are porting an app which formerly used Openbase 7 to now use MySQL 5.0.
OB 7 did have quite badly defined (i.e. undocumented) behavior regarding case-sensitivity. We only found this out now when trying the same queries with MySQL.
It appears that OB 7 treats lookups using "=" differently from those using "LIKE": If you have two values "a" and "A", and make a query with WHERE f="a", then it finds only the "a" field, not the "A" field. However, if you use LIKE instead of "=", then it finds both.
Our tests with MySQL showed that if we're using a non-binary collation (e.g. latin1), then both "=" and "LIKE" compare case-insensitively. However, to simulate OB's behavior, we need to get only "=" to be case-sensitive.
We're now trying to figure out how to deal with this in MySQL without having to add a lot of LOWER() function calls to all our queries (there are a lot!).
We have full control over the MySQL DB, meaning we can choose its collation mode as we like (our table names and unique indexes are not affected by the case sensitivity issues, fortunately).
Any suggestions how to simulate the OpenBase behaviour on MySQL with the least amount of code changes?
(I realize that a few smart regex replacements in our source code to add the LOWER calls might do the trick, but we'd rather find a different way)

Another idea .. does MySQL offer something like User Defined Functions? You could then write a UDF-version of like that is case insesitive (ci_like or so) and change all like's to ci_like. Probably easier to do than regexing a call to lower in ..

These two articles talk about case sensitivity in mysql:
Case Sensitive mysql
mySql docs "Case Sensitivity"
Both were early hits in this Google search:
case sensitive mysql

I know that this is not the answer you are looking for .. but given that you want to keep this behaviour, shouldn't you explicitly code it (rather than changing some magic 'config' somewhere)?
It's probably quite some work, but at least you'd know which areas of your code are affected.

A quick look at the MySQL docs seems to indicate that this is exactly how MySQL does it:
This means that if you search with col_name LIKE 'a%', you get all column values that start with A or a.

Regular expression to match common SQL syntax?

I was writing some Unit tests last week for a piece of code that generated some SQL statements.
I was trying to figure out a regex to match SELECT, INSERT and UPDATE syntax so I could verify that my methods were generating valid SQL, and after 3-4 hours of searching and messing around with various regex editors I gave up.
I managed to get partial matches but because a section in quotes can contain any characters it quickly expands to match the whole statement.
Any help would be appreciated, I'm not very good with regular expressions but I'd like to learn more about them.
By the way it's C# RegEx that I'm after.
Clarification
I don't want to need access to a database as this is part of a Unit test and I don't wan't to have to maintain a database to test my code. which may live longer than the project.

Regular expressions can match languages only a finite state automaton can parse, which is very limited, whereas SQL is a syntax. It can be demonstrated you can't validate SQL with a regex. So, you can stop trying.

SQL is a type-2 grammar, it is too powerful to be described by regular expressions. It's the same as if you decided to generate C# code and then validate it without invoking a compiler. Database engine in general is too complex to be easily stubbed.
That said, you may try ANTLR's SQL grammars.

As far as I know this is beyond regex and your getting close to the dark arts of BnF and compilers.
http://savage.net.au/SQL/
Same things happens to people who want to do correct syntax highlighting. You start cramming things into regex and then you end up writing a compiler...

I had the same problem - an approach that would work for all the more standard sql statements would be to spin up an in-memory Sqlite database and issue the query against it, if you get back a "table does not exist" error, then your query parsed properly.

Off the top of my head: Couldn't you pass the generated SQL to a database and use EXPLAIN on them and catch any exceptions which would indicate poorly formed SQL?

Have you tried the lazy selectors. Rather than match as much as possible, they match as little as possible which is probably what you need for quotes.

To validate the queries, just run them with SET NOEXEC ON, that is how Entreprise Manager does it when you parse a query without executing it.
Besides if you are using regex to validate sql queries, you can be almost certain that you will miss some corner cases, or that the query is not valid from other reasons, even if it's syntactically correct.

I suggest creating a database with the same schema, possibly using an embedded sql engine, and passing the sql to that.

I don't think that you even need to have the schema created to be able to validate the statement, because the system will not try to resolve object_name etc until it has successfully parsed the statement.
With Oracle as an example, you would certainly get an error if you did:
select * from non_existant_table;
In this case, "ORA-00942: table or view does not exist".
However if you execute:
select * frm non_existant_table;
Then you'll get a syntax error, "ORA-00923: FROM keyword not found where expected".
It ought to be possible to classify errors into syntax parsing errors that indicate incorrect syntax and errors relating to tables name and permissions etc..
Add to that the problem of different RDBMSs and even different versions allowing different syntaxes and I think you really have to go to the db engine for this task.

There are ANTLR grammars to parse SQL. It's really a better idea to use an in memory database or a very lightweight database such as sqlite. It seems wasteful to me to test whether the SQL is valid from a parsing standpoint, and much more useful to check the table and column names and the specifics of your query.

The best way is to validate the parameters used to create the query, rather than the query itself. A function that receives the variables can check the length of the strings, valid numbers, valid emails or whatever. You can use regular expressions to do this validations.

public bool IsValid(string sql)
{
string pattern = #"SELECT\s.*FROM\s.*WHERE\s.*";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
return rgx.IsMatch(sql);
}

I am assuming you did something like .\* try instead [^"]* that will keep you from eating the whole line. It still will give false positives on cases where you have \ inside your strings.

Is there some way to inject SQL even if the ' character is deleted?

If I remove all the ' characters from a SQL query, is there some other way to do a SQL injection attack on the database?
How can it be done? Can anyone give me examples?

Yes, there is. An excerpt from Wikipedia
"SELECT * FROM data WHERE id = " + a_variable + ";"
It is clear from this statement that the author intended a_variable to be a number correlating to the "id" field. However, if it is in fact a string then the end user may manipulate the statement as they choose, thereby bypassing the need for escape characters. For example, setting a_variable to
1;DROP TABLE users
will drop (delete) the "users" table from the database, since the SQL would be rendered as follows:
SELECT * FROM DATA WHERE id=1;DROP TABLE users;
SQL injection is not a simple attack to fight. I would do very careful research if I were you.

Yes, depending on the statement you are using. You are better off protecting yourself either by using Stored Procedures, or at least parameterised queries.
See Wikipedia for prevention samples.

I suggest you pass the variables as parameters, and not build your own SQL. Otherwise there will allways be a way to do a SQL injection, in manners that we currently are unaware off.
The code you create is then something like:
' Not Tested
var sql = "SELECT * FROM data WHERE id = #id";
var cmd = new SqlCommand(sql, myConnection);
cmd.Parameters.AddWithValue("#id", request.getParameter("id"));
If you have a name like mine with an ' in it. It is very annoying that all '-characters are removed or marked as invalid.
You also might want to look at this Stackoverflow question about SQL Injections.

Yes, it is definitely possible.
If you have a form where you expect an integer to make your next SELECT statement, then you can enter anything similar:
SELECT * FROM thingy WHERE attributeID=
5 (good answer, no problem)
5; DROP table users; (bad, bad, bad...)
The following website details further classical SQL injection technics: SQL Injection cheat sheet.
Using parametrized queries or stored procedures is not any better. These are just pre-made queries using the passed parameters, which can be source of injection just as well. It is also described on this page: Attacking Stored Procedures in SQL.
Now, if you supress the simple quote, you prevent only a given set of attack. But not all of them.
As always, do not trust data coming from the outside. Filter them at these 3 levels:
Interface level for obvious stuff (a drop down select list is better than a free text field)
Logical level for checks related to data nature (int, string, length), permissions (can this type of data be used by this user at this page)...
Database access level (escape simple quote...).
Have fun and don't forget to check Wikipedia for answers.

Parameterized inline SQL or parameterized stored procedures is the best way to protect yourself. As others have pointed out, simply stripping/escaping the single quote character is not enough.
You will notice that I specifically talk about "parameterized" stored procedures. Simply using a stored procedure is not enough either if you revert to concatenating the procedure's passed parameters together. In other words, wrapping the exact same vulnerable SQL statement in a stored procedure does not make it any safer. You need to use parameters in your stored procedure just like you would with inline SQL.

Also- even if you do just look for the apostrophe, you don't want to remove it. You want to escape it. You do that by replacing every apostrophe with two apostrophes.
But parameterized queries/stored procedures are so much better.

Since this a relatively older question, I wont bother writing up a complete and comprehensive answer, since most aspects of that answer have been mentioned here by one poster or another.
I do find it necessary, however, to bring up another issue that was not touched on by anyone here - SQL Smuggling. In certain situations, it is possible to "smuggle" the quote character ' into your query even if you tried to remove it. In fact, this may be possible even if you used proper commands, parameters, Stored Procedures, etc.
Check out the full research paper at http://www.comsecglobal.com/FrameWork/Upload/SQL_Smuggling.pdf (disclosure, I was the primary researcher on this) or just google "SQL Smuggling".

. . . uh about 50000000 other ways
maybe somthing like 5; drop table employees; --
resulting sql may be something like:
select * from somewhere where number = 5; drop table employees; -- and sadfsf
(-- starts a comment)

Yes, absolutely: depending on your SQL dialect and such, there are many ways to achieve injection that do not use the apostrophe.
The only reliable defense against SQL injection attacks is using the parameterized SQL statement support offered by your database interface.

Rather that trying to figure out which characters to filter out, I'd stick to parametrized queries instead, and remove the problem entirely.

It depends on how you put together the query, but in essence yes.
For example, in Java if you were to do this (deliberately egregious example):
String query = "SELECT name_ from Customer WHERE ID = " + request.getParameter("id");
then there's a good chance you are opening yourself up to an injection attack.
Java has some useful tools to protect against these, such as PreparedStatements (where you pass in a string like "SELECT name_ from Customer WHERE ID = ?" and the JDBC layer handles escapes while replacing the ? tokens for you), but some other languages are not so helpful for this.

Thing is apostrophe's maybe genuine input and you have to escape them by doubling them up when you are using inline SQL in your code. What you are looking for is a regex pattern like:
\;.*--\
A semi colon used to prematurely end the genuine statement, some injected SQL followed by a double hyphen to comment out the trailing SQL from the original genuine statement. The hyphens may be omitted in the attack.
Therefore the answer is: No, simply removing apostrophes does not gaurantee you safety from SQL Injection.

I can only repeat what others have said. Parametrized SQL is the way to go. Sure, it is a bit of a pain in the butt coding it - but once you have done it once, then it isn't difficult to cut and paste that code, and making the modifications you need. We have a lot of .Net applications that allow web site visitors specify a whole range of search criteria, and the code builds the SQL Select statement on the fly - but everything that could have been entered by a user goes into a parameter.

When you are expecting a numeric parameter, you should always be validating the input to make sure it's numeric. Beyond helping to protect against injection, the validation step will make the app more user friendly.
If you ever receive id = "hello" when you expected id = 1044, it's always better to return a useful error to the user instead of letting the database return an error.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas