I'm quite new to SQL Databases, but I'm trying to add a Conditional Split in my Data Flow between my Flat File Source and OLE DB Database to exclude records containing some special characters such as ø and ¿ and ¡ on the [title] column. Those are causing errors when creating a table and therefore I want those records to be split from my table. How can I create a conditional split for this?
As a bonus: Is there a way to only filter in a conditional split the rows that contain numbers from 0-9 and letters from a-zA-Z so that all rows with "special" symbols are filtered out automatically?
A conditional split works by determining whether a condition is true or false. So, if you can write a rule that evaluates to true or false, and you can multiple rules to address assorted business needs, then you can properly shunt rows into different pathways.
How do I do that?
I always advocate that people add new columns to their data flows to handle this stuff. It's the only way you're going to have to debug when a condition comes up that you think should have been handled but wasn't.
Whether you create a column called IsTitleOnlyAlphaNumeric or IsTitleInternational is really up to you. General programming rule is you go for the common/probable case. Since the ASCII universe is 127 characters max, 255 for extended ASCII, I'd advocate the former. Otherwise, you're going to play whack-a-mole as the next file has umlats or a thorn in it.
Typically, we would add a new column through a Derived Column Transformation which means you're working with the SSIS expression language. However, in this case the expression does not have the ability to gracefully* identify whether the string is good or not. Instead, you'll want to use the .NET library for this heavy lifting. That's the Script Component and you'll have it operate in the Transformation mode (default).
Add a new column of type boolean IsTitleOnlyAlphaNumeric and crib the regular expression from check alphanumeric characters in string in c#
The relevant bit of the OnRowProcessed (name approximate) would look like
Row.IsTitleOnlyAlphaNumeric = isAlphaNumeric(Row.Title);
As rows flow through, that will be evaluated for each one and you'll see whether it meets the criteria or not. Depending on your data, you might need a check for NULL before you call that method.
How I shouldn't do that
*You could abuse the daylights out of the REPLACE function and test the allowable length of an expression by doing something like creating a new column called StrippedTitle and we are going to replace everything allowable character with an empty string. If the length of the trimmed final string is not zero, then there's something bad in there.
REPLACE(REPLACE(REPLACE([Title], "A", ""), "B", ""), "C", "") ..., "a", ""), "b", "") ..., "9", "")
where ... implies you've continued the pattern. Yes, you'll have to replace upper and lower cased characters. ASCIITable.com or similar will be your friend.
That will be a new column. So add a second Derived Column component to calculate whether it's empty - again, easier to debug. IsTitleOnlyAlphaNumeric
LEN(RTRIM(StrippedTitle)) == 0
Terrible approach but the number of questions I answer where people later clarify "I cannot use script" is decidedly non-zero.
Related
When writing applications that manage data, it is often useful to allow the end user to create or remove classes of data that are best represented as columns. For example, I'm working on a dictionary building application; a user might decide they want to add, say, an "alternate spelling" field or something to data, which could be very easily represented as another column.
Usually, I just name the column based on whatever the user called it ("alternate_spelling" in this case); however, a user-defined string that isn't explicitly sanitized as a database identifier bothers me. Since column names can't be bound like values, I'm trying to figure out how to sanitize the column names.
So my question is: what should I be doing? Can I get away with just quoting things? There's lots of questions asking how to bind column names in SQL, and many responses saying one should never need to, but never explaining the correct approach to handling variable columns. I'm working in Python specifically, but I think this question is more general.
It depands on which database you are using...
According to PostgreSQL:
"SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical marks and non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters, underscores, digits (0-9), or dollar signs ($). Note that dollar signs are not allowed in identifiers according to the letter of the SQL standard, so their use might render applications less portable"
(Keep also in mind: maximum length allowed for the name)
I was looking for something like this. I still wouldn't trust it with user-supplied names - I'd look those up from the database catalog instead, but I think it is robust enough to check data that is provided from your backend.
i.e. Just because something comes from your internal data tables or yaml config files doesn't 100% mean that an attacker couldn't have hacked into those sources, so why not add another layer right before composing sql queries?
This is for postgresql but mostly should work on something else. No, it doesn't cover ALL possible characters for naming columns and tables, only those used in my databases.
class SecurityException(Exception):
"""concerns security"""
class UnsafeSqlException(SecurityException):
""" sql fragments looks unsafe """
def is_safe_sql_name(sql : str, error_on_empty : bool = False, raise_on_false : bool = True) -> bool :
"""check that something looks like an object name"""
patre = re.compile("^[a-z][a-z0-9_]{0,254}$",re.IGNORECASE)
if not isinstance(sql, str):
raise TypeError(f"sql should be a string {sql=}")
if not sql:
if error_on_empty:
raise ValueError(f"empty sql {sql=}")
return False
res = bool(patre.match(sql))
if not res and raise_on_false:
raise UnsafeSqlException(f"{sql=}")
return res
I have a database that has around 10k records and some of them contain HTML characters which I would like to replace.
For example I can find all occurrences:
SELECT * FROM TABLE
WHERE TEXTFIELD LIKE '%/%'
the original string example:
this is the cool mega string that contains /
how to replace all / with / ?
The end result should be:
this is the cool mega string that contains /
If you want to replace a specific string with another string or transformation of that string, you could use the "replace" function in postgresql. For instance, to replace all occurances of "cat" with "dog" in the column "myfield", you would do:
UPDATE tablename
SET myfield = replace(myfield,"cat", "dog")
You could add a WHERE clause or any other logic as you see fit.
Alternatively, if you are trying to convert HTML entities, ASCII characters, or between various encoding schemes, postgre has functions for that as well. Postgresql String Functions.
The answer given by #davesnitty will work, but you need to think very carefully about whether the text pattern you're replacing could appear embedded in a longer pattern you don't want to modify. Otherwise you'll find someone's nooking a fire, and that's just weird.
If possible, use a suitable dedicated tool for what you're un-escaping. Got URLEncoded text? use a url decoder. Got XML entities? Process them though an XSLT stylesheet in text mode output. etc. These are usually safer for your data than hacking it with find-and-replace, in that find and replace often has unfortunate side effects if not applied very carefully, as noted above.
It's possible you may want to use a regular expression. They are not a universal solution to all problems but are really handy for some jobs.
If you want to unconditionally replace all instances of "/" with "/", you don't need a regexp.
If you want to replace "/" but not "Ǘ", you might need a regexp, because you can do things like match only whole words, match various patterns, specify min/max runs of digits, etc.
In the PostgreSQL string functions and operators documentation you'll find the regexp_replace function, which will let you apply a regexp during an UPDATE statement.
To be able to say much more I'd need to know what your real data is and what you're really trying to do.
If you don't have postgres, you can export all database to a sql file, replace your string with a text editor and delete your db on your host, and re-import your new db
PS: be careful
I have a cell that contains an array of characters seperated by commas i.e. "1,2,3,4,5" My question is, is it possible to remove a particular element of the array such as if I wanted to remove "1" then the cell would then become "2,3,4,5" or remove "3" and it becomes "1,2,4,5" I want to perform this task within SQL either as a function or a stored procedure, any help is much appreciated.
Sure, it'd just be some basic string REPLACE() calls: http://msdn.microsoft.com/en-us/library/ms186862.aspx
However, since you have to manipulate individual bits of this data field separately from the rest of the field, it's a good candidate for getting normalized into its own child table.
For some time i'm debating if i should leave columns which i don't know if data will be passed in and set the value to empty string ('') or just allow null.
i would like to hear what is the recommended practice here.
if it makes a difference, i'm using c# as the consuming application.
I'm afraid that...
it depends!
There is no single answer to this question.
As indicated in other responses, at the level of SQL, NULL and empty string have very different semantics, the former indicating that the value is unknown, the latter indicating that the value is this "invisible thing" (in displays and report), but none the less it a "known value". A example commonly given in this context is that of the middle name. A null value in the "middle_name" column would indicate that we do not know whether the underlying person has a middle name or not, and if so what this name is, an empty string would indicate that we "know" that this person does not have a middle name.
This said, two other kinds of factors may help you choose between these options, for a given column.
The very semantics of the underlying data, at the level of the application.
Some considerations in the way SQL works with null values
Data semantics
For example it is important to know if the empty-string is a valid value for the underlying data. If that is the case, we may loose information if we also use empty string for "unknown info". Another consideration is whether some alternate value may be used in the case when we do not have info for the column; Maybe 'n/a' or 'unspecified' or 'tbd' are better values.
SQL behavior and utilities
Considering SQL behavior, the choice of using or not using NULL, may be driven by space consideration, by the desire to create a filtered index, or also by the convenience of the COALESCE() function (which can be emulated with CASE statements, but in a more verbose fashion). Another consideration is whether any query may attempt to query multiple columns to append them (as in SELECT name + ', ' + middle_name AS LongName etc.).
Beyond the validity of the choice of NULL vs. empty string, in given situation, a general consideration it to try and be as consistent as possible, i.e. to try and stick to ONE particular way, and to only/purposely/explicitly depart from this way for good reasons and in few cases.
Don't use empty string if there is no value. If you need to know if a value is unknown, have a flag for it. But 9 times out of 10, if the information is not provided, it's unknown, and that's fine.
NULL means unknown value. An empty string means a known value - a string with length zero. These are totally different things.
empty when I want a valid default value that may or may not be changed, for example, a user's middle name.
NULL when it is an error if the ensuing code does not set the value explicitly.
However, By initializing strings with the Empty value instead of null, you can reduce the chances of a NullReferenceException occurring.
Theory aside, I tend to view:
Empty string as a known value
NULL as unknown
In this case, I'd probably use NULL.
One important thing is to be consistent: mixing NULLs and empty strings will end in tears.
On a practical implementation level, empty string takes 2 bytes in SQL Server where as NULLs are bitmapped. In some conditions and for wide/larger tables it makes a different in performance because it's more data to shift around.
I have a command class that abstracts almost all specific database functions (We have the exactly same application running on Mssql 2005 (using ODBC and the native mssql library), MySQL and Oracle. But sometimes we had some problems with our prepare method that, when executed, replaces all placeholders with their respective values. But the problem is that I am using the following:
if(is_array($Parameter['Value']))
{
$Statement = str_ireplace(':'.$Name, implode(', ', $this->Adapter->QuoteValue($Parameter['Value'])), $Statement);
}
else
{
$Statement = str_ireplace(':'.$Name, $this->Adapter->QuoteValue($Parameter['Value']), $Statement);
}
The problem arises when we have two or mer similar parameters names, for example, session_browser and session_browse_version... The first one will partially replace the last one.
Course we learned to go around specifying the parameters within a specific order, but now that I have some "free" time I want to make it better, so I am thinking on switching to preg_replace... and I am not good in regular expression, can anyone give any help with a regex to replace a string like ':parameter_name`?
Best Regards,
Bruno B B Magalhaes
You should use the \b metacharacter to match the word boundary, so you don't accidentally match a short parameter name within a longer parameter name.
Also, you don't need to special-case arrays if you coerce a scalar Value to an array of one entry:
preg_replace("/:$Name\b/",
implode(",", $this->Adapter->QuoteValue( (array) $Parameter['Value'] )),
$Statement);
Note, however, that this can make false positive matches when an identifier or a string literal contains a pattern that looks like a parameter placeholder:
SELECT * FROM ":Name";
SELECT * FROM Table WHERE column = ':Name';
This gets even more complicated when quoted identifiers and string literals can contain escaped quotes.
SELECT * FROM Table WHERE column = 'word\':Name';
You might want to reconsider interpolating variables into SQL strings during prepare, because you're defeating any benefits of prepared statements with respect to security or performance.
I understand why you're doing what you're doing, because not all RDBMS back-end supports named parameters, and also SQL parameters can't be used for lists of values in an IN() predicate. But you're creating an awfully leaky abstraction.