For the data I'm processing, the field separation character is ':'. I was wondering if there's an FS value that can be set to allow allow fields to contain : by escaping it (':').
From where I've looked, I doesn't think it is possible with AWK's regular expressions set, but I figured I'd ask anyways.
Luckily it isn't necessary for my current project, just curious for future reference.
Related
I know that I should use it when I deal with data of TEXT type (and I guess the ones that fall back to TEXT), but is it the only case?
Example:
UPDATE names SET name='Mike' WHERE id=3
I'm writing an SQL query auto generation in C++, so I want to make sure I don't miss cases, when I have to add quotes.
Single quotes (') denote textual data, as you noted (e.g., 'Mike' in your example). Numeric data (e.g., 3 in your example), object (table, column, etc) names and syntactic elements (e.g., update, set, where) should not be wrapped in quotes.
The single quote is the delimiter for the string. It lets the parser know where the string starts and where it ends as well as that is is a string. You will find that sometimes you get away with a double quote too.
The only way to be certain you don't miss any cases would be to escape the input, otherwise this will be vulnerable to abuse when somehow a single quote ends up in in the text.
I've got a Prolog program where I'm doing some brute force search on all strings up to a certain length. I'm checking which strings match a certain pattern, keeping adding patterns until hopefully I find a set of patterns that covers all strings. I would like to store which ones to a file which don't match any of my patterns, so that when I add a new pattern, I only need to check the leftovers, instead of doing the entire brute force search again.
If I were writing this in python, I would just pickle the list of strings, and load it from the file. Does anybody know how to do something similar in Prolog?
I have a good amount of Prolog programming experience, but very little with Prolog IO. I could probably write a predicate to read a file and parse it into a term, but I figured there might be a way to do it more easily.
If you want to write out a term and be able to read it back later at any time barring variables names, use the ISO built-in write_canonical/1 or write_canonical/2. It is quite well supported by current systems. writeq/1 and write/1 work often too, but not always. writeq/1 uses operator syntax (so you need to read it back with the very same operators present) and write/1 does not use quotes. So they work "most of the time" — until they break.
Alternatively, you may use the ISO write-options [quoted(true), ignore_ops(true), numbervars(false)] in write_term/2 or write_term/3. This might be interesting to you if you want to use further options like variable_names/1 to retain also the names of the variables.
Also note that the term written does not include a period at the end. So you have to write a space and a period manually at the end. The space is needed to ensure that an atom consisting of graphic characters does not clobber with the period at the end. Think of writing the atom '---' which must be written as --- . and not as ---. You might write the space only in case of an atom. Or an atom that does not "glue" with .
writeq and read make a similar job, but read the note on writeq about operators, if you declare any.
Consider using read/1 to read a Prolog term. For more complex or different kinds of parsing, consider using DCGs and then phrase_from_file/2 with SWI's library(pio).
I have generated the Create statement for a SQL Server view.
Pretty standard, although there is a some replacing happening on a varchar column, such as:
select Replace(txt, '�', '-')
What the heck is '�'?
When I run that against a row that contains that character, I am seeing the literal '?' being replaced.
Any ideas? Do I need some special encoding in my editor?
Edit
If it helps the end point is a Google feed.
You need to read the script in the same encoding as that in which it was written. Even then, if your editor's font doesn't include a glyph for the character, it may still not display correctly.
When the script was created, did you choose an encoding, or accept the default? If the later, you need to find out which encoding was used. UTF-8 is likely.
However, in this case, the character may not be a mis-representation. Unicode replacement character explains that this character is used as a replacement for some other character that cannot be represented. It's possible in your case that the code you are looking at is simply saying, if we have some data that could not be represented, treat it as a hyphen instead. In other words, this may be nothing to do with the script generation/viewing process, but rather a deliberate piece of code.
I'm thinking about writing a templating tool for generating T-SQL code, which will include delimited sections like below;
SELECT
~~idcolumn~~
FROM
~~table~~
WHERE
~~table~~.flag = 1
Notice the double-tildes delimiting bits? This is an idea for an escape sequence in my templating language. But I want to be certain that the escape sequence is valid -- that it will never occur in a valid T-SQL statement. Problem is, I can't find any official microsoft description of the T-SQL language.
Does anyone know of an official specification for the T-SQL language, or at least the lexing rules? So I can make an informed decision about the escape sequence.
UPDATES:
Thanks for the suggestions so far, but I'm not looking for confirmation of the '~~' escape sequence per se. What I need is a document I can reference I can point to and say 'microsoft says this character sequence is totally impossible in T-SQL.' For instance, microsoft publish the language specification for C# here which includes a description of what characters can go into valid C# programs. (see page 67 of the pdf.) I'm looking for a similar reference.
The double-tilde: "~~" is actually perfectly good T-SQL. For instance; "(SELECT ~~1)" returns '1'.
There are several well known and often used formats for template parameters, one example being $(paramname) (also used in other scripts as well as T-SQL scripts)
Why not use an existing format?
It doesn't matter if ~~ is legal TSQL or not, if you provide an escape for producing ~~ in actual TSQL when you need it.
Since template parameters have to have a nonzero-length identifier, you have a peculiar case where the identifier length is ridiculously "zero", e.g., ~~~~. This kind of thing makes an ideal escape sequence, since it is useless for anything else. Simply process your template text; whenever you find ~~~~ replace it by the named parameter string, and whenever you find ~~~~ replace it by ~~. Now, if ~~ is needed in the final TSQL, just write ~~~~ in your template.
I suspect that even if you do this, that the number of times you'll actually write ~~~~ in practice will be close to zero, so the reason for doing it is theoretical completeness and giving you a warm fuzzy feeling that you can write anything in a template.
Well, I'm not sure about a complete description of the language, but it appears that ~~ could occur in an identifier provided that it is quoted (in brackets, typically).
You may have more luck with a convention saying you don't support identifiers with ~~ in them. Or, just reserve your own lexical symbols and don't worry about ~~ occurring elsewhere.
You could treat quoted literals and strings as content, regardless if they contain your escape-sequence. It would make it more robust.
Run the text trough a lexer, to separate each token. If the token is a string or a quoted literal, treat it as such. But if it is a literal that begins and ends with ~~, you can safely assume it is a template placeholder.
I'm not sure you'll find something that will never occur in a valid statement. Consider:
DECLARE #TemplateBreakingString varchar(100) = '~~I hope this works~~'
or
CREATE TABLE [~~TemplateBreakingTable~~] (IDField INT Identity)
Your escape sequence can occur in string literals, but that is all. That said, Microsoft owns t-sql, and they are free to do anything they want with it moving forward for future versions of sql server. Still, I think ~~ is safe enough.
Lately I have been doing a security pass on a PHP application and I've already found and fixed one XSS vulnerability (both in validating input and encoding the output).
How can I query the database to make sure there isn't any malicious data still residing in it? The fields in question should be text with allowable symbols (-, #, spaces) but shouldn't have any special html characters (<, ", ', >, etc).
I assume I should use regular expressions in the query; does anyone have prebuilt regexes especially for this purpose?
If you only care about non-alphanumerics and it's SQL Server you can use:
SELECT *
FROM MyTable
WHERE MyField LIKE '%[^a-z0-9]%'
This will show you any row where MyField has anything except a-z and 0-9.
EDIT:
Updated pattern would be: LIKE '%[^a-z0-9!-# ]%' ESCAPE '!'
I had to add the ESCAPE char since you want to allow dashes -.
For the same reason that you shouldn't be validating input against a black-list (i.e. list of illegal characters), I'd try to avoid doing the same in your search. I'm commenting without knowing the intent of the fields holding the data (i.e. name, address, "about me", etc.), but my suggestion would be to construct your query to identify what you do want in your database then identify the exceptions.
Reason being there are just simply so many different character patterns used in XSS. Take a look at the XSS Cheat Sheet and you'll start to get an idea. Particularly when you get into character encoding, just looking for things like angle brackets and quotes is not going to get you too far.