Conceptual / Rule Implementation / String Manipulation - vb.net

So I'm working on a software in VB.Net where I need to scrape information and process it according to rules. For example, simple string replace rules like making "Det" into "Detached" for a specific field, or a split/join rule, basic string operations. All my scraping rules are RegEx and I store them in a database in rows of rule sets for different situations.
What is the best way behind creating/storing rules to manipulate text? I do not want to hardcode the rules into the software, but rather be able to add more as there will be a need for them. I want to store them in a database, but then how do I interpret them? I'm assuming I would have to create a whole system to interpret them, like a rules engine? Maybe you can give me a different outlook on this problem.

I've written rules engines before. They are usually a Bad Idea (tm).
I would consider writing the rules in your application code. Leave the database and rules engine out of it. First, a rules engine often confuses intent. It's hard to see exactly what is going on when you come back in a couple months for a maintenance patch. Second, VB (or C# or any other language you choose) contains a more appropriate vocabulary for defining rules than anything you will likely have time to implement. Trust me, XML is a poor representation of rules. Lastly, non-programmers won't be able to write regex anyhow.. so you aren't gaining anything for all your added complexity.
You can mitigate most of the deployment headaches by using ClickOnce deployment.
Hope that helps.

Related

General stategy for designing Flexible Language application using ANTLR4

Requirement:
I am trying to develop a language application using antlr4. The language in question is not important. The important thing is that the grammar is very vast (easily >2000 rules!!!). I want to do a number of operations
Extract bunch of informations. These can be call graphs, variable names. constant expressions etc.
Any number of transformations:
if a loop can be expanded, we go ahead and expand it
If we can eliminate dead code we might choose to do that
we might choose to rename all variable names to conform to some norms.
Each of these operations can be applied independent of each other. And after application of these steps I want the rewrite the input as close as possible to the original input.
e.g. So we might want to eliminate loops and rename the variable and then output the result in the original language format.
Questions:
I see a need to build a custom Tree (read AST) for this. So that I can modify the tree with each of the transformations. However when I want to generate the output, I lose the nice abilities of the TokenStreamRewriter. I have to specify how to write each of the nodes of the tree and I lose the original input formatting for the places I didn't do any transformations. Does antlr4 provide a good way to get around this problem?
Is AST the best way to go? Or do I build my own object representation? If so how do I create that object efficiently? Creating object representation is very big pain for such a vast language. But may be better in the long run. Again how do I get back the original formatting?
Is it possible to work just on the parse tree?
Are there similar language applications which do the same thing? If so what strategy do they use?
Any input is welcome.
Thanks in advance.
In general, what you want is called a Program Transformation System (PTS).
PTSs generally have parsers, build ASTs, can prettyprint the ASTs to recover compilable source text. More importantly, they have standard ways to navigate/inspect/modify the ASTs so that you can change them programmatically.
Many offer these capabilities in the form of pattern-matching code fragments written in the surface syntax of the language being transformed; this avoids the need to forever having to know excruciatingly fine details about which nodes are in your AST and how they are related to children. This is incredibly useful when you big complex grammars, as most of our modern (and our legacy languages) all seem to have.
More sophisticated PTSs (very few) provide additional facilities for teasing out the semantics of the source code. It is pretty hard to analyze/transform most code without knowing what scopes individual symbols belong to, or their type, and many other details such as data flow. Full disclosure: I build one of these.

Optimal set of charecters to disallow/allow that prevent SQL injection perfectly?

I know that sql injection must rely on certain set of special characters used in SQL. So here's the question:
What would be the minimal set of characters to disallow, or as large as possible, a wet to allow, that would prevent SQL injection perfectly?
I understand This is not the best nor the easiest approach to prevent SQL injection, but i just wonder questions like "can you do it even without xxx".
Answering a comment:
Just for the purpose of curiosity, some languages other than English can indeed be written normally without some special characters. for example:
Chinese:“”‘’,。!?:;——
Japanese:””’’、。!?:;ーー
English:""'',.!?:;--
You may want to look up some examples of the kinds of ways a SQL injection attack can be leveraged against a web application with weak security architecture. As far as the "WHY" you'll need to analyze the kinds of site user hacks that can get through what's supposed to be a locked down (but public facing) system.
I turned up this reference for further analysis:
You can see more at the source of this injection:
http://www.unixwiz.net/techtips/sql-injection.html
I know that sql injection must rely on certain set of special characters used in SQL. So here's the question:
There isn't any one character that is absolutely banned, nor are there others that are trusted 100% safe.
EDIT: Response to a revised OP...
A brief history of a type of SQL Injection vulnerability: Without developing data access APIs for your database inputs, the most common approach is "dynamically" constructing SQL statements, then executing the raw SQL (a dangerous mix of developer prepared code and outside user input).
Consider in pseudocode:
#SQL_QUERY = "SELECT NAME, SOCIAL_SECURITY_NUMBER, PASSWD, BANK_ACCOUNT_NUMBER" +
"FROM SECURE_TABLE " + "WHERE NAME = " + #USER_INPUT_NAME;
$CONNECT_TO SQL_DB;
EXECUTE #SQL_QUERY;
...
The expected input when obeyed is a lookup of specific user information based on a single name value input from some user form.
While it isn't my point to accidentally teach a how-to for a new generation of would-be SQL Injection hackers, you can guess that a long concatenated string of code isn't really secure if user input on a web form can also tag on anything they like to the code; including more code of their own.
The utility or danger with using dynamic sql coding is a discussion by itself. You can put a few barriers in front of a dynamic query such as this:
An "input handler" procedure that all inputs from app forms go through before the query code module is executed:
run_query( parse_and_clean( #input_value ) )
This is where you look for things like string inputs too long for expected sizes (a 500 character name???? non-alpha numerical name values?)
Web forms can also have their own scripted input handlers. Usually intended for validating for specific inputs (not empty, longer than three characters, no punctuation, format masks like "1-XXX-XXX-XXXX", etc.) They can also scrub for input hacks such as html coding tags, javascript, urls with remote calls, etc. Sometimes this need is best served here because the offending input is caught before it gets anywhere near the database.
False positives: more often it really is just user error sending malformatted input.
These examples date to the really early days of the Internet. There are lots of source code modules/libraries, add-ons, open-source and paid variants of attempts to make SQL injection much harder to break through into your host systems. Dig around; there is even a huge thread on Stack Overflow that discusses this exact topic.
For the time I have been stumbling around and developing for the Internet, I share a common observation: "perfectly" is a description of a system that has already been hacked. You won't get there, and that isn't the point. You should look towards being the "easy, quick target" (i.e., the Benz in the gas station with windows down, keys in ignition and owner in the restroom)
When you say "can you do it even without...?" The answer is "sure" if it's a "Hello World" type project for some classroom assignment... hosted on your local machine... behind your router/firewall. After all, you gotta learn somewhere. Just be smart where you deploy stuff that isn't so secured.
It's just when you put things out open to the "world" on the Internet, you should always be thinking how the "same task" has to be done differently with a security mindset. Same outputs as your test/dev workspace but with more precautions such as locking down obvious and commonly exploited weaknesses.
Enjoy your additional research, and hopefully this post (devoid of code examples) can encourage your mindset in helpful directions.
Onward.
I know that sql injection must rely on certain set of special characters used in SQL. So here's the question:
What would be the minimal set of characters to disallow, or as large as possible, a wet to allow, that would prevent SQL injection perfectly?
This is the wrong approach to this topic. Instead you need to understand why and how SQL injections happen. Then you can work out measures which prevent them.
Because the cause of SQL injections is the incorporation of user supplied data into SQL command. If the user supplied fragments are not processed properly, they could modify the intention of the SQL command.
Here the aspect of proper processing is crucial, which again depends on the developer’s intention of how the user provided data is supposed to be interpreted. In most cases it’s a literal value (e. g., string, number, date, etc.) and sometimes a SQL keyword (e. g., sort order ASC or DESC, operators, etc.) or even an identifier (e. g., column name, etc.).
These all follow certain syntax rules which can the user provided data can either be validated against and/or enforced:
Prepared statements are the first choice for literal values. They separate the SQL statement and parameter values so that latter can’t be mistakenly interpreted as something other than literal values.
White listing can always be used when the set of valid values it limited, which is the case for SQL keywords and identifiers.
So looking at the use cases for user provided data and the corresponding options that prevent SQL injections, there is no need to restrict the set of allowed characters globally. Better stick to the methods which are best practice and have proved effective.
And if you happen to have a use case that doesn’t fit the mentioned, e. g., allowing users to provide something more complex than an atomic language elements like an arbitrary filter expression, you should break them down into their atoms to be able to ensure their integrity.

Dynamic Methods/Rules

I need to create a product configurator, but according to the requirements, literally every product has a set of rules to validate it. The rules refer to the quantity of underlying components the configuration is made of.
At the moment the way this is being handled is just storing the "formula" string in the db, and since the UI is in Excel, then when you call a configuration, it comes with the rules as well and you just append a "=" in front of it. Thus, the final product works when quantities or components change.
So I've seen a few similar type of questions being asked, and the answer always seemed to be UJS, however, this is stored in the app itself, correct? The challenge for me is to create a way I can replicate these rules depending on the product, and different products are beeing added all the time, changed, etc, so keeping it in the app to redeploy each time you want to change something seems a bit extreme!
Can anyone think of a good solution? Help!
These are business rules so they'll need to be stored somewhere server-side (they could be mirrored in client-side code but storing them there exclusively is highly inadvisable ;). They could be represented as code, or configuration files, or in a database.
As you suggested, modeling frequently changing rules in source code makes your app brittle. I think the best storage option is your database.
If you need to implement a client-side behavior on a per-product basis, you could use AJAX to send a product ID out to a service which returns a configuration "package" to your (dumb) client.
Would that work? Sounds good to me, anyway. ;)

There has to be a better way to do localized database fields

So far there've been several questions regarding this, and they've all come down to the same answer: one table for the language-neutral data, 1-* to a table with the translations and an indexed language ID field.
This has several problems:
Twice as much CRUD.
Need for Ajax CRUD if you want a decently friendly web UI.
More than twice the validation -- you need to ensure that the relationship is 1-* rather than 0-*.
Collation differences between languages isn't accommodated.
Queries require joins.
If you want slugs in multiple languages, oh boy.
A lot of database people have worked on all sorts of theoretical and practical problems, but surprisingly few people work on this one.
I think what we need ultimately is:
A field type that'll store multiple versions of strings
Multiple indices for each such field, one for each language or variation, with the option to specify the correct collation mode
A standard ORM object for this crazy thing
UI elements
Overkill? Sure, maybe, but the whole problem is a real nightmare as it is. And it's not exactly an uncommon scenario.
We gotta try to convince server vendors to work on this.
Edit: By the way, this is my first time using the community wiki; hopefully I'm doing it right.
Edit 2: Something about my wording seems to have made people think that I'm attacking the very concept of DBMS. I'm not; I'm simply saying that built-in support for localization is a much-needed feature.
I probably shouldn't have mentioned performance; it's of course completely negligible most of the time. The focus of my concern is on the fact that this really stifles productivity.
I'll provide an example. Suppose I have a very trivial table for a decidedly trivial store:
Products (id, price, description, name, slug)
In EF/MVC, I'd throw this in the ORM designer, maybe encapsulate it in a repository, build a Products controller, and have actions for Index, Details, Create, Update, Edit and Delete. To identify a product in any of the items, I'd simply do a WHERE(slug = #slug). I'd make a view model for the create/edit actions, design the form control, and wire it up straight to the repository. Done and done. To access the details for a product, the user would go to /products/details/product-slug.
But then since the rest of the website is bilingual, I decide to change the products table accordingly.
Products (id, price)
ProductsText (productId, language, description, name, slug)
Hey, that's not so bad. Yeah, not yet. Then you write your relationships and your constraints, and then you write you write out all your properties in the view-model, and then you make a complete CRUD controller for the ProductsText data or use jQuery/Ajax to add create/update/edit buttons on your Products controller, and then you add validation logic to make sure the user enters at least the primary language, and then when you want to read data for the end-user pages you write another query to take join ProductsText.slug and ProductsText.language with Products... I probably missed something, but you get the idea.
The complexity of the program just explodes with boilerplate code once you have localization involved.
Of course, I don't expect the problem to be solved completely, and it's obviously just as much a UI problem as it is a database problem. But there's just so much that could be done to make all this easier. A "multistring" field type might be a really good start.
Edit 3: Anyone ever hear of SQL Server Modeling Services? It has some localization tools in it that could be a step in the right direction. Still CTP though.
-- Simulate the French locale with the SET LANGUAGE statement.
SET LANGUAGE French
select Id, CountryName,
[System.Globalization].[SessionsString](CountryName, 1) as CountryNameString
from [Location].[CountriesTable]
What is a localized database field?
Typically in applications we've worked in, the UI is localized. This is accomplished using a database, and we put all the translations (and potentially the master phrases) in the table with a locale-code and phraseid being the primary key. This is fairly straightforward, requires a single reusable set of stored procs and has good performance and the usage is well-understood. We often allow translation on the fly so that the app interface includes a translation feature where corrections can be made and other users will see them live - either rich forms applications or web forms applications (depending on caching - which is another key feature of UI localization)
As far as querying requiring joins - that's just a fact of life in a normalized relational database, and performance there is usually managed with a good normalized design and proper indexing.
In other "data", it has made little sense to localize except under direction of the application requirements. For instance, even though you may offer a product in multiple countries, the SKU and distributor may be different. This level of localization is very application specific and we often dealt with it as a separate database and there really isn't anything tying those individually country database together - many products were not available although there may have been equivalent products in the other countries.
If you are selling the same products around the world, then you kind of fall into the original scenario in a kind of multi-lingual CMS. This requires significant work besides the low-level database. For instance, if someone corrects the default product description, what flags the translators that the translations need to also be corrected? These questions are non-trivial. Although I can see where database vendors could assist with features, these are intrinsic difficulties of application requirements and design and not necessarily something the database can add features which will universally solve.
The collation issue is indeed a little awkward. Typically data is stored in nvarchar and you would not know the collation you wanted for retrieval at the time you wrote the stored proc, since the locale would be a parameter. This only affects collections retrieved which need to be ordered by content, not usually natural key and certainly not retrieval by key - it's not a large problem, but is one which cannot easily be handled without dynamic SQL (casting using the preferred collation from a table depending upon the location passed in, if you mix data from different locales, you would have to decide if you want to sort by locale first and then it may be difficult to pick a collation which might work properly within all locales in the same result set). You are probably going to want to use a Windows collation with such a wide variety of data.
Similarly with ORMs, we typically treated the composite unique key of locale/phraseid as the key to retrieve objects (we typically also had a surrogate identity primary key) - I know that traditional ORMs don't necessarily like this departure from retrieval by a meaningless surrogate key.
I've encountered all of these issues for localized CRM-style web sites. Not fun to design and optimize, but it can be done. My 2¢ worth:
1. Twice as much CRUD.
This depends on how your CRUD is designed. Any of my stored procedures or functions that can retrieve a possibly-localized field take a locale/culture code parameter. All of these fields are also NVARCHAR to avoid encoding issues.
2. Need for Ajax CRUD if you want a decently friendly web UI.
I suppose so, but this is application-dependent. Should defer to the "internal" CRUD (DRY principle).
3. More than twice the validation -- you need to ensure that the relationship is 1-* rather than 0-*.
This also assumes that all content is required in all supported locales, instead of using a fallback mechanism. For example, Microsoft's MSDN content is available in multiple locales, but some is in only one (generally this is US English, the "neutral" locale for Microsoft).
For a CRM-style system, any locale can be used for the initial content as long as the fallback uses that if the neutral content is not available.
4. Collation differences between languages isn't accommodated.
I find that it is easier to put all collation support at the UI/reporting layer. Multilingual-aware tables with collation/locale specified on a row-by-row basis would be a very nice-to-have feature but I wouldn't like to wait for it to become available...
5. Queries require joins.
Yes, definitely makes the query a bit more complicated :-) but no real way around that. Can get even more complicated if locale fallback is included (a "locale specificity" ranking field helps here).
6. If you want slugs in multiple languages, oh boy.
This is the reason that the .NET replacement parameters in the format string were designed to be indexed, not positional (printf(), etc. are positional). An English format may need replacements in 1, 2, 3 order, while the German equivalent uses 3, 1, 2.
To make life easier for localizers, whenever I create a .NET resource bundle I document the parameters including index, data type (including minimum and/or maximum string lengths), and a contextual description - context is important for determining text gender in some locales.
Plurality may also require multiple related resources as some locales need more than just "single" and "plural" (e.g. "0 files", "1 file", "2 files").
The same rules must apply to any localizable column in the database.
Well the answers are not that helpfull so far. I had the same problem on various projects I was doing in the past. And there was never a shortcut nor a solution out of the box that helpped me to solve this problem in a easy way. But your approach is going into the right direction and with a little work on your Data Access Layer you can actualy abstract all the burden that is caused by this requirement.
So for Metadata like Types, Categories, Countries etc. performance is not an issue since the whole stuff can be cached. For freetext entries it is a different story. You most probably can't cache them and they tend to be quite long.
You might already know those pages:
http://www.codeproject.com/KB/aspnet/LocalizedSamplePart2.aspx
http://www.sisulizer.com/online-help/DatabaseLocalization.shtml
Best-practices for localizing a SQL Server (2005/2008) database
In my experience I haven't commonly run into the problem where the data stored in the database has many language-dependent versions of the same text. Typically a developed application will have many language files for all the text that's more or less statically built into the application. Then we see database data for text users enter. While an application may be used by users with many different languages, the situation where users type the same text in multiple languages is not so common. Typically uses of an application will show the UI in their language and then enter and view data in their language.
For example, users of our application in the US vs in Netherlands or Saudi Arabia would see the UI in the language of their choice, but for any given installation, the data they enter will consistently be in their native language.
Obviously this doesn't apply to all cases. CRMs are an example where you would have the same text with multiple translations, like Wikipedia, but I think what I described above is the more common scenario.
"A lot of database people have worked on all sorts of theoretical and practical problems, but surprisingly few people work on this one."
That's because there is nothing to work on, from a theoretical perspective, in your example. The so-called "problems" you mention are, all of them, nothing more than a direct consequence of the fact that you are managing more data.
"Twice as much CRUD."
And why is that a problem ? I know of at least a few systems I built that had a lot more of that than your example.
"Need for Ajax CRUD if you want a decently friendly web UI."
Is that really so ? I don't know, but at any rate how data is handled in the presentation layer, is no concern of the DBMS, and if the programmer thinks it is too difficult/cumbersome, then don't blame the DBMS for that.
More than twice the validation -- you need to ensure that the relationship is 1-* rather than 0-*.
And why is that a problem ? If more business rules are stated, more validation is required.
"Collation differences between languages isn't accommodated."
How so ? What is the sense of collating English text with French ? Of English text with Ukrainian or Russian or Chinese ? Or did you mean something else ?
"Queries require joins."
And why is that a problem ?
"If you want slugs in multiple languages, oh boy."
In what context ? For what purpose ?
SELECT language,nllabel FROM ...
NATURAL JOIN (SELECT 'EN' as language UNION SELECT 'FR' as language)
Oh but wait, I forgot ... JOINs are also a problem.
"and it's obviously just as much a UI problem as it is a database problem."
I disagree that it is. When looking at your problem from a database angle, there are two things that might possibly be a small beginning of a solution :
the possibility to do full view updating (both through JOIN and through GROUP, for your case).
the possibility to have attributes of type 'table' inside database tables. You could then have the entire set of applicable localized names-stuff as a sinle attribute in a single row for your product/...
As for full view updating : don't hold your breath. You'll suffocate long before it has arrived.
As for nested tables : they might already exist, if anyone has them Oracle will, I don't really know, but I'm not really confident that this will really make life easier on the UI side of things.
Oh, and BTW : SQL is nowhere near "theoretically pure".

Catching SQL Injection and other Malicious Web Requests

I am looking for a tool that can detect malicious requests (such as obvious SQL injection gets or posts) and will immediately ban the IP address of the requester/add to a blacklist. I am aware that in an ideal world our code should be able to handle such requests and treat them accordingly, but there is a lot of value in such a tool even when the site is safe from these kinds of attacks, as it can lead to saving bandwidth, preventing bloat of analytics, etc.
Ideally, I'm looking for a cross-platform (LAMP/.NET) solution that sits at a higher level than the technology stack; perhaps at the web-server or hardware level. I'm not sure if this exists, though.
Either way, I'd like to hear the community's feedback so that I can see what my options might be with regard to implementation and approach.
Your almost looking at it the wrong way, no 3party tool that is not aware of your application methods/naming/data/domain is going to going to be able to perfectly protect you.
Something like SQL injection prevention is something that has to be in the code, and best written by the people that wrote the SQL, because they are the ones that will know what should/shouldnt be in those fields (unless your project has very good docs)
Your right, this all has been done before. You dont quite have to reinvent the wheel, but you do have to carve a new one because of a differences in everyone's axle diameters.
This is not a drop-in and run problem, you really do have to be familiar with what exactly SQL injection is before you can prevent it. It is a sneaky problem, so it takes equally sneaky protections.
These 2 links taught me far more then the basics on the subject to get started, and helped me better phrase my future lookups on specific questions that weren't answered.
SQL injection
SQL Injection Attacks by Example
And while this one isnt quite a 100% finder, it will "show you the light" on existing problem in your existing code, but like with webstandards, dont stop coding once you pass this test.
Exploit-Me
The problem with a generic tool is that it is very difficult to come up with a set of rules that will only match against a genuine attack.
SQL keywords are all English words, and don't forget that the string
DROP TABLE users;
is perfectly valid in a form field that, for example, contains an answer to a programming question.
The only sensible option is to sanitise the input before ever passing it to your database but pass it on nonetheless. Otherwise lots of perfectly normal, non-malicious users are going to get banned from your site.
One method that might work for some cases would be to take the sql string that would run if you naively used the form data and pass it to some code that counts the number of statements that would actually be executed. If it is greater than the number expected, then there is a decent chance that an injection was attempted, especially for fields that are unlikely to include control characters such as username.
Something like a normal text box would be a bit harder since this method would be a lot more likely to return false positives, but this would be a start, at least.
One little thing to keep in mind: In some countries (i.e. most of Europe), people do not have static IP Addresses, so blacklisting should not be forever.
Oracle has got an online tutorial about SQL Injection. Even though you want a ready-made solution, this might give you some hints on how to use it better to defend yourself.
Now that I think about it, a Bayesian filter similar to the ones used to block spam might work decently too. If you got together a set of normal text for each field and a set of sql injections, you might be able to train it to flag injection attacks.
One of my sites was recently hacked through SQL Injection. It added a link to a virus for every text field in the db! The fix was to add some code looking for SQL keywords. Fortunately, I've developed in ColdFiusion, so the code sits in my Application.cfm file which is run at the beginning of every webpage & it looks at all the URL variables. Wikipedia has some good links to help too.
Interesting how this is being implemented years later by google and them removing the URL all together in order to prevent XSS attacks and other malicious acitivites