whats the difference between a command and a statement - language-design

Often when reading about Tcl (e.g. http://antirez.com/articoli/tclmisunderstood.html) you read that "everything is a command". Sometimes you also hear how other languages are, like Tcl, "command languages."
To me with my background in other languages, I just view these "commands" as statements. What precisely is the difference between a command and a statement?

Traditionally, in Tcl, the phrase "everything is a command" means that there's no such thing as a "reserved word" command, or one that is defined by the system that you can't change. Every single executable piece of Tcl code is of the format:
command ?arg1? ... ?argN?
There's no such thing as a command that's part of the syntax and can't be overwritten (like "if" and other control structures in most languages). It's entirely possible to redefine the "if" command to do something slightly different.
For example, you could redefine "while" as:
proc while {expression body} {
for {} {[uplevel 1 expr $expression]} {} {
uplevel 1 $body
}
}
The above code being untested... but it shows the general idea.

A command is what other languages call a function, routine or reserved word, and can be defined by the "proc" command or in C or whatever. A statement is an invocation of a command. Using traditional definitions, a statement in Tcl is a command followed by zero or more arguments.
Consider the following code:
1 proc hello {string} {
2 puts "hello, $string"
3 }
4 hello "world"
Line 1 defines a command named "hello", line 4 is a statement that calls the "hello" command.
True enough, some articles on Tcl use the term "command" to mean both the name of a command and the invocation of the command. I wouldn't get too hung up on that. Think of statements as the invocation of a command and you'll be fine.
When you see the phrase "everything is a command", it means is that there are no reserved words. Things that look like language syntax or keywords -- if, return, exit, break, while and so on... -- are all commands. They can all be treated alike: they can be renamed, they can be removed, they can be re-implemented, etc.

I'd say that a command is what you execute in a statement; different statements may execute the same command (with different arguments, typically). The command is the operation; a statement is a specific invocation of the operation.

I guess this is mainly a question of semantics, so there may be some variation in the understanding of these concepts. That said, this Wikipedia article provides some guidance that is in keeping with my intuition on the topic.
A statement is a unit of an imperative program. Specifically, it is the unit that is terminated by the statement terminator. In C that's a semi-colon. Or, in Pascal and its descendants, it's the unit that's separated by the statement separator. I think in most flavours of Pascal that's also a semi-colon.
A command is an action, such as a function call or a keyword that performs an action. The Wikipedia article likens them to verbs, and I think that's a good description.
So, a variable declaration is a statement, but probably not a command. And a variable assignment via an assignment operator might be considered a command by some and not by others. One sometimes hears it referred to as an assignment command, but not often. If it looks like a function call, as in TCL, then it's more 'command-like', since there's an explicit verb set.
Also, statements may consist of several commands. For example, think about several function calls in C joined with commas. I would consider that one statement, since it has one semi-colon and returns one value. However, each of the separate calls could be considered a command.
The other thing to bear in mind when considering the statement/command terminology is that statements typically refer to imperative programs, while commands typically refer to shells and scripts. So, when using a command shell like bash, one speaks of commands. If I write a bash script, then it usually thought of as a script of commands, rather than a program of statements, even though the difference is largely academic.
TCL, as one of the early scripting languages, probably wanted to draw a distinction between itself as an interpreted scripting language running in a shell (wish), versus compiled languages like C and Pascal. Perl, for example, having come to popularity somewhat later, typically doesn't harp on the distinction.
However, you still often hear people refer to Perl scripts, rather than Perl programs, and likewise for TCL.
I fear that my answer may have done nothing to clarify the distinction, but I hope it helps.

I often wondered why the term 'statement' is used in programming since it does not seem to match with the meaning of the word in natural language, where a command is an imperative and a statement is not. Intuitively I would prefer the word command.
In Pascal, a variable declaration is not considered a statement, as Jeremy Bourque suggests (it may be true for other languages), since each programming block is divided in a declaration section with all declarations and the statement section with all statements.
Statements are separated by semicolons, as Jeremy Bourque says above. I don't think it is possible to have more than one command in one statement (as in C, apparently). Except for compound statements of course, which have one or more sub-statements. So I guess that one could consider command and statement synonyms in Pascal.
However, in implementations of Pascal, the word command is often used for commands given to the compiler and development environment (IDE). It might be useful to have a different word for commands (statements) in the source code. Perhaps the people who developed Pascal and other early languages, considered a command as something which was to be executed immediately. Therefore Bourque's remarks about the difference between compiled and scripted languages sound logical to me.

I would think that a command is an instruction in code, and a statement may run several commands but at the end evaluates to true or false.

Related

Does Laminas Framework DB Adapter automatically strip tags etc. like mysqli_real_escape_string?

I'm using Laminas framework (formerly ZEND Framework) and would like to know if the DB Adapter automatically strip tags etc. at Insert and Select Statements / Execution ($statement->prepare() / $statement->execute()->current()) to avoid SQL injection?
If not what is the best method to implement it while working with LAMINAS DB Adapter? A Wrapper function to get a clean SQL statement?
Before, I was using a wrapper function that cleaned the input with mysqli_real_escape_string
You'd need to provide some sample code in order to get a more accurate answer to this question, however the topic you're talking of here is parameterisation. If you use laminas-db properly, you won't have to call mysqli_real_escape_string or similar to protect against SQL injection. However, you should still be very wary of any potential cross site scripting (XSS) issues if any data written to your database eventually gets echoed back to the browser. See the section below on that.
How to NOT do it
If you do anything like this, laminas-db won't parameterise your query for you, and you'll be susceptible to SQL injection:
// DO NOT DO THIS
$idToSelect = $_POST['id'];
$myQuery = "select * from myTable where id = $idToSelect";
$statement->prepare($myQuery);
$results = $statement->execute();
How to do it properly
Personally, I make use of the Laminas TableGatewayInterface abstraction, because it generally makes life a lot easier. If you want to operate at the lower level, something like this is a much safer way of doing it; it'll ensure that you don't accidentally add a whole heap of SQL injection vulnerabilities to your code
Named Parameters
$idToSelect = $_POST['id'];
$myQuery = "select * from myTable where id = :id";
$statement->prepare($myQuery);
$results = $statement->execute(['id' => $idToSelect]);
Anonymous Parameters
$idToSelect = $_POST['id'];
$myQuery = "select * from myTable where id = ?";
$statement->prepare($myQuery);
$results = $statement->execute([$idToSelect]);
If you do it in a way similar to this, you won't need any additional wrapper functions or anything, the underlying components will take care of it for you. You can use either the named or anonymous parameter version, however I personally prefer the named parameters, as I find it aids code readability and generally makes for easier debugging.
Sanitising Input/Escaping Output
Be very careful with sanitising input. Not only can it lead to a false sense of security, it can also introduce subtle and very frustrating behaviour that the end user will perceive as a bug.
Take this example: Someone tries to sign up to a site under the name "Bob & Jane Smith", however the input sanitisation routine spots the ampersand character and strips it out. All of a sudden, Bob has gained a middle name they never knew they had.
One of the biggest problems with input sanitisation is context. Some characters can be unsafe in one given context, but entirely benign in another. Furthermore, if you need to support the entire unicode character set (as we generally should be unless we have a very good reason to not), trying to sanitise out control characters in data that will potentially pass through several different systems (html, javascript, html, php, sql, etc) becomes a very hard problem due to the way multibyte characters are handled.
When it comes to MySQL, parameterisation pretty much takes care of all of these issues, so we don't need to worry about the "Little Bobby Tables" scenario.
It's much easier to flip all of this on its head and escape output. In theory, the code that's rendering the output data knows in which context it's destined for, so can therefore escape the output for said context, replacing control characters with safe alternatives.
For example, let's say someone's put the following in their bio on a fictional social networking site:
My name is Bob, I like to write javascript
<script>window.postStatusUpdate("Bob is my hero ♥️");</script>
This would need to be escaped in different ways, depending upon the context in which this data is to be displayed. If it's to be displayed in a browser, the escaper needs to replace the angle brackets with an escaped version. If it's going out as JSON output, the double quotes need to be backslash escaped, etc.
I appreciate this is a rather windy answer to a seemingly simple question, however these things should certainly be considered earlier on in the application design; It can save you heaps of time and frustration further on down the line.

Stata: Can I use "by" for a block of code?

I know and love by. But I wonder if it is somehow possible to "by" an entire block of code instead of just a single command.
This would be useful if one would first like to use a command that stores values (e.g. count) and then do something with those results.
Yes, in the sense that you can write a program and make it byable and Stata programmers do this routinely.
http://www.stata.com/manuals14/pbyable.pdf
No, in the sense that by: is a prefix to a single command (so, as above, whatever you specify must be specified by one line).

How to implement SEPARATE island grammar in ANTLR4 with correct line numbers and char index?

I've been developing a COBOL grammar with support of embedded SQL statements. For anyone who's not familiar with COBOL, here is an example.
MOVE A TO B.
EXEC SQL
SELECT C FROM T WHERE ID=1
INTO :E
END-EXEC
MOVE F TO G
The code between "EXEC SQL" and "END-EXEC" uses a (specially augmented) SQL syntax, which is a perfect example of island grammar.
I know this can be implemented with Lexer mode in ANTLR4. But I have another requirement that the SQL grammar should be separated from COBOL grammar so that the SQL grammar could be reused when embedded in other languages like PL1, without copy paste programming.
So what I did is using a simple lexer mode to capture anything between "EXEC SQL" and "END-EXEC", extract the SQL code as a String and give it to a separate SQL lexer (and parser).
This worked OK with one drawback - the line numbers and char index of tokens recognized in the SQL parser is counted from the start of the extracted SQL code string, instead of starting from the original COBOL program. When it comes to tracking back to source code, e.g. in case there are parsing errors, the line numbers turn out to be mis-leading.
So the question is : is there a simpler way in ANTLR 4 to implement island grammars seperately (both lexer and parser seperated), yet still preserving correct line numbers and char index in the tokens generated for the island part?
Update: I found there is grammar import feature in ANTLR 4 and my colleague told me we've been trying that but failed. The problem is - lexer mode in imported grammar are not well supported, which gets compiling errors. This issue is being tracked here.
To expand on Bill's comment, when instancing your SQL parser/lexer, pass it the line offset of the beginning of the EXEC block. Implement a custom SQL token that reports the line number as offset plus the SQL text relative line number. Have your SQL TokenFactory inject the offset as a constant in to each token generated.
Update
Using modes to implement an idiomatic island grammar, with or without using includes (which work quite well for me at least), is the most natural approach.
Barring that, initiating an external SQL block parser process can be from an Action in the lexer or parser, by an override of the lexer's token emit() method (or related methods), and from a visitor walking the base grammar's parse tree.
Only you can balance which is acceptable, desirable or necessary in any given circumstance.
For example, if the parse tree evaluation provides a value for use in the dynamic execution of an SQL exec block or, conversely, depends on the values returned by such an execution, you are essentially forced to use a symbol table and defer initiations of the SQL executions to a walker. Of course, you can then cache each of the different generated SQL parse trees and reinitialize their symbol tables with instance specific data for reuse without reparsing the SQL blocks.
Just depends on whatever your real requirements are.

Why are many languages case sensitive?

Why are many languages case sensitive?
Is it simply a matter of inheritance? C++ is case-sensitive because C is, Java is case-sensitive because C++ is, etc.? Or is there a more pragmatic reason behind it?
I don't think you'll get a better answer than "because the author(s) of that language thought it was better that way". Personally, I think they're right. I'd hate to find these lines anywhere in the same source file (and refer to the same object+method)...
SomeObject.SomeMethod();
...
SOMEOBJECT.SOMEMETHOD();
...
someObject.someMethod();
...
sOmEoBjEcT.sOmEmEtHoD();
I don't think anyone would be happy to see this...
Unix.
Unix was case sensitive, and so many programming languages developed for use on Unix were case sensitive.
Computers are not forgiving - an uppercase character is not the same thing as a lowercase character, they're entirely different. And back when processing cycles, RAM and so forth were expensive it wasn't seen as worth the effort to force compilers and computers to be "forgiving", people were just trying to get the things to work.
Notice how case insensitivity didn't really become something useful until things like Visual Basic came along - once companies started to get invested in the concept that getting the masses to program was a good thing for their bottom line (i.e., Microsoft makes more money if there're more programs on Windows) did the languages start to be friendlier and more forgiving.
One interesting thing to consider is that English is also case-sensitive. (I suspect this is true for most natural languages, but it may well not be true for all.)
There's a big difference (where I live, anyway, near the town of Reading) between:
I like reading.
and
I like Reading.
Similarly, while many people do capitalise incorrectly, and you can usually understand what is meant, that doesn't mean such writing is considered correct. I'm a stickler when it comes to this kind of thing, which is not to say I get everything right myself, of course. I don't know whether that's part of the inheritance of programming language case sensitivity, but I suspect it may be.
One distinct advantage of case sensitivity for programming languages is that the text becomes culturally insensitive as well. It's bad enough having to occasionally spell out to a compiler which text encoding is used for a source file - having to specify which culture it's in would be even worse :(
It's actually extremely practical, both for the developer and for the language syntax specification: lower/upper case distinction adds a great deal of expressiveness to identifier naming.
From the point of view of the language syntax, you can force certain identifiers to start with a lower or upper case (for instance Java class name). That makes parsing easier, and hence helps keeping the syntax clean.
From a developer point of view, this allows for a vast number of convenient coding conventions, making your code clearer and easier to understand.
My guess would be that case sensitivity enlarges the name space. A nice trick such as
MyClass myClass;
would be impossible with case-insensitive compiler.
Case folding is only simple in English (and for all characters < 128). The German sz or "sharp s" (ß) doesn't have an upper case variant in the ISO 8859-1 charset. It only received one in Unicode after about a decade of discussion (and now, all fonts must be updated...). Kanji and Hiragana (Japanese alphabets) don't even know lower case.
To avoid this mess, even in this age of Unicode, it is not wise to allow case folding and unicode identifiers.
ExpertSexChange
I believe this is a competitor to Stack Overflow where you have to pay to read answers. Hmm... with case insensitivity, the meaning of the site's name is ambiguous.
This is a good reason for languages being case-sensitive. Less ambiguity! Ambiguity to programmers is considered yucky.
Back when parsing and compiling was real expensive and would take all night it was advantageous to the compiler if it didn't have to worry about case.
Once identifiers came in to existence that were only unique via their case it became very difficult to go back. Many developers liked it and there doesn't seem to be a big desire to undo it.
Case sensitivity adds to language readability by the use of naming conventions. You can't write
Person person = new Person("Bill");
if your language is case insensitive, because the compiler wouldn't be able to distinguish between the Class name and the variable name.
Also, having Person, person, PersoN, PeRsOn, and PERSON, all be equivalent tokens would give me a headache. :)
What is the capital form of i? I (U+0049) or İ (U+0130)?
Capitalization is locale dependent.
Because they're as dumb as a box of frogs, for precisely the reasons given for the opposite viewpoint in this thread (I'm not even gonna ask what that's about. Wood for the trees and all that).
When FOOBAR = FooBar = foobar, you get to choose your convention, and other coders can do the same whether they share your preference or not. No confusion.
They also can't get away with the stroke of genius that is having a constant, function and variable all with the same name in the same file, albeit with different caps. Again, no confusion.
You call your variable WebSite, they call theirs Website, and which system gets confused? Not an easy catch either, when you're scanning.
As for lookups, is it really that much more processing to convert the name to lowercase before looking it up? Doing your own premature optimisation is one thing, expecting it from the developer of your language of choice is a whole other level of missing the point.
...and yet, all these answers saying case-sensitivity reduces confusion. Sigh
Many (non-programming) languages (e.g. European using the Roman alphabet) are case-sensitive, so it's natural for native speakers of those languages to use upper- / lower-case distinctions.
The very idea that programming languages wouldn't be case-sensitive is a historical artifact arising from the limitations of early-generation hardware (including pre-computer teletype machines that used a 5-bit character code).
People who argue for case-blind languages must be unable to distinguish
IAmNowHere
from
IAmNowhere
(It's a joke! ;-)
There's also Common Lisp, which is a case-sensitive language that many people mistakenly believe is case-insensitive. When you type (car x) into the Listener, it turns into (CAR X) for processing. It is possible to define symbols with lower-case names, but they have to be quoted with something like |lower-case-symbol|. Therefore, typing in (car x) or (CAR X) or (Car X) all works the same.
(Franz Lisp was at one point introducing what they called "modern" capitalization, in which the Listener would not fold cases, and CL keywords would be in lowercase. I never followed it well enough to know what happened there.)
The upper-case of a letter isn't a universal concept. Java uses Unicode, so if you wanted case-insensitive Java, the meaning of your program could change depending on what locale it was compiled in.
Most languages don't let you put dots or commas (or apostrophes or spaces) in the middle of integer literals, probably because that's also locale-dependent.
From
.NET Framework Developer's Guide
Capitalization Conventions, Case-Sensitivity:
The capitalization guidelines exist
solely to make identifiers easier to
read and recognize. Casing cannot be
used as a means of avoiding name
collisions between library elements.
Do not assume that all programming
languages are case-sensitive. They are
not. Names cannot differ by case
alone.
How do you yell if you don't HAVE CAPS?! AHHH!
You have to be expressive. But in all honesty, of all the people in the world, those who work with programming logic would be the first to insist that differences are in fact differences.
I have read this entire thread. I must believe that those that report to have found value in case sensitivity have never programmed in a true high level language (which by definition is case insensitive). K&R admit that C is mid-level. After programming in Pascal, Delphi, Lazarus, ADA, etc, one learns that highly readable code is simple to write and to get to run quickly without obsessing on terse case sensitive constructs. After all, readability is the first and last word on the subject. Code is written for the human, not the computer. No problems to debug with case insensitive code.
When one moves down to a mid-level language, one finds that there are NO advantages to case sensitivity. There are however, a considerable number of hours spent debugging case sensitivity caused problems. Especially when patching together modules from different coders.
It also appears that a large number of respondents do not understand what is meant by case insensitivity. Only the characters a-z are affected. These are a sequential subset of ASCII characters. Three or four bytes of machine code make the compiler indifferent to case in this range of characters. It does not alter under-bar, numerals, or anything else. The points about other languages and character sets simply do not apply to this discussion. The compiler or interrupter would be coded to temporarily convert or not convert the character for analysis at compile time based on the being ASCII or not.
I am shocked at the new languages like Python that have come out repeating the mistake that K&R made. Yes they saved half dozen bytes in an environment where the total RAM for compiler, source, and object code was 1000 bytes. That was then. Now Memory is not a problem. Now, for no sensible reason, even the reserve words in Python are case sensitive! I do not think I will need to use "For" of "Print" as variable or function name. But that possibility has been preserved by the expensive of the time spent contenting with the interrupter over the exact case of each identifier. A bad deal I think.
The closest thing I have read to date in support of case sensitivity is the comments on Hashing. But these rare coding events that can be handled with careful attention to detail do not seem to be to be worth the pointless scrutiny a coder must use to write case sensitive code. Two views of the problem. One encourages bad coding, set traps in the code, and requires extra attention to be diverted away from bigger concepts. The other has no down side, has worked flawlessly in high level languages, and allows flexibility were it does no harm. It looks to me like yet another case of VHS wins over BETA. It's just my two cents worth here.
Lots of people here have said that it would be bad for several forms of capitalization to refer to the same thing, e.g.:
person
perSoN
PERSON
What would be really bad is if these all referred to different objects in code. If you've got variables person, perSoN and PERSON all referring to different things, you've got a problem.
Case sensitivity doesn't really help case consistency.
Foo.Bar
foo.Bar
fOO.bAR
In a case insensitive language that can be fixed automatically by the editor easily.
In a case sensitive language fixing it it's harder as it may be legal. The editor first has to ckeck if foo.Bar and fOO.bAR exist and also has to guess that you typed with the wrong case rather than forgetting to declare the variable (as Foo is different to fOO).
Every example I've seen supporting case sensitivity is based on a desire to write bad, undescriptive code. e.g. the "date" vs. "myDate" argument - these are both equally undescriptive and bad practice. Good practice is to name it what it actually is: birthDate, hireDate, invoiceDate, whatever. And who in their right mind would want to write code like:
Public Class Person
Public Shared ReadOnly PERSON As Person
End Class
Public Class Employee
Public person As Person = person.PERSON
End Class
Amazingly this is perfectly valid case insensitive VB.Net code. The thought that case sensitivity allows you to even more flagrantly disobey good programming practice is an argument against it, not for it.
I think having a case-sensitive language ENCOURAGES people to write poor code.
Const SHOESIZE = 9
Class ShoeSize
ShoeSize.shoesize = SHOESIZE
call shoeSize(ShoeSize);
function shoeSize(SHOEsize)
{
int ShoeSIZE = 10
return ShoeSize
}
Duh. You couldn't think of a better variable name than "ShoeSize" for the different purposes? There is a billion different words you could use, but you choose to just keep using ShoeSize instead?
Because many people find employeeSocailSecurityNumber just as readable as employee_social_security_number and it is shorter.
And you could also (foolishly) just use single-letters ("a" and "b" and "c") for all classes, variables, functions, and methods.
But WHY would you want to?
Use names that make sense, not:
function a(a)
{
int a = a.a;
return a
}
By typical coding standards, Person would be a class, person a variable name, and PERSON a constant. It's often useful to use the same word with different capitalization to mean something related but slightly different.
So, if you had three staff members in your business all called Robert, you'd refer to them as Robert, robert and ROBERT would you? And rely on people to know exactly which one you meant?
Give them email addresses such as Robert#widgets.com, robert#widgets.com, and ROBERT#widgets.com if your email system was case sensitive?
The potential for an unauthorised breach of personal data would be huge. Not to mention if you sent the database root password to the disgruntled employee about to be sacked.
Better to call them Bob, Robbie, and Robert. Better still to call them Robert A, Robert B and Robert C if their surnames were e.g. Arthur, Banks, and Clarke
Really - why on earth have a naming convention that invites mistakes or confusion, that relies on people being very alert? Are you so short of words in your volcabulary?
And as for the person who mentions the supposedly handy trick "MyClass myClass" - why, why why? You deliberately make it difficult to see at a glance whether a method used is a class method or an instance method.
Plus you lost the chance to tell the next person reading your code more about the particular instance of the class.
For instance.
Customer PreviousCustomer
Customer NewCustomer
Customer CorporateCustomer
Your instance name needs to ideally tell your colleague more than just the class it's based on!
Learning is always easier by example so here it goes:
C#(case sensitive but usable from VB.NET which is case insensitive):
CONSTANT_NAME
IInterfaceName // Uses I prefix in all case sensitive and insensitive languages
ClassName // Readable in both case sensitive and insensitive languages
_classMember // sometimes m_classMember or just classMember
DoSomething(someParam) // Method with action name, params can be _someParam
PropertyName // Same style in case sensitive and insensitive languages
localVariable // Never using prefix
Java and JS use a style similar to C# but methods/functions/events are declared like variables doSomething, onEvent.
ObjectPascal(Delphi and Lazarus/FPC are case insensitive, like ADA and VB.NET)
CConstantName // One can use Def or no prefix, not a standard
IInterfaceName
TClassName // Non-atomic types/classes have T prefix e.g. TStructRecordName
PSomePointer // Pointers have types, safer low level stuff
FClassFieldMember // F means Field member similar to m
DoSomething(Parameter) // Older code uses prefix A for parameters instead
PropertyName
LLocalVariable // Older code uses prefix for parameters not local vars
Using only OneCase and prefixes for each type makes sense in all languages. Even languages that started without prefixes have newer constructs like Interfaces that don't rely on case but use a prefix instead.
So it's really not important if a language is case sensitive or not. Newer concepts were added to case sensitive languages that were too confusing to be expressed by case alone and required using a prefix.
Since case sensitive languages started using prefixes, it's only reasonable to stop using case with the same identifier name someIdentifier SomeIdentifier SOME_IDENTIFIER, ISomeIdentifier and just use prefixes where it makes sense.
Consider this problem:
You have a class member called something, a method/function parameter called something and a local variable called something, what case convention could be used to easily differentiate between these ?
Isn't it easier to just use the most ConsistentCaseStyle everywhere and add a prefix ?
Fans of case insensitive languages care about code quality, they just want one style. Sometimes they accept the fact that one library is poorly written and use a strict style while the library might have no style or poor code.
Both case sensitive and insensitive languages require strict discipline, it makes more sense to have only one style everywhere. It would be better if we had a language that used only StrictCase, one style everywhere and prefixes.
There is a lot of poor C code, case sensitivity doesn't make it readable and you can't do anything about it. In a case insensitive language you could enforce a good style in your code without rewriting the library.
In a StrictCase language that doesn't exists yet, all code would have decent quality :)
MyClass myClass;
would be impossible with case-insensitive compiler.
Or you could be smart and actually use 2 different words... that better show what you are actually trying to do, like:
MyClass myCarDesign;
Duh.
There is another reason languages are case sensitive. IDs may be stored in a hash table and hash tables are dependent on hashing functions that will give different hashes for differing case. And it may not be convenient to convert all the IDs to all upper or all lower before running them through the hash function. I came across this issue when I was writing my own compiler. It was much simpler (lazier) to declare my language as case sensitive.
If word separation is not important then why do we put spaces between words? Therefore I think that underlines between words in a name do increase readability. Also lower case with Capitalization of appropriate characters is easiest to read. Lastly, it is surely much easier if all words can be conveyed by word of mouth - "Corporate Underscore Customer" rather than "Capital C Lower Case o r p o r a t e Underscore Capital C Lower Case u s t o m e r"! - the former can be spoken 'in one's head' the latter cannot - I wonder how people who are happy with case sensitivity handle these case sensitive names in their brains - I really struggle. So I feel that case sensitivity is not at all helpfull - a retrogade step from COBOL in my opinion.
Because people seriously overthink things.
Case insensitivity works best when it's also case-preserving and combined with a separation between type and variable namespaces. This means that:
If you declare a class as 'TextureImage' and then try to use it as 'textureImage', the IDE can autocorrect you. This gives you the advantage that you'll never have to hit the shift key unless you're declaring an identifier or using an underscore.
Just like in Java and several other languages; it's perfectly valid to type "MyClass myClass". The IDE and the compiler should have no problem differentiating between the use of a type and the use of a variable.
In addition, case insensitivity guarantees that 'o' and 'O' will never refer to different objects. Common arguments include:
"sOmEoNe wIlL tYpE cOdE lIkE tHiS"; => and that someone will _never_ be allowed to join a programming team, so this is a strawman argument. even if they did manage to do so, case insensitivity is more the solution than the problem, because it means that you don't have to remember whatever crazy uppercase/lowercase combination they use.
"you can't internationalize case insensitivity easily!"; => over 95% of programming languages are written in english for a very good reason. there are no competing character encodings and the vast majority of keyboards on earth are english based (in partial or whole). supporting unicode identifiers is perhaps the dumbest idea anyone has came up with in the 21st century; because a good portion of unicode characters are frikkin invisible surragates, reading code is hard enough without having to use google translate, and writing code is hard enough without having to copy-paste identifiers or use a character map.
"but case sensitive languages have more identifiers!"; => no, they have grammatically overloaded identifiers, which is substantially worse.
I don't use any case-insensitive languages, but the advantages are blatantly obvious if you think about this sort of thing seriously.
A reasonable answer might be that the designers of the language thought it
would make the language easier to understand thinking about the future :)

Regular expression to match common SQL syntax?

I was writing some Unit tests last week for a piece of code that generated some SQL statements.
I was trying to figure out a regex to match SELECT, INSERT and UPDATE syntax so I could verify that my methods were generating valid SQL, and after 3-4 hours of searching and messing around with various regex editors I gave up.
I managed to get partial matches but because a section in quotes can contain any characters it quickly expands to match the whole statement.
Any help would be appreciated, I'm not very good with regular expressions but I'd like to learn more about them.
By the way it's C# RegEx that I'm after.
Clarification
I don't want to need access to a database as this is part of a Unit test and I don't wan't to have to maintain a database to test my code. which may live longer than the project.
Regular expressions can match languages only a finite state automaton can parse, which is very limited, whereas SQL is a syntax. It can be demonstrated you can't validate SQL with a regex. So, you can stop trying.
SQL is a type-2 grammar, it is too powerful to be described by regular expressions. It's the same as if you decided to generate C# code and then validate it without invoking a compiler. Database engine in general is too complex to be easily stubbed.
That said, you may try ANTLR's SQL grammars.
As far as I know this is beyond regex and your getting close to the dark arts of BnF and compilers.
http://savage.net.au/SQL/
Same things happens to people who want to do correct syntax highlighting. You start cramming things into regex and then you end up writing a compiler...
I had the same problem - an approach that would work for all the more standard sql statements would be to spin up an in-memory Sqlite database and issue the query against it, if you get back a "table does not exist" error, then your query parsed properly.
Off the top of my head: Couldn't you pass the generated SQL to a database and use EXPLAIN on them and catch any exceptions which would indicate poorly formed SQL?
Have you tried the lazy selectors. Rather than match as much as possible, they match as little as possible which is probably what you need for quotes.
To validate the queries, just run them with SET NOEXEC ON, that is how Entreprise Manager does it when you parse a query without executing it.
Besides if you are using regex to validate sql queries, you can be almost certain that you will miss some corner cases, or that the query is not valid from other reasons, even if it's syntactically correct.
I suggest creating a database with the same schema, possibly using an embedded sql engine, and passing the sql to that.
I don't think that you even need to have the schema created to be able to validate the statement, because the system will not try to resolve object_name etc until it has successfully parsed the statement.
With Oracle as an example, you would certainly get an error if you did:
select * from non_existant_table;
In this case, "ORA-00942: table or view does not exist".
However if you execute:
select * frm non_existant_table;
Then you'll get a syntax error, "ORA-00923: FROM keyword not found where expected".
It ought to be possible to classify errors into syntax parsing errors that indicate incorrect syntax and errors relating to tables name and permissions etc..
Add to that the problem of different RDBMSs and even different versions allowing different syntaxes and I think you really have to go to the db engine for this task.
There are ANTLR grammars to parse SQL. It's really a better idea to use an in memory database or a very lightweight database such as sqlite. It seems wasteful to me to test whether the SQL is valid from a parsing standpoint, and much more useful to check the table and column names and the specifics of your query.
The best way is to validate the parameters used to create the query, rather than the query itself. A function that receives the variables can check the length of the strings, valid numbers, valid emails or whatever. You can use regular expressions to do this validations.
public bool IsValid(string sql)
{
string pattern = #"SELECT\s.*FROM\s.*WHERE\s.*";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
return rgx.IsMatch(sql);
}
I am assuming you did something like .\* try instead [^"]* that will keep you from eating the whole line. It still will give false positives on cases where you have \ inside your strings.