When should Regex be used over String.IndexOf()? or String.Contains()?

When should Regex be used over String.IndexOf()? or String.Contains()? - .net-4.0

I'm, currently working my first project in .NET 4.0 and it requires several thousand string comparisons (I'm searching directories and sometimes entire drives for certain files). For the most part, the strings are quite short because I'm only looking at file paths so I have just made use of String.Contains() to see if the file path string contains my needle string.
I was wondering though, would Regex be a better idea? At what point will the Regex be faster than a standard string comparison? Is it based on the length of the strings being compared or the number of strings being compared?

It's variable. Comparison performance is a complex function of the input data, the culture being used for comparing, case sensitivity and CompareOptions. A Regex object is more expensive to instantiate (unless it's in the Regex cache), so if you're doing a lot of one off comparisons, it not that great to use and I've found it's typically slower than IndexOf(), but YMMV.
Keep in mind that when using Contains/IndexOf that the culture under which the user/thread is running will decide how the comparison is done. That can have a significant impact on performance. Not all cultures are as fast.
The Invariant culture is a very fast culture. If you use a CompareInfo directly, rather than doing String.IndexOf(), it will be somewhat faster still.
CultureInfo.InvariantCulture.CompareInfo.IndexOf(..)
The only way to have some confidence in making the right choice is to benchmark. That said, unless you're shifting through many megabytes of strings, it won't make a difference that matters to anyone. As ChrisF said earlier, focus on readable/maintainble code in that case.
Here's a good article on getting the most out of regex:
Optimizing Regular Expression Performance

If your search expression is simple then I don't think it's worth moving to a Regex - no matter how good you are at coding and reading them it will take you more time to understand the code when you (or more importantly, some one else) look at it again in 6 months time.
If the speed improvements are only marginal stay with the more readable, maintainable code.

I'm just guessing, but I suspect that for simple substring searches there will be little difference in performance between String.Contains(), String.IndexOf() and regex (if anything, I'd guess that regex would never be faster, but might be slower by a miniscule amount).
You shouldn't give any thought about moving to regex unless your requirements are (or become) such that you need to match on something more complex than a substring.

In .Net 4.0 there is an issue with the String.IndexOf call see Hotfix 2467309, it may help you decide your answer.

Related

Rewrite this exceedingly long query

I just stumbled across this gem in our code:
my $str_rep="lower(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(field,'-',''),'',''),'.',''),'_',''),'+',''),',',''),':',''),';',''),'/',''),'|',''),'\',''),'*',''),'~','')) like lower('%var%')";
I'm not really an expert in DB, but I have a hunch it can be rewritten in a more sane manner. Can it?

It depends on the DBMS you are using. I'll post some examples (feel free to edit this answer to add more).
MySQL
There is really not much to do; the only way to replace all the characters is nesting REPLACE functions as it has already been done in your code.
Oracle DB
Your clause can be rewritten by using the TRANSLATE function.
SQL Server
Like in MySQL there aren't any functions similar to Oracle's TRANSLATE. I have found some (much longer) alternatives in the answers to this question. In general, however, queries become very long. I don't see any real advantages of doing so, besides having a more structured query that can be easily extended.
Firebird
As suggested by Mark Rotteveel, you can use SIMILAR TO to rewrite the entire clause.
If you are allowed to build your query string via Perl you can also use a for loop against an array containing all the special characters.
EDIT: Sorry I did not see you indicated the DB in the tags. Consider only the last part of my answer.

Your flagged this as Perl, but it's probably not?
Here is a Perl solution anyway:
$var =~ s/[\-\.\_\+\,\:\;\/\|\\\*\~]+//g;

Sorry I don't know the languages concerned, but a couple of things come to mind.
Firstly you could look for a replace text function that does more that just a single character. Many languages have them. Some also do regular expression based find and replace.
Secondly the code looks like it is attempting to strip a specific list of characters. This list may not include all that is necessary which means a relatively high (pain in the butt) maintenance problem. A simpler solution might be to invert the problem and ask what characters do you want to keep? Inverting like this sometimes yields a simpler solution.

Faster string comparison in vb.net

I have this condition:
If (cmbStatusSearch.SelectedValue <> "-1") Then
How can I make it better in performance? I read String.CompareOrdinal is faster in comparing strings. So should I use:
If (String.CompareOrdinal(cmbStatusSearch.SelectedValue,"-1" <>0) Then
Or is there any other way to make if faster in performance?

I think you're being overly concerned about the wrong type of performance issues and prematurely optimizing a trivial piece of code.
The first example is much more readable than the second. If it serves your purpose then move on and be content with it. Your performance bottleneck will not be in that statement. If you feel some operation in your program is slow then use a profiler, such as the ANTS Performance Profiler (or similar), to discover where the bottleneck is. Until then, guessing about performance issues is futile.
To put this into perspective, consider that no one would use LINQ if they were so concerned over performance to the level presented in your question. Instead, they would stick to traditional code and for loops, which are known to be faster. However, for the sake of readability and expressiveness, LINQ is commonly used and acceptable.
Although String.CompareOrdinal might be more efficient, I would recommend using it when you need to benefit from its intended purpose, which is to perform a case-sensitive comparison using ordinal sort rules. Otherwise, for your posted example, a direct comparison is fine and more readable.

Lets think about this:
Any string equality check can be implemented viastring.CompareOrdinal.
A string comparison check cannot be implemented via an equality check.
So if CompareOrdinal was faster, why wouldn't they just implement Equals in terms of it? In fact it's slower (exact numbers depend on framework), but this is not surprising since it does strictly more work.

Array length question

This question concerns optimization. Suppose I need the array length of an array A at two places in my code. Should I use the function a.length() in the two places, or is it faster to assign a local variable the value of a.length() and use it at the two places.
By "faster" I mean in terms of running time. Moreover, i am talking asymptotically.

The asymptotic complexity of calling the function twice is the same - any constant number of calls to the same (pure) function on the same arguments has the same asymptotic complexity as a single call to that function, since you can just roll the constant number of calls into the big-O's hidden constant.
As for what will be faster, there's no guarantee which one will be faster. It depends on the language and compiler. I'd suggest just writing it both ways and timing the result to see if there's an appreciable difference. That said, if you are writing something that is so performance-critical that you can't afford to call .length() twice, you may need to reconsider your approach in general to see if there's a better global solution to the problem. Microoptimizations are rarely worth the effort unless you have a compelling reason to believe that your program is markedly slower in the unoptimized version.

If you have to ask the question, you're not at a point where it matters yet. If you were, you'd already have code that you've profiled, and you could just try it and see. This kind of thing depends heavily on your language and compiler, and the only results that matter are the ones you see.
Don't worry about micro-optimizations til you find you need to shave cycles, and even then the algorithm is the first thing to check.

What language? In many languages, such calls are optimized away (either at compile time or by a JIT compiler) into direct access to the length field of the array object.

What are the things should we consider while writing a Spell Checker?

I want to write a very simple Spell Checker. The spell checker will try to match the input word with equivalent words form the dictionary.
What can be done to find those 'equivalent words'? What analysis can be preformed on two words to mark them equivalent?

Before investing too much trying to unravel that i'd first look to already existing implementations like Aspell or netspell for two main reasons
Not much point in re-inventing the wheel. Spell checking is much trickier than it first appears and it makes sense to build on work that has already been done
If your interest is finding out how to do it, the source code and community will be a great benefit should you decide to implement your own anyway

Much depends on your use case. For example:
Is your dictionary very small (about twenty words)? In this case it probably is better to precompute all possible nearby mistaken words and use a table/hash lookup.
What is your error model? Aspell has at least two (one for spelling errors caused by nearby letters on the keyboard, and the other for spelling errors caused by the way a word sounds).
How dynamic is your dictionary? Can you afford to do a massive preparation in order to get an efficient retrieval?
You may need a "word equivalence" measure like Double Metaphone, in addition to edit distance.
You can get some feel by reading Peter Norvig's great description of spelling correction.
And, of course, whenever possible, steal code. Do not reinvent the wheel without a reason - a reason could be a very special domain, a special way your users make spelling mistakes, or just to learn how it's done.

Edit Distance is the theory you need to write a spell checker. You also need a dictionary. Most UNIX systems come with a dictionary already installed for your locale.

I just finished implementing a spell checker and used a combination of the following in getting a list of "suggested" words
Phonetic hashing of the "misspelled" word to lookup a hash of identical dictionary hashed real words (for java check out Apache Commons Codec for a suitable library). The phonetic hash of your dictionary file can be precomputed.
Edit distance between the input and the potentials (this is reasonably expensive so you need to reduce the list first with something like a phonetic hash, assuming a higher volume load - in my case, a server based spell check)
A known list of common misspellings, e.g. recieve vs. receive.
An ordered list of the most common words in the english language
Essentially I weighted each potential word primarily based on edit-distance and commonality. e.g. if word probability is a percentage, then
weight = edit-distance * 100 / probability
(lower weights are better)
But then I also also override any result with the known common misspellings (i.e. these always float to the top suggested result).
There may be better ways, but this worked pretty well.
You may also wish to ignore ALL CAPS words, initials etc, so choosing what to ignore is also something to think about.

Under linux/unix you have ispell. Why reinventing the whell.

Why is VB.net so verbose? Is it possible to trim down the fat?

Do languages become more verbose as they mature? It feels like each new version of VB.net gains more syntax. Is it possible to trim down some fat like the keyword "Dim"? C# also feels like it is getting more syntax since version 1.

C# has certainly gained more syntax, but in a way which makes it less verbose.
Virtually every feature in C# 3.0 allows you to do more with less code.

That's the VB idiom. All languages have an idiom, and there are plenty that go for verbose and spelled-out. Thank your lucky stars you're not WRITING IN COBOL.
C# evolved out of the C-like languages, and in the C tradition brevity and terseness are valued, hence braces, && and ||, int not integer, case-sensitive lower-cased code. In the VB idiom, long self-explanatory keywords are good, terse keywords or cryptic symbols are bad, hence MustInherit, Dim blah as Integer and case insensitivity but with a tendency to Capitalise Your Keywords. Basically, stick with the idiom of the language you're using. If you use (or have to use) VB, then get used to the verbosity - it's intentional.

There's nothing wrong with verbosity, in fact it often can be a very good thing.
I assume you have meaningful variable names? Descriptive method names? Then why is there a problem with typing 'end if' instead of '}'. Its really not an issue at all, if anything, the terseness of C# is more of a problem trying to fit as much as possible into as few characters as possible - that means it harder to read, not easier.

Okay... this feels embarrassing to admit, but I like the use of Dim in VB.NET. Yes, it's not serving any particular use that couldn't be inferred by the use of "As" later... but to me, there's something absolutely "beat you over the head" obvious about having declaration statements start with Dim. It means when someone's looking through the code, they don't even have to think about what those statements mean, even for a microsecond. Languages like C# have declarations that are obvious enough, but if you're just browsing by it you may have to consider it for a moment (even the most brief of moments).
There's something "extra obvious" about having a special keyword at the start of certain kinds of statements. In VB.NET, assignments start with "Dim" and calls to methods (can) start with "Call", giving them a kind of "left side uniformity" that If, For, and other constructs already have: you get the barest gist of what's going on with that line just by looking at the very start of it. Using these almost gives you the equivalent of a left-hand column that you can browse down extremely quickly and get the gist of what's going on on some kind of basic level ("Okay, we're declaring things here... we're calling to other things here...").
That might seem irrational or even silly to some people... but it does make the purpose of each statement explicit enough that it feels faster (at least to me) to browse through... especially when browsing through unfamiliar code written by others.
I guess, in the end, it comes down to "different strokes for different folks". I don't mind typing the extra three characters for that obviousness of purpose.

VB.NET is great as it is, it brings more clarity in code.
For instance I love the fact how you can describe what your ending.
End while vs }
End for vs }
End if vs }
And really the verbosity isn't an issue while typing because of intellisence.

Verbosity and readability often go hand-in-hand. Lately I've come to fear the word "Elegant", because it generally translates into "Fun but less immediately readable"
Of course the person writing the code always says "Well it's MORE readable to me because it's shorter/more elegant.
that's crap. It's always easier to read something more explicit unless you have so much trouble reading that it takes you two hours to get through a Dick and Jane novel.
Note that I'm not talking about redundancies, just being explicit and spelling out your desires.
As a programmer it's much MUCH more fun to write elegant expressions, but I've found myself looking at the "elegance" of others and even my own "elegance" after a while and I'll change it to something more explicit, readable and reliable when I realize that although it was fun to write, I just spent more time reading/debugging it than it took to write in the first place.
On the other hand, DIM is just stupid :)

I'm slightly out of my depth here.
I like what everyone calls verbose. Does no one ever wonder my we don't have multiple nested parenthesis in any natural languages (e.g. English)? In fact nested parenthesis really need the convention of indentation to make then legible at all. In natural languages we use verbose clauses and multiple sentences, to avoid, or at least explain layering parenthesis.
Also in a certain sense I wouldn't say vb has much more syntax than c# at all. It doesn't really have many more lexical tokens in a certain block of code than would c# (Does it?). It has about the same syntax as c# it just has longer syntactical tokens, 'End Sub' instead of '}' for example. For most pieces of syntactical plumbing, the VB version will just be more typing (if you are sworn off intelisense), Also '}' is ambiguous compared to 'End Sub' since it also means 'End If' and whole load of other things. This doesn't make it more concise in a certain sense, There are still the same number of tokens in the code, but from a smaller subset of more curt token markers. But different tokens in C# have different meanings depending on the context, which requires you still to have that nested level in your mind as you read the code, in order to get, for example what kind of code block you are reading if you lose your place, even though the end of the code block could be in view you might have to look up to see the beginning of the code block. Even where this is not the case is && really better than AndAlso?
Dim I suppose is quite useless, an extra lexical token compared to c#, but I suppose at least it's consistent with linq, vb doesn't need the var command. I can't think of any other candidates for the chop. Maybe I don't have the imagination.
I'm sure someone is going to come along and tell me how wrong I am :) My best guess this is to do with personal preference anyway.

Then C# should get rid of ";" at the end of every fricking line :)
Default should be easy way to go such "New Line" instead of a ";" if you want exceptions such 2 statements in the same line then you need use a separator such as ":" or ";"

Some people like that. I suspect things like Dim are left in for legacy reasons as much as anything else.
VB does have some nice short cuts though. For example the conversion routines. Eg CInt, CStr etc

OK I'm an old C programmer that came to .Net via C#, and Now I'm working in VB.Net
I have to say at first I was appalled by the verbosity, but having gone back to
C# for I bit, I have to admit I like VB.Net a bit more now.
I can write VB code that reads better and is less cryptic and it really doesn't take
up much more space (especially if you put all your C# curly braces on their own line)
Also the editor in Visual Studio takes care of most of the verbosity for me.
I like how Dim X As List(Of SomeType) reads as opposed to List<sometype> X
Semi-colons are a nuisance now. Curly braces are annoying.
I do miss the square brackets for array refs, a little bit though.

Ever since I saw this...
lowercase keywords?
...I've been hoping it would get included in the language. It's amazing how much more readable it is Without All The Capitalized Keywords.

C# is not a user-friendly language and takes many MORE lines of code to accomplish the same task as VB.Net. Optional parameters, with/end with, etc. are not supported within C#. Even db connections take more code in C#. Star Wars fans love the C# though, as you get to talk like Yoda while you're programming.
My $0.02: VB.Net's intellisense features, shortcuts, logical readability, etc. make it a superior language.

Yes I know what you mean, it can get a bit long winded, I find it becomes hard to read due to there being to too many keywords and making it hard to see the logic behind it all but hey some people like it.
I have just recently started writing all my stuff in C# and I must say I have come to like it a lot more since I just cut of VB and said right C# from now on, so if you are that worried your could always just switch.

Use a pre-processor... and loose any hope to be read by anybody else (or yourself in some years, perhaps...).
I will answer what is often answered to people complaining that Lua is verbose: if you want a concise language, you know where to find it. There are tons of languages around, even if you restrict yourself to the CLR, so why complain about the one you chose? (I know, you might work on a project you didn't started, etc.).
That's not a flame or something. There are many languages for a reason. Some like them super-concise, close of mathematical notations, others like them (more or less) verbose, finding that more readable. There are languages for all tastes!

for one thing that i don't like about VB is this
print("ByVal sender as object, ByVal e as EventArgs");
vs C#
object sender, Eventargs e

Doing so will only make your code harder to read.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas