The idea of text highlighting, code completion, etc in programming - ide

I wanna know the idea of advanced text editors features like text highlighting, code completion, automatic indentation etc.
To make my idea clear I imagine that text highlighting is reading the entire text into a string then do regular expression replacement of keywords with keywords + color codes and replace the text again. That looks logical but it would be so inefficient to do that with every keystroke when your file is 4000 lines for example ! So I wanna know the idea of implementation of such thing in C# for example (any other language would be fine also but that's what i am experimenting with right now).

Syntax highlighting:
This comes to mind. I haven't actually tried the example, so I cannot say anything about the performance, but it seems to be the simplest way of getting basic syntax highlighting up and running.
Auto-completion:
Given a list of possible keywords (that could be filtered based on context) you could quickly discard anything that doesn't match what the user is currently typing. In most languages, you can safely limit yourself to one "word", since whitespace isn't usually legal in an identifier. For example, if I start typing "li", the auto-completion database can discard anything that doesn't start with the letters 'l' and 'i' (ignoring case). As the user continues to type, more and more options can be discarded until only one -- or at least a few -- remains. Since you're just looking at one word at a time, this would be very fast indeed.
Indentation:
A quick-and-dirty approach that would (kind of) work in C-like languages is to have a counter that you increment once for every '{', and decrement once for every '}'. When you hit enter to start a new line, the indentation level is then counter * indentWidth, where indentWidth is a constant number of spaces or tabs to indent by. This suffers from a serious drawback, though -- consider the following:
if(foo)
bar(); // This line should be indented, but how does the computer know?
To deal with this, you might look for lines that end with a ')', not a semicolon.

An old, but still applicable resource for editor internals is The Craft of Text Editing. Chapter 7 address the question of redisplay strategies directly.

In order to do "advanced" syntax highlighting -- that is, highlighting that requires contextual knowledge, a parser is often needed. Most parsers are built on some sort of a formal grammar which exist in various varieties: LL, LALR, and LR are common.
However, most parsers operate on whole files, which is quite inefficient for text editing, so instead we turn to incremental parsers. Incremental parsers use knowledge of the language and the structure of what has been previously been processed in order to re-do the least amount of work possible.
Here's a few references to incremental parsing:
Language Design Patterns and incremental parsing
Incremental Parsing
Incremental Parsing in the Yi Editor and the presentation on Vimeo

Related

Displaying Korean Characters - iOS App

I am trying to display Korean text in my iPhone app. The app appends the Unicode of letters one by one to an NSMutableString and displays the string on the screen after each letter is appended.
I understand that there are some rules for conjoining letters (Jamo).
Is there a function for automatically applying all these rules to a string of letters or do I need to write code to make changes (e.g., changing a consonant to a tail consonant if there is a vowel before it)?
FCA. It's you who sent email to me, right? Because the more detailed question is here, I will try (my best) to answer here instead of replying to your email.
By reading the whole text you and people wrote here, I figured out that you are making a Korean handwriting recognition software. So, you would not enjoy the luxury of the Korean input method provided by Apple.
There are two things for me to say. Let's go one by one. (I believe you are already aware of one of the two things I'm going to explain.)
How to compose Hangul text.
So, by reading your inquiry, it should not be about Unicode composed/decomposed Korean string (or just a series of Ja (Consonants) and Mo (Vowels)). Your question looks to be about "how to determine if a consonant (your term is tail consonant, right?) a user writes is a last consonant or the begining consonant of next syllable.
Best thing is to learn Korean, but let me briefly explain it.
Let's say you write 소방차 (a Fire dept. car.)
You are to write : ㅅㅗㅂㅏㅇㅊㅏ
(Again I'm not talking about the decomposed form of Unicode. It's about how people write Korean text.)
When you type ㅗ (which is the 2nd char) temporarily a display system displays 소 by attaching the ㅗ to its preceding ㅅ. And it will look up Korean table. (Although how to assemble Hangul is JoHap style (조합형), which is called composite style, there are tables of allowed Korean text defined in any Korean standard called Wansung style (완성형). So, you are to test the "assembled" syllable to the table to see if there is such a syllable). Then you will find "소" in the table. So, you will display "소".
Now the next char, "ㅂ" is written. Then here it becomes a little complicated. Because there is a syllable "솝" in the table, first it will attach ㅂ to the preceding syllable. So, it will display "솝". However, things are not determined yet completely. A user writes the next char, "ㅏ". It's pretty sure that there is no syllable without first/beginning consonant (Ja). It will look up the table, but fail to find a syllable "ㅏ".
So, it will guess the ㅂ (edited from ㅅ. it was typo) attached to the previous syllable actually belongs to the 2nd syllable. And it should display "소바". Now, ㅇ is typed. Then it tries to attach the ㅇ to the second syllable. So it displays 소방. (At this moment it can also lookup 방 in the table. And it is found.)
Now, "ㅊ" is typed. Probably internally it can test 소방ㅊ where o and ㅊ exist under 바 (I can't write it, because there is no such syllable with o and ㅊ exist together under 바, like 밝.). However, there is no such syllable. So, it instantly determines that ㅊ belongs to the next syllable.
Then "ㅏ" is typed. It will assemble ㅊ and ㅏ to make 차. When you press the space key or return key or any other white space key, it will finish composing Hangul.
This is a simple case. In Korean, there are more complicated syllables like 빨, 꼭, 헗, etc. For the first consonants, 복자음 (BokJaUm, Double Consonants) like ㅃ, ㄲ in 빨 and 꼭, people type ㅂ and ㅅ by pressing the shift key. Then it will display ㅃ and ㄲ. So, picking up how may consonants and determine where (previous syllable or next syllable) it belongs to can be easy if a user type with keyboard. (However, there are some nice Korean input methods for Windows and Xterm, where it allows to type ㅂ twice to make ㅃ. It's kind of an intelligent feature. But testing text like 빱빠라빱, 흙을 can be complicated because you end up testing 3 or 4 consonants grouped like {1,3}, {2,2}, {3, 1}.
The bad news is... because you are writing handwriting recognition, you may need to handle such complicated case if you input recognized Hangul characters one by one into a Korean input method engine. However, if you write up your own input method in your app, you can maintain its own state machine, so it can be easier. But as you can see, it's a trade off. Depending on the existing input method engine and ingesting each char into it. (Hmmm... wait... Maybe the input method engine can handle those complicated cases too.)
FYI, I would like to introduce two open source projects. One is a Korean input method Finder module for Mac, and the other is an input method engine with which you can make a Korean input method. Also, there is a Korean input method for X-Windows hosted here. If you prefer Windows project to look up, here is one.
The latter two were hosted at KLDP.net, a Korean open source project hosting site, but they were moved to Google code. As far as I can remember, "SaeNaRu" and "Nabi" (butterfly) can support typing the same consonant twice to make a double consonant.
For more detailed information, you can look up the libhangul and nabi. (I remember that the input method part of code was almost the same between libhangul and nabi before. But at that time they were separated and expected to evolve independently. So, I guess that they are different.
OK. The first thing is done.
Now let's move on to the second issue. (This is the part I said you may know about already. But just to complete my explanation, let me explain this also.)
It's about what character to choose as an input to your probable Korean input method state machine or a engine like libhangul. There are basically two representation of composed (on display) Hangul characters : Composed and Decomposed. Composed one contains fully composed chars. For example, 사랑합니다, each syllable, 사, 랑, 합, 니, 다 is saved as such. They are not stored as ㅅ, ㅏ, ㄹ, ㅏ, ㅇ, ㅎ, ㅏ, ㅂ, ㄴ, ㅣ, ㄷ, ㅏ.
That is composed representation in Unicode. This representation is usually used by text editors, etc. The other representation is decomposed in Unicode. It's like ㅅ, ㅏ, ㄹ, ㅏ, ㅇ, ㅎ, ㅏ, ㅂ, ㄴ, ㅣ, ㄷ, ㅏ.
This representation is usually used by file systems. For example, if you put a file name in Hangul on Windows, and access the folder which contains it from Mac, it will be displayed like ㅅㅏㄹㅏㅇㅎㅏㅂㄴㅣㄷㅏ although it is displayed as 사랑합니다 on Windows.
However, there is another set of characters if memory serves, which is just a list of Hangul consonants and vowels. Although they may look same or similar to decomposed syllables, they are actually different in that the location where they are drawn is in middle a space where a character is drawn. Its purpose is to present Hangul characters in Korean alphabet tables or things like that for education purpose (or any other purpose.)
So, I'm not sure what characters (i.e. the decomposed or the characters for the list of Hangul consonants and vowels) to ingest to a input method state machine or input method engine you choose or implement. If you implement it, its your choice, but if you use some external libraries for the engine, you need to figure it out.
Also, as I mentioned in my blog post, there are two variants in each composed and decomposed representation, which are all defined in Unicode standard. So, well.. yeah.. I agree. It's quite a bit of work.
As for me, I tried to make an input method for Mac, (when Apple announced they would get rid of the Finder plugin architecture for security issue), but at that time libhangul (Yeah.. I tried to use it) was being changed a lot. So, until it stabilized, I decided to hold off. But because I became very busy at work and tired when I got home, so I didn't make progress on my own input method. So, I believe the state of the libhangul project is much better now than ever. So, it's good try at least to take a look at it.
Also, if you don't have Windows, it would be good to try hanterm or any xterm derivatives which supports Hangul input in itself. The source code will be available at their hosting web site.
Good luck with your project, and if there are more things to ask me, please do so.
Check out these system level text-input facility. I never used these, but looks promising.
http://developer.apple.com/library/ios/#documentation/StringsTextFonts/Conceptual/TextAndWebiPhoneOS/CustomTextProcessing/CustomTextProcessing.html#//apple_ref/doc/uid/TP40009542-CH4-SW8
http://developer.apple.com/library/ios/#documentation/UIKit/Reference/UITextInput_Protocol/Reference/Reference.html#//apple_ref/occ/intf/UITextInput
Because iOS doesn't support system-wide keyboard customization, everybody just use system-default input facility. And handling of Hangul composition is all different by every operating-systems or platforms. (MS/Apple/Samsung/LG or others) So the best way is using system-supplied facility such as UITextField for consistency for users. Or you should accurately simulate how your platform OS does it. Of course you can make it yourself, but users won't like it.
Though I'm not expert on this topic - Korean Hangul compositor -, but I don't think there's simple algorithm without table lookup. Anyway if you really want to implement it yourself, these are all the core problems you have to handle.
Compositing your visual symbols into consonants and vowels which defined in Unicode.
Determining initial-consonant / final-consonants by placement of vowels.
It wouldn't be so hard, but anyway ability to modify preceding character sequence is required. You cannot implement Korean input with only one-way stream unless you have separate key for initial/final consonants which are looks same.
Unicode defines all valid set of Jamo components. Usually those components are too many to be presented on a device. And also inefficient. Most Korean input system decomposes those Jamo again and composite them once before compositing final litter. You also can identify and decompose them visually just like Korean people do.
After you get initial/final-consonants and vowels which are defined in Unicode standard, Unicode Normalization feature (such as -[NSString precomposedStringWithCompatibilityMapping]) will do the rest of jobs.
libhangul (code.google.com/p/libhangul ) does the conversion! It has several functions to handle different types of keyboards (i.e., keyboards with different layouts) and converting the keys to the Unicodes of Hanguls.
It also has several functions which combine the Hanguls to make syllables (they basically implement table lookups that Eonil has mentioned in his response).
Libhangul stores the Hanguls in its buffer as it receives them (it does not output them). After receiving enough Hanguls and successfully converting them into a syllable, it outputs the syllable. Unfortunately, this is quite confusing for the user. The way around this is displaying the buffer content on the screen. After receiving a new Hangul, what has been displayed must be erased. If a syllable has been successfully formed, then the syllable is displayed. Otherwise, the buffer content is displayed again. Note that you can’t just display the new Hangul on the screen. You must erase what you have displayed before and read the previous Hanguls and the new one from the buffer and display them on the screen again.
The reason is that Libhangul may change the code for the previous Hanguls stored in the buffer to make it possible to combine them with the new Hangul. This way, you will get the updated Hanguls.
Also note if the user changes the location of the cursor, the buffer must be emptied.
Additionally, if the user presses backspace, then, the last Hangul displayed on the screen must be erased and must be removed from the buffer.
Libhangul has also some features for correcting typos. For example, if you typeᅡ and ᄉ, it converts them into사.
Thank you JongAm Park and Eonil for your help and thoughtful comments! Since my reputation is less than 15 at this point, I can’t upvote your answers, but I will do when I can.

removing dead variables using antlr

I am currently servicing an old VBA (visual basic for applications) application. I've got a legacy tool which analyzes that application and prints out dead variables. As there are more than 2000 of them I do not want to do this by hand.
Therefore I had the idea to transform the separate codefiles which contain the dead variable according to the aforementioned tool to ASTs and remove them that way.
My question: Is there a recommended way to do this?
I do not want to use StringTemplate, as I would need to create templates for all rules and if I had a commend on the hidden channel, it would be lost, right?
Everything I need is to remove parts of that code and print out the rest as it was read in.
Any one has any recommendations, please?
Some theory
I assume that regular expressions are not enough to solve your task. That is you can't define the notion of a dead-code section in any regular language and expect to express it in a context-free language described by some antlr grammar.
The algorithm
The following algorithm can be suggested:
Tokenize source code with a lexer.
Since you want to preserve all the correct code -- don't skip or hide it's tokens. Make sure to define separate tokens for parts which may be removed or which will be used to determine the dead code, all other characters can be collected under a single token type. Here you can use output of your auxiliary tool in predicates to reduce the number of tokens generated. I guess antlr's tokenization (like any other tokenization) is expressed in a regular language so you can't remove all the dead code on this step.
Construct AST with a parser.
Here all the powers of a context-free language can be applied -- define dead-code sections in parser's rules and remove it from the AST being constructed.
Convert AST to source code. You can use some tree parser here, but I guess there is an easier way which can be found observing toString and similar methods of a tree type returned by the parser.

How do you test your app for Iñtërnâtiônàlizætiøn? (Internationalization?)

How do you test your app for Iñtërnâtiônàlizætiøn compliance? I tell people to store the Unicode string Iñtërnâtiônàlizætiøn into each field and then see if it is displayed correctly on output.
--- including output as a cell's content in Excel reports, in rtf format for docs, xml files, etc.
What other tests should be done?
Added idea from #Paddy:
Also try a right-to-left language. Eg, שלום ירושלים ([The] Peace of Jerusalem). Should look like:
(source: kluger.com)
Note: Stackoverflow is implemented correctly. If text does not match the image, then you have a problem with your browser, os, or possibly a proxy.
Also note: You should not have to change or "setup" your already running app to accept either the W European characters or the Hebrew example. You should be able to just type those characters into your app and have them come back correctly in your output. In case you don't have a Hebrew keyboard laying around, copy and paste the the examples from this question into your app.
Pick a culture where the text reads from right to left and set your system up for that - make sure that it reads properly (easier said than done...).
Use one of the three "pseudo-locales" available since Windows Vista:
The three different pseudo-locale are for testing 3 kinds of locales:
Base The qps-ploc locale is used for English-like pseudo
localizations. Its strings are longer versions of English strings,
using non-Latin and accented characters instead of the normal script.
Additionally simple Latin strings should sort in reverse order with
this locale.
Mirrored qpa-mirr is used for right-to-left pseudo data, which is
another area of interest for testing.
East Asian qps-asia is intended to utilize the large CJK character
repertoire, which is also useful for testing.
Windows will start formatting dates, times, numbers, currencies in a made-up psuedo-locale that looks enough like english that you can work with it, but obvious enough when you're not respecting the locale:
[Шěđлеśđαỳ !!!], 8 ōf [Μäŕςћ !!] ōf 2006
There is more to internationalization than unicode handling. You also need to make sure that dates show up localized to the user's timezone, if you know it (and make sure there's a way for people to tell you what their time zone is).
One handy fact for testing timezone handling is that there are two timezones (Pacific/Tongatapu and Pacific/Midway) that are actually 24 hours apart. So if timezones are being handled properly, the dates should never be the same for users in those two timezones for any timestamp. If you use any other timezones in your tests, results may vary depending on the time of day you run your test suite.
You also need to make sure dates and times are formatted in a way that makes sense for the user's locale, or failing that, that any potential ambiguity in the rendering of dates is explained (e.g. "05/11/2009 (dd/mm/yyyy)").
"Iñtërnâtiônàlizætiøn" is a really bad string to test with since all the characters in it also appear in ISO-8859-1, so the string can work completely without any Unicode support at all! I've no idea why it's so commonly used when it utterly fails at its primary function!
Even Chinese or Hebrew text isn't a good choice (though right-to-left is a whole can of worms by itself) because it doesn't necessarily contain anything outside 3-byte UTF-8, which curiously was a very large hole in MySQL's default UTF-8 implementation (which is limited to 3-byte chars), until it was fixed by the addition of the utf8mb4 charset in MySQL 5.5. These days one of the more common uses of >3-byte UTF-8 is Emojis like these: [💝🐹🌇⛔]. If you don't see some pretty little coloured pictures between those brackets, congratulations, you just found a hole in your Unicode stack!
First, learn The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.
Make sure your application can handle Turkish. It has several quirks that break applications that assume English rules. Because there are four kinds of letter "i" (dotted and dot-less, upper and lower case), applications that assume uppercase(i) => I will break when using Turkish rules, where uppercase(i) => İ.
A common thing to do is check if the user typed the command "exit" by using lowercase(userInput) == "exit" or uppercase(userInput) == "EXIT". This works as expected under English rules but will fail under Turkish rules where "exıt" != "exit" and "EXİT" != "EXIT". To do this correctly, one must use case-insensitive comparison routines which are built into all modern languages.
I was thinking about this question from a completely different angle. I can't recall exactly what we did, but on a previous project I think we wound up changing the Regional Settings (in the Regional and Language Options control panel?) to help us ensure the localized strings were working.

What are the things should we consider while writing a Spell Checker?

I want to write a very simple Spell Checker. The spell checker will try to match the input word with equivalent words form the dictionary.
What can be done to find those 'equivalent words'? What analysis can be preformed on two words to mark them equivalent?
Before investing too much trying to unravel that i'd first look to already existing implementations like Aspell or netspell for two main reasons
Not much point in re-inventing the wheel. Spell checking is much trickier than it first appears and it makes sense to build on work that has already been done
If your interest is finding out how to do it, the source code and community will be a great benefit should you decide to implement your own anyway
Much depends on your use case. For example:
Is your dictionary very small (about twenty words)? In this case it probably is better to precompute all possible nearby mistaken words and use a table/hash lookup.
What is your error model? Aspell has at least two (one for spelling errors caused by nearby letters on the keyboard, and the other for spelling errors caused by the way a word sounds).
How dynamic is your dictionary? Can you afford to do a massive preparation in order to get an efficient retrieval?
You may need a "word equivalence" measure like Double Metaphone, in addition to edit distance.
You can get some feel by reading Peter Norvig's great description of spelling correction.
And, of course, whenever possible, steal code. Do not reinvent the wheel without a reason - a reason could be a very special domain, a special way your users make spelling mistakes, or just to learn how it's done.
Edit Distance is the theory you need to write a spell checker. You also need a dictionary. Most UNIX systems come with a dictionary already installed for your locale.
I just finished implementing a spell checker and used a combination of the following in getting a list of "suggested" words
Phonetic hashing of the "misspelled" word to lookup a hash of identical dictionary hashed real words (for java check out Apache Commons Codec for a suitable library). The phonetic hash of your dictionary file can be precomputed.
Edit distance between the input and the potentials (this is reasonably expensive so you need to reduce the list first with something like a phonetic hash, assuming a higher volume load - in my case, a server based spell check)
A known list of common misspellings, e.g. recieve vs. receive.
An ordered list of the most common words in the english language
Essentially I weighted each potential word primarily based on edit-distance and commonality. e.g. if word probability is a percentage, then
weight = edit-distance * 100 / probability
(lower weights are better)
But then I also also override any result with the known common misspellings (i.e. these always float to the top suggested result).
There may be better ways, but this worked pretty well.
You may also wish to ignore ALL CAPS words, initials etc, so choosing what to ignore is also something to think about.
Under linux/unix you have ispell. Why reinventing the whell.

Code formatting: is lining up similar lines ok? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I recently discovered that our company has a set of coding guidelines (hidden away in a document management system where no one can find it). It generally seems pretty sensible, and keeps away from the usual religious wars about where to put '{'s and whether to use hard tabs. However, it does suggest that "lines SHOULD NOT contain embedded multiple spaces". By which it means don't do this sort of thing:
foo = 1;
foobar = 2;
bar = 3;
Or this:
if ( test_one ) return 1;
else if ( longer_test ) return 2;
else if ( shorter ) return 3;
else return 4;
Or this:
thing foo_table[] =
{
{ "aaaaa", 0 },
{ "aa", 1 },
// ...
}
The justification for this is that changes to one line often require every line to be edited. That makes it more effort to change, and harder to understand diffs.
I'm torn. On the one hand, lining up like this can make repetitive code much easier to read. On the other hand, it does make diffs harder to read.
What's your view on this?
2008: Since I supervise daily merges of source code,... I can only recommend against it.
It is pretty, but if you do merges on a regular basis, the benefit of 'easier to read' is quickly far less than the effort involved in merging that code.
Since that format can not be automated in a easy way, the first developer who does not follow it will trigger non-trivial merges.
Do not forget that in source code merge, one can not ask the diff tool to ignore spaces :
Otherwise, "" and " " will look the same during the diff, meaning no merge necessary... the compiler (and the coder who added the space between the String double quotes) would not agree with that!
2020: as noted in the comments by Marco
most code mergers should be able to handle ignoring whitespace and aligning equals is now an auto format option in most IDE.
I still prefer languages which come with their own formatting options, like Go and its gofmt command.
Even Rust has its rustfmt now.
I'm torn. On the one hand, lining up
like this can make repetitive code
much easier to read. On the other
hand, it does make diffs harder to
read.
Well, since making code understandable is more important than making diffs understandable, you should not be torn.
IMHO lining up similar lines does greatly improve readability. Moreover, it allows easier cut-n-pasting with editors that permit vertical selection.
I never do this, and I always recommend against it. I don't care about diffs being harder to read. I do care that it takes time to do this in the first place, and it takes additional time whenever the lines have to be realigned. Editing code that has this format style is infuriating, because it often turns into a huge time sink, and I end up spending more time formatting than making real changes.
I also dispute the readability benefit. This formatting style creates columns in the file. However, we do not read in column style, top to bottom. We read left to right. The columns distract from the standard reading style, and pull the eyes downward. The columns also become extremely ugly if they aren't all perfectly aligned. This applies to extraneous whitespace, but also to multiple (possibly unrelated) column groups which have different spacing, but fall one after the other in the file.
By the way, I find it really bizarre that your coding standard doesn't specify tabbing or brace placement. Mixing different tabbing styles and brace placements will damage readability far more than using (or not using) column-style formatting.
I never do this. As you said, it sometimes requires modifying every line to adjust spacing. In some cases (like your conditionals above) it would be perfectly readable and much easier to maintain if you did away with the spacing and put the blocks on separate lines from the conditionals.
Also, if you have decent syntax highlighting in your editor, this kind of spacing shouldn't really be necessary.
There is some discussion of this in the ever-useful Code Complete by Steve McConnell. If you don't own a copy of this seminal book, do yourself a favor and buy one. Anyway, the discussion is on pages 426 and 427 in the first edition which is the edition I've got an hand.
Edit:
McConnell suggests aligning the equal signs in a group of assignment statements to indicate that they're related. He also cautions against aligning all equal signs in a group of assignments because it can visually imply relationship where there is none. For example, this would be appropriate:
Employee.Name = "Andrew Nelson"
Employee.Bdate = "1/1/56"
Employee.Rank = "Senator"
CurrentEmployeeRecord = 0
For CurrentEmployeeRecord From LBound(EmployeeArray) To UBound(EmployeeArray)
. . .
While this would not
Employee.Name = "Andrew Nelson"
Employee.Bdate = "1/1/56"
Employee.Rank = "Senator"
CurrentEmployeeRecord = 0
For CurrentEmployeeRecord From LBound(EmployeeArray) To UBound(EmployeeArray)
. . .
I trust that the difference is apparent. There is also some discussion of aligning continuation lines.
Personally I prefer the greater code readability at the expense of slightly harder-to-read diffs. It seems to me that in the long run an improvement to code maintainability -- especially as developers come and go -- is worth the tradeoff.
With a good editor their point is just not true. :)
(See "visual block" mode for vim.)
P.S.: Ok, you still have to change every line but it's fast and simple.
I try to follow two guidelines:
Use tabs instead of spaces whenever possible to minimize the need to reformat.
If you're concerned about the effect on revision control, make your functional changes first, check them in, then make only cosmetic changes.
Public flogging is permissible if bugs are introduced in the "cosmetic" change. :-)
2020-04-19 Update: My, how things change in a dozen years! If I were to answer this question today, it would probably be something like, "Ask your editor to format your code for you and/or tell your diff tool to ignore whitespace when you're making cosmetic changes.
Today, when I review code for readability and think the clarity would be improved by formatting it differently, I always end the suggestion with, "...unless the editor does it this way automatically. Don't fight your tools. They always win."
My stance is that this is an editor problem: While we use fancy tools to look at web pages and when writing texts in a word processor, our code editors are still stuck in the ASCII ages. They are as dumb as we can make them and then, we try to overcome the limitations of the editor by writing fancy code formatters.
The root cause is that your compiler can't ignore formatting statements in the code which say "hey, this is a table" and that IDEs can't create a visually pleasing representation of the source code on the fly (i.e. without actually changing one byte of the code).
One solution would be to use tabs but our editors can't automatically align tabs in consecutive rows (which would make so many thing so much more easy). And to add injury to insult, if you mess with the tab width (basically anything != 8), you can read your source code but no code from anyone else, say, the example code which comes with the libraries you use. And lastly, our diff tools have no option "ignore whitespace except when it counts" and the compilers can't produce diffs, either.
Eclipse can at least format assignments in a tabular manner which will make big sets of global constants much more readable. But that's just a drop of water in the desert.
If you're planning to use an automated code standard validation (i.e. CheckStyle, ReShaper or anything like that) those extra spaces will make it quite difficult to write and enforce the rules
You can set your diff tool to ignore whitespace (GNU diff: -w).
This way, your diffs will skip those lines and only show the real changes. Very handy!
We had a similar issue with diffs at multiple contracts... We found that tabs worked best for everyone. Set your editor to maintain tabs and every developer can choose his own tab length as well.
Example: I like 2 space tabs to code is very compact on the left, but the default is 4, so although it looks very different as far as indents, etc. go on our various screens, the diffs are identical and doesn't cause issues with source control.
I like the first and last, but not the middle so much.
This is PRECISELY the reason the good Lord gave as Tabs -- adding a character in the middle of the line doesn't screw up alignment.