Why is there a convention of 1-based line numbers but 0-based char numbers? - indexing

According to TkDocs:
The "1.0" here represents where to insert the text, and can be read as "line 1, character 0". This refers to the first character of the first line; for historical conventions related to how programmers normally refer to lines and characters, line numbers are 1-based, and character numbers are 0-based.
I hadn't heard of this convention before, and I can't find anything relevant on Google. Can anyone explain this to me please?

I think you're referring to Tk's text widget. The man page says:
Lines are numbered from 1 for consistency with other UNIX programs that use this numbering scheme.
Although, I'm not sure which Unix tools it's talking about.
Update:
As mentioned in the comments, it looks like a lot of unix text manipulation tool starts line numbering at 1. And tcl/tk having a unix origin, it makes sense to be as compatible as possible with the underlying OS environment.

It really is nothing more than convention, but here is a suggestion.
Character positions are generally thought of in the same way as a Java iterator, which is a "pointer" to a position between two characters. Thus the first character is the one after index position 0. Substrings are taken between two inter-character positions, for instance.
Line positions on the other hand are generally thought of more in the way of a .NET enumerator, which is a "pointer" to the item itself, not to a position in between. Thus the first line is the line at position 1.

Related

VBA replace certain carriage

All.
I am used to programming VBA in Excel, but am new to the structures in Word.
I am working through a library of text files to update them. Many of them are either OCR documents, or were manually entered.
Each has a recurring pattern, the most common of which is unnecessary carriage returns.
For example, I am looking at several text files where there is a double return after each line. A search and replace of all double carriage returns removes all paragraph distinctions.
However, each line is approximately 30 characters long, and if I manually perform the following logic, it gives me a functional document.
If there is a double carriage return after 30+ characters, I replace them with a space.
If there were less than 30 characters prior to the double return, I replace them with a single return.
Can anyone help me with some rudimentary code that would help me get started on that? I could then modify it for each "pattern" of text documents I have.
e.g.
In this case, there are more than
thirty characters per line. And I
will keep going to illustrate this
example.
This would be a new paragraph, and
would be separated by another of
the single returns.
I want code that would return:
In this case, there are more than thirty character returns. And I will keep going to illustrate this example.
This would be a new paragraph, and would be separated by another of the single returns.
Let me know if anyone can throw something out that I can play with!
You can do this without code (which RegEx requires), simply using Word's own wildcard Find/Replace tools, where:
Find = ([!^13]{30,})[^13]{1,}
Replace = \1^32
and, to clean up the residual multi-paragraph breaks:
Find = [^13]{2,}
Replace = ^p
You could, of course, record the above as a macro...
Here is a RegEx that might work for you:
(\n\n)(?<!\.(\n\n))
The substitution is just a plain space, you can try it out (and modify / tweak it) here: https://regex101.com/r/zG9GPw/4
This 'pattern' tells the RegEx engine to look for the newline character \n which occurs x2 like this \n\n (worth noting this is from your question and might be different in your files, e.g. could be \r\n) and it assumes that a valid line break will be proceeded by a full stop: \..
In RegEx the full stop symbol is a single character wild card so it needs to be escaped with the '\' (n and r are normal characters, escaping them tells the RegEx engine they represent newline and return characters).
So... the expression is looking for a group of x2 newline characters but then uses a negative look-behind to exclude any matches where the previous character was a full stop.
Anyway, it's all explained on the site:
Here is how you could do a RegEx find and replace using NotePad++ (I'm not sure if it comes with RegEx or if a plugin is needed, either way it is easy). But you can set a location, filters (to target specific file types), and other options (such as search in sub-directories).
Other than that, as #MacroPod pointed out you could also do this with MS Word, document by document, not using any code :)

jEdit in hard word-wrap mode: insert comment character automatically?

Probably quite a niche question, but I believe in the power of a big community: Is it possible to set up jEdit in way, that it automatically inserts a comment character (//, #, ... depending on the edit mode) at the beginning of a new line, if the line before the wrap was a comment?
Sample:
# This is a comment spanning multiple lines. If I continue to type here, it
# wraps around automatically, but I have to manually add a `#` to each line.
If I continue to type after the . the third line should start with the # automatically. I searched in the plugin repository but could not find anything related.
Background: jEdit has the concepct of soft and hard wrap. While soft wrap only breaks lines visually at a character limit, it does not insert line breaks in the file. Hard wrap on the other hand inserts \n into the file at the desired character count.
This is not exactly what you want: I use the macros Enter_with_Prefix.bsh to automatically insert the prefix (e.g., #, //) at the beginning of the new line.
Description copied from Enter_with_Prefix.bsh:
Enter_with_Prefix.bsh - a Beanshell macro for jEdit
that starts a new line continuing any recognized
sequence that started the previous. For example,
if the previous line beings with "1." the next will
be prefixed with "2.". It supports alpha lists (a., b., etc...),
bullet lists (+, =, *, etc..), comments, Javadocs,
Java import statements, e-mail replies (>, |, :),
and is easy to extend with new sequence types. Suggested
shortcut for this macro is S+ENTER (SHIFT+ENTER).

How to identify binary and text files using Smalltalk

I want to verify that a given file in a path is of type text file, i.e. not binary, i.e. readable by a human. I guess reading first characters and check each character with :
isAlphaNumeric
isSpecial
isSeparator
isOctetCharacter ???
but joining all those testing methods with and: [ ... and: [ ... and: [ ] ] ] seems not to be very smalltalkish. Any suggestion for a more elegant way?
(There is a Python version here How to identify binary and text files using Python? which could be useful but syntax and implementation looks like C.)
only heuristics; you can never be really certain...
For ascii, the following may do:
|isPlausibleAscii numChecked|
isPlausibleAscii :=
[:char |
((char codePoint between:32 and:127)
or:[ char isSeparator ])
].
numChecked := text size min: 1024.
isPossiblyText := text from:1 to:numChecked conform: isPlausibleAscii.
For unicode (UTF8 ?) things become more difficult; you could then try to convert. If there is a conversion error, assume binary.
PS: if you don't have from:to:conform:, replace by (copyFrom:to:) conform:
PPS: if you don't have conform: , try allSatisfy:
All text contains more space than you'd expect to see in a binary file, and some encodings (UTF16/32) will contain lots of 0's for common languages.
A smalltalky solution would be to hide the gory details in method on Standard/MultiByte-FileStream, #isProbablyText would probably be a good choice.
It would essentially do the following:
- store current state if you intend to use it later, reset to start (Set Latin1 converter if you use a MultiByteStream)
Iterate over N next characters (where N is an appropriate number)
Encounter a non-printable ascii char? It's probably binary, so return false. (not a special selector, use a map, implement a new method on Character or something)
Increase 2 counters if appropriate, one for space characters, and another for zero characters.
If loop finishes, return whether either of the counters have been read a statistically significant amount
TLDR; Use a method to hide the gory details, otherwise it's pretty much the same.

Preserve "long" spaces in PDFBox text extraction

I am using PDFBox to extract text from PDF.
The PDF has a tabular structure, which is quite simple and columns are also very widely spaced from each-other
This works really well, except that all kinds of horizontal space gets converted into a single space character, so that I cannot tell columns apart anymore (space within words in a column looks just like space between columns).
I appreciate that a general solution is very hard, but in this case the columns are really far apart so that having a simple differentiation between "long spaces" and "space between words" would be enough.
Is there a way to tell PDFBox to turn horizontal whitespace of more then x inches into something other than a single space? A proportional approach (x inch become y spaces) would also work.
The pdftotext C library/tool has a '-layout' switch that tries to preserve the layout. Basically, if I can emulate that with PDFBox, that would be perfect.
There does not seem to be a setting for this, but I was able to modify the source for the PDFTextStripper tool to output a column separator (|) when a "long" space was encountered. In the code where it was building the output line it is possible to look at the x positions of the current and previous letter, and if it is large enough, do something special. PDFTextStripper has lots of protected methods, but turned out to be not really all that extensible. I ended up having to copy the whole class to change a private method.
Looking at the code in there, I call myself lucky that with the particular PDF, this simple approach was successful. A more general solution seems very tricky.
PDF text extraction is difficult.
If the text was output as one big string separated by spaces such as :-
PDFTextOut(" Column 1 Column 2 Column 3");
and you are using a fixed width font such as Courier then you could theoretically calculate the number of spaces between items of text because each character is the same width. If the font is proportional such a Arial then the calculation is harder.
In reality most PDF's generated by individually placing each piece of text directly into its position. Therefore, there is technically no space character or any other characters between columns. The text is just placed into an absolute position on the page.
PDFMoveTo(100,100);
PDFTextOut("Column 1");
PDFMoveTo(250,100);
PDFTextOut("Column 2");
In order to perform data extraction on PDF documents you have to do a little bit more work to find and match column data by using pixel locations as you have mentioned and by making some assumptions and having a little bit of luck.

RegexKitLite: Match Expression --> Match anything except ] --> Match ]

I am essentially attempting to replace all of the footnotes in a large text. There are various reasons I am doing this in Objective-C, so please assume that constraint.
Every footnote beings with this: [Footnote
Every footnote ends with this: ]
There can be absolutely anything between those two markers, including line breaks. However, there will never be ] between them.
So, essentially I want to match [Footnote, then match anything except ], until ] is matched.
This is the closest I have been able to get to identifying all of the footnotes:
NSString *regexString = #"[\\[][F][o][o][t][n][o][t][e][^\\]\n]*[\\]]";
Using this regular expression manages to identify 780/889 footnotes. It also appears that none of those 780 are false alarms. The only ones it appears to miss are those footnotes that have line breaks in them.
I have spent a lengthly amount of time on www.regular-expressions.info, specifically on the page about dots (http://www.regular-expressions.info/dot.html). This has helped me to create the above regular expressions, but I have not truly figured out how to include any character or line break, except right bracket.
Using the following regular expression instead manages to capture all of the footnotes, but it captures way too much text, because * is greedy: (?s)[\\[][F][o][o][t][n][o][t][e].*[\\]]
Here is some sample text that the regular expression is run on:
<p id="id00082">[Footnote 1: In the history of Florence in the early part of the XVIth century <i>Piero di Braccio Martelli</i> is frequently mentioned as <i>Commissario della Signoria</i>. He was famous for his learning and at his death left four books on Mathematics ready for the press; comp. LITTA, <i>Famiglie celebri Italiane</i>, <i>Famiglia Martelli di Firenze</i>.—In the Official Catalogue of MSS. in the Brit. Mus., New Series Vol. I., where this passage is printed, <i>Barto</i> has been wrongly given for Braccio.</p>
<p id="id00083">2. <i>addi 22 di marzo 1508</i>. The Christian era was computed in Florence at that time from the Incarnation (Lady day, March 25th). Hence this should be 1509 by our reckoning.</p>
<p id="id00084">3. <i>racolto tratto di molte carte le quali io ho qui copiate</i>. We must suppose that Leonardo means that he has copied out his own MSS. and not those of others. The first thirteen leaves of the MS. in the Brit. Mus. are a fair copy of some notes on physics.]</p>
<p id="id00085">Suggestions for the arrangement of MSS treating of particular subjects.(5-8).</p>
When you put together the science of the motions of water, remember to include under each proposition its application and use, in order that this science may not be useless.--
[Footnote 2: A comparatively small portion of Leonardo's notes on water-power was published at Bologna in 1828, under the title: "_Del moto e misura dell'Acqua, di L. da Vinci_".]
In this example there are two footnotes and some non-footnote text. The first footnote, as you can see, contains two line breaks inside it. The second one contains no line breaks.
The first regular expression I mentioned above will manage to capture Footnote 2 in this example text, but it will not capture Footnote 1 because it contains line breaks.
Any improvements on my regular expression would be most appreciated.
Try
#"\\[Footnote[^\\]]*\\]";
This should match across newlines. No need to put a single character into a character class, either.
As a commented, multiline regex (without string escapes):
\[ # match a literal [
Footnote # match literal "Footnote"
[^\]]* # match zero or more characters except ]
\] # match ]
Inside a character class ([...]), the caret ^ takes on a different meaning; it negates the contents of the class. So [ab] matches a or b, whereas [^ab] matches any character except a or b.
Of course, if you have nested footnotes, this will malfunction. A text like [Footnote foo [footnote bar] foo] will match from the beginning until bar]. To avoid this, change the regex to
#"\\[Footnote[^\\]\\[]*\\]";
so neither opening nor closing brackets are allowed. Then of course, you only match the innermost Footnotes and will have to apply the same regex twice (or more, depending on the maximum level of nesting) to the entire text, "peeling back" layer by layer.