What does "hyphenation vector" mean? - hyphenation

The Hyphen library seems to be a very popular and free way to have hyphenation in your app.
What does hyphenation vector mean?
I am running the example attached to the library source code.
Example output:
hibernate // input word
030412000 // output hyphenation vector
hi=ber=nate // hyphen points
- hi=bernate
- hiber=nate
Odd numbers in the vector indicate hyphenation points. But what do all of those values mean?

László Németh describes the algorithm in OpenOffice's documentation in full detail.
The library uses the algorithm developed by Frank M. Liang ("Word Hy-phen-a-tion by Com-pu-ter"): all letters in digrams, trigrams, and longer patterns are assigned numerical values to indicate it's a 'usual' place (an odd number) or an 'unusual' place (an even number) for a hyphen to occur. The higher the number is, the greater importance -- a pattern will almost never be broken on a larger even number, and almost always on a larger odd number. The number sequences are statistically determined on a corpus of pre-hyphenated words.
Note that the numbers are for positions between two characters. A better notation would have been
h i b e r n a t e
0 3 0 4 1 2 0 0 (0)
(where the last 0 is obsolete).

Related

what's best way for awk to check arbitrary integer precision

from GNU gawk's page
https://www.gnu.org/software/gawk/manual/html_node/Checking-for-MPFR.html
they have a formula to check arbitrary precision
function adequate_math_precision(n) { return (1 != (1+(1/(2^(n-1))))) }
My question is : wouldn't it be more efficient by staying within integer math domain with a formula such as
( 2^abs(n) - 1 ) % 2 # note 2^(n-1) vs. 2^|n| - 1
Since any power of 2 must also be even, then subtracting 1 must always be odd, then its modulo (%) over 2 becomes indicator function for is_odd() for n >= 0, while the abs(n) handles the cases where it's negative.
Or does the modulo necessitate a casting to float point, thus nullifying any gains ?
Good question. Let's tackle it.
The proposed snippet aims at checking wether gawk was invoked with the -M option.
I'll attach some digression on that option at the bottom.
The argument n of the function is the floating point precision needed for whatever operation you'll have to perform. So, say your script is in a library somewhere and will get called but you have no control over it. You'll run that function at the beginning of the script to promptly throw exception and bail out, suggesting that the end result will be wrong due to lack of bits to store numbers.
Your code stays in the integer realm: a power of two of an integer is an integer. There is no need to use abs(n) here, because there is no point in specifying how many bits you'll need as a negative number in the first place.
Then you subtract one from an even, integer number. Now, unless n=0, in which case 2^0=1 and then your code reads (1 - 1) % 2 = 0, your snippet shall always return 1, because the quotient (%) of an odd number divided by two is 1.
Problem is: you are trying to calculate a potentially stupidly large number in a function that should check if you are able to do so in the first place.
Since any power of 2 must also be even, then subtracting 1 must always
be odd, then its modulo (%) over 2 becomes indicator function for
is_odd() for n >= 0, while the abs(n) handles the cases where it's
negative.
Except when n=0 as we discussed above, you are right. The snippet will tell that any power of 2 is even, and any power of 2, minus 1, is odd. We were discussing another subject entirely thought.
Let's analyze the other function instead:
return (1 != (1+(1/(2^(n-1)))))
Remember that booleans in awk runs like this: 0=false and non zero equal true. So, if 1+x where x is a very small number, typically a large power of two (2^122 in the example page) is mathematically guaranteed to be !=1, in the digital world that's not the case. At one point, floating computation will reach a precision rock bottom, will be rounded down, and x=0 will be suddenly declared. At that point, the arbitrary precision function will return 0: false: 1 is equal 1.
A larger discussion on types and data representation
The page you link explains precision for gawk invoked with the -M option. This sounds like technoblahblah, let's decipher it.
At one point, your OS architecture has to decide how to store data, how to represent it in memory so that it can be accessed again and displayed. Terms like Integer, Float, Double, Unsigned Integer are examples of data representation. We here are addressing Integer representation: how is an integer stored in memory?
A 32-bit system will use 4 bytes to represent and integer, which in turn determines how larger the integer will be. The 32 bits are read from most significative (MSB) to less significative (LSB) and if signed, one bit will represent the sign (the MSB typically, drastically reducing the max size of the integer).
If asked to compute a large number, a machine will try to fit in in the max number available. If the end result is larger than that, you have overflow and end up with a wrong result or an error. Many online challenges typically ask you to write code for arbitrary long loops or large sums, then test it with inputs that will break the 64bit barrier, to see if you master proper types for indexes.
AWK is not a strongly typed language. Meaning, any variable can store data, regardless of the type. The data type can change and it is determined at runtime by the interpreter, so that the developer doesn't need to care. For instance:
$awk '{a="this is text"; print a; a=2; print a; print a+3.0*2}'
-| this is text
-| 2
-| 8
In the example, a is text, then is an integer and can be summed to a floating point number and printed as integer without any special type handling.
The Arbitrary Precision Page presents the following snippet:
$ gawk -M 'BEGIN {
> s = 2.0
> for (i = 1; i <= 7; i++)
> s = s * (s - 1) + 1
> print s
> }'
-| 113423713055421845118910464
There is some math voodoo behind, we will skip that. Since s is interpreted as a floating point number, the end result is computed as floating point.
Try to input that number on Windows calculator as decimal, and it will fail. Although you can compute it as a binary. You'll need the programmer setting and to add up to 53 bits to be able to fit it as unsigned integer.
53 is a magic number here: with the -M option, gawk uses arbitrary precision for numbers. In other words, it commandeers how many bits are necessary, track them and breaks free of the native OS architecture. The default option says that gawk will allocate 53 bits for any given arbitrary number. Fun fact, the actual result of that snippet is wrong, and it would take up to 100 bits to compute correctly.
To implement arbitrary large numbers handling, gawk relies on an external library called MPFR. Provided with an arbitrary large number, MPFR will handle the memory allocation and bit requisition to store it. However, the interface between gawk and MPFR is not perfect, and gawk can't always control the type that MPFR will use. In case of integers, that's not an issue. For floating point numbers, that will result in rounding errors.
This brings us back to the snippet at the beginning: if gawk was called with the -M option, numbers up to 2^53 can be stored as integers. Floating points will be smaller than that (you'll need to make the comma disappear somehow, or rather represent it spending some of the bits allocated for that number, just like the sign). Following the example of the page, and asking an arbitrary precision larger than 32, the snippet will return TRUE only if the -M option was passed, otherwise 1/2^(n-1) will be rounded down to be 0.

What is the difference in presentation between hexadecimal ASCII And hexadecimal number

I have two questions:
What is the difference in presentation between hexadecimal ASCII And hexadecimal number?
I mean that when we say
var db 31H
How we can find out if we want to say Character a or we want to say number 31H.
Why this application goes like this?
1- a db 4 dup(41h)
2- b dw 2 dup(4141h)
I thought that this two lines will be run in the same way but in the second line when I want to see the variables they will be 8 8bits and in each one is number 41h.
But it must something wrong because dw is 2 8 bits and we are saying make 2 of 2 of 8 bits and it must be 4 8 bits not 8 8 bits.
The answer to the first question is simple: in a computer's memory, there is no ASCII, no numbers, no images ... there is just bits. 31H represents the string of bits 00110001; nothing more, nothing less.
It's only when you do something with those bits (display them to a screen, use them in a mathematical operation, etc) that you interpret it as meaning 1 (which it would in ASCII), or a (in some other character encoding), or 49 (as a decimal number), or a particular shade of blue in your colour palette.

maximum 'string' length in fortran

does fortran have a maximum 'string' length?
i am going to be reading lines from a file which could have very long lines. the one i am looking at now has around 1.3k characters per line, but it is possible that they may have much more. i am reading each line from the file to a character*5000 variable, but if i get more in the future, is it bad to make it a character*5000000 variable? is there a max? is there a better way to solve this problem than making a very large character variables?
Since the usual Fortran IO is record based, reading lines into strings implies knowing the maximum string length. Another possible design: use stream IO and Fortran will ignore the record boundaries. Read the file in fixed-length chunks that are shorter than the longest lines. The complication is handling items split across chunk boundaries. The practicality depends on details not given in the question.
P.S. From "The Fortran 2003 Handbook" by Adams et al.: "The maximum length permitted for character strings is processor-dependent." -- meaning compiler dependent.
Maximum wil be implementation dependant. For your case, I can think of something along these lines:
character(:),allocatable :: ch
l = 5
do
allocate(character(l) :: ch)
read(unit,'(a)',iostat=io) ch
if (ch(l-4:l) = ' ' .or. io/=0) exit
deallocate(ch)
l = l * 2
end do
Obviously will not work for pad='no' and if you expect long regions with spacec in your records.

How input string is represented in magnetic tapes?

I know that in turing machines, the (different) tapes are used for both input and output and for stack too. In a problem of adding 2 numbers using turing machine, the input is dealing with many symbols like 1,0,B(blank),+.
(Tough this questions is related to physics, I asked here since I thought they mayn't know about turing machines and their inputs.)
And my doubt is ,
If the input is BBBBB1111+111111BB,
then in magnetic tape,
1->represented by North polarity(say).
0->represented by south polarity(say).
B->represented by No polarity.
Then,
How '+' will be represented?
I doesn't think that there will be some codes(like ASCII) for special symbols.
Since the number and type of special symbols will be implementation dependent. Also special codes will make the algorithm more tedious.
or
Is the input symbol representation in tapes is entirely different from the above mentioned method?If yes, please explain.
You would probably do this by having each character encoded with multiple bits. For example:
B: 00
0: 01
1: 10
+: 11
Your read head would then have size two and would always move two steps to the left or the right when making a move.
Symbol: Representation
0:1 ; 1:11 ; 2:111 ; n:n+1 ; Blank:B

Test cases for numeric input

What are some common (or worthwhile) tests, test questions, weaknesses, or misunderstandings dealing with numeric inputs?
This is a community wiki. Please add to it.
For example, here are a couple sample ideas:
I commonly see users enter text into number fields (eg, ">4" or "4 days", etc).
Fields left blank (null)
Very long numeric strings
Multiple decimals and commas (eg, "4..4" and "4,,434.4.4")
Boundary Value Analysis:
Lower Boundary
Lower Boundary - 1 (for decimal/float, use smaller amounts)
Upper Boundary
Upper Boundary + 1
Far below the lower boundary (eg, beyond the hardware boundary value)
Far above the upper boundary
middle of the range
0
0.0
White space, nothing else " "
String input & other incorrect data types.
Number with text in front or back, eg "$5.00", "4 lbs", "about 60", "50+"
Negative numbers
+ sign with positive numbers, "+4"
Both plus and minus sign, eg "+-4" and "-4e+30"
Exponents of 10, both uppercase and lowercase, positive and negative eg "4e10", "-5E-10", "+6e+60", etc
Too many "e" characters, eg "4e4e4" "4EE4"
Impossibly large/small exponents or inappropriate ones
Decimal values that cannot be represented in a computer
eg, .3 + .6 == 1.0? This bug affects most hardware, so outputs that compare decimal values should allow for a degree of error.
Integer/hardware overflow. Eg, for 32-bit integers, what happens when adding 4 billion to 4 billion?
wrong use of decimal sign and thousands seperator ("," vs. ".") (MikeD)
internationalisation i18n issues: In english aplications you write "12,345.67" meaning "12345.67" in german you write "12345,67" – (k3b)
leading 0's don't make number octal (common javascript bug)