Output formatting: too much whitespace in gfortran - formatting

Using gfortran 4.6. This code:
PROGRAM f1
IMPLICIT NONE
INTEGER :: i=1, j=3
WRITE(*,*) "integer i is ", i, ", and j is ", j, "."
END PROGRAM f1
produces this console output, which has way too much whitespace:
integer i is 1 , and j is 3 .
Is there some setting I can set so that there is no space before the first token ("integer"), and so the whitespace between tokens is just one space? I know one fix is
WRITE(*,'(A,I1,A,I1,A)') "integer i is ", i, ", and j is ", j, "."
but this seems very cumbersome to have to do every time I have a print statement - would rather it be somewhat more like C++ where you explicitly write any whitespace in the output.

List-directed IO, i.e, write (*, *) is meant as a convenience. There are no settings to change its behavior. Different compilers will produce different output. Instead you can, as you have identified, use formatted IO. In this case you can use I0 as the format, which will produce the required number of digits, while I1 will only output single-digit integers. Which is OK if those are the largest values that will be output.
WRITE(*, '( "integer i is ", I0, ", and j is ", I0, "." )' ) i, j

You may try some more universal format and re-use it
fmt = "(*(1x,g0))"
write(*,fmt) whatever1, whatever2

Related

Is it possible to optimize these fortran loops?

Here is my problem: I have a fortran code with a certain amount of nested loops and first I wanted to know if it's possible to optimize (rearranging) them in order to get a time gain? Second I wonder if I could use OpenMP to optimize them?
I have seen a lot of posts about nested do loops in fortran and how to optimize them but I didn't find one example that is suited to mine. I have also searched about OpenMP for nested do loops in fortran but I'm level 0 in OpenMP and it's difficult for me to know how to use it in my case.
Here are two very similar examples of loops that I have, first:
do p=1,N
do q=1,N
do ab=1,nVV
cd = 0
do c=nO+1,N
do d=c+1,N
cd = cd + 1
A(p,q,ab) = A(p,q,ab) + (B(p,q,c,d) - B(p,q,d,c))*C(cd,ab)
end do
end do
kl = 0
do k=1,nO
do l=k+1,nO
kl = kl + 1
A(p,q,ab) = A(p,q,ab) + (B(p,q,k,l) - B(p,q,l,k))*D(kl,ab)
end do
end do
end do
do ij=1,nOO
cd = 0
do c=nO+1,N
do d=c+1,N
cd = cd + 1
E(p,q,ij) = E(p,q,ij) + (B(p,q,c,d) - B(p,q,d,c))*F(cd,ij)
end do
end do
kl = 0
do k=1,nO
do l=k+1,nO
kl = kl + 1
E(p,q,ij) = E(p,q,ij) + (B(p,q,k,l) - B(p,q,l,k))*G(kl,ij)
end do
end do
end do
end do
end do
and the other one is:
do p=1,N
do q=1,N
do ab=1,nVV
cd = 0
do c=nO+1,N
do d=nO+1,N
cd = cd + 1
A(p,q,ab) = A(p,q,ab) + B(p,q,c,d)*C(cd,ab)
end do
end do
kl = 0
do k=1,nO
do l=1,nO
kl = kl + 1
A(p,q,ab) = A(p,q,ab) + B(p,q,k,l)*D(kl,ab)
end do
end do
end do
do ij=1,nOO
cd = 0
do c=nO+1,N
do d=nO+1,N
cd = cd + 1
E(p,q,ij) = E(p,q,ij) + B(p,q,c,d)*F(cd,ij)
end do
end do
kl = 0
do k=1,nO
do l=1,nO
kl = kl + 1
E(p,q,ij) = E(p,q,ij) + B(p,q,k,l)*G(kl,ij)
end do
end do
end do
end do
end do
The very small difference between the two examples is mainly in the indices of the loops. I don't know if you need more info about the different integers in the loops but you have in general: nO < nOO < N < nVV. So I don't know if it's possible to optimize these loops and/or possibly put them in a way that will facilitate the use of OpenMP (I don't know yet if I will use OpenMP, it will depend on how much I can gain by optimizing the loops without it).
I already tried to rearrange the loops in different ways without any success (no time gain) and I also tried a little bit of OpenMP but I don't know much about it, so again no success.
From the initial comments it may appear that at least in some cases you may be using more memory than the available RAM, which means you may be using the swap file, with all the bad consequences on the performances. To fix this, you have to either install more RAM if possible, or deeply reorganize your code to not store the full B array (by far the largest one) at once (again, if possible).
Now, let's assume that you have enough RAM. As I wrote in the comments, the access pattern on the B array is far from optimal, as the inner loops correspond to the last indeces of B, which can result in many cache misses (all the more given the the size of B). Changing the loop order if possible is a way to go.
Just looking at your first example, I am focusing on the computation of the array A (the computation of the array E looks completely independent of A, so it can be processed separately):
!! test it at first without OpenMP
!!$OMP PARALLEL DO PRIVATE(cd,c,d,kl,k,l)
do ab=1,nVV
cd = 0
do c=nO+1,N
do d=c+1,N
cd = cd + 1
A(:,:,ab) = A(:,:,ab) + (B(:,:,c,d) - B(:,:,d,c))*C(cd,ab)
end do
end do
kl = 0
do k=1,nO
do l=k+1,nO
kl = kl + 1
A(:,:,ab) = A(:,:,ab) + (B(:,:,k,l) - B(:,:,l,k))*D(kl,ab)
end do
end do
end do
!!$OMP END PARALLEL DO
What I did:
moved the loops on p and q from outer to inner positions (it's not always as easy than it is here)
replaced them with array syntax (no performance gain to expect, just a code easier to read)
Now the inner loops (abstracted by the array syntax) tackle contiguous elements in memory, which is much better for the performances. The code is even ready for OpenMP multithreading on the (now) outer loop.
EDIT/Hint
Fortran stores the arrays in "column-major order", that is when incrementing the first index one accesses contiguous elements in memory. In C the arrays are stored in "row-major order", that is when incrementing the last index one accesses contiguous elements in memory. So a general rule is to have the inner loops on the first indeces (and the opposite in C).
It would be helpful if you could describe the operations you'd like to perform using tensor notation and the Einstein summation rule. I have the feeling the code could be written much more succinctly using something like np.einsum in NumPy.
For the second block of loop nests (the ones where you iterate across a square sub-section of B as opposed to a triangle) you could try to introduce some sub-programs or primitives from which the full solution is built.
Working from the bottom up, you start with a simple sum of two matrices.
!
! a_ij := a_ij + beta * b_ij
!
pure subroutine apb(A,B,beta)
real(dp), intent(inout) :: A(:,:)
real(dp), intent(in) :: B(:,:)
real(dp), intent(in) :: beta
A = A + beta*B
end subroutine
(for first code block in the original post, you would substitute this primitive with one that only updates the upper/lower triangle of the matrix)
One step higher is a tensor contraction
!
! a_ij := a_ij + b_ijkl c_kl
!
pure subroutine reduce_b(A,B,C)
real(dp), intent(inout) :: A(:,:)
real(dp), intent(in) :: B(:,:,:,:)
real(dp), intent(in) :: C(:,:)
integer :: k, l
do l = 1, size(B,4)
do k = 1, size(B,3)
call apb( A, B(:,:,k,l), C(k,l) )
end do
end do
end subroutine
Note the dimensions of C must match the last two dimensions of B. (In the original loop nest above, the storage order of C is swapped (i.e. c_lk instead of c_kl.)
Working our way upward, we have the contractions with two different sub-blocks of B, moreover A, C, and D have an additional outer dimension:
!
! A_n := A_n + B1_cd C_cdn + B2_kl D_kln
!
! The elements of A_n are a_ijn
! The elements of B1_cd are B1_ijcd
! The elements of B2_kl are B2_ijkl
!
subroutine abcd(A,B1,C,B2,D)
real(dp), intent(inout), contiguous :: A(:,:,:)
real(dp), intent(in) :: B1(:,:,:,:)
real(dp), intent(in) :: B2(:,:,:,:)
real(dp), intent(in), contiguous, target :: C(:,:), D(:,:)
real(dp), pointer :: p_C(:,:,:) => null()
real(dp), pointer :: p_D(:,:,:) => null()
integer :: k
integer :: nc, nd
nc = size(B1,3)*size(B1,4)
nd = size(B2,3)*size(B2,4)
if (nc /= size(C,1)) then
error stop "FATAL ERROR: Dimension mismatch between B1 and C"
end if
if (nd /= size(D,1)) then
error stop "FATAL ERROR: Dimension mismatch between B2 and D"
end if
! Pointer remapping of arrays C and D to rank-3
p_C(1:size(B1,3),1:size(B1,4),1:size(C,2)) => C
p_D(1:size(B2,3),1:size(B2,4),1:size(D,2)) => D
!$omp parallel do default(private) shared(A,B1,p_C,B2,p_D)
do k = 1, size(A,3)
call reduce_b( A(:,:,k), B1, p_C(:,:,k))
call reduce_b( A(:,:,k), B2, p_D(:,:,k))
end do
!$omp end parallel do
end subroutine
Finally, we reach the main level where we select the subblocks of B
program doit
use transform, only: abcd, dp
implicit none
! n0 [2,10]
!
integer, parameter :: n0 = 6
integer, parameter :: n00 = n0*n0
integer, parameter :: N, nVV
real(dp), allocatable :: A(:,:,:), B(:,:,:,:), C(:,:), D(:,:)
! N [100,200]
!
read(*,*) N
nVV = (N - n0)**2
allocate(A(N,N,nVV))
allocate(B(N,N,N,N))
allocate(C(nVV,nVV))
allocate(D(n00,nVV))
print *, "Memory occupied (MB): ", &
real(sizeof(A) + sizeof(B) + sizeof(C) + sizeof(D),dp) / 1024._dp**2
A = 0
call random_number(B)
call random_number(C)
call random_number(D)
call abcd(A=A, &
B1=B(:,:,n0+1:N,n0+1:N), &
B2=B(:,:,1:n0,1:n0), &
C=C, &
D=D)
deallocate(A,B,C,D)
end program
Similar to the answer by PierU, parallelization is on the outermost loop. On my PC, for N = 50, this re-engineered routine is about 8 times faster when executed serially. With OpenMP on 4 threads the factor is 20. For N = 100 and I got tired of waiting for the original code; the re-engineered version on 4 threads took about 3 minutes.
The full code I used for testing, configurable via environment variables (ORIG=<0|1> N=100 ./abcd), is available here: https://gist.github.com/ivan-pi/385b3ae241e517381eb5cf84f119423d
With more fine-tuning it should be possible to bring the numbers down even further. Even better performance could be sought with a specialized library like cuTENSOR (also used under the hood of Fortran intrinsics as explained in Bringing Tensor Cores to Standard Fortran or a tool like the Tensor Contraction Engine.
One last thing I found odd was that large parts of B are un-unused. The sub sections B(:,:,1:n0,n0+1:N) and B(:,:,n0+1:N,1:n0) appear to be wasted space.

How do I read numbers from a file in Fortran and immediately do calculations with each numbers?

I'm totally new to Fortran, and I'm trying to learn the language here:
http://www.fortrantutorial.com/files-precision/index.php. I have some basic experience with C and Python, but not much, like introduction class and such.
So in exercises 4.1, they ask me to input some numbers from a file, and check if these numbers are even or odd. Here is the code to input the numbers:
program readdata
implicit none
!reads data from a file called mydata.txt
real :: x,y,z
open(10,file='mydata.txt')
read(10,*) x,y,z
print *,x,y,z
end program readdata
The file mydata.txt contains some random numbers. And they can check if the number is even or odd by:
if (mod(num,2)>0) then……
My question is that: if this file have like 10, or 1000 numbers, do I have to manually assign every single one of them? Is there any other way for me to do quick calculation with mass numbers situation like that?
Every read also moves the read pointer forwards. So with every new read, a new line is read in from the file.
The easiest thing to do is to keep reading until the READ statement returns an error. Of course, you have to pass a variable for the READ to write its error into. Something like this:
program readdata
implicit none
real :: x, y, z
integer :: iounit, ios
open(newunit=iounit, file='mydata.txt', iostat=ios, action='READ')
if (ios /= 0) STOP 1
do
read(iounit, *, iostat=ios) x, y, z
if (ios /= 0) exit
print *, x, y, z
end do
close(iounit)
end program readdata
Update: If you're limited to Fortran 95, as OP suggested in his comments, here's what to change: Instead of
integer :: iounit, ios
open(newunit=iounit, ...)
you use
integer :: ios
integer, paramter :: iounit = 100
open(unit=iounit, ...)
All that's important is that iounit is a number, greater than 10, which is not used as a unit for any other read/write operation.
well a mass of numbers means that X, Y, and Z need to be more than a single number of each... so something like this should get you vectored towards a solution:
program readdata
implicit none
!reads data from a file called mydata.txt
INTEGER, PARAMETER :: max2process = 10000
INTEGER :: I
INTEGER :: K = 0
real , DIMENSION(max2process) :: x,y,z
LOGICAL, DIMENSION(max2process) :: ODD = .FALSE.
open(10,file='mydata.txt')
10 CONTINUE
X(:) = 0
Y(:) = 0
K(:) = 0
DO I = 1, max2process
K = K + I
read(10,*, EOF=99) x(I), y(I), z(I)
print *,x(I), y(I), z(I)
ENDDO
WRITE(*,22) I, MINVAL(X(1:I)), MAXVAL(X(1:I)
22 FORMAT(' MIN(X(1:',I5,')=',0PE12.5,' Max=',0PE12.5)
odd = .FALSE.
WHERE (MODULO(X, 2) /= 0)
ODD = .TRUE.
ENDWHERE
DO I = 1, 10
WRITE(*,88) I, X(I), odd(I)
88 FORMAT('x(', I2,')=',0PE12.5,' odd="',L1,'"')
EBDDO
WRITE(*,*) ' we did not hit the end of file... Go back and read some more'
GOTO 10
99 WRITE(*,*) 'END of file reached at ', (K-1)
end program readdata
Basically something like that, but I probably have a typo (LTR). If MODULO is ELEMENTAL then you can do it easily... And MODULO is ELEMENTAL, so you are in luck (which is common).
There is a https://rosettacode.org/wiki.Even_or_Odd#Fortran that looks nice. the functions could be further enhanced using ELEMENTAL FUNCTION.
that should give you enough to work it out.
or you could read the file to find the size, close and reopen it, and then allocate X, Y, Z and Odd... the example i gave was more of a streaming example. one should generally use a DO WHILE rather than a GOTO, but starting out a GOTO can be conceptually easier. (You need an EXIT or some DO WHILE (IOSTAT /= ) in order to break out)

Not sure what this pseudo-code is saying

I saw this pseudo-code on another stackoverflow question found here Split a string to a string of valid words using Dynamic Programming.
The problem is a dynamic programming question to see if an input string can be split into words from a dictionary.
The third line, means to set an array b of size [N+1] to all false values? I'm pretty sure about that. But what I am really not sure about is the fifth line. Is that a for-loop or what? I feel like pseudo-code saying 'for i in range' would only have 2 values. What is that line saying?
def try_to_split(doc):
N = len(doc)
b = [False] * (N + 1)
b[N] = True
for i in range(N - 1, -1, -1):
for word starting at position i:
if b[i + len(word)]:
b[i] = True
break
return b
It's confusing syntax, and I'm pretty sure there's a mistake. It should be:
for i in range(N - 1, 0, -1) //0, not -1
which I believe means
for i from (N - 1) downto 0 //-1 was the step, like i-- or i -= 1
This makes sense with the algorithm, as it simply starts at the end of the string, and solves each trailing substring until it gets to the beginning. If b[0] is true at the end, then the input string can be split into words from the dictionary. for word starting at position i just checks all words in the dictionary to see if they start at that position.
If one wants to be able to reconstruct a solution, they can change b to an int array, initialize to 0s, and change the if to this:
if b[i + len(word)] != 0
b[i] = i + len(word) //len(word) works too
break

Reading a known number of variable from a file when one of the variables are missing in input file

I already checked similar posting. The solution is given by M. S. B. here Reading data file in Fortran with known number of lines but unknown number of entries in each line
So, the problem I am having is that from text file I am trying to read inputs. In one line there is supposed to be 3 variables. But sometimes the input file may have 2 variables. In that case I need to make the last variable zero. I tried using READ statement with IOSTAT but if there is only two values it goes to the next line and reads the next available value. I need to make it stop in the 1st line after reading 2 values when there is no 3rd value.
I found one way to do that is to have a comment/other than the type I am trying to read (in this case I am reading float while a comment is a char) which makes a IOSTAT>0 and I can use that as a check. But if in some cases I may not have that comment. I want to make sure it works even than.
Part of the code
read(15,*) x
read(15,*,IOSTAT=ioerr) y,z,w
if (ioerr.gt.0) then
write(*,*)'No value was found'
w=0.0;
goto 409
elseif (ioerr.eq.0) then
write(*,*)'Value found', w
endif
409 read(15,*) a,b
read(15,*) c,d
INPUT FILE is of the form
-1.000 abcd
12.460 28.000 8.00 efg
5.000 5.000 hijk
20.000 21.000 lmno
I need to make it work even when there is no "8.00 efg"
for this case
-1.000 abcd
12.460 28.000
5.000 5.000 hijk
20.000 21.000 lmno
I can not use the string method suggested by MSB. Is there any other way?
I seem to remember trying to do something similar in the past. If you know that the size of a line of the file won't exceed a certain number, you might be able to try something like:
...
character*(128) A
read(15,'(A128)') A !This now holds 1 line of text, padded on the right with spaces
read(A,*,IOSTAT=ioerror) x,y,z
if(IOSTAT.gt.0)then
!handle error here
endif
I'm not completely sure how portable this solution is from one compiler to the next and I don't have time right now to read up on it in the f77 standard...
I have a routine that counts the number of reals on a line. You could adapt this to your purpose fairly easily I think.
subroutine line_num_columns(iu,N,count)
implicit none
integer(4),intent(in)::iu,N
character(len=N)::line
real(8),allocatable::r(:)
integer(4)::r_size,count,i,j
count=0 !Set to zero in case of premature return
r_size=N/5 !Initially try out this max number of reals
allocate(r(r_size))
read(iu,'(a)') line
50 continue
do i=1,r_size
read(line,*,end=99) (r(j),j=1,i) !Try reading i reals
count=i
!write(*,*) count
enddo
r_size=r_size*2 !Need more reals
deallocate(r)
allocate(r(r_size))
goto 50
return
99 continue
write(*,*) 'I conclude that there are ',count,' reals on the first line'
end subroutine line_num_columns
If a Fortran 90 solution is fine, you can use the following procedure to parse a line with multiple real values:
subroutine readnext_r1(string, pos, value)
implicit none
character(len=*), intent(in) :: string
integer, intent(inout) :: pos
real, intent(out) :: value
integer :: i1, i2
i2 = len_trim(string)
! initial values:
if (pos > i2) then
pos = 0
value = 0.0
return
end if
! skip blanks:
i1 = pos
do
if (string(i1:i1) /= ' ') exit
i1 = i1 + 1
end do
! read real value and set pos:
read(string(i1:i2), *) value
pos = scan(string(i1:i2), ' ')
if (pos == 0) then
pos = i2 + 1
else
pos = pos + i1 - 1
end if
end subroutine readnext_r1
The subroutine reads the next real number from a string 'string' starting at character number 'pos' and returns the value in 'value'. If the end of the string has been reached, 'pos' is set to zero (and a value of 0.0 is returned), otherwise 'pos' is incremented to the character position behind the real number that was read.
So, for your case you would first read the line to a character string:
character(len=1024) :: line
...
read(15,'(A)') line
...
and then parse this string
real :: y, z, w
integer :: pos
...
pos = 1
call readnext_r1(line, pos, y)
call readnext_r1(line, pos, z)
call readnext_r1(line, pos, w)
if (pos == 0) w = 0.0
where the final 'if' is not even necessary (but this way it is more transparent imho).
Note, that this technique will fail if there is a third entry on the line that is not a real number.
I know the following simple solution:
w = 0.0
read(15,*,err=600)y, z, w
goto 610
600 read(15,*)y, z
610 do other stuff
But it contains "goto" operators
You might be able to use the wonderfully named colon edit descriptor. This allows you to skip the rest of a format if there are no further items in the I/O list:
Program test
Implicit None
Real :: a, b, c
Character( Len = 10 ) :: comment
Do
c = 0.0
comment = 'No comment'
Read( *, '( 2( f7.3, 1x ), :, f7.3, a )' ) a, b, c, comment
Write( *, * ) 'I read ', a, b, c, comment
End Do
End Program test
For instance with gfortran I get:
Wot now? gfortran -W -Wall -pedantic -std=f95 col.f90
Wot now? ./a.out
12.460 28.000 8.00 efg
I read 12.460000 28.000000 8.0000000 efg
12.460 28.000
I read 12.460000 28.000000 0.00000000E+00
^C
This works with gfortran, g95, the NAG compiler, Intel's compiler and the Sun/Oracle compiler. However I should say I'm not totally convinced I understand this - if c or comment are NOT read are they guaranteed to be 0 and all spaces respectively? Not sure, need to ask elsewhere.

Translating Morse code with no spaces

I have some Morse code that has lost the spaces in between the letters, my challenge is to find out what the message says. So far I have been kinda lost because of the sheer amount of combinations there might be.
Here is all the info on the messages I have.
The output will be English
There will always be a translation that make sense
Here is and example message -..-...-...-...-..-.-.-.-.-..-.-.-.-.-.-.-.-.-.-..-...-.
The messages should be no longer then 70 characters
The morse code was taken from a longer stream so it is possible that the first or last groups may be cut off and hence have no valid translation
Does anyone have a clever solution?
This is not an easy problem, because as ruakh suggested there are many viable sentences to a given message. For example 'JACK AND JILL WENT UP THE HILL' has the same encoding as 'JACK AND JILL WALK CHISELED'. Since these are both grammatical sentences and the words in each are common, it's not obvious how to pick one or the other (or any other of the 40141055989476564163599 different sequences of English words that have the same encoding as this message) without delving into natural language processing.
Anyway, here's a dynamic programming solution to the problem of finding the shortest sentence (with the fewest characters if there's a tie). It can also count the total number of sentences that have the same encoding as the given message. It needs a dictionary of English words in a file.
The next enhancements should be a better measure of how likely a sentence is: perhaps word frequencies, false-positive rates in morse (eg, "I" is a common word, but it appears often as part of other sequences of morse code sequences). The tricky part will be formulating a good score function that can be expressed in a way that it can be computed using dynamic programming.
MORSE = dict(zip('ABCDEFGHIJKLMNOPQRSTUVWXYZ', [
'.-', '-...', '-.-.', '-..', '.', '..-.', '--.', '....',
'..', '.---', '-.-', '.-..', '--', '-.', '---', '.--.',
'--.-', '.-.', '...', '-', '..-', '...-', '.--', '-..-',
'-.--', '--..'
]))
# Read a file containing A-Z only English words, one per line.
WORDS = set(word.strip().upper() for word in open('dict.en').readlines())
# A set of all possible prefixes of English words.
PREFIXES = set(word[:j+1] for word in WORDS for j in xrange(len(word)))
def translate(msg, c_sep=' ', w_sep=' / '):
"""Turn a message (all-caps space-separated words) into morse code."""
return w_sep.join(c_sep.join(MORSE[c] for c in word)
for word in msg.split(' '))
def encode(msg):
"""Turn a message into timing-less morse code."""
return translate(msg, '', '')
def c_trans(morse):
"""Construct a map of char transitions.
The return value is a dict, mapping indexes into the morse code stream
to a dict of possible characters at that location to where they would go
in the stream. Transitions that lead to dead-ends are omitted.
"""
result = [{} for i in xrange(len(morse))]
for i_ in xrange(len(morse)):
i = len(morse) - i_ - 1
for c, m in MORSE.iteritems():
if i + len(m) < len(morse) and not result[i + len(m)]:
continue
if morse[i:i+len(m)] != m: continue
result[i][c] = i + len(m)
return result
def find_words(ctr, i, prefix=''):
"""Find all legal words starting from position i.
We generate all possible words starting from position i in the
morse code stream, assuming we already have the given prefix.
ctr is a char transition dict, as produced by c_trans.
"""
if prefix in WORDS:
yield prefix, i
if i == len(ctr): return
for c, j in ctr[i].iteritems():
if prefix + c in PREFIXES:
for w, j2 in find_words(ctr, j, prefix + c):
yield w, j2
def w_trans(ctr):
"""Like c_trans, but produce a word transition map."""
result = [{} for i in xrange(len(ctr))]
for i_ in xrange(len(ctr)):
i = len(ctr) - i_ - 1
for w, j in find_words(ctr, i):
if j < len(result) and not result[j]:
continue
result[i][w] = j
return result
def shortest_sentence(wt):
"""Given a word transition map, find the shortest possible sentence.
We find the sentence that uses the entire morse code stream, and has
the fewest number of words. If there are multiple sentences that
satisfy this, we return the one that uses the smallest number of
characters.
"""
result = [-1 for _ in xrange(len(wt))] + [0]
words = [None] * len(wt)
for i_ in xrange(len(wt)):
i = len(wt) - i_ - 1
for w, j in wt[i].iteritems():
if result[j] == -1: continue
if result[i] == -1 or result[j] + 1 + len(w) / 30.0 < result[i]:
result[i] = result[j] + 1 + len(w) / 30.0
words[i] = w
i = 0
result = []
while i < len(wt):
result.append(words[i])
i = wt[i][words[i]]
return result
def sentence_count(wt):
result = [0] * len(wt) + [1]
for i_ in xrange(len(wt)):
i = len(wt) - i_ - 1
for j in wt[i].itervalues():
result[i] += result[j]
return result[0]
msg = 'JACK AND JILL WENT UP THE HILL'
print sentence_count(w_trans(c_trans(encode(msg))))
print shortest_sentence(w_trans(c_trans(encode(msg))))
I don't know if this is "clever", but I would try a breadth-first search (as opposed to the depth-first search implicit in BRPocock's regex idea). Suppose your string looks like this:
.---.--.-.-.-.--.-...---...-...-..
J A C K A N D J I L L
You start out in state ('', 0) ('' being what you've decoded so far; 0 being your position in the Morse-code string). Starting from position zero, possible initial characters are . E, .- A, .-- W, .--- J, and .---- 1. So, push states ('E', 1), ('A', 2), ('W', 3), ('J', 4), and ('1', 5) onto your queue. After dequeuing state ('E', 1), you would enqueue states ('ET', 2), ('EM', 3), and ('EO', 4).
Now, your queue of possible states will grow quite quickly — both of { ., - } are letters, as are all of { .., .-, -., -- } and all of { ..., ..-, .-., .--, -.., -.-, --., --- }, so in each pass your number of states will increase by a factor of at least three — so you need to have some mechanism for user feedback. In particular, you need some way to ask your user "Is it plausible that this string starts with EOS3AIOSF?", and if the user says "no", you will need to discard state ("EOS3AIOSF", 26) from your queue. The ideal would be to present the user with a GUI that, every so often, shows all current states and lets him/her select which ones are worth proceeding with. ("The user" will also be you, of course. English has a shortage of pronouns: if "you" refers to the program, then what pronoun refers to the user-programmer?!)
Maintain 3 things: a list of words so far S, the current word so far W, and the current symbol C.
S should be only good words, eg. 'THE QUICK'
W should be a valid prefix of a word, eg. ['BRO']
C should be a valid prefix of some letter, eg. '.-'
Now, given a new symbol, let's say '-', we extend C with it (in this case we get '.--').
If C is a complete letter (in this case it is, the letter 'W'), we have a choice to add it to W, or to continue extending the letter further by adding more symbols.
If we extend W, we have a choice to add it to S (if it's a valid word), or to continue extending it further.
This is a search, but most paths terminate quickly (as soon as you have W not being a valid prefix of any word you can stop, and as soon as C isn't a prefix of any letter you can stop).
To get it more efficient, you could use dynamic programming to avoid redundant work and use tries to efficiently test prefixes.
What might the code look like? Omitting the functions 'is_word' which tests if a string is an English word, and 'is_word_prefix' which tests if a string is the start of any valid word, something like this:
morse = {
'.-': 'A',
'-...': 'B',
etc.
}
def is_morse_prefix(C):
return any(k.startswith(C) for k in morse)
def break_words(input, S, W, C):
while True:
if not input:
if W == C == '':
yield S
return
i, input = input[0], input[1:]
C += i
if not is_morse_prefix(C):
return
ch = morse.get(C, None)
if ch is None or not is_word_prefix(W + ch):
continue
for result in break_words(input, S, W + ch, ''):
yield result
if is_word(W + ch):
for result in break_words(input, S + ' ' + W + ch, '', ''):
yield result
for S in break_words('....--', [], '', ''):
print S