Can Raku range operator on strings mimic Perl's behaviour? - raku

In Perl, the expression "aa" .. "bb" creates a list with the strings:
aa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax ay az ba bb
In Raku, however, (at least with Rakudo v2021.08), the same expression creates:
aa ab ba bb
Even worse, while "12" .. "23" in Perl creates a list of strings with the numbers 12, 13, 14, 15, ..., 23, in Raku the same expression creates the list ("12", "13", "22", "23").
The docs seem to be quite silent about this behaviour; at least, I could not find an explanation there. Is there any way to get Perl's behaviour for Raku ranges?
(I know that the second problem can be solved via typecast to Int. This does not apply to the first problem, though.)

It's possible to get the Perl behavior by using a sequence with a custom generator:
say 'aa', *.succ … 'bb';
# OUTPUT: «aa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax ay az ba bb»
say '12', *.succ … '23';
# OUTPUT: «12 13 14 15 16 17 18 19 20 21 22 23»
(Oh, and a half solution for the '12'..'23' case: you already noted that you can cast the endpoints to a Numeric type to get the output you want. But you don't actually need to cast both endpoints – just the bottom. So 12..'23' still produces the full output. As a corollary, because ^'23' is sugar for 0..^'23', any Range built with &prefix:<^> will be numeric.)
For the "why" behind this behavior, please refer to my other answer to this question.

TL;DR Add one or more extra characters to the endpoint string. It doesn't matter what the character(s) is/are.
10 years after the current doc corpus was kicked started by Moritz Lenz++, Raku's doc is, as ever, a work in progress.
There's a goldmine of more than 16 years worth of chat logs that I sometimes spelunk, looking for answers. A search for range "as words" with nick: TimToady netted me this in a few minutes:
TimToady beginning and ending of the same length now do the specced semantics
considering each position as a separate character range
My instant reaction:
Here's why it does what it does. The guy who designed how Perl's range works not only deliberately specced it to work how it now does in Raku but implemented it in Rakudo himself in 2015.
It does that iff "beginning and ending of the same length". Hmm. 💡
A few seconds later:
say flat "aa" .. "bb (like perl)";
say flat "12" .. "23 (like perl)";
displays:
(aa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax ay az ba bb)
(12 13 14 15 16 17 18 19 20 21 22 23)
😊

[I'm splitting this into a separate answer because it addresses the "why" instead of the "how"]
I did a bit of digging, and learned that:
For Sequences, having "aa"…"bb" produce "aa", "ab", "ba", "bb" is specified in Roast
The original use case provided for this behavior was generating sequences of octal numbers (as Strs) (discussed again in 2018)
For Ranges, the behavior of "aa".."bb" is currently unspecified and there does not appear to be consensus about what it should be.
(As you already know), Rakudo's implementation has "aa".."bb" behave the same as "aa"…"bb".
In 2018, lizmat ([Elizabeth Mattijsen])https://stackoverflow.com/users/7424470/elizabeth-mattijsen) on StackOverflow) changed .. to make "aa".."bb" behave the way it does in Perl but reverted that change pending consensus on the correct behavior.
So I suppose we (as a community) are still thinking about it? Personally, I'm inclined to agree with lizmat that having "aa".."bb" provide the longer range (like Perl) makes sense: if users want the shorter one, they can use a sequence. (Or, for an octal range, something like (0..0o377).map: *.fmt('%03o'))
But, either way, I definitely agree with that 2018 commit that we should pin this down in Roast – and then get it noted in the docs.

Related

Fill and sign for PDF file not working using Acrobat Reader DC

I'm asking this here because given the searches I've done, it appears Adobe's support is next to non-existent. I have, according to this online validation tool:
https://www.pdf-online.com/osa/validate.aspx
A perfectly valid PDF, which is generated from code. However, when using Acrobat Reader DC I am unable to use Fill And Sign - when attempting to sign, it throws this error:
The operation failed because Adobe Acrobat encountered an unknown error
This is the offending PDF:
https://github.com/DelphiWorlds/MiscStuff/blob/master/Test/PDF/SigningNoWork.pdf
This is one which is very similar, where Fill and Sign works:
https://github.com/DelphiWorlds/MiscStuff/blob/master/Test/PDF/SigningWorks.pdf
Foxit Reader has no issue with either of them - Fill and Sign works without fail.
I would post the source of the files, however because they have binary data, I figure links to them is better.
The question is: why does the first one fail to work, but not the second?
In your non-working file all the fonts are defined with
/FirstChar 30
/LastChar 255
i.e. having 226 glyphs. Their respective Widths arrays only have 224 entries, though, so they are incomplete.
After adding two entries to each Widths array, Adobe Reader here does not run into that unknown error anymore during Fill And Sign.
As the OP inquired how exactly I changed those widths arrays:
I wanted the change to have as few side effects as possible, so I was glad to see that there was some empty space in the font dictionaries in question, so a trivial hex editing sufficed, no need to shift indirect objects and update cross references:
In each of those font definitions in the objects 5, 7, 9, and 11 the Widths array is the last dictionary entry value and ends with some white space, after the last width we have these bytes:
20 0D 0A 5D 0D 0A 3E 3E --- space CR NL ']' CR NL '>' '>'
I added two 0 values using the white space:
20 30 20 30 20 5D 3E 3E --- space '0' space '0' space ']' '>' '>'
Acrobat Reader DC - the free version, does not allow you to do the fill and sign anymore if your document has metadata attached to it.
You need to purchase the Pro DC version, which is like $14.99, in order to continue using the fill and sign on here.
I just got done with a 4 months support exchange of emails with Adobe, and that was their final answer.

Extra "hidden" characters messing with equals test in SQL

I am doing a database (Oracle) migration validation and I am writing scripts to make sure the target matches the source. My script is returning values that, when you look at them, look equal. However, they are not.
For instance, the target has PREAPPLICANT and the source has PREAPPLICANT. When you look at them in text, they look fine. But when I converted them to hex, it shows 50 52 45 41 50 50 4c 49 43 41 4e 54 for the target and 50 52 45 96 41 50 50 4c 49 43 41 4e 54 for the source. So there is an extra 96 in the hex.
So, my questions are:
What is the 96 char?
Would you say that the target has incorrect data because it did not bring the char over? I realize this question may be a little subjective, but I'm asking it from the standpoint of "what is this character and how did it get here?"
Is there a way to ignore this character in the SQL script so that the equality check passes? (do I want the equality to pass or fail here?)
It looks like you have Windows-1252 character set here.
https://en.wikipedia.org/wiki/Windows-1252
Character 96 is an En Dash. This makes sense, as the data was PREAPPLICANT.
One user provided "PREAPPLICANT" and another provided "PRE-APPLICANT" and Windows helpfully converted their proper dash into an en dash.
As such, this doesn't appear to be an error in data, more an error in character sets. You should be able to filter these out without too much effort but then you are changing data. It's kind of like when one person enters "Mr Jones" and another enters "Mr. Jones"--you have to decide how much data massaging you want to do.
As you probably already have done, use the DUMP function to get the byte representation of the data in code of you wish to inspect for weirdness.
Here's some text with plain ASCII:
select dump('Dashes-and "smart quotes"') from dual;
Typ=96 Len=25: 68,97,115,104,101,115,45,97,110,100,32,34,115,109,97,114,116,32,113,117,111,116,101,115,34
Now introduce funny characters:
select dump('Dashes—and “smart quotes”') from dual;
Typ=96 Len=31: 68,97,115,104,101,115,226,128,148,97,110,100,32,226,128,156,115,109,97,114,116,32,113,117,111,116,101,115,226,128,157
In this case, the number of bytes increased because my DB is using UTF8. Numbers outside of the valid range for ASCII stand out and can be inspected further.
Here's another way to see the special characters:
select asciistr('Dashes—and “smart quotes”') from dual;
Dashes\2014and \201Csmart quotes\201D
This one converts non-ASCII characters into backslashed Unicode hex.

Validate UK postcode using regular expression in oracle

Below is the list of valid postcodes:
A1 1AA
A11 1AA
AA1 1AA
AA11 1AA
A1A 1AA
BFPO 1
BFPO 11
BFPO 111
I tried with (([A-Z]{1,2}[0-9]{1,2})\ ([0-9][A-Z]{2}))|(GIR\ 0AA)$ but it is not working. Could you please help me with proper query to validate all the postcode formats.
First, rather than guessing based on the set of data at hand, let's look at what UK postcodes are.
EC1V 9HQ
The first one or two letters is the postcode area and it identifies the main Royal Mail sorting office which will process the mail. In this case EC would go to the Mount Pleasant sorting office in London.
The second part is usually just one or two numbers but for some parts of London it can be a number and a letter. This is the postcode district and tells the sorting office which delivery office the mail should go to.
This third part is the sector and is usually just one number. This tells the delivery office which local area or neighbourhood the mail should go to.
The final part of the postcode is the unit code which is always two letters. This identifies a group of up to 80 addresses and tells the delivery office which postal route (or walk) will deliver the item.
Digesting that...
1 or 2 letters.
A number and maybe an alphanumeric.
A space.
"Usually" a number, but I can't find any instances otherwise.
2 letters.
\A[[:alpha:]]{1,2}\d[[:alnum:]]? \d[[:alpha:]]{2}\z
We can't use \w because that contains an underscore.
I used the more exact \A and \z over ^ and $ because \A and \z match the exact beginning and end of the string, whereas ^ and $ match the beginning and end of a line. $ in particular is tolerant of a trailing newline.
Of course, there are special cases. XXXX 1ZZ for various overseas territories, XXXX is enumerated.
\A(ASCN|STHL|TDCU|BBND|BIQQ|FIQQ|PCRN|SIQQ|TKCA) 1ZZ\z
Then a couple of really special cases.
GIR 0AA for Girobank.
AI-2640 for Anguilla.
\A(AI-2640|GIR 0AA)\z
Put them all together into one big (...|...|...) mess. It's good to build the query in three pieces and put it together with the x modifier to ignore whitespace.
REGEXP_LIKE(
postcode,
'\A
(
[[:alpha:]]{1,2}\d[[:alnum:]]?\ \d[[:alpha:]]{2}\z |
(ASCN|STHL|TDCU|BBND|BIQQ|FIQQ|PCRN|SIQQ|TKCA)\ 1ZZ |
(AI-2640|GIR\ 0AA)
)
\z',
'x'
)
Or you can make the basic regex less strict and accept 2-4 alphanumerics for the first part. Then there's only the special case for Anguilla to worry about.
\A([[:alnum:]]{2,4} \d[[:alpha:]]{2}|AI-2640)\z
On the downside, this will let in post codes that don't exist. On the up side, you don't have to keep tweaking for additional special cases. That's probably fine for this level of filtering.

How input string is represented in magnetic tapes?

I know that in turing machines, the (different) tapes are used for both input and output and for stack too. In a problem of adding 2 numbers using turing machine, the input is dealing with many symbols like 1,0,B(blank),+.
(Tough this questions is related to physics, I asked here since I thought they mayn't know about turing machines and their inputs.)
And my doubt is ,
If the input is BBBBB1111+111111BB,
then in magnetic tape,
1->represented by North polarity(say).
0->represented by south polarity(say).
B->represented by No polarity.
Then,
How '+' will be represented?
I doesn't think that there will be some codes(like ASCII) for special symbols.
Since the number and type of special symbols will be implementation dependent. Also special codes will make the algorithm more tedious.
or
Is the input symbol representation in tapes is entirely different from the above mentioned method?If yes, please explain.
You would probably do this by having each character encoded with multiple bits. For example:
B: 00
0: 01
1: 10
+: 11
Your read head would then have size two and would always move two steps to the left or the right when making a move.
Symbol: Representation
0:1 ; 1:11 ; 2:111 ; n:n+1 ; Blank:B

PDF Cross Reference Streams

I'm developing a PDF parser/writer, but I'm stuck at generating cross reference streams.
My program reads this file and then removes its linearization, and decompresses all objects in object streams. Finally it builds the PDF file and saves it.
This works really well when I use the normal cross reference & trailer, as you can see in this file.
When I try to generate a cross reference stream object instead (which results in this file, Adobe Reader can't view it.
Has anyone experience with PDF's and can help me search what the Problem is?
Note that the cross reference is the ONLY difference between file 2 and file 3. The first 34127 bytes are the same.
If someone needs the content of the decoded reference stream, download this file and open it in a HEX editor. I've checked this reference table again and again but I could not find anything wrong. But the dictionary seems to be OK, too.
Thanks so much for your help!!!
Update
I've now completely solved the problem. You can find the new PDF here.
Two problems I see (without looking at the stream data itself.
"Size integer (Required) The number one greater than the highest object number used in this section or in any section for which this shall be an update. It shall be equivalent to the Size entry in a trailer dictionary."
your size should be... 14.
"Index array (Optional) An array containing a pair of integers for each subsection in this section. The first integer shall be the first object number in the subsection; the second integer shall be the number of entries in the subsection
The array shall be sorted in ascending order by object number. Subsections cannot overlap; an object number may have at most one entry in a section.
Default value: [0 Size]."
Your index should probably skip around a bit. You have no objects 2-4 or 7. The index array needs to reflect that.
Your data Ain't Right either (and I just learned out to read an xref stream. Yay me.)
00 00 00
01 00 0a
01 00 47
01 01 01
01 01 70
01 02 fd
01 76 f1
01 84 6b
01 84 a1
01 85 4f
According to this data, which because of your "no index" is interpreted as object numbers 0 through 9, have the following offset:
0 is unused. Fine.
1 is at 0x0a. Yep, sure is
2 is at 0x47. Nope. That lands near the beginning of "1 0"'s stream. This probably isn't a coincidence.
3 is at 0x101. Nope. 0x101 is still within "1 0"'s stream.
4 is at 0x170. Ditto
5 is at 0x2fd. Ditto
6 is at 0x76f1. Nope, and this time buried inside that image's stream.
I think you get the idea. So even if you had a correct \Index, your offsets are all wrong (and completely different from what's in resultNormal.pdf, even allowing for dec-hex confusion).
What you want can be found in resultNormal's xref:
xref
0 2
0000000000 65535 f
0000000010 00000 n
5 2
0000003460 00000 n
0000003514 00000 n
8 5
0000003688 00000 n
0000003749 00000 n
0000003935 00000 n
0000004046 00000 n
0000004443 00000 n
So your index should be (if I'm reading this right): \Index[0 2 5 2 8 5]. And the data:
0 0 0
1 0 a
1 3460 (that's decimal)
1 3514 (ditto)
1 3688
etc
Interestingly, the PDF spec says that the size must be BOTH the number of entries in this and all previous XRefs AND the number one higher than the highest object number in use.
I don't think the later part is ever enforced, but I wouldn't be surprised to find that xref streams are more retentive than the normal cross reference tables. Might be the same code handling both, might not.
#mtraut:
Here's what I see:
13 0 obj <</Size 10/Length 44/Filter /FlateDecode/DecodeParms <</Columns 3/Predictor 12>>/W [1 2 0]/Type /XRef/Root 8 0 R>>
stream
...
endstream
endobj
The "resultstream.pdf" does not have a valid cross ref stream.
if i open it in my viewer, he tries to read object " 13 0 " as a cross ref stream, but its a plain dictionary (stream tags and data is missing).
A little out of topic: What language are you developing in? At least in Java a know of three valuable choices (PDFBox, iText and jPod, where i personally as one of the developers opt for jPod, very clean implementation :-). If this does not fit your platform, maybe you can at least have a look at algorithms and data structures.
EDIT
Well - if "resultstream.pdf" is the document in question then this is what my editor (SCITE) sees
...
13 0 obj
<</Size 0/W [1 2 0]/Type /XRef/Root 8 0 R>>
endobj
startxref
34127
%%EOF
There is no stream.