How can I make an ANTLR rule that matches a line of characters containing a specific character only once, at any position in the line? - antlr

I am new to writing grammar in ANTLR, and I am not sure how to make the rule I want here.
Suppose a line of characters could possibly be something like:
a b c 1 e : d g e 3 s 5 b 7 d r : f : 2 v : 2
But could also look like:
g j : f 1 h k 6 u s h u 5 r c b 0 u k = x v
which is what I am looking for.
So, a line could be made up of any number of letters, numbers, and : symbols in any order, and any white space is ignored. I need an ANTLR rule that matches a line that looks like these, except only has one : symbol, which can be located any position in the line. How can I do this?

Depending on the rest of your lexer rules, you might try something like this:
WholeLineWithOneColon
: ~[:\r\n]* ':' ~[:\r\n]*
;
Note that this need not match an entire line (or even start at the beginning of the line), because the matching of ANTLR's lexer rules are dependent on other rules that possibly match more than the rule I now gave an example of.

Related

How to process mainframe numbers where "{" is the last character

I have a one mainframe file data like as below
000000720000{
I need to parse the data and load into a hive table like below
72000
the above field is income column and "{" sign which denotes +ve amount
datatype used while creating table income decimal(11,2)
in layout.cob copybook using INCOME PIC S9(11)V99
could someone help?
The number you want is 7200000 which would be 72000.00.
The conversion you are looking for is:
Positive numbers
{ = 0
A = 1
B = 2
C = 3
D = 4
E = 5
F = 6
G = 7
H = 8
I = 9
Negative numbers (this makes the whole value negative)
} = 0
J = 1
K = 2
L = 3
M = 4
N = 5
O = 6
P = 7
Q = 8
R = 9
Let's explain why.
Based on your question the issue you are having is when packed decimal data is unpacked UNPK into character data. Basically, the PIC S9(11)V2 actually takes up 7 bytes of storage and looks like the picture below.
You'll see three lines. The top is the character representation (missing in the first picture because the hex values do not map to displayable characters) and the lines below are the hexadecimal values. Most significant digit on top and least below.
Note that in the rightmost byte the sign is stored as C which is positive, to represent a negative value you would see a D.
When it is converted to character data it will look like this
Notice the C0 which is a consequence of the unpacking to preserve the sign. Be aware that this display is on z/OS which is EBCDIC. If the file has been transferred and converted to another code-page you will see the correct character but the hex values will be different.
Here are all the combinations you will likely see for positive numbers
and here for negative numbers
To make your life easy, if you see one of the first set of characters then you can replace it with the corresponding number. If you see something from the second set then it is a negative number.

generating palindromes with John the Ripper

How can I configure John the Ripper to generate only mangled (Jumbo) palindromes from a word-list to crack a password hash?
(I've googled it but only found "how to avoid palindromes")
in john/john.conf (for e.g. 9 and 10 letter palindromes) -append the following rules at the end:
# End of john.conf file.
# Keep this comment, and blank line above it, to make sure a john-local.conf
# that does not end with \n is properly loaded.
[List.Rules:palindromes]
f
f D5
then run john with your wordlist plus the newly created "palindromes" rules:
$ john --wordlist=wordlist.lst --rules:palindromes hashfile.hash
rule f simply appends a reflection of itself to the current word from the wordlist, e.g. P4ss! -> P4ss!!ss4P
rule f D5 not only reflects the word but then deletes the 5th character, e.g. P4ss! -> P4ss!ss4P
I haven't found a way to "delete the middle character" so as of now, the rule has to be adjusted to the required length of palindromes, e.g. f D4 for length of 7, f D6 for length of 11 etc.
Edit: Possible solution for variable length (not tested yet):
f
Mr[6
M = Memorize current word, r = Reverse the entire word , [ = Delete first character, 6 = Prepend the word saved to memory to current word
With this approach the palindromes could additionally be "turned inside out" (word from wordlist at the end of the resulting palindrome instead of at beginning)
f
Mr[6
Mr]4
M = Memorize current word, r = Reverse the entire word , ] = Delete last character, 4 = Append the word saved to memory to current word

How to denote at least one repetition in EBNF?

https://en.wikipedia.org/wiki/Extended_Backus–Naur_form
The above article mentions that curly braces denote repetition of arbitrary times (incl. zero), while square brackets denote at most one repetition.
What I want however, is at least one repetition - that is, a terminal or a nonterminal must appear at least once.
Well I can describe it like that:
production = nonterminal, { nonterminal };
But I thought the point of EBNF over BNF was to avoid the need of this kind of "hacks".
The Wikipedia article also mentions:
EBNF also provides, among other things, the syntax to describe repetitions (of a specified number of times), to exclude some part of a production, and to insert comments in an EBNF grammar.
But does EBNF provide the syntax to describe at least one repetition?
Place a minus (except-symbol) after the final brace.
production = { nonterminal }-;
ISO/IEC 14977 : 1996(E)
5.8 Syntactic-term
When a syntactic-term is a single syntactic-factor it represents any
sequence of symbols represented by that syntactic-factor.
When a syntactic-term is a syntactic-factor followed by an
except-symbol followed by a syntactic-exception it represents any
sequence of symbols that satisfies both of the conditions:
a) it is a sequence of symbols represented by the syntactic-factor,
b) it is not a sequence of symbols represented by the
syntactic-exception.
As examples the following syntax-rules illustrate the facilities
provided by the except-symbol.
letter = "A" | "B" | "C" | "D" | "E" | "F"
| "G" | "H" | "I" | "J" | "K" | "L" | "M"
| "N" | "O" | "P" | "Q" | "R" | "S" | "T"
| "U" | "V" | "W" | "X" | "Y" | "Z";
vowel = "A" | "E" | "I" | "O" |"U";
consonant = letter - vowel;
ee = {"A"}-, "E";
Terminal-strings defined by these rules are as follows:
letter: A B C D E F G H I J etc.
vowel: A E I O U
consonant: B C D F G H J K L M etc.
ee: AE AAE AAAE AAAAE AAAAAE etc.
NOTE — {"A"}- represents a sequence of one or more A’s because it is a
syntactic-term with an empty syntactic-exception.
Note that in the second paragraph (emphasis added), satisfies both of the conditions. That is, both the syntactic-factor and the syntactic-exception must be satisfied. The braces still mean repetition. This results in one or more to satisfy the syntax, even though the exception is empty.

Pig script to count the number of letters in a file

I want to extend the hello world program of hadoop word count to be able to count the number of letters in the input file.
I have written this so far and I'm unable to figure out what is wrong with this code. Any help identifying the issue will be appreciated.
A = load '/tmp/alice.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\\w+';
D = foreach C generate flatten(REGEX_EXTRACT_ALL(word, '([a-zA-Z])')) as letter;
E = group D by letter;
F = foreach E generate COUNT(D), group;
store F into '/tmp/alice_wordcount';
Let me say that I am a PIG newbie, but somehow this query got me interested. I diverged into all kinds of complex stuff like nested foreach, UDFs etc. But in the end, the answer is pretty simple. It's just a correction in one of your pig latin lines as below:
D = foreach C generate flatten(TOKENIZE(REPLACE(word,'','|'), '|')) as letter;
Instead of using regexp_extract_all, I instead opt to REPLACE each letter boundary with a special character ('|' here, though you can use an uncommon sequence also if you like) and then TOKENIZE around that delimiter.
try the following code
Load the data A = load '/tmp/alice.txt';
Split the line into words B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
Split words into chars C = foreach B generate flatten(TOKENIZE(REPLACE($0,'','|'),'|')) as letter;
Group the letters D = GROUP C BY letter;
Generate the results with count of each letter E = foreach D generate COUNT(C), group;
Store F into '/tmp/alice_wordcount';

How to load array of strings with tab delimiter in pig

I have a text file with tab delimiter and I am trying to print first column as id and remaining array of strings as second column names.
consider below is the file to load:
cat file.txt;
1 A B
2 C D E F
3 G
4 H I J K L M
In the above file, first column is an id and the remaining are names.
I should get the output like:
id names
1 A,B
2 C,D,E,F
3 G
4 H,I,J,K,L,M
If names are split with delimiter ,, then I am getting the output by using below commands:
test = load '/tmp/arr' using PigStorage('\t') as (id:int,names:chararray)
btest = FOREACH test GENERATE id, FLATTEN(TOBAG(STRSPLIT(name,','))) as value:tuple(name:CHARARRAY);
But for the array with delimiter ('\t'), I am not getting them because it's considering only the first value in the column 2 (i.e, names).
Any solution for this?
I have a solution for this:
When using PigStorage('\t') in the load, the file should have tab delimiter. So the number of tab used in a line that many coloumns(+1) is created. This is how it works.
But you have a trick
You can change the default delimiter and use some other delimiter to load the file like comma and then you can have the names in commaseperated.
It will work for sure
Input file sample
1,A B
2,C D E F
3,G
4,H I J K L M
Hope this helps