ANTLR4 no viable alternative at input 'do { return' error? - antlr

This ANTLR4 parser grammar errors a 'no viable alternative' error when I try to parse an input. The only rules I know of that matches the part of the input with the error are the rules 'retblock_expr' and 'block_expr'. I have put 'retblock_expr' infront of 'block_expr' and put 'non_assign_expr' infront of 'retblock_expr' but it still throws the error.
input:
print(do { return a[3] })
full error:
line 1:11 no viable alternative at input '(do { return'
parser grammar:
parser grammar TestP;
options { tokenVocab=Test; }
program: ( ( block_expr | retblock_expr ) ( wsp ( block_expr | retblock_expr ) wsp )* wsp )? EOF;
retblock_expr
: isglobal DEF wsp fcreatable wsp fcall wsp ( DO wsp )? ( LBC wsp block_expr wsp RBC | block_expr wsp END ) #FuncBlockA
| FUNC wsp fcall wsp ( DO wsp )? ( LBC wsp block_expr wsp RBC | block_expr wsp END ) #CFuncBlockA
| LAMBDA wsp fcall wsp ( DO wsp )? ( LBC wsp block_expr wsp RBC | block_expr wsp END ) #LambdaBlockA
| SWITCH wsp atompar_option wsp ( DO wsp )? ( LBC wsp (CASE wsp atompar_option wsp ( block_expr wsp END | LBC wsp block_expr wsp RBC ) )* ( DEFAULT wsp atompar_option wsp ( block_expr wsp END | LBC wsp block_expr wsp RBC ) )? wsp RBC | (CASE wsp atompar_option wsp ( block_expr wsp END | LBC wsp block_expr wsp RBC ) )* ( DEFAULT wsp atompar_option wsp ( block_expr wsp END | LBC wsp block_expr wsp RBC ) )? wsp END ) #SwitchBlockA
| DO wsp ( LBC wsp block_expr wsp RBC | block_expr wsp END ) #DoBlockA
;
non_assign_expr
: ( iterable ( ( DOT | SUP | SIB ) iterable )+ | index ) #AccessExpr
| ( call | datat | LPR non_assign_expr RPR | LBC non_assign_expr RBC | LBR non_assign_expr RBR ) #BracketsExpr
| ( STR | KUN )+ indexable #UnpackExpr
| <assoc=right> non_assign_expr ( wsp POW wsp non_assign_expr )+ #PowExpr
| non_assign_expr ( wsp ( INC | DEC ) wsp non_assign_expr | INC | DEC )+ #CrementExpr
| ( PLS | MNS | BNT | EXC | LEN | NOT )+ non_assign_expr #UnaryExpr
| non_assign_expr EXC+ #FactExpr
| non_assign_expr ( wsp ( STR | DIV | PER | FDV | CDV ) wsp non_assign_expr | PER )+ # AdvExpr
| non_assign_expr ( wsp ( PLS | MNS ) wsp non_assign_expr )+ #BasicExpr
| non_assign_expr ( wsp CON wsp non_assign_expr )+ #ConcatExpr
| non_assign_expr ( wsp ( BLS | BRS ) wsp non_assign_expr )+ #ShiftExpr
| non_assign_expr ( wsp ( LET | LTE | GRT | GTE ) wsp non_assign_expr )+ #CompareExpr
| non_assign_expr ( wsp ( EQL | IS | NEQ | IS wsp NOT ) wsp non_assign_expr )+ #EqualExpr
| non_assign_expr ( wsp BND wsp non_assign_expr )+ #BitAnd
| non_assign_expr ( wsp BXR wsp non_assign_expr )+ #BitXor
| non_assign_expr ( wsp BOR wsp non_assign_expr )+ #BitOr
| <assoc=right> non_assign_expr ( wsp ( AND | TND ) wsp non_assign_expr wsp ( OR | TOR ) wsp non_assign_expr )+ #Ternary
| non_assign_expr ( wsp ( NND | AND ) wsp non_assign_expr )+ #AndExpr
| non_assign_expr ( wsp ( NXR | XOR ) wsp non_assign_expr )+ #XorExpr
| non_assign_expr ( wsp ( NOR | OR ) wsp non_assign_expr )+ #OrExpr
| retblock_expr #RBlockA
| typet LPR non_assign_expr RPR #TypeCastA
| atom #AtomNAE
;
block_expr
: IF wsp non_assign_expr wsp ( ( THEN wsp block_expr | LBC wsp block_expr wsp RBC ) wsp ( ( ELIF wsp non_assign_expr wsp THEN wsp block_expr | ELIF wsp non_assign_expr wsp LBC wsp block_expr wsp RBC )* ELSE wsp ( block_expr wsp END | LBC wsp block_expr wsp RBC ) | ( ELIF wsp non_assign_expr wsp THEN wsp block_expr | ELIF wsp non_assign_expr wsp LBC wsp block_expr wsp RBC )*? ( ELIF wsp non_assign_expr wsp THEN wsp block_expr wsp END | ELIF wsp non_assign_expr wsp LBC wsp block_expr wsp RBC ) ) | ( THEN wsp block_expr wsp END | LBC wsp block_expr wsp RBC ) ) #IfBlock
| TRY wsp ( block_expr wsp ( EXCEPT (LPR wsp IDN wsp RPR)? wsp ( ( (DO wsp)? LBC wsp block_expr wsp RBC | DO wsp block_expr wsp END | block_expr ) wsp )? FINALLY wsp ( block_expr wsp END | LBC wsp block_expr wsp RBC ) | EXCEPT (LPR wsp IDN wsp RPR)? wsp ( block_expr wsp END | LBC wsp block_expr wsp RBC ) ) | ( block_expr wsp END | LBC wsp block_expr wsp RBC ) ) #DebugBlock
| FOR wsp av_var wsp TOR wsp av_inc wsp CMA wsp av_inc wsp ( CMA wsp av_inc wsp )? ( DO wsp )? ( LBC wsp block_expr wsp RBC | block_expr wsp END ) #RangeBlock
| FOR wsp av_var wsp CMA wsp non_assign_expr wsp CMA wsp non_assign_expr wsp CMA wsp non_assign_expr wsp ( DO wsp )? ( LBC wsp block_expr wsp RBC | block_expr wsp END ) #ActionBlock
| FOR wsp IDN wsp ( TOR wsp non_assign_expr )? ( wsp CMA wsp ( IDN wsp ( TOR wsp non_assign_expr )? )? ( wsp CMA wsp IDN wsp TOR wsp non_assign_expr )* )? IN wsp iterable wsp ( DO wsp )? ( LBC wsp block_expr wsp RBC | block_expr wsp END ) #IterationBlock
| ( WHILE wsp non_assign_expr wsp ( DO wsp )? ( LBC wsp block_expr wsp RBC | block_expr wsp END ) | DO wsp ( LBC wsp block_expr wsp RBC | block_expr ) wsp WHILE wsp non_assign_expr ) #WhileBlock
| ( ( DO | REPEAT ) wsp ( LBC wsp block_expr wsp RBC | block_expr ) wsp UNTIL wsp non_assign_expr | UNTIL wsp non_assign_expr wsp ( DO | REPEAT ) ( LBC wsp block_expr wsp RBC | block_expr wsp END ) ) #RepeatBlock
| isglobal DEF wsp fcreatable wsp fcall wsp ( DO wsp )? ( LBC wsp block_expr wsp RBC | block_expr wsp END ) #FuncBlock
| FUNC wsp fcall wsp ( DO wsp )? ( LBC wsp block_expr wsp RBC | block_expr wsp END ) #CFuncBlock
| LAMBDA wsp fcall wsp ( DO wsp )? ( LBC wsp block_expr wsp RBC | block_expr wsp END ) #LambdaBlock
| SWITCH wsp atompar_option wsp ( DO wsp )? ( LBC wsp (CASE wsp atompar_option wsp ( block_expr wsp END | LBC wsp block_expr wsp RBC ) )* ( DEFAULT wsp atompar_option wsp ( block_expr wsp END | LBC wsp block_expr wsp RBC ) )? wsp RBC | (CASE wsp atompar_option wsp ( block_expr wsp END | LBC wsp block_expr wsp RBC ) )* ( DEFAULT wsp atompar_option wsp ( block_expr wsp END | LBC wsp block_expr wsp RBC ) )? wsp END ) #SwitchBlock
| DO wsp ( LBC wsp block_expr wsp RBC | block_expr wsp END ) #DoBlock
| LPR block_expr RPR #EnclosedBlockA
| LBC block_expr RBC #EnclosedBlockB
| block #OpenBlock
;
atompar_option
: LPR wsp atom wsp RPR
| atom
;
isglobal: ( GLOBAL wsp )?;
block: ( stat+ wsp (PASS | retstat)* )+ | PASS;
stat
: expression+ wsp SMC*
| expression* wsp SMC+
;
retstat: RETURN wsp non_assign_expr;
expression
: <assoc=right> isglobal var_list ( wsp aop wsp expression )+ #AssignExpr
| exp_list #ExpListA
| non_assign_expr #NonAssign
| atom #AtomEXPR
| IVC #InvalidCharacter
;
literal
: strt
| num
;
datat
: listd
| dictd
| setd
| tupled
;
wsp: WSP*;
listd
: LBR wsp exp_list wsp RBR
| EML
;
dictd
: LBC wsp kvpair wsp
(
CMA
wsp
kvpair
wsp
)*
RBC
;
setd
: LBC wsp exp_list wsp RBC
| EMS
;
indexable
: ( dictd | IDN | ( ( datat | IDN | strt ) ) LBR non_assign_expr RBR | ( datat | IDN | strt ) ( ( DOT | SUP | SIB ) ( datat | IDN | strt ) )+ ) fcall
| ( ( datat | IDN | strt ) ) LBR non_assign_expr RBR
| ( datat | IDN | strt ) ( ( DOT | SUP | SIB ) ( datat | IDN | strt ) )+
| IDN
| datat
;
iterable
: indexable
| strt
;
numidn
: num
| IDN
;
av_numidn
: numidn
| av_var
;
av_inc
: av_numidn
| call
;
tupled: LPR wsp (exp_list | CMA) wsp RPR;
kvpair: non_assign_expr wsp TOR wsp non_assign_expr;
index
: ( iterable ) LBR non_assign_expr RBR
| iterable ( ( DOT | SUP | SIB ) iterable )+
;
var_list: ( typet wsp )? av_var ( wsp CMA wsp ( typet wsp )? var_list)*;
av_var
: IDN
| index
;
exp_list: non_assign_expr (wsp CMA wsp non_assign_expr)*;
atom
: num
| av_var
| strt
| typet
| ckw
| val
| datat
;
aop
: A_FDV // '//='
| A_CDV // '*/='
| A_NOR // '||='
| A_FAC // '=!='
| A_LTE // '=<='
| A_GTE // '=>='
| A_EQL // '==='
| A_NEQ // '!=='
| A_CON // '..='
| A_NXR // '$$='
| A_BRS // '>>='
| A_NND // '&&='
| A_BLS // '<<='
| A_DCL // '::='
| A_CLD // ':.='
| A_KUN // '=**'
| A_VUN // '=*'
| A_DOT // '.='
| A_POW // '^='
| A_NOT // '=!'
| A_BNT // '=~'
| A_LEN // '=#'
| A_PER // '=%'
| A_MUL // '*='
| A_DIV // '/='
| A_MOD // '%='
| A_ADD // '+='
| A_SUB // '-='
| A_LET // '=<'
| A_GRT // '=>'
| A_BND // '&='
| A_BXR // '$='
| A_BOR // '|='
| A_TND // '?='
| A_TOR // ':='
| A_NML // '='
;
num
: exponential
| non_exponential
;
exponential
: PXI
| DXI
| PXF
| DXF
| PXB
| DXB
| PXD
| DXD
| PXP
| DXP
| PRX
| DEX
;
non_exponential
: IMG
| FLT
| DBL
| DCM
| PRC
| INT
;
fcreatable
: dictd
| av_var
;
callablets
: fcreatable
| retblock_expr
| ckw
;
fcall
: CLP
| LPR arg_list RPR
;
call: callablets fcall;
arg_list: arg_type ( wsp CMA wsp arg_type )*;
arg_type
: u_var_list wsp aop wsp u_exp_list
| unkeyed_var
;
unkeyed_var
: LPR var_list RPR
| LBR var_list RBR
| LBC var_list RBC
| var_list
;
u_var_list: unkeyed_var ( wsp aop wsp u_var_list )*;
u_exp_list: unkeyed_exp ( wsp CMA wsp unkeyed_exp )*;
unkeyed_exp
: tupled
| listd
| setd
| non_assign_expr
;
litidn
: literal
| IDN
;
typecast: typet LPR non_assign_expr RPR;
strt
: multi_line
| single_line
| char_string
;
multi_line
: SMT
| USM
| NMT
| UNM
;
single_line
: SST
| USS
| NST
| UNS
| NAS
;
char_string
: SCH
| USC
| NCH
| UNC
| NAC
;
typet
: STRT
| INTT
| NUMT
| DECIMALT
| FLOATT
| DOUBLET
| PRECISET
| EXPNT
| CHART
| IMAGT
| REALT
| HEXTY
| BINTY
| OCTTY
| LISTD
| SETD
| DICTD
| TUPLED
| TYPET
| BOOLT
;
bks_or_WSP
: WSP
| BKS
| SPC
;
emd
: EML
| EMS
;
sep
: SMC
| CMA
| TOR
;
kwr
: WHILE
| FOR
| DO
| DEL
| NEW
| IMPORT
| EXPORT
| DEF
| END
| GLOBAL
| BREAK
| CONTINUE
| NOT
| AND
| OR
| IN
| CASE
| DEFAULT
| RETURN
| TRY
| EXCEPT
| FINALLY
| ELIF
| IF
| ELSE
| AS
| CONST
| REPEAT
| UNTIL
| THEN
| GOTO
| LABEL
| USING
| PUBLIC
| PROTECTED
| PRIVATE
| SELF
| FROM
| XOR
| IMAGT
| REALT
| WHERE
| PASS
| G_G
| L_L
| MAP
| IS
;
ckw
: OPN
| OUT
| OUTF
| PRINT
| PRINTF
| LAMBDA
| FUNC
| ERR
| ERRF
| ASSERT
| ASSERTF
| FORMAT
| SWITCH
| ABS
| ASCII
| CALLABLE
| CHR
| DIR
| EVAL
| EXEC
| FILTER
| GET
| HASH
| ID
| INST
| SUB
| SUPER
| MAX
| MIN
| OBJ
| ORD
| POWF
| REV
| REPR
| ROUND
| FLOOR
| CEIL
| MUL
| SORT
| ADD
| ZIP
| WAIT
| SECS
| MILS
| BENCHMARK
;
val
: RMH // 'inf'
| IMH // 'infi'
| NAN // 'nan'
| IND // 'ind'
| UND // 'und'
| NIL // 'nil'
| NON // 'none'
| TRU // 'true'
| FLS // 'false'
;
opr
: NND // '&&'
| NXR // '$$'
| NOR // '||'
| CLP // '()'
| SUP // '::'
| SIB // ':.'
| KUN // '**'
| INC // '++'
| DEC // '+-'
| FDV // '//'
| CDV // '* /'
| CON // '..'
| BLS // '<<'
| BRS // '>>'
| LTE // '<='
| GTE // '>='
| EQL // '=='
| NEQ // '!='
| LPR // '('
| RPR // ')'
| LBR // '['
| RBR // ']'
| LBC // '{'
| RBC // '}'
| STR // '*'
| POW // '^'
| PLS // '+'
| MNS // '-'
| BNT // '~'
| EXC // '!'
| LEN // '#'
| PER // '%'
| DIV // '/'
| LET // '<'
| GRT // '>'
| BND // '&'
| BXR // '$'
| BOR // '|'
| TND // '?'
| TOR // ':'
| DOT // '.'
;
inl
: strt
| num
| ckw
| kwr
| val
| IDN
| bks_or_WSP
| sep
| emd
| aop
| opr
| typet
| IVC
;

Your PRINT token can only be matched by the blk_expr rule through this path:
There is no path for retblock_expr to recognize anything that begins with the PRINT token.
As a result, it will not matter which order you have elk_expr or retblock_expr.
There is no parser rule in your grammar that will match a PRINT token followed by a LPR token. a block_expr is matched by the program rule, and it only matches (ignoring wsp) block_expr or retblock_expr. Neither of these have alternatives that begin with an LPR token, so ANTLR can't match that token.
print(...) would normally be matched as a function call expression that accepts 0 or more comma-separated parameters. You have no sure rule/alternative defined. (I'd guess that it should be an alternative on either retblock_expr or block_expr
That's the immediate cause of this error. ANTLR really does not have any rule/alternative that can accept a LPR token in this position.

Related

Antltr || Not able to validate nested boolean condition in If block

I'm facing an issue while validating the below formula with the given grammar rules.
if(2>3?ceil(loopup(12)):floor(matrix(2,3)))
However, I am able to inject the below formulas:
if(2>3?loopup(12):matrix(2,3))
if(2>3?ceil(12.2):floor(2.3))
ast
: expr+ EOF
;
expr: nestedexpr
| LOOKUP_FIELD '(' idrule ')'
| TIER_FIELD '(' idrule ',' idrule ')'
| MATRIX_FIELD '(' idrule ',' idrule ')'
| IF '(' conditionalrule '?' expr ':' expr')'
| ROUND '(' idrule ',' roundnumberrule ')'
;
nestedexpr:
nestedexpr ('*'|'/'|'+'|'-') nestedexpr
| '(' '-' nestedexpr ')'
| '(' nestedexpr ')'
| ROUND '(' expr ',' roundnumberrule ')'
| MATH_FUNCTION_FIELD '(' expr ')'
| DYNAMIC_FIELD_ID/users-ack-status
;
arithematicexpr:
arithematicexpr ('*'|'/'|'+'|'-') arithematicexpr
| '(' '-' arithematicexpr ')'
| '(' arithematicexpr ')'
| DYNAMIC_FIELD_ID
;
orrule: OR '(' conditionalrule (',' conditionalrule)+ ')';
andrule: AND '(' conditionalrule (',' conditionalrule)+ ')';
conditionalrule: orrule | andrule | relationalrule;
relationalrule: DYNAMIC_FIELD_ID RELATIONAL_OPERATOR DYNAMIC_FIELD_ID;
idrule :
DYNAMIC_FIELD_ID
;
LOOKUP_FIELD: L O O K U P ;
TIER_FIELD: T I E R;
MATRIX_FIELD: M A T R I X;
IF: I F;
AND: A N D;
OR: O R;
ROUND : R O U N D;
MATH_FUNCTION_FIELD : C E I L | F L O O R;
RELATIONAL_OPERATOR: '<' | '>' | '<=' | '>=' | '<>' | '=';
BOOL_FIELD : T R U E | F A L S E;
DYNAMIC_FIELD_ID: {isDynamicFieldId()}? . ;
roundnumberrule: ROUND_NUMBER;
ROUND_NUMBER: [0-7];
WS : [ \t\r\n]+ -> skip ;
if(2>3?ceil(loopup(12)):floor(matrix(2,3)))
The above should get parsed by the mentioned grammar rule.

Telegram User Adder

Hi recently I made telegram scrapper that scrap users from telegram groups.
Now I am trying make user adder to it.
#!/bin/env python3
from telethon.sync import TelegramClient
from telethon.tl.functions.messages import GetDialogsRequest
from telethon.tl.types import InputPeerEmpty, InputPeerChannel, InputPeerUser
from telethon.errors.rpcerrorlist import PeerFloodError, UserPrivacyRestrictedError
from telethon.tl.functions.channels import InviteToChannelRequest
import configparser
import os, sys
import csv
import traceback
import time
import random
re="\033[1;31m"
gr="\033[1;32m"
cy="\033[1;36m"
def banner():
print(f"""
_____ __ ____ ____ ____ ___ ____ _____ __ ____ ____ ____ ___ ____
.----------------. .----------------. .----------------. .----------------. .----------------.
| .--------------. || .--------------. || .--------------. || .--------------. || .--------------. |
| | __ | || | ________ | || | ________ | || | _________ | || | _______ | |
| | / \ | || | |_ ___ `. | || | |_ ___ `. | || | |_ ___ | | || | |_ __ \ | |
| | / /\ \ | || | | | `. \ | || | | | `. \ | || | | |_ \_| | || | | |__) | | |
| | / ____ \ | || | | | | | | || | | | | | | || | | _| _ | || | | __ / | |
| | _/ / \ \_ | || | _| |___.' / | || | _| |___.' / | || | _| |___/ | | || | _| | \ \_ | |
| ||____| |____|| || | |________.' | || | |________.' | || | |_________| | || | |____| |___| | |
| | | || | | || | | || | | || | | |
| '--------------' || '--------------' || '--------------' || '--------------' || '--------------' |
'----------------' '----------------' '----------------' '----------------' '----------------'
_____ __ ____ ____ ____ ___ ____ _____ __ ____ ____ ____ ___ ____
version : 2.0
""")
cpass = configparser.RawConfigParser()
cpass.read('config.data')
try:
api_id = cpass['cred']['id']
api_hash = cpass['cred']['hash']
phone = cpass['cred']['phone']
client = TelegramClient(phone, api_id, api_hash)
except KeyError:
os.system('clear')
banner()
print(re+"[!] run python3 setup.py first !!\n")
sys.exit(1)
client.connect()
if not client.is_user_authorized():
client.send_code_request(phone)
os.system('clear')
banner()
client.sign_in(phone, input(gr+'[+] Enter the code: '+re))
os.system('clear')
banner()
input_file = sys.argv[1]
users = []
with open(input_file, encoding='UTF-8') as f:
rows = csv.reader(f,delimiter=",",lineterminator="\n")
next(rows, None)
for row in rows:
user = {}
user['username'] = row[0]
user['id'] = int(row[1])
user['access_hash'] = int(row[2])
user['name'] = row[3]
users.append(user)
chats = []
last_date = None
chunk_size = 200
groups=[]
result = client(GetDialogsRequest(
offset_date=last_date,
offset_id=0,
offset_peer=InputPeerEmpty(),
limit=chunk_size,
hash = 0
))
chats.extend(result.chats)
for chat in chats:
try:
if chat.megagroup== False:
groups.append(chat)
except:
continue
i=0
for group in groups:
print(gr+'['+cy+str(i)+gr+']'+cy+' - '+group.title)
i+=1
print(gr+'[+] Choose a group to add members')
g_index = input(gr+"[+] Enter a Number : "+re)
target_group=groups[int(g_index)]
target_group_entity = InputPeerChannel(target_group.id,target_group.access_hash)
print(gr+"[1] add member by user ID\n[2] add member by username ")
mode = int(input(gr+"Input : "+re))
n = 0
for user in users:
n += 1
if n % 50 == 0:
time.sleep(1)
try:
print ("Adding {}".format(user['id']))
if mode == 1:
if user['username'] == "":
continue
user_to_add = client.get_input_entity(user['username'])
elif mode == 2:
user_to_add = InputPeerUser(user['id'], user['access_hash'])
else:
sys.exit(re+"[!] Invalid Mode Selected. Please Try Again.")
client(InviteToChannelRequest(target_group_entity,[user_to_add]))
print(gr+"[+] Waiting for 2-10 Seconds...")
time.sleep(random.randrange(2, 10))
except FloodWaitError:
print(re+"[!] Getting Flood Error from telegram. \n[!] Script is stopping now. \n[!] Please try again after some time.")
except UserPrivacyRestrictedError:
print(re+"[!] The user's privacy settings do not allow you to do this. Skipping.")
except:
traceback.print_exc()
print(re+"[!] Unexpected Error")
continue
It works but partly I can hardly add 1-10 user at a time and I shows errors some of adding proccess
Kindly I tried most thing command says it needs much time but timer doesnt seem effect on it even I add some.Any suggestions any helps ?
Adding 1456428294
[!] Getting FloodWaitError from telegram.
[!] Script is stopping now.
[!] Please try again after some time.
FloodWaitError (420)
the same request was repeated many times. Must wait .seconds (you can access this attribute). For example:
from telethon import errors
try:
messages = await client.get_messages(chat)
print(messages[0].text)
except errors.FloodWaitError as e:
print('Have to sleep', e.seconds, 'seconds')
time.sleep(e.seconds)
Read the documentation:
https://docs.telethon.dev/en/latest/concepts/errors.html

Controlling parser rule alternatives precedence

I have an expression IF 1 THEN 2 ELSE 3 * 4. I want this parsed as IF 1 THEN 2 ELSE (3 * 4), however using my grammar (extract) below, it parses it as (IF 1 THEN 2 ELSE 3) * 4.
formula: expression EOF;
expression
: LPAREN expression RPAREN #parenthesisExp
| IF condition=expression THEN thenExpression=expression ELSE elseExpression=expression #ifExp
| left=expression BINARYOPERATOR right=expression #binaryoperationExp
| left=expression op=(TIMES|DIV) right=expression #muldivExp
| left=expression op=(PLUS|MINUS) right=expression #addsubtractExp
| left=expression op=(EQUALS|NOTEQUALS|LT|GT) right=expression #comparisonExp
| left=expression AMPERSAND right=expression #concatenateExp
| NOT expression #notExp
| STRINGLITERAL #stringliteralExp
| signedAtom #atomExp
;
My understanding is that because I have the ifExp alternative appearing before the muldivExp it should use that first, then because I have the muldivExp before atomExp (which handles numbers) it should do 3 * 4 to end the ELSE, rather than using just the 3. In which case I can't see why it's making the IF..THEN..ELSE a child of the multiplication.
I don't think the rest of the grammar is relevant here, but in case it is see below for the whole thing.
grammar AnaplanFormula;
formula: expression EOF;
expression
: LPAREN expression RPAREN #parenthesisExp
| IF condition=expression THEN thenExpression=expression ELSE elseExpression=expression #ifExp
| left=expression BINARYOPERATOR right=expression #binaryoperationExp
| left=expression op=(TIMES|DIV) right=expression #muldivExp
| left=expression op=(PLUS|MINUS) right=expression #addsubtractExp
| left=expression op=(EQUALS|NOTEQUALS|LT|GT) right=expression #comparisonExp
| left=expression AMPERSAND right=expression #concatenateExp
| NOT expression #notExp
| STRINGLITERAL #stringliteralExp
| signedAtom #atomExp
;
signedAtom
: PLUS signedAtom #plusSignedAtom
| MINUS signedAtom #minusSignedAtom
| func_ #funcAtom
| atom #atomAtom
;
atom
: SCIENTIFIC_NUMBER #numberAtom
| LPAREN expression RPAREN #expressionAtom // Do we need this?
| entity #entityAtom
;
func_: functionname LPAREN (expression (',' expression)*)? RPAREN #funcParameterised
| entity LSQUARE dimensionmapping (',' dimensionmapping)* RSQUARE #funcSquareBrackets
;
dimensionmapping: WORD COLON entity; // Could make WORD more specific here
functionname: WORD; // Could make WORD more specific here
entity: QUOTELITERAL #quotedEntity
| WORD+ #wordsEntity
| left=entity DOT right=entity #dotQualifiedEntity
;
WS: [ \r\n\t]+ -> skip;
/////////////////
// Fragments //
/////////////////
fragment NUMBER: DIGIT+ (DOT DIGIT+)?;
fragment DIGIT: [0-9];
fragment LOWERCASE: [a-z];
fragment UPPERCASE: [A-Z];
fragment WORDSYMBOL: [#?_£%];
//////////////////
// Tokens //
//////////////////
IF: 'IF' | 'if';
THEN: 'THEN' | 'then';
ELSE: 'ELSE' | 'else';
BINARYOPERATOR: 'AND' | 'and' | 'OR' | 'or';
NOT: 'NOT' | 'not';
WORD: (DIGIT* (LOWERCASE | UPPERCASE | WORDSYMBOL)) (LOWERCASE | UPPERCASE | DIGIT | WORDSYMBOL)*;
STRINGLITERAL: DOUBLEQUOTES (~'"' | ('""'))* DOUBLEQUOTES;
QUOTELITERAL: '\'' (~'\'' | ('\'\''))* '\'';
LSQUARE: '[';
RSQUARE: ']';
LPAREN: '(';
RPAREN: ')';
PLUS: '+';
MINUS: '-';
TIMES: '*';
DIV: '/';
COLON: ':';
EQUALS: '=';
NOTEQUALS: LT GT;
LT: '<';
GT: '>';
AMPERSAND: '&';
DOUBLEQUOTES: '"';
UNDERSCORE: '_';
QUESTIONMARK: '?';
HASH: '#';
POUND: '£';
PERCENT: '%';
DOT: '.';
PIPE: '|';
SCIENTIFIC_NUMBER: NUMBER (('e' | 'E') (PLUS | MINUS)? NUMBER)?;
Move your ifExpr down near the end of your alternatives. (In particular, below any alternative that you would wish to match your elseExpression
Your “if ... then ... else ...” is below the muldivExp precisely because you've made it a higher priority. Items lower in the tree are evaluated before items higher in the tree, so higher priority items belong lower in the tree.
With:
expression:
LPAREN expression RPAREN # parenthesisExp
| left = expression BINARYOPERATOR right = expression # binaryoperationExp
| left = expression op = (TIMES | DIV) right = expression # muldivExp
| left = expression op = (PLUS | MINUS) right = expression # addsubtractExp
| left = expression op = (EQUALS | NOTEQUALS | LT | GT) right = expression # comparisonExp
| left = expression AMPERSAND right = expression # concatenateExp
| NOT expression # notExp
| STRINGLITERAL # stringliteralExp
| signedAtom # atomExp
| IF condition = expression THEN thenExpression = expression ELSE elseExpression = expression #
ifExp
;
I get

How to split a column by using length split and MaxSplit in Pyspark dataframe?

For Example
If I have a Column as given below by calling and showing the CSV in Pyspark
+--------+
| Names|
+--------+
|Rahul |
|Ravi |
|Raghu |
|Romeo |
+--------+
if I specify in my functions as Such
Length = 2
Maxsplit = 3
Then I have to get the results as
+----------+-----------+----------+
|Col_1 |Col_2 |Col_3 |
+----------+-----------+----------+
| Ra | hu | l |
| Ra | vi | Null |
| Ra | gh | u |
| Ro | me | o |
+----------+-----------+----------+
Simirarly in Pyspark
Length = 3
Max split = 2 it should provide me the output such as
+----------+-----------+
|Col_1 |Col_2 |
+----------+-----------+
| Rah | ul |
| Rav | i |
| Rag | hu |
| Rom | eo |
+----------+-----------+
This is how it should look like, Thank you
Another way to go about this. Should be faster than any looping or udf solution.
from pyspark.sql import functions as F
def split(df,length,maxsplit):
return df.withColumn('Names',F.split("Names","(?<=\\G{})".format('.'*length)))\
.select(*((F.col("Names")[x]).alias("Col_"+str(x+1)) for x in range(0,maxsplit)))
split(df,3,2).show()
#+-----+-----+
#|Col_1|Col_2|
#+-----+-----+
#| Rah| ul|
#| Rav| i|
#| Rag| hu|
#| Rom| eo|
#+-----+-----+
split(df,2,3).show()
#+-----+-----+-----+
#|col_1|col_2|col_3|
#+-----+-----+-----+
#| Ra| hu| l|
#| Ra| vi| |
#| Ra| gh| u|
#| Ro| me| o|
#+-----+-----+-----+
Try this,
import pyspark.sql.functions as F
tst = sqlContext.createDataFrame([("Raghu",1),("Ravi",2),("Rahul",3)],schema=["Name","val"])
def fn (split,max_n,tst):
for i in range(max_n):
tst_loop=tst.withColumn("coln"+str(i),F.substring(F.col("Name"),(i*split)+1,split))
tst=tst_loop
return(tst)
tst_res = fn(3,2,tst)
The for loop can also replaced by a list comprehension or reduce, but i felt in you case, a for loop looked neater. they have the same physical plan anyway.
The results
+-----+---+-----+-----+
| Name|val|coln0|coln1|
+-----+---+-----+-----+
|Raghu| 1| Rag| hu|
| Ravi| 2| Rav| i|
|Rahul| 3| Rah| ul|
+-----+---+-----+-----+
Try this
def split(data,length,maxSplit):
start=1
for i in range(0,maxSplit):
data = data.withColumn(f'col_{start}-{start+length-1}',f.substring('channel',start,length))
start=length+1
return data
df = split(data,3,2)
df.show()
+--------+----+-------+-------+
| channel|type|col_1-3|col_4-6|
+--------+----+-------+-------+
| web| 0| web| |
| web| 1| web| |
| web| 2| web| |
| twitter| 0| twi| tte|
| twitter| 1| twi| tte|
|facebook| 0| fac| ebo|
|facebook| 1| fac| ebo|
|facebook| 2| fac| ebo|
+--------+----+-------+-------+
Perhaps this is useful-
Load the test data
Note: written in scala
val Length = 2
val Maxsplit = 3
val df = Seq("Rahul", "Ravi", "Raghu", "Romeo").toDF("Names")
df.show(false)
/**
* +-----+
* |Names|
* +-----+
* |Rahul|
* |Ravi |
* |Raghu|
* |Romeo|
* +-----+
*/
split the string col as per the length and offset
val schema = StructType(Range(1, Maxsplit + 1).map(f => StructField(s"Col_$f", StringType)))
val split = udf((str:String, length: Int, maxSplit: Int) =>{
val splits = str.toCharArray.grouped(length).map(_.mkString).toArray
RowFactory.create(splits ++ Array.fill(maxSplit-splits.length)(null): _*)
}, schema)
val p = df
.withColumn("x", split($"Names", lit(Length), lit(Maxsplit)))
.selectExpr("x.*")
p.show(false)
p.printSchema()
/**
* +-----+-----+-----+
* |Col_1|Col_2|Col_3|
* +-----+-----+-----+
* |Ra |hu |l |
* |Ra |vi |null |
* |Ra |gh |u |
* |Ro |me |o |
* +-----+-----+-----+
*
* root
* |-- Col_1: string (nullable = true)
* |-- Col_2: string (nullable = true)
* |-- Col_3: string (nullable = true)
*/
Dataset[Row] -> Dataset[Array[String]]
val x = df.map(r => {
val splits = r.getString(0).toCharArray.grouped(Length).map(_.mkString).toArray
splits ++ Array.fill(Maxsplit-splits.length)(null)
})
x.show(false)
x.printSchema()
/**
* +-----------+
* |value |
* +-----------+
* |[Ra, hu, l]|
* |[Ra, vi,] |
* |[Ra, gh, u]|
* |[Ro, me, o]|
* +-----------+
*
* root
* |-- value: array (nullable = true)
* | |-- element: string (containsNull = true)
*/

ANTLR4 Grammar Performance Very Poor

Given the grammar below, I'm seeing very poor performance when parsing longer strings, on the order of seconds. (this on both Python and Go implementations) Is there something in this grammar that is causing that?
Example output:
0.000061s LEXING "hello world"
0.014349s PARSING "hello world"
0.000052s LEXING 5 + 10
0.015384s PARSING 5 + 10
0.000061s LEXING FIRST_WORD(WORD_SLICE(contact.blerg, 2, 4))
0.634113s PARSING FIRST_WORD(WORD_SLICE(contact.blerg, 2, 4))
0.000095s LEXING (DATEDIF(DATEVALUE("01-01-1970"), date.now, "D") * 24 * 60 * 60) + ((((HOUR(date.now)+7) * 60) + MINUTE(date.now)) * 60))
1.552758s PARSING (DATEDIF(DATEVALUE("01-01-1970"), date.now, "D") * 24 * 60 * 60) + ((((HOUR(date.now)+7) * 60) + MINUTE(date.now)) * 60))
This is on Python.. though I don't expect blazing performance I would expect sub-second for any input. What am I doing wrong?
grammar Excellent;
parse
: expr EOF
;
expr
: atom # expAtom
| concatenationExpr # expConcatenation
| equalityExpr # expEquality
| comparisonExpr # expComparison
| additionExpr # expAddition
| multiplicationExpr # expMultiplication
| exponentExpr # expExponent
| unaryExpr # expUnary
;
path
: NAME (step)*
;
step
: LBRAC expr RBRAC
| PATHSEP NAME
| PATHSEP NUMBER
;
parameters
: expr (COMMA expr)* # functionParameters
;
concatenationExpr
: atom (AMP concatenationExpr)? # concatenation
;
equalityExpr
: comparisonExpr op=(EQ|NE) comparisonExpr # equality
;
comparisonExpr
: additionExpr (op=(LT|GT|LTE|GTE) additionExpr)? # comparison
;
additionExpr
: multiplicationExpr (op=(ADD|SUB) multiplicationExpr)* # addition
;
multiplicationExpr
: exponentExpr (op=(MUL|DIV) exponentExpr)* # multiplication
;
exponentExpr
: unaryExpr (EXP exponentExpr)? # exponent
;
unaryExpr
: SUB? atom # negation
;
funcCall
: function=NAME LPAR parameters? RPAR # functionCall
;
funcPath
: function=funcCall (step)* # functionPath
;
atom
: path # contextReference
| funcCall # atomFuncCall
| funcPath # atomFuncPath
| LITERAL # stringLiteral
| NUMBER # decimalLiteral
| LPAR expr RPAR # parentheses
| TRUE # true
| FALSE # false
;
NUMBER
: DIGITS ('.' DIGITS?)?
;
fragment
DIGITS
: ('0'..'9')+
;
TRUE
: [Tt][Rr][Uu][Ee]
;
FALSE
: [Ff][Aa][Ll][Ss][Ee]
;
PATHSEP
:'.';
LPAR
:'(';
RPAR
:')';
LBRAC
:'[';
RBRAC
:']';
SUB
:'-';
ADD
:'+';
MUL
:'*';
DIV
:'/';
COMMA
:',';
LT
:'<';
GT
:'>';
EQ
:'=';
NE
:'!=';
LTE
:'<=';
GTE
:'>=';
QUOT
:'"';
EXP
: '^';
AMP
: '&';
LITERAL
: '"' ~'"'* '"'
;
Whitespace
: (' '|'\t'|'\n'|'\r')+ ->skip
;
NAME
: NAME_START_CHARS NAME_CHARS*
;
fragment
NAME_START_CHARS
: 'A'..'Z'
| '_'
| 'a'..'z'
| '\u00C0'..'\u00D6'
| '\u00D8'..'\u00F6'
| '\u00F8'..'\u02FF'
| '\u0370'..'\u037D'
| '\u037F'..'\u1FFF'
| '\u200C'..'\u200D'
| '\u2070'..'\u218F'
| '\u2C00'..'\u2FEF'
| '\u3001'..'\uD7FF'
| '\uF900'..'\uFDCF'
| '\uFDF0'..'\uFFFD'
;
fragment
NAME_CHARS
: NAME_START_CHARS
| '0'..'9'
| '\u00B7' | '\u0300'..'\u036F'
| '\u203F'..'\u2040'
;
ERRROR_CHAR
: .
;
You can always try to parse with SLL(*) first and only if that fails you need to parse it with LL(*) (which is the default).
See this ticket on ANTLR's GitHub for further explaination and here is an implementation that uses this strategy.
This method will save you (a lot of) time when parsing syntactically correct input.
Seems like this performance is due to the left recursion used in the addition / multiplication etc, operators. Rewriting these to be binary rules instead yields performance that is instant. (see below)
grammar Excellent;
COMMA : ',';
LPAREN : '(';
RPAREN : ')';
LBRACK : '[';
RBRACK : ']';
DOT : '.';
PLUS : '+';
MINUS : '-';
TIMES : '*';
DIVIDE : '/';
EXPONENT : '^';
EQ : '=';
NEQ : '!=';
LTE : '<=';
LT : '<';
GTE : '>=';
GT : '>';
AMPERSAND : '&';
DECIMAL : [0-9]+('.'[0-9]+)?;
STRING : '"' (~["] | '""')* '"';
TRUE : [Tt][Rr][Uu][Ee];
FALSE : [Ff][Aa][Ll][Ss][Ee];
NAME : [a-zA-Z][a-zA-Z0-9_.]*; // variable names, e.g. contact.name or function names, e.g. SUM
WS : [ \t\n\r]+ -> skip; // ignore whitespace
ERROR : . ;
parse : expression EOF;
atom : fnname LPAREN parameters? RPAREN # functionCall
| atom DOT atom # dotLookup
| atom LBRACK expression RBRACK # arrayLookup
| NAME # contextReference
| STRING # stringLiteral
| DECIMAL # decimalLiteral
| TRUE # true
| FALSE # false
;
expression : atom # atomReference
| MINUS expression # negation
| expression EXPONENT expression # exponentExpression
| expression (TIMES | DIVIDE) expression # multiplicationOrDivisionExpression
| expression (PLUS | MINUS) expression # additionOrSubtractionExpression
| expression (LTE | LT | GTE | GT) expression # comparisonExpression
| expression (EQ | NEQ) expression # equalityExpression
| expression AMPERSAND expression # concatenation
| LPAREN expression RPAREN # parentheses
;
fnname : NAME
| TRUE
| FALSE
;
parameters : expression (COMMA expression)* # functionParameters
;