Grammar regexes match independently but not together - grammar

This is my attempt to solve the weekly challenge, "implement brace expansion".
I wrote below grammar, which should work. But doesn't.
grammar BraceExpansion
{
regex TOP { <start-txt> <to-expand> <end-txt> }
regex start-txt { <save-char>* }
regex end-txt { <save-char>* }
token save-char { <-[ \" \& \( \) \` \' \; \< \> \| ]> }
token list-element { <-[ \" \! \$ \& \( \) \` \' \; \< \> \| ]> }
token alphanum { <[ a..z A..Z 0..9 ]> }
token alpha { <[ a..z A..Z ]> }
regex num { \-? <[ 0..9 ]>+ }
regex to-expand { <range> | <list> }
regex range { <alpha-range> | <num-range> }
regex num-range { \{ <num> \. \. <num> [ \. \. <num> ]? \} }
regex alpha-range { \{ <alpha> \. \. <alpha> [ \. \.<num> ]? \} }
regex list { \{ <list-element>+ % ',' \} }
}
say brace-expand( 'A{1..3}B{a..g..3}C{1,2}D' );
sub num-range( $match )
{
say "-NUM-";
my #num = |$match<range><num-range><num>.list>>.Int;
my #range = #num[0] ... #num[1];
my $steps = ( #num[2] // 1 ).abs;
#range.batch( $steps )>>.[0];
}
sub alpha-range( $match )
{
say "-ALPHA-";
my #num = |$match<range><alpha-range><alpha>.list>>.Str;
my #range = #num[0] ... #num[1];
my $steps = ( $match<range><alpha-range><num> // 1 ).abs;
#range.batch($steps)>>.[0];
}
sub list( $match )
{
say "-LIST-";
$match<list><list-element>.list>>.Str;
}
sub brace-expand( $str )
{
say "brace-expand( $str )";
my $match = BraceExpansion.parse( $str );
my #alternatives =
$match<range><num-range> ?? num-range( $match ) !!
$match<range><alpha-range> ?? alpha-range( $match ) !!
$match<list> ?? list( $match ) !!
();
say "A", #alternatives;
return $str
unless #alternatives;
#alternatives
.map( -> $element { $match<start-txt>.Str ~ $element ~ $match<end-txt>.Str } )
.map( -> $result { brace-expand( $result ) } )
;
}
However, if I change the TOP rule to
regex TOP { <start-txt> <range> <end-txt> }
or
regex TOP { <start-txt> <list> <end-txt> }
the range and the list tokens work independently and I get the output I expect. But when I use the to-expand rule, the whole grammar doesn't match and I cannot figure out why. Is the alteration wrong? But if so, why then does <alpha-range> | <num-range> work?
.... (time passes) ....
In the meantime I found out that
regex TOP { <start-txt> [ <range> | <list> ] <end-txt> }
does what I want. But my question still stands, why doesn't it work as above?

The match is happening fine; you are not indexing into the match correctly. The structure of the resulting match is:
start-txt => 「A{1..3}B{a..g..3}C」
save-char => 「A」
...
to-expand => 「{1,2}」
list => 「{1,2}」 # (or range)
...
end-txt => 「D」
save-char => 「D」
...
But you are indexing into this like $match<list>. You should first be indexing into the <to-expand> like $match<to-expand><list>. Try it online!
The reason the modified version works is that you are removing that intermediate step of to-expand so that the index works.

Related

What is the difference between using Raku's Code.assuming method and using an anonymous Block or Sub?

The Raku docs say that Code.assuming
Returns a Callable that implements the same behavior as the original, but has the values passed to .assuming already bound to the corresponding parameters.
What is the difference between using .assuming and wrapping the Code in an anonymous Block (or Sub) that calls the inner function with some parameters already bound?
For instance, in the code below, what is the difference between &surname-public (an example the docs provide for .assuming) and &surname-block;
sub longer-names ( $first, $middle, $last, $suffix ) {
say "Name is $first $middle $last $suffix";
}
my &surname-public = &longer-names.assuming( *, *, 'Public', * );
my &surname-block = -> $a,$b,$c { longer-names($a, $b, 'Public', $c) }
surname-public( 'Joe', 'Q.', 'Jr.'); # OUTPUT: «Name is Joe Q. Public Jr.»
surname-block( 'Joe', 'Q.', 'Jr.'); # OUTPUT: «Name is Joe Q. Public Jr.»
I see that .assuming saves a bit of length and could, in some contexts, be a bit clearer. But I strongly suspect that I'm missing some other difference.
There really isn't a difference.
While the code to implement .assuming() is almost 300 lines, the important bit is only about ten lines of code.
$f = EVAL sprintf(
'{ my $res = (my proto __PRIMED_ANON (%s) { {*} });
my multi __PRIMED_ANON (|%s(%s)) {
my %%chash := %s.hash;
$self(%s%s |{ %%ahash, %%chash });
};
$res }()',
$primed_sig, $capwrap, $primed_sig, $capwrap,
(flat #clist).join(", "),
(#clist ?? ',' !! '')
);
The rest of the code in .assuming is mostly about pulling information out of the signatures.
Let's take your code and insert it into that sprintf.
(Not exactly the same, but close enough for our purposes.)
{
my $res = (
# $primed_sig v----------------------v
my proto __PRIMED_ANON ($first, $middle, $suffix) { {*} }
);
# $capwrap vv
# $primed_sig v----------------------v
my multi __PRIMED_ANON (|__ ($first, $middle, $suffix)) {
# $capwrap vv
my %chash := __.hash;
# v---------------------------v #clist
$self(__[0], __[1], 'Public', __[2], |{ %ahash, %chash });
};
# return the proto
$res
}()
If we simplify it, and tailor it to your code
my &surname-public = {
my $res = (
my proto __PRIMED_ANON ($first, $middle, $suffix) { {*} }
);
my multi __PRIMED_ANON ( $first, $middle, $suffix ) {
longer-names( $first, $middle, 'Public', $suffix )
};
$res
}()
We can simplify it further by just using a pointy block.
my &surname-public = -> $first, $middle, $suffix {
longer-names( $first, $middle, 'Public', $suffix )
};
Also by just using single letter parameter names.
my &surname-public = -> $a,$b,$c { longer-names($a, $b, 'Public', $c) }
Like I said, there really isn't a difference.
In the future, it may be more beneficial to use .assuming(). After it gets rewritten to use RakuAST.

Grammar to parse module specification

Raku modules can be specified in different ways, for example:
MyModule
MyModule:ver<1.0.3>
MyModule:ver<1.0.3>:auth<Name (email#example.com)>;
MyModule:ver<1.0.3>:auth<Name <email#example.com>>;
I wrote the below grammar to parse the module spec which works fine for most of specs but it fails if the auth field contains < or >. How can I fix the grammar to match in this case as well?
I can't figure out how to say match everything in between < and > including any < and > as well.
#!/usr/bin/env perl6
grammar Spec {
token TOP { <spec> }
token spec { <name> <keyval>* }
token name { [<-[./:<>()\h]>+]+ % '::' }
token keyval { ':' <key> <value> }
proto token key { * }
token key:sym<ver> { <sym> }
token key:sym<version> { <sym> }
token key:sym<auth> { <sym> }
token key:sym<api> { <sym> }
token key:sym<from> { <sym> }
# BUG: fix specs that contains '<>' inside value;
token value { '<' ~ '>' $<val>=<-[<>]>* | '(' ~ ')' $<val>=<-[()]>* }
}
my \tests = (
'MyModule:ver<1.0.3>:auth<Name (email#example.com)>',
'MyModule:ver<1.0.3>:auth<Name <email#example.com>>',
);
for tests -> \spec {
say so Spec.parse: spec;
}
# Output:
True
False
If you know that the inner field will basically be in the same format as the value token, you can recursively match for value with $<val>=[.*? <value>?]. This even lets you capture the contents of the inner field seperately:
token value { '<' ~ '>' $<val>=[.*? <value>?] | '(' ~ ')' $<val>=<-[()]>* }
If you don't want the inner contents than you can use the recursive <~~> in place of <value>
token value { '<' ~ '>' $<val>=[.*? <~~>?] | '(' ~ ')' $<val>=<-[()]>* }

error: cannot infer an appropriate lifetime for autoref due to conflicting requirements [E0495]

First of all: I am fully aware of this post: Cannot infer appropriate lifetime for autoref in Iterator impl
and that the problem is probably similar to mine.
However, I can't get it working with the knowledge of this thread.
The code:
use std::str::Chars;
use super::token::*;
use super::token_stream::TokenStream;
pub struct Lexer<'a> {
input: Chars<'a>,
buffer: String,
cur_char: char
}
impl<'a> Lexer<'a> {
pub fn new(iterator: Chars<'a>) -> Lexer {
let mut lexer = Lexer {
input: iterator,
buffer: String::new(),
cur_char: '\0' };
lexer.consume_next();
lexer
}
pub fn new_from_str(content : &str) -> Lexer {
Lexer::new(content.chars())
}
fn consume_next(&mut self) -> char {
let next = self.input.next();
if let Some(c) = next {
self.buffer.push(c);
self.cur_char = c;
}
else {
self.cur_char = '\0';
}
self.current_char()
}
fn clear_buffer(&mut self) {
self.buffer.clear();
}
fn current_char(&self) -> char {
self.cur_char
}
fn scan_line_comment(&self) -> Token { Token::EndOfFile }
fn scan_multi_line_comment(&self) -> Token { Token::EndOfFile }
fn scan_identifier(&self) -> Token { Token::EndOfFile }
fn scan_char_literal(&self) -> Token { Token::EndOfFile }
fn scan_string_literal(&self) -> Token { Token::EndOfFile }
fn scan_number_literal(&self) -> Token { Token::EndOfFile }
fn consume_and_return<'b>(&mut self, token: Token<'b>) -> Token<'b> {
self.consume_next();
token
}
}
impl<'a> TokenStream for Lexer<'a> {
fn next_token(&mut self) -> Token {
match self.current_char() {
/* Skip whitespace */
' ' |
'\r' |
'\n' |
'\t' => {
self.clear_buffer();
self.consume_and_return(Token::Whitespace)
}
/* Opening delimiters */
'(' => self.consume_and_return(Token::OpenDelim(DelimitToken::Paren)),
'[' => self.consume_and_return(Token::OpenDelim(DelimitToken::Bracket)),
'{' => self.consume_and_return(Token::OpenDelim(DelimitToken::Brace)),
/* Opening delimiters */
')' => self.consume_and_return(Token::CloseDelim(DelimitToken::Paren)),
']' => self.consume_and_return(Token::CloseDelim(DelimitToken::Bracket)),
'}' => self.consume_and_return(Token::CloseDelim(DelimitToken::Brace)),
/* Special tokens which aren't the beginning
of any other token */
'?' => self.consume_and_return(Token::Question),
';' => self.consume_and_return(Token::SemiColon),
',' => self.consume_and_return(Token::Comma),
/* Dot, DotDot and DotDotDot tokens */
'.' => match self.consume_next() {
'.' => match self.consume_next() {
'.' => self.consume_and_return(Token::DotDotDot),
_ => Token::DotDot
},
_ => Token::Dot
},
/* Tokens starting with '+' */
'+' => match self.consume_next() {
'=' => self.consume_and_return(Token::BinOpEq(BinOpToken::Plus)),
_ => Token::BinOp(BinOpToken::Plus)
},
/* Tokens starting with '-' */
'-' => match self.consume_next() {
'=' => self.consume_and_return(Token::BinOpEq(BinOpToken::Minus)),
'>' => self.consume_and_return(Token::Arrow),
_ => Token::BinOp(BinOpToken::Minus)
},
/* Tokens starting with '*' */
'*' => match self.consume_next() {
'=' => self.consume_and_return(Token::BinOpEq(BinOpToken::Star)),
_ => return Token::BinOp(BinOpToken::Star)
},
/* Tokens starting with '/' */
'/' => match self.consume_next() {
'=' => self.consume_and_return(Token::BinOpEq(BinOpToken::Slash)),
'/' => self.scan_line_comment(),
'*' => self.scan_multi_line_comment(),
_ => Token::BinOp(BinOpToken::Slash)
},
/* Tokens starting with '%' */
'%' => match self.consume_next() {
'=' => self.consume_and_return(Token::BinOpEq(BinOpToken::Percent)),
_ => Token::BinOp(BinOpToken::Percent)
},
/* Tokens starting with '^' */
'^' => match self.consume_next() {
'=' => self.consume_and_return(Token::BinOpEq(BinOpToken::Caret)),
_ => return Token::BinOp(BinOpToken::Caret)
},
/* Tokens starting with '!' */
'!' => match self.consume_next() {
'=' => self.consume_and_return(Token::RelOp(RelOpToken::NotEq)),
_ => Token::Exclamation
},
/* Tokens starting with '=' */
'=' => match self.consume_next() {
'=' => self.consume_and_return(Token::RelOp(RelOpToken::EqEq)),
_ => Token::Eq
},
/* Tokens starting with '&' */
'&' => match self.consume_next() {
'&' => self.consume_and_return(Token::LogicalOp(LogicalOpToken::AndAnd)),
'=' => self.consume_and_return(Token::BinOpEq(BinOpToken::And)),
_ => Token::BinOp(BinOpToken::And)
},
/* Tokens starting with '|' */
'|' => match self.consume_next() {
'|' => self.consume_and_return(Token::LogicalOp(LogicalOpToken::OrOr)),
'=' => self.consume_and_return(Token::BinOpEq(BinOpToken::Or)),
_ => Token::BinOp(BinOpToken::Or)
},
/* Tokens starting with '<' */
'<' => match self.consume_next() {
'<' => match self.consume_next() {
'=' => self.consume_and_return(Token::BinOpEq(BinOpToken::Shl)),
_ => Token::BinOp(BinOpToken::Shl)
},
'=' => self.consume_and_return(Token::RelOp(RelOpToken::LessEq)),
_ => Token::RelOp(RelOpToken::LessThan)
},
/* Tokens starting with '>' */
'>' => match self.consume_next() {
'>' => match self.consume_next() {
'=' => self.consume_and_return(Token::BinOpEq(BinOpToken::Shr)),
_ => Token::BinOp(BinOpToken::Shr)
},
'=' => self.consume_and_return(Token::RelOp(RelOpToken::GreaterEq)),
_ => Token::RelOp(RelOpToken::GreaterThan)
},
/* Char and string literals */
'\'' => self.scan_char_literal(),
'\"' => self.scan_string_literal(),
/* Integer- and float literals and identifiers */
'0' ... '9' => self.scan_number_literal(),
'a' ... 'z' |
'A' ... 'Z' => self.scan_identifier(),
/* When end of iterator has been reached */
_ => Token::EndOfFile
}
}
}
impl<'a> Iterator for Lexer<'a> {
type Item = Token<'a>;
fn next(&mut self) -> Option<Self::Item> {
let token = self.next_token();
match token {
Token::EndOfFile => None,
_ => Some(token)
}
}
}
#[cfg(test)]
mod tests {
use super::*;
use super::super::token::*;
use super::super::token_stream::TokenStream;
#[test]
fn simple_tokens() {
let solution = [
Token::OpenDelim(DelimitToken::Paren),
Token::CloseDelim(DelimitToken::Paren),
Token::OpenDelim(DelimitToken::Bracket),
Token::CloseDelim(DelimitToken::Bracket),
Token::OpenDelim(DelimitToken::Brace),
Token::CloseDelim(DelimitToken::Brace),
Token::Question,
Token::SemiColon,
Token::Comma,
Token::EndOfFile
];
let mut lexer = Lexer::new_from_str("()[]{}?;,");
for expected in &solution {
assert_eq!(lexer.next_token(), *expected);
}
}
}
Playground
And its dependent module 'Token':
#[derive(Clone, PartialEq, Eq, Hash, Debug, Copy)]
pub enum BinOpToken {
Plus, // +
Minus, // -
Star, // *
Slash, // /
Percent, // %
Caret, // ^
And, // &
Or, // |
Shl, // <<
Shr // >>
}
#[derive(Clone, PartialEq, Eq, Hash, Debug, Copy)]
pub enum RelOpToken {
EqEq, // ==
NotEq, // !=
LessThan, // <
LessEq, // <=
GreaterThan, // >
GreaterEq // >=
}
#[derive(Clone, PartialEq, Eq, Hash, Debug, Copy)]
pub enum LogicalOpToken {
AndAnd, // &&
OrOr // ||
}
#[derive(Clone, PartialEq, Eq, Hash, Debug, Copy)]
pub enum DelimitToken {
Paren, // ( or )
Bracket, // [ or ]
Brace, // { or }
}
#[derive(Clone, PartialEq, Eq, Hash, Debug, Copy)]
pub enum LiteralToken<'a> {
Char(&'a str), // e.g. 'a'
Integer(&'a str), // e.g. 5, 42, 1337, 0
Float(&'a str), // e.g. 0.1, 5.0, 13.37, 0.0
String(&'a str) // e.g. "Hello, World!"
}
#[derive(Clone, PartialEq, Eq, Hash, Debug, Copy)]
pub enum Token<'a> {
/* Logical operators, e.g. && or || */
LogicalOp(LogicalOpToken),
/* Binary operators compatible with assignment, e.g. +, - */
BinOp(BinOpToken),
/* Binary assignment operators, e.g. +=, -= */
BinOpEq(BinOpToken),
/* Relational operators, e.g. <, <=, >, >=, ==, != */
RelOp(RelOpToken),
/* An opening delimiter, e.g. { or ( or [ */
OpenDelim(DelimitToken),
/* A closing delimiter, e.g. } or ) or ] */
CloseDelim(DelimitToken),
/* Identifiers with their given name */
Identifier(&'a str),
/* Literal token, e.g. an integer, float or string literal */
Literal(LiteralToken<'a>),
/* Special tokens */
Eq, // =
Colon, // :
SemiColon, // ;
ColonColon, // ::
Dot, // .
DotDot, // ..
DotDotDot, // ...
Comma, // ,
Exclamation, // !
Question, // ?
Arrow, // ->
FatArrow, // =>
/* Junk tokens which the parser doesn't require in order to parse the program. */
Whitespace,
Comment,
/* End of file (EOF) token indicating the end of stream for parsing */
EndOfFile
}
Playground
As well as the trait 'TokenStream':
pub use super::token::Token;
pub trait TokenStream {
fn next_token(&mut self) -> Token;
}
I am getting the following error:
src/parser/lexer.rs:202:20: 202:32 error: cannot infer an appropriate lifetime for autoref due to conflicting requirements [E0495]
src/parser/lexer.rs:202 let token = self.next_token();
^~~~~~~~~~~~
I guess that it is a lifetime problem. My next_token() method returns a Token that has a lifetime independent of Self, however I am not sure if I did the annotation right.
I also tried to do some more annotation for the next() method in Iterator but it all failed ...
I get this error when I add a lifetime to the &mut self parameter of the next() method in the implementation of the Iterator trait:
src/parser/lexer.rs:201:2: 207:3 error: method `next` has an incompatible type for trait:
expected bound lifetime parameter ,
found concrete lifetime [E0053]
I found a solution to my problems and now everything compiles fine.
The problem was in fact a lifetime problem but not only within the TokenStream trait. I had lifetime issues in several places across the entire code.
Some notable places from the long code in the initial post:
lexer.rs: line 46 - 58
fn scan_line_comment<'b>(&self) -> Token<'b> { Token::EndOfFile }
fn scan_multi_line_comment<'b>(&self) -> Token<'b> { Token::EndOfFile }
fn scan_identifier<'b>(&self) -> Token<'b> { Token::EndOfFile }
fn scan_char_literal<'b>(&self) -> Token<'b> { Token::EndOfFile }
fn scan_string_literal<'b>(&self) -> Token<'b> { Token::EndOfFile }
fn scan_number_literal<'b>(&self) -> Token<'b> { Token::EndOfFile }
fn consume_and_return<'b>(&mut self, token: Token<'b>) -> Token<'b> {
self.consume_next();
token
}
I had to insert the lifetime 'b to specify that the Token may outlive the Lexer instance.
The TokenStream required a new lifetime parameter so that it can specify that extended lifetime as well:
pub trait TokenStream<'a> {
fn next_token(&mut self) -> Token<'a>;
}
The TokenStream implementation for Lexer had to be adjusted for this change:
impl<'a, 'b> TokenStream<'b> for Lexer<'a> {
fn next_token(&mut self) -> Token<'b> {
...
}
...
}
As well as the Iterator implementation for Lexer
impl<'a> Iterator for Lexer<'a> {
type Item = Token<'a>;
fn next(&mut self) -> Option<Self::Item> {
let token = self.next_token();
match token {
Token::EndOfFile => None,
_ => Some(token)
}
}
}
That's it!

Generate JSON from nested sets (perl, sql, jquery)

I have content pages in the database (using nested sets) and I need to show it by jQuery jsTree plugin. It's need to return JSON with data like this:
[
{
data: 'node1Title',
children: [
{
data: 'subNode1Title',
children: [...]
},
{
data: 'subNode2Title',
children: [...]
}
]
},
{
data: 'node2Title',
children: [...]
}
]
What I need for do it?
I can transform an array of hashes to JSON but I don't understand how to generate an array.
Sample data:
**'pages'table**
id parent_id level lkey rkey name
1 0 1 1 14 index
2 1 2 2 7 info
3 1 2 8 13 test
4 2 3 3 4 about
5 2 3 5 6 help
6 3 3 9 10 test1
7 3 3 11 12 test2
I need to get:
[
{
data: 'index',
children: [
{
data: 'info',
children: [
{
data: 'about'
},
{
data: 'help',
}
]
},
{
data: 'test',
children: [
{
data: 'test1'
},
{
data: 'test2'
}
]
}
]
}
]
I had exactly the same problem and here is what I wrote in Perl to convert my nested set tree into a JSON object for jsTree plugin (I'm using DBIx::Tree::NestedSet to access the MySQL database tree). I know my code is ugly from a Perl perspective, but it works for me.
sub get_json_tree {
my $json = '[';
my $first = 1;
my $last_level = 1;
my $level = 1;
my $tree = DBIx::Tree::NestedSet->new(dbh => $dbh);
my $ancestors = $tree->get_self_and_children_flat(id => $tree->get_root);
foreach (#{$ancestors}) {
my $name = $_->{'name'};
$last_level = $level;
$level = $_->{'level'};
if ($level > $last_level) {
$json .= ',' if ($json =~ /}$/);
} elsif ($level < $last_level) {
$json .= ']}';
for (my $i = 0; $i < $last_level - $level; $i++) {
$json .= ']}';
}
$json .= ',';
} elsif ($level == $last_level && !$first) {
$json .= ']},';
}
$json .= '{"attr":{"id":'.$_->{'id'}.',"rel":"folder"},"data":"'.$name.'","children":[';
$first = 0;
}
$json .= ']}';
for (my $i = 1; $i < $level; $i++) {
$json .= ']}';
}
$json .= ']';
return $json;
}
I'm looking for it. Perhaps DataTable plugin examples offer a solution. I'm looking on the plugin directory /examples/server_side/scripts/ssp.class.php. You can download it here.
Take a look about simplest way of using it at "Server-side script" label in this documentation.
This is very simple. You need to write a recursive function. I wrote it in Perl. $list - this is your array sorted by 'left_key'.
sub make_tree {
my $list = shift;
my #nodes;
while (my $node = shift #$list) {
if (#$list and $node->{level} < $list->[0]{level}) {
$node->{data} = make_tree($list);
push #nodes, $node;
}
last if #$list and $node->{level} > $list->[0]{level};
}
return \#nodes;
}
my $hash = make_tree($list);
Recently I was looking for a similar solution. I didn't find this until after posting my own question. The final code I posted on question I think would answers your question nicely.
I am using the following code with a modified version of DBIx::Tree::NestedSet. I use this code to create a JSON output of the nested sets tree.
Convert a flat datastructure into a tree
sub get_jsonTree {
my ($array_of_hashes_ref) = #_;
my $roots;
my %recs_by_name;
my %children_by_parent_name;
my %count;
for my $row (#$array_of_hashes_ref) {
my $name = $row->{position_id};
my $parent_name = $row->{placement_id};
my $rec = {
name => $name,
};
## Added to loop through all key,value pairs and add them to $rec
while ( my ($key, $value) = each(%$row) ) {
$rec->{$key} = $value;
}
##Added To Count Child Nodes
$count{$parent_name} = 0 if (!$count{$parent_name});
$rec->{'child_count'} = $count{$parent_name};
$count{$parent_name}++;
push #{ $children_by_parent_name{$parent_name // 'root'} }, $rec;
$recs_by_name{$name} = $rec;
}
$roots = delete($children_by_parent_name{root}) || [];
for my $name (keys(%children_by_parent_name)) {
my $children = $children_by_parent_name{$name};
if ( my $rec = $recs_by_name{$name} ) {
$rec->{children} = $children;
} else {
$util{'test'} .= "Parent $name doesn't exist.\n<BR>";
push #$roots, #$children;
}
}
use JSON;
my $json_str = encode_json(#{$roots}[0]);
return $json_str;
}
my $array_of_hashes_ref = [
{ position_id => 123, placement_id => undef },
{ position_id => 456, placement_id => 123 },
{ position_id => 789, placement_id => 123 },
# ...
];
my $json_str = &get_jsonTree($array_of_hashes_ref);

Parsing Data in Silverlight [duplicate]

Where could I find some JavaScript code to parse CSV data?
You can use the CSVToArray() function mentioned in this blog entry.
<script type="text/javascript">
// ref: http://stackoverflow.com/a/1293163/2343
// This will parse a delimited string into an array of
// arrays. The default delimiter is the comma, but this
// can be overriden in the second argument.
function CSVToArray( strData, strDelimiter ){
// Check to see if the delimiter is defined. If not,
// then default to comma.
strDelimiter = (strDelimiter || ",");
// Create a regular expression to parse the CSV values.
var objPattern = new RegExp(
(
// Delimiters.
"(\\" + strDelimiter + "|\\r?\\n|\\r|^)" +
// Quoted fields.
"(?:\"([^\"]*(?:\"\"[^\"]*)*)\"|" +
// Standard fields.
"([^\"\\" + strDelimiter + "\\r\\n]*))"
),
"gi"
);
// Create an array to hold our data. Give the array
// a default empty first row.
var arrData = [[]];
// Create an array to hold our individual pattern
// matching groups.
var arrMatches = null;
// Keep looping over the regular expression matches
// until we can no longer find a match.
while (arrMatches = objPattern.exec( strData )){
// Get the delimiter that was found.
var strMatchedDelimiter = arrMatches[ 1 ];
// Check to see if the given delimiter has a length
// (is not the start of string) and if it matches
// field delimiter. If id does not, then we know
// that this delimiter is a row delimiter.
if (
strMatchedDelimiter.length &&
strMatchedDelimiter !== strDelimiter
){
// Since we have reached a new row of data,
// add an empty row to our data array.
arrData.push( [] );
}
var strMatchedValue;
// Now that we have our delimiter out of the way,
// let's check to see which kind of value we
// captured (quoted or unquoted).
if (arrMatches[ 2 ]){
// We found a quoted value. When we capture
// this value, unescape any double quotes.
strMatchedValue = arrMatches[ 2 ].replace(
new RegExp( "\"\"", "g" ),
"\""
);
} else {
// We found a non-quoted value.
strMatchedValue = arrMatches[ 3 ];
}
// Now that we have our value string, let's add
// it to the data array.
arrData[ arrData.length - 1 ].push( strMatchedValue );
}
// Return the parsed data.
return( arrData );
}
</script>
jQuery-CSV
It's a jQuery plugin designed to work as an end-to-end solution for parsing CSV into JavaScript data. It handles every single edge case presented in RFC 4180, as well as some that pop up for Excel/Google spreadsheet exports (i.e., mostly involving null values) that the specification is missing.
Example:
track,artist,album,year
Dangerous,'Busta Rhymes','When Disaster Strikes',1997
// Calling this
music = $.csv.toArrays(csv)
// Outputs...
[
["track", "artist", "album", "year"],
["Dangerous", "Busta Rhymes", "When Disaster Strikes", "1997"]
]
console.log(music[1][2]) // Outputs: 'When Disaster Strikes'
Update:
Oh yeah, I should also probably mention that it's completely configurable.
music = $.csv.toArrays(csv, {
delimiter: "'", // Sets a custom value delimiter character
separator: ';', // Sets a custom field separator character
});
Update 2:
It now works with jQuery on Node.js too. So you have the option of doing either client-side or server-side parsing with the same library.
Update 3:
Since the Google Code shutdown, jquery-csv has been migrated to GitHub.
Disclaimer: I am also the author of jQuery-CSV.
Here's an extremely simple CSV parser that handles quoted fields with commas, new lines, and escaped double quotation marks. There's no splitting or regular expression. It scans the input string 1-2 characters at a time and builds an array.
Test it at http://jsfiddle.net/vHKYH/.
function parseCSV(str) {
var arr = [];
var quote = false; // 'true' means we're inside a quoted field
// Iterate over each character, keep track of current row and column (of the returned array)
for (var row = 0, col = 0, c = 0; c < str.length; c++) {
var cc = str[c], nc = str[c+1]; // Current character, next character
arr[row] = arr[row] || []; // Create a new row if necessary
arr[row][col] = arr[row][col] || ''; // Create a new column (start with empty string) if necessary
// If the current character is a quotation mark, and we're inside a
// quoted field, and the next character is also a quotation mark,
// add a quotation mark to the current column and skip the next character
if (cc == '"' && quote && nc == '"') { arr[row][col] += cc; ++c; continue; }
// If it's just one quotation mark, begin/end quoted field
if (cc == '"') { quote = !quote; continue; }
// If it's a comma and we're not in a quoted field, move on to the next column
if (cc == ',' && !quote) { ++col; continue; }
// If it's a newline (CRLF) and we're not in a quoted field, skip the next character
// and move on to the next row and move to column 0 of that new row
if (cc == '\r' && nc == '\n' && !quote) { ++row; col = 0; ++c; continue; }
// If it's a newline (LF or CR) and we're not in a quoted field,
// move on to the next row and move to column 0 of that new row
if (cc == '\n' && !quote) { ++row; col = 0; continue; }
if (cc == '\r' && !quote) { ++row; col = 0; continue; }
// Otherwise, append the current character to the current column
arr[row][col] += cc;
}
return arr;
}
I have an implementation as part of a spreadsheet project.
This code is not yet tested thoroughly, but anyone is welcome to use it.
As some of the answers noted though, your implementation can be much simpler if you actually have DSV or TSV file, as they disallow the use of the record and field separators in the values. CSV, on the other hand, can actually have commas and newlines inside a field, which breaks most regular expression and split-based approaches.
var CSV = {
parse: function(csv, reviver) {
reviver = reviver || function(r, c, v) { return v; };
var chars = csv.split(''), c = 0, cc = chars.length, start, end, table = [], row;
while (c < cc) {
table.push(row = []);
while (c < cc && '\r' !== chars[c] && '\n' !== chars[c]) {
start = end = c;
if ('"' === chars[c]){
start = end = ++c;
while (c < cc) {
if ('"' === chars[c]) {
if ('"' !== chars[c+1]) {
break;
}
else {
chars[++c] = ''; // unescape ""
}
}
end = ++c;
}
if ('"' === chars[c]) {
++c;
}
while (c < cc && '\r' !== chars[c] && '\n' !== chars[c] && ',' !== chars[c]) {
++c;
}
} else {
while (c < cc && '\r' !== chars[c] && '\n' !== chars[c] && ',' !== chars[c]) {
end = ++c;
}
}
row.push(reviver(table.length-1, row.length, chars.slice(start, end).join('')));
if (',' === chars[c]) {
++c;
}
}
if ('\r' === chars[c]) {
++c;
}
if ('\n' === chars[c]) {
++c;
}
}
return table;
},
stringify: function(table, replacer) {
replacer = replacer || function(r, c, v) { return v; };
var csv = '', c, cc, r, rr = table.length, cell;
for (r = 0; r < rr; ++r) {
if (r) {
csv += '\r\n';
}
for (c = 0, cc = table[r].length; c < cc; ++c) {
if (c) {
csv += ',';
}
cell = replacer(r, c, table[r][c]);
if (/[,\r\n"]/.test(cell)) {
cell = '"' + cell.replace(/"/g, '""') + '"';
}
csv += (cell || 0 === cell) ? cell : '';
}
}
return csv;
}
};
csvToArray v1.3
A compact (645 bytes), but compliant function to convert a CSV string into a 2D array, conforming to the RFC4180 standard.
https://code.google.com/archive/p/csv-to-array/downloads
Common Usage: jQuery
$.ajax({
url: "test.csv",
dataType: 'text',
cache: false
}).done(function(csvAsString){
csvAsArray=csvAsString.csvToArray();
});
Common usage: JavaScript
csvAsArray = csvAsString.csvToArray();
Override field separator
csvAsArray = csvAsString.csvToArray("|");
Override record separator
csvAsArray = csvAsString.csvToArray("", "#");
Override Skip Header
csvAsArray = csvAsString.csvToArray("", "", 1);
Override all
csvAsArray = csvAsString.csvToArray("|", "#", 1);
Here's my PEG(.js) grammar that seems to do ok at RFC 4180 (i.e. it handles the examples at http://en.wikipedia.org/wiki/Comma-separated_values):
start
= [\n\r]* first:line rest:([\n\r]+ data:line { return data; })* [\n\r]* { rest.unshift(first); return rest; }
line
= first:field rest:("," text:field { return text; })*
& { return !!first || rest.length; } // ignore blank lines
{ rest.unshift(first); return rest; }
field
= '"' text:char* '"' { return text.join(''); }
/ text:[^\n\r,]* { return text.join(''); }
char
= '"' '"' { return '"'; }
/ [^"]
Try it out at http://jsfiddle.net/knvzk/10 or http://pegjs.majda.cz/online. Download the generated parser at https://gist.github.com/3362830.
Here's another solution. This uses:
a coarse global regular expression for splitting the CSV string (which includes surrounding quotes and trailing commas)
fine-grained regular expression for cleaning up the surrounding quotes and trailing commas
also, has type correction differentiating strings, numbers, boolean values and null values
For the following input string:
"This is\, a value",Hello,4,-123,3.1415,'This is also\, possible',true,
The code outputs:
[
"This is, a value",
"Hello",
4,
-123,
3.1415,
"This is also, possible",
true,
null
]
Here's my implementation of parseCSVLine() in a runnable code snippet:
function parseCSVLine(text) {
return text.match( /\s*(\"[^"]*\"|'[^']*'|[^,]*)\s*(,|$)/g ).map( function (text) {
let m;
if (m = text.match(/^\s*,?$/)) return null; // null value
if (m = text.match(/^\s*\"([^"]*)\"\s*,?$/)) return m[1]; // Double Quoted Text
if (m = text.match(/^\s*'([^']*)'\s*,?$/)) return m[1]; // Single Quoted Text
if (m = text.match(/^\s*(true|false)\s*,?$/)) return m[1] === "true"; // Boolean
if (m = text.match(/^\s*((?:\+|\-)?\d+)\s*,?$/)) return parseInt(m[1]); // Integer Number
if (m = text.match(/^\s*((?:\+|\-)?\d*\.\d*)\s*,?$/)) return parseFloat(m[1]); // Floating Number
if (m = text.match(/^\s*(.*?)\s*,?$/)) return m[1]; // Unquoted Text
return text;
} );
}
let data = `"This is\, a value",Hello,4,-123,3.1415,'This is also\, possible',true,`;
let obj = parseCSVLine(data);
console.log( JSON.stringify( obj, undefined, 2 ) );
Here's my simple vanilla JavaScript code:
let a = 'one,two,"three, but with a comma",four,"five, with ""quotes"" in it.."'
console.log(splitQuotes(a))
function splitQuotes(line) {
if(line.indexOf('"') < 0)
return line.split(',')
let result = [], cell = '', quote = false;
for(let i = 0; i < line.length; i++) {
char = line[i]
if(char == '"' && line[i+1] == '"') {
cell += char
i++
} else if(char == '"') {
quote = !quote;
} else if(!quote && char == ',') {
result.push(cell)
cell = ''
} else {
cell += char
}
if ( i == line.length-1 && cell) {
result.push(cell)
}
}
return result
}
I'm not sure why I couldn't get Kirtan's example to work for me. It seemed to be failing on empty fields or maybe fields with trailing commas...
This one seems to handle both.
I did not write the parser code, just a wrapper around the parser function to make this work for a file. See attribution.
var Strings = {
/**
* Wrapped CSV line parser
* #param s String delimited CSV string
* #param sep Separator override
* #attribution: http://www.greywyvern.com/?post=258 (comments closed on blog :( )
*/
parseCSV : function(s,sep) {
// http://stackoverflow.com/questions/1155678/javascript-string-newline-character
var universalNewline = /\r\n|\r|\n/g;
var a = s.split(universalNewline);
for(var i in a){
for (var f = a[i].split(sep = sep || ","), x = f.length - 1, tl; x >= 0; x--) {
if (f[x].replace(/"\s+$/, '"').charAt(f[x].length - 1) == '"') {
if ((tl = f[x].replace(/^\s+"/, '"')).length > 1 && tl.charAt(0) == '"') {
f[x] = f[x].replace(/^\s*"|"\s*$/g, '').replace(/""/g, '"');
} else if (x) {
f.splice(x - 1, 2, [f[x - 1], f[x]].join(sep));
} else f = f.shift().split(sep).concat(f);
} else f[x].replace(/""/g, '"');
} a[i] = f;
}
return a;
}
}
Regular expressions to the rescue! These few lines of code handle properly quoted fields with embedded commas, quotes, and newlines based on the RFC 4180 standard.
function parseCsv(data, fieldSep, newLine) {
fieldSep = fieldSep || ',';
newLine = newLine || '\n';
var nSep = '\x1D';
var qSep = '\x1E';
var cSep = '\x1F';
var nSepRe = new RegExp(nSep, 'g');
var qSepRe = new RegExp(qSep, 'g');
var cSepRe = new RegExp(cSep, 'g');
var fieldRe = new RegExp('(?<=(^|[' + fieldSep + '\\n]))"(|[\\s\\S]+?(?<![^"]"))"(?=($|[' + fieldSep + '\\n]))', 'g');
var grid = [];
data.replace(/\r/g, '').replace(/\n+$/, '').replace(fieldRe, function(match, p1, p2) {
return p2.replace(/\n/g, nSep).replace(/""/g, qSep).replace(/,/g, cSep);
}).split(/\n/).forEach(function(line) {
var row = line.split(fieldSep).map(function(cell) {
return cell.replace(nSepRe, newLine).replace(qSepRe, '"').replace(cSepRe, ',');
});
grid.push(row);
});
return grid;
}
const csv = 'A1,B1,C1\n"A ""2""","B, 2","C\n2"';
const separator = ','; // field separator, default: ','
const newline = ' <br /> '; // newline representation in case a field contains newlines, default: '\n'
var grid = parseCsv(csv, separator, newline);
// expected: [ [ 'A1', 'B1', 'C1' ], [ 'A "2"', 'B, 2', 'C <br /> 2' ] ]
You don't need a parser-generator such as lex/yacc. The regular expression handles RFC 4180 properly thanks to positive lookbehind, negative lookbehind, and positive lookahead.
Clone/download code at https://github.com/peterthoeny/parse-csv-js
Just throwing this out there.. I recently ran into the need to parse CSV columns with Javascript, and I opted for my own simple solution. It works for my needs, and may help someone else.
const csvString = '"Some text, some text",,"",true,false,"more text","more,text, more, text ",true';
const parseCSV = text => {
const lines = text.split('\n');
const output = [];
lines.forEach(line => {
line = line.trim();
if (line.length === 0) return;
const skipIndexes = {};
const columns = line.split(',');
output.push(columns.reduce((result, item, index) => {
if (skipIndexes[index]) return result;
if (item.startsWith('"') && !item.endsWith('"')) {
while (!columns[index + 1].endsWith('"')) {
index++;
item += `,${columns[index]}`;
skipIndexes[index] = true;
}
index++;
skipIndexes[index] = true;
item += `,${columns[index]}`;
}
result.push(item);
return result;
}, []));
});
return output;
};
console.log(parseCSV(csvString));
Personally I like to use deno std library since most modules are officially compatible with the browser
The problem is that the std is in typescript but official solution might happen in the future https://github.com/denoland/deno_std/issues/641 https://github.com/denoland/dotland/issues/1728
For now there is an actively maintained on the fly transpiler https://bundle.deno.dev/
so you can use it simply like this
<script type="module">
import { parse } from "https://bundle.deno.dev/https://deno.land/std#0.126.0/encoding/csv.ts"
console.log(await parse("a,b,c\n1,2,3"))
</script>
I have constructed this JavaScript script to parse a CSV in string to array object. I find it better to break down the whole CSV into lines, fields and process them accordingly. I think that it will make it easy for you to change the code to suit your need.
//
//
// CSV to object
//
//
const new_line_char = '\n';
const field_separator_char = ',';
function parse_csv(csv_str) {
var result = [];
let line_end_index_moved = false;
let line_start_index = 0;
let line_end_index = 0;
let csr_index = 0;
let cursor_val = csv_str[csr_index];
let found_new_line_char = get_new_line_char(csv_str);
let in_quote = false;
// Handle \r\n
if (found_new_line_char == '\r\n') {
csv_str = csv_str.split(found_new_line_char).join(new_line_char);
}
// Handle the last character is not \n
if (csv_str[csv_str.length - 1] !== new_line_char) {
csv_str += new_line_char;
}
while (csr_index < csv_str.length) {
if (cursor_val === '"') {
in_quote = !in_quote;
} else if (cursor_val === new_line_char) {
if (in_quote === false) {
if (line_end_index_moved && (line_start_index <= line_end_index)) {
result.push(parse_csv_line(csv_str.substring(line_start_index, line_end_index)));
line_start_index = csr_index + 1;
} // Else: just ignore line_end_index has not moved or line has not been sliced for parsing the line
} // Else: just ignore because we are in a quote
}
csr_index++;
cursor_val = csv_str[csr_index];
line_end_index = csr_index;
line_end_index_moved = true;
}
// Handle \r\n
if (found_new_line_char == '\r\n') {
let new_result = [];
let curr_row;
for (var i = 0; i < result.length; i++) {
curr_row = [];
for (var j = 0; j < result[i].length; j++) {
curr_row.push(result[i][j].split(new_line_char).join('\r\n'));
}
new_result.push(curr_row);
}
result = new_result;
}
return result;
}
function parse_csv_line(csv_line_str) {
var result = [];
//let field_end_index_moved = false;
let field_start_index = 0;
let field_end_index = 0;
let csr_index = 0;
let cursor_val = csv_line_str[csr_index];
let in_quote = false;
// Pretend that the last char is the separator_char to complete the loop
csv_line_str += field_separator_char;
while (csr_index < csv_line_str.length) {
if (cursor_val === '"') {
in_quote = !in_quote;
} else if (cursor_val === field_separator_char) {
if (in_quote === false) {
if (field_start_index <= field_end_index) {
result.push(parse_csv_field(csv_line_str.substring(field_start_index, field_end_index)));
field_start_index = csr_index + 1;
} // Else: just ignore field_end_index has not moved or field has not been sliced for parsing the field
} // Else: just ignore because we are in quote
}
csr_index++;
cursor_val = csv_line_str[csr_index];
field_end_index = csr_index;
field_end_index_moved = true;
}
return result;
}
function parse_csv_field(csv_field_str) {
with_quote = (csv_field_str[0] === '"');
if (with_quote) {
csv_field_str = csv_field_str.substring(1, csv_field_str.length - 1); // remove the start and end quotes
csv_field_str = csv_field_str.split('""').join('"'); // handle double quotes
}
return csv_field_str;
}
// Initial method: check the first newline character only
function get_new_line_char(csv_str) {
if (csv_str.indexOf('\r\n') > -1) {
return '\r\n';
} else {
return '\n'
}
}
Just use .split(','):
var str = "How are you doing today?";
var n = str.split(" ");