I'm new to yacc/lex and I'm working on a parser that was written by someone else. I notice that when an undefined token is found, the parser returns an error and stops. Is there a simple way to just make it ignore completely lines that it cannot parse and just move on to the next one?
just add a rule that looks like
. {
// do nothing
}
at the bottom of all of your rules, and it will just ignore everything it comes across that doesn't fit any of the previous rules.
Edit: if you have multiple states, then a catch-all that works in any state would then look like:
<*>. {
}
Related
I have an issue when I try to parse my JSON. I create my JSON "by my hand" like this in PHP :
$outp ='{"records":['.$outp.']}'; and I create it so I can take field from my database to show them in the page. The thing is, in my database I have a field "description" where people can give a description about something. So some people make return to line like this for example :
Interphone
Equipe:
Canape-lit
Autre:
Local
And when I try to parse my JSON there is an error because of these line's return. "SyntaxError: Unexpected token".
Here's an example of my JSON :
{"records":[{"Parking":"Aucun","Description":"Interphone
Equipé :
Canapé-lit
","Chauffage":"Fioul"}]}
Can someone help me please ?
You've really dug yourself into a very bad hole here.
The problem
The problem you're running into is that a newline (line feed and carriage return characters) are not valid JSON. They must be escaped as \n and \r. You can see the full JSON standard here here.
You need to do two things.
Fix your code
In spite of the fact that the JSON standard is comparatively simple, you should not create your JSON by hand. You already know why. You have to handle several edge cases and the like. Your users could enter anything on the page, and you need to make sure that it gets properly encoded no matter what.
You need to use a JSON serialization tool. json_encode is built in as of 5.2. If you can't use this for any reason, find an existing, widely used (and therefore heavily tested) third party library with a JSON serializer.
If you're asking, "Why can't I create my own serializer?", you could, in theory. Realistically, there is no point. Yours won't be better than existing ones. It will be much more likely to have bugs and to perform worse than something a lot of people have used in production. It will also take much longer to create and test than using an existing one.
If you need this data in code after you pull it back out of the database, then you need a JSON deserializer. json_decode should also be fine, but again, if you can't use it, look for a widely used third party library.
Fix your data
If you haven't hit production yet, you have really dodged a bullet here, and you can skip this whole section. If you have gone to production and you have data from users, you've got a major problem.
Even after you fix your code, you still have bad data in your production database that won't parse correctly. You have to do something to make this data usable. Unfortunately, it is impossible to automatically recover the original data for every possible case. This is because users might have entered the characters/substrings you added to the data to turn it into "JSON"; for example, they might have entered a comma separated list of quoted words: "dog","cat","pig", and "cow". That is an intractable problem, since you know for a fact you didn't properly serialize all your incoming input. There's no way to tell the difference between text your code generated and text the user entered. You're going to have to settle for a best effort and try to throw errors when you can't figure it out in code, and it might mess up a user's data in some special cases. You might have to fix some things manually.
Start by discussing this with your manager, team lead, whoever you answer to. Assuming that you can't lose the data, this is the most sound process to follow for creating a fix for your data:
Create a database dump of your production data.
Import that dump into a development database.
Develop and test your method of repairing this data against the development database from the last step.
Ensure you have a recovery plan for deployments gone wrong. Test this plan in your testing environment.
Once you've gone through your typical release process, it's time to release the fixed code and the data update together.
Take the website offline.
Back up the database.
Update the website with the new code.
Implement your data fix.
Verify that it worked.
Bring the site online.
If your data fix doesn't work (possibly because you didn't think of an edge case or something), then you have a nice back up you can restore and you can cancel the release. Then go back to step 1.
As for how you can fix the data, I don't recommend queries here. I recommend a little script tool. It would have to load the data from the database, pull the string apart, try to identify all the pieces, build up an object from those pieces, and finally serialize them to JSON correctly, and put them back into the database.
Here's an example function of how you might go about pulling the string apart:
const ELEMENT_SEPARATOR = '","';
const PAIR_SEPARATOR = '":"';
function recover_object_from_malformed_json($malformed_json, $known_keys) {
$tempData = substr($malformed_json, 14); // Removes {"records":[{" prefix
$tempData = substr($tempData, 0, -4); // Removes "}]} suffix
$tempData = explode(ELEMENT_SEPARATOR, $tempData); // Split into what we think are pairs
$data = array();
$lastKey = NULL;
foreach ($tempData as $i) {
$explodedI = explode(KEY_VALUE_SEPARATOR, $i, 2); // Split what we think is a key/value into key and value
if (in_array($explodedI[0], $known_keys)) { // Check if it's actually a key
// It's a key
$lastKey = $explodedI[0];
if (array_key_exists($lastKey, $data)) {
throw new RuntimeException('Duplicate key: ' + $lastKey);
}
// Assign the value to the key
$data[$lastKey] = $explodedI[1];
}
else {
// This isn't a key vlue pair, near as we can tell
// So it must actually be part of the last value,
// and the user actually entered the delimiter as part of the value.
if (is_null($lastKey)) {
// This one is REALLY messed up
throw new RuntimeException('Does not begin with a known key');
}
$data[$lastKey] += ELEMENT_SEPARATOR;
$data[$lastKey] += $i;
}
}
return $data;
}
Note that I'm assuming that your "list" is a single element. This gets much harder and much messier if you have more than one. You'll also need to know ahead of time what keys you expect to have. The bottom line is that you have to undo whatever your code did to create the "JSON", and you have to do everything you can to try to not mess up a user's data.
You would use it something like this:
$knownKeys = ["Parking", "Description", "Chauffage"];
// Fetch your rows and loop over them
foreach ($dbRows as $row) {
try {
$dataFromDb = $row.myData // or however you would pull out this string.
$recoveredData = recover_object_from_malformed_json($dataFromDb);
// Save it back to the DB
$row.myData = json_encode($recoveredData);
// Make sure to commit here.
}
catch (Exception $e) {
// Log the row's ID, the content that couldn't be fixed, and the exception
// Make sure to roll back here
}
}
(Forgive me if the database stuff looks really wonky. I don't do PHP, so I have no idea how that code should look. Hopefully, you can at least get the concept.)
Why I don't recommend trying to parse your data as JSON to recover it.
The bottom line is that your data in the database is not JSON. IF you try to parse it as such, all the other edge cases you didn't handle properly will get screwed up in the process. You'll see bad things like
\\ becomes \
\j becomes j
\t becomes a tab character
In the end, it will just mess up your data even more.
Conclusion
This is a huge mess, and you should never try to convert something into a standard format without using a properly built, well tested serializer. Fixing the data is going to be hard, and it's going to take time. I also seriously doubt you have a lot of background in text processing techniques, and lacking that knowledge is going to make this harder. You can get some good info on text processing by studying how compilers are made. Good luck.
Short: I am looking for a way to get the text of the script that was evaluated and caused a syntax error from within the context of window.onerror.
Long:
The full scenario includes a phone gap application and the PushNotifications plugins.
When a push message is sent to the device a javascript error is caught using window.onerror.
with the text "SyntaxtError: Expected token '}'"
the reported line number is 1 (is it is usually when dealing with EVALuated code.
The way the plugin executs its code is by using:
NSString * jsCallBack = [NSString stringWithFormat:#"%#(%#);", self.callback, jsonStr];
[self.webView stringByEvaluatingJavaScriptFromString:jsCallBack];
I belive but not 100% sure that this is the code PhoneGap Build are pushing
more code can be seen in here https://github.com/phonegap-build/PushPlugin/blob/master/src/ios/PushPlugin.m#L177
the self.callback is a string passed by me to the plugin and jsonStr is (supposed to be) an object describing the push message.
when I tried to pass as the parameter that ends up being self.callback the string alert('a');// then I did get the alert and no syntax error. ad now I am trying to understand what does jsonStr gets evaluated to so that maybe I can find a way around it or figure out if its my fault somehow (maybe for the content I am sending in the push notification....)
I also tried to look at the last item of the $('script') collection of the document hopeing that maybe stringByEvaluatingJavaScriptFromString generates a new script block but that does not seem to be the case.
further more in the window.onerror I also tried to get the caller
using var c=window.onerror.caller||window.onerror.arguments.caller; but this returns undefined.
As I stated before - I am looking for ideas on how to determine what exactly is causing the syntax error possibly by getting a hold of the entire block of script being evaluated when the syntax error happened.
I have the following lines in my code at many places. I want to find all of them at once and replace each of such block with new comment. However i am able search single line at a time. But i am not getting how to include new line in my regular expression to search please help.
// Block Solver
// We develop a block solver that includes the joint limit.
// when the mass has poor distribution (leading to large torques about..
//
Thanks in advance
Search for:
^(?://.*\n?)+
and replace all with nothing.
This will find all lines that start with //.
I've set myself a somewhat ambitious first task in learning regular expressions (and one which relates to a problem I'm trying to solve). I need to find any instance of a url that ends in .m4v, in a big html string.
My first attempt was this for jpg files
http.*jpg
Which of course seems correct on first glance, but of course returns stuff like this:
http://domain.com/page.html" title="Misc"><img src="http://domain.com/image.jpg
Which does match the expression in theory. So really, I need to put something in http.*m4v that says 'only the closest instance between http and m4v'. Any ideas?
As you've noticed, an expression such as the following is greedy:
http:.*\.jpg
That means it reads as much input as possible while satisfying the expression.
It's the "*" operator that makes it greedy. There's a well-defined regex technique to making this non-greedy… use the "?" modifier after the "*".
http:.*?\.jpg
Now it will match as little as possible while still satisifying the expression (i.e. it will stop searching at the first occurrence of ".jpg".
Of course, if you have a .jpg in the middle of a URL, like:
http://mydomain.com/some.jpg-folder/foo.jpg
It will not match the full URL.
You'll want to define the end of the URL as something that can't be considered part of the URL, such as a space, or a new line, or (if the URL in nested inside parentheses), a closing parenthesis. This can't be solved with just one little regex however if it's included in written language, since URLs are often ambiguous.
Take for example:
At this page, http://mysite.com/puppy.html, there's a cute little puppy dog.
The comma could technically be a part of a URL. You have to deal with a lot of ambiguities like this when looking for URLs in written text, and it's hard not to have bugs due to the ambiguities.
EDIT | Here's an example of a regex in PHP that is a quick and dirty solution, being greedy only where needed and trying to deal with the English language:
<?php
$str = "Checkout http://www.foo.com/test?items=bat,ball, for info about bats and balls";
preg_match('/https?:\/\/([a-zA-Z0-9][a-zA-Z0-9-]*)(\.[a-zA-Z0-9-]+)*((\/[^\s]*)(?=[\s\.,;!\?]))\b/i', $str, $matches);
var_dump($matches);
It outputs:
array(5) {
[0]=>
string(38) "http://www.foo.com/test?items=bat,ball"
[1]=>
string(3) "www"
[2]=>
string(4) ".com"
[3]=>
string(20) "/test?items=bat,ball"
[4]=>
string(20) "/test?items=bat,ball"
}
The explanation is in the comments.
Perl, ruby, php and javascript should all work with these:
/(http:\/\/(?:(?:(?!\http:\/\/).))+\.jpg)/
The URLs will be stored in the matched groups. Tested this out against "http://a.com/b.jpg-folder/c.jpg http://mydomain.com/some.jpg-folder/foo.jpg" and it worked correctly without being too greedy.
so, I am parsing Hayes modem AT commands. Not read from a file, but passed as char * (I am using C).
1) what happens if I get something that I totally don't recognize? How do I handle that?
2) what if I have something like
my_token: "cmd param=" ("value_1" | "value_2");
and receive an invalid value for "param"?
I see some advice to let the back-end program (in C) handle it, but that goes against the grain for me. Catch teh problem as early as you can, is my motto.
Is there any way to catch "else" conditions in lexer/parser rules?
Thanks in advance ...
That's the thing: the whole point of your parser and lexer is to blow up if you get bad input, then you catch the blow up and present a pretty error message to the user.
I think you're looking for Custom Syntax Error Recovery to embed in your grammar.
EDIT
I've no experience with ANTLR and C (or C alone for that matter), so follow this advice with caution! :)
Looking at the page: http://www.antlr.org/api/C/using.html, perhaps the part at the bottom, Implementing Customized Methods is what you're after.
HTH