JSON issue with SQL lines return - sql

I have an issue when I try to parse my JSON. I create my JSON "by my hand" like this in PHP :
$outp ='{"records":['.$outp.']}'; and I create it so I can take field from my database to show them in the page. The thing is, in my database I have a field "description" where people can give a description about something. So some people make return to line like this for example :
Interphone
Equipe:
Canape-lit
Autre:
Local
And when I try to parse my JSON there is an error because of these line's return. "SyntaxError: Unexpected token".
Here's an example of my JSON :
{"records":[{"Parking":"Aucun","Description":"Interphone
Equipé :
Canapé-lit
","Chauffage":"Fioul"}]}
Can someone help me please ?

You've really dug yourself into a very bad hole here.
The problem
The problem you're running into is that a newline (line feed and carriage return characters) are not valid JSON. They must be escaped as \n and \r. You can see the full JSON standard here here.
You need to do two things.
Fix your code
In spite of the fact that the JSON standard is comparatively simple, you should not create your JSON by hand. You already know why. You have to handle several edge cases and the like. Your users could enter anything on the page, and you need to make sure that it gets properly encoded no matter what.
You need to use a JSON serialization tool. json_encode is built in as of 5.2. If you can't use this for any reason, find an existing, widely used (and therefore heavily tested) third party library with a JSON serializer.
If you're asking, "Why can't I create my own serializer?", you could, in theory. Realistically, there is no point. Yours won't be better than existing ones. It will be much more likely to have bugs and to perform worse than something a lot of people have used in production. It will also take much longer to create and test than using an existing one.
If you need this data in code after you pull it back out of the database, then you need a JSON deserializer. json_decode should also be fine, but again, if you can't use it, look for a widely used third party library.
Fix your data
If you haven't hit production yet, you have really dodged a bullet here, and you can skip this whole section. If you have gone to production and you have data from users, you've got a major problem.
Even after you fix your code, you still have bad data in your production database that won't parse correctly. You have to do something to make this data usable. Unfortunately, it is impossible to automatically recover the original data for every possible case. This is because users might have entered the characters/substrings you added to the data to turn it into "JSON"; for example, they might have entered a comma separated list of quoted words: "dog","cat","pig", and "cow". That is an intractable problem, since you know for a fact you didn't properly serialize all your incoming input. There's no way to tell the difference between text your code generated and text the user entered. You're going to have to settle for a best effort and try to throw errors when you can't figure it out in code, and it might mess up a user's data in some special cases. You might have to fix some things manually.
Start by discussing this with your manager, team lead, whoever you answer to. Assuming that you can't lose the data, this is the most sound process to follow for creating a fix for your data:
Create a database dump of your production data.
Import that dump into a development database.
Develop and test your method of repairing this data against the development database from the last step.
Ensure you have a recovery plan for deployments gone wrong. Test this plan in your testing environment.
Once you've gone through your typical release process, it's time to release the fixed code and the data update together.
Take the website offline.
Back up the database.
Update the website with the new code.
Implement your data fix.
Verify that it worked.
Bring the site online.
If your data fix doesn't work (possibly because you didn't think of an edge case or something), then you have a nice back up you can restore and you can cancel the release. Then go back to step 1.
As for how you can fix the data, I don't recommend queries here. I recommend a little script tool. It would have to load the data from the database, pull the string apart, try to identify all the pieces, build up an object from those pieces, and finally serialize them to JSON correctly, and put them back into the database.
Here's an example function of how you might go about pulling the string apart:
const ELEMENT_SEPARATOR = '","';
const PAIR_SEPARATOR = '":"';
function recover_object_from_malformed_json($malformed_json, $known_keys) {
$tempData = substr($malformed_json, 14); // Removes {"records":[{" prefix
$tempData = substr($tempData, 0, -4); // Removes "}]} suffix
$tempData = explode(ELEMENT_SEPARATOR, $tempData); // Split into what we think are pairs
$data = array();
$lastKey = NULL;
foreach ($tempData as $i) {
$explodedI = explode(KEY_VALUE_SEPARATOR, $i, 2); // Split what we think is a key/value into key and value
if (in_array($explodedI[0], $known_keys)) { // Check if it's actually a key
// It's a key
$lastKey = $explodedI[0];
if (array_key_exists($lastKey, $data)) {
throw new RuntimeException('Duplicate key: ' + $lastKey);
}
// Assign the value to the key
$data[$lastKey] = $explodedI[1];
}
else {
// This isn't a key vlue pair, near as we can tell
// So it must actually be part of the last value,
// and the user actually entered the delimiter as part of the value.
if (is_null($lastKey)) {
// This one is REALLY messed up
throw new RuntimeException('Does not begin with a known key');
}
$data[$lastKey] += ELEMENT_SEPARATOR;
$data[$lastKey] += $i;
}
}
return $data;
}
Note that I'm assuming that your "list" is a single element. This gets much harder and much messier if you have more than one. You'll also need to know ahead of time what keys you expect to have. The bottom line is that you have to undo whatever your code did to create the "JSON", and you have to do everything you can to try to not mess up a user's data.
You would use it something like this:
$knownKeys = ["Parking", "Description", "Chauffage"];
// Fetch your rows and loop over them
foreach ($dbRows as $row) {
try {
$dataFromDb = $row.myData // or however you would pull out this string.
$recoveredData = recover_object_from_malformed_json($dataFromDb);
// Save it back to the DB
$row.myData = json_encode($recoveredData);
// Make sure to commit here.
}
catch (Exception $e) {
// Log the row's ID, the content that couldn't be fixed, and the exception
// Make sure to roll back here
}
}
(Forgive me if the database stuff looks really wonky. I don't do PHP, so I have no idea how that code should look. Hopefully, you can at least get the concept.)
Why I don't recommend trying to parse your data as JSON to recover it.
The bottom line is that your data in the database is not JSON. IF you try to parse it as such, all the other edge cases you didn't handle properly will get screwed up in the process. You'll see bad things like
\\ becomes \
\j becomes j
\t becomes a tab character
In the end, it will just mess up your data even more.
Conclusion
This is a huge mess, and you should never try to convert something into a standard format without using a properly built, well tested serializer. Fixing the data is going to be hard, and it's going to take time. I also seriously doubt you have a lot of background in text processing techniques, and lacking that knowledge is going to make this harder. You can get some good info on text processing by studying how compilers are made. Good luck.

Related

SHOW KEYS in Aerospike?

I'm new to Aerospike and am probably missing something fundamental, but I'm trying to see an enumeration of the Keys in a Set (I'm purposefully avoiding the word "list" because it's a datatype).
For example,
To see all the Namespaces, the docs say to use SHOW NAMESPACES
To see all the Sets, we can use SHOW SETS
If I want to see all the unique Keys in a Set ... what command can I use?
It seems like one can use client.scan() ... but that seems like a super heavy way to get just the key (since it fetches all the bin data as well).
Any recommendations are appreciated! As of right now, I'm thinking of inserting (deleting) into (from) a meta-record.
Thank you #pgupta for pointing me in the right direction.
This actually has two parts:
In order to retrieve original keys from the server, one must -- during put() calls -- set policy to save the key value server-side (otherwise, it seems only a digest/hash is stored?).
Here's an example in Python:
aerospike_client.put(key, {'bin': 'value'}, policy={'key': aerospike.POLICY_KEY_SEND})
Then (modified Aerospike's own documentation), you perform a scan and set the policy to not return the bin data. From this, you can extract the keys:
Example:
keys = []
scan = client.scan('namespace', 'set')
scan_opts = { 'concurrent': True, 'nobins': True, 'priority': aerospike.SCAN_PRIORITY_MEDIUM }
for x in (scan.results(policy=scan_opts)): keys.append(x[0][2])
The need to iterate over the result still seems a little clunky to me; I still think that using a 'master-key' Record to store a list of all the other keys will be more performant, in my case -- in this way, I can simply make one get() call to the Aerospike server to retrieve the list.
You can choose not bring the data back by setting includeBinData in ScanPolicy to false.

why this idea in using while loop. is it trivial?

I have met this a lot recently reading other people's scripts. A short example is below:
Say we need input and store them in var A and B, the scheme is below:
int ok;
ok = false;
while(!ok){
//ask input for A
//ask input for B
ok = true;
}
I understand what it wants, but why is this scheme necessary? can I only have "ask input for A and B".
but why is this scheme necessary?
It is not necessary.
can I only have "ask input for A and B".
You sure can.
However, if user gives you input that is not useful (for example: you ask for the users age, and they type "horse"), then you might want to ask again. Allowing re-trying of input is generally a useful feature. The canonical control structure for repeating a piece of program is a loop.
Your example program however, sets ok unconditionally, so in that case there is really no use for the loop. The loop makes sense only if there is some form of validation that must be passed before the input is OK.
When there are no checks in the code you omitted, but you see this same construct all over the place, then it's a copy&paste artifact.
Someone had a piece of code that was reading input and validating it, then copied the code somewhere else, removed the validation bits, and left the rest as-is. Then they copy&pasted that code all over the place.
In my experience, this happens very often.

calling script_execute with a variable

I'm using GameMaker:Studio Pro and trying to execute a script stored in a variable as below:
script = close_dialog;
script_execute(script);
It doesn't work. It's obviously looking for a script named "script". Anyone know how I can accomplish this?
This question's quite old now, but in case anyone else ends up here via google (as I did), here's something I found that worked quite well and avoids the need for any extra data structures as reference:
scriptToCall = asset_get_index(scr_scriptName);
script_execute(scriptToCall);
The first line here creates the variable scriptToCall and then assigns to it Game Maker's internal ID number for the script you want to call. This allows script_execute to correctly find the script from the ID, which doesn't work if you try to pass it a string containing the script name.
I'm using this to define which scripts should be called in a particular situation from an included txt file, hence the need to convert a string into an addressable script ID!
You seem to have some confusion over how Game Maker works, so I will try to address this before I get around to the actual question.
GML is a rather simple-minded beast, it only knows two data types: strings and numbers. Everything else (objects, sprites, scripts, data structures, instances and so on) is represented with a number in your GML code.
For example, you might have an object called "Player" which has all kinds of fancy events, but to the code Player is just a constant number which you can (e.g.) print out with show_message(string(Player));
Now, the function script_execute(script) takes as argument the ID of the script that should be executed. That ID is just a normal number. script_execute will find the script with that ID in some internal table and then run the script.
In other words, instead of calling script_execute(close_dialog) you could just as well call script_execute(14) if you happened to know that the ID of close_dialog is 14 (although that is bad practice, since it make the code difficult to understand and brittle against ID changes).
Now it should be obvious that assigning the numeric value of close_dialog to a variable first and then calling script_execute on that variable is perfectly OK. In the end, script_execute only cares about the number that is passed, not about the name of the variable that this number comes from.
If you are thinking ahead a bit, you might wonder whether you need script_execute at all then, or if you could instead just do this:
script = close_dialog;
script();
In my opinion, it would be perfectly fine to allow this in the language, but it does not work - the function call operator actually does care about the name of the thing you try to call.
Now with that background out of the way, on to your actual question. If close_dialog is actually a script, your suggested code will work fine. If it is an extension function (or a built-in function -- I don't own Studio so what do I know) then it does not actually have an ID, and you can't call it with script_execute. In fact, you can't even assign close_dialog to a variable then because it does not have any value in GML -- all you can do with it then is call it. To work around this though, you could create a script (say, close_dialog_script which only calls close_dialog, which you can then use just as above.
Edit: Since it does not seem to work anyway, check whether you have a different resource by the name of close_dialog (perhaps a button sprite). This kind of conflict could mean that close_dialog gives you the ID of the sprite, not of the script, while calling the script directly would still work.
After much discussion on the forums, I ended up going with this method.
I wrote a script called script_id()
var sid;
sid = 6; //6 = scriptnotfound script :)
switch (argument0) {
case "load_room":
sid = 0;
break;
case "show_dialog":
sid = 1;
break;
case "close_dialog":
sid = 3;
break;
case "scrExample":
sid = 4;
break;
}
return sid;
So now I can call script_execute(script_id("close_dialog"));
I hate it, but it's better than keeping a spreadsheet... in my opinion.
There's also another way, with execute_string();
Should look like this:
execute_string(string(scriptName) + "();");

Compare 2 datasets with dbunit?

Currently I need to create tests for my application. I used "dbunit" to achieve that and now need to compare 2 datasets:
1) The records from the database I get with QueryDataSet
2) The expected results are written in the appropriate FlatXML in a file which I read in as a dataset as well
Basically 2 datasets can be compared this way.
Now the problem are columns with a Timestamp. They will never fit together with the expected dataset. I really would like to ignore them when comparing them, but it doesn't work the way I want it.
It does work, when I compare each table for its own with adding a column filter and ignoreColumns. However, this approch is very cumbersome, as many tables are used in that comparison, and forces one to add so much code, it eventually gets bloated.
The same applies for fields which have null-values
A probable solution would also be, if I had the chance to only compare the very first column of all tables - and not by naming it with its column name, but only with its column index. But there's nothing I can find.
Maybe I am missing something, or maybe it just doesn't work any other way than comparing each table for its own?
For the sake of completion some additional information must be posted. Actually my previously posted solution will not work at all as the process reading data from the database got me trapped.
The process using "QueryDataset" did read the data from the database and save it as a dataset, but the data couldn't be accessed from this dataset anymore (although I could see the data in debug mode)!
Instead the whole operation failed with an UnsupportedOperationException at org.dbunit.database.ForwardOnlyResultSetTable.getRowCount(ForwardOnlyResultSetTable.java:73)
Example code to produce failure:
QueryDataSet qds = new QueryDataSet(connection);
qds.addTable(“specificTable”);
qds.getTable(„specificTable“).getRowCount();
Even if you try it this way it fails:
IDataSet tmpDataset = connection.createDataSet(tablenames);
tmpDataset.getTable("specificTable").getRowCount();
In order to make extraction work you need to add this line (the second one):
IDataSet tmpDataset = connection.createDataSet(tablenames);
IDataSet actualDataset = new CachedDataSet(tmpDataset);
Great, that this was nowhere documented...
But that is not all: now you'd certainly think that one could add this line after doing a "QueryDataSet" as well... but no! This still doesn't work! It will still throw the same Exception! It doesn't make any sense to me and I wasted so much time with it...
It should be noted that extracting data from a dataset which was read in from an xml file does work without any problem. This annoyance just happens when trying to get a dataset directly from the database.
If you have done the above you can then continue as below which compares only the columns you got in the expected xml file:
// put in here some code to read in the dataset from the xml file...
// and name it "expectedDataset"
// then get the tablenames from it...
String[] tablenames = expectedDataset.getTableNames();
// read dataset from database table using the same tables as from the xml
IDataSet tmpDataset = connection.createDataSet(tablenames);
IDataSet actualDataset = new CachedDataSet(tmpDataset);
for(int i=0;i<tablenames.length;i++)
{
ITable expectedTable = expectedDataset.getTable(tablenames[i]);
ITable actualTable = actualDataset.getTable(tablenames[i]);
ITable filteredActualTable = DefaultColumnFilter.includedColumnsTable(actualTable, expectedTable.getTableMetaData().getColumns());
Assertion.assertEquals(expectedTable,filteredActualTable);
}
You can also use this format:
// Assert actual database table match expected table
String[] columnsToIgnore = {"CONTACT_TITLE","POSTAL_CODE"};
Assertion.assertEqualsIgnoreCols(expectedTable, actualTable, columnsToIgnore);

TSearch2 - dots explosion

Following conversion
SELECT to_tsvector('english', 'Google.com');
returns this:
'google.com':1
Why does TSearch2 engine didn't return something like this?
'google':2, 'com':1
Or how can i make the engine to return the exploded string as i wrote above?
I just need "Google.com" to be foundable by "google".
Unfortunately, there is no quick and easy solution.
Denis is correct in that the parser is recognizing it as a hostname, which is why it doesn't break it up.
There are 3 other things you can do, off the top of my head.
You can disable the host parsing in the database. See postgres documentation for details. E.g. something like ALTER TEXT SEARCH CONFIGURATION your_parser_config
DROP MAPPING FOR url, url_path
You can write your own custom dictionary.
You can pre-parse your data before it's inserted into the database in some manner (maybe splitting all domains before going into the database).
I had a similar issue to you last year and opted for solution (2), above.
My solution was to write a custom dictionary that splits words up on non-word characters. A custom dictionary is a lot easier & quicker to write than a new parser. You still have to write C tho :)
The dictionary I wrote would return something like 'www.facebook.com':4, 'com':3, 'facebook':2, 'www':1' for the 'www.facebook.com' domain (we had a unique-ish scenario, hence the 4 results instead of 3).
The trouble with a custom dictionary is that you will no longer get stemming (ie: www.books.com will come out as www, books and com). I believe there is some work (which may have been completed) to allow chaining of dictionaries which would solve this problem.
First off in case you're not aware, tsearch2 is deprecated in favor of the built-in functionality:
http://www.postgresql.org/docs/9/static/textsearch.html
As for your actual question, google.com gets recognized as a host by the parser:
http://www.postgresql.org/docs/9.0/static/textsearch-parsers.html
If you don't want this to occur, you'll need to pre-process your text accordingly (or use a custom parser).