Why is my resource pack saying "Unable to parse pack manifest with stack: * Line 9, Column 5 Missing '}' or object member name" [duplicate] - minecraft

When manually generating a JSON object or array, it's often easier to leave a trailing comma on the last item in the object or array. For example, code to output from an array of strings might look like (in a C++ like pseudocode):
s.append("[");
for (i = 0; i < 5; ++i) {
s.appendF("\"%d\",", i);
}
s.append("]");
giving you a string like
[0,1,2,3,4,5,]
Is this allowed?

Unfortunately the JSON specification does not allow a trailing comma. There are a few browsers that will allow it, but generally you need to worry about all browsers.
In general I try turn the problem around, and add the comma before the actual value, so you end up with code that looks like this:
s.append("[");
for (i = 0; i < 5; ++i) {
if (i) s.append(","); // add the comma only if this isn't the first entry
s.appendF("\"%d\"", i);
}
s.append("]");
That extra one line of code in your for loop is hardly expensive...
Another alternative I've used when output a structure to JSON from a dictionary of some form is to always append a comma after each entry (as you are doing above) and then add a dummy entry at the end that has not trailing comma (but that is just lazy ;->).
Doesn't work well with an array unfortunately.

No. The JSON spec, as maintained at http://json.org, does not allow trailing commas. From what I've seen, some parsers may silently allow them when reading a JSON string, while others will throw errors. For interoperability, you shouldn't include it.
The code above could be restructured, either to remove the trailing comma when adding the array terminator or to add the comma before items, skipping that for the first one.

Simple, cheap, easy to read, and always works regardless of the specs.
$delimiter = '';
for .... {
print $delimiter.$whatever
$delimiter = ',';
}
The redundant assignment to $delim is a very small price to pay.
Also works just as well if there is no explicit loop but separate code fragments.

Trailing commas are allowed in JavaScript, but don't work in IE. Douglas Crockford's versionless JSON spec didn't allow them, and because it was versionless this wasn't supposed to change. The ES5 JSON spec allowed them as an extension, but Crockford's RFC 4627 didn't, and ES5 reverted to disallowing them. Firefox followed suit. Internet Explorer is why we can't have nice things.

As it's been already said, JSON spec (based on ECMAScript 3) doesn't allow trailing comma. ES >= 5 allows it, so you can actually use that notation in pure JS. It's been argued about, and some parsers did support it (http://bolinfest.com/essays/json.html, http://whereswalden.com/2010/09/08/spidermonkey-json-change-trailing-commas-no-longer-accepted/), but it's the spec fact (as shown on http://json.org/) that it shouldn't work in JSON. That thing said...
... I'm wondering why no-one pointed out that you can actually split the loop at 0th iteration and use leading comma instead of trailing one to get rid of the comparison code smell and any actual performance overhead in the loop, resulting in a code that's actually shorter, simpler and faster (due to no branching/conditionals in the loop) than other solutions proposed.
E.g. (in a C-style pseudocode similar to OP's proposed code):
s.append("[");
// MAX == 5 here. if it's constant, you can inline it below and get rid of the comparison
if ( MAX > 0 ) {
s.appendF("\"%d\"", 0); // 0-th iteration
for( int i = 1; i < MAX; ++i ) {
s.appendF(",\"%d\"", i); // i-th iteration
}
}
s.append("]");

PHP coders may want to check out implode(). This takes an array joins it up using a string.
From the docs...
$array = array('lastname', 'email', 'phone');
echo implode(",", $array); // lastname,email,phone

Interestingly, both C & C++ (and I think C#, but I'm not sure) specifically allow the trailing comma -- for exactly the reason given: It make programmaticly generating lists much easier. Not sure why JavaScript didn't follow their lead.

Rather than engage in a debating club, I would adhere to the principle of Defensive Programming by combining both simple techniques in order to simplify interfacing with others:
As a developer of an app that receives json data, I'd be relaxed and allow the trailing comma.
When developing an app that writes json, I'd be strict and use one of the clever techniques of the other answers to only add commas between items and avoid the trailing comma.
There are bigger problems to be solved...

Use JSON5. Don't use JSON.
Objects and arrays can have trailing commas
Object keys can be unquoted if they're valid identifiers
Strings can be single-quoted
Strings can be split across multiple lines
Numbers can be hexadecimal (base 16)
Numbers can begin or end with a (leading or trailing) decimal point.
Numbers can include Infinity and -Infinity.
Numbers can begin with an explicit plus (+) sign.
Both inline (single-line) and block (multi-line) comments are allowed.
http://json5.org/
https://github.com/aseemk/json5

No. The "railroad diagrams" in https://json.org are an exact translation of the spec and make it clear a , always comes before a value, never directly before ]:
or }:

There is a possible way to avoid a if-branch in the loop.
s.append("[ "); // there is a space after the left bracket
for (i = 0; i < 5; ++i) {
s.appendF("\"%d\",", i); // always add comma
}
s.back() = ']'; // modify last comma (or the space) to right bracket

According to the Class JSONArray specification:
An extra , (comma) may appear just before the closing bracket.
The null value will be inserted when there is , (comma) elision.
So, as I understand it, it should be allowed to write:
[0,1,2,3,4,5,]
But it could happen that some parsers will return the 7 as item count (like IE8 as Daniel Earwicker pointed out) instead of the expected 6.
Edited:
I found this JSON Validator that validates a JSON string against RFC 4627 (The application/json media type for JavaScript Object Notation) and against the JavaScript language specification. Actually here an array with a trailing comma is considered valid just for JavaScript and not for the RFC 4627 specification.
However, in the RFC 4627 specification is stated that:
2.3. Arrays
An array structure is represented as square brackets surrounding zero
or more values (or elements). Elements are separated by commas.
array = begin-array [ value *( value-separator value ) ] end-array
To me this is again an interpretation problem. If you write that Elements are separated by commas (without stating something about special cases, like the last element), it could be understood in both ways.
P.S. RFC 4627 isn't a standard (as explicitly stated), and is already obsolited by RFC 7159 (which is a proposed standard) RFC 7159

It is not recommended, but you can still do something like this to parse it.
jsonStr = '[0,1,2,3,4,5,]';
let data;
eval('data = ' + jsonStr);
console.log(data)

With Relaxed JSON, you can have trailing commas, or just leave the commas out. They are optional.
There is no reason at all commas need to be present to parse a JSON-like document.
Take a look at the Relaxed JSON spec and you will see how 'noisy' the original JSON spec is. Way too many commas and quotes...
http://www.relaxedjson.org
You can also try out your example using this online RJSON parser and see it get parsed correctly.
http://www.relaxedjson.org/docs/converter.html?source=%5B0%2C1%2C2%2C3%2C4%2C5%2C%5D

As stated it is not allowed. But in JavaScript this is:
var a = Array()
for(let i=1; i<=5; i++) {
a.push(i)
}
var s = "[" + a.join(",") + "]"
(works fine in Firefox, Chrome, Edge, IE11, and without the let in IE9, 8, 7, 5)

From my past experience, I found that different browsers deal with trailing commas in JSON differently.
Both Firefox and Chrome handles it just fine. But IE (All versions) seems to break. I mean really break and stop reading the rest of the script.
Keeping that in mind, and also the fact that it's always nice to write compliant code, I suggest spending the extra effort of making sure that there's no trailing comma.
:)

I keep a current count and compare it to a total count. If the current count is less than the total count, I display the comma.
May not work if you don't have a total count prior to executing the JSON generation.
Then again, if your using PHP 5.2.0 or better, you can just format your response using the JSON API built in.

Since a for-loop is used to iterate over an array, or similar iterable data structure, we can use the length of the array as shown,
awk -v header="FirstName,LastName,DOB" '
BEGIN {
FS = ",";
print("[");
columns = split(header, column_names, ",");
}
{ print(" {");
for (i = 1; i < columns; i++) {
printf(" \"%s\":\"%s\",\n", column_names[i], $(i));
}
printf(" \"%s\":\"%s\"\n", column_names[i], $(i));
print(" }");
}
END { print("]"); } ' datafile.txt
With datafile.txt containing,
Angela,Baker,2010-05-23
Betty,Crockett,1990-12-07
David,Done,2003-10-31

String l = "[" + List<int>.generate(5, (i) => i + 1).join(",") + "]";

Using a trailing comma is not allowed for json. A solution I like, which you could do if you're not writing for an external recipient but for your own project, is to just strip (or replace by whitespace) the trailing comma on the receiving end before feeding it to the json parser. I do this for the trailing comma in the outermost json object. The convenient thing is then if you add an object at the end, you don't have to add a comma to the now second last object. This also makes for cleaner diffs if your config file is in a version control system, since it will only show the lines of the stuff you actually added.
char* str = readFile("myConfig.json");
char* chr = strrchr(str, '}') - 1;
int i = 0;
while( chr[i] == ' ' || chr[i] == '\n' ){
i--;
}
if( chr[i] == ',' ) chr[i] = ' ';
JsonParser parser;
parser.parse(str);

I usually loop over the array and attach a comma after every entry in the string. After the loop I delete the last comma again.
Maybe not the best way, but less expensive than checking every time if it's the last object in the loop I guess.

Related

Finding best delimiter by size of resulting array after split Kotlin

I am trying to obtain the best delimiter for my CSV file, I've seen answers that find the biggest size of the header row. Now instead of doing the standard method that would look something like this:
val supportedDelimiters: Array<Char> = arrayOf(',', ';', '|', '\t')
fun determineDelimiter(headerRow): Char {
var headerLength = 0
var chosenDelimiter =' '
supportedDelimiters.forEach {
if (headerRow.split(it).size > headerLength) {
headerLength = headerRow.split(it).size
chosenDelimiter = it
}
}
return chosenDelimiter
}
I've been trying to do it with some in-built Kotlin collections methods like filter or maxOf, but to no avail (the code below does not work).
fun determineDelimiter(headerRow: String): Char {
return supportedDelimiters.filter({a,b -> headerRow.split(a).size < headerRow.split(b)})
}
Is there any way I could do it without forEach?
Edit: The header row could look something like this:
val headerRow = "I;am;delimited;with;'semi,colon'"
I put the '' over an entry that could contain other potential delimiter
You're mostly there, but this seems simpler than you think!
Here's one answer:
fun determineDelimiter(headerRow: String)
= supportedDelimiters.maxByOrNull{ headerRow.split(it).size } ?: ' '
maxByOrNull() does all the hard work: you just tell it the number of headers that a delimiter would give, and it searches through each delimiter to find which one gives the largest number.
It returns null if the list is empty, so the method above returns a space character, like your standard method. (In this case we know that the list isn't empty, so you could replace the ?: ' ' with !! if you wanted that impossible case to give an error, or you could drop it entirely if you wanted it to give a null which would be handled elsewhere.)
As mentioned in a comment, there's no foolproof way to guess the CSV delimiter in general, and so you should be prepared for it to pick the wrong delimiter occasionally. For example, if the intended delimiter was a semicolon but several headers included commas, it could wrongly pick the comma. Without knowing any more about the data, there's no way around that.
With the code as it stands, there could be multiple delimiters which give the same number of headers; it would simply pick the first. You might want to give an error in that case, and require that there's a unique best delimiter. That would give you a little more confidence that you've picked the right one — though there's still no guarantee. (That's not so easy to code, though…)
Just like gidds said in the comment above, I would advise against choosing the delimiter based on how many times each delimiter appears. You would get the wrong answer for a header row like this:
Type of shoe, regardless of colour, even if black;Size of shoe, regardless of shape
In the above header row, the delimiter is obviously ; but your method would erroneously pick ,.
Another problem is that a header column may itself contain a delimiter, if it is enclosed in quotes. Your method doesn't take any notice of possible quoted columns. For this reason, I would recommend that you give up trying to parse CSV files yourself, and instead use one of the many available Open Source CSV parsers.
Nevertheless, if you still want to know how to pick the delimiter based on its frequency, there are a few optimizations to readability that you can make.
First, note that Kotlin strings are iterable; therefore you don't have to use a List of Char. Use a String instead.
Secondly, all you're doing is counting the number of times a character appears in the string, so there's no need to break the string up into pieces just to do that. Instead, count the number of characters directly.
Third, instead of finding the maximum value by hand, take advantage of what the standard library already offers you.
const val supportedDelimiters = ",;|\t"
fun determineDelimiter(headerRow: String): Char =
supportedDelimiters.maxBy { delimiter -> headerRow.count { it == delimiter } }
fun main() {
val headerRow = "one,two,three;four,five|six|seven"
val chosenDelimiter = determineDelimiter(headerRow)
println(chosenDelimiter) // prints ',' as expected
}

What is the difference between ', ` and |, and when should they be used?

I've seen strings written like in these three ways:
lv_str = 'test'
lv_str2 = `test`
lv_str3 = |test|
The only thing I've notice so far is that ' trims whitespaces sometimes, while ` preserves them.
I just recently found | - don't know much about it yet.
Can someone explain, or post a good link here when which of these ways is used best and if there are even more ways?
|...| denotes ABAP string templates.
With string templates we can create a character string using texts, embedded expressions and control characters.
ABAP Docu
Examples
Use ' to define character-typed literals and non-integer numbers:
CONSTANTS some_chars TYPE char30 VALUE 'ABC'.
CONSTANTS some_number TYPE fltp VALUE '0.78'.
Use ` to define string-typed literals:
CONSTANTS some_constant TYPE string VALUE `ABC`.
Use | to assemble text:
DATA(message) = |Received HTTP code { status_code } with message { text }|.
This is an exhaustive list of the ways ABAP lets you define character sequences.
To answer the "when should they be used" part of the question:
` and | are useful if trailing spaces are needed (they are ignored with ', cf this blog post for more information, be careful SCN renders today the quotes badly so the post is confusing) :
DATA(arrival) = `Hello ` && `world`.
DATA(departure) = |Good | && |bye|.
Use string templates (|) rather than the combination of ` and && for an easier reading (it remains very subjective, I tend to prefer |; with my keyboard, | is easier to obtain too) :
DATA(arrival) = `Dear ` && mother_name && `, thank you!`.
DATA(departure) = |Bye { mother_name }, thank you!|.
Sometimes you don't have the choice: if a String data object is expected at a given position then you must use ` or |. There are many other cases.
In all other cases, I prefer to use ' (probably because I obtain it even more easily with my keyboard than |).
Although the other answers are helpful they do not mention the most important difference between 'and `.
A character chain defined with a single quote will be defined as type C with exactly the length of the chain even including white spaces at the beginning and the end of the character sequence.
So this one 'TEST' will get exactly the type C LENGTH 4.
wherever such a construct `TEST` will evaluate always to type string.
This is very important for example in such a case.
REPORT zutest3.
DATA i TYPE i VALUE 2.
DATA(l_test1) = COND #( WHEN i = 1 THEN 'ACT3' ELSE 'ACTA4').
DATA(l_test2) = COND #( WHEN i = 1 THEN `ACT3` ELSE `ACTA4`).
WRITE l_test1.
WRITE l_test2.

Referencing nested arrays in awk

I'm creating a bunch of mappings that can be indexed into using 3 keys such as below:
mappings["foo"]["bar"]["blah"][1]=0
split( "10,13,19,49", mappings["foo"]["bar"]["blah"] )
I can then index into the nested array using for example
mappings[product][format][version][i]
But this is a bit long-winded when I need to refer to the same nested array several times, so in other languages I'd create a reference to the inner array:
map=mappings[product][format][version]
map[i]
However, I can't seem to get this to work in awk (gawk 4.1.3).
I can only find one link over google, that suggests this is impossible in previous versions of awk, and a loop setting the keys and values one-by-one is the only solution. Is this still the case or does anyone have a suggestions for a better solution?
https://developer.apple.com/library/archive/documentation/OpenSource/Conceptual/ShellScripting/Howawk-ward/Howawk-ward.html
EDIT
In response to comments a bit more background on what I'm trying to do. If there is a better approach, I'm all for using it!
I have set of CSV files that I'm feeding into AWK. The idea is to calculate a checksum based on specific columns after applying filtering to the rows.
The columns to checksum on, and the filtering to apply, are derivived from runtime parameters sent into the script.
The runtime parameters are a triple of (product,format,version), hence my use of a 3-nested assoicative array.
Another approach would be to use triple as a single key, rather than nesting, but gawk doesn't seem to natively support this, so I'd end-up concatenating the values as string. This felt a bit less structured to me, but if I'm wrong, happy to change my mind on this apporach.
Anyway, it is these parameters that are used to index into the array to structure to retrieve the column numbers, etc.
You can then build-up a tree-like structure, for example, the below shows 2 formats for product foo on version blah, and so on...:
mappings["product-foo"]["format-bar"]["version-blah"][1]=0
split( "10,13,19,49", mappings["product-foo"]["format-bar"]["version-blah"] )
mappings["product-foo"]["format-moo"]["version-blah"][1]=0
split( "55,23,14,6", mappings["product-foo"]["format-moo"]["version-blah"] )
The magic happens like this, you can see how long-winded the mappings indexing becomes without referencing:
(FNR>1 && (format!="some-format" ||
(version=="some-version" && $1=="some-filter") ||
(version=="some-other-version" && $8=="some-other-filter"))) {
# Loop over each supplied field summing an absolute tally for each
for (i=1; i <= length(mappings[product][format][version]); i++) {
sumarr[i] += ( $mappings[product][format][version][i] < 0 ? -$mappings[product][format][version][i]:$mappings[product][format][version][i] )
}
}
The comment from #ed-morton simplifies this as originally requested, but interested if their is a simpler approach.
The right answer is from #ed-morton above (thanks!).
Ed - if you write it out as an answer I'll accept it, otherwise I'll accept this quote in a few days for good housekeeping.
Right, there is no array copy functionality in awk and there are no pointers/references so you can't create a pointer to an array. You can of course create function map(i) { return mappings[product][format][version][i]}

Generating Random String of Numbers and Letters Using Go's "testing/quick" Package

I've been breaking my head over this for a few days now and can't seem to be able to figure it out. Perhaps it's glaringly obvious, but I don't seem to be able to spot it. I've read up on all the basics of unicode, UTF-8, UTF-16, normalisation, etc, but to no avail. Hopefully somebody's able to help me out here...
I'm using Go's Value function from the testing/quick package to generate random values for the fields in my data structs, in order to implement the Generator interface for the structs in question. Specifically, given a Metadata struct, I've defined the implementation as follows:
func (m *Metadata) Generate(r *rand.Rand, size int) (value reflect.Value) {
value = reflect.ValueOf(m).Elem()
for i := 0; i < value.NumField(); i++ {
if t, ok := quick.Value(value.Field(i).Type(), r); ok {
value.Field(i).Set(t)
}
}
return
}
Now, in doing so, I'll end up with both the receiver and the return value being set with random generated values of the appropriate type (strings, ints, etc. in the receiver and reflect.Value in the returned reflect.Value).
Now, the implementation for the Value function states that it will return something of type []rune converted to type string. As far as I know, this should allow me to then use the functions in the runes, unicode and norm packages to define a filter which filters out everything which is not part of 'Latin', 'Letter' or 'Number'. I defined the following filter which uses a transform to filter out letters which are not in those character rangetables (as defined in the unicode package):
func runefilter(in reflect.Value) (out reflect.Value) {
out = in // Make sure you return something
if in.Kind() == reflect.String {
instr := in.String()
t := transform.Chain(norm.NFD, runes.Remove(runes.NotIn(rangetable.Merge(unicode.Letter, unicode.Latin, unicode.Number))), norm.NFC)
outstr, _, _ := transform.String(t, instr)
out = reflect.ValueOf(outstr)
}
return
}
Now, I think I've tried just about anything, but I keep ending up with a series of strings which are far from the Latin range, e.g.:
𥗉똿穊
𢷽嚶
秓䝏小𪖹䮋
𪿝ท솲
𡉪䂾
ʋ𥅮ᦸ
堮𡹯憨𥗼𧵕ꥆ
𢝌𐑮𧍛併怃𥊇
鯮
𣏲𝐒
⓿ꐠ槹𬠂黟
𢼭踁퓺𪇖
俇𣄃𔘧
𢝶
𝖸쩈𤫐𢬿詢𬄙
𫱘𨆟𑊙
欓
So, can anybody explain what I'm overlooking here and how I could instead define a transformer which removes/replaces non-letter/number/latin characters so that I can use the Value function as intended (but with a smaller subset of 'random' characters)?
Thanks!
Confusingly the Generate interface needs a function using the type not a the pointer to the type. You want your type signature to look like
func (m Metadata) Generate(r *rand.Rand, size int) (value reflect.Value)
You can play with this here. Note: the most important thing to do in that playground is to switch the type of the generate function from m Metadata to m *Metadata and see that Hi Mom! never prints.
In addition, I think you would be better served using your own type and writing a generate method for that type using a list of all of the characters you want to use. For example:
type LatinString string
const latin = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01233456789"
and then use the generator
func (l LatinString) Generate(rand *rand.Rand, size int) reflect.Value {
var buffer bytes.Buffer
for i := 0; i < size; i++ {
buffer.WriteString(string(latin[rand.Intn(len(latin))]))
}
s := LatinString(buffer.String())
return reflect.ValueOf(s)
}
playground
Edit: also this library is pretty cool, thanks for showing it to me
The answer to my own question is, it seems, a combination of the answers provided in the comments by #nj_ and #jimb and the answer provided by #benjaminkadish.
In short, the answer boils down to:
"Not such a great idea as you thought it was", or "Bit of an ill-posed question"
"You were using the union of 'Letter', 'Latin' and 'Number' (Letter || Number || Latin), instead of the intersection of 'Latin' with the union of 'Letter' and 'Number' ((Letter || Number) && Latin))
Now for the longer version...
The idea behind me using the testing/quick package is that I wanted random data for (fuzzy) testing of my code. In the past, I've always written the code for doing things like that myself, again and again. This meant a lot of the same code across different projects. Now, I could of course written my own package for it, but it turns out that, even better than that, there's actually a standard package which does just about exactly what I want.
Now, it turns out the package does exactly what I want very well. The codepoints in the strings which it generates are actually random and not just restricted to what we're accustomed to using in everyday life. Now, this is of course exactly the thing which you want in doing fuzzy testing in order to test the code with values outside the usual assumptions.
In practice, that means I'm running into two problems:
There's some limits on what I would consider reasonable input for a string. Meaning that, in testing the processing of a Name field or a URL field, I can reasonably assume there's not going to be a value like 'James Mc⌢' (let alone 'James Mc🙁') or 'www.🕸site.com', but just 'James McFrown' and 'www.website.com'. Hence, I can't expect a reasonable system to be able to support it. Of course, things shouldn't completely break down, but it also can't be expected to handle the former examples without any problems.
When I filter the generated string on values which one might consider reasonable, the chance of ending up with a valid string is very small. The set of possible characters in the set used by the testing/quick is just so large (0x10FFFF) and the set of reasonable characters so small, you end up with empty strings most of the time.
So, what do we need to take away from this?
So, whilst I hoped to use the standard testing/quick package to replace my often repeated code to generate random data for fuzzy testing, it does this so well that it provides data outside the range of what I would consider reasonable for the code to be able to handle. It seems that the choice, in the end, is to:
Either be able to actually handle all fuzzy options, meaning that if somebody's name is 'Arnold 💰💰' ('Arnold Moneybags'), it shouldn't go arse over end. Or...
Use custom/derived types with their own Generator. This means you're going to have to use the derived type instead of the basic type throughout the code. (Comparable to defining a string as wchar_t instead of char in C++ and working with those by default.). Or...
Don't use testing/quick for fuzzy testing, because as soon as you run into a generated string value, you can (and should) get a very random string.
As always, further comments are of course welcome, as it's quite possible I overlooked something.

How to Parse Some Wiki Markup

Hey guys, given a data set in plain text such as the following:
==Events==
* [[312]] – [[Constantine the Great]] is said to have received his famous [[Battle of Milvian Bridge#Vision of Constantine|Vision of the Cross]].
* [[710]] – [[Saracen]] invasion of [[Sardinia]].
* [[939]] – [[Edmund I of England|Edmund I]] succeeds [[Athelstan of England|Athelstan]] as [[King of England]].
*[[1275]] – Traditional founding of the city of [[Amsterdam]].
*[[1524]] – [[Italian Wars]]: The French troops lay siege to [[Pavia]].
*[[1553]] – Condemned as a [[Heresy|heretic]], [[Michael Servetus]] is [[burned at the stake]] just outside [[Geneva]].
*[[1644]] – [[Second Battle of Newbury]] in the [[English Civil War]].
*[[1682]] – [[Philadelphia]], [[Pennsylvania]] is founded.
I would like to end up with an NSDictionary or other form of collection so that I can have the year (The Number on the left) mapping to the excerpt (The text on the right). So this is what the 'template' is like:
*[[YEAR]] – THE_TEXT
Though I would like the excerpt to be plain text, that is, no wiki markup so no [[ sets. Actually, this could prove difficult with alias links such as [[Edmund I of England|Edmund I]].
I am not all that experienced with regular expressions so I have a few questions. Should I first try to 'beautify' the data? For example, removing the first line which will always be ==Events==, and removing the [[ and ]] occurrences?
Or perhaps a better solution: Should I do this in passes? So for example, the first pass I can separate each line into * [[710]] and [[Saracen]] invasion of [[Sardinia]]. and store them into different NSArrays.
Then go through the first NSArray of years and only get the text within the [[]] (I say text and not number because it can be 530 BC), so * [[710]] becomes 710.
And then for the excerpt NSArray, go through and if an [[some_article|alias]] is found, make it only be [[alias]] somehow, and then remove all of the [[ and ]] sets?
Is this possible? Should I use regular expressions? Are there any ideas you can come up with for regular expressions that might help?
Thanks! I really appreciate it.
EDIT: Sorry for the confusion, but I only want to parse the above data. Assume that that's the only type of markup that I will encounter. I'm not necessarily looking forward to parsing wiki markup in general, unless there is already a pre-existing library which does this. Thanks again!
This code assumes you are using RegexKitLite:
NSString *data = #"* [[312]] – [[Constantine the Great]] is said to have received his famous [[Battle of Milvian Bridge#Vision of Constantine|Vision of the Cross]].\n\
* [[710]] – [[Saracen]] invasion of [[Sardinia]].\n\
* [[939]] – [[Edmund I of England|Edmund I]] succeeds [[Athelstan of England|Athelstan]] as [[King of England]].\n\
*[[1275]] – Traditional founding of the city of [[Amsterdam]].";
NSString *captureRegex = #"(?i)(?:\\* *\\[\\[)([0-9]*)(?:\\]\\] \\– )(.*)";
NSRange captureRange;
NSRange stringRange;
stringRange.location = 0;
stringRange.length = data.length;
do
{
captureRange = [data rangeOfRegex:captureRegex inRange:stringRange];
if ( captureRange.location != NSNotFound )
{
NSString *year = [data stringByMatching:captureRegex options:RKLNoOptions inRange:stringRange capture:1 error:NULL];
NSString *textStuff = [data stringByMatching:captureRegex options:RKLNoOptions inRange:stringRange capture:2 error:NULL];
stringRange.location = captureRange.location + captureRange.length;
stringRange.length = data.length - stringRange.location;
NSLog(#"Year:%#, Stuff:%#", year, textStuff);
}
}
while ( captureRange.location != NSNotFound );
Note that you really need to study up on RegEx's to build these well, but here's what the one I have is saying:
(?i)
Ignore case, I could have left that out since I'm not matching letters.
(?:\* *\[\[)
?: means don't capture this block, I escape * to match it, then there are zero or more spaces (" *") then I escape out two brackets (since brackets are also special characters in a regex).
([0-9]*)
Grab anything that is a number.
(?:\]\] \– )
Here's where we ignore stuff again, basically matching " – ". Note any "\" in the regex, I have to add another one to in the Objective-C string above since "\" is a special character in a string... and yes that means matching a regex escaped single "\" ends up as "\\" in an Obj-C string.
(.*)
Just grab anything else, by default the RegEX engine will stop matching at the end of a line which is why it doesn't just match everything else. You'll have to add code to strip out the [[LINK]] stuff from the text.
The NSRange variables are used to keep matching through the file without re-matching original matches. So to speak.
Don't forget after you add the RegExKitLite class files, you also need to add the special linker flag or you'll get lots of link errors (the RegexKitLite site has installation instructions).
I'm no good with regular expressions, but this sounds like a job for them. I imagine a regex would sort this out for you quite easily.
Have a look at the RegexKitLite library.
If you want to be able to parse Wikitext in general, you have a lot of work to do. Just one complicating factor is templates. How much effort do you want to go to cope with these?
If you're serious about this, you probably should be looking for an existing library which parses Wikitext. A brief look round finds this CPAN library, but I have not used it, so I can't cite it as a personal recommendation.
Alternatively, you might want to take a simpler approach and decide which particular parts of Wikitext you're going to cope with. This might be, for example, links and headings, but not lists. Then you have to focus on each of these and turn the Wikitext into whatever you want that to look like. Yes, regular expressions will help a lot with this bit, so read up on them, and if you have specific problems, come back and ask.
Good luck!