How to detect a section in a wikimedia page dump - wikimedia

I have looked around quite a bit to try and answer this question, but to no avail. I am parsing wikimedia page dumps to process certain pages (yes, I am aware of several tools to parse wikimedia page dumps, but they don't work for me as well as my parser).
Question is simple. I know how to detect start of a section (e.g. "==External References=="). That's easy. What's not well defined is how to detect when a section ends? For example, for most sections I can scan until start of next section header, but that isn't reliable. I looked at wikimedia's help page on sections, but it doesn't say how to detect end of a section.

There is no "section end" marker in MediaWiki syntax. A section extends until the next section header of the same or lower level. (There is also a "section 0" containing all the text before the first section header.)
Yes, this implies that sections at different levels can overlap, as in this example:
This text is in section 0.
== Section 1 begins here ==
This text is in section 1.
=== Section 2 begins here ===
This text is in sections 1 and 2.
=== Section 3 begins here ===
This text is in sections 1 and 3.
== Section 4 begins here ==
This text is in section 4.
Note that headings created using the HTML <h1>, <h2>, etc. tags don't begin or end sections, and won't have section edit links, even though they look otherwise identical to section headings.
Section headings inside templates do get section edit links, which let you edit the corresponding section of the template, but they're treated specially and are not considered part of the normal section structure of the containing page. There are also some weird special cases here involving section headers inside template parameters which I don't fully remember off the top of my head.
The automatically generated first level heading at the top of every page also doesn't count as a section heading, although any extra first level headings created with = Heading = do.

Related

Extract portion of HTML from website?

I'm trying to use VBA in Excel, to navigate a site with Internet explorer, to download an Excel file for each day.
After looking through the HTML code of the site, it looks like each day's page has a similar structure, but there's a portion of the website link that seems completely random. But this completely random part stays constant and does not change each time you want to load the page.
The following portion of the HTML code contains the unique string:
<a href="#" onClick="showZoomIn('222698519','b1a9134c02c5db3c79e649b7adf8982d', event);return false;
The part starting with "b1a" is what is used in the website link. Is there any way to extract this part of the page and assign it as a variable that I then can use to build my website link?
Since you don't show your code, I will talk too in general terms:
1) You get all the elements of type link (<a>) with a Set allLinks = ie.document.getElementsByTagName("a"). It will be a vector of length n containing all the links you scraped from the document.
2) You detect the precise link containing the information you want. Let's imagine it's the 4th one (you can parse the properties to check which one it is, in case it's dynamic):
Set myLink = allLinks(3) '<- 4th : index = 3 (starts from zero)
3) You get your token with a simple split function:
myToken = Split(myLink.onClick, "'")(3)
Of course you can be more synthetic if the position of your link containing the token is always the same, like always the 4th link:
myToken = Split(ie.document.getElementsByTagName("a")(3).onClick,"'")(3)

SSRS Loop through data, adding a new title if the chapter/section changes

I'm building an SSRS report in Report Builder 3.0 (2014). I have five sections of data I'm working with: inspection number, chapter, section, code, and description. The report is only going to show data for one inspection, so all data is filtered on InspNo first. There are often multiple codes associated with an InspNo.
What I have: for every code associated with an inspection, the code is listed along with its description.
What I need: I need to add the chapter and section info, but only when it changes. For example, let's say the codes associated with an inspection are 302.7, 304.10, 304.12, and 505.1. I would like the result to be as follows:
Chapter 3
Section 302
302.7 - Description
Section 304
304.10 - Description
304.12 - Description
Chapter 5
Section 505
505.1 - Description
I have tried using Lists, but the chapter and section get repeated for every code. Any ideas how to make it work?
****UPDATE****
I'm getting closer to a solution. Right now I'm using a combination of textboxes and lists. The chapters and sections are text boxes, and the codes/descriptions are lists. All of the elements have a visibility expression using InStr. The lists are working perfectly. However the text boxes are giving me issues.
It seems my visibility expressions on elements outside of lists are only looking at the first piece of the pulled data. In the example above, Section 302's visibility expression is =IIF(InStr(Fields!FAILEDCODE.Value, "302") > 0, False, True). This is working great because the first code is 302.7. Section 304's visibility expression is =IIF(InStr(Fields!FAILEDCODE.Value, "304") > 0, False, True). This text box is always hidden. It seems like Report Builder is only checking this InStr value against the first line of data, not the entire set. Does anyone know if this is accurate or if there's a workaround?

Netsuite PDF Templating: get number of pages as attribute

I am templating pdfs in Netsuite using freemarker and I want to display the footer only on the last page. I have been doing some research, but couldn't find a solution (since looks like the environment does not allow me to include or import libs), so I thought that just comparing the number of the page with the total pages in an if tag would be a nice and easy workaround. I already know how to display the numbers by using the <pagenumber/> and <totalpages/> tags, but still cannot get them as values so I can use them like this:
<#if (pagenumber == totalpages) >
... footer html...
</#if>
Any ideas of how or where can I get those values from?
The approach you are trying won't work, because you are mixing BFO and Freemarker syntax. Netsuite uses two different "engines" to process PDF Templates. The first step is Freemarker, which merges the record fields with your template and produces an XML file, which is then converted by BFO into a PDF file. The <totalpages/> element is meaningless to Freemarker, as it is only converted into a number by BFO later.
Unfortunately, the ability to add a footer to only the last page of a document is currently a limitation of BFO, as per the BFO FAQ:
At the moment we do not have a facility for explicitly assigning a
footer or header to the last page in a document when the number of
pages is unknown.
You CAN add it after a page break - and put the page break at the end of the body
<pbr footer="nlfooter" footer-height="25%"></pbr>
</body>
The issue here is - on a one page output - you will get 2 pages minimum... it will always ADD a page for the disclaimer / footer...

Section content using MediaWiki API

I'm using the MediaWiki API to get the content of a Wikipedia page like this in JSON.
http://en.wikipedia.org/w/api.php?format=json&action=query&titles=New_York&prop=extracts
I'd like each section to be separated out instead of having the entire content of the page as one value. I know you can get each section like this but I want it to also include the content with each section.
http://en.wikipedia.org/w/api.php?format=json&action=parse&prop=sections&page=New_York
Is this possible to do with the API?
If you know the number of the section which you want you can get the contents through action=parse with the section parameter. E.g. the "19th century" section of the New_York article would be:
https://en.wikipedia.org/w/api.php?action=parse&page=New_York&format=json&prop=wikitext&section=4
To get the section number you can use
http://en.wikipedia.org/w/api.php?format=json&action=parse&prop=sections&page=New_York
and then find the index corresponding to your section title (line). In this case "line":"19th century","index":"4".

PDF - why is there no standard structure element for a page?

The PDF Spec defines standard structure types, used to define a structure tree for the document. As far as I can see, there is no element related to pages. Here are the standard structure types for grouping elements:
Document
Part
Art
Sect
Div
...and so on...
Why is there no Page item in this list?
If you want your structure to use pages, what should be used? Part? Sect? Div?
PDF tags exist so that the content type / meaning of elements can be identified. They should be considering a kind of "meta" information for the PDF, simply providing context for the content in a file (so that content can be easily extracted, converted, processed, accessible, etc.). Think of it as a table of contents to a book. Just because the book has x pages doesn't mean that the content structure would be altered if the book's page height was cut in half and now had 2x pages in it.
A Page Object in the PDF Document Structure already groups elements (by nature of each element being on a given page), so doing so in this structure would be a little redundant.
Also, consider this case:
Document
Table of Contents (Page 1)
Section 1 (starts on page 2, ends mid page 3)
Sub Section (page 2)
Sub Section (half of page 3)
Section 2 (starts mid page 3)
etc...
In this example, Section 1 and Section 2 couldn't both be direct parents of page 3 (not to mention that Section 1 spans two different pages). Additionally, trying to solve this problem really isn't necessary because the elements which is being grouped here is already each a child of its respective Document Structure's Page node in the actual file format.
Appendix G of the PDF Specification gives examples that demonstrate use of the Page object.
The PDF has a tree structure (which is what allows it to load any page so fast). The content does not have any structure unless you choose to use the marked content feature of the format which then allows metadata to be include in the data.