Wikipedia API - How to get rid of Wikipedia hyperlinks/junk

Wikipedia API - How to get rid of Wikipedia hyperlinks/junk - api

I'm currently using Wikipedia API to get some content that I can use on my website. At the moment when I get content it is all in html or wikitext (both containing Wikipedia hyperlink and a lot of junk in the text). Is there a way around this to just get plain text without having all this junk?
I have tried calling HTML and converting it into plain text but that still contains all of the wiki junk. I want to try and create a universal method that can remove all the junk as I want to be able to call multiple different Wikipedia pages and get plain text for all of these.
HTML:
https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Game_of_Thrones_(season_3)&prop=text&section=1&disabletoc=1
Wikitext:
https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Game_of_Thrones_(season_3)&prop=wikitext&section=1&disabletoc=1
I hope this makes sense, any advice/guidance is greatly appreciated.

Related

Get text from a section on some page

I know how to make an API call to get me the text of the whole page, like this, but is there a way (without having to parse through the wiki markup) to only get the text from a certain section?

If you look at the documentation for the revisions module, you'll notice that it has a prameter rvsection, which is exactly what you want. So, for example, to retrieve the lead section, use
http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=Stack%20Overflow&prop=revisions&rvprop=content&rvsection=0

SEO - META Tags and Google

I just found out that Google recently decided to start using their own "title" when they display their search results. Also, after checking Yahoo and Bing I saw that the way they are displaying their results are the same but in completely different way than Google.
I guess my question would be, if there is an actual "correct" way of adding titles to my pages in order for Google to display what I want them to and this way get the same results with Yahoo/Bing that are currently using the page's title as a search result (sometimes they pick up the first tag and use it as title).
Any recommendations or links to follow for more studying would be appreciated.

There's nothing you can really do about it. Google will choose what title to display based on criteria they have not made public. This usually is the page's title as found in the <title> tag but if Google feels a different title better summarizes the page's content they may choose to display something else.
You can try to change your page titles to better reflect the page's content and see if that helps.

Using optimal keyword prominency in meta tags according to guidelines... and Google will pick up your meta tags. See our news portal's source and metas (keywords: hírek, választás 2014, etc.): http://valasztas2014.hir24.hu/

SEO: Can dynamically generated links be crawled?

I have a page containing <div> tags with onclick="" code that calls an ajax request to get json data, and then iterates through the results to form links (<a />) to append to the page. These links do not exist in any other place on my website. How can I make these dynamically generated links crawlable?
My initial thought was to turn the <div> tags into <a> tags with a href="#", but with my limited knowledge of how typical crawlers work, i don't think this would solve my problem since the "#" would be what's recognized by the crawler, and not necessarily the dynamically generated output. This is besides the point that i don't want the scroll positioning to be altered at all, which would also rule out giving the <a> tag an id and having it reference itself.
Do I have any options aside from making a new page containing all of the links i need to be crawled? Thanks.

As a general rule, content that is created or made available through JavaScript cannot be found or indexed by search engines. Google does support crawlable Ajax but using it as the only means of accessing your content is bad for accessibility. Also, other search engines can't get to that content which is also not a good thing. Basically crawable ajax is a bad thing.
You should always make your content available without requiring JavaScript to get it. Then you can improve your site by adding JavaScript to make getting the content faster or easier. This is called Progressive Enhancement and is how good websites are built.

how does google recovers the web site description?

do you know how google recovers the description of a website in their search results? is it the meta-description? the first paragraph?

Their algorithms aren't officially released to the public, but if there is a meta description tag, it takes that. Otherwise it generally depends on where the keywords lie within the body of the webpage. If someone is searching for "foo", a paragraph with foo in it will likely appear, with foo highlighted in bold.

Search Engines (including Google) crawl through the first introductory paragraph of the page or a post and takes that excerpt to put in the description when search results are shown. But there's a protection measure that one should take to be SEO friendly. If you are starting your page/post with an image, it negatively affects the SEO of that page because the search results are in text form and for that search engines won't understand the format of the image since they want a text description. In case of WordPress, use All IN One SEO Pack Plugin to manipulate the description if you are starting your post/page with an image.

Mediawiki + Lucene: How To Strip Markup?

I have the Lucene search extension (http://www.mediawiki.org/wiki/Extension_talk:Lucene-search) integrated with my mediawiki installation. Its all working really well, however- lucene seems to have indexed all the mediawiki /html markup as well and it is showing up in the results.
i.e. searching for "green" will return results with markup such as, style="background:green; color:white
Is there a way to strip the search results of all the markup? I believe wikipedia uses the same search plugin, how are they doing it?

You will probably have to transform the raw wiki markup before indexing it with Lucene. When dealing with pure XML content, it's possible to just use an XSL transform with <xsl:value-of select="text()"/> to extract the text content.
I'm afraid that won't work for wiki markup, but maybe you can capture the page post-HTML transformation?

I found a solution to part of the problem. The following change will remove HTML markup from the search results. I have not been able to remove Wikitext markup yet. Any tips on that would be appreciated. Note that I do not use the Lucene search extension.
Open /includes/search/SearchEngine.php
In that file, there is a class defined - SearchResult
SearchResult.getTextSnippet() contains the code to format the search results
SearchResult->mText contains the text blurb from the search results
To fix the problem, simply go into SearchEngine.php and find the method called getTextSnippet(), then add the following line before the "if":
$this->mText = strip_tags( $this->mText );
I found this solution on this random Wiki: http://www.myrandomwiki.com/wiki/MediaWiki_Notes#Strip_HTML_From_Search_Results

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Wikipedia API - How to get rid of Wikipedia hyperlinks/junk - api

Related

Get text from a section on some page

SEO - META Tags and Google

SEO: Can dynamically generated links be crawled?

how does google recovers the web site description?

Mediawiki + Lucene: How To Strip Markup?

Categories

Resources