Sitemap structure for local directory website (for search engine bots) - seo

I am working on a local listings website . I have to finalize the sitemap/link directory for the site to enable the search engine bots to crawl my site effectively (the end user is supposed to search using textboxes)
I have come with the following schema : Keywords(alphabetical order)=>City(Alphabetical Orders)=>Locality(in the city above , in alphabetical order)
Please note that at the three levels described above their are no actual profiles , just a bunch of more links , ex: keyword = xyz ; city = New York , Detroit , California , Minnesota
Locality = localities list of city chosen above .
This can be culminated in two ways :
1) At the third level I provide a link to search results page (the same a end user would see)
2) I simple list links of listed entities in the sitemap itself along with relevant info( which would be additional content for the bot to determine the links relevance)
Also is there a penalty for increasing the levels (3) here , should I consider going with two ?
Note : The search results page (as seen by end user) has pagination employed with links , not ajax .

To finalize with your sitemap, you can put a link on your keywords, cities and localities also. It will help for the users to navigate your website very easily.
If you go with more increasing level with third stage then it will look like too many links on a single page. So, Google will ignore your page and it will not indexed. So, you will loose the chanse of SEO benefits.

Related

How to get all Wikipedia page links with their pageIDs?

Starting a request like that:
https://en.wikipedia.org/w/api.php?action=query&format=json&titles=Title&prop=links&pllimit=500
provides me a list of links (that the page contains) where every link consists of the title and the ns (namespace)
Is there a way to also get the PageID together with title & ns? (the less work it is for the sever the better of course)
You need to use generator parameter. Here is an example for Cobra Wikipedia page.
https://en.wikipedia.org/w/api.php?action=query&generator=links&titles=Cobra&prop=info&gpllimit=500

How to get with Mediawiki API all images in a category which are not in another one?

I am entirely new to API, so sorry if the question is silly.
I would like to get all images in a category in Commons let's say X, but exclude those which are also in another one (Y). I do not understand if I can actually do this.
https://commons.wikimedia.org/w/api.php?action=query&list=categorymembers&cmtype=file&cmtitle=Category:X
will get all of them, how to exclude some?
moreover I would like in the result to have the description of the images, not just the name of the file, is that possible?
MediaWiki has - by default - no built-in support for category building and querying intersections. To accomplish this task, extensions or external tools or multiple API queries and result processing is required.
CirrusSearch API
On Wikimedia Commons, like on the whole Wikimedia Wiki farm, CirrusSearch powers filtered search, including search for category intersections and is also available through API (action=query&list=search&srsearch=incategory:A+-incategory:B, this is Category:A minus Category:B).
FastCCI
One of the tools I can recommend (because it's a dedicated high-performance solution and actually running) is fastcci, developed by Daniel Schwen; specifically for Wikimedia Commons, there is already a database maintained and a webservice running but it's possible to set it up for any wiki, provided the tool set has a host to run on and has database access.
Query
Consider the following query URL:
https://fastcci.wmflabs.org/?c1=3302993&c2=15516712&d1=0&d2=0&s=200&a=not&t=js
https://fastcci.wmflabs.org/ - Host Wikimedia Commons fastcci runs on
c1 - ID of category 1
c2 - ID of category 2
d1 - depth of category 1 to search in (fastcci by default considers sub-categories)
d2 - depth of category 2 to search in (fastcci by default considers sub-categories)
s - Number or results to return
o - Offset
a - conjunction
t - connection type (t=js for a JSONP response; otherwise assumes being used as websocket)
Response
fastcciCallback( [ 'RESULT 27572680,0,0|1675043,0,0|27577015,0,0|27577043,0,0|27577106,0,0|27576896,0,0|27576790,0,0|23481936,0,0|17560964,0,0|11009066,0,0', 'OUTOF 10', 'DBAGE 378310', 'DONE'] );
RESULT followed by a | separated list of up to 50 integer triplets of the form pageId,depth,tag. Each triplet stands for one image or category
Resources
Sample client side implementation - to see it in action, just visit any category and next to the Good pictures button in any category page.
Example is FilesOf('Category:Saaleck') - FilesOf('Category:Rapeseed fields in Saxony-Anhalt')
Server application
Presentation on YouTube
Slides
A note on pageIDs
page IDs → page titles: GET /w/api.php?action=query&pageids=page_IDs_separated_by_pipe
page titles → page IDs: GET /w/api.php?action=query&titles=Titles_separated_by_pipe
AFAIK, there is no way to get that directly using the API. But, assuming both categories are reasonably small, you could get all images from both of them and then compute the complement in your code.
To retrieve the description, you can use prop=imageinfo&iiprop=extmetadata&iiextmetadatafilter=ImageDescription.
In the context of your example query, it would look like this:
https://commons.wikimedia.org/w/api.php?action=query&generator=categorymembers&gcmtype=file&gcmtitle=Category:X&prop=imageinfo&iiprop=extmetadata&iiextmetadatafilter=ImageDescription

sharepoint crawl rule to exclude AllItems.aspx , but get an item/document in search resu lts if queried in the search box

I followed this blog Tips 1and created a crawl rule http://.*forms/allitems.aspx and ran full crawl. I no longer get the results with AllItems.aspx. However, if there is any document with name Something.doc in a Document Library , it no longer gets pulled in the search results.
I think what I desire is a basic functionality, like the user should not get to see Allitems.aspx in the search results but should get the item/document with names entered in the search box.
Please let me know if I am missing anything. I have already put in 24 hours...googled the max I could.
It seems that an Index Reset is required. Here's the steps I did:
1. Add the following crawl rule to exclude: *://*allitems.aspx.
2. Index Reset.
3. Full Crawl.
I could not find a good way to do this using crawl rules. Instead, I opted to set up a restriction on the search results web part.
In the search results web part properties, select "Change Query"
Add a property filter to exclude anything with "AllItems" (and any other exclusions you want in place.
Used Steve Mann's blog as a reference and for the images: http://stevemannspath.blogspot.com/2013/04/sharepoint-2013-search-removing-junk.html

Query Wikipedia pages with properties

I need to use Wikipedia API Query or any other api such as Opensearch to query for a simple list of pages with some properties.
Input: a list of page (article) titles or ids.
Output: a list of pages that contain the following properties each:
page id
title
snippet/description (like in opensearch api)
page url
image url (like in opensearch api)
A result similar to this:
http://en.wikipedia.org/w/api.php?action=opensearch&search=miles%20davis&limit=20&format=xml
Only with page ids and not for a search, but rather an exact list of pages by either titles or pageids.
This should be a fairly simple thing but I have been stuck with that for quite some time trying all kinds of URL combinations from the MW api manual, without success.
I dont't think there is another way than the Open Search API to fetch Open Search data, but depending on which Wikipedia you are interested in, there might be other extensions installed to help you. Taking English Wikipedia as an example, we can make use of the MobileFrontend and PageImages extensions, that happens to be installed there.
Title and url are available from the native MediaWiki API. To get the url, you can use prop=info, and specify with inprop=url that it is the url you are interested in.
Prominent images of a page is returned by prop=pageimages, thanks to PageImages.
MobileFrontend adds a property called extracts, that you can use with the directive exintro to get the first paragraph. Note however that MediWiki markup is complex, and result might not always be perfect. If we put it all together in one single query, it would be something like this:
http://en.wikipedia.org/w/api.php?action=query&pageids=21482&prop=pageimages|info|extracts&inprop=url&exintro
giving this:
<api>
<query>
<pages>
<page pageid="21482" ns="0" title="Nairobi" pageimage="Nairobi_Montage.jpg" contentmodel="wikitext" pagelanguage="en" touched="2014-02-06T06:10:01Z" lastrevid="594161616" counter="" length="89157" fullurl="http://en.wikipedia.org/wiki/Nairobi" editurl="http://en.wikipedia.org/w/index.php?title=Nairobi&action=edit">
<thumbnail source="http://upload.wikimedia.org/wikipedia/commons/thumb/6/66/Nairobi_Montage.jpg/45px-Nairobi_Montage.jpg" width="45" height="50" />
<extract xml:space="preserve">
<p><b>Nairobi</b> /naɪˈroʊbi/ is the [...]
</extract>
</page>
</pages>
</query>
</api>
Here is a multistep process to get a list of Wikipedia page titles and properties for articles, and then getting the page IDs and URLS.
Please note: It does use a portion of a previous answer: "Title and url are available from the native MediaWiki API. To get the url, you can use prop=info, and specify with inprop=url that it is the url you are interested in."
If you would like to use the Wikipedia API for your own applications and search Wikipedia for getting a list of articles about a certain topic, and you wanted the answer in JSON format, then you could could use the following URL:
https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=REPLACE_ME_WITH_SEARCH_TOPIC&format=json&callback=?
And if your eyes are having trouble parsing results from that, then replace "format=json&callback=?" with "formatversion=2" like the following example to make it easier for your eyes:
https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=REPLACE_ME_WITH_SEARCH_TOPIC&formatversion=2
The following example will give me a batch list of article titles and properties about/for "Thailand" in JSON format, and after that I will use the resulting titles to find the page IDs and URLS of those articles.
URL step 1:
https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=thailand&format=json&callback=?
From step 1, I can get the list of titles I need from inside the resulting JSON, with step 2, I use those titles gained in step 1 in another API query (aka step 2) for gaining the page IDs and URLs of those articles in the resulting JSON...results of step2.
Here are the Wikipedia article titles from the resulting JSON of step 1:
Thailand
Outline of Thailand
Geography of Thailand
Economy of Thailand
Football in Thailand
Southern Thailand
Government of Thailand
Northern Thailand
Culture of Thailand
Cinema of Thailand
URL step 2:
https://en.wikipedia.org/w/api.php?action=query&titles=Thailand|Outline%20of%20Thailand|Geography%20of%20Thailand|Economy%20of%20Thailand|Football%20in%20Thailand|Southern%20Thailand|Government%20of%20Thailand|Northern%20Thailand|Culture%20of%20Thailand|Cinema%20of%20Thailand&prop=info&inprop=url&format=json&callback=?

Understanding RESTful. URIs for complex actions

I'm trying to build a RESTful service, and I've faced with some problems. I'll describe these problems (questions) with an example of an imaginary RESTful service.
For example, I need a "News" service on my site. News can be of different types: local news and global news. News are added by administrator. User can view both local and global news (separately or all-together). News are shown by pages. User can view the exact news.
So, I've built such a verb-noun table for this task:
GET /news - Get all news
POST /news - Create news
GET /news/{id} - Show the news with id={id}
PUT /news/{id} - Edit the news with id={id}
GET /news/{type}/{page}/{per_page} - Get news page #{page} of type {type}
GET /news/{page} - Get news page #{page} of both types
So, there are problems:
1) how to distinguish {page} and {id}? maybe {id} can be only number, but {page} - a string, started with 'p' (for example 'p1'}?
2) User can change the value "per_page" - how many news are shown on a page. Isn't it too complicated - /news/{type}/{page}/{per_page}? How it can be simplified?
3) How should be URLs in browser look like on this services? URLs won't be exact as URIs from table above?
For example:
/news - Viewing news (1st page with default 'per_page' and default 'type')
/news/{type} - Viewing news (1st page with default 'per_page' and type={type})
/news/{id} - Viewing exact news with id={id}
/news/{type}/{page}/{per_page} - Viewing exact page of news of exact type.
4) Additional functional. For example filter search ( getting news by date, author or title).
How to realize this with REST? How filter object (xml or json) should be transmitted? How to make URL of page with results of the filter? /news/{date:12.12.2012,author:'admin'} or something better?
Sorry for my rough English, If you see some grammar and etc mistakes - feel free to correct them.
Thanks in advance.
I'd say you should use regular params for the type, page and per_page. Type, Page and Per_Page do not represent unique Resources, but are rather filters to the collection of News Resources. So I'd do
/news
/news/{id}
/news?type={type}&page={page}&per_page={per_page}
Same for additional filtering.
Make sure to check out http://www.ics.uci.edu/~fielding/pubs/dissertation/evaluation.htm#sec_6_2
As Gordon wrote, you can use request params as normal. Remember that REST doesn't means only clean and nice urls.
So, leave ids and type parameters in uri, but pagination params add with query string.
Also, to distinguish different uri parts, you could use pattern used in Google's gdata i.e. params are preceded with name
/news
/news/id/{id}
/news/type/{type}
with some parsing on server side, you could add many parameters, optional parameters and not enforce exact ordering.