list=alllinks confusion - api

I'm doing a research project for the summer and I've got to use get some data from Wikipedia, store it and then do some analysis on it. I'm using the Wikipedia API to gather the data and I've got that down pretty well.
What my questions is in regards to the links-alllinks option in the API doc here
After reading the description, both there and in the API itself (it's down and bit and I can't link directly to the section), I think I understand what it's supposed to return. However when I ran a query it gave me back something I didn't expect.
Here's the query I ran:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=google&rvprop=ids|timestamp|user|comment|content&rvlimit=1&list=alllinks&alunique&allimit=40&format=xml
Which in essence says: Get the last revision of the Google page, include the id, timestamp, user, comment and content of each revision, and return it in XML format.
The allinks (I thought) should give me back a list of wikipedia pages which point to the google page (In this case the first 40 unique ones).
I'm not sure what the policy is on swears, but this is the result I got back exactly:
<?xml version="1.0"?>
<api>
<query><normalized>
<n from="google" to="Google" />
</normalized>
<pages>
<page pageid="1092923" ns="0" title="Google">
<revisions>
<rev revid="366826294" parentid="366673948" user="Citation bot" timestamp="2010-06-08T17:18:31Z" comment="Citations: [161]Tweaked: url. [[User:Mono|Mono]]" xml:space="preserve">
<!-- The page content, I've replaced this cos its not of interest -->
</rev>
</revisions>
</page>
</pages>
<alllinks>
<!-- offensive content removed -->
</alllinks>
</query>
<query-continue>
<revisions rvstartid="366673948" />
<alllinks alfrom="!2009" />
</query-continue>
</api>
The <alllinks> part, its just a load of random gobbledy-gook and offensive comments. No nearly what I thought I'd get. I've done a fair bit of searching but I can't seem to find a direct answer to my question.
What should the list=alllinks option return?
Why am I getting this crap in there?

You don't want a list; a list is something that iterates over all pages. In your case you simply "enumerate all links that point to a given namespace".
You want a property associated with the Google page, so you need prop=links instead of the alllinks crap.
So your query becomes:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions|links&titles=google&rvprop=ids|timestamp|user|comment|content&rvlimit=1&format=xml

Related

How can I use pagemap with google custom search refinements?

I'm trying to add refinements to my google custom search.
I have meta tags on just about every page of the site, such as
<meta name="type-id" content="241" />
Where there are many different types, and I want to have one refinement for each type.
In the docs, it says
You can also use these more:pagemap: operators with refinement labels
But I have been unable to do that.
Note that I have had success using more:pagemap:metatags-type-id:241 in the search input, or as a webSearchQueryAddition - but despite googles docs, I haven't been able to get it to work with a refinement.
Here's a sample from my cse.xml (removing some attributes from the CustomSearchEngine tag):
<?xml version="1.0" encoding="UTF-8"?>
<CustomSearchEngine>
<Title>Test</Title>
<Context>
<Facet>
<FacetItem>
<Label name="videos" mode="FILTER">
<Rewrite>more:p:metatags-article-keyword:121</Rewrite>
</Label>
<Title>Videos</Title>
</FacetItem>
</Facet>
</Context>
</CustomSearchEngine>
Is this supposed to work? Am I using wrong syntax in the rewrite rule? Has anyone else done something like this?
Your label in the facet should be mode="BOOST" if you want to restrict to a structured data field within the scope of your engine.
<Facet>
<FacetItem>
<Label name="videos" mode="BOOST">
<Rewrite>more:p:metatags-article-keyword:121</Rewrite>
</Label>
<Title>Videos</Title>
</FacetItem>
</Facet>

How to get more info within only one geosearch call via Wikipedia API?

I am using an API call similar to http://en.wikipedia.org/w/api.php?action=query&list=geosearch&gsradius=10000&gscoord=41.426140|26.099319.
I returns something like this
<?xml version="1.0"?>
<api>
<query>
<geosearch>
<gs pageid="27460829" ns="0" title="Kostilkovo" lat="41.416666666667" lon="26.05" dist="4245.1" primary="" />
<gs pageid="27460781" ns="0" title="Belopolyane" lat="41.45" lon="26.15" dist="4988.7" primary="" />
<gs pageid="27460862" ns="0" title="Siv Kladenets" lat="41.416666666667" lon="26.166666666667" dist="5713.5" primary="" />
<gs pageid="13811116" ns="0" title="Svirachi" lat="41.483333333333" lon="26.116666666667" dist="6521.9" primary="" />
<gs pageid="27460810" ns="0" title="Gorno Lukovo" lat="41.366666666667" lon="26.1" dist="6613.4" primary="" />
<gs pageid="27460799" ns="0" title="Dolno Lukovo" lat="41.366666666667" lon="26.083333333333" dist="6746.2" primary="" />
<gs pageid="27460827" ns="0" title="Kondovo" lat="41.433333333333" lon="26.016666666667" dist="6937" primary="" />
<gs pageid="27460848" ns="0" title="Plevun" lat="41.45" lon="26.016666666667" dist="7383.1" primary="" />
<gs pageid="24179704" ns="0" title="Villa Armira" lat="41.499069444444" lon="26.106263888889" dist="8130" primary="" />
<gs pageid="27460871" ns="0" title="Zhelezari" lat="41.413333333333" lon="25.998333333333" dist="8540.1" primary="" />
</geosearch>
</query>
</api>
But while I am actually trying to get some pictures of those pages, subsequent calls are needed, like
to get some page images
http://en.wikipedia.org/w/api.php?action=query&prop=images&pageids=13843906
then, to get image info
http://en.wikipedia.org/w/api.php?action=query&titles=File:Alexandru_Ioan_Cuza_Dealul_Patriarhiei.jpg&prop=imageinfo&iiprop=url
Well, even if this gets me what I ultimately need, it is not efficient at all.
I would like to know if there are some parameters for this calls, or maybe completely other call(s) that would bring all this info in maximum 2 steps/calls. It would be great, though, if it would be only one.
Wow, I had no idea that such a feature exists nowadays! But to answer your question, since it's a list query, you can probably use it as a generator.
Let's try it:
Original geosearch query: http://en.wikipedia.org/w/api.php?action=query&list=geosearch&gsradius=10000&gscoord=41.426140|26.099319
Generator query to get images on matching pages: http://en.wikipedia.org/w/api.php?action=query&prop=images&imlimit=max&generator=geosearch&ggsradius=10000&ggscoord=41.426140|26.099319
The prop=images query can also be used as a generator, so you can also do this:
Get URLs for all images on a list of pages: http://en.wikipedia.org/w/api.php?action=query&prop=imageinfo&iiprop=url&generator=images&gimlimit=max&pageids=13811116|24179704|27460781|27460799|27460810|27460827|27460829|27460848|27460862|27460871
Alas, AFAIK you can't nest generators, so you can't do both steps in one query. You can either:
get the list of images in one query, and then use another query to get the URLs, or
start with the basic geosearch query to get the page IDs, and then get the images and their URLs in another query.
Alas, it turns out that both of these options fail to give you some information that you may want. If you use list=geosearch as a generator, you don't get the coordinate information that you may need if you e.g. wish to display the results on a map. On the other hand, using prop=images as a generator makes you miss out on something even more important: the knowledge of which images are used on which pages!
Thus, unfortunately, it seems that, if your goal is to place images on a map, you'll probably have to do it with three separate queries. At least you can still query multiple pages / images in one request, so you shouldn't need more than three (until you hit the query limits and need to use continuations, that is).
(Also, doing it in three steps lets you apply some filtering to the images before the third step. For example, most of the pages returned by your example query only have the same three images — Flag of Bulgaria.svg, Ivaylovgrad Reservoir.jpg and Oblast Khaskovo.png — all of which are used via templates, and none of which really look like good choices to represent the specific location.)
Ps. If you're just interested in finding images near a particular location, even if they're not used on any specific Wikipedia article, you might want to try using geosearch directly on Wikimedia Commons. It doesn't seem to return any results for your Bulgarian example coordinates, but it works just fine in a more crowded location.
Here is an alternative to build on the previous answer. If you start with this query as a partial answer:
https://en.wikipedia.org/w/api.php?action=query&prop=images&imlimit=max&generator=geosearch&ggsradius=10000&ggscoord=41.426140|26.099319
Then you can build on this to get the information in a single query. The pageimages property can work with the generator. You cannot nest generators but you can chain properties. A query can use pageimages to get the page's main image url for each of the geosearch results. It looks like this:
https://en.wikipedia.org/w/api.php?action=query&prop=images|pageimages&pilimit=max&piprop=thumbnail&iwurl=&imlimit=max&generator=geosearch&ggsradius=10000&ggscoord=41.426140|26.099319
This query returns the image "File" names (images property) and a single URL for the main image (pageimages property). The main image of the page is all I need. You might be able to extrapolate the "file" urls by matching the changes from the file to the url that is output with the query but I cannot recommend such a hack.
The images property has a setting that is supposed to return urls for interwiki links, iwurl. I see the "file" as an interwiki link. This parameter is not working and images does not return a url. Playing on the sandbox might lead you to a better answer.
Intuitively it seems like you should be able to chain the images and imageinfo properties together. Doing so does not give the expected results.
If a single url for the main image of the page is not enough I can encourage you to play in the API sandbox to try and get what you need with some combination of properties. I am using the geosearch generator and get the page image, text description, and lat/long coordinates so that I can get the address. Good luck!

Shopify: Is it possible to find a product by sku number?

The data that I'm getting only contains the SKU numbers. I am trying to figure out how I can link these SKU numbers to the product variants in Shopify without the actual product id number.
Example data:
<Inventory ItemNumber="100B3001-B-01">
<ItemStatus Status="Avail" Quantity="0" />
</Inventory>
<Inventory ItemNumber="100B3001-B-02">
<ItemStatus Status="Avail" Quantity="0" />
</Inventory>
<Inventory ItemNumber="100B3001-B-03">
<ItemStatus Status="Avail" Quantity="-1" />
<ItemStatus Status="Alloc" Quantity="1" />
</Inventory>
<Inventory ItemNumber="100B3001-B-04">
<ItemStatus Status="Avail" Quantity="-1" />
<ItemStatus Status="Alloc" Quantity="1" />
</Inventory>
Here's a delightful, condescending discussion from Shopify employees in 2011 asking why you can't just store the Shopify ID everywhere. The "stock-keeping unit" is a universal systems integration point and, in every system I've seen, each SKU uniquely maps to a product because words have meaning, but apparently not at Shopify.
Three years later, you seem to have two options.
One is to create a Fulfillment Service and provide a URL where Shopify will call you asking for stock levels on a SKU; this is probably the simplest solution, provided you have a Web server sitting somewhere where you can expose such a callback.
The second is to periodically page through all of the Products and store a mapping of the Shopify ID to a SKU somewhere, consulting your map when you need to do an update. Because most of our integrations are cron jobs and I'd like to keep them that way, I periodically ask for the products that have changed since the last run, and then update my mapping.
As David Lazar points out in his comment, the ability to find a product based on its SKU is not currently supported in the Shopify API.
EDIT: This is an unreliable option that I once used as a last resort. I wouldn't recommend using this but I will leave it here in case as someone may find it helpful.
You can use this API endpoint:
/admin/products/search.json?query=sku:abc123
I have only used it in the browser though, and I can't find any documentation for it. And it may stop working at any time.
You can use
*.myshopify.com/search?view=json&q=sku:SKUID
Graph query on Shopify works for me:

REST wrapping a single resource in a collection

I have a small dilemma.
If you have the following URI endpoints:
/item
/item/{id}
If I make a GET request to /item I expect something like this:
<Items>
<Item>...</Item>
<Item>...</Item>
...
</Items>
If I make a GET request to /item/{id} I expect something like this:
<Item>
...
</Item>
Some of my fellow team members argue we should design the API so when someone does a GET for /item/{id} it should be returned as a collection of a single element. Like this:
<Items>
<Item>...</Item>
</Items>
This seems wrong to me. Does it seem wrong to you too? Please explain why, so I might convince either myself to go with the always wrapped version of the resource or my fellow devs to go with the non-wrapped single resource.
Most importantly, there is no right and wrong answer to this question.
However, here is what I think.
If you want to return a single item, I would tend to do this:
GET /Item/{Id}
=>
<Item>
...
</Item>
If the {Id} does not exist then the server should return a 404.
If I want to return a collection of items, I would do
GET /Items
=>
<Items>
<Item>...</Item>
<Item>...</Item>
</Items>
If there are no items, then it should return a 200 with an empty <Items/> element.
If it really makes it easier for the client to deal with a collection that has just one element, then you could do something like this.
GET /Items?Id={Id}
=>
<Items>
<Item> ... </Item>
</Items>
The difference here is that if the {Id} did not exist then I would tend to return 200 not a 404.
Seems counterintuitive to me. You are potentially saving code effort on the client side by having one way of reading data from your two GET methods. This is of course countered by having extra code to wrap your single GET method in a collection.
If you want real world examples,
twitter returns an individual
representation of a resource not
wrapped in a collection
basecamp, an
early proponent of REST based API,
also follows this model
EDIT: Our API uses this HTTP status code structure
I think your colleagues are right if you think about the consuming side of your REST service, which then can handle every response as a collection. And there's one more thing: If the {id} did not exist, what does your service return? Nothing? Then the consumer has to check for either a null result or error response, a single element or a collection. According to my experience getting a collection in any case (which may be empty) is the most convenient way to be served by a REST service.

Does the Wikipedia API support searches for a specific template?

Is it possible to query the Wikipedia API for articles that contain a specific template? The documentation does not describe any action that would filter search results to pages that contain a template. Specifically, I am after pages that contain Template:Persondata. After that, I am hoping to be able to retrieve just that specific template in order to populate genealogy data for the openancestry.org project.
The query below shows that the Albert Einstein page contains the Persondata Template, but it doesn't return the contents of the template, and I don't know how to get a list of page titles that contain the template.
http://en.wikipedia.org/w/api.php?action=query&prop=templates&titles=Albert%20Einstein&tlcontinue=736|10|ParmPart
Returns:
<api>
<query>
<pages>
<page pageid="736" ns="0" title="Albert Einstein">
<templates>
...
<tl ns="10" title="Template:Persondata"/>
...
</templates>
</page>
</pages>
</query>
<query-continue>
<templates tlcontinue="736|10|Reflist"/>
</query-continue>
</api>
I suspect that I can't get what I need from the API, but I'm hoping I'm wrong and that someone has already blazed a trail down this path.
You can use the embeddedin query to find all pages that include the template:
curl 'http://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eititle=Template:Persondata&eilimit=5&format=xml'
Which gets you:
<?xml version="1.0"?>
<api>
<query>
<embeddedin>
<ei pageid="307" ns="0" title="Abraham Lincoln" />
<ei pageid="308" ns="0" title="Aristotle" />
<ei pageid="339" ns="0" title="Ayn Rand" />
<ei pageid="340" ns="0" title="Alain Connes" />
<ei pageid="344" ns="0" title="Allan Dwan" />
</embeddedin>
</query>
<query-continue>
<embeddedin eicontinue="10|Persondata|595" />
</query-continue>
</api>
See full docs at mediawiki.org.
Edit Use embeddedin query instead of backlinks (which doesn't cover template inclusions)
Using embeddedin does not allow you to search for a specific person, the search string becomes the Template:Persondata.
The best way I've found to get only people from Wikipedia is to use list=search and filter the search using AND"Born"AND"Occupation":
http://en.wikipedia.org/w/api.php?action=query&list=search&srsearch="Tom Cruise"AND"Born"AND"Occupation"&format=jsonfm&srprop=snippet&srlimit=50`
Remember that Wikipedia is using a search engine that doesn't yet allow us to search only the title, it will search the full text. You can take advantage of that to get more precise results.
The accepted answer explains how to list pages using a certain template, but if you need to search for pages using the template, you can with the hastemplate: search keyword: https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=hastemplate:NPOV%20physics