Fetching all pictures of certain artist on mediawiki - media-queries

Currently I'm looking for a way to fetch URLs of paintings on mediawiki that is authored by Albrecht Durer.
Can you point me to a some explanations, is there any API like "give me all images where artist is Albrecht Durer"?
I have found an imageinfo (http://www.mediawiki.org/wiki/API:Properties#imageinfo_.2F_ii), but didn't find how to filter by artist.

There isn't a great way to do that. The structured media data project aims at providing exactly this kind of capability, but it is still ways off.
Right now, your best bet is using the category system. Category:Paintings by Albrecht Dürer and its subcategories contain the images you are looking for, and you can use the categorymembers API as a generator for imageinfo to fetch the URL's. There is no way to get a recursive list though, so you will have to recurse into subcategories manually. To make it worse, the category graph is not guaranteed to be a tree, so you will have to implement things like duplicate filtering and cycle detection.
If the wiki in question is Wikimedia Commons, there are various external tools which can help, such as CatScan or catgraph.

Related

Tree structure of data in REST - URL always from root?

Problem
When the data have a tree structure of parent/child/grandchild entities, we often duplicate the information in the URL, specifying parent IDs, even if that's not necessary. What's the best way to design the RESTful API in such case? Can the URLs be shortened and the parent IDs omitted?
Example
The tree is as follows: The top-most entity is a product. Each product has 0-N reviews. Each review can have 0-M comments attached. In theory, there can be an arbitrary depth of this tree.
The naive RESTful API would look like this (assuming only GET endpoints):
/products ... list of products
/products/123 ... specific product 123
/products/123/reviews ... list of reviews for product '123'
/products/123/reviews/abc ... specific review 'abc'
/products/123/reviews/abc/comments ... list comments for review 'abc'
Hang on, wait a minute... The last two labels I have written do not say anything about product '123'. Yes, the review 'abc' belongs to that product, but as a human, I don't need to know that. And if the review ID 'abc' is unique among all reviews, neither does the computer.
So for example when we send an update (PATCH) request for review 'abc', we don't need to know whole hierarchy of parent objects up to the tree root (products), e.g that it belongs to product '123' in this case. Of course, we assume each object has an unique ID among all objects of that entity - but that's a natural behavior for example in RDBs, so many people (well, their APIs) are in this situation.
Questions
If the IDs of "child entities" are unique among all entities of that type, would it be best practice to design the API like this?
/reviews/abc ... specific review 'abc'
/reviews/abc/comments ... list comments for review 'abc'
/comments/xyz ... specific comment 'xyz'
If answer to (1) is yes, should an endpoint like this be valid as well? Why? Why not?
/products/123/reviews/abc/comments/xyz ... specific comment 'xyz'
If short URLs are allowed (or even preferred), isn't this a bit inconsistent then?
/products/123/reviews ... list reviews for product '123'
/reviews/abc ... specific review 'abc'
/reviews ... what should be here? all reviews?
Yes.
Depends - I wouldn't recommend it, but if you find a use case for it, why not?
I see no inconsistency - yes, in this situation /reviews should be a list of all reviews in system, but if that makes no sense for your application, then /reviews can just yield a 404 and everything's fine.
Ideally, design of URLs should be decoupled from the rest of the REST API. That means, as far as your URLs are uniquely identifying your resources, they're (from purely theoretical point of view) "well designed".
But API is an interface and it should be treated as such. API is consumed by machines, but those machines are written by people, so in fact, design matters. It's the same reason why to have nice URLs on your blog - there is no technical reason for it, but it improves the experience of users if they want to read, share, remember or understand your URLs (you may say that Google searches for keywords in URLs and so it is a technical reason, but no, it's not - Google's bot is just one of your users - website consumers - and optimization for the bot is just like any other optimization for your users, thus it's interface design).
In case design of your URLs matters (for any reason), then in my opinion the best approach is to keep them simple. As simple, as you can. Your observation is very right - you don't need to mimic hierarchy of your resources or the way you store data in database. Eventually it would only get in your way and in a way of people who want to consume your API.
If a resource is uniquely identified within a collection by an ID, then design your URLs just /collection/{id}. Look how Facebook does it - majority of its API does exactly this. Structure of their URLs is pretty flat.
There doesn't even need to be a /collection resource for listing all existing objects. You can have them linked only from places, where it makes sense, like /products/123/reviews, where you can list links pointing to /reviews/{id}.
Why I think complicated URLs are bad?
Relations between resources are graphs and you can't put graphs to URLs
Putting other IDs and hierarchies into URLs makes things more complicated for no reason. Usually, hierarchies are not so simple in APIs - relations between resources are more often very complicated graphs, not simple trees. So don't put linking between resources into your URLs - there are better places (hypermedia formats, link headers, or at least linking by ID references) where to put information about relations and those are not limited to one string like URLs, so with them you can define relations better.
You're torturing your consumer by requesting too much parameters
By requiring more information in URL from consumer, you force him to remember all this context and all those IDs or know those values in advance. You require more (unnecessary) input, but in reality, there is no reason for consumer to remember product's ID just to check out one of its reviews.
Evolvability
In case your URLs are not decoupled well, you should really think of what happends if structure of your data changes in time. With simple URLs, nothing really happens. With complicated URLs, every time you change the way your API resources are related, you'll need to change also URLs so they keep up with your structure. And as everyone knows, changing URLs is hard - whether we are talking about web or APIs. Hypermedia somehow solves this, but even without hypermedia you can do at least so little that you keep your URLs light and as change-prone, as it gets.
Your design could look like this
/products/{id} - specific product, links to an endpoint with list of its reviews
/products/{id}/reviews - lists links to endpoints of reviews of the product
/reviews/{id} - specific review, should link to reviewed product and it could even link to the list above, if it seems to be useful for an API consumer
In fact, any of those resources can also link to any other thing in the system, if its useful or if there is a logical connection. Some linking systems (such as hypermedia) make understanding those links easier, because you can specify a rel attribute, which says to consumer where the link is pointing to (self points to itself, next could point to another page, etc.).
Of course, as always, it depends on your specific case. But generally, I'd recommend to keep URLs decoupled and simple. Also, I wouldn't recommend to to try to mirror any complicated relations or hierarchies in URLs.
As long as the URL can uniquely identify the resource, it is correct.
So the approaches in both Q-1) and Q-2) are fine to use and can be mixed. It is like provide different entry points to the same resource.
The answer to the question comes back to your business use-case. If there is no need for more than one entry points, should just stick with one and it will simplify the code.
To Q-3, ‘/reviews’ will mean all reviews. Also you don’t need to support that if there is no business use-case to get all reviews in your system.
Hope this help.

Use the Apple Search API to search by genre?

Is it possible to use the Apple Search API to search by genre ? I'm thinking specifically games in the app store. Using Obj-c.
As has been pointed out in the comments of this question...
Search Apple App store by genre with iOS/Obj-c
There seems to be a problem with trying to search by genre, so I'm looking for answers which of examples of that actually working, not just links to the docs.
It's not actually documented on the Search API documentation, but you can add a genreId parameter to the search URL and it restricts the search to a particular genre.
If you look at the JSON returned from a search for "Yelp", there are 4 interesting things:
"genreIds":["6005", "6001"]
"genres":["Social Networking", "Weather"]
"primaryGenreName":"Social Networking"
"primaryGenreId":6005
Adding &genreId=6001 to a URL will find apps in the US in the "Weather" category. I'm using the search term "Check" in the URL.
https://itunes.apple.com/search?term=Check&country=us&entity=software&genreId=6001
Because it's not documented, you can't rely on it working forever. You may also be able to use the primaryGenreName as a parameter, I didn't try that. You'll have to figure out what numbers correspond to what categories too.
The Search API is documented here: http://www.apple.com/itunes/affiliates/resources/documentation/itunes-store-web-service-search-api.html
You can use this link to generate an RSS Feed of your liking. Without knowing too much about how you intend to use it, I would suggest looking at these two solutions and using the best one that suits your needs.

large collection of photos for IR software testing

I'm looking for large (preferably 50K+) collection of photos that I could use for testing image recognition software. So preferably photos of objects. I'm fine with album covers, movie posters or anything like that.
Any suggestions?
The ImageNet database (http://www.image-net.org/ - seems to be down at the time I'm writing this, but i think that's temporary) is something you could look into, if it comes back up. It has literally millions of labeled images, seperated into a hierarchy of classes (you don't have to download the complete set).
What about a google search like this?
Another google search let me find this one. Lots of objects there.

How to create SEO-friendly paging for a grid?

I've got this grid (a list of products in an internet shop) for which I've no idea how big it can get. But I suppose a couple hundred items is quite realistic, especially for search results. Maybe even thousands, if we get a big client. :)
Naturally, I should use paging for such a grid. But how to do it so that search engine bots can crawl all the items too? I very much like this idea, but that only has first/last/prev/next links. If a search engine bot has to follow links 200 levels deep to get to the last page, I think it might give up pretty soon, and not enumerate all items.
What is the common(best?) practice for this?
Is it really the grid you want to have index by the search engine or are you afer a product detail page? If the last one is what you want, you can have a dynamic sitemap (XML) and the search engines will take it from there.
I run a number of price comparison sites and as such i've had the same issue as you before. I dont really have a concrete answer, i doubt anyone will have one tbh.
The trick is to try and make each page as unique as possible. The more unique pages, the better. Think of it as each page in google is a lottery ticket, the more tickets the more chances you have of winning.
So, back to your question. We tend to display 20 products per page and then have pagination at the bottom. AFAIK google and other bots will crawl all links on your site. They wouldnt give up. What we have noticed though is if your subsequent pages have the same SEO titles, H tags and is basically the same page but with different result sets then Google will NOT add the pages to the index.
Likewise i've looked at the site you suggested and would suggest changing the layout to be text and not images, an example of what i mean is on this site: http://www.shopexplorer.com/lcd-tv/index.html
Another point to remember is the more images etc... on the page the longer the page will take to load the worse your UI will be. I've also heard it affects quality on SEO ranking algorithms.
Not sure if i've given you enough to go on, but to recap:
i would limit the results to 20-30
I would use pagination but i would use text and not images
i would make sure the paginated pages have distinct enough 'SEO markers' [ title, h1 etc.. ] to be a unique page.
i.e.
LCD TV results page 2 > is bad
LCD TV results from Sony to Samsung > Better
Hopefully i've helped a little
EDIT:
Vlix, i've also seen your question ref: sitemaps. If you're concerned with that, i wouldnt be, then split the feed into multiple seperate feeds. Maybe on a category level, brand level etc... I'm not sure but i think google would want as many pages as possible. It will ignore the ones it doesnt like and just add the unique ones.
That at least, is how i understand it.
SEO is a dark art - nobody will be able to tell you exactly what to do and how to do it. However, I do have some general pointers.
Pleun is right - your objective should be to get the robots to your product detail page - that's likely to be the most keyword-rich, so optimize this page as much as you can! Semantic HTML, don't use images to show text, the usual.
Construct meaningful navigation schemes to lead the robots (and your visitors!) to your product detail pages. So, if you have 150K products, let's hope they are grouped into some kind of hierarchy, and that each (sub)category in that hierarchy has a managable (<50 or so) number of products. If your users have to go through lots and lots of pages in a single category to find the product they're interested in, they're likely to get bored and leave. Make this categorization into a navigation scheme, and make it SEO friendly - e.g. by using friendly URLs.
Create a sitemap - robots will crawl the entire sitemap, though they may not decide to pay much attention to pages that are hard to reach through "normal" navigation, even if they are in the sitemap.xml.
Most robots don't parse more than the first 50-100K of HTML. If your navigation scheme (with a data grid) is too big, the robot won't necessarily pick up or follow links at the end.
Hope this helps!

Tool or methods for automatically creating contextual links within a large corpus of content?

Here's the basic scenario - I have a corpus of say 100,000 newspaper-like articles. Minimally they will all have a well-defined title, and some amount of body content.
What I want to do is find runs of text in articles that ought to link to other articles.
So, if article Foo has a run of text like "Students in 8th grade are being encouraged to read works by John-Paul Sartre" and article Bar is titled (and about) "The important works of John-Paul Sartre", I'd like to automagically create that HTML link from Foo to Bar within the text of Foo.
You should ask yourself something before adding the links. What benefit for users do you want to achieve by doing this? You probably want to increase the navigability of your site. Maybe it is better to create an easier way to add links to older articles in form used to submit new ones. Maybe it is possible to add a "one click search for selected text" feature. Maybe you can add a wiki-like functionality that lets users propose link for selected text. You probably want to add links to related articles (generated through tagging system or text mining) below the articles.
Some potential problems with fully automated link adder:
You may need to implement a good word sense disambiguation algorithm to avoid confusing or even irritating the user by placing bad automatic links with regex (or simple substring matching).
As the number of articles is large you do not want to generate the html for extra links on every request, cache it instead.
You need to make a decision on duplicate titles or titles that contain other title as substring (either take longest title or link to most recent article or prefer article from same category).
TLDR version: find alternative solutions that provide desired functionality to the users.
What you are looking for are text mining tools. You can find more info and links at http://en.wikipedia.org/wiki/Text_mining. You might also want to check out Lucene and its ports at http://lucene.apache.org. Using these tools, the basic idea would be to find a set of similar articles based on the article (or title) in question. You could search various properties of the article including titles and content or both. A tagging system a la Delicious (or Stackoverflow) might also be helpful. Rather than pre-creating the links between articles, you'd present the relevant articles in an interface much like the Related questions interface on the right-hand side of this page.
If you wanted to find and link specific text in each article, I think you'd need to do some preprocessing to select pertinent phrases to key on. Even then I think it would be very hard not to miss things due to punctuation/misspellings or to not include irrelevant links for the same reasons.