How to get all Wikipedia article titles? - title

How to get all Wikipedia article titles in one place without extra characters and pageids. Just the article's title. Something like this:
When I download wikipedia dump, I get this
Maybe I know a movement that might get me all pages but I wanted to get all pages in one take.

You'll find it on https://dumps.wikimedia.org
The latest List of page titles in main namespace for English Wikipedia as a database dump is here (69 MB).
If you rather want it through the API you use query and list=allpages but that only give you maximum 500 (5k for bots) at a time, so you will have to make more than 10 000 API calls for the English Wikipedia.
Example: https://en.wikipedia.org/w/api.php?action=query&format=xml&list=allpages&aplimit=max

Related

Creating a SOLR index for activity stream or newsfeed

I am trying to index the activity feed of a social portal am building. The portal allows users to follow each other to get updates from the people they follow as an activity feed sorted by date.
For example, user A will be following users B, C, D, E & F. So user A should see all the posts from B, C, D, E & F on his/her activity feed.
Let's assume the post consist of just two fields.
1. The text of the post. (text_field)
2. The name/UID of the user who posted it. (user_field)
Currently, I am creating an index for all the posts and indexing the text_field & user_field. In scale, there can be 1,000,000+ posts. A user may follow 100s if not 1000s of users. What will be the best way to create an index for this scenario?
Should I also index a person followers, so that its quickly looked up and then pass it to a second query for getting the posts of all those users sorted by date?
What is the best way to query the index consisting of all these posts, by passing the UID of all the users that are followed? Considering this may be in 100's or more.
Update:
The motivation for using Solr for the news feed was mainly inspired by this detailed slide and my brief discussion with OpenSocial team.
When starting off with a social portal, Fan out on write seems an overkill and more expensive. However Fan out on read is better. Both the slide and the OpenSocial team suggested using a search backend for Fan out on read. The slide mentioned above also have data on how it helped them.
At present, the feed is going to be flat and only sort criteria will be the date(recency). We won't be considering relevance or posts from more closer groups.
It's kind of abstract, but I will do my best here. Based on what you mentioned, I am not sure if Solr is really the right tool for the job here. You can still have Solr for full text search, but I am not sure about generating a news feed from it in this scenario. Remember that although Solr is pretty impressive, it is a search engine. I will pretend that you will stick with Solr for the rest of the post, keep in mind that we are trying to put a square peg through a round hole here though.
Here are a few additional questions you should think about.
You will probably want to add a timestamp of the post to the data element
You need to figure out how to properly sort the results. Is it in order of recency? Or based on posts that the user is more likely to interact with?
If a user has 1000+ connections, would he want to see an update from every one of them in the main feed? Or should posts from a closer group of friends show up higher?
Here are some comments about your questions:
1) If you index person's followers, it may be hard to keep up. I am assuming followers are going to be changing often and re-indexing in this scenario would not really be practical.
2) That sounds more on par, but again, you need to figure out the sorting. You can get a list of connections for the user, then run a search for top posts from all of them.

How to get list of wikipedia articles that are in specified category?

If I use this URL to get the Category page
https://en.wikipedia.org/w/api.php?&callback=jQuery111206430303168017417_1453394474227&action=query&prop=revisions&rvprop=content&format=json&titles=Category%3AHacker+(subculture)&_=1453394474245
but I only have a header and other categories, how to get the same page as on Wikipedia with a list of articles?
The Wikimedia API does not return the HTML page as it appears when you browse Wikipedia. If you want that page, you need to call it by its common URL, e.g. https://en.wikipedia.org/wiki/Category:Hacker_%28subculture%29
If you want to use the API to get at the page titles or page ids listed in a certain category, you need to query for category members.
For your query, you would do something like: https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category%3AHacker+%28subculture%29
Set cmlimit to get more than the default ten pages. Maximum is 500.
You can then parse the JSON to get at the listed page titles or page ids, e.g. to create links to those pages.
Look at the documentation for an explanation of these and other parameters you may use in your query.
The query uses format=jsonfm (for a readable rendering of the data) as a default. Use format=json for your data query.

How to get random page of specific "Portal:" using WikiMedia API

Is there any way to get random article of specific wikimedia portal using wikimedia API?
For example, I need random page of Portal:Science.
Does anyone know how to do this?
What you're asking for doesn't make much sense, because a portal doesn't have a list of pages associated with it.
The closest thing you can do is to get a random page from e.g. Category:Science or one of its subcategories. There is no way to do that directly using the API, you would need to traverse all the subcategories and choose a random page from them by yourself.
There is a tool that already does this (with a limit on the depth of the category tree): erwin85's random article and there is also a template for it on the English Wikipedia.

Make search engines distinguish website chronological updates over time (like in forums)

I see that search engines are prominently capable of finding pages chronologically for forum websites and the like, offering the option to show the results for the last 24 hours, last week, last month, last year, etc.
I understand that these sites need to be continuously crawled to provide those updates, but I have technical doubts about what structure, tags or whatever I need to do to achieve it for my website.
I see that at the client side (which is also the side search engines are at) content appears basically as static data, already processed by the server, so the question is:
If I have a website for which I update and add content constantly to the index page to make it easily visible, and for which I even add links, times and dates as text for the new pages, why don't these updates show at all in search engines?
Do I need to add XML/RSS feeds, or what else?
How do forums and sites with heavy updates with a chronological mark achieve the capability to allow search engines to list results separated by hours, days, etc.?
What specific set of tags and overall structure do I need to add for this feature?
I also see that search engines, mainly Googlebot, usually take a minimum of 3 days to crawl those new pages, but still, they aren't organized persistently (or at all) in a chronological way in search results.
I am not using any forum, blog or other kind of web publishing software, just raw HTML and PHP written by hand, and the minimum I mentioned above, of pointing to new documents from the index page of the website along with a description.
Do I need to add XML/RSS feeds, or what else?
Yes. Atom or one of the RSS formats (or several formats at the same time, so you could offer Atom and RSS).
Search engines will know about new blog posts, microblog post, forum threads, forum thread answers etc., because they subscribe to the feed. So sometimes you'll notice that a page is indexed by a search engines only minutes after it was published. But for smaller sites, search engines probably don't check for updates every few minutes, instead it might take even days until a new page is indexed.
A sitemap might help, too.

Is it possible to retrieve arbitrary number of items from reddit using API?

I am writing a small assistant app to read (well, filter/rank) /r/programming/ for me, because it has so many damn posts, and because certain area of my coding skills was getting rusty and it sounds like good exercise.
I am getting items from "new" page of the subreddit using json api; however it only returns 25 items per request (which is the page size), so to retrieve items for, say, last week, I need to make dozens of requests. As the mandated request interval is 2s, it is painful.
I wonder if there's some way to retrieve more items? Query string parameters for standard html gets also work for json gets, but I cannot find one for page size.
EDIT: for posterity, the paramrter name is "limit", although that too is capped at 100
for posterity, the parameter name is "limit", although that too is capped at 100