So I'm currently working on a research paper on media bias (or lack thereof) towards 2020 presidential candidates.
For this, I'm looking for a way to make a huge database of sentences that mention these politicians by name or (if possible) with a pronoun. Right now I'd like to only focus on 5-7 of the biggest American news outlets (WaPo, NYT, FOX, etc.).
I want to collect all of these sentences into an Excel sheet, including a timestamp of when the article was released and a link to the article itself. I actually don't know if that's feasible or whether such program/script exists or not.
Do you think there's a way to solve this, does it already exist, and if not, can a rookie programmer write a script for this?
Thank you for all your help in advance!
You'd probably just need to create your own web scraper. You could have a Set of names that you're looking for, and if the name exists on the page then you can have some heuristics to get the sentence it's in. You'll probably have to have some specific stuff for getting the timestamp from the article. I'd say it wouldn't be too bad since you're targeting only a few news outlets, but probably a bit challenging for a rookie programmer.
Also, I recommend checking out something like https://www.webscraper.io/
Related
I was reading some interesting questions about the topic "Can we make a program that, given a particular sequence, produces the next terms", like this one, and I really like the detailed answer of this one. I understand that the answer is "That's impossible without more restrictions", and that given some restrictions (polynomials, rational function or boolean map) we know some good algorithms, as the second answer I linked explains.
Now, a natural question is how much can we solve, trying our best even if we can't always solve it, to answer the original, general question. What I usually do when facing a hard sequence is trying to see if it's in OEIS, and if it seems to be there, seeing if there is any formula or algorithm to produce it in there. You can download a small version of OEIS with the first terms of each sequence, and you can make queries to find formulas or maple algorithms for a particular sequence. My question is, do you think it's feasible to download a small version of OEIS that includes, with the first terms, a little algorithm to produce it?
The natural problem here is that I haven't seen any link to download the entire database of OEIS with all the details, which maybe deserves its own question. Even if we had this, you need to read the formulas/algorithms (that can be written in different languages, from what I've seen) and interpret them correctly. But I thought maybe someone here knows how to solve this, in any case thanks in advance.
You could, as you note, download the sequences and their A-numbers from the link mentioned here: https://oeis.org/wiki/Welcome#Compressed_Versions
After searching that and finding one sequence (or a small number of sequences) of interest, you could scrape the respective page(s) for formulas. There are specific fields for Maple and Mathematica, which may be helpful, and otherwise, an entry in the PROGRAM field should include identifying information when it is not one of the standard languages with its own field in the database. See: http://oeis.org/wiki/Style_Sheet
Unofficially, but with the interests of the OEIS in mind, I would not recommend trying to download or scrape the OEIS in its entirety. Whether it's one person, or a whole host of people, we would certainly recommend using the compressed version of the database to identify sequences of interest by A-number first, then pulling their entire entry by scraping the site or querying the OEIS using methods that you have already mentioned: Programmatic access to On-Line Encyclopedia of Integer Sequences
If this sounds laborious, perhaps an alternative is the Wolfram Cloud, which actives this through other means. For example, you can navigate to the cloud (you may have to register just to get access) at: https://www.wolframcloud.com/
Typing in something like FindSequenceFunction[{1, 2, 3, 5, 17, 305, 34865}] will give you a formula, if Wolfram/Mathematica can find one. The documentation for FindSequenceFunction can be found here: https://reference.wolfram.com/language/ref/FindSequenceFunction.html
Wolfram/Mathematica can also invoke the OEIS using packages like the one described here: https://mathematica.stackexchange.com/questions/40/is-it-possible-to-invoke-the-oeis-from-mathematica
For a toy project, I want to implement an automated question answering system with Lucene and I'm trying to figure out a reasonable way to implement it. The basic operation is as follows:
1) The user will enter a question.
2) The system will identify the keywords in the question.
3) The keywords will be searched in a large knowledgebase and matching sentences will be shown as answers.
My knowledgebase (i.e., corpus) is not structured. It is just a large, continuous text (say, a user manual without any chapters). I mean that the only structure is that sentences and paragraphs are identified.
I plan to treat each sentence or paragraph as a separate document. To present the answer in a context, I may consider keeping one sentence/paragraph before/after the indexed one as payload. I would like to know if that makes sense. Also, I'm wondering if there are other tried and well-known approaches for that kind of systems. As an example, another approach that comes to mind is to index large chunks of the corpus as documents with the token positions, then process the vicinity of found keywords to construct my answers.
I would appreciate direct recommendations based on experience or intuition, but also tutorials or introductory materials to question-answering systems with Lucene in mind.
Thanks.
It's not an unreasonable approach to take.
One enhancement you might consider is incorporating learning feedback, so that you can continually improve the scoring of content vs search terms. To do this you would ask users to rate the answers that come back ('helpful vs unhelpful'), that way you can start to rank documents against keywords based on the historical data. You could classify potential documents as helpful/unhelpful for given keywords by using a simple Bayesian classifier.
Indexing each sentence as a document will give you some problems. You've pointed out one: you would need to store the surrounding texts a payloads. That means you'll need to store each sentence three times (before, during and after), and you'll have to manually get into the payload.
If you want to go the route of each sentence being a document, I would recommend coming up with an ID for each sentence and storing that as a separate field. Then you can display [ID-1, ID, ID+1] in each result.
The bigger question though is: how should you break up the text into documents? Identifying semantically related areas seems difficult, so doing it by sentence/paragraph might be the only way to go. A better way would be if you could find which text is the header of a section, and then put everything in that section as a document.
You might also want to use the index (if your corpus has one). The terms there could be boosted, as they are presumably more important.
Instead of luncene which does text indexing, search and retrieval, I think using something like Apache Mahout would help with this. Mahout considers text as knowledge and doing that makes the answering the question better than just text matching. Mahout is a machine learning and data mining f/w which fits this domain better. Just a very high level thought.
--Sai
What is the best way to implement a constructor for a record? It seems like a function should be able to return a record object in the instantiation of the record in some later model higher up the tree, but I can't get that to work. For now I just use a bunch of parameters at the top of the record that populate the variables stored in the record, but it seems like that will only work in simple cases.
Can anyone shed a little light? Perhaps I shouldn't be using a record but a model. Also does anyone know how the PDE functionality is coming? The book only says that it is coming, but I have seen some other things around.
I don't seem to have the clout to add tags (which makes sense, since my "reputation" is lower than yours) so sorry about that. I thought I had actually added one at one point, but perhaps I am mistaken.
I think you need to be clear what you mean by constructor since it has a very specific meaning in Modelica. If I understand your question correctly, it sounds like what you want to do is create an instance of a record that has some fields that are specified in the constructor arguments and from those arguments a bunch of other fields in the record are computed. Is that correct?
If so, there is a mechanism to do this. You mention "the book" but it isn't clear which one you mean. If it is mine, it definitely has no mention of these so called "record constructors" because it is too old. I do not know if Peter Fritzson's book mentions them either. However, they do exist and are documented in Section 12.6 of the Modelica 3.2 specification.
As for PDEs, there has been work into this kind of thing but nothing has really been done within the design group on this topic. I would add that if you want to solve either elliptical or parabolic PDEs on regular grids, this isn't too hard even with the current language. The only real drawback is that most tools probably don't handle sparsity very efficiently. Irregular grids would also be possible, but then you get into complicated basis functions. Finally, hyperbolic PDEs are, in my opinion, quite tricky (in any environment) due to the implicit physical constraints between time and space which are difficult to express (i.e. the CFL condition).
I hope that answers your questions so far.
I can only comment on your question regarding the book of Peter Fritzson. He confirmed that he's working on an update and he hopes to get it ready 'in the course of 2011'.
Original post here:
http://openmodelica.org/index.php/forum/topic?id=50
And thanks for initiating the modelica tag, I might be useful in the near future for me too... :-)
regards,
Roel
Is it necessary to use a period for single sentence notification boxes? Even though its considered proper grammar to do so, it just looks ugly and feels too formal.
Here are two screenies for comparison (first includes period, second doesn't).
alt text http://wordofjohn.com/files/stack_alert_1.png
alt text http://wordofjohn.com/files/stack_alert_2.png
Can't go wrong with correct grammar
Good grammar shows to your customers that you took time to make a good software even where others might not took time.
This way they can expect the best out of you and your company.
If you are using a full sentence to tell the user what to do, then I think proper grammar is important, although I always stay away from exclamation points, I find them annoying.
It is more preference that anything, but I like to maintain the best grammar possible in any situation.
In both instances you capitalized the first word in the sentence so I would say go with proper grammar
but it really is a preference
I'd vote No.
These alerts are like signposts or roadsigns, they need to present a brief but important message as succinctly as possible.
My reasoning extended - I think it's subjective, and so I doubt anyone's going to have a bad user experience because of the presense or absence of a full stop (period). A question mark might be confusing if it was left out, but a full stop is kind of implicit.
If you use periods at the end of your sentences, then users will know that the string hasn't been truncated (well OK, they won't know that it hasn't been truncated, but it's a good indicator. Plus, as others have said, it shows you went to the trouble to get it right.
I can't remember - what do MS/Apple do?
Let me explain my preference with an analogy.
I used to work at a bookstore where they sold Bibles. Some of them were Cambridge calfskin leather bound deluxe editions that came in special boxes for over US$100.00 each. Some of them were mass market paperback throw-away versions for US$1.99 each. The cheap ones often had glaring grammatical and spelling errors. I don't think this was a coincidence.
Regardless of where my software is going to be used or what it is for, I try to do my best to make sure it gets put (metaphorically) on the high-quality, expensive rack. Every time. Even at the risk of sounding "too formal".
If you are using the string as a normal resource, you (or someone else in your project) could use the text in another context, which would mean you need to keep track of which resources contain a period or not.
Is there a way to search the web which does NOT remove punctuation? For example, I want to search for window.window->window (Yes, I actually do, this is a structure in mozilla plugins). I figure that this HAS to be a fairly rare string.
Unfortunately, Google, Bing, AltaVista, Yahoo, and Excite all strip the punctuation and just show anything with the word "window" in it. And according to Google, on their site, at least, there is NO WAY AROUND IT.
In general, searching for chunks of code must be hard for this reason... anyone have any hints?
google codesearch ("window.window->window" but it doesn't seem to get any relevant result out of this request)
There is similar tools all over the internet like codase or koders but I'm not sure they let you search exactly this string. Anyway they might be useful to you so I think they're worth mentioning.
edit: It is very unlikely you'll find a general purpose search engine which will allow you to search for something like "window.window->window" because most search engines will do some processing on the document before storing it. For instance they might represent it internally as vectors of words (a vector space model) and use that to do the search, not the actual original string. And creating such a vector involves first cutting the document according to punctuation and other critters. This is a very complex and interesting subject which I can't tell you much more about. My bad memory did a pretty good job since I studied it at school!
BTW they might do the same kind of processing on your query too. You might want to read about tf-idf which is probably light years from what google and his friends are doing but can give you a hint about what happens to your query.
There is no way to do that, by itself in the main Google engine, as you discovered -- however, if you are looking for information about Mozilla then the best bet would be to structure your query something more like this:
"window.window->window" +Mozilla
OR +XUL
+ Another search string related to what you are
trying to do.
SymbolHound is a web search that does not remove punctuation from the queries. There is an option to search source code repositories (like the now-discontinued Google Code Search), but it also has the option to search the Internet for special characters. (primarily programming-related sites such as StackOverflow).
try it here: http://www.symbolhound.com
-Tom (co-founder)