How do I find whether a document on the web is semantically related to some other document? - semantic-web

My question here is that given a document d1 on the web and a document d2 how do I tell that d1 and d2 are semantically related. Are there some API's that can do some amount of natural language processing that might give me a hint as to d1 is a probably connected to d2.
I need it badly and uregently.Please Help!!

You can use special microformats. See more at http://microformats.org/
Simple example:
cc by 2.0
Rel-License is one of several microformats. By adding rel="license" to a hyperlink, a page indicates that the destination of that hyperlink is a license for the current page.

For semantically relating documents you can use special vocabularies like SKOS and relate them in an ontology. Or you can use - as silex mentioned - microformats directly in your documents.
For natural language processing, there exist different tools like GATE which can extract information. But this is not a trivial task.
Perhaps you can refine what you want to do? Do you want to define which documents are related? Or do you want a software to find out which documents may be related?

You need to look into "named entity extraction" i.e. natural language processing to extract likely entities that are common to both documents. These are generally people, places, events, times, organisations.
Take a look at OpenCalais http://www.opencalais.com/ for some real-world applications of this type of technology.

Related

is this possible to manipulate a description to same meaning but different words with data manipulation

i want to copy a data from a website which sells courses like ITIL, Prince2 and PMP and many other IT sector courses now there are 20,000 different courses's description is there.
However, i want to use selenium to scrape all the data but description is still subject to copyright.
Kindly let me know how i can manipulate all of that description to data to same meaning but different words.
Is there any API which can give me an access to build an code which will be helping these description data by using it's synonymous or which can change it's grammer to completely new sentennces but same meaning.
Kindly let me know where to start this.
Thanks,
The task you are referring to is called paraphrasing.
There is a lot of research on the field. In arXiv you fill find research papers on the topic. However, since you are asking for an API, I am assuming you don't want to implement these models by your self. Luckily, some authors have published their models online on GitHub. (Note: some are a re-implementation by someone else.)
When you use some of these implementations, note that most offer a pre-trained model. Do read which data set was used for training and try to pick the one that is the most similar to the data that you are facing. By doing so, more words in the domain of your descriptions will be available and more synonyms can be used.

cTAKES indication that category > 0 sources are about to be used?

In appendix 1 of the UMLS license agreement, there is a listing of all sources within the current version of the UMLS Metathesaurus with an indication of any additional restrictions and notices that apply. Loosely speaking, it seems like you can generally have your way with the Metathesaurus data sources that fall within category-0 of the license, but things get more restrictive at categories above that.
For example (likely a bad example as I am not a lawyer), looking at section 12.2 of the main license section:
LICENSEE is prohibited from using the vocabulary source in operational
applications that create records or information containing data from
the vocabulary source. Use for data creation research or product
development is allowed.
My question then is: (since cTAKES already has my UMLS credentials) is there any way to tell when doing a certain action with cTAKES is going to instruct it use/access data from the Metathesaurus that have a category > 0 (eg. some popup warning or header comment in the binary files)? Thanks
** The reason I'm interested is because: suppose that a certain ctakes process uses a category-2 data source to do something on some input that populates data into some XMI output (I don't know much about ctakes' full implementation, but for sake of arg. lets assume this is true) that gets post-processed and stored as some report for an organization. It would seem that the organization has violated the category-2 restriction inadvertently (since they were never warned about the underlying data being used to generate the outputs). I may be grossly misunderstanding something here, so please let me know if this is the case.

Wiki Database, is there one?

I was searching the net for something like a wiki database, just like wikipedia but instead stores structured content, editable by users. What I was looking for was an online database accessible by everyone where people can design the schema and data with proper versioning of both schema and data. I couldn't find any such site. I am not sure if it is my search skills or if there really is no wiki database as of now. Does anyone out there know anything like this?
I think there is a great potential for something like this. A possible example will be a website with a GUI for querying a MySQL DB where any website visitor can create DB objects and populate data.
UPDATE: I had registered the domain wikidatabase.org to get started on a tool but I didn't find enough time yet. If anyone is interested in spending some time and coding on this, please let me know at wikidatabase.org
It's not quite what you're looking for, but Semantic Mediawiki adds database-like features to MediaWiki:
http://semantic-mediawiki.org/wiki/Semantic_MediaWiki
It's still fundamentally a Wiki, but you can add semantic tags to pages ([[foo::bar]] [[baz::1000]]) and then do database-type queries across them: SELECT baz FROM pages WHERE foo=bar would be {{#ask: [[foo::bar]] | ?baz}}. There is even an embryonic SPARQL implementation for pseudo-SQL queries.
OK this question is old, but Google led me here, so for anyone else out there looking for a wiki for structured data: Take a look at Foswiki.
This might be like what you're looking for: dbpedia.org. They're working on extracting data from Wikipedia, and encoding it in a structured format using RDF, so that it can be queried using SPARQL.
Linkeddata.org has a big list of RDF data sets.
Do you mean something like http://www.freebase.com?
You should check out https://www.wikidata.org/wiki/Wikidata:Main_Page which is a bit different but still may be of interest.
Something that might come close to your requirements is Google Docs.
What's offered is document editing roughly similar to MS Word, and spreadsheets roughly similar to Excel. I'm thinking of the latter, of course.
In Google Docs, You can create spreadsheets for free; being spreadsheets, they naturally have a row-and-column structure similar to a database, and which you can define flexibly. You can also share these sheets with other people. This seems to be a by-invite-only process rather than open-to-all, but there may be other possibilities I'm not aware of, or that level of sharing might be enough for you in any case.
mindtouch should be able to do it. It's rather easy to get data in / out. (for example: it's trivial to aggregate all the IP's for servers into one table).
I pretty much use it as a DB in the wiki itself (pages have tables, key/value..inheritance, templates, etc...) but you can also interface with the API, write dekiscript, grab the XML...
I like this idea. I have heard of some sites that are trying to pull together large datasets for various things for open consumption, but none that would allow a wiki feel.
You could start with something as simple as an installation of phpMyAdmin with a known password that would allow people to log in, create a database, edit data and query from any other site on the web.
It might suffer from more accuracy problems than wikipedia though.
OpenRecord, development of which seems to have halted in 2008, seems to approach this. It is a structured wiki in which pages are views on the data. Unlike RDBMSes it is loosely typed - the system tries to make a best guess about what data you entered, but defaults to text when it cannot guess. Schemas appear to have been implied.
http://openrecord.org
An example of the typing that is given is that of a date. If you enter '2008' in a record, the system interprets this as a date. If you enter 'unknown' however, the system allows that as well.
Perhaps you might be interested in Couch DB:
Apache CouchDB is a document-oriented
database that can be queried and
indexed in a MapReduce fashion using
JavaScript. CouchDB also offers
incremental replication with
bi-directional conflict detection and
resolution.
I'm working on an Open Source PHP / Symfony / PostgreSQL app that does this.
It allows multiple projects, each project can have multiple directories, each directory has a defined field structure. Admins set all this up.
Then members of the public can suggest new records, edit or report existing ones. All this is moderated and versioned.
It's early days yet but it basically works and is already in real world use in several projects.
Future plans already in progress include tools to help keep the data up to date, better searching/querying and field types that allow translations of content between languages.
There is more at http://www.directoki.org/
I'm surprised that nobody has mentioned Wikibase yet, which is the software that powers Wikidata.

optical character recognition of PDFs of parliamentary debates

For a contract work, I need to digitalize a lot of old, scanned-graphic-only plenary debate protocol PDFs from the Federal Parliament of Germany.
The problem is that most of these files have a two-column format:
Sample Protocol http://sert.homedns.org/img/btp12001.png
I would love to read your answer to my following questions:
How I can split the two columns before feeding them into OCR?
Which commercial, open-source OCR software or framework, do you recommend and why?
Please note that any tool, programming-language, framework etc. is all fine. Don't hesitate recommend esoteric products, libraries if you think they are cut for the jub ^__^!!
UPDATE: These documents are already scanned by the parliament o_O: sample (same as the image above) and there are lots of them and I want to deliver on the contract ASAP so I can't go fetch print copies of the same documents, cut and scan them myself. There are just too many of them.
Best Regards,
Cetin Sert
Cut the pages down the middle before you scan.
It depends what OCR software you are using. A few years ago I did some work with an OCR API, I cant quite remember the name but I think there's lots of alternatives. Anyway this API allowed me to define regions on the page to OCR, If you always know roughly where the columns are you could use an SDK to map out parts of the page.
I use Omnipage 17 for such things. It has an batchmode too, where you can put the documents in an folder, where they was grabed, and put the result into another.
It autorecognit the layout, include columns, or you can set the default layout to columns.
You can set many options how the output should look like.
But try a demo, if it goes correct. I have at the moment problems with ligaturs in some of my documents. So words like "fliegen" comes out as "fl iegen" so you must spell them.
Take a look at http://www.wisetrend.com/wisetrend_ocr_cloud.shtml (an online, REST API for OCR). It is based on the powerful ABBYY OCR engine. You can get a free account and try it with a few of your images to see if it handles the 2-column format (it should be able to do it). Also, there are a bunch of settings you can play with (see API documentation) - you may have to tweak some of them before it will work with 2 columns. Finally, as a solution of last resort, if the 2-column split is always in the same place, you can first create a program that splits the input image into two images (shouldn't be very difficult to write this using some standard image processing library), and then feed the resulting images to the OCR process.

Tool or methods for automatically creating contextual links within a large corpus of content?

Here's the basic scenario - I have a corpus of say 100,000 newspaper-like articles. Minimally they will all have a well-defined title, and some amount of body content.
What I want to do is find runs of text in articles that ought to link to other articles.
So, if article Foo has a run of text like "Students in 8th grade are being encouraged to read works by John-Paul Sartre" and article Bar is titled (and about) "The important works of John-Paul Sartre", I'd like to automagically create that HTML link from Foo to Bar within the text of Foo.
You should ask yourself something before adding the links. What benefit for users do you want to achieve by doing this? You probably want to increase the navigability of your site. Maybe it is better to create an easier way to add links to older articles in form used to submit new ones. Maybe it is possible to add a "one click search for selected text" feature. Maybe you can add a wiki-like functionality that lets users propose link for selected text. You probably want to add links to related articles (generated through tagging system or text mining) below the articles.
Some potential problems with fully automated link adder:
You may need to implement a good word sense disambiguation algorithm to avoid confusing or even irritating the user by placing bad automatic links with regex (or simple substring matching).
As the number of articles is large you do not want to generate the html for extra links on every request, cache it instead.
You need to make a decision on duplicate titles or titles that contain other title as substring (either take longest title or link to most recent article or prefer article from same category).
TLDR version: find alternative solutions that provide desired functionality to the users.
What you are looking for are text mining tools. You can find more info and links at http://en.wikipedia.org/wiki/Text_mining. You might also want to check out Lucene and its ports at http://lucene.apache.org. Using these tools, the basic idea would be to find a set of similar articles based on the article (or title) in question. You could search various properties of the article including titles and content or both. A tagging system a la Delicious (or Stackoverflow) might also be helpful. Rather than pre-creating the links between articles, you'd present the relevant articles in an interface much like the Related questions interface on the right-hand side of this page.
If you wanted to find and link specific text in each article, I think you'd need to do some preprocessing to select pertinent phrases to key on. Even then I think it would be very hard not to miss things due to punctuation/misspellings or to not include irrelevant links for the same reasons.