is this possible to manipulate a description to same meaning but different words with data manipulation - selenium

i want to copy a data from a website which sells courses like ITIL, Prince2 and PMP and many other IT sector courses now there are 20,000 different courses's description is there.
However, i want to use selenium to scrape all the data but description is still subject to copyright.
Kindly let me know how i can manipulate all of that description to data to same meaning but different words.
Is there any API which can give me an access to build an code which will be helping these description data by using it's synonymous or which can change it's grammer to completely new sentennces but same meaning.
Kindly let me know where to start this.
Thanks,

The task you are referring to is called paraphrasing.
There is a lot of research on the field. In arXiv you fill find research papers on the topic. However, since you are asking for an API, I am assuming you don't want to implement these models by your self. Luckily, some authors have published their models online on GitHub. (Note: some are a re-implementation by someone else.)
When you use some of these implementations, note that most offer a pre-trained model. Do read which data set was used for training and try to pick the one that is the most similar to the data that you are facing. By doing so, more words in the domain of your descriptions will be available and more synonyms can be used.

Related

Looking for a program/script to collect sentences from news articles

So I'm currently working on a research paper on media bias (or lack thereof) towards 2020 presidential candidates.
For this, I'm looking for a way to make a huge database of sentences that mention these politicians by name or (if possible) with a pronoun. Right now I'd like to only focus on 5-7 of the biggest American news outlets (WaPo, NYT, FOX, etc.).
I want to collect all of these sentences into an Excel sheet, including a timestamp of when the article was released and a link to the article itself. I actually don't know if that's feasible or whether such program/script exists or not.
Do you think there's a way to solve this, does it already exist, and if not, can a rookie programmer write a script for this?
Thank you for all your help in advance!
You'd probably just need to create your own web scraper. You could have a Set of names that you're looking for, and if the name exists on the page then you can have some heuristics to get the sentence it's in. You'll probably have to have some specific stuff for getting the timestamp from the article. I'd say it wouldn't be too bad since you're targeting only a few news outlets, but probably a bit challenging for a rookie programmer.
Also, I recommend checking out something like https://www.webscraper.io/

Custom model in Apache Open NLP

I am working currently with custom models which I am training for my own use case. My use case is to classify emails based on whether it is an address change request. If the address change request could be understood from a single sentence, it is working fine without issues. But if the address change request needs to be understood from multiple sentences, it is not working.
Giving few examples below :-
Example 1 :- THIS IS WORKING
1.
a)training file :-
Guys I wish to <START:contactupdate> change my address <END> .
My new address is 68 Dorset Road, Coventry, West Midlands, CV1 4ED.
Please confirm once you are done.
Thanks.
b)Testing model with the below sentence :-
String input = "Guys I wish to change my address.My new address is 68 Dorset Road, Coventry, West Midlands, CV1 4ED.Please confirm once you are done. Thanks."; //Working
EXAMPLE 2 :- This is not working.
Lets say the address change request can only be deduced from multiple lines.
"My old address is no longer valid. Need to update it."
How do I train my model in this scenario?How do I specify the custom tags for above?
Can you please help. I am stuck.
Many Thanks
What do you mean with not working? That the thing you want to retrieve is not retrieved? Or that the training crashes somewhere when the tags are spread out over multiple lines?
In general, the (by default MaxEnt) model that you are training in this procedure tries to detect common features for the thing you are training for. Typically, these are named entities like persons, organisations, locations. And in many languages, these contain typical features (like the prefix Mr./Mrs., the suffix corp., the morpheme "street", respectively). This can be picked up by the model, and applied in new data, leading to the recognition of whichever it is you want to recognise. The thing you are trying to do however, is pretty advanced NLP already. Since the longer the phrase, the larger the possible variation, it becomes more difficult to pick up commonalities. I'd say for your use case, people are typically using parsing (either constituency or dependency parsing) or other more sophisticated tools than just this relatively flat pattern recognition. So you may want to look into these instead. I don't know how much data you have at your disposal, from which you can infer different ways of expressing the desire to change an address in a customer database. If reasonable (i.e. not just a couple of sentences), you may want to manually annotate them, parse the corpus, use machine learning on the parse trees/graphs for the sentences of interest and go about it in this way. As mentioned, quite advanced NLP in my opinion, and not something that has an out of the box solution.
If I understand your question correctly, I think you are trying to categorize emails to find out if its for address change. But the model example looks like for named entity. In my opinion, it might be better to use "Document Categorizer" feature of Apache OpenNLP.
You can provide different samples for possible sentences which can be categorized as address change. "Address_change", "general_inquiry" etc. can be a categories. This way you can add as many different sampels as you want with many variations of sentences. Here is easy & basic tutorial for document categorization training & usage.

How to use Alignment API to generate a Alignment Format file?

I am going to attend the Instance Matching of OAEI, now I need to make my results to Alignment Format. In order to achieve it, I have learned official tutorials.(link:http://alignapi.gforge.inria.fr/tutorial/tutorial1/index.html).
But there are many differences between the method taught and the method I want. In other words, I can't understand the API.
This is my situation:
I have 2 rdf file(person11.rdf and person12.rdf respectively.data link is http://oaei.ontologymatching.org/2010/im/index.html, the PR dataset), each file has information of many person. I want to find the coreferent entities, the results must be printed in Alignment Format. I find the results by using SPARQL, but I don't know how to print it in Alignment Format.
So, I have three questions:
First, if I want to generate a Alignment Format file, is the method taught the only way?
Second, can you give me your method(code better) to generate the Alignment Format file? Maybe I am wrong from the beginning, can you give me some suggestions?
Third, if you attended OAEI or know something about Instance Matching, can you give me some advice? I want to find the coreferent entities.
Thank you!
First question: I guess that the "mentioned method" is the one in tutorial1. It is not the appropriate one since you have to write a program to output the alignment format and this is a command line interface tutorial. In this case, you'd better look at http://alignapi.gforge.inria.fr/tutorial/tutorial2/index.html
Then, there are basically two ways to do:
The advised one (for several reasons and for participating to OAEI) is to follow these tutorials, to create an empty alignment in it, to create the correspondences from the results of your SPARQL query and to render it. Everything is covered by the tutorials but the part concerning your SPARQL queries. This assumes that you are programming in Java.
The non-advised solution (primarily non advised because you will have to debug your own renderer), is to write, in any programming language that you want a program that output the format (which corresponds to what you cite).
Think about it: how would you expect that the Alignment API knows the results of your SPARQL query? If you come up with a nice solution, contact the API developers, they may integrate it and others could benefit.
Second question: I cannot do better than what is above.
Third question: too general. Read the OAEI results (http://oaei.ontologymatching.org) and look at the code of others.
Good luck!

Are there any open source resources for SQL schema design patterns?

I can barely count the number of times I've created a "users" table, similar for "computers" and "customers". I've tried looking around, but haven't ever seen a resource for modeling these schema that we see over and over again. It seems like some of these objects should be some-kind-of-solved by now. Is there anything like this?
I have never seen anything like this either and I'm not sure it's necessary. Yes, there are a lot of similarities but every application is different. At one point I had built an internal library of some of my more "standard" tables (user is a good example) to use as a jumping point, but I have yet to create two identical tables for different systems.
Thus, I have yet to ever use the library I built because I can write the new table quicker and more error free than I can modify another existing example to work for the current project.
You could look at the source code of some popular open-source CRM/ERPs, such as OpenERP, though some of them are not great.
These are the top books on data modelling patterns:
Analysis Patterns, Fowler
Data Model Resource Book, vol. 1,2,3, Silverston
Enterprise Model Patterns, Hay
Patterns of Data Modeling, Blaha

Wiki Database, is there one?

I was searching the net for something like a wiki database, just like wikipedia but instead stores structured content, editable by users. What I was looking for was an online database accessible by everyone where people can design the schema and data with proper versioning of both schema and data. I couldn't find any such site. I am not sure if it is my search skills or if there really is no wiki database as of now. Does anyone out there know anything like this?
I think there is a great potential for something like this. A possible example will be a website with a GUI for querying a MySQL DB where any website visitor can create DB objects and populate data.
UPDATE: I had registered the domain wikidatabase.org to get started on a tool but I didn't find enough time yet. If anyone is interested in spending some time and coding on this, please let me know at wikidatabase.org
It's not quite what you're looking for, but Semantic Mediawiki adds database-like features to MediaWiki:
http://semantic-mediawiki.org/wiki/Semantic_MediaWiki
It's still fundamentally a Wiki, but you can add semantic tags to pages ([[foo::bar]] [[baz::1000]]) and then do database-type queries across them: SELECT baz FROM pages WHERE foo=bar would be {{#ask: [[foo::bar]] | ?baz}}. There is even an embryonic SPARQL implementation for pseudo-SQL queries.
OK this question is old, but Google led me here, so for anyone else out there looking for a wiki for structured data: Take a look at Foswiki.
This might be like what you're looking for: dbpedia.org. They're working on extracting data from Wikipedia, and encoding it in a structured format using RDF, so that it can be queried using SPARQL.
Linkeddata.org has a big list of RDF data sets.
Do you mean something like http://www.freebase.com?
You should check out https://www.wikidata.org/wiki/Wikidata:Main_Page which is a bit different but still may be of interest.
Something that might come close to your requirements is Google Docs.
What's offered is document editing roughly similar to MS Word, and spreadsheets roughly similar to Excel. I'm thinking of the latter, of course.
In Google Docs, You can create spreadsheets for free; being spreadsheets, they naturally have a row-and-column structure similar to a database, and which you can define flexibly. You can also share these sheets with other people. This seems to be a by-invite-only process rather than open-to-all, but there may be other possibilities I'm not aware of, or that level of sharing might be enough for you in any case.
mindtouch should be able to do it. It's rather easy to get data in / out. (for example: it's trivial to aggregate all the IP's for servers into one table).
I pretty much use it as a DB in the wiki itself (pages have tables, key/value..inheritance, templates, etc...) but you can also interface with the API, write dekiscript, grab the XML...
I like this idea. I have heard of some sites that are trying to pull together large datasets for various things for open consumption, but none that would allow a wiki feel.
You could start with something as simple as an installation of phpMyAdmin with a known password that would allow people to log in, create a database, edit data and query from any other site on the web.
It might suffer from more accuracy problems than wikipedia though.
OpenRecord, development of which seems to have halted in 2008, seems to approach this. It is a structured wiki in which pages are views on the data. Unlike RDBMSes it is loosely typed - the system tries to make a best guess about what data you entered, but defaults to text when it cannot guess. Schemas appear to have been implied.
http://openrecord.org
An example of the typing that is given is that of a date. If you enter '2008' in a record, the system interprets this as a date. If you enter 'unknown' however, the system allows that as well.
Perhaps you might be interested in Couch DB:
Apache CouchDB is a document-oriented
database that can be queried and
indexed in a MapReduce fashion using
JavaScript. CouchDB also offers
incremental replication with
bi-directional conflict detection and
resolution.
I'm working on an Open Source PHP / Symfony / PostgreSQL app that does this.
It allows multiple projects, each project can have multiple directories, each directory has a defined field structure. Admins set all this up.
Then members of the public can suggest new records, edit or report existing ones. All this is moderated and versioned.
It's early days yet but it basically works and is already in real world use in several projects.
Future plans already in progress include tools to help keep the data up to date, better searching/querying and field types that allow translations of content between languages.
There is more at http://www.directoki.org/
I'm surprised that nobody has mentioned Wikibase yet, which is the software that powers Wikidata.