Retrieving freebase quad dump type names from id - semantics

I'm currently working on a project using the freebase dumps, which I insert assertions into a per-mid ordered LevelDB. My goal is to be able for a given name, like Bob Dylan, to retrieve every types linked to it name.
For example, "Bob Dylan" would correspond to "Musician", "Film Producer" and so on, each corresponding themselves to the types "/music/artist", "/film/producer" etc...
Unfortunatly, if it's rather easy to find out the Bob Dylan mid into the quad dump
/m/bobdylanmid /common/topic/notable_types /music/artist
/m/bobdylanmid /common/topic/notable_types /film/producer
I'd like to be able to find those types names in various languages now. But I can't find a logical way to retrieve them in the dump.
Any clue please?

I'm not 100% certain, but I don't think the schema is actually in the quad dump. I know it never used to be.
You'll need to look up the names using a query like this. Unfortunately, the human readable names exist only in English, so you'd need to jump through some more hoops to get other languages. For that you could try something along the lines of this slightly more complicated query
[{
"id": "/music/artist",
"/freebase/type_profile/equivalent_topic": {
"name": {
"lang": null,
"value": null
}
},
"name": null
}]​
It depends on the "equivalent topic" property being filled in, which may not be the case for all types. If you only want a few languages, you could modify the query to return those explicitly ("Musician" has 45 different language variants).

If you are mainly interested in cases like your example (a person is/was a ...) using properties (rather than types) may do the job, in your case (the latter via a cvt):
/people/person/profession
/people/person/employment_history /business/employment_tenure/title
This might be more what you want to have anyways, unless you also want to display that e.g. Alan Turing is a "Literature Subject".
For the corresponding instances (with types /business/job_title, /people/profession) you can get the names in different languages (if existing).

Related

Clean unstructured place name to a structured format

I have around 300k unstructured data as below screen.I'm trying to use Google refine or OpenRefine to make this correct. However, I'm unable to find a proper way to do this. I'm new to this tool. Anyone's help would be greatly appreciated.Also, this tool is quite slow to process 300k records. If I am trying out something its taking lots of time to process and give an output.
OR Please suggest any other opensource tools and techniques do this?
As Owen said in comments, your question is probably too broad and cannot receive acceptable answer. We can just provide you with a general procedure to follow.
In Open Refine, you'll need to create a column based on the messy column and apply transformations to delete unwanted characters. You'll have to use regular expressions. But for that, it's necessary to be able to identify patterns. It's not clear to me why the "ST" of "Nat.secu ST." is important, but not the "US" in "Massy Intertech US". Not even the "36" in "Plowk 36" (Google doesn't know this word, so I'm not sure is an organisation name).
On the basis of your fifteen lines, however, we seem to distinguish some clear patterns. For example, it looks like you'll have to remove the tokens (character suites without spaces) at the end of the string that contain a #. For that, the GREL formula in Open Refine could look like this:
value.trim().replace(/\b\w+#\w+\b$/,'')
Here is a screencast if it's not clear to you.
But sometimes a company name may contain a #, in which case you will need to create more complex rules. For example, remove the token only if the string contains more than two words.
if(value.split(' ').length() > 2, value.replace(/\b\w+#\w+\b$/, ''), value)
And so on for the other patterns that you'll find (for example, any number sequence at the end that contains more than 4 numbers and one - between them)
Feel free to check out the Open Refine documentation in case of doubt.

A way to store array type data into single database table?

I am having a input JSON which I need to feed into a database. We are exploring on whether to normalize or not our database tables.
Following is the structure for the input data (json):
"attachments": [
{
"filename": "abc.pdf",
"url": "https://www.web.com/abc.pdf",
"type": "done"
},
{
"filename": "pqr.pdf",
"url": "https://www.web.com/pqr.pdf",
"type": "done"
},
],
In the above example, attachments could have multiple values (more than 2, upto 8).
We were thinking of creating a different table called DB_ATTACHMENT and keep all the attachments for a worker down there. But the issue is we have somewhat 30+ different attachment type array (phone, address, previous_emp, visas, etc.)
Is there a way to store everything in ONE table (employee)? One I can think of is using a single column (ATTACHMENT) and add all the data in 'delimited-format' and have the logic at target system to parse and extract everything..
Any other better solution?
Thanks..
Is there a way to store everything in ONE table (employee)? One I can
think of is using a single column (ATTACHMENT) and add all the data in
'delimited-format' and have the logic at target system to parse and
extract everything.. Any other better solution?
You can store the data in a single VARCHAR column as JSON, then recover the information in the client decoding this JSON data.
Also, there are already some SQL implementations offering native JSON datatypes. For example:
mariaDB: https://mariadb.com/kb/en/mariadb/column_json/
mySQL: https://dev.mysql.com/doc/refman/5.7/en/json.html
Database systems store your data and offer you SQL to simplify your search requests in case your data is structured.
It depends on you to decide whether you want to store the data structured to benefit from the SQL or leave the search requester with the burden of parsing it.
It very much depends on how you intend to use the data. I'm not totally sure I understand your question, so I am going to rephrase the business domain I think you're working with - please comment if this is not correct.
The system manages 0..n employees.
One employee may have 0..8 attachments.
An attachment belongs to exactly 1 employee.
An attachment may be one of 30 different types.
Each attachment type may have its own schema.
If attachments aren't important in the business domain - they're basically notes, and you don't need to query or reason about them - you could store them as a column on the "employee" table, and parse them when you show them to the end user.
This solution may seem easier - but don't underestimate the conversion logic - you have to support Create, Read, Update and Delete for each attachment.
If attachments are meaningful in the business domain, this very quickly breaks down. If you need to answer questions like "find all employees who have attached abc.pdf", "find employees who do not have a telephone_number attachment", unpacking each employee_attachment makes your query very difficult.
In this case, you almost certainly need to store attachments in one or more separate tables. If the schema for each attachment is, indeed, different, you need to work out how to deal with inheritance in relational database models.
Finally - some database engines support formats like JSON and XML natively. Yours may offer this as a compromise solution.

Category Implementation in a database

I'm building a system that involves users and teachers. In this particular system however I would like to categorize the teachers, but the tricky part is the categories are dynamic thus they can change anytime.
I have to have some functions, since I'm developing the backend;
The first one is showAllCategories(), that shows all the main categories.
Second is the showSubcategories() which shows the subcategories of a category()
Third is the showContent(), which in this case shows the teacher's information.
Before asking mighty Stack-Overflowers how would this be efficiently implemented, I thought I could use a doubly linked list approach where in categories table CategoryName, Before, After, Content and if the category did not have the after, the content would be pointing to the teacher's table. This is my classic SQL approach however I'm using MongoDB and since I'm a beginner I wonder if I could take the advantage of NoSQL in this particular situation?
MongoDb natively supports the Array type, which behaves actually more like a list. With $push and $pull you can add and remove elements from such an array field. $addToSet even makes sure there are no dublicates.
Now is the question of how the categories are stored. You can make a collection categories with the main categories, and they would be having a field each that has the array of the sub-categories:
{"_id": "science", "sub": ["chemist", "physicist", "biology"]}
{"_id": "languages", "sub": ["english", "german", "spanish"]}
Your teacher collection on the other hand would then have an array of embedded documents, the categories of the teacher. They are duplicates of those found in the categories collection, minus the fields that you won't need in the teacher view. This way you avoid joins, since they don't exist in MongoDB.
{
"_id": ObjectId(...),
"name": {"first": "Foo", "last": "Bar"},
"categories": ["chemist", "biology"]
}
The rest I am sure you can think up.
Addition: In short, use the flexible types that MongoDB offers, and don't worry about data redundancy. Embed documents often and don't forget the indexes.

Modeling products with vastly different sets of needed-to-know information and linking them to lineitems?

I'm currently working on a site that sells products of varying types that are custom manufactured. I've got your general, standard cart schema: Order has many LineItems, LineItems have one Product, but I've run into a bit of a sticking point:
Lets say one of our products is a ball, and one of our products is a box of crayons. While people are creating their order, we end up creating items that could be represented by some psuedocode:
Ball:
attributes:
diameter: decimal
color: foreign_ref_to Colors.id
material: foreign_ref to Materials.id
CrayonBox:
attributes:
width: decimal
height: decimal
front_text: string
crayons: many_to_many with Crayon
...
Now, these are created and stored in our db before an order is made. I can pretty easily make it so that when an item is added to a cart, we get a product name and price by doing the linking from Ball or CrayonBox in my controller and generating the LineItem, but it would be nice if we could provide a full set of info for every line item.
I've thought of a few possible solutions, but none that seem ideal:
One: use an intermediary "product info" linking table, and represent different products in terms of that, so we'd have something like:
LineItem
information: many_to_many with product_information
...
ProductInformation:
lineitem: many_to_many with line_item
name: string
value: string
ProductInformation(name='color', value=$SOMECOLOR)
ProductInformation(name='color', value=$SOMEOTHERCOLOR)
...
The problem with this is that the types of data needed to be represented for each attribute of a product does not all fall under the same column type. I could represent everything with strings, but $DEITY knows I don't even come close to thinking that's a good solution.
The other solution I've thought of is having the LineItem table have a foreign key to each table that represents a Product type. Unfortunately, this means I would have to check for the existence of each foreign key in my controller. I don't like this very much at all, but I like it marginally better than stuffing every piece of data into one datatype and then dealing with all the conversion stuff outside of the DB.
One other possible solution would be to store the tablename of the product data in a column, but that can't possibly be a good thing to do, can it? I lose the capability of the db to link stuff together, and it strikes me as akin to using eval() where it's not needed -- and we all know that eval() isn't really needed very often.
I want to be able to say "give me the line item, and then the extended info for that line item", and have the correct set of information for various product types.
So, people who actually know what they're doing with database schema, what should I be doing? How should I be representing this? This seems like it would be a fairly common use case, but I haven't been able to find much info with googling -- is there a common pattern for things like this? Need more info? This can't possibly be outside of the realm of "you can use a RDBMS for this", can it?
Edit: I'm now fairly certain that what I want here is Class Table Inheritance. with an alias in my individual models to "normalize" the link followed to the "info" table for each product type. Unfortunately, the ORM I'm kinda stuck using for this (Doctrine 1.2) doesn't support Class Table Inheritance. I may be able to accomplish something similar with Doctrine's "column aggregation" inheritance, but egh. Anyone think I'm barking way up the wrong tree? I looked over EAV, and I don't think it quite fits the problem -- each set of information about different products is known, although they might be very different from product type A to product type B. The flexibility of EAV may be nice, but it seems like an abuse of the db for a problem like this.
It strikes me that this is a perfect fit for the likes of CouchDB / MongoDB which allow every 'row' to contain different attributes, yet permits indexed lookups. It should be fairly straightforward to build a hybrid structure using MySQL for the rigid relational parts and 'nosql' for the parts of varying shape.
Take a look at this discussion.
Assumptions:
You have some specific products you're selling. I.e., you know you're selling crayons, but not spatulas. The customer doesn't come to your site and try to order a product you've never heard of.
The products you're selling have a pre-existing set of attributes. I.e., crayons have color; crayon_boxes have width, height, crayons... The customer doesn't come to your site and try to specify the value for an attribute you've never heard of.
One way to do this (if you're a RBDM purist, please close your eyes now until I tell you to open them again) is to use an attribute string. So the table would be like this:
Products
+ ProductName
+ ProductAttribute
And then a sample record would be like this:
Product Name = "Crayon Box"
Product Attribute = "Height:5 inches;Width:7 inches"
With something like this, parse the name/value pairs in or out as necessary.

First Name Variations in a Database

I am trying to determine what the best way is to find variations of a first name in a database. For example, I search for Bill Smith. I would like it return "Bill Smith", obviously, but I would also like it to return "William Smith", or "Billy Smith", or even "Willy Smith". My initial thought was to build a first name hierarchy, but I do not know where I could obtain such data, if it even exists.
Since users can search the directory, I thought this would be a key feature. For example, people I went to school with called me Joe, but I always go by Joseph now. So, I was looking at doing a phonetic search on the last name, either with NYSIIS or Double Metaphone and then searching on the first name using this name heirarchy. Is there a better way to do this - maybe some sort of graded relevance using a full text search on the full name instead of a two part search on the first and last name? Part of me thinks that if I stored a name as a single value instead of multiple values, it might facilitate more search options at the expense of being able to address a user by the first name.
As far as platform, I am using SQL Server 2005 - however, I don't have a problem shifting some of the matching into the code; for example, pre-seeding the phonetic keys for a user, since they wouldn't change.
Any thoughts or guidance would be appreciated. Countless searches have pretty much turned up empty. Thanks!
Edit: It seems that there are two very distinct camps on the functionality and I am definitely sitting in the middle right now. I could see the argument of a full-text search - most likely done with a lack of data normalization, and a multi-part approach that uses different criteria for different parts of the name.
The problem ultimately comes down to user intent. The Bill / William example is a good one, because it shows the mutation of a first name based upon the formality of the usage. I think that building a name hierarchy is the more accurate (and extensible) solution, but is going to be far more complex. The fuzzy search approach is easier to implement at the expense of accuracy. Is this a fair comparison?
Resolution: Upon doing some tests, I have determined to go with an approach where the initial registration will take a full name and I will split it out into multiple fields (forename, surname, middle, suffix, etc.). Since I am sure that it won't be perfect, I will allow the user to edit the "parts", including adding a maiden or alternate name. As far as searching goes, with either solution I am going to need to maintain what variations exists, either in a database table, or as a thesaurus. Neither have an advantage over the other in this case. I think it is going to come down to performance, and I will have to actually run some benchmarks to determine which is best. Thank you, everyone, for your input!
In my opinion you should either do a feature right and make it complete, or you should leave it off to avoid building a half-assed intelligence into a computer program that still gets it wrong most of the time ("Looks like you're writing a letter", anyone?).
In case of human names, a computer will get it wrong most of the time, doing it right and complete is impossible, IMHO. Maybe you can hack something that does the most common English names. But actually, the intelligence to look for both "Bill" and "William" is built into almost any English speaking person - I would leave it to them to connect the dots.
The term you are looking for is Hypocorism:
http://en.wikipedia.org/wiki/Hypocorism
And Wikipedia lists many of them. You could bang out some Python or Perl to scrape that page and put it in a db.
I would go with a structure like this:
create table given_names (
id int primary key,
name text not null unique
);
create table hypocorisms (
id int references given_names(id),
name text not null,
primary key (id, name)
);
insert into given_names values (1, 'William');
insert into hypocorisms values (1, 'Bill');
insert into hypocorisms values (1, 'Billy');
Then you could write a function/sproc to normalize a name:
normalize_given_name('Bill'); --returns William
One issue you will face is that different names can have the same hypocorism (Albert -> Al, Alan -> Al)
I think your basic approach is solid. I don't think fulltext is going to help you. For seeding, behindthename.com seems to have large amount of the data you want.
Are you using SQl Server 2005 Express with Advanced Services as to me it sounds you would benefit from the Full Text indexing and more specifically Contains and Containstable which you can use with specific instructions here is a link for the uses of Containstable:
http://msdn.microsoft.com/en-us/library/ms189760.aspx
and here is the download link for SQL Server 2005 With Advanced Services:
http://www.microsoft.com/downloads/details.aspx?familyid=4C6BA9FD-319A-4887-BC75-3B02B5E48A40&displaylang=en
Hope this helps,
Andrew
You can use the SQL Server Full Text Search and do an inflectional search.
Basically like:
SELECT ProductId, ProductName
FROM ProductModel
WHERE CONTAINS(CatalogDescription, ' FORMSOF(THESAURUS, metal) ')
Check out:
http://en.wikipedia.org/wiki/SQL_Server_Full_Text_Search#Inflectional_Searches
http://msdn.microsoft.com/en-us/library/ms345119.aspx
http://www.mssqltips.com/tip.asp?tip=1491
Not sure what your application is, but if your users know at the time of sign up that people from their past might be searching the database for them, you could offer them the chance in the user profile to define other names they might be known as (including last names, women change these all the time and makes finding them much harder!) and that they want people to be able to search on. Store these in a separate related table. Then search on that. Just make the structure such that you can define one name as the main name (the one you use for everything except the search.)
You'll find that you're dabbling in an area known as "Natural Language Processing" and you'll need to do several things, most of which can be found under the topic of stemming.
Simplistic stemming simply breaks the word apart, but more advanced algorithms associate words that mean the same thing - for instance Google might use stemming to convert "cat" and "kitten" to "feline" and search for all three, weighing the actual word provided by the user as slightly heavier so exact matches return before stemmed matches.
It's a known problem, and there are open source stemmers available.
-Adam
No, Full Text searches will not help to solve your problem.
I think you might want to take a look at some of the following links: (Funny, no one mentioned SoundEx till now)
SoundEx - MSDN
SoundEx - Google results
InformIT - Tolerant Search algorithms
Basically SoundEx allows you to evaluate the level of similarity in similar sounding words. The function is also available on SQL 2005.
As a side issue, instead of returning similar results, it might prove more intuitive to the user to use a AJAX based script to deliver similar sounding names before the user initiates his/her search. That way you can show the user "similar names" or "did you mean..." kind of data.
Here's an idea for automatically finding "name synonyms" like Bill/William. That problem has been studied in the broader context of synonyms in general: inducing them from statistics of which words commonly appear in the same contexts in a large text corpus like the Web. You could try combining that approach with a list of names like Moby Names; I don't know if it's been done before.
Here are some pointers.