How to map subjective data in the semantic web? - semantics

I've been looking at the freebase project for storing data. It seems to be a great place to store concrete, objective data like names, locations and dates. Is it a good place to store subjective data like opinions or ratings? Is there another/better open data, semantic data store or strategy for storing and querying this kind of information?
Additionally, since it is subjective I can be sure that others will not agree with my opinion. How would I store the opinions of others inline so the crowd opinion could be represented better?
Is freebase the right place to store this type of data?
For example: a restaurant rating or a movie rating. The movie rating would probably be less time sensitive than the restaurant rating. Any non-identifying information about the person who entered the data would be interesting for determining other factors and relationships.

The Semantic Web is more or less a variant of first-order logic, for the most part, so the important part is to have a clear understanding of what each of your predicates "mean". This idea is very simple but applicable to a wide-variety of meaning representations - i.e. it is behind the entity model of databases.
There should be no problem representing the information you mentioned in a semantic web representation. Just be sure to have a clear definition of what each of your predicates denote, so that the meaning doesn't shift over time and you end up with an inconsistent representation.
Genesereth's book is old but a good one if you are interested in reading about this in further detail. I think a lot of people who worked on the Semantic Web were involved in Douglas Lenat's Cyc project which gradually shifted to a logic-based meaning representation over time.
http://www.amazon.com/Logical-Foundations-Artificial-Intelligence-Genesereth/dp/0934613311
The site for Cyc:
http://www.cyc.com/

I find designing/selecting data formats is very hard without an understanding of the questions I will be asking using that data. What purpose do you expect the data to be used for? Come up with some use cases and that may guide your search.
Storing attributed data is an open research topic, with development in (among other places) the Intelligence community: these users obviously need to keep track of where information came from, and who has added to it along the way, both to verify its reliability and to do things like track whether Secret information has been included by accident. That may be a good place to look.

Data is data, what you want to do is label the data as what it is, an opinion or a rating. A "fact" I suppose which could be inferred from such data would be that most people had x subjective opinion about said topic.

from twitter:
jimpick #the_real_kevinw Each user and app/base has their own namespace, but I'd ask on the developers mailing list. A mashup might fit better.

Related

What business folks have to understand about database design

I have a business team asking me about setting up a meeting to explain them about database design considerations. Since they do not have much idea on RDMS I'm to thinking to explain below things
What is RDBMS
What is a table and what are constraints / why we need them
What is a transaction and what are ACID Properties
Things to consider before/while developing a dbms
a. Decide how much detail you need and how much you may in need future
b. Identify fields with unique values
c. Select the appropriate data types for your fields
d. Normalization and Index design
Also most of the time this team has their data coming in from flat files which we need to load into the DB and represent into the format they need. Anybody please suggest what can i explain more or any better way I can explain. And kind of their data is all over the place. I just want to emphazise more on thinking it through because we couldn't set up a stable process to do the import. Any suggestion for me is welcome as well :)
Appreciate your help!
You haven't said what your audience expects to take away from your presentation. So I'll have to guess, based on my dealings with business people in the past. Your mileage may vary.
Business people typically don't care about the skills and knowledge you put into doing a good job with database design, even when they say they do. They want to understand database design in terms of costs and benefits. That is how business people think.
So if you must cover some technical topic like indexing, do so from a cost benefit point of view. There is a cost to adding an index to a table, and there is a benefit to adding an index to a table. Figuring out in advance whether the benefit is worth the cost is the really tricky part, and they will be interested in this.
On a larger scale, data is a business asset. There is a cost to managing that asset well, and there is a benefit to managing that asset well. If you can connect your talk to these two concepts, they will be interested.
If they are really good business people, they will have a good understanding of the subject matter that the database covers, provided it's a part of the enterprise data that affects their business. If you have a good ER model of the data in the database, this model will connect every value in every table to an attribute, and every attribute will describe some aspect of the subject matter. This is a very different use of an ER model than just using it as a preliminary to creating a relational model.
Technical people tend to think of ER modeling as "relational modeling light". It's really much deeper than that. It's an analytical handle on the question "what does the data really mean?" And this is a handle on "what is the data really worth?". And this is where the technical world meets the business world.
How about starting from the basis of CRUD operations, then move on to normalization, give the scenarios for the need of Normalization and concept of Keys in RDBMS ,then you can talk about the ER modeling
Considering the fact that you are presenting to business folks, I think there would be 2 approaches best suited to your needs.
a) WHEN YOU HAVE LESS TIME:
Only cover topics which need minimum or no prior knowledge. Cover RDMS & things to consider.
Keep it simple and easy to understand. Tell them how your solution works and why it is an effective one.
Cover only topics which are relevant and make it layman friendly. Provide them the pros & cons of your DB design. Connect it to business needs.
In all cases, provide contextual examples which they may relate to with ease.
b) WHEN YOU HAVE MORE TIME
You may cover topics in detail as suggested in the previous comments. (#SQL_Underworld & #Ramya)

Organizing interconnected objects

This is a generic question, I don't know if it belongs to Programming or StackOverflow.
I'm writing a litte simulation. Without going very deep into its details, consider that many kind of identities are involved. They correspond to Object since I'm using a OOP language.
There are Guys that inhabit the world simulated
There are Maps
A map has many Lots, that are pieces of land with some characteristics
There are Tribes (guys belong to tribes)
There is a generic class called Position to locate the elements
There are Bots in control of tribes that move guys around
There is a World that represents the world simulated
and so on.
If the simulated world was laid down as a database, the objects would be tables with lots of references, but in memory I have to use a different strategy. So, for example, a Tribe has an array of Guys as a property, The world has a, array of Bots, of Tribes, of Maps. A Map has a Dictionary whose key is a Position and whose value is a Lot. A Guy has a Position that is where he stands.
The way I lay down such connections is pretty much arbitrary. For example, I could have an array of Guys in the World, or an Array of guys per Lot (the guys standing on a piece of land), or an array of Guys per Bot (with the Guys controlled by the bot).
Doing so, I also have to pass around a lot of objects. For example, a Bot must have informations about the Map and opponent Guys to decide how to move its Guys.
As said, in a database I'd have a Guys table connected to the Lots table (indicating its position), to the Tribe table (indicating which Tribe it belongs to) and so it would also be easy to query "All the guys in Position [1, 5]". "All the Guys of Tribe 123". "All the Guys controlled by Bot B standing on the Lot b34 not belonging to the Tribe 456" and so on.
I've worked with APIs where to get the simplest information you had to make an instance of the CustomerContextCollection and pass it to CustomerQueryFactory to get back a CustomerInPlaceQuery to... When people criticize OOP and cite verbose abstractions that soon smell ridiculous, that's what I mean. I want to avoid such things and having to relay on deep abstractions and (anti pattern) abstract contexts.
The question is: what is the preferred, clean way to manage entities and collections of entities that are deeply linked in multiple ways?
It depends on your definition of "clean". In my case, I define clean as: I can implement desired behavior in an obvious, efficient manner.
Building OOP software is not a data modeling exercise. I'd suggest stepping back a little. What does each one of those objects actually do? What methods are you going to implement?
Just because "guys are in a lot" doesn't mean that the lot object needs a collection of guys; it only needs one if there are operations on a lot that affect all the guys in it. And even then, it doesn't necessarily need a collection of guys - it needs a way to get the guys in the lot. This may be an internally stored collection, but it could also be a simple method that calls back into the world to find guys matching a criteria. The implementation of that lookup should be transparent to anyone.
From the tenor of your questions, it seems like you're thinking of this from a "how do I generate reports" perspective. Step back and think of the behaviors you're trying to implement first.
Another thing I find extremely valuable is to differentiate between Entities and Values. Entities are objects where identity matters - you may have two guys, both named "Chris", but they are two different objects and remain distinct despite having the same "key". Values, on the other hand, act like ints. From your above list, Position sounds a lot like a value - Position(0,0) is Position(0,0) regardless of which chunk of memory (identity) those bits are stored in. The distinction has a bit effect on how you compare and store values vs. entities. For example, your Guy objects (entities) would store their Position as a simple member variable.
I've found a great reference for how to think about such things is Eric Evan's "Domain Driven Design" book. He's focused on business systems, but the discussions are very valuable for how you think about building OO systems in general I've found.
I would say that no 'true' answer exists to your core question -- a best way to manage collections of entities that are linked in multiple ways. It really depends on the kind of application (simulation) - here are some thoughts:
Is execution time important?
If this is the case, there is really no way around analyzing in which way your simulator will iterate over (query) the objects from the pool: sketch out the basic simulation loop and check what kind of events will require to iterate over what kind of model entities (I assume you are developing a discrete-event simulation?). Then you should organize the data structures in a way that optimizes the most frequent/time-consuming events (as opposed to "laying down the connections arbitrarily"). Additionally, you may want to use special data structures (such as k-d trees) to organize entities with properties that you need to query often (e.g., position data). For some typical problems, e.g. collision detection, there is also a whole lot of approaches to solve them efficiently (so look for suitable libraries/frameworks, e.g. for multi-agent simulation).
How flexible do you want to make it?
If you really want to make it super-flexible and really don't want to decide on the hierarchy of the model entities, why not just use an in-memory database? As you already said, databases are easily applicable to your problem (and you can easily save the model state, which may also be useful).
How clean is clean enough?
If you want to be absolutely sure that the rest of your simulator is not affected by the design choices you make in regards of your model representation, hide it behind an interface (say, ModelWorld), which defines methods for all the types of queries your simulator may invoke (this is orthogonal to the second point and may help with the first point, i.e. figuring out what kind of access pattern your simulator exhibits). This allows you to change implementations easily, without affecting any other parts of the simulator code.

OOD: order.fill(warehouse) -or- warehouse.fill(order)

which form is a correct OO design?
"Matter of taste" is a mediocre's easy way out.
Any good reads on the subject?
I want a conclusive prove one way or the other.
EDIT: I know which answer is correct (wink!). What I really want is to see any arguments in support of the former form (order.fill(warehouse)).
There is no conclusive proof and to a certain extent it is a matter of taste. OO is not science - it is art. It also depends on the domain, overall software structure, etc. and so your small example cannot be extrapolated to any OO problem.
However, here is my take based on your information:
Warehouses store things. They don't fill orders. Orders request things. They don't know which warehouse (or warehouses) the things come from. So a dependency in either direction between the two does not feel right.
In the real world, and the software, something would be a mediator between the two. #themel indicated the same in the comment to your question, though I prefer something less programming pattern sounding. Perhaps something like:
ShippingPlan plan = shippingPlanner.fill(order).from(warehouses).ship();
However, it is a matter of taste :-)
In its simplest form warehouse is an inventory storage place.
But it also would be correct to view a warehouse as a facility comprised of storage space, personal, shipping docks etc. If you assume that view of a warehouse then it would be appropriate to say that a warehouse (as a facility) can be charged with filling out orders, or in expanded form:
a warehouse facility is capable of assembling a shipment according to a given specification (an order)
above is a justification (if not proof) for: warehouse.fill(order); form. Notice that this form substantially equivalent to SingleShot's and themel's suggestions. The trick is to consolidate shippingPlanner (an order fulfillment authority) and a warehouse (a inventory storage space). Simply put in my example warehouse is a composition of an order fulfillment authority and an inventory storage space and in SingleShot's those two are presented separately. It means that if such consolidation is (or becomes) unacceptable (for example due to complexity of the parts), then the warehouse can be decomposed into these two sub components.
I can not come up with a justification for assigning fill operation to an order object.
hello? warehouse? yes, please take this order and fill it. thank you. -- that I can understand.
hey, order! the warehouse is over there. do your thing and get fulfill yourself. -- makes no sense to me.

what are "Meta-Data design principles"?

I'm looking at a job description that I'm considering applying for, and one of the requirements listed is "Familiar with Meta-Data design principles".
Can some give a brief explanation? I'm probably familiar with the concept, but I've never heard that terminology before.
I did Google to find more info, but didn't get good results. Except for this white paper titled Metadata Principles and Practicalities. It was a little heavy, and I was hoping to find a quick explanation.
Additional Note: Thanks for all the answers so far. They've been very good. I wanted to clarify that I'm familiar with what metadata is, but I've just never heard of "metadata design principles". What sort of design principles are there for metadata have? Is this a large enough topic for a book? for a pamphlet? As Robert Harvey points out, it sounds like a nebulous term invented by someone in HR.
I'll bet it means "design principles include being driven by meta-data".
There aren't many design principles for meta-data -- it's usually given by your tools.
However, some organizations want to use meta-data as a key part of application software specification, construction and operation.
If they want someone who's design principles include using meta-data heavily, then it might come out as a phrase like "meta-data design principles".
But, before I said anything, I'd ask them what they think they meant by this.
Essentially, that would be the design of data about data; that is, characterizing data with additional data. Metadata is data about data; where data can be the orders that you get for a given item, the metadata about it can be things like how MANY orders you got, etc. Proper metadata design involves understanding what types of information is likely to be useful and interesting about whatever data you're analyzing, and recognizing how to most appropriately track and capture it.
For example, the number of sales of a given book in a particular day may be useful; not necessarily so the number of sales of the same book in a given minute. Likewise, the number of sales in a given year may be less useful than sales by month, etc. In this example, it's granularity, but metadata design can involve many other things; perhaps geographic distribution of sales is important, as another example.
The phrase, "Familiar with metadata design principles," sounds suspiciously like one of those nebulous phrases invented by an HR department that has no clue what they are talking about. However, I'll take a stab at it.
Metadata is data that enhances other data by describing the properties or characteristics of that other data.
Examples:
In the following tag:
Link to Google
the href descriptor is metadata because it "decorates," or further describes, the link. It is a property of the link. In general all HTML attributes are metadata.
A C# attribute is metadata. Microsoft calls attributes "a way to associate declarative information with a class."
[System.Serializable]
public class SampleClass
{
// Objects of this type can be serialized.
}
In a database table, the value contained in the Address field of a record:
12345 Main Street
is just data, but the field's definition in the database:
Type: Text
Length: 50
is metadata.
In an MP3 file, the audio is just data, but the MP3 tags such as Author, Title, and Bitrate are metadata.
XML is data, XSD is metadata. XSD can be used to express a set of rules to which an XML document must conform in order to be considered 'valid'.
The number of sales of a particular book in a given period is not metadata for the book, because it does not further describe the book itself, only its sales. However, the Author, Title, and number of pages of a book is metadata for that book (as is the ISBN).
There. Now you know all about "Metadata Design Principles."
Here is an excerpt from "Applying UML and Patterns" by C. Larman:
Reflective or Meta-Level Designs
An example of this approach is using
the java.beans.Introspector to
obtain a BeanInfo object, asking for
the getter Method object for bean
property X, and calling
Method.invoke. The system is
protected from the impact of logic or
external code variations by
reflective algorithms that use
introspection and meta-language
services. It may be considered a
special case of data-driven designs.

What are the principles behind, and benefits of, the "party model"?

The "party model" is a "pattern" for relational database design. At least part of it involves finding commonality between many entities, such as Customer, Employee, Partner, etc., and factoring that into some more "abstract" database tables.
I'd like to find out your thoughts on the following:
What are the core principles and motivating forces behind the party model?
What does it prescribe you do to your data model? (My bit above is pretty high level and quite possibly incorrect in some ways. I've been on a project that used it, but I was working with a separate team focused on other issues).
What has your experience led you to feel about it? Did you use it, and if so, would you do so again? What were the pros and cons?
Did the party model limit your choice of ORMs? For example, did you have to eliminate certain ORMs because they didn't allow for enough of an "abstraction layer" between your domain objects and your physical data model?
I'm sure every response won't address every one of those questions ... but anything touching on one or more of them is going to help me make some decisions I'm facing.
Thanks.
What are the core principles and motivating forces behind the party
model?
To the extent that I've used it, it's mostly about code reuse and flexibility. We've used it before in the guest / user / admin model and it certainly proves its value when you need to move a user from one group to another. Extend this to having organizations and companies represented with users under them, and it's really providing a form of abstraction that isn't particularly inherent in SQL.
What does it prescribe you do to your data model? (My bit above is
pretty high level and quite possibly
incorrect in some ways. I've been on a
project that used it, but I was
working with a separate team focused
on other issues).
You're pretty correct in your bit above, though it needs some more detail. You can imagine a situation where an entity in the database (call it a Party) contracts out to another Party, which may in turn subcontract work out. A party might be an Employee, a Contractor, or a Company, all subclasses of Party. From my understanding, you would have a Party table and then more specific tables for each subclass, which could then be further subclassed (Party -> Person -> Contractor).
What has your experience led you to feel about it? Did you use it, and if
so, would you do so again? What were
the pros and cons?
It has its benefits if you need flexibly to add new types to your system and create relationships between types that you didn't expect at the beginning and architect in (users moving to a new level, companies hiring other companies, etc). It also gives you the benefit of running a single query and retrieving data for multiple types of parties (Companies,Employees,Contractors). On the flip side, you're adding additional layers of abstraction to get to the data you actually need and are increasing load (or at least the number of joins) on the database when you're querying for a specific type. If your abstraction goes too far, you'll likely need to run multiple queries to retrieve the data as the complexity would start to become detrimental to readability and database load.
Did the party model limit your choice of ORMs? For example, did you
have to eliminate certain ORMs because
they didn't allow for enough of an
"abstraction layer" between your
domain objects and your physical data
model?
This is an area that I'm admittedly a bit weak in, but I've found that using views and mirrored abstraction in the application layer haven't made this too much of a problem. The real problem for me has always been a "where is piece of data X living" when I want to read the data source directly (it's not always intuitive for new developers on the system either).
The idea behind the party models (aka entity schema) is to define a database that leverages some of the scalability benefits of schema-free databases. The party model does that by defining its entities as party type records, as opposed to one table per entity. The result is an extremely normalized database with very few tables and very little knowledge about the semantic meaning of the data it stores. All that knowledge is pushed to the data access in code. Database upgrades using the party model are minimal to none, since the schema never changes. It’s essentially a glorified key-value pair data model structure with some fancy names and a couple of extra attributes.
Pros:
Kick-ass horizontal scalability. Once your 5-6 tables are defined in your entity model, you can go to the beach and sip margaritas. You can virtually scale this database out as much as you want with minimum efforts.
The database supports any data structure you throw at it. You can also change data structures and party/entities definitions on the fly without affecting your application. This is very very powerful.
You can model any arbitrary data entity by adding records, not changing the schema. Meaning you can say goodbye to schema migration scripts.
This is programmers’ paradise, since the code they write will define the actual entities they use in code, and there are no mappings from Objects to Tables or anything like that. You can think of the Party table as the base object of your framework of choice (System.Object for .NET)
Cons:
Party/Entity models never play well with ORMs, so forget about using EF or NHibernate to get semantically meaningful entities out of your entity database.
Lots of joins. Performance tuning challenges. This ‘con’ is relative to the practices you use to define your entities, but is safe to say that you’ll be doing a lot more of those mind-bending queries that will bring you nightmares at night.
Harder to consume. Developers and DB pros unfamiliar with your business will have a harder time to get used to the entities exposed by these models. Since everything is abstract, there no diagram or visualization you can build on top of your database to explain what is stored to someone else.
Heavy data access models or business rules engines will be needed. Basically you have to do the work of understanding what the heck you want out of your database at some point, and your database model is not going to help you this time around.
If you are considering a party or entity schema in a relational database, you should probably take a look at other solutions like a NoSql data store, BigTable or KV Stores. There are some great products out there with massive deployments and traction such as MongoDB, DynamoDB, and Cassandra that pioneered this movement.
This is a vast topic, I would recommend reading The Data Model Resource Book Volume 3 - Universal Patterns for Data Modeling by Len Silverston and Paul Agnew.
I've just received my copy and it's pretty good - It provides you with an overlook for many approaches to data modeling, including hybrid contextual role patterns and so on. It has detailed PROs and CONs for every approach.
There is a pletheora of ways to model party relationships and roles all with their benefits and disadvantages. The question that was accepted as an answer covers just one instance of a 'party model'.
For instance, in many approaches, notions like "Employee", "Project Manager" etc. are roles that a party can play within a certain context. I will try to give you a better breakdown once I get home.
When I was part of a team implementing these ideas in the early 1980's, it did not limit our choice of ORM's because those hadn't been invented yet.
I'd fall back on those ideas any time, as that particular project was one of the most convincing proofs-of-concept I have ever seen of a "revolutionary" idea (which it certainly was at the time).
It forces you to nothing. And it doesn't stop you from anything (from any mistake, I mean). The one defining your own information model is you.
All parties have lots of properties in common. The fact that they have a name and such (we called those "signaletics"). The fact that they have principal/primary locations called "addresses". The fact that they all are involved, in some sense, in the business' contracts.
as a simple talk from my understanding: Party modeling gives the flexibility and needs more effort (like T-sql join and ...) to be implemented.
I also wanna point that, "using Party modeling (serialization/generalization) gives you the ability to have FK-Relation to other tables". for example: think of different types of users (admin, user, ...) which generalized into User table, and you can have UserID in your Authorization table.
I'm not sure, but the party model sounds like a particular case of the generalization-specialization pattern. A search on "generalization specialization relational modeling" finds some interesting articles.