How to implement weighted data property in protege 4 - properties

I am implementing an ontology to check for semantic similarity between individuals of different classes of animals. Say Cow is exactly similar to Cow and nearly similar to buffalo/bull etc. but cow is not similar to dog. i have different data properties that i want to associate weight or degree. Eg for cow, isDomestic property has weight 90 (i.e in 90% cases a cow would be kept as domestic animal) but isSecurity will have weight 0 i.e in no case a cow is kept for security. while Dog isDometic is again 90, but Dog isSecurity will have value say 70 or more. Thats how cow and dog would not be similar.
i have came across RDF reification but i need to implement it in Protege and then use OWL api to reason and check for semantic similarity of individuals.
Any hint or guidance will be appreciated. Thanks in advance.

Related

Can RDF/SPARQL be used for sub-graph matching?

I would like to build a knowledge graph of a set of instances, where each instance is itself a collection of ordered sub-instances. As a simple example, let's assume my instances are chains of marbles {CHAIN1, CHAIN2, CHAIN3, ...} and the sub-instances are colored marbles {CHAIN1: YELLOW-RED-BLUE-RED; CHAIN2: BLUE-YELLOW-GREEN; CHAIN3: GREEN-RED-BLUE-RED}.
Just to clarify, an incorrect approach would define CHAIN1 something like this:
:CHAIN1 :has_marble :YELLOW, :RED, :BLUE, :RED
but querying this would clearly only yield a "bag of marbles" situation.
I would like to be able to:
Query the knowledge graph such that I can get back the marbles for each chain in the correct order.
Match sequences of marbles between different chains. For example, I might want to get all the chains that have the sequence :RED-:BLUE-:RED as a sub-sequence (i.e., CHAIN1 and CHAIN3).
Questions:
What would be the best way of building this knowledge graph? Should I store the marbles as RDF sequences using rdf:first/rdf:rest? Or is there a better, more flexible option? If possible, I would like to be able to define the type of relation between the marbles, say :RED :is_followed_by :BLUE.
Is the type of graph matching I'm after possible? And how about if I'd like to match the sequences using some properties that describe each marble? Say, :BLUE :has_shape :SQUARE, and match the sequence of marbles by their shape?
Note: What I really want to model are chains of DNA and protein sequences, so if anyone has specific recommendations for such applications, that would be even more helpful.

Understanding Stratified sampling in numpy

I am currently completing an exercise book on machine learning to wet my feet so to speak in the discipline. Right now I am working on a real estate data set: each instance is a district of california and has several attributes, including the district's median income, which has been scaled and capped at 15. The median income histogram reveals that most median income values are clustered around 2 to 5, but some values go far beyond 6. The author wants to use stratified sampling, basing the strata on the median income value. He offers the next piece of code to create an income category attribute.
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
He explains that he divides the median_income by 1.5 to limit the number of categories and that he then keeps only those categories lower than 5 and merges all other categories into category 5.
What I don't understand is
Why is it mathematically sound to divide the median_income of each instance to create the strata? What exactly does the result of this division mean? Are there other ways to calculate/limit the number of strata?
How does the division restrict the number of categories and why did he choose 1.5 as the divisor instead of a different value? How did he know which value to pick?
Why does he only want 5 categories and how did he know beforehand that there would be at least 5 categories?
Any help understanding these decisions would be greatly appreciated.
I'm also not sure if this is the StackOverFlow category I should post this question in, so if I made a mistake by doing so please let me know what might be the appropriate forum.
Thank you!
You may be the right person to analyze more on this based on your data set. But I can help you understanding stratified sampling, so that you will have an idea.
STRATIFIED SAMPLING: suppose you have a data set with consumers who eat different fruits. One feature is 'fruit type' and this feature has 10 different categories(apple,orange,grapes..etc) now if you just sample the data from data set, there is a possibility that sample data might not cover all the categories. Which is very bad when train the data. To avoid such scenario, we have a method called stratified sampling, in this probability of sampling each different category is same so that we will not miss any useful data.
Please let me know if you still have any questions, I would be very happy to help you.

how to represent multichannel event sequences

I'm trying to use TraMineR but am open to feedback/references/links to more info as to how to represent multi-channel or hierarchical event sequences and algorithms that deal with it.
I have a complex event structure that I'm trying to figure out how to represent as a sequence. There are different types of events. Each event type may have a different set of fields (and different numbers of fields). For instance, age might be a field in one event type whereas height might be a field in another event type. My first instinct (and I believe a common approach) was to “flatten” everything, e.g. every possible combination of values for an event constitutes a unique event type. However, this may miss patterns in the generic event types.
For example, let's say I'm a dog breeder and drink a lot of coffee and I want to see if there are patterns in my coffee/dog buying habits (yes, silly example). I might have events like:
- Bought dog
- Breed: hound
- Sex: female
- Bought coffee
- Store: Starbucks
- Roast: dark
- Bought dog
- Breed: hound
- Sex: female
- Bought coffee
- Store: Starbucks
- Roast: light
- Bought dog
- Breed: Doberman pincher
- Sex: male
To flatten the data I may say that every unique combination of store and roast is a unique coffee buying event. Also, every unique combination of breed and sex is a unique dog buying event. This approach would turn the example above into 5 different event types (rather than 2 event types with fields). This representation could detect patterns such as the following: if I drink 2 dark roast coffees from Starbucks then I am more likely to by a male Doberman pincher.
However, this representation may miss more general patterns that don't depend on field values in the events. For instance, it may be the case that I simply buy a dog after having two coffees in general.
I'd like to be able to detect patterns at both "levels" and am unsure of how to represent the events to do so. Of course one approach would be to use both representations and then just combine the results of the two.
So, questions are:
1. Any links/citations to papers that deal with this?
2. Is this a common issue?
3. Any recommendations on how to represent these events?
4. Any recommendations on how to work with them in TraMineR
5. Any recommendations / links / references to algorithms that deal with this sort of thing?
6. Any ideas at all?
Thanks!!!
This is actually similar to the question asked here (although they did not know to reference "multi-channel" and the title was vague): Multiple events in traminer
TraMineR has support for dealing with multichannel sequences with functions like:
seqdistmc
The general approach, I believe, is to do exactly what I outlined as our "flatten" solution. In this case you combine the values for each channel into one event type. e.g. in my example dog.hound.female would be one event with one channel/field to replace the first event in my example that has 3 separate fields/channels. You then use the typical functions for finding distances, subsequences, etc. You do have options for setting up substitution costs and finding distances though, so it has some extra options for doing this multi-channel approach. It also deals with missing values in case you have channels that are different length or have gaps.
This is also similar to what's suggested in the answer to the topic linked above, using the native R function interaction.

How to *really* write UML cardinalities?

I would like to know once and for all how to write UML cardinalities, since I very often had to debate about them (so proofs and sources are very welcome :)
If I want to explain that a Mother can have several Children but a Child has one and only one Mother, should I write:
Mother * ---------- 1 Child
Or
Mother 1 ---------- * Child
?
the second one
Mother 1 ----------------- 1..* Child
You would find many example in the UML specification for all figure related to the Abstract Syntax...
Of course Red Beard is right, the correct answer is the second one.
As for a tip for remembering this, I advise to think in english: You say "A child has ONE mother", and in this sentence like in UML, ONE is written next to Mother. Fairly simple.
Many people have this question when they start using UML, especially when they come from another notation where the names are always read clockwise, regardless of which end of the line they're on. That's really confusing!
Red Beard is correct, although the UML spec does not explicitly state where association-end information (i.e., name and multiplicity) is written, it implies it in several places. For example, Figures 7.11 (showing attributes) and 7.12 (showing unidirectional associations with association ends next to the arrowheads) are equivalent property notations; thus, the multiplicity does indeed go next to the property's type.
One way I learned to remember which end has which multiplicity is to imagine a unidirectional graph of instances and write the number next to the arrowheads that point at the target.
BTW, you should use descriptive association end names. These often turn into attribute names in Java, element names in XSD, and so on. For example, in Java, the Mother class might have a "children" attribute of type "Set<Child>". If you don't name them, you'll often get undesirable default names.

Semantic web : Degree of relationship between 2 entities in a single ontology

Suppose that we have concepts in an ontology like: grand_mother, mother, and son.
grand_mother concept has some entities like: Mrs. Brown, Mrs. Linda...
mother concept has some entities like: Mrs. Jennifer, Mrs. King..
son concept has: Mike, Bill..
The degree of entities belong to 2 neighbor concepts (like mother and son, or grand_mother and mother) is 0.6. the degree of entities belongs to 2 far concepts (like son and grand_mother) is : 0.6 * 0.6.
User can type some keywords to the search box, and I must measure the degree of them.
For example, 1st keyword is Mrs. Brown and 2nd is Mike.
I have no idea how to do it? (use reasoners but I don't know how to measure the degree of them). Have technologies to do it?
Can anyone help me?
Thanks in advanced.
I think the keyword that you're looking for is weighted graphs, and a shortest path algorithm eg. http://en.wikipedia.org/wiki/Dijkstra's_algorithm