how to represent multichannel event sequences - sequence

I'm trying to use TraMineR but am open to feedback/references/links to more info as to how to represent multi-channel or hierarchical event sequences and algorithms that deal with it.
I have a complex event structure that I'm trying to figure out how to represent as a sequence. There are different types of events. Each event type may have a different set of fields (and different numbers of fields). For instance, age might be a field in one event type whereas height might be a field in another event type. My first instinct (and I believe a common approach) was to “flatten” everything, e.g. every possible combination of values for an event constitutes a unique event type. However, this may miss patterns in the generic event types.
For example, let's say I'm a dog breeder and drink a lot of coffee and I want to see if there are patterns in my coffee/dog buying habits (yes, silly example). I might have events like:
- Bought dog
- Breed: hound
- Sex: female
- Bought coffee
- Store: Starbucks
- Roast: dark
- Bought dog
- Breed: hound
- Sex: female
- Bought coffee
- Store: Starbucks
- Roast: light
- Bought dog
- Breed: Doberman pincher
- Sex: male
To flatten the data I may say that every unique combination of store and roast is a unique coffee buying event. Also, every unique combination of breed and sex is a unique dog buying event. This approach would turn the example above into 5 different event types (rather than 2 event types with fields). This representation could detect patterns such as the following: if I drink 2 dark roast coffees from Starbucks then I am more likely to by a male Doberman pincher.
However, this representation may miss more general patterns that don't depend on field values in the events. For instance, it may be the case that I simply buy a dog after having two coffees in general.
I'd like to be able to detect patterns at both "levels" and am unsure of how to represent the events to do so. Of course one approach would be to use both representations and then just combine the results of the two.
So, questions are:
1. Any links/citations to papers that deal with this?
2. Is this a common issue?
3. Any recommendations on how to represent these events?
4. Any recommendations on how to work with them in TraMineR
5. Any recommendations / links / references to algorithms that deal with this sort of thing?
6. Any ideas at all?
Thanks!!!

This is actually similar to the question asked here (although they did not know to reference "multi-channel" and the title was vague): Multiple events in traminer
TraMineR has support for dealing with multichannel sequences with functions like:
seqdistmc
The general approach, I believe, is to do exactly what I outlined as our "flatten" solution. In this case you combine the values for each channel into one event type. e.g. in my example dog.hound.female would be one event with one channel/field to replace the first event in my example that has 3 separate fields/channels. You then use the typical functions for finding distances, subsequences, etc. You do have options for setting up substitution costs and finding distances though, so it has some extra options for doing this multi-channel approach. It also deals with missing values in case you have channels that are different length or have gaps.
This is also similar to what's suggested in the answer to the topic linked above, using the native R function interaction.

Related

Object condition in multiple places/repeated code (DRY)

This is a fundamental application design question I’ve struggled with and flip-flopped on for years. We have a legacy webapp that doesn't really have a solid ORM, if that tidbit might influence your answer. To abstract my question let’s say we have a class Car, and a corresponding table in our database named car. Car has a few properties: color, weight, year, maxspeed These properties directly correspond to columns in the db table.
In our application, we define the car as “classic old” if year is < 1960 and color = black. And in many places within our app knowing whether the car is "classic old" is extremely important (maybe we’re running a very illogical insurance agency which gives steep discounts and other perks to cars which are “classic old”).
All over our application, we do things like:
--list all classic old cars
--give the current user a discount if their car is classic old
--list all classic old cars with max speed > 100 miles per hour
--email the current user if their car is classic old and weights more than 1000 pounds
What is the best way to go about this? We have a legacy application that does this in some places:
getOldClassicCars()
select * where year < 1960 and color = black
and in other places:
cararray = getAllCars();
for each car in cararray
if car.year < 1960 and car.color = black
oldcararray = car.add()
The point being that this very important, fundamental piece of our application – is the car classic old – is “hardcoded” as year < 1960 and color = black in many places. Sometimes in SQL, sometimes in application code, etc. Obviously that is not good, but as we’ve refactored things I’m not sure we’re refactoring things the best way we can.
Well, you are stuck with the fundamental problem that
you cant run your code on the database
you want to be able to use the database's selection functionality on this criteria.
you want the calculation of "classic old" to be defined in a single place (preferably code)
Lets enumerate the solutions
1: Put the calculation in a sproc and always use the sproc to retrieve cars.
The problem here is if you create a new car in code, its class status is undefined, so you haven't really solved the 'not in two places' problem.
2: Get the DB to run your calc via an assembly. for example you can get mssql to run functions from a .net assembly which you can also use in your code base to perform the same calculation.
Problem, its hard work. Plus essentially its still in two places, you have to keep the db up to date and ensure that the table is accessed correctly
3: Persist the calculated value on the DB, but perform the calc in the code
Problem, if the calculation changes the DB values will be incorrect and need updating.
3 seems to be the best option, as we will know when the calculation changes and be able to take some action to resolve the situation.
However, it might be best, given the fundamental nature of this calculation, to make that 'out of dateness' implicit in the way we structure the code.
Instead of simply persisting car.IsClassic we could add a CarStatusReport object with a datetime property. We then generate a CarStatusReport(2017) which evaluates all the cars at that point in time and saves that data in a separate table.
Our business logic is then no longer, "Is this car a classic?" but "What does the latest CarStatusReport say the status of this car is?"
You Business Logic will then reside in a single CarStatusReportGenerator service and any other logic accessing the IsClassic calculation, will be forced to acknowledge the ephemeral nature of the stored info.
No optimal solution here. But, one good point will be to move all the business logic into the one place. If you can't (when you make methods or functions calculating some property, for example isOld()) then hide all those inconsistencies under the hood, so implementation users (conceptually) will never notice DRY violation from outside.

In UML/ER diagramming, how to notate & make transaction requirements with entity that has different values depending on a certain attribute?

I am diagramming an art museum system, where there are Permanent_Art_Objects. Each Permanent_Art_Object has many attributes, and can also be either a 1) Sculpture/Statue, 2) Painting, or 3) Other. Depending on whether it's a sculpture/statue, painting, or other, it has sub-attributes unique to itself.
Here is an example of these sub-attributes.
What is the proper notation for showing these 'sub-attributes'?
For example, if Permanent_Art_Object is Other, it has as sub-attributes Type and Style.
Also, how would I make a query to INSERT INTO Permanent_Art_Object VALUES() for a new art object, if there's so much variety??
It all depends on what you are making. If this is purely for a database, I think ERD's are the cleanest way for modeling but a sidenote is that there are atleast 4 types of notations. Below is how I would do it in UML and ERD with the limited context I have.
More info about ERD's:
Basics: http://web.cse.ohio-state.edu/~gurari/course/cse670/cse670Ch2.xht
Specialisations: http://web.cse.ohio-state.edu/~gurari/course/cse670/cse670Ch16.xht
Overview of different types: http://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model#Cardinalities
My example:

Designing a solution to retrieve and classify content based on given attributes

This is a design problem I am facing. Let's say I have a cars website. Cars have the following attributes with different possible values.
Color: red, green, blue
Size: small, big
Based on those attributes I want to classify between cars for young people, cars for middle aged people and cars for elder people, with the following criteria:
Cars_young: red or green
Cars_middle_age: blue and big
Cars_elder: blue and small
I'll call this criteria target
I have a table cars with columns: id, color and size.
I need to be able to:
a) when retrieving a car by id, tell its target (if it's young, middle age or elder people)
b) be able to query the database to know how many views had cars belonging to each target
Also, as a developer, I must implement it in a way that those criteria are easily changed.
Which is the best way to implement it? Is there a design pattern for it? I can explain two possible solutions I thought about but I don't really like:
1) create a new column in the database table called target, so it's easy to make both a) and b).
Drawbacks: Each time crieteria changes I have to update the column target for all cars, and also, I have to change the insertNewCar() function.
2) Implement it in the 'Cars' class.
Drawback: Each time criteria changes I have to change query in b) as well as code in 'getCarById' in a).
3) Use TRIGGERS in SQL, but I would like to avoid this solution if possible
I would like to be able have this criteria definition somewhere in the code which can be changed easily, and would also hopefully be used by 'Cars' class. I'm thinking about some singleton or global objects for 'target' which can be injected in some Cars methods.
Anyone can explain a nice solution or send documentation about some post that faces this problem, or a pattern design that solves it?
On first sight specification pattern might meet your expectations. Wikipedia gives a nice explanation how it works, small teaser bellow:
OverDueSpecification OverDue = new OverDueSpecification();
NoticeSentSpecification NoticeSent = new NoticeSentSpecification();
InCollectionSpecification InCollection = new InCollectionSpecification();
ISpecification SendToCollection = OverDue.And(NoticeSent).And(InCollection.Not());
InvoiceCollection = Service.GetInvoices();
foreach (Invoice currentInvoice in InvoiceCollection) {
if (SendToCollection.IsSatisfiedBy(currentInvoice)) {
currentInvoice.SendToCollection();
}
}
You can consider combine specification pattern with observers.
Also there are few other ideas:
extention of specification pattern on SQL generation, WHERE clauses in particular
storing criteria configuration in database
criteria versioning: storing information about version of rules used to assign to category comined with category itself

Access Query - Compare Multiple User Selections Against Each Other

I'm running into a conceptual problem that I cannot seem to conquer in my mind.
Let's say I want a user to enter what they're currently wearing into a database via a form. Throwing 'T-Shirt' and 'Blue' in a new row is incredibly easy. However, let's say I want to compare one users against others, and rank in order from most similar to least.
This becomes a huge nightmare when you consider the amount of options available.
Undershirt
Overshirt
Jacket
Scarf/Necklaces
Headwear
Pants
Underwear
Leggings
Socks
Footwear
Accessories
As I see it, I could hard-code in the 11 categories above and let a user make selections from drop-drop boxes tailored to each category. Now, let's use an example of 'Undershirt' and 'Overshirt'. Depending on the person, a long-sleeved shirt could be used as either; they're still wearing one. If I make users put values in categories, User A might put it in one and User B might but it in another category. And they wouldn't get compared because of that, separate categories.
Now, instead of hard-coding in categories (and thus making a limit of how much a user can enter), I could put each item into its own row and search by User ID. But let's say a person enters in shorts one day, and next throws in jeans and a shirt. How can I make sure that they're compared separately (e.g., dress compared to shorts, dress compared to jeans+shirt) and not (dress compared to shorts+jeans+shirt).
As to actually comparing, each item vs. each other could be performed via a 2d lookup table. (Row Dress vs. Column Jeans would net a zero, Row Dress vs. Column Dress would net a one)
The appropriate design for this would depend on the acceptable margin of error. If there is zero acceptable error, then you must present the users with the categories and they specify true/false yes/no for each one or select from a limited set of possible answers.
HANDS:
gloves
mittens
brass knuckles
[Caveat: user could be wearing brass knuckles inside the mittens. You have to take into account
whether values are mutually exclusive or not. Barefoot <> no socks.
Someone who is barefoot is not wearing socks but someone not wearings socks may be wearing docksiders]
FEET1:
anklet socks
sheer stockings
fishnet stockings
ragg wool hiking socks
kneesocks
gym socks
no socks
FEET2:
mocassins
running shoes
sandals
wing-tips
uggs
spike heels
...
HEAD:
sombrero
beret
baseball hat
pirate's hat
beanie
knitted cap
NECK:
scarf
mock turtleneck aka dickie
Et cetera et cetera ad nauseam.
Or if margin of error is very generous, you could allow simple freeform text-entry and match/partial-match on words. Slightly less error : you could set up a synonyms table and match on the synonyms of the supplied words.
As a general rule, get the database design right and worry about reporting later. If this is not just a thought exercise, you may like to say what you are actually comparing, because with the above, a person is quite likely to say "tuxedo" or "evening dress", and let the details be inferred, whereas in some other area, this may not be possible. Even so, it seems that you would need a minimum of three columns (fields) for each item:
Timestamp
Major category (jeans, trousers, skirt)
Item (Levi's, tweeds, mini)
If accuracy is particularly important, you will need a trained interviewer :)
I have just noticed underwear in that list, which is even more complicated, because what would qualify as full underwear for a lady of a certain age is by no means the same as that for a gentleman of ten years.

Algorithm for almost similar values search

I have Persons table in SQL Server 2008.
My goal is to find Persons who have almost similar addresses.
The address is described with columns state, town, street, house, apartment, postcode and phone.
Due to some specific differences in some states (not US) and human factor (mistakes in addresses etc.), address is not filled in the same pattern.
Most common mistakes in addresses
Case sensitivity
Someone wrote "apt.", another one "apartment" or "ap." (although addresses aren't written in English)
Spaces, dots, commas
Differences in writing street names, like 'Dr. Jones str." or "Doctor Jones street" or "D. Jon. st." or "Dr Jones st" etc.
The main problem is that data isn't in the same pattern, so it's really difficult to find similar addresses.
Is there any algorithm for this kind of issue?
Thanks in advance.
UPDATE
As I mentioned address is separated into different columns. Should I generate a string concatenating columns or do your steps for each column?
I assume I shouldn't concatenate columns, but if I'll compare columns separately how should I organize it? Should I find similarities for each column an union them or intersect or anything else?
Should I have some statistics collecting or some kind of educating algorithm?
Suggest approaching it thus:
Create word-level n-grams (a trigram/4-gram might do it) from the various entries
Do a many x many comparison for string comparison and cluster them by string distance. Someone suggested Levenshtein; there are better ones for this kind of task, Jaro-Winkler Distance and Smith-Waterman work better. A libraryt such as SimMetrics would make life a lot easier
Once you have clusters of n-grams, you can resolve the whole string using the constituent subgrams i.e. D.Jones St => Davy Jones St. => DJones St.
Should not be too hard, this is an all-too-common problem.
Update: Based on your update above, here are the suggested steps
Catenate your columns into a single string, perhaps create a db "view" . For example,
create view vwAddress
as
select top 10000
state town, street, house, apartment, postcode,
state+ town+ street+ house+ apartment+ postcode as Address
from ...
Write a separate application (say in Java or C#/VB.NET) and Use an algorithm like JaroWinkler to estimate the string distance for the combined address, to create a many x many comparison. and write into a separate table
address1 | address n | similarity
You can use Simmetrics to get the similarity thus:
JaroWinnkler objJw = new JaroWinkler()
double sim = objJw.GetSimilarity (address1, addres n);
You could also trigram it so that an address such as "1 Jones Street, Sometown, SomeCountry" becomes "1 Jones Street", "Jones Street Sometown", and so on....
and compare the trigrams. (or even 4-grams) for higher accuracy.
Finally you can order by similarity to get a cluster of most similar addresses and decide an approprite threshold. Not sure why you are stuck
I would try to do the following:
split up the address in multiple words, get rid of punctuation at the same time
check all the words for patterns that are typically written differently and replace them with a common name (e.g. replace apartment, ap., ... by apt, replace Doctor by Dr., ...)
put all the words back in one string alphabetically sorted
compare all the addresses using a fuzzy string comparison algorithm, e.g. Levenshtein
tweak the parameters of the Levenshtein algorithm (e.g. you want to allow more differences on longer strings)
finally do a manual check of the strings
Of course, the solution to keep your data 'in shape' is to have explicit fields for each of your characteristics in your database. Otherwise, you will end up doing this exercise every few months.
The main problem I see here is to exactly define equality.
Even if someone writes Jon. and another Jone. - you will never be able to say if they are the same. (Jon-Jonethan,Joneson,Jonedoe whatever ;)
I work in a firm where we have to handle exact this problem - I'm afraid I have to tell you this kind of checking the adress lists for navigation systems is done "by hand" most of the time. Abbrevations are sometimes context dependend, and there are other things that make this difficult. Ofc replacing string etc is done with python - but telling you the MEANING of such an abbr. can only done by script in a few cases. ("St." -> Can be "Saint" and "Street". How to decide? impossible...this is human work.).
Another big problem is as you said "Is there a street "DJones" or a person? Or both? Which one is ment here? Is this DJones the same as Dr Jones or the same as Don Jones? Its impossible to decide!
You can do some work with lists as presented by another answer here - but it will give you enough "false positives" or so.
You have a postcode field!!!
So, why don't you just buy a postcode table for your country
and use that to clean up your street/town/region/province information?
I did a project like this in the last centuary. Basicly it was a consolidation of two customer files after a merger, and, involved names and addresses from three different sources.
Firstly as many posters have suggested, convert all the common words and abbreveations and spelling mistakes to a common form "Apt." "Apatment" etc. to "Apt".
Then look through the name and identifiy the first letter of the first name, plus the first surname. (Not that easy consider "Dr. Med. Sir Henry de Baskerville Smythe") but dont worry where there are amiguities just take both! So if you lucky you get HBASKERVILLE and HSMYTHE. Now get rid of all the vowels as thats where most spelling variations occur so now you have HBSKRVLL HSMTH.
You would also get these strings from "H. Baskerville","Sir Henry Baskerville Smith" and unfortunately "Harold Smith" but we are talking fuzzy matching here!
Perform a similar exercise on the street, and apartment and postcode fields. But do not throw away the original data!
You now come to the interesting bit first you compare each of the original strings and give say 50 points for each string that matches exactly. Then go through you "normalised" strings and give say 20 points for each one that matches exactly. Then go through all the strings and give say 5 points for each four character or more substring they have in common. For each pair compared you will end up with some with scores > 150 which you can consider as a certain match, some with scores less than 50 which you can consider not matched and some inbetween which have some probability of matching.
You need some more tweaking to improve this by adding various rules like "subtract 20 points for a surname of 'smith'". You really have to keep running and tweaking until you get happy with the resulting matches, but, once you look at the results you get a pretty good feel which score to consider a "match" and which are the false positives you need to get rid of.
I think the amount of data could affect what approach works best for you.
I had a similar problem when indexing music from compilation albums with various artists. Sometimes the artist came first, sometimes the song name, with various separator styles.
What I did was to count the number of occurrences on other entries with the same value to make an educated guess wether it was the song name or an artist.
Perhaps you can use soundex or similar algorithm to find stuff that are similar.
EDIT: (maybe I should clarify that I assumed that artist names were more likely to be more frequently reoccurring than song names.)
One important thing that you mention in the comments is that you are going to do this interactively.
This allows to parse user input and also at the same time validate guesses on any abbreviations and to correct a lot of mistakes (the way for example phone number entry works some contact management systems - the system does the best effort to parse and correct the country code, area code and the number, but ultimately the user is presented with the guess and has the chance to correct the input)
If you want to do it really good then keeping database/dictionaries of postcodes, towns, streets, abbreviations and their variations can improve data validation and pre-processing.
So, at least you would have fully qualified address. If you can do this for all the input you will have all the data categorized and matches can then be strict on certain field and less strict on others, with matching score calculated according weights you assign.
After you have consistently pre-processed the input then n-grams should be able to find similar addresses.
Have you looked at SQL Server Integration Services for this? The Fuzzy Lookup component allows you to find 'Near matches': http://msdn.microsoft.com/en-us/library/ms137786.aspx
For new input, you could call the package from .Net code, passing the value row to be checked as a set of parameters, you'd probably need to persist the token index for this to be fast enough for user interaction though.
There's an example of address matching here: http://msdn.microsoft.com/en-us/magazine/cc163731.aspx
I'm assuming that response time is not critical and that the problem is finding an existing address in a database, not merging duplicates. I'm also assuming the database contains a large number of addresses (say 3 million), rather than a number that could be cleaned up economically by hand or by Amazon's Mechanical Turk.
Pre-computation - Identify address fragments with high information content.
Identify all the unique words used in each database field and count their occurrences.
Eliminate very common words and abbreviations. (Street, st., appt, apt, etc.)
When presented with an input address,
Identify the most unique word and search (Street LIKE '%Jones%') for existing addresses containing those words.
Use the pre-computed statistics to estimate how many addresses will be in the results set
If the estimated results set is too large, select the second-most unique word and combine it in the search (Street LIKE '%Jones%' AND Town LIKE '%Anytown%')
If the estimated results set is too small, select the second-most unique word and combine it in the search (Street LIKE '%Aardvark%' OR Town LIKE '%Anytown')
if the actual results set is too large/small, repeat the query adding further terms as before.
The idea is to find enough fragments with high information content in the address which can be searched for to give a reasonable number of alternatives, rather than to find the most optimal match. For more tolerance to misspelling, trigrams, tetra-grams or soundex codes could be used instead of words.
Obviously if you have lists of actual states / towns / streets then some data clean-up could take place both in the database and in the search address. (I'm very surprised the Armenian postal service does not make such a list available, but I know that some postal services charge excessive amounts for this information. )
As a practical matter, most systems I see in use try to look up people's accounts by their phone number if possible: obviously whether that is a practical solution depends upon the nature of the data and its accuracy.
(Also consider the lateral-thinking approach: could you find a mail-order mail-list broker company which will clean up your database for you? They might even be willing to pay you for use of the addresses.)
I've found a great article.
Adding some dlls as sql user-defined functions we can use string comparison algorithms using SimMetrics library.
Check it
http://anastasiosyal.com/archive/2009/01/11/18.aspx
the possibilities of such variations are countless and even if such an algorithm exists, it can never be fool-proof. u can't have a spell checker for nouns after all.
what you can do is provide a drop-down list of previously entered field values, so that they can select one, if a particular name already exists.
its better to have separate fields for each value like apartments and so on.
You could throw all addresses at a web service like Google Maps (I don't know whether this one is suitable, though) and see whether they come up with identical GPS coordinates.
One method could be to apply the Levenshtein distance algorithm to the address fields. This will allow you to compare the strings for similarity.
Edit
After looking at the kinds of address differences you are dealing with, this may not be helpful after all.
Another idea is to use learning. For example you could learn, for each abbreviation and its place in the sentence, what the abbreviation means.
3 Jane Dr. -> Dr (in 3rd position (or last)) means Drive
Dr. Jones St -> Dr (in 1st position) means Doctor
You could, for example, use decision trees and have a user train the system. Probably few examples of each use would be enough. You wouldn't classify single-letter abbreviations like D.Jones that could be David Jones, or Dr. Jones as likely. But after a first level of translation you could look up a street index of the town and see if you can expand the D. into a street name.
Again, you would run each address through the decision tree before storing it.
It feels like there should be some commercial products doing this out there.
A possibility is to have a dictionary table in the database that maps all the variants to the 'proper' version of the word:
*Value* | *Meaning*
Apt. | Apartment
Ap. | Apartment
St. | Street
Then you run each word through the dictionary before you compare.
Edit: this alone is too naive to be practical (see comment).