Modeling Geographic Locations in an Relational Database - sql

I am designing a contact management system and have come across an interesting issue regarding modeling geographic locations in a consistent way. I would like to be able to record locations associated with a particular person (mailing address(es) for work, school, home, etc.) My thought is to create a table of locales such as the following:
Locales (ID, LocationName, ParentID) where autonomous locations (such as countries, e.g. USA) are parents of themselves. This way I can have an arbitrarily deep nesting of 'political units' (COUNTRY > STATE > CITY or COUNTRY > STATE > CITY > UNIVERSITY). Some queries will necessarily involve recursion.
I would appreciate any other recommendations or perhaps advice regarding predictable issues that I am likely to encounter with such a scheme.

You might want to have a look at Freebase.com as a site that's had some open discussion about what a "location" means and what it means when a location is included in another. These sorts of questions can generate a lot of discussion.
For example, there is the obvious "geographic nesting", but there are less obvious logical nestings. For example, in a strictly geographic sense, Vatican City is nested within Italy. But it's not nested politically. Similarly, if your user is located in a research center that belongs to a university, but isn't located on the University's property, do you model that relationship or not?

Sounds like a good approach to me. The one thing that I'm not clear on when reading you post is what "parents of themselves" means - if this is to indicate that the locale does not have a parent, you're better off using null than the ID of itself.

I think you might be overthinking this. There's a reason most systems just store addresses and maybe a table of countries. Here are some things to look out for:
Would an address in the Bronx include the borough as a level in the hierarchy? Would an address in an unincorporated area eliminate the "city" level of the hierarchy? How do you model an address within a university vs an address that's not within one? You'll end up with a ragged hierarchy which will force you to traverse the tree every time you need to display an address in your application. If you have an "address book" page the performance hit could be significant.
I'm not sure that you even have just one hierarchy. Brown University has facilities in Providence, RI and Bristol, RI. The only clean solution would be to have a double hierarchy with two campuses that each belong to their respective cities in one hierarchy but that both belong to Brown University on the other hierarchy. (A university is fundamentally unlike a political region. You shouldn't really mix them.)
What about zip codes? Some zip codes encompass multiple towns, other times a city is broken into multiple zip codes. And (rarely) some zip codes even cross state lines. (According to Wikipedia, at least...)
How will you enter the data? Building out the database by parsing conventionally-formatted addresses can be difficult when you take into account vanity addresses, alternate names for certain streets, different international formats, etc. And I think that entering every address hierarchically would be a PITA.
It sounds like you're trying to model the entire world in your application. Do you really want or need to maintain a table that could conceivable contain every city, state, province, postal code, and country in the world? (Or at least every one where you know somebody?) The only thing I can think of that this scheme would buy you is proximity, but if that's what you want I'd just store state and country separately (and maybe the zip code) and add latitude and longitude data from Google.
Sorry for the extreme pessimism, but I've gone down that road myself. It's logically beautiful and elegant, but it doesn't work so well in practice.

Here's a suggestion for a pretty flexible schema. An immediate warning: it could be too flexible/complex for what you actually need
Location
(LocationID, LocationName)
-- Basic building block
LocationGroup
(LocationGroupID, LocationGroupName, ParentLocationGroupID)
-- This can effective encapsulate multiple hierarchies. You have one root node and then you can create multiple independent branches. E.g. you can split by state first and then create several sub-hierarchies e.g. ZIP/city/xxxx
LocationGroupLocation
(LocationID, LocationGroupID)
-- Here's how you link Location with one or more hierarchies. E.g. you can link your house to a ZIP, as well as a City... What you need to implement is a constraint that you should not be able to link up a location with any two hierarchies where one of them is a parent of the other (as the relationship is already implicit).

I would think carefully about this since it may not be a necessary feature.
Why not just use a text field and let users type in an address?
Remember the KISS principle (Keep It Simple, Stupid).

I agree with the other posts that you need to be very careful here about your requirements. Location can become a tricky issue and this is why GIS systems are so complicted.
If you are sure you just need a basic heirarchy structure, I have the following suggestions:
I support the previous comment that root level items should not have themselves as the parent. Root level items should have a null value for the parent. Always be careful about putting data into a field that has no meaning (i.e. "special" value to represent no data). This practice is rarely necessarily and way overused in the devleoper community.
Consider XPath / XML. This is Something to consider for bother recording the heirarchy structure, and for processing / parsing the data at retrieval. If you are using MSSQL Server, the XPath expressions in select statements are perfect for tasks such as returning the full location/heirarchy path of a record as the code is simple and the results are fast.

For Geographic locations you may wish to resolve an address to a Latitude, Longitude array (perhaps using Google maps etc.) to calculate proximities etc.. For Geopolitical nesting ... I'd go with the KISS response.
If you really want to model it, perhaps you need the types to be more generic ... Country -> State -> County -> Borough -> Locality -> City -> Suburb -> Street or PO Box -> Number -> -> Appartment etc. -> Institution (University or Employer) -> Division -> Subdivision-1 -> subdivision-n ... Are you sure you can't do KISS?

I'm modeling an apps for global users and I have the same problems, but I think that this approach could already be in use in many enterprise. But why this problem don't have an universal solution? Or, has this problem one best solution that can be the start point or anybody in the world need think in a solution for it since beginnig?
In IT, we are making the same things any times and in many places, unfortunately. For exemplo, who are not have made more than one user, customer or product's database? And the worst, all enterprise in the world has made it. I think that could have universal solutions for universal problems.

Related

Query String vs Resource Path for Filtering Criteria

Background
I have 2 resources: courses and professors.
A course has the following attributes:
id
topic
semester_id
year
section
professor_id
A professor has the the following attributes:
id
faculty
super_user
first_name
last_name
So, you can say that a course has one professor and a professor may have many courses.
If I want to get all courses or all professors I can: GET /api/courses or GET /api/professors respectively.
Quandary
My quandary comes when I want to get all courses that a certain professor teaches.
I could use either of the following:
GET /api/professors/:prof_id/courses
GET /api/courses?professor_id=:prof_id
I'm not sure which to use though.
Current solution
Currently, I'm using an augmented form of the latter. My reasoning is that it is more scale-able if I want to add in filtering/sorting criteria.
I'm actually encoding/embedding JSON strings into the query parameters. So, a (decoded) example might be:
GET /api/courses?where={professor_id: "teacher45", year: 2016}&order={attr: "topic", sort: "asc"}
The request above would retrieve all courses that were (or are currently being) taught by the professor with the provided professor_id in the year 2016, sorted according to topic title in ascending ASCII order.
I've never seen anyone do it this way though, so I wonder if I'm doing something stupid.
Closing Questions
Is there a standard practice for using the query string vs the resource path for filtering criteria? What have some larger API's done in the past? Is it acceptable, or encouraged to use use both paradigms at the same time (make both endpoints available)? If I should indeed be using the second paradigm, is there a better organization method I could use besides encoding JSON? Has anyone seen another public API using JSON in their query strings?
Edited to be less opinion based. (See comments)
As already explained in a previous comment, REST doesn't care much about the actual form of the link that identifies a unique resource unless either the RESTful constraints or the hypertext transfer protocol (HTTP) itself is violated.
Regarding the use of query or path (or even matrix) parameters is completely up to you. There is no fixed rule when to use what but just individual preferences.
I like to use query parameters especially when the value is optional and not required as plenty of frameworks like JAX-RS i.e. allow to define default values therefore. Query parameters are often said to avoid caching of responses which however is more an urban legend then the truth, though certain implementations might still omit responses from being cached for an URI containing query strings.
If the parameter defines something like a specific flavor property (i.e. car color) I prefer to put them into a matrix parameter. They can also appear within the middle of the URI i.e. /api/professors;hair=grey/courses could return all cources which are held by professors whose hair color is grey.
Path parameters are compulsory arguments that the application requires to fulfill the request in my sense of understanding otherwise the respective method handler will not be invoked on the service side in first place. Usually this are some resource identifiers like table-row IDs ore UUIDs assigned to a specific entity.
In regards to depicting relationships I usually start with the 1 part of a 1:n relationship. If I face a m:n relationship, like in your case with professors - cources, I usually start with the entity that may exist without the other more easily. A professor is still a professor even though he does not hold any lectures (in a specific term). As a course wont be a course if no professor is available I'd put professors before cources, though in regards to REST cources are fine top-level resources nonetheless.
I therefore would change your query
GET /api/courses?where={professor_id: "teacher45", year: 2016}&order={attr: "topic", sort: "asc"}
to something like:
GET /api/professors/teacher45/courses;year=2016?sort=asc&onField=topic
I changed the semantics of your fields slightly as the year property is probably better suited on the courses rather then the professors resource as the professor is already reduced to a single resource via the professors id. The courses however should be limited to only include those that where held in 2016. As the sorting is rather optional and may have a default value specified, this is a perfect candidate for me to put into the query parameter section. The field to sort on is related to the sorting itself and therefore also belongs to the query parameters. I've put the year into a matrix parameter as this is a certain property of the course itself, like the color of a car or the year the car was manufactured.
But as already explained previously, this is rather opinionated and may not match with your or an other folks perspective.
I could use either of the following:
GET /api/professors/:prof_id/courses
GET /api/courses?professor_id=:prof_id
You could. Here are some things to consider:
Machines (in particular, REST clients) should be treating the URI as an opaque thing; about the closest they ever come to considering its value is during resolution.
But human beings, staring that a log of HTTP traffic, do not treat the URI opaquely -- we are actually trying to figure out the context of what is going on. Staying out of the way of the poor bastard that is trying to track down a bug is a good property for a URI Design to have.
It's also a useful property for your URI design to be guessable. A URI designed from a few simple consistent principles will be a lot easier to work with than one which is arbitrary.
There is a great overview of path segment vs query over at Programmers
https://softwareengineering.stackexchange.com/questions/270898/designing-a-rest-api-by-uri-vs-query-string/285724#285724
Of course, if you have two different URI, that both "follow the rules", then the rules aren't much help in making a choice.
Supporting multiple identifiers is a valid option. It's completely reasonable that there can be more than one way to obtain a specific representation. For instance, these resources
/questions/38470258/answers/first
/questions/38470258/answers/accepted
/questions/38470258/answers/top
could all return representations of the same "answer".
On the /other hand, choice adds complexity. It may or may not be a good idea to offer your clients more than one way to do a thing. "Don't make me think!"
On the /other/other hand, an api with a bunch of "general" principles that carry with them a bunch of arbitrary exceptions is not nearly as easy to use as one with consistent principles and some duplication (citation needed).
The notion of a "canonical" URI, which is important in SEO, has an analog in the API world. Mark Seemann has an article about self links that covers the basics.
You may also want to consider which methods a resource supports, and whether or not the design suggests those affordances. For example, POST to modify a collection is a commonly understood idiom. So if your URI looks like a collection
POST /api/professors/:prof_id/courses
Then clients are more likely to make the associate between the resource and its supported methods.
POST /api/courses?professor_id=:prof_id
There's nothing "wrong" with this, but it isn't nearly so common a convention.
GET /api/courses?where={professor_id: "teacher45", year: 2016}&order={attr: "topic", sort: "asc"}
I've never seen anyone do it this way though, so I wonder if I'm doing something stupid.
I haven't either, but syntactically it looks a little bit like GraphQL. I don't see any reason why you couldn't represent a query that way. It would make more sense to me as a single query description, rather than breaking it into multiple parts. And of course it would need to be URL encoded, etc.
But I would not want to crazy with that right unless you really need to give to your clients that sort of flexibility. There are simpler designs (see Roman's answer)

A tree, where each node could have multiple parents

Here's a theoretical/pedantic question: imagine properties where each one could be owned by multiple others. Furthermore, from one iteration of ownership to the next, two neighboring owners could decide to partly combine ownership. For example:
territory 1, t=0: a,b,c,d
territory 2, t=0: e,f,g,h
territory 1, t=1: a,b,g,h
territory 2, t=1: g,h
That is to say, c and d no longer own property; and g and h became fat cats, so to speak.
I'm currently representing this data structure as a tree where each child could have multiple parents. My goal is to cram this into the Composite design pattern; but I'm having issues getting a conceptual footing on how the client might go back and update previous ownership without mucking up the whole structure.
My question is twofold.
Easy: What is a convenient name for this data structure such that I can google it myself?
Hard: What am I doing wrong? When I code I try to keep the mantra, "Keep it simple, Stupid," in my head, and I feel I am breaking this credo.
My question is two fold: Easy: What is a convenient name for this data
structure such that I can google it myself?
What you have here is not a tree, it is a graph. A multimap will help you here.
But any adjacency list or adjacency matrix will give you a good start.
Here is a video on adjacency matrix and list: Youtube on adjacency matrix and list
Hard: What am I doing wrong?
This is really hard to tell. Perhaps you did not model the relationship
in a proper way. It is not that hard, given a good datastructure to start with.
And, as you asked for design patterns (but you probably found out yourself),
the Composite pattern will let you model such an setting with ease.
You have a many-to-many relationship between your owners and your territories (properties). I'm not sure what language you're working in, but this sort of thing can be easily represented and tracked in a relational database. (You'd probably want a table for each entity, and the relationship would probably require a third "junction" table. If it's necessary to be able to query "back in time", this could have some sort of "time index" column as well.)
If you are working in an object-oriented language, you might create two classes, Territory and Owner, where the Territory class has a property/member/field which is a collection of references/pointers to Owners and the Owner class has a similar collection of Territories. (One of these two collections may need to contain "weak" references depending on the language.)
In this case, some difficulty may arise if you want to be able to go back and look at the network state at some particular point earlier in time. (If this is what you need, say so and I (or someone else) can post a solution that works for that.)
I'm not sure what level of simplicity you are striving for, but in neither of these cases is updating the ownership relationships really that "hard". Maybe if you posted some code it might be easier to give you more concrete advice.
Hard to tell without more information regarding the business rules. Though I've plenty of experience designing graphs where each node could potentially have numerous parents.
A common structure is the Directed Acyclic Graph. Essential rules here are that no path through the graph can cycle back onto itself. For example take the path "A/B/C/B", this would not be valid as B repeats twice.
Valid:- "A/B/C", "D/E/C", node C has two parents E and B.
Invalid:- "A/B/C/B", node B repeats in the same path causing a cycle.

What would be the best way to index and search my data using Lucene?

I’ve found multiple questions on SO and elsewhere that ask questions along the lines of “How can I index and then search relational data in Lucene”. Quite rightly these questions are met with the standard response that Lucene is not designed to model data like this. This quote I found sums it up…
A Lucene Index is a Document Store. In a Document Store, a single
document represents a single concept with all necessary data stored to
represent that concept (compared to that same concept being spread
across multiple tables in an RDBMS requiring several joins to
re-create).
So I will not ask that question and instead provide my high level requirements and see if any Lucene gurus out there can help me.
We have data on People (Name, Gender, DOB, Nationality, etc)
And data on Companies (Name, Country, City, etc).
We also have data about how these two types of entity relate to each other where a person worked at the company (Person, Company, Role, Date Started, Date Ended, etc).
We have two entities – Person and Company – that have their own properties and then properties exist for the many-to-many link between them.
Some example searches could be as follows…
Find all Companies in Australia
Find all People born between two dates
Find all People who have worked as a .Net Developer
Find all males who have worked as a.Net Developer in London.
Find all People who have worked as a .Net Developer between 2008 and 2010
The criteria span all the three sets of data. Our requirement is to provide a Faceted Search over the data that accepts any combination of the various properties, of which I have given some examples.
I would like to use Lucene.Net for this. We are a .Net software house and so feel slightly intimidated by java. However, all suggestions are welcome.
I am aware of the idea that the Index should be constructed with the search in mind. But I can’t seem to come up with a sensible index that would meet all the combinations of search criteria
What classes native to Lucene or what extension points can we make use of.
Are there are established techniques for doing this kind of thing?
Are there any third open source contributions that I have missed that will help us here?
For now I won’t describe the scenarios we have considered because I don’t want to bloat out this question and make it too intimidating. Please ask me to elaborate where necessary.
To store both companies and people in a single index, you could create documents with a type field that identifies the type of entities they describe.
Birthdays can be stored as date fields.
You could give each person a simple text field containing the names of companies that they worked for. Note that you won't get an error if you enter a company that is not represented by a document in your index. Lucene is not a relational DB tool, but you knew that.
(Sorry that I've not posted any links to the API; I'm familiar with Lucene Core but not Lucene.NET.)

Cities, Municipalities, or Settlements?

I'm trying to build a form that lets a user enter an address as easily as possible.
The way I had it before is you would choose a country, and then a list of provinces/states is populated, you choose one of those, and then type the name of the city.
However, I think this could be made easier. A lot of apps (ex. Facebook events) allow you to just type the name of the city, and then the country/province can be inferred, so I'm trying to build something like that.
I live in BC, Canada, so I'm starting with that. Looking on Wikipedia, there are a number of "cities" that are left out, such as Delta, which is a fairly big region over here. It should probably be included in the list, because in colloquial conversations, people usually say I live in "Delta" or something, even though it's not really a city. So maybe I should actually use municipalities which seem to be a superset of the cities.
But then we've left out places like Ladner, which is actually a sub-region of Delta, but now I'm worried I'm getting too specific?
Thoughts?
Also, this is just Canada. I have to include the United States as well. I'm not sure if the states are divided up quite the same way, how fine-grained should I go there, fellow Americans?
If you were trying to get the weather, or time, or trying to look up a location, and were presented with a form, how fine-grained would you expect it to recognize?
As an American, all I usually care about is City and State. Maybe an option to omit the city and fill in County/Region like you were talking about above would be appropriate, as not everyone lives near enough to a city to really claim to be from there. The only issue is you'd have to keep track of all of these in your database, and keep them updated. I'm not really sure how to go about that.
For the United States, check out the USPS City State File. This will never get all the common place names, but should cover most incorporated cities, townships, municipalities, etc (and a few unincorporated places as well).

DDD: Modeling M:N relation between two roots where the relation itself carries semantic meaning

Update Edited to reflect clarifications requested by Chris Holmes below. Originally I was referring to a Terminal as a Site but changed it to better reflect my actual domain.
At heart, I think this is a question about modeling a many to many relationship between two root entities where the relationship itself carries some semantic meaning.
In my domain
You can think of a Terminal as a branch location of our company
A Terminal can have a relationship with any number of customers
A customer can have a relationship with any number of terminals (standard many to many)
A customer\terminal relationship means that a customer can potentially store products at the Terminal
This relationship can be enabled\disabled. To be disabled merely means you are temporarily not allowed to store product, so a disabled relationship is different from no relationship at all.
A customer can have many Offices
A Terminal that has a relationship with a customer (enabled or not) must have a default office for that customer that they communicate with
There are some default settings that are applied to all transactions between a Customer and a Terminal, these are set up on a Terminal-Customer Relationship level
I think my objects here are pretty clear, Terminal, Customer, Office, and TerminalCustomerRelationship (since there is information being stored specifically about the relationship such as whether it is enabled, default office, ad default settings). Through multiple refactorings I have determined that both Terminal and Customer should probably be aggregate roots. This leaves me with the question of how to design my TerminalCustomerRelationship object to relate the two.
I suppose I could make the traversal from Terminal to TerminalCustomerRelationship unidirectional toward the relationship but I don't know how to break the relation from the relationship to the customer, especially since it needs to contain a reference to an Office which has a relationship to a Customer.
I'm new to this stuff and while most of DDD makes perfect sense I'm getting confused and need a fresh outlook. Can someone give me their opinion on how to deal with this situation?
Please note that I say Relationship not relation. In my current view it deserves to be an object in the same way that a Marriage would be an object in an application for a wedding chapel. Its most visible purpose is that it relates two objects, but it has other properties that rightfully belong to it as well.
By your description, you definitely need a "TerminalCustomerRelationship" entity to track the associated information. I would also convert the 'IsEnabled' flag into a first class 'Event' entity with a timestamp - this gives you the ability to save a history of the state changes (a more realistic view of what's happening in the domain.)
Here's a sample application (in VS2008) that refects your problem. You can tweak/test the code until the relationships make sense. Run "bin/debug/TerminalSampleApp.exe" and right-click "Terminal->Create Example" to get started.
Let me know if you find it useful.
Names can often clarify an object's responsibilities and bring a domain model into focus.
I am unclear what a Site is and that makes the entire model confusing, which makes it difficult for me to offer better advice. If a Site were a Vendor, for instance, then it would be easy to rename SiteCustomerRelationship as a Contract. In that context it makes perfect sense for Contract to be its own entity, and have the the model look like Vendor-Contract-Customer-Office.
There are other ways to look at this as well. Udi has a decent post on this sort of many-to-many relationship here.
You should not have a object Like SiteCustomerRelationship, its DB specific.
If its truly DDD you should have a Relation like:
Aggregate<Site> Customer.Site
IEnumerable<Aggregate<Office>> Customer.Offices
and perhaps
Aggregate<Office> Customer.DefaultOffice