With the reconcile service I often come across this problem: the best candidate isn't really correct, the best is the second or the third candidate (ad it has also a better score), like this:
How can I select the correct one in mass? I've got thousand of records, and I'm stumbling across lots of cases like this. I think it should be some way that is not doing it one by one.
For instance something that says "take the best candidate score, no matter what's its position".
Edit: as pintoch says, it could be a bug. In the meantime it's possible to create two numeric facet. One with cell.recon.candidates[1].score and the other with cell.recon.candidates[2].score. Playing with them it's possible to select the score of the third and the second candidates to make sure you get the candidate with the best score. Then it has to be reconciled one by one, but it's just a question of clicking.
I would say that this behaviour is a bug in the first place: the candidates should be sorted by decreasing score. The reconciliation service API does not specify that services should return their candidates with any particular order, but that is probably unintended.
The quickest solution would be to contact the person running the reconciliation service that you are using and ask them to sort the candidates by decreasing score on their side.
This also suggests improvements in OpenRefine itself: OpenRefine could always sort the results of a reconciliation service by decreasing score. I have opened a ticket about this.
More broadly, I agree that the current ways to match candidates based on specific criteria could be improved (but this might require redesigning important parts of the reconciliation system, which will take time).
Background
I have 2 resources: courses and professors.
A course has the following attributes:
id
topic
semester_id
year
section
professor_id
A professor has the the following attributes:
id
faculty
super_user
first_name
last_name
So, you can say that a course has one professor and a professor may have many courses.
If I want to get all courses or all professors I can: GET /api/courses or GET /api/professors respectively.
Quandary
My quandary comes when I want to get all courses that a certain professor teaches.
I could use either of the following:
GET /api/professors/:prof_id/courses
GET /api/courses?professor_id=:prof_id
I'm not sure which to use though.
Current solution
Currently, I'm using an augmented form of the latter. My reasoning is that it is more scale-able if I want to add in filtering/sorting criteria.
I'm actually encoding/embedding JSON strings into the query parameters. So, a (decoded) example might be:
GET /api/courses?where={professor_id: "teacher45", year: 2016}&order={attr: "topic", sort: "asc"}
The request above would retrieve all courses that were (or are currently being) taught by the professor with the provided professor_id in the year 2016, sorted according to topic title in ascending ASCII order.
I've never seen anyone do it this way though, so I wonder if I'm doing something stupid.
Closing Questions
Is there a standard practice for using the query string vs the resource path for filtering criteria? What have some larger API's done in the past? Is it acceptable, or encouraged to use use both paradigms at the same time (make both endpoints available)? If I should indeed be using the second paradigm, is there a better organization method I could use besides encoding JSON? Has anyone seen another public API using JSON in their query strings?
Edited to be less opinion based. (See comments)
As already explained in a previous comment, REST doesn't care much about the actual form of the link that identifies a unique resource unless either the RESTful constraints or the hypertext transfer protocol (HTTP) itself is violated.
Regarding the use of query or path (or even matrix) parameters is completely up to you. There is no fixed rule when to use what but just individual preferences.
I like to use query parameters especially when the value is optional and not required as plenty of frameworks like JAX-RS i.e. allow to define default values therefore. Query parameters are often said to avoid caching of responses which however is more an urban legend then the truth, though certain implementations might still omit responses from being cached for an URI containing query strings.
If the parameter defines something like a specific flavor property (i.e. car color) I prefer to put them into a matrix parameter. They can also appear within the middle of the URI i.e. /api/professors;hair=grey/courses could return all cources which are held by professors whose hair color is grey.
Path parameters are compulsory arguments that the application requires to fulfill the request in my sense of understanding otherwise the respective method handler will not be invoked on the service side in first place. Usually this are some resource identifiers like table-row IDs ore UUIDs assigned to a specific entity.
In regards to depicting relationships I usually start with the 1 part of a 1:n relationship. If I face a m:n relationship, like in your case with professors - cources, I usually start with the entity that may exist without the other more easily. A professor is still a professor even though he does not hold any lectures (in a specific term). As a course wont be a course if no professor is available I'd put professors before cources, though in regards to REST cources are fine top-level resources nonetheless.
I therefore would change your query
GET /api/courses?where={professor_id: "teacher45", year: 2016}&order={attr: "topic", sort: "asc"}
to something like:
GET /api/professors/teacher45/courses;year=2016?sort=asc&onField=topic
I changed the semantics of your fields slightly as the year property is probably better suited on the courses rather then the professors resource as the professor is already reduced to a single resource via the professors id. The courses however should be limited to only include those that where held in 2016. As the sorting is rather optional and may have a default value specified, this is a perfect candidate for me to put into the query parameter section. The field to sort on is related to the sorting itself and therefore also belongs to the query parameters. I've put the year into a matrix parameter as this is a certain property of the course itself, like the color of a car or the year the car was manufactured.
But as already explained previously, this is rather opinionated and may not match with your or an other folks perspective.
I could use either of the following:
GET /api/professors/:prof_id/courses
GET /api/courses?professor_id=:prof_id
You could. Here are some things to consider:
Machines (in particular, REST clients) should be treating the URI as an opaque thing; about the closest they ever come to considering its value is during resolution.
But human beings, staring that a log of HTTP traffic, do not treat the URI opaquely -- we are actually trying to figure out the context of what is going on. Staying out of the way of the poor bastard that is trying to track down a bug is a good property for a URI Design to have.
It's also a useful property for your URI design to be guessable. A URI designed from a few simple consistent principles will be a lot easier to work with than one which is arbitrary.
There is a great overview of path segment vs query over at Programmers
https://softwareengineering.stackexchange.com/questions/270898/designing-a-rest-api-by-uri-vs-query-string/285724#285724
Of course, if you have two different URI, that both "follow the rules", then the rules aren't much help in making a choice.
Supporting multiple identifiers is a valid option. It's completely reasonable that there can be more than one way to obtain a specific representation. For instance, these resources
/questions/38470258/answers/first
/questions/38470258/answers/accepted
/questions/38470258/answers/top
could all return representations of the same "answer".
On the /other hand, choice adds complexity. It may or may not be a good idea to offer your clients more than one way to do a thing. "Don't make me think!"
On the /other/other hand, an api with a bunch of "general" principles that carry with them a bunch of arbitrary exceptions is not nearly as easy to use as one with consistent principles and some duplication (citation needed).
The notion of a "canonical" URI, which is important in SEO, has an analog in the API world. Mark Seemann has an article about self links that covers the basics.
You may also want to consider which methods a resource supports, and whether or not the design suggests those affordances. For example, POST to modify a collection is a commonly understood idiom. So if your URI looks like a collection
POST /api/professors/:prof_id/courses
Then clients are more likely to make the associate between the resource and its supported methods.
POST /api/courses?professor_id=:prof_id
There's nothing "wrong" with this, but it isn't nearly so common a convention.
GET /api/courses?where={professor_id: "teacher45", year: 2016}&order={attr: "topic", sort: "asc"}
I've never seen anyone do it this way though, so I wonder if I'm doing something stupid.
I haven't either, but syntactically it looks a little bit like GraphQL. I don't see any reason why you couldn't represent a query that way. It would make more sense to me as a single query description, rather than breaking it into multiple parts. And of course it would need to be URL encoded, etc.
But I would not want to crazy with that right unless you really need to give to your clients that sort of flexibility. There are simpler designs (see Roman's answer)
I've been trying to define rules in my ontology to infer that if a person has friends who are friends amongst each other then all are friends, but if 1 or more are not friends to each other then my ontology will infer that they all, are not friends.
Thank you
You probably need to get your intended semantics straightened out a little bit more.
From what I gather, you want isFriendWith at least to be symmetric, i.e. when isFriendWith(bob, alice) then also isFriendWith(alice, bob).
Also, if you want to have friendsAll to have any meaning, isFriendWith cannot be transitive. This would also capture the natural meaning, as a friend of my friend is not necessarily my friend.
To elaborate: If isFriendWith where symmetric and transitive every friend of bob would automatically also be a friend of all of bob's friends (because isFriendWith(bob, alice) implies isFriendWith(alice, bob). From there on, with any isFriendWith(bob, carol) transitivity implies that isFriendWith(alice, carol). So if isFriendWith is symmetric and transitive, you get the clique automatically.
But as stated, this is probably not, what you want.
As for formulating this in SWRL, let's give it a try, shall we?
friendsAll is most likely reflexive, i.e. let's just assume everybody is his/her own friend. Now, we need an recursive rule that extends this set while still fulfilling the condition: "In this set, everybody is everybody's friend".
To include bob's friends, you would need to be able to quantify over isFriendWith and check if that any candidate friend of bob is also a friend of all other friends of bob. Since you cannot nest quantifiers in SWRL, I'm more or less sure, you cannot express that algorithm in the rule language alone. However, I maybe wrong here and there is a neat little trick hidden inside the semantics. It is not one that I know of, however and the need for quantifier nesting in the direct formulation leaves me believing that it is not possible.
It basically boils down to a well-known graph-theoretic problem: given a starting point bob friendsAll is the largest subset of bob's friends such that every everbody in the group is friends with everyone else, i.e. bob's Maximal Clique.
I need to store information for staff. Each database instance is per parent company with multiple outlets underneath. Some of the staff that work at an outlet can potentially also work at other outlets, however as each outlet is for the most part autonomous, each outlet will not want other outlets to see their staff list.
I wanted to create unique staff instances and just relate them to the outlets that employ them, keeping their details uniform across the database. However my colleague wishes to allow each outlet to create their own staff members. The consequence of this approach is that John Smith might be a staff member at outlet A, Jonathan smith at outlet B and J smith at outlet C (as each outlet could enter pretty much whatever they want). Also each staff member has a set of skills and services associated with them, which also will not be uniform between outlets.
Will this approach cause problems down the line? At the outlet level it probably won't make any difference, but I am concerned that if the parent group ask for reports, the results may be misleading as perhaps 5 staff members might be returned, which in reality are the same person, however may have different details.
What you are describing, if I understand you correctly, is choosing between the prospect of denormalizing a person record per outlet (giving each outlet a completely independent copy of John Smith) vs. defining a single John Smith and then defining related tables that could belong to the outlet level of access.
If you have a choice (the freedom to design the system either way) the normalized way with only 1 John Smith + auxiliary tables with outlet-specific details when necessary is the correct way. I hesitate to say 'correct', but in the absence of very large numbers of users I would say denormalization here would only lead to avoidable integrity errors.
If you choose to denormalize now and not relate John Smith at outlet A to John Smith at outlet B even though they are in reality the same person you are both opening the door to illogical data (updates to one John and not both) and losing the ability to do simple things like count the distinct number of people in the database.
Failing to identify unique people now will prevent you from being able to properly relate other entities to a person in the future. This will complicate your queries at the very least and give rise to logically incorrect but technically correct information in many cases.
In addition to the helpful answers (that actually answer the question), here is some useful info:
If these people are staff members (i.e. they have jobs), then why not use their social security number/national insurance number as the unique identifier? That's guaranteed to be unique and they are each guaranteed to have one.
EDIT: US Social Security numbers are guaranteed to be unique and are not reused. See Q20 here: http://www.ssa.gov/history/hfaq.html
EDIT: No they're not! D'oh! Thanks to Mitch Wheat for this link: http://www.dailyfinance.com/2010/08/12/your-social-security-number-may-not-be-unique-to-you/ I guess the task is an impossible one then as there is no real way of solving the problem...
If outlets can't know the stafflist of other outlets, and outlets can add staff, there doesn't seem to be any way of preventing 2 different outlets from adding the same person, but their being recorded as different people in your system. So unless there is some central clearinghouse, or a way for an outlet to see if a person is already listed as a staffmember (w/o getting any details about the outlet(s) she is associated with), I think you're stuck.
Alice & Bob are both secret quadruple agents who could be working for the US, Russia or China. They want to come up with a scheme that would:
If they are both working for the same side, prove this to each other so they can talk freely.
If they are working for different sides, not expose any additional information about which side they are on.
Oh, and because of the sensitive nature of what they do, there is no trusted third party who can do the comparison for both of them.
What protocol would be able to satisfy both of these needs?
Ideally, any protocol would also be able to generalize to multiple participants and multiples states but that's not essential.
I've puzzled over it for a while and I can't find a satisfactory solution, mainly owing to condition 2.
edit: Here's the original problem that motivated me to look for a solution. "Charlie" had some personal photos that he shared with me and I later discovered that he had also shared them with "Bob". We both wanted to know if we had the same set of photos but, at the same time, if Charlie hadn't shared a certain photo with either of us, he probably had a good reason not to and we didn't want to leak information.
My first thought would be for each of us to concatenate all the photos and provide the MD5 sum. If they matched, then we had the same photos but if they didn't, neither party would know which photos the other had. However, I realized soon after that this scheme would still leak information because Bob could generate an MD5 for each subset of photos he had and if any of them matched my sum, he would know which photos I didn't have. I've yet to find a satisfactory solution to this particular problem but I thought I would generalize it to avoid people focusing on the particulars of my situation.
For both problems, you could use a Secure two-party computation equality-algorithm. There are many schemes, for example this by Damgard, Fitzi, Kiltz, Nielsen and Toft: Unconditionally Secure Constant Round Multi-Party
Computation for Equality, Comparison, Bits and Exponentiation.
Of course an agent could try to pose as an agent from another side to get a 1/3 chance to discover the true side of another agent, but that seems unavoidable.
A much simpler scheme for the photo-problem, which should be almost as good as the secure multiparty computation, is the following:
Alice and Bob sorts their pictures and generate a SHA-512 hash.
Alice sends the first bit of her hash to Bob.
Bob compares the bit to the first bit of his hash. If it is different, they know that they have received different photos. Otherwise they continue.
Bob sends the second bit of his hash to Alice.
Alice checks this bit and decides whether to continue.
Continue until the protocol aborts or all bits have been checked.
So they are guaranteed to be quadruple agents? That is they are guaranteed to be secretly working for one faction while pretending to work for a second while pretending to work for a third while pretending to work for a fourth? They are limited to just the US, Russia or China? If so then that means that there will always be at least one faction they are both pretending to work for and are simultaneously actually working for. That seems to negate their ability to be quadruple agents, because surely one of them can't be working for the Americans while secretly working for the Americans, while secretly working for the Americans, while secretly working for the Americans.
You say that the ideal solution would generalize to arbitrary numbers of states and spy-stacks. Can the degree of secret agent-ness be either higher, equal or lower than the number of states? This might be important. Also, is Alice always guaranteed to have the same degree of agent-ness as Bob? i.e. They will ALWAYS both be triple agents, or ALWAYS both by quintuple agents? The modulo operator springs to mind...
More details please.
As a potential answer, you can enumerate the states into a bitfield. US=1 Russia=2, China=4, Madagascar=8, Tuva=16 etc. Construct a device that is essentially an AND gate. Alice builds and brings one half and Bob builds and brings the other. Separated by a cloth, they each press the button of the state they're really working for. If the output of the AND gate is high, then they're on the same side. If not, then they quietly take down the cloth, and depart with the respective halves of their machine so that the button can't be determined by fingerprint.
This is not theoretical or rigorous, but practical.
For your photos problem, create hashes for all subsets of your photos; randomly select a subset of these, and shuffle in an agreed quantity of randomly generated hash values. Bob does the same, and you exchange these sets. If the proportion of hashes in what Bob has sent you that matches ones you can generate by hashing subsets of your photos significantly differs from what you expect, it is likely you have a significantly different corpus of photos from him. If the proportion of random hashes you agree on is high, you risk being unable to detect small differences in your collections of photos; if the proportion is low, you risk exposing information about missing photos; you will have to select a suitable point for the tradeoff.
Interesting.
I think, no matter what the scheme, it'll need to involve a component of random failure. This is because of the conflicting requirements. You would need a scheme that, occasionally, even when they are on the same side, doesn't work. Because if it always worked, they would immediately be able to determine they aren't on the same side.
Your point 'B' is also vague. You say you don't want to expose what side they are on. Does that mean that the info can't point to specifically one of the sides? Is it okay if Alice thinks Bob is from either one of the others?
Also, have you tried emailing this to the cryptography mailing list? May get a better response there. It's an interesting one to think about :)
Here's the closest I've come to a solution:
Assume there is a function doubleHash such that
doubleHash(a+doubleHash(b)) == doubleHash(b+doubleHash(a))
Alice generates a 62 bit secret and appends the 2 bit country code to the end of it, hashes it and gives Bob doubleHash(a).
Bob does the same thing and gives Alice doubleHash(b).
Alice appends the original secret to the hash that Bob gave her, hashes it and publishes it as doubleHash(a+doubleHash(b)).
Bob does the same thing and publishes doubleHash(b+doubleHash(a)).
If both the hashes match, then they are from the same country. On the other hand, if they don't match, then Bob can't decipher the hash because he doesn't know Alice's secret and vice versa.
However, such a scheme relies on the existence of a doubleHash function and I'm not sure if such a thing is possible.
The most simple thing I can think of with the photos that would possibly work is as thus:
Hash all the photos with a 4096 bit hash.
Sort the photos by hash value. ( Hashes are afterall, just a string representation of a large number )
using that sort order, use a streaming system to pipe, and hash, those photos, as if they were a singular file.
Share your hashes.
If the hashes match, you have the same files. ( low low risk of incorrect positive match, but at 4K hashing, its a bit unlikely )
There are of course, a few weaknesses here:
Don't share how many photos you have. Doing so could permit the party with the greater number of photos do intelligent permutation of the data and remove photos from the hash set they suspect likely you don't have, using the number as a guide, and find ( at great computational expense mind ) a set of images that matches your hash.
They can do 1 without the number, but its harder, and they're out of luck if they actually have less photos.
They could create a fake hash, simply with a random number generator, and send it to you, giving you the impression you had different datasets when you really had the same.
The above weaknesses are also prevalent in your country code identification system, except of course, you have far less entropy to get in the way, and its far easier to fraud the system. ( and thus, far far far easier to work out who they are by sheer brute force, or have yourself worked out by brute force, regardless of how fancy your hash algorithm is )
If this were not the case, you would have already been found out by the very agencies you work for, because something that reliable and secure would be a sure fire way to do a secure background check.
The Photo Scenario is Impossible to Achieve:
Your scheme is impossible for the reasons that you name.
Consider a function f, which takes two sets of photos, s1 and s2.
f(s1, s2) returns true if s1=s2 and false if s1!=s2.
That is, this function implements the scheme you want.
Bob can always supply a subset of photo's he has, and learn which photo's charlie doesn't have.
There is no way around this, any function which has the property you want can not have the security you want.
The Spy Scenario is Even More Impossible:
As Kent Fredric pointed out the spy scenario has even greater inherent weaknesses.
It has all problems of the photo scenario, plus the additional weakness of having only four secrets.
In the photo scenario it would be highly unlikely that Bob would randomly guess one of Charlies photographs.
It is trivial in the spy scenario for Bob to guess Alices choice (1/4).
The spys only have four countries they can belong to, as they are both quadruple agents they both know all the secret code words for each country.
Thus, Bob could pretend to be working for the Chinese to test Alice.
A Different Type of Solution:
Some posters have noted, the security can be increased if you weaken the accuracy of f.
Of course if it is not accurate what is the point. I propose a different type of solution.
Do not let them compare the same
photographs more than one time.
The party which wishes to initiate the comparison must first show that this is a new comparison and does not use any of the pictures from before.
EDIT: Problems with Double Hash
I am making some assumptions about the doublhash protocol, but...
For the photograph scheme, the doublehash protocol is no better than f, because the 62 bit secret must be constructed from a set of photographs for the comparison to be meaningfull. The subset attack mentioned in the original question still applies here. Try all subsets of photographs to brute force the secrets you can generate, thus Bob can see if he can generate the same secret as Alice.
Using the doublehash property Bob can still brute force the secret.
doubleHash(s1+doubleHash(b)) != doubleHash(aliceSecret+doubleHash(a))
doubleHash(s2+doubleHash(b)) != doubleHash(aliceSecret+doubleHash(a))
doubleHash(s3+doubleHash(b)) == doubleHash(aliceSecret+doubleHash(a))
Bingo, aliceSecret == s3.
DoubleHash is only as strong as it is hard to bruteforce either a or b
Implementating DoubleHash
Instead doubleHash(a + doubleHash(b)), try doubleHash(a, md5(b)).
DoubleHash(a + doubleHash(b)) is bad because Bob could generate colliding hashes like so:
doubleHash((12 + doubleHash(34)) + doubleHash(5678))
= doubleHash((34 + doubleHash(12)) + doubleHash(5678))
= doubleHash(5678 + doubleHash(12 + doubleHash(34))
= doubleHash(5678 + doubleHash(34 + doubleHash(12))
Here is an implementation of doubleHash using the new formulation,
Doublehash(a, hashOfB){
hashOfA = md5(a)
combinedHash = hashOfA xor hashOfB
return md5(combinedHash)
}
One could also use the math behind blind signatures to impliment version of doubleHash.
Wouldn't RSA work here? Each nation knows its private key, you publish your public key, and only nations that are the same can decrypt the info. I guess the second person would know that the first isn't on the same side as they are, however.
Hmm.
How about Public Key Cryptography?