Verifying equivalence of a secret - cryptography

Alice & Bob are both secret quadruple agents who could be working for the US, Russia or China. They want to come up with a scheme that would:
If they are both working for the same side, prove this to each other so they can talk freely.
If they are working for different sides, not expose any additional information about which side they are on.
Oh, and because of the sensitive nature of what they do, there is no trusted third party who can do the comparison for both of them.
What protocol would be able to satisfy both of these needs?
Ideally, any protocol would also be able to generalize to multiple participants and multiples states but that's not essential.
I've puzzled over it for a while and I can't find a satisfactory solution, mainly owing to condition 2.
edit: Here's the original problem that motivated me to look for a solution. "Charlie" had some personal photos that he shared with me and I later discovered that he had also shared them with "Bob". We both wanted to know if we had the same set of photos but, at the same time, if Charlie hadn't shared a certain photo with either of us, he probably had a good reason not to and we didn't want to leak information.
My first thought would be for each of us to concatenate all the photos and provide the MD5 sum. If they matched, then we had the same photos but if they didn't, neither party would know which photos the other had. However, I realized soon after that this scheme would still leak information because Bob could generate an MD5 for each subset of photos he had and if any of them matched my sum, he would know which photos I didn't have. I've yet to find a satisfactory solution to this particular problem but I thought I would generalize it to avoid people focusing on the particulars of my situation.

For both problems, you could use a Secure two-party computation equality-algorithm. There are many schemes, for example this by Damgard, Fitzi, Kiltz, Nielsen and Toft: Unconditionally Secure Constant Round Multi-Party
Computation for Equality, Comparison, Bits and Exponentiation.
Of course an agent could try to pose as an agent from another side to get a 1/3 chance to discover the true side of another agent, but that seems unavoidable.
A much simpler scheme for the photo-problem, which should be almost as good as the secure multiparty computation, is the following:
Alice and Bob sorts their pictures and generate a SHA-512 hash.
Alice sends the first bit of her hash to Bob.
Bob compares the bit to the first bit of his hash. If it is different, they know that they have received different photos. Otherwise they continue.
Bob sends the second bit of his hash to Alice.
Alice checks this bit and decides whether to continue.
Continue until the protocol aborts or all bits have been checked.

So they are guaranteed to be quadruple agents? That is they are guaranteed to be secretly working for one faction while pretending to work for a second while pretending to work for a third while pretending to work for a fourth? They are limited to just the US, Russia or China? If so then that means that there will always be at least one faction they are both pretending to work for and are simultaneously actually working for. That seems to negate their ability to be quadruple agents, because surely one of them can't be working for the Americans while secretly working for the Americans, while secretly working for the Americans, while secretly working for the Americans.
You say that the ideal solution would generalize to arbitrary numbers of states and spy-stacks. Can the degree of secret agent-ness be either higher, equal or lower than the number of states? This might be important. Also, is Alice always guaranteed to have the same degree of agent-ness as Bob? i.e. They will ALWAYS both be triple agents, or ALWAYS both by quintuple agents? The modulo operator springs to mind...
More details please.
As a potential answer, you can enumerate the states into a bitfield. US=1 Russia=2, China=4, Madagascar=8, Tuva=16 etc. Construct a device that is essentially an AND gate. Alice builds and brings one half and Bob builds and brings the other. Separated by a cloth, they each press the button of the state they're really working for. If the output of the AND gate is high, then they're on the same side. If not, then they quietly take down the cloth, and depart with the respective halves of their machine so that the button can't be determined by fingerprint.
This is not theoretical or rigorous, but practical.

For your photos problem, create hashes for all subsets of your photos; randomly select a subset of these, and shuffle in an agreed quantity of randomly generated hash values. Bob does the same, and you exchange these sets. If the proportion of hashes in what Bob has sent you that matches ones you can generate by hashing subsets of your photos significantly differs from what you expect, it is likely you have a significantly different corpus of photos from him. If the proportion of random hashes you agree on is high, you risk being unable to detect small differences in your collections of photos; if the proportion is low, you risk exposing information about missing photos; you will have to select a suitable point for the tradeoff.

Interesting.
I think, no matter what the scheme, it'll need to involve a component of random failure. This is because of the conflicting requirements. You would need a scheme that, occasionally, even when they are on the same side, doesn't work. Because if it always worked, they would immediately be able to determine they aren't on the same side.
Your point 'B' is also vague. You say you don't want to expose what side they are on. Does that mean that the info can't point to specifically one of the sides? Is it okay if Alice thinks Bob is from either one of the others?
Also, have you tried emailing this to the cryptography mailing list? May get a better response there. It's an interesting one to think about :)

Here's the closest I've come to a solution:
Assume there is a function doubleHash such that
doubleHash(a+doubleHash(b)) == doubleHash(b+doubleHash(a))
Alice generates a 62 bit secret and appends the 2 bit country code to the end of it, hashes it and gives Bob doubleHash(a).
Bob does the same thing and gives Alice doubleHash(b).
Alice appends the original secret to the hash that Bob gave her, hashes it and publishes it as doubleHash(a+doubleHash(b)).
Bob does the same thing and publishes doubleHash(b+doubleHash(a)).
If both the hashes match, then they are from the same country. On the other hand, if they don't match, then Bob can't decipher the hash because he doesn't know Alice's secret and vice versa.
However, such a scheme relies on the existence of a doubleHash function and I'm not sure if such a thing is possible.

The most simple thing I can think of with the photos that would possibly work is as thus:
Hash all the photos with a 4096 bit hash.
Sort the photos by hash value. ( Hashes are afterall, just a string representation of a large number )
using that sort order, use a streaming system to pipe, and hash, those photos, as if they were a singular file.
Share your hashes.
If the hashes match, you have the same files. ( low low risk of incorrect positive match, but at 4K hashing, its a bit unlikely )
There are of course, a few weaknesses here:
Don't share how many photos you have. Doing so could permit the party with the greater number of photos do intelligent permutation of the data and remove photos from the hash set they suspect likely you don't have, using the number as a guide, and find ( at great computational expense mind ) a set of images that matches your hash.
They can do 1 without the number, but its harder, and they're out of luck if they actually have less photos.
They could create a fake hash, simply with a random number generator, and send it to you, giving you the impression you had different datasets when you really had the same.
The above weaknesses are also prevalent in your country code identification system, except of course, you have far less entropy to get in the way, and its far easier to fraud the system. ( and thus, far far far easier to work out who they are by sheer brute force, or have yourself worked out by brute force, regardless of how fancy your hash algorithm is )
If this were not the case, you would have already been found out by the very agencies you work for, because something that reliable and secure would be a sure fire way to do a secure background check.

The Photo Scenario is Impossible to Achieve:
Your scheme is impossible for the reasons that you name.
Consider a function f, which takes two sets of photos, s1 and s2.
f(s1, s2) returns true if s1=s2 and false if s1!=s2.
That is, this function implements the scheme you want.
Bob can always supply a subset of photo's he has, and learn which photo's charlie doesn't have.
There is no way around this, any function which has the property you want can not have the security you want.
The Spy Scenario is Even More Impossible:
As Kent Fredric pointed out the spy scenario has even greater inherent weaknesses.
It has all problems of the photo scenario, plus the additional weakness of having only four secrets.
In the photo scenario it would be highly unlikely that Bob would randomly guess one of Charlies photographs.
It is trivial in the spy scenario for Bob to guess Alices choice (1/4).
The spys only have four countries they can belong to, as they are both quadruple agents they both know all the secret code words for each country.
Thus, Bob could pretend to be working for the Chinese to test Alice.
A Different Type of Solution:
Some posters have noted, the security can be increased if you weaken the accuracy of f.
Of course if it is not accurate what is the point. I propose a different type of solution.
Do not let them compare the same
photographs more than one time.
The party which wishes to initiate the comparison must first show that this is a new comparison and does not use any of the pictures from before.
EDIT: Problems with Double Hash
I am making some assumptions about the doublhash protocol, but...
For the photograph scheme, the doublehash protocol is no better than f, because the 62 bit secret must be constructed from a set of photographs for the comparison to be meaningfull. The subset attack mentioned in the original question still applies here. Try all subsets of photographs to brute force the secrets you can generate, thus Bob can see if he can generate the same secret as Alice.
Using the doublehash property Bob can still brute force the secret.
doubleHash(s1+doubleHash(b)) != doubleHash(aliceSecret+doubleHash(a))
doubleHash(s2+doubleHash(b)) != doubleHash(aliceSecret+doubleHash(a))
doubleHash(s3+doubleHash(b)) == doubleHash(aliceSecret+doubleHash(a))
Bingo, aliceSecret == s3.
DoubleHash is only as strong as it is hard to bruteforce either a or b
Implementating DoubleHash
Instead doubleHash(a + doubleHash(b)), try doubleHash(a, md5(b)).
DoubleHash(a + doubleHash(b)) is bad because Bob could generate colliding hashes like so:
doubleHash((12 + doubleHash(34)) + doubleHash(5678))
= doubleHash((34 + doubleHash(12)) + doubleHash(5678))
= doubleHash(5678 + doubleHash(12 + doubleHash(34))
= doubleHash(5678 + doubleHash(34 + doubleHash(12))
Here is an implementation of doubleHash using the new formulation,
Doublehash(a, hashOfB){
hashOfA = md5(a)
combinedHash = hashOfA xor hashOfB
return md5(combinedHash)
}
One could also use the math behind blind signatures to impliment version of doubleHash.

Wouldn't RSA work here? Each nation knows its private key, you publish your public key, and only nations that are the same can decrypt the info. I guess the second person would know that the first isn't on the same side as they are, however.
Hmm.

How about Public Key Cryptography?

Related

Length extension attack doubts

So I've been studying this concept of length extension attacks and there are few things that I noticed during my study about it which are not very bright to me.
1.Research papers are explaining how you can append some type of data to the end and make newly formed data. For example
Desired New Data: count=10&lat=37.351&user_id=1&long=-119.827&waffle=eggo&waffle=liege
(notice 2 waffles). My question is if a parser function on the server side can track duplicate attributes, could then the entire length extension attack be nonsense? Because the server would notice duplicate attributes. Is a proper parser that is made to check any duplicates a good solution versus length extension attacks? I'm aware of HMAC approach and other protections, but specifically talking just about parsers here now.
2.Research says that only vulnerable data is H(key|message). They claim that H(message|key) won't work for the attacker because we would have to append a new key (which we obviously don't know). My question is why would we have to append a new key? We don't do it when we are attacking H(key|message). Why can't we rely on the fact that we will pass the verification test (we would create the correct hash) and that if the parser tries to extract the key from it, that it would take the only key in the block we send out and resume from there? Why would we have to send 2 keys? Why doesn't attack against H(message|key) work?
My question is if a parser function on server side can track duplicate attributes, could then the entire length extension attack be a nonsense?
You are talking about a well-written parser. Writing software is hard and writing correct software is very hard.
In that example, you have seen an overwritten attribute. Are you able to say that a good parser must take the last one or the first one? What is the rule? There can be stations that the last one must be taken! That is an attack that can be applied or not. This depends on the station. If you consider that the knowledge of the length extension attack goes back to 1990s, then finding a place applicable to this should amaze someone!. And, it is applied in the wild to Flickr API in 2009, after almost 20 years;
Flickr's API Signature Forgery by Thai Duong and Juliano Rizzo Published on Sep. 28, 2009.
My question is why would we have to append new key? We don't do it when we are attacking H(key|message). Why can't we relay on the fact that we will pass verification test (we would create correct hash) and that if parser tries to extract key from it, that it would take the only key in the block we send out and resume from there. Why would we have to send 2 keys? Why doesnt attack against H(message|key) work?
The attack is a signature forgery. The key is not known to the attacker, but they can still forge new signatures. The new message and signature - extended hash - is sent to the server, then the server takes the key and appends it to the message to execute a canonical verification, that is; if it does the signature is valid.
The parser doesn't extract the key, it already knows the key. The point is that can you make sure that the data is really extended or not. The padding rule is simple, add 1 and fill many zeroes so that the last 64 (128) is the length encoding (very simplified, for example, the final length must be multiple of 512 for SHA256). To see that there is another padding inside you must check every block and then you may claim that there is an extension attack. Yes, you can do this, however, the one of aims of cryptography is to reduce the dependencies, too. If we can create a better signature that eliminates the checking then we suggest to left the others. This enables the software developers to write more secure implementation easily.
Why doesn't attack against H(message|key) work?
Simple, you get the extended message message|extended and send the extended hash
H(message|key|extended) to the server. Then the server takes the message message|extended and appends the key message|extended|key and hashes it H(message|extended|key) and clearly this is not equal to the extended one H(message|key|extended)
Note that the trimmed version of the SHA2 series like SHA-512/256 has resistance to length extension attacks. SHA3 is immune to it by design and that enables a simple KMAC signature scheme. Blake2 is also immune since it is designed with the HAIFA construction.

Query String vs Resource Path for Filtering Criteria

Background
I have 2 resources: courses and professors.
A course has the following attributes:
id
topic
semester_id
year
section
professor_id
A professor has the the following attributes:
id
faculty
super_user
first_name
last_name
So, you can say that a course has one professor and a professor may have many courses.
If I want to get all courses or all professors I can: GET /api/courses or GET /api/professors respectively.
Quandary
My quandary comes when I want to get all courses that a certain professor teaches.
I could use either of the following:
GET /api/professors/:prof_id/courses
GET /api/courses?professor_id=:prof_id
I'm not sure which to use though.
Current solution
Currently, I'm using an augmented form of the latter. My reasoning is that it is more scale-able if I want to add in filtering/sorting criteria.
I'm actually encoding/embedding JSON strings into the query parameters. So, a (decoded) example might be:
GET /api/courses?where={professor_id: "teacher45", year: 2016}&order={attr: "topic", sort: "asc"}
The request above would retrieve all courses that were (or are currently being) taught by the professor with the provided professor_id in the year 2016, sorted according to topic title in ascending ASCII order.
I've never seen anyone do it this way though, so I wonder if I'm doing something stupid.
Closing Questions
Is there a standard practice for using the query string vs the resource path for filtering criteria? What have some larger API's done in the past? Is it acceptable, or encouraged to use use both paradigms at the same time (make both endpoints available)? If I should indeed be using the second paradigm, is there a better organization method I could use besides encoding JSON? Has anyone seen another public API using JSON in their query strings?
Edited to be less opinion based. (See comments)
As already explained in a previous comment, REST doesn't care much about the actual form of the link that identifies a unique resource unless either the RESTful constraints or the hypertext transfer protocol (HTTP) itself is violated.
Regarding the use of query or path (or even matrix) parameters is completely up to you. There is no fixed rule when to use what but just individual preferences.
I like to use query parameters especially when the value is optional and not required as plenty of frameworks like JAX-RS i.e. allow to define default values therefore. Query parameters are often said to avoid caching of responses which however is more an urban legend then the truth, though certain implementations might still omit responses from being cached for an URI containing query strings.
If the parameter defines something like a specific flavor property (i.e. car color) I prefer to put them into a matrix parameter. They can also appear within the middle of the URI i.e. /api/professors;hair=grey/courses could return all cources which are held by professors whose hair color is grey.
Path parameters are compulsory arguments that the application requires to fulfill the request in my sense of understanding otherwise the respective method handler will not be invoked on the service side in first place. Usually this are some resource identifiers like table-row IDs ore UUIDs assigned to a specific entity.
In regards to depicting relationships I usually start with the 1 part of a 1:n relationship. If I face a m:n relationship, like in your case with professors - cources, I usually start with the entity that may exist without the other more easily. A professor is still a professor even though he does not hold any lectures (in a specific term). As a course wont be a course if no professor is available I'd put professors before cources, though in regards to REST cources are fine top-level resources nonetheless.
I therefore would change your query
GET /api/courses?where={professor_id: "teacher45", year: 2016}&order={attr: "topic", sort: "asc"}
to something like:
GET /api/professors/teacher45/courses;year=2016?sort=asc&onField=topic
I changed the semantics of your fields slightly as the year property is probably better suited on the courses rather then the professors resource as the professor is already reduced to a single resource via the professors id. The courses however should be limited to only include those that where held in 2016. As the sorting is rather optional and may have a default value specified, this is a perfect candidate for me to put into the query parameter section. The field to sort on is related to the sorting itself and therefore also belongs to the query parameters. I've put the year into a matrix parameter as this is a certain property of the course itself, like the color of a car or the year the car was manufactured.
But as already explained previously, this is rather opinionated and may not match with your or an other folks perspective.
I could use either of the following:
GET /api/professors/:prof_id/courses
GET /api/courses?professor_id=:prof_id
You could. Here are some things to consider:
Machines (in particular, REST clients) should be treating the URI as an opaque thing; about the closest they ever come to considering its value is during resolution.
But human beings, staring that a log of HTTP traffic, do not treat the URI opaquely -- we are actually trying to figure out the context of what is going on. Staying out of the way of the poor bastard that is trying to track down a bug is a good property for a URI Design to have.
It's also a useful property for your URI design to be guessable. A URI designed from a few simple consistent principles will be a lot easier to work with than one which is arbitrary.
There is a great overview of path segment vs query over at Programmers
https://softwareengineering.stackexchange.com/questions/270898/designing-a-rest-api-by-uri-vs-query-string/285724#285724
Of course, if you have two different URI, that both "follow the rules", then the rules aren't much help in making a choice.
Supporting multiple identifiers is a valid option. It's completely reasonable that there can be more than one way to obtain a specific representation. For instance, these resources
/questions/38470258/answers/first
/questions/38470258/answers/accepted
/questions/38470258/answers/top
could all return representations of the same "answer".
On the /other hand, choice adds complexity. It may or may not be a good idea to offer your clients more than one way to do a thing. "Don't make me think!"
On the /other/other hand, an api with a bunch of "general" principles that carry with them a bunch of arbitrary exceptions is not nearly as easy to use as one with consistent principles and some duplication (citation needed).
The notion of a "canonical" URI, which is important in SEO, has an analog in the API world. Mark Seemann has an article about self links that covers the basics.
You may also want to consider which methods a resource supports, and whether or not the design suggests those affordances. For example, POST to modify a collection is a commonly understood idiom. So if your URI looks like a collection
POST /api/professors/:prof_id/courses
Then clients are more likely to make the associate between the resource and its supported methods.
POST /api/courses?professor_id=:prof_id
There's nothing "wrong" with this, but it isn't nearly so common a convention.
GET /api/courses?where={professor_id: "teacher45", year: 2016}&order={attr: "topic", sort: "asc"}
I've never seen anyone do it this way though, so I wonder if I'm doing something stupid.
I haven't either, but syntactically it looks a little bit like GraphQL. I don't see any reason why you couldn't represent a query that way. It would make more sense to me as a single query description, rather than breaking it into multiple parts. And of course it would need to be URL encoded, etc.
But I would not want to crazy with that right unless you really need to give to your clients that sort of flexibility. There are simpler designs (see Roman's answer)

Protege SWRL rules

I've been trying to define rules in my ontology to infer that if a person has friends who are friends amongst each other then all are friends, but if 1 or more are not friends to each other then my ontology will infer that they all, are not friends.
Thank you
You probably need to get your intended semantics straightened out a little bit more.
From what I gather, you want isFriendWith at least to be symmetric, i.e. when isFriendWith(bob, alice) then also isFriendWith(alice, bob).
Also, if you want to have friendsAll to have any meaning, isFriendWith cannot be transitive. This would also capture the natural meaning, as a friend of my friend is not necessarily my friend.
To elaborate: If isFriendWith where symmetric and transitive every friend of bob would automatically also be a friend of all of bob's friends (because isFriendWith(bob, alice) implies isFriendWith(alice, bob). From there on, with any isFriendWith(bob, carol) transitivity implies that isFriendWith(alice, carol). So if isFriendWith is symmetric and transitive, you get the clique automatically.
But as stated, this is probably not, what you want.
As for formulating this in SWRL, let's give it a try, shall we?
friendsAll is most likely reflexive, i.e. let's just assume everybody is his/her own friend. Now, we need an recursive rule that extends this set while still fulfilling the condition: "In this set, everybody is everybody's friend".
To include bob's friends, you would need to be able to quantify over isFriendWith and check if that any candidate friend of bob is also a friend of all other friends of bob. Since you cannot nest quantifiers in SWRL, I'm more or less sure, you cannot express that algorithm in the rule language alone. However, I maybe wrong here and there is a neat little trick hidden inside the semantics. It is not one that I know of, however and the need for quantifier nesting in the direct formulation leaves me believing that it is not possible.
It basically boils down to a well-known graph-theoretic problem: given a starting point bob friendsAll is the largest subset of bob's friends such that every everbody in the group is friends with everyone else, i.e. bob's Maximal Clique.

Password strength check: comparing to previous passwords

Every now and then I come across applications that force you to change passwords once in a while. Almost universally, they have this strange requirement for the new password: it has to be "significantly" different from your previous password(s).
While at first this sounds logical, next thing I think is: how do they do that? Do they store my passwords in plain text? I would have accepted the answer that they do, if it wasn't for the fact that these are kinds of applications that pretend to care about security so much they force you to change your password if it is expired! Microsoft Exchange is one example of this.
I'm not very good at cryptography and hash functions, so my question is this: Is it possible to enforce this kind of policy without storing passwords in plain text?
Do you know how this policy is implemented in real world applications?
UPDATE: An Example.
I was recently changing my Microsoft Exchange password. I only use Web Access, so it might be different a little -- I have no idea.
So, it forces me to change my password. What I do sometimes is I change it to something new and then change it back almost immediately. The freaky part is that It did not allow me to even change it back because of this. I tried changing it a little, by adding a letter in front of it or changing one symbol -- no luck, it was complaining.
With a typical hash, the best you can do is see if the new password is exactly equal to previous ones. You can break the password into multiple hashes in order to get more flexible with comparison, for example 3 hashes:
Alpha characters only
Numeric characters only
All other characters
You could for example require all the hashes to change to be accepted, to prevent users from just changing their password from SecretPassword01 to SecretPassword02.
A cryptographic expert may weigh in here on if this could be made as secure as a single hash.
NOTE that this is not as secure as a single hash, so before you go implementing this, make sure you have really done your research.
When changing password you're usually asked for the old one to confirm your identity. It's then trivial to compare the old one and the new one to see how much they differ. TBH I don't know how to compare to several previous passwords without storing them, but that's getting into the territory of ridiculous policies anyway.

Double salt for hashing passwords?

I'm thinking of hashing user passwords with two different salt strings, one stored in the code which is the same for all users and another stored in the database for which each user has their own unique value.
Would this be more effective than simply storing the values in the database?
Any advice, opinions appreiated.
Thanks
The effect is miniscule if anything at all. Consider that a static, hard coded salt can be viewed as nothing more than an alteration to the hashing algorithm - it happens exactly the same way every time, so it may as well be considered part of the algorithm.
But the purpose of the salt is to create some randomness that is similar to extending the (minimum) strength of the password, for the purpose of making offline cracking (including rainbow tables) more resource intensive (non-rainbow-table cracking will require more CPU time, and rainbow tables will require all salts for all strings).
The only way that you'd get any value from this is while the static salt is unknown - the equivalent to the algorithm being unknown. If your binary or your source is available to the attacker, then reverse engineering will demonstrate the algorithm and the hard coded salt.
And if this issue goes public, you will probably have to deal with flack from many security enthusiasts who believe that anything not perfect is completely broken, even though your product already does the right thing and the additional step is just useless.
And, of course, you'll have to deal with maintenance issues of having a static salt - backwards compatibility and bug fixes around the hashing code can be a pain.
The very small benefit of static keys (or salts) is simply not worth the cost. Always make keys and salts dynamic.