Could someone please explain what Random Oracle is all about?
I am trying to get into the nitty gritty of cryptography when I just came across the term "Random Oracle". A loose understanding of the term tells me that its an ideal system that accepts many inputs and gives out a single output, continues to throw the same output every time the same input is fed. Just that, the very first time, the output is randomized. I am still not clear about the understanding of this term and googling it out hasn't helped. Could someone please explain what it is?
Understanding of Random Oracle in Cryptography
Random Oracle based on https://en.bitcoinwiki.org/wiki/Random_oracle
In cryptography, a random oracle is an oracle (a theoretical black box) that responds to every unique query with a (truly) random response chosen uniformly from its output domain. If a query is repeated it responds the same way every time that query is submitted.
Stated differently, a random oracle is a mathematical function chosen uniformly at random, that is, a function mapping each possible query to a (fixed) random response from its output domain.
Related
What is the difference between np.random.seed(0), np.random.seed(42), and np.random.seed(..any number). what is the function of the number in parentheses?
python uses the iterative Mersenne Twister algorithm to generate pseudo-random numbers [1]. The seed is simply where we start iterating.
To be clear, most computers do not have a "true" source of randomness. It is kind of an interesting thing that "randomness" is so valuable to so many applications, and is quite hard to come by (you can buy a specialized device devoted to this purpose). Since it is difficult to make random numbers, but they are nevertheless necessary, many, many, many, many algorithms have been developed to generate numbers that are not random, but nevertheless look as though they are. Algorithms that generate numbers that "look randomish" are called pseudo-random number generators (PRNGs). Since PRNGs are actually deterministic, they can't simply create a number from the aether and have it look randomish. They need an input. It turns out that using some complex operations and modular arithmetic, we can take in an input, and get another number that seems to have little or no relation to the input. Using this intuition, we can simply use the previous output of the PRNG as the next input. We then get a sequence of numbers which, if our PRNG is good, will seem to have no relation to each other.
In order to get our iterative PRNG started, we need an initial input. This initial input is called a "seed". Since the PRNG is deterministic, for a given seed, it will generate an identical sequence of numbers. Usually, there is a default seed that is, itself, sort of randomish. The most common one is the current time. However, the current time isn't a very good random number, so this behavior is known to cause problems sometimes. If you want your program to run in an identical manner each time you run it, you can provide a seed (0 is a popular option, but is entirely arbitrary). Then, you get a sequence of randomish numbers, but if you give your code to someone they can actually entirely recreate the runtime of the program as you witnessed it when you ran it.
That would be the starting key of the generator. Typically if you want to get reproducible results you'll use the same seed over and over again throughout your simulations.
You are setting the seed of the random number generator so you can get reproducible results. Example.
np.random.seed(0)
np.random.randint(0,100,10)
Output:
array([44, 47, 64, 67, 67, 9, 83, 21, 36, 87])
Now, if you ran the same code your computer, you should get the same 10 number output from the random integers from 0 to 100.
I'm working with a lot of name data where the following events are happening:
In one stream the data is submitted as "Sung" and in the other stream "Snug" my initial thought to this was to convert Sung and Snug to where each character equals a number then the sums would be the same, so even if they transverse a character, I'd be able to bucket these appropriately.
The other is where in one stream it comes in as "Lillly" as opposed to "Lilly" in the other stream. I'd like to figure out how to fuzzy match these such that I can identify them. I'm not sure if this is possible in Oracle.
I'm working with many millions of data points and trying to figure out how to write these classification buckets such that I can stop having so much noise in my primary task of finding where people are truly different people as opposed to a clerical error.
Any thoughts would be very appreciated.
A common measure for such distance is called Levenshtein distance (Wikipedia here). This measures the "edit" distance between two strings -- number of edit operations needed to convert one into the other.
That's the good news. More good news is that Oracle even has an implementation in the UTL_MATCH library.
The bad news is that it is really, really expensive on millions of data points. Unfortunately, I cannot help you there so much. One idea is to determine which names are "close enough" because they already share a certain minimum number of characters.
Another method is to convert the strings to what they sound like. That is called soundex. You may be able to use the two together -- assuming your names are predominantly English (the soundex algorithm was invented by the US Census Bureau, so it would work best on names in America).
I'm trying to understand continuous optimization algorithms applied on some test functions.
Here are the results obtaind by some algorithms used for this issue on some of test functions :
enter image description here
I didn't understand the difference between the two underlined phrases. would you please help me in this?
P.S. sometimes they use the term (median number) instead of (mean number ) what's the difference between the two??
This question lacks some context. It would have been better to link to th text too to get a grasp of what is going on.
But i read it as this (and i think that's how someone with some experience in optimization-algorithms would read it; you have to check it though with your knowledge of the context):
The bold 1.0s are the normalized number of function-evaluations on different functions to optimize (each row is a different function)
The values in the brackets are unnormalized numbers explaining the same
When ACO used 820 evaluations (unnormalized), normalized to 1.0, CACO used 8.3 * 820 evaluations
The mean and median are two different measures of central tendency. Check wikipedia out to understand the differences.
Anyone know if MD5, Whirlpool, SHA[n], etc., have any "special" input that might get a hexdigest output to align into:
All numeric characters
All alpha characters
All of the same character/pattern repeated consistently or entirely
Example in python:
>>> from hashlib import sha1
>>> hash = sha1('magic_word').hexdigest()
>>> hash
4040404040404040404040404040404040404040
>>> hash = sha1('^3&#b d *#"').hexdigest()
aedefeebadcdccebefadcedddcbeadaedcbdeadc
Is this even possible? My knowledge of hashing functions is limited to the scope of applying them in databases for storing passwords, which is essentially none.
But sometimes I wonder, when testing for collisions, that these sorts of cases might arise...
A hash function models a random oracle: for each input, if it was not yet queried before, we throw some dice to find an output, then note it to some book. If an input is queried again, simply give back this old value.
By throwing a 16-sided dice 40 times (for each input), we get enough output for an SHA-1 like oracle. (For MD5, we only need 32 times.)
So, we can calculate the probability of "40 times only letters" as (6/16)^40 ≈ 9.15·10^-18, "40 times only digits" has probability (10/16)^40 ≈ 6.8·10^-9.
As "number of tries needed until the first success" is geometrically distributed, we need 1/p tries in average, i.e. around 10^17 tries for "only letters", and 1.5 ·10^8 tries for "only digits".
(Now, SHA-1 is not a real random oracle, but there is no weakness known which would say that SHA-1 would have better or worse probabilities for one of these. And for now, brute-force really seems to be the best way to do this.)
I'm sure with the right input, those sorts of outputs are possible. Why does it matter? Just curious?
Yes, it is possible. Given the right input, any desired bit pattern can be output. It might take a few million years to find the right input though.
For a reasonably wide target, like all hex 0-9 or all hex a-f it should be relatively easy. Calculating the proportion of acceptable outputs, in all possible outputs will help you get an estimate of the running time. Brute force or random searching will eventually find something that hits the target. For a broken hash, like MD4, you might be able to shave something off the expected time.
I found this on an "interview questions" site and have been pondering it for a couple of days. I will keep churning, but am interested what you guys think
"10 Gbytes of 32-bit numbers on a magnetic tape, all there from 0 to 10G in random order. You have 64 32 bit words of memory available: design an algorithm to check that each number from 0 to 10G occurs once and only once on the tape, with minimum passes of the tape by a read head connected to your algorithm."
32-bit numbers can take 4G = 2^32 different values. There are 2.5*2^32 numbers on tape total. So after 2^32 count one of numbers will repeat 100%. If there were <= 2^32 numbers on tape then it was possible that there are two different cases – when all numbers are different or when at least one repeats.
It's a trick question, as Michael Anderson and I have figured out. You can't store 10G 32b numbers on a 10G tape. The interviewer (a) is messing with you and (b) is trying to find out how much you think about a problem before you start solving it.
The utterly naive algorithm, which takes as many passes as there are numbers to check, would be to walk through and verify that the lowest number is there. Then do it again checking that the next lowest is there. And so on.
This requires one word of storage to keep track of where you are - you could cut down the number of passes by a factor of 64 by using all 64 words to keep track of where you're up to in several different locations in the search space - checking all of your current ones on each pass. Still O(n) passes, of course.
You could probably cut it down even more by using portions of the words - given that your search space for each segment is smaller, you won't need to keep track of the full 32-bit range.
Perform an in-place mergesort or quicksort, using tape for storage? Then iterate through the numbers in sequence, tracking to see that each number = previous+1.
Requires cleverly implemented sort, and is fairly slow, but achieves the goal I believe.
Edit: oh bugger, it's never specified you can write.
Here's a second approach: scan through trying to build up to 30-ish ranges of contiginous numbers. IE 1,2,3,4,5 would be one range, 8,9,10,11,12 would be another, etc. If ranges overlap with existing, then they are merged. I think you only need to make a limited number of passes to either get the complete range or prove there are gaps... much less than just scanning through in blocks of a couple thousand to see if all digits are present.
It'll take me a bit to prove or disprove the limits for this though.
Do 2 reduces on the numbers, a sum and a bitwise XOR.
The sum should be (10G + 1) * 10G / 2
The XOR should be ... something
It looks like there is a catch in the question that no one has talked about so far; the interviewer has only asked the interviewee to write a program that CHECKS
(i) if each number that makes up the 10G is present once and only once--- what should the interviewee do if the numbers in the given list are present multple times? should he assume that he should stop execting the programme and throw exception or should he assume that he should correct the mistake by removing the repeating number and replace it with another (this may actually be a costly excercise as this involves complete reshuffle of the number set)? correcting this is required to perform the second step in the question, i.e. to verify that the data is stored in the best possible way that it requires least possible passes.
(ii) When the interviewee was asked to only check if the 10G weight data set of numbers are stored in such a way that they require least paases to access any of those numbers;
what should the interviewee do? should he stop and throw exception the moment he finds an issue in the algorithm they were stored in, or correct the mistake and continue till all the elements are sorted in the order of least possible passes?
If the intension of the interviewer is to ask the interviewee to write an algorithm that finds the best combinaton of numbers that can be stored in 10GB, given 64 32 Bit registers; and also to write an algorithm to save these chosen set of numbers in the best possible way that require least number of passes to access each; he should have asked this directly, woudn't he?
I suppose the intension of the interviewer may be to only see how the interviewee is approaching the problem rather than to actually extract a working solution from the interviewee; wold any buy this notion?
Regards,
Samba