How does SHA-3 work with mappings in Solidity? - solidity

I'm having some trouble understanding how mapping works in Solidity. As it was explained to me, the value passed into the mapping is hashed using SHA-3. Once it's hashed, the new hashed value is used as the location of where in memory the value is stored. I'm confused because I don't understand why SHA-3 is producing a real location in memory. Doesn't SHA-3 produce random values of a fixed length? How could the hashed value possibly be used to tell the computer where a value is in memory if SHA-3 is supposed to produce arbitrary values? Thanks.

the value passed into the mapping is hashed using SHA-3.
The value is not hashed. However - the storage slot location is determined by a hash of the property position within the contract and few other factors, such as index and datatype of the key.
See the docs for specifics. You might be interested in the section starting
The value corresponding to a mapping key k is located at keccak256(h(k) . p)

Related

Why in solidity proxies from position's index subtract one

I'm learning Proxy pattern in Solidity, and have noticed, that in the ERC1967 contract offsets always decreases by one: https://github.com/OpenZeppelin/openzeppelin-contracts/blob/master/contracts/proxy/ERC1967/ERC1967Upgrade.sol#L18-L19
What's the purpose of such deceasing?
Assigning the value into storage slot ID calculated from a hash of a string (and not from a hash of a number, or even just slot ID <some low number>) decreases the risk of clash with another storage variable.
And subtracting 1 from the value decreases the probability even further, as there's no known input that would result in this value if it was hashed.
See the Rationale section of ERC-1967
Furthermore, a -1 offset is added so the preimage of the hash cannot be known, further reducing the chances of a possible attack.

Fixed length data structures in redis

I need to match tens of thousands of 4 byte strings with about one or more boolean values. I don't mind using up a whole word for the booleans if it means faster retrieval. However I have such tight constraints for my data I imagine there is some, albeit minor, optimizations that can be made if these are reported to the storage engine in advance. Does Redis have any way to take advantage of this?
Here is a sample of my data:
"DENL": false
"NLES": false
"NLUS": true
"USNL": true
"AEGB": true
"ITAE": true
"ITFR": false
The keys are the concatination of two ISO 3166-1 alpha-2 codes. As such they are guaranteed to be 4 uppercase English letters.
The data structures I have considered using are:
Hashes to map the 4 byte keys to a string representing the booleans
A separate set for each boolean value
And since my data only contains uppercase English letters and there are only 456976 possible combinations of those (which comes out to 56KB per bit stored per key):
One or more strings that are accessed with bitwise operations (GETBIT, BITFIELD) using a function to convert the key string to a bit index.
I think that sets are probably the most elegant solution and a binary string over all possible combinations will be the most efficient. I would like to know wheter there is there some kind of middle ground? Like a set with fixed length strings as members. I would expect a datatype optimized for fixed length strings would be able to provide faster searching than one optimized for variable length strings.
It is slightly better to use the 4-letter country-code-combination as a simple key, with an empty value.
The set data-type is really a hash map where the keys are the element and are added to the hash map with NULL value. I wouldn't use a set as this implies to hashes and two lookups into a hash map: the first for the set key in the database and the second for the hash internal to the set for the element.
Use the existence of the key as either "need customs declaration" or "does not need a customs declaration" as Tomasz says.
Using simple keys allows you to use the SET command with NX/XX conditions, which may be handy in your logic:
NX -- Only set the key if it does not already exist.
XX -- Only set the key if it already exists.
Use EXISTS command instead of GET as it is slightly faster (no type checking, no value fetching).
Another advantage of simple keys vs sets is to get the value of multiple keys at once using MGET:
> MGET DENL NLES NLUS
1) ""
2) ""
3) (nil)
To be able to do complex queries, assuming these are rare and not optimized for performance, you can use SSCAN (if you go with sets) or KEYS (if you go with simple keys). However, if you go with simple keys you better use a dedicated database, see SELECT.
To query for those with NL on the left side you would use:
KEYS NL??
There are a couple of optimizations you could try:
use a set and treat all values as either "need customs declaration" or "does not need a customs declaration" - depending which one has fewer values; then with SISMEMBER you can check if your key is in that set which gives you the correct answer,
have a look at introduction to Redis data types, chapter "Bitmaps" - if you pre-define all of your keys in some array you can use SETBIT and GETBIT operations to store the flag "needs customs declaration" for given bit number (index in array).

One Way Cryptographic

Assume you can use a strong one-way cryptographic hash function that takes a string of variable length as an input, producing a fixed length output that is unique for each input string. It is computationally infeasible to reverse this process.
If you use a document as the input of the hashing function, what can you do with the output hash? This question has more than one answer.
What is the correct choices
a: Compare whether you have the same document as someone else without comparing the documents themselves
b: Send the hash to another party for them to decrypt the hash to retrieve the document
c: Create a table where anyone who has the hash can find the document
d: Compress the document content so it can be extracted later
Since this is not an answer for a curent question:
a: is a correct answer. "Compare whether you have the same document as someone else without comparing the documents themselves."
This is a common use of cryptographicly secure hashes, Git is one use case.
c. Is a potential answer as a hash can be used in this manner. "Create a table where anyone who has the hash can find the document."
The other answers are all incorrect because the original can not be recovered from the hash.

ROR - Generate an alpha-numeric string for a DB ID

In our DB, every Person has an ID, which is the DB generated, auto-incremented integer. Now, we want to generate a more user-friendly alpha-numeric ID which can be publicly exposed. Something like the Passport number. We obviously don't want to expose the DB ID to the users. For the purpose of this question, I will call what we need to generate, the UID.
Note: The UID is not meant to replace the DB ID. You can think of the UID as a prettier version of the DB ID, which we can give out to the users.
I was wondering if this UID can be a function of the DB ID. That is, we should be able to re-generate the same UID for a given DB ID.
Obviously, the function will take a 'salt' or key, in addition to the DB ID.
The UID should not be sequential. That is, two neighboring DB IDs should generate visually different-looking UIDs.
It is not strictly required for the UID to be irreversible. That is, it is okay if somebody studies the UID for a few days and is able to reverse-engineer and find the DB ID. I don't think it will do us any harm.
The UID should contain only A-Z (uppercase only) and 0-9. Nothing else. And it should not contain characters which can be confused with other alphabets or digits, like 0 and O, l and 1 and so on. I guess Crockford's Base32 encoding takes care of this.
The UID should be of a fixed length (10 characters), regardless of the size of the DB ID. We could pad the UID with some constant string, to bring it to the required fixed length. The DB ID could grow to any size. So, the algorithm should not have any such input limitations.
I think the way to go about this is:
Step 1: Hashing.
I have read about the following hash functions:
SHA-1
MD5
Jenkin's
The hash returns a long string. I read here about something called XOR folding to bring the string down to a shorter length. But I couldn't find much info about that.
Step 2: Encoding.
I read about the following encoding methods:
Crockford Base 32 Encoding
Z-Base32
Base36
I am guessing that the output of the encoding will be the UID string that I am looking for.
Step 3: Working around collisions.
To work around collisions, I was wondering if I could generate a random key at the time of UID generation and use this random key in the function.
I can store this random key in a column, so that we know what key was used to generate that particular UID.
Before inserting a newly generated UID into the table, I would check for uniqueness and if the check fails, I can generate a new random key and use it to generate a new UID. This step can be repeated till a unique UID is found for a particular DB ID.
I would love to get some expert advice on whether I am going along the correct lines and how I go about actually implementing this.
I am going to be implementing this in a Ruby On Rails app. So, please take that into consideration in your suggestions.
Thanks.
Update
The comments and answer made me re-think and question one of the requirements I had: the need for us to be able to regenerate the UID for a user after assigning it once. I guess I was just trying to be safe, in the case where we lose a user's UID and we will able to get it back if it is a function of an existing property of the user. But we can get around that problem just by using backups, I guess.
So, if I remove that requirement, the UID then essentially becomes a totally random 10 character alphanumeric string. I am adding an answer containing my proposed plan of implementation. If somebody else comes with a better plan, I'll mark that as the answer.
As I mentioned in the update to the question, I think what we are going to do is:
Pre-generate a sufficiently large number of random and unique ten character alphanumeric strings. No hashing or encoding.
Store them in a table in a random order.
When creating a user, pick the first these strings and assign it to the user.
Delete this picked ID from the pool of IDs after assigning it to a user.
When the pool reduces to a low number, replenish the pool with new strings, with uniqueness checks, obviously. This can be done in a Delayed Job, initiated by an observer.
The reason for pre-generating is that we are offloading all the expensive uniqueness checking to a one-time pre-generation operation.
When picking an ID from this pool for a new user, uniqueness is guaranteed. So, the operation of creating user (which is very frequent) becomes fast.
Would db_id.chr work for you? It would take the integers and generate a character string from them. You could then append their initials or last name or whatever to it. Example:
user = {:id => 123456, :f_name => "Scott", :l_name => "Shea"}
(user.id.to_s.split(//).map {|x| (x.to_i + 64).chr}).join.downcase + user.l_name.downcase
#result = "abcdefshea"

How do I get Average field length and Document length in Lucene?

I am trying to implement BM25f scoring system on Lucene. I need to make a few minor changes to the original implementation given here for my needs, I got lost at the part where he gets Average Field Length and document length... Could someone guide me as to how or where I get it from?
You can get field length from TermVector instances associated with documents' fields, but that will increase your index size. This is probably the way to go unless you cannot afford a larger index. Of course you will still need to calculate the average yourself, and store it elsewhere (or perhaps in a special document with a well-known external id that you just update when the statistics change).
If you can store the data outside of the index, one thing you can do is count the tokens when documents are tokenized, and store the counts for averaging. If your document collection is static, just dump the values for each field into a file & process after indexing. If the index needs to get updated with additions only, you can store the number of documents and the average length per field, and recompute the average. If documents are going to be removed, and you need an accurate count, you will need to re-parse the document being removed to know how many terms each field contained, or get the length from the TermVector if you are using that.