How to encode string to unique long? - kotlin

The server sends alphanumerical ids for a list of items. At the same time, recycler view getItemId (required for has stable ids) must return Long. How to encode string to unique long?

Short answer: you probably can't.  Not unless the IDs are guaranteed to be short.
A Long uses 8 bytes, so it can hold 2⁶⁴ (about 1.8×10¹⁹) different values.  So it could only represent that number of strings.  (A result of the pigeonhole principle.)
However, if the IDs contain only basic ASCII letters (let's assume upper case) and digits — 36 possibilities — and are 13 characters long, then there are 36¹³ (about 1.7×10²⁰) different strings.  That's an order of magnitude more than 2⁶⁴, so some of them will have to map to the same Long value.
(In fact, each Long would map to about 10 IDs on average — and even more if you include strings with fewer characters, and/or a greater range of characters.)
So unless the range of IDs is limited, you'll have to find another approach.

Related

Map words to numbers

I am doing indexing of data in my IRE (Information Retrieval and Extraction) course. Now instead of storing terms in the index, I am storing termID which is a mapping corresponding to the term. The size of term, if the length of the term is 15, would be 15 bytes i.e. 120 bits while if I use termID instead of term then I can definitely store it in less than 120 bits. One of the possible ways is to maintain a dictionary of the (term, termID) where termID would be from 1..n where n is the number of terms. The problems with this method is:
I have to keep this dictionary in the ram and the dictionary size can be in GBs.
To find termID corresponding to a term, it will take O(log(n)) where n is the number of terms in the dictionary.
Can I make some function which takes a term as an input and returns the mapping (encryption) in O(1) ?. It is okay if there are few collisions (Just guessing that a few collisions in exchange of speed and memory is a good trade-off. BTW I don't know how much it will effect my search results).
Is there any other better way to do this?
I think you gave the answer already more or less by saying "it is ok if there are a few collisions". The trick is hashing. You can first reduce the number of "characters" in your search terms. E.g., drop numbers, and special characters. Afterwards you can merge Upper and lower-case characters. Finally you could apply some simple replacements e.g. replacing the german ü bei ue (which is actually there origin). After doing so you have probably sth. like 32bit. You can then represent an four character string in a single byte. If you reserve 4 bytes for each words you need to deal with the longer words. There you can basically resort to xor each 4byte block.
An alternative approach would be to do something hybrid for the dictionary. If you would build a dictionary for only the 10k most frequent words you are most likely covering already most of the texts. Hence, you only need to keep parts of your dictionary in memory, while for most of the words you can use dictionary on hardisc or maybe even ignore them.

Random string generation using arc4random

I'm trying to create a method that creates a random string consisting of 32 characters. This method will generate a random number using arc4random_uniform(62) to choose a number between 0 and 61 and then chose a character from a string that holds numbers from 0 to 9 and alphabet letters both small and capital letters, respectively. For an instance, if arc4random_uniform(62) returns 10, the chosen character will be a, if it returns 61, the chosen character will be Z). The method will do this for 32 times to create the final generated string.
I was wondering when this approach will fail to generate a unique String and result in a repeated one. I searched about this topic and didn't find a satisfying answer. I hope that you will help with me this since I am trying to use this method to generate unique IDs for use in my app.
This method will generate a random number using arc4random_uniform(62) to choose a number between 0 and 61 and then chose a character from a string that holds numbers from 0 to 9 and alphabet letters both small and capital letters, respectively.
You could create an array with a string for all the characters you want to include, and randomly pick values. Or, alternatively you could take advantage of the ASCII encoding has mostly sequential character positions and you can fairly easily convert an ascii number to an NSString.
An integer between 48 and 57 is the numbers 0-9 in ASCII, 65 to 90 is A-Z and 97 to 122 is a-z: https://en.wikipedia.org/wiki/Ascii_table#ASCII_printable_code_chart
I was wondering when this approach will fail to generate a unique String and result in a repeated one. I searched about this topic and didn't find a satisfying answer.
It's often referred to as the "birthday problem". As long as your value is reasonably long (say, 20 characters), it is effectively impossible to have a collision. The world is more likely to be destroyed in the next 2 seconds than your app ever creating a collision.
I hope that you will help with me this since I am trying to use this method to generate unique IDs for use in my app.
Apple provides an API for generating unique IDs. You should use that instead of inventing your own system:
NSString *id = [NSUUID UUID].UUIDString;
That will give you a value like D19B40AA-322C-4ADF-BEF6-2EC4D4CE7BA8. It conforms to "Version 4" of the UUID standard — according to Wikipedia if you generate 1 billion UUIDs every second for the next 100 years, there is a 50% chance of getting two IDs that are the same.
If the UUID is longer than you want, you could grab a smaller part part of the string. Beware that the 4 at the start of the third block means this is a "version 4" UUID and is not a random value. Also the first character at the start of the 4th block is only has four possible values — so avoid or strip off those two characters if you want to grab a smaller part of the string for use as your random ID. See the wikipedia page on UUIDs for more detail.

Make unique readable string out of a long integer

I have long integers numbers like this: 5291658276538691055
How could I programmatically convert this number to a 4-6 capital letters only that is a unique combination that can also be reversed to get back to the number?
For example using OBJ-C.
There are 26 capital letters;
6 of them could represent 26 ^ 6 numbers (308915776);
So, no. You are trying to map a much larger range of numbers into a much smaller range, it cannot be reversible.
Also, log 5291658276538691055 / log 26 is less than 14, so if 14 letters is good for you, just transform the number into 26-based and map the digits to letters.
And one more thing - if the range of numbers is small enough, you could do some manipulation on the numbers (e.g., just subtract the min) and encode it, which will cost you less digits.
You will need to convert the numbers to Base 26 (Hexavigesimal - snappy name!)
The Wikipedia article on Hexavigesimal gives example code in Java - you should be able to adapt this pretty easily.
NB: You cannot get the long number you mentioned down to 4-6 capital letters only using a conversion algorithm (your example in Base 26 is BCKSATKEBRYBXJ). If you need conversion that short, you only have two options:
Lookup tables (store mappings, e.g. 5291658276538691055 = ABCDEF). Obviously only useful if you have a discrete set of numbers.
Including additional characters (e.g. lower case + numbers).

Parallelizable hashing algorithm where size and order of sub-strings is irrelevant

EDIT
Here is the problem I am trying to solve:
I have a string broken up into multiple parts. These parts are not of equal, or predictable length. Each part will have a hash value. When I concatenate parts I want to be able to use the hash values from each part to quickly get the hash value for the parts together. In addition the hash generated by putting the parts together must match the hash generated if the string were hashed as a whole.
Basically I want a hashing algorithm where the parts of the data being hashed can be hashed in parallel, and I do not want the order or length of the pieces to matter. I am not breaking up the string, but rather receiving it in unpredictable chunks in an unpredictable order.
I am willing to ensure an elevated collision rate, so long as it is not too elevated. I am also ok with a slightly slower algorithm as it is hardly noticeable on small strings, and done in parallel for large strings.
I am familiar with a few hashing algorithms, however I currently have a use-case for a hash algorithm with the property that the sum of two hashes is equal to a hash of the sum of the two items.
Requirements/givens
This algorithm will be hashing byte-strings with length of at least 1 byte
hash("ab") = hash('a') + hash('b')
Collisions between strings with the same characters in different order is ok
Generated hash should be an integer of native size (usually 32/64 bits)
String may contain any character from 0-256 (length is known, not \0 terminated)
The ascii alpha-numeric characters will be by far the most used
A disproportionate number of strings will be 1-8 ASCII characters
A very tiny percentage of the strings will actually contain bytes with values at or above 127
If this is a type of algorithm that has terminology associated with it, I would love to know that terminology. If I knew what a proper term/name for this type of hashing algorithm was it would be much easier to google.
I am thinking the simplest way to achieve this is:
Any byte's hash should be its value, normalized to <128 (if >128 subtract 128)
To get the hash of a string you normalize each byte to <128 and add it to the key
Depending on key size I may need to limit how many characters are used to hash to avoid overflow
I don't see anything wrong with just adding each (unsigned) byte value to create a hash which is just the sum of all the characters. There is nothing wrong with having an overflow: even if you reach the 32/64 bit limit (and it would have to be a VERY/EXTREMELY long string to do this) the overflow into a negative number won't matter in 2's complement arithmetic. As this is a linear process it doesn't matter how you split your string.

In VB.NET, what data structure to use so that, given two characters, return an integer for optimum performance?

I am finishing up a program that does a large number of calculations and am trying to optimize the innermost loop.
The calculation I am currently looking at iterates over a large number of pairs of words and makes a table of the counts of corresponding pairs of characters. For example, one pair of words might be:
voice
louse
and the character pairs would then be (v,l), (o,o), (i,u), (c,s), and (e,e), and these pairs would all then have a count of 1. If the combination (v,l) is ever encountered again in another word, it would increment that count to two.
What data structure should I use for highest performance? Given the two characters, I need to retrieve the count for that pair. Currently I am using a nested hash table whose declaration looks like:
Dim data As New Dictionary(of String, Dictionary(of String, Integer))
Using this data structure, the program must hash two strings for every integer it retrieves. For every character pair, it must first check to see if the pair is in the hash table, and if not add it, requiring two more hashes. I have also considered a one level hash table with the key being the two characters concatenated together, so key = "vl" and value = 1, but I have read that string concatenation is relatively slow in VB.
So then, my questions are:
How speedy are the Dictionaries in VB? Would four hashes be quicker than one hash and a string concatenation (two level vs one level hash table)?
Can you think of a better structure to store this kind of data that allows fast additions and retrieval?
One option is to use a Dictionary(Of Integer, Integer). You can convert from any .NET Unicode character to an unsigned 16 bit integer, as they're UTF-16 code units.
You can then combine two unsigned 16 bit integers into a 32 bit integer very easily. Alternatively, you can just convert each code unit into a 32 bit unsigned integer to start with, and shift and combine in the same way :)
In C# I'd just use:
int combination = (((int) char1) << 16) | ((int) char2);
EDIT: According to jeroenh's comment, the VB equivalent is:
Dim combination As Integer = (AscW(char1) << 16) Or AscW(char2)