Is there a way to optimize reducing many numbers to the same modulus - modulus

I am writing a program to do integer factorization and have to reduce a series of numbers to a given modulus. Both the number and the modulus are bigints, say 50 to 100 digits. The number changes but the modulus is always the same. Is there some way to optimize the repeated modulus calculations, perhaps by pre-computing some partial results and storing them in a table?

Let your bigint library worry about optimizing operations like that.

Related

Likelihood of Collision

I want to hash an internal account number and use the result as a unique public identifier for an account record. The identifier is limited to 40 characters. I have approximately 250 records with unique account numbers.
What is less likely to result in a collision.
Taking the SHA-1 of the SHA-256 hash of the account number.
Taking the SHA-256 of the account number and picking out 40 characters.
These approaches are identical (*), so you should use the second one. There is no reason to inject SHA-1 into the system. Any selection of bits out of SHA-256 are independent and "effectively random."
An alternate solution that may be convenient is to turn these into v5 UUIDs. If you keep your namespace secret (which is allowed), this may be a very nice way to do what you're describing.
(*) There are some subtleties around the fact that your using "characters" rather than bytes here, and you could get a larger space in 40 "characters" by using a better encoding than you're likely using. It's possible the spaces are a little different based on how you're actually encoding. But it deeply doesn't matter. These spaces are enormous, and the two approaches will be the same in practice, so use the one that only needs one algorithm.
Another approach that may meet your needs better is to stretch the identifiers. If the space is sufficiently sparse (i.e if the number of possible identifiers is dramatically larger than the number of actually used identifiers), then stretching algorithms like PBKDF2 are designed to handle exactly that. They are expensive to compute, but you can tune their cost to match your security requirements.
The general problem with just hashing is that hashes are very fast, and if your space of possible identifiers is very small, then it's easy to brute force. Stretching algorithms make the cost of guessing arbitrarily expensive, so large spaces are impractical to brute force. They do this without requiring any secrets, which is nice. The general approach is:
Select a "salt" value. This can be publicly known. It does not matter. For this particular use case, because every account number is different, you can select a single global salt. (If the protected data could be the same, then it's important to have different salts for each record.)
Compute PBKDF2(salt, iterations, length, payload)
The number of iterations tunes how slow this operation is. The output is "effectively random" (just like a hash) and can be used in the same ways.
A common target for iterations is a value that delivers around 80-100ms. This is fairly fast on the server, but is extremely slow for brute-forcing large spaces, even if the attacker has better hardware than yours. Ideally your space should take at least millions of years to brute force (seriously; this is the kind of headroom we typically like in security; I personally target trillions of years). If it's smaller than a few years, then it probably can be brute forced quickly by throwing more hardware at it.
(Of course all of these choices can be turned based on your attack model. It depends on how dedicated and well-funded you expect attacks to be.)
A 40 character ID is 320 bits, which gives you plenty of space. With only 250 records, you can easily fit a unique counter into that. Three digits is only 24 bits, and you have the range 000 to 999 to play with. Fill up the rest of the ID with, say, the hex expression of part of the SHA-256 hash. With a 3-digit ID, that leaves 37 places for hex which covers 37*4 = 148 bits of the Sha-256 output.
You may want to put the counter in the middle of the hex string in a fixed position instead of at the beginning or end to make it less obvious.
<11 hex chars><3 digit ID><26 hex chars>

Search performance of non-clustered index on binary and integer columns

I store fixed-size binary hashes less than or equal 64 bits in Microsoft SQL Server 2012 database. The size of binary hash may be 48 or 32 bits also. Each hash has an identificator Id. The table structure is like this:
Id int NOT NULL PRIMARY KEY,
Hash binary(8) NOT NULL
I created non-clustered index on Hash column for performance purposes and for fast way to look up of hash. Also I tried to create integer columns instead of binary(n) depending on bytes n. For example I changed the column type from binary(4) to int.
Are there differences between indices on column types binary(8) and bigint or between binary(4) and int and so on?
Is it reasonable to store hashes as integers to improve search performance?
Under the covers the index is limited to a certain byte length. The smaller the better for IO. It's easy enough to convert between the datatypes using convert(varbinary(25),Hash) syntax, once you have the value of interest. You don't want to invoke a ton of converts while you're looking up the records.
If there is a difference it might be due to the collations or statistics that are being used, which just says between two values if one is more, less or equal. Statics enable the query to look past lots of values because it "knows" the data distribution.
When you have a large strings and you attempt to do like '%value' lookups, there's not much benefit much from the index. Hashes should be random. Which means the focus is on the number of bytes that are compared to make a query decision. The fewer the better.
The unhelpful but accurate cya that every database engineer will tell you, it depends, you should test it.

How to improve performance of PIG job that uses Datafu's Hyperloglog for estimating cardinality?

I am using Datafu's Hyperloglog UDF to estimate a count of unique ids in my dataset. In this case I have 320 million unique ids that may appear multiple times in my dataset.
Dataset : Country, ID.
Here is my code :
REGISTER datafu-1.2.0.jar;
DEFINE HyperLogLogPlusPlus datafu.pig.stats.HyperLogLogPlusPlus();
-- id is a UUID, for example : de305d54-75b4-431b-adb2-eb6b9e546014
all_ids =
LOAD '$data'
USING PigStorage(';') AS (country:chararray, id:chararray);
estimate_unique_ids =
FOREACH (GROUP all_ids BY country)
GENERATE
'Total Ids' as label,
HyperLogLogPlusPlus(all_ids) as reach;
STORE estimate_unique_ids INTO '$output' USING PigStorage();
Using 120 reducers I noticed that a majority of them completed within minutes. However a handful of the reducers were overloaded with data and ran forever. I killed them after 24 hours.
I thought Hyperloglog was more efficient than counting for real. What is going wrong here?
In DataFu 1.3.0, an Algebraic implementation of HyperLogLog was added. This allows the UDF to use the combiner and will probably improve performance in skewed situations.
However, in the comments in the Jira issue there is a discussion of some other performance problems that can arise when using HyperLogLog. The relevant quote is below:
The thing to keep in mind is that each instance of HyperLogLogPlus allocates a pretty large byte array. I can't remember the exact numbers, but I think for the default precision of 20 it is hundreds of KB. So in your example if the cardinality of "a" is large you are going to allocate a lot of large byte arrays that will need to be transmitted from combiner to reducer. So I would avoid using it in "group by" situations unless you know the key cardinality is quite small. This UDF is better suited for "group all" scenarios where you have a lot of input data. Also if the input data is much smaller than the byte array then you could be worse off using this UDF. If you can accept worse precision then the byte array could be made smaller.

Is there any sql database not creating an index for a unique constraint?

I have seen this question, but really, it's only about MySQL. Is there any sql database out there, that does not create an index for a unique constraint?
In one sense, no one can give you a definitive answer. As we speak, someone could be creating that very thing. But it's a fair bet that any DBMS you've heard of or are likely to hear of will use indexes to enforce uniqueness, because that's what the science dictates.
DBMSs use indexes for this because searching them is quick. The index uses some kind of structure that supports a binary search, providing O(log N) time complexity.
Consider what the system would have to do without such a structure.
for each row to be inserted
scan all rows in table
error if found
In the best case -- when there's no error -- each inserted row would cause a scan of the entire table. That's O(nm) complexity, a/k/a exponential time.
Suppose for example you're inserting 10,000 rows into a 10,000-row table. You're looking at 100,000,000 = 10,000 * 10,000 comparisons! A binary search, by contrast, requires ~13 comparisons for 10,000 rows, and ~17 for 20,000. Because we're inserting into the same table we're comparing to, the number of comparisons on average will be 15, so the total number of comparisons is 150,000 = 15 * 10,000, or 0.15% of the work.
Databases are all about scale, and exponential time even at modest scale is infeasible.
On an ordinary machine I have handy, a simple program to compare two unsorted arrays of 10,000 integers takes 0.1 seconds. As we might expect, 100,000 integers takes 10 seconds, 100 times longer. At 1,000,000 integers, we could expect 1000 seconds, or about 15 minutes. A cool billion would take a million times longer, until sometime in the year 2042.
Rob Pike likes to say, Fancy algorithms are slow when n is small, and n is usually small. It's true. But rule #5 is just as important: Data dominates.

What's the database performance improvement from storing as numbers rather than text?

Suppose I have text such as "Win", "Lose", "Incomplete", "Forfeit" etc. I can directly store the text in the database. Instead if use numbers such as 0 = Win, 1 = Lose etc would I get a material improvement in database performance? Specifically on queries where the field is part of my WHERE clause
At the CPU level, comparing two fixed-size integers takes just one instruction, whereas comparing variable-length strings usually involves looping through each character. So for a very large dataset there should be a significant performance gain with using integers.
Moreover, a fixed-size integer will generally take less space and can allow the database engine to perform faster algorithms based on random seeking.
Most database systems however have an enum type which is meant for cases like yours - in the query you can compare the field value against a fixed set of literals while it is internally stored as an integer.
There might be significant performance gains if the column is used in an index.
It could range anywhere from negligible to extremely beneficial depending on the table size, the number of possible values being enumerated and the database engine / configuration.
That said, it almost certainly will never perform worse to use a number to represent an enumerated type.
Don't guess. Measure.
Performance depends on how selective the index is (how many distinct values are in it), whether critical information is available in the natural key, how long the natural key is, and so on. You really need to test with representative data.
When I was designing the database for my employer's operational data store, I built a testbed with tables designed around natural keys and with tables designed around id numbers. Both those schemas have more than 13 million rows of computer-generated sample data. In a few cases, queries on the id number schema outperformed the natural key schema by 50%. (So a complex query that took 20 seconds with id numbers took 30 seconds with natural keys.) But 80% of the test queries had faster SELECT performance against the natural key schema. And sometimes it was staggeringly faster--a difference of 30 to 1.
The reason, of course, is that lots of the queries on the natural key schema need no joins at all--the most commonly needed information is naturally carried in the natural key. (I know that sounds odd, but it happens surprisingly often. How often is probably application-dependent.) But zero joins is often going to be faster than three joins, even if you join on integers.
Clearly if your data structures are shorter, they are faster to compare AND faster to store and retrieve.
How much faster 1, 2, 1000. It all depends on the size of the table and so on.
For example: say you have a table with a productId and a varchar text column.
Each row will roughly take 4 bytes for the int and then another 3-> 24 bytes for the text in your example (depending on if the column is nullable or is unicode)
Compare that to 5 bytes per row for the same data with a byte status column.
This huge space saving means more rows fit in a page, more data fits in the cache, less writes happen when you load store data, and so on.
Also, comparing strings at the best case is as fast as comparing bytes and worst case much slower.
There is a second huge issue with storing text where you intended to have a enum. What happens when people start storing Incompete as opposed to Incomplete?
having a skinner column means that you can fit more rows per page.
it is a HUGE difference between a varchar(20) and an integer.