How to improve performance of PIG job that uses Datafu's Hyperloglog for estimating cardinality? - apache-pig

I am using Datafu's Hyperloglog UDF to estimate a count of unique ids in my dataset. In this case I have 320 million unique ids that may appear multiple times in my dataset.
Dataset : Country, ID.
Here is my code :
REGISTER datafu-1.2.0.jar;
DEFINE HyperLogLogPlusPlus datafu.pig.stats.HyperLogLogPlusPlus();
-- id is a UUID, for example : de305d54-75b4-431b-adb2-eb6b9e546014
all_ids =
LOAD '$data'
USING PigStorage(';') AS (country:chararray, id:chararray);
estimate_unique_ids =
FOREACH (GROUP all_ids BY country)
GENERATE
'Total Ids' as label,
HyperLogLogPlusPlus(all_ids) as reach;
STORE estimate_unique_ids INTO '$output' USING PigStorage();
Using 120 reducers I noticed that a majority of them completed within minutes. However a handful of the reducers were overloaded with data and ran forever. I killed them after 24 hours.
I thought Hyperloglog was more efficient than counting for real. What is going wrong here?

In DataFu 1.3.0, an Algebraic implementation of HyperLogLog was added. This allows the UDF to use the combiner and will probably improve performance in skewed situations.
However, in the comments in the Jira issue there is a discussion of some other performance problems that can arise when using HyperLogLog. The relevant quote is below:
The thing to keep in mind is that each instance of HyperLogLogPlus allocates a pretty large byte array. I can't remember the exact numbers, but I think for the default precision of 20 it is hundreds of KB. So in your example if the cardinality of "a" is large you are going to allocate a lot of large byte arrays that will need to be transmitted from combiner to reducer. So I would avoid using it in "group by" situations unless you know the key cardinality is quite small. This UDF is better suited for "group all" scenarios where you have a lot of input data. Also if the input data is much smaller than the byte array then you could be worse off using this UDF. If you can accept worse precision then the byte array could be made smaller.

Related

Redis bitmap split key division strategy

I'm grabbing and archiving A LOT of data from the Federal Elections Commission public data source API which has a unique record identifier called "sub_id" that is a 19 digit integer.
I'd like to think of a memory efficient way to catalog which line items I've already archived and immediately redis bitmaps come to mind.
Reading the documentation on redis bitmaps indicates a maximum storage length of 2^32 (4294967296).
A 19 digit integer could theoretically range anywhere from 0000000000000000001 - 9999999999999999999. Now I know that the datasource in question does not actually have 99 quintillion records, so they are clearly sparsely populated and not sequential. Of the data I currently have on file the maximum ID is 4123120171499720404 and a minimum value of 1010320180036112531. (I can tell the ids a date based because the 2017 and 2018 in the keys correspond to the dates of the records they refer to, but I can't sus out the rest of the pattern.)
If I wanted to store which line items I've already downloaded would I need 2328306436 different redis bitmaps? (9999999999999999999 / 4294967296 = 2328306436.54). I could probably work up a tiny algorithm determine given an 19 digit idea to divide by some constant to determine which split bitmap index to check.
There is no way this strategy seems tenable so I'm thinking I must be fundamentally misunderstanding some aspect of this. Am I?
A Bloom Filter such as RedisBloom will be an optimal solution (RedisBloom can even grow if you miscalculated your desired capacity).
After you BF.CREATE your filter, you pass to BF.ADD an 'item' to be inserted. This item can be as long as you want. The filter uses hash functions and modulus to fit it to the filter size. When you want to check if the item was already checked, call BF.EXISTS with the 'item'.
In short, what you describe here is a classic example for when a Bloom Filter is a great fit.
How many "items" are there? What is "A LOT"?
Anyway. A linear approach that uses a single bit to track each of the 10^19 potential items requires 1250 petabytes at least. This makes it impractical (atm) to store it in memory.
I would recommend that you teach yourself about probabilistic data structures in general, and after having grokked the tradeoffs look into using something from the RedisBloom toolbox.
If the ids ids are not sequential and very spread, keep tracking of which one you processed using a bitmap is not the best option since it would waste lot of memory.
However, it is hard to point the best solution without knowing the how many distinct sub_ids your data set has. If you are talking about a few 10s of millions, a simple set in Redis may be enough.

Efficiently perform COUNT DISTINCT with spark, on csvs?

I have a large volume of data, and I'm looking to efficiently (ie, using a relatively small Spark cluster) perform COUNT and DISTINCT operations one of the columns.
If I do what seems obvious, ie load the data into a dataframe:
df = spark.read.format("CSV").load("s3://somebucket/loadsofcsvdata/*").toDF()
df.registerView("someview")
and then attempt to run a query:
domains = sqlContext.sql("""SELECT domain, COUNT(id) FROM someview GROUP BY domain""")
domains.take(1000).show()
my cluster just crashes and burns - throwing out of memory exceptions or otherwise hanging/crashing/not completing the operation.
I'm guessing that somewhere along the way there's some sort of join that blows one of the executors' memory?
What's the ideal method for performing an operation like this, when the source data is at massive scale and the target data isn't (the list of domains in the above query is relatively short, and should easily fit in memory)
related info available at this question: What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?
I would suggest to tune your executors settings. Especially, setting following parameters correctly can provide dramatic improvement in performance.
spark.executor.instances
spark.executor.memory
spark.yarn.executor.memoryOverhead
spark.executor.cores
In your case, I would also suggest to tune Number of partitions, especially bump up following param from default 200 to higher value, as per requirement.
spark.sql.shuffle.partitions

Java EE/SQL: Is there a significant performance lag between primary key types?

Currently I am involved in learning some basics of the Java EE technology. I encountered a particular project and took a deeper look into the underlying database structure.
On server-side I investigated a Java function that creates a primary key with a length of 32 characters (based on concatenating the time, a random hash, and an additional cryptographic nonce).
I am interested in a estimation about the performance loss caused by using such a primary key. If there is no security reason to create such kind of unique IDs wouldn't it be much better to let the underlying database create new increasing primaries, starting at 0?
Wouldn't a SQL/JQL search be much faster when using numbers instead of strings?
Using numbers will probably be faster, but you should measure it with a test case if you need the performance ratio between both options.
I don't think number comparison vs string comparison will give a big performance advantage by itself. However:
larger fields typically means less data per table block, so you have to read more blocks from DB in case of a full scan (it will be slower)
accordingly, larger keys typically means less keys per index block, so you have to read more index blocks in case of index scans (it will be slower)
larger fields are, well, larger, so by definition they are less space-efficient.
Note that we are talking about data size and not data type: most likely a 8-byte integer will not be significantly more efficient than a 8-byte string.
Note also that using random IDs is usually more "clusterable" than sequence numbers, as sequences / autonumerics need to be administered centrally (although this can be mitigated using techniques such as the Hi-Lo algorithm. Most curent persistence frameworks support this technique).

mysql - Creating rows vs. columns performance

I built an analytics engine that pulls 50-100 rows of raw data from my database (lets call it raw_table), runs a bunch statistical measurements on it in PHP and then comes up with exactly 140 datapoints that I then need to store in another table (lets call it results_table). All of these data points are very small ints ("40","2.23","-1024" are good examples of the types of data).
I know the maximum # of columns for mysql is quite high (4000+) but there appears to be a lot of grey area as far as when performance really starts to degrade.
So a few questions here on best performance practices:
1) The 140 datapoints could be, if it is better, broken up into 20 rows of 7 data points all with the same 'experiment_id' if fewer columns is better. HOWEVER I would always need to pull ALL 20 rows (with 7 columns each, plus id, etc) so I wouldn't think this would be better performance than pulling 1 row of 140 columns. So the question: is it better to store 20 rows of 7-9 columns (that would all need to be pulled at once) or 1 row of 140-143 columns?
2) Given my data examples ("40","2.23","-1024" are good examples of what will be stored) I'm thinking smallint for the structure type. Any feedback there, performance-wise or otherwise?
3) Any other feedback on mysql performance issues or tips is welcome.
Thanks in advance for your input.
I think the advantage to storing as more rows (i.e. normalized) depends on design and maintenance considerations in the face of change.
Also, if the 140 columns have the same meaning or if it differs per experiment - properly modeling the data according to normalization rules - i.e. how is data related to a candidate key.
As far as performance, if all the columns are used it makes very little difference. Sometimes a pivot/unpivot operation can be expensive over a large amount of data, but it makes little difference on a single key access pattern. Sometimes a pivot in the database can make your frontend code a lot simpler and backend code more flexible in the face of change.
If you have a lot of NULLs, it might be possible to eliminate rows in a normalized design and this would save space. I don't know if MySQL has support for a sparse table concept, which could come into play there.
You have a 140 data items to return every time, each of type double.
It makes no practical difference whether this is 1x140 or 20x7 or 7x20 or 4x35 etc. It could be infinitesimally quicker for one shape of course but then have you considered the extra complexity in the PHP code to deal with a different shape.
Do you have a verified bottleneck, or is this just random premature optimisation?
You've made no suggestion that you intend to store big data in the database, but for the purposes of this argument, I will assume that you have 1 billion (10^9) data points.
If you store them in 140 columns, you'll have a mere 7 millon rows, however, if you want to retrieve a single data point from lots of experiments, then it will have to fetch a large number of very wide rows.
These very wide rows will take up more space in your innodb_buffer_pool, hence you won't be able to cache so many; this will potentially slow you down when you access them again.
If you store one datapoint per row, in a table with very few columns (experiment_id, datapoint_id, value) then you'll need to pull out the same number of smaller rows.
However, the size of rows makes little difference to the number of IO operations required. If we assume that your 1 billion datapoints doesn't fit in ram (which is NOT a safe assumption nowadays), maybe the resulting performance will be approximately the same.
It is probably better database design to use few columns; but it will use less disc space and perhaps be faster to populate, if you use lots of columns.

What's the database performance improvement from storing as numbers rather than text?

Suppose I have text such as "Win", "Lose", "Incomplete", "Forfeit" etc. I can directly store the text in the database. Instead if use numbers such as 0 = Win, 1 = Lose etc would I get a material improvement in database performance? Specifically on queries where the field is part of my WHERE clause
At the CPU level, comparing two fixed-size integers takes just one instruction, whereas comparing variable-length strings usually involves looping through each character. So for a very large dataset there should be a significant performance gain with using integers.
Moreover, a fixed-size integer will generally take less space and can allow the database engine to perform faster algorithms based on random seeking.
Most database systems however have an enum type which is meant for cases like yours - in the query you can compare the field value against a fixed set of literals while it is internally stored as an integer.
There might be significant performance gains if the column is used in an index.
It could range anywhere from negligible to extremely beneficial depending on the table size, the number of possible values being enumerated and the database engine / configuration.
That said, it almost certainly will never perform worse to use a number to represent an enumerated type.
Don't guess. Measure.
Performance depends on how selective the index is (how many distinct values are in it), whether critical information is available in the natural key, how long the natural key is, and so on. You really need to test with representative data.
When I was designing the database for my employer's operational data store, I built a testbed with tables designed around natural keys and with tables designed around id numbers. Both those schemas have more than 13 million rows of computer-generated sample data. In a few cases, queries on the id number schema outperformed the natural key schema by 50%. (So a complex query that took 20 seconds with id numbers took 30 seconds with natural keys.) But 80% of the test queries had faster SELECT performance against the natural key schema. And sometimes it was staggeringly faster--a difference of 30 to 1.
The reason, of course, is that lots of the queries on the natural key schema need no joins at all--the most commonly needed information is naturally carried in the natural key. (I know that sounds odd, but it happens surprisingly often. How often is probably application-dependent.) But zero joins is often going to be faster than three joins, even if you join on integers.
Clearly if your data structures are shorter, they are faster to compare AND faster to store and retrieve.
How much faster 1, 2, 1000. It all depends on the size of the table and so on.
For example: say you have a table with a productId and a varchar text column.
Each row will roughly take 4 bytes for the int and then another 3-> 24 bytes for the text in your example (depending on if the column is nullable or is unicode)
Compare that to 5 bytes per row for the same data with a byte status column.
This huge space saving means more rows fit in a page, more data fits in the cache, less writes happen when you load store data, and so on.
Also, comparing strings at the best case is as fast as comparing bytes and worst case much slower.
There is a second huge issue with storing text where you intended to have a enum. What happens when people start storing Incompete as opposed to Incomplete?
having a skinner column means that you can fit more rows per page.
it is a HUGE difference between a varchar(20) and an integer.