What are the best practices for hadoop benchmarking?

What are the best practices for hadoop benchmarking? - apache

I am using TestDFSIO to benchmark hadoop I/O performance.
The test rig I am using is a small virtual cluster of 3 data nodes and one name node.
Each vm would have 6-8 GB RAM and 100-250 GB HDD.
I want to know about two things:
What should be the values for number of files(nrFIles) and file size for each file (fileSize) parameters with respect to my set up such that we can relate the results of my small cluster to clusters of standard sizes like having 8-12 x 2-TB hard disks and 64 GBs of RAM and higher processing speeds. Is it even correct to do so.
In general what are the best practices for benchmarking hadoop? Like:
what is the recommended cluster specification(specs of datanodes, namenodes), recommended test data size, what configurations/specs should the test bed have in order to have results which will conform to real life hadoop applications
Simply said I want to know about the correct hadoop test rig setup and correct test methods so that my results are relatable to production clusters.
It will be helpful to have references to proven work.
Another question is
suppose i have -nrFiles 15 -fileSize 1GB
I found that number of map tasks will be equal to the number mentioned for nrFiles
But how are they distributed among the 3 data nodes? 15 number of map tasks is not clear to me. Is it like for 15 files each file will have one mapper working on it?
I have not found any document or description of how exactly testDFSIO works.

You cannot compare results for two clusters. the results may vary on number of mappers on a node, replication factor, network etc.
Cluster specification would depend on what are you trying to use it for.
If you provide -nrFiles 15 -fileSize 1000 there would be 15 files created of each 1GB. each mapper would work on a single file, so there would be 15 map tasks. for your 3 node cluster assuming you just have 1 mapper on a node there would be 5 waves to write the full data.
refer the below link for testDFSIO and other benchmarking tools: http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/

Related

Impala concurrent query delay

My cluster configuration is as follows:
3 Node cluster
128GB RAM per cluster node.
Processor: 16 core HyperThreaded per cluster node.
All 3 nodes have Kudu master and T-Server and Impala server, one of the node has Impala catalogue and Impala StateStore.
My issues are as follows:
1) I've a hard time figuring out Dynamic resource pooling in Impala while running concurrent queries. I've tried giving mem_limit still no luck. I've also tried static service pool but with that also I couldn't achieve required concurrency. Even with admission control, the required concurrency was not achieved.
I) The time taken for 1 query: 500-800ms.
II) But if 10 concurrent queries are given the time taken grows to 3-6s per query.
III) But if more than 20 concurrent queries are given the time taken is exceeding 10s per query.
2) One of my cluster nodes is not taking the load after submitting the query, I checked this by the summary of the query. I've tried giving the NUM_NODES as 0 and 1 on the node which is not taking the load, still, the summary shows that the node is not taking the load.

What is the table size ? How many rows are there in the tables ? Are the tables partitioned ? It will be nice if you can compare your configurations with the Impala Benchmarks
As mentioned above Impala is designed to run on a Massive Parallel Processing Infrastructure. If when we had a setup of 10 nodes with 80 cores and 160 virtual cores with 12 TB SAN storage, we could get a computation time of 60 seconds with 5 concurrent users.

Database for 5-dimensional data?

I commonly deal with data-sets that have well over 5 billion data points in a 3D grid over time. Each data point has a certain value, which needs to be visualised. So it's a 5-dimensional data-set. Lets say the data for each point looks like (x, y, z, time, value)
I need to run arbitrary queries against these data-sets where for example the value is between a certain range, or below a certain value.
I need to run queries where I need all data for a specific z value
These are the most common queries that I would need to run against this data-set. I have tried the likes of MySQL and MongoDB and created indices for those values, but the resource requirements are quite extreme with long query-runtime. I ended up writing my own file-format to at least store the data for relative easy retrieval. This approach makes it difficult to find data without having to read/scan the entire data-set.
I have looked at the likes of Hadoop and Hive, but the queries are not designed to be run in real-time. In terms of data size it seems a better fit though.
What would be the best method to index such larger amounts of data efficiently? Is a custom indexing system the best approach or to slice the data into smaller chunks and device a specific way of indexing (which way though?). the goal is to be able to run queries against the data and have the results returned in under 0.5 seconds. My best was 5 seconds by running the entire DB on a huge RAM drive.
Any comments and suggestions are welcome.
EDIT:
the data for all x, y, z, time and value are all FLOAT

It really depends on the hardware you have available, but regardless of that and considering the type and the amount of data you are dealing with, I definitely suggest a clustered solution.
As you already mentioned, Hadoop is not a good fit because it is primarily a batch processing tool.
Have a look at Cassandra and see if it solves your problem. I feel like a column-store rdbms like CitusDB (Free up to 6 nodes) or Vertica (Free up to 3 nodes) may also prove useful.

Select Query output in parallel streams

I need to spool over 20 million records in a flat file. A direct select query would be time utilizing. I feel the need to generate the output in parallel based on portions of the data - i.e having 10 select queries over 10% of the data each in parallel. Then sort and merge on UNIX.
I can utilize rownum to do this, however this would be tedious, static and needs to be updated every time my rownum changes.
Is there a better alternative available?

If the data in SQL is well spread out over multiple spindles and not all on one disk, and the IO and network channels are not saturated currently, splitting into separate streams may reduce your elapsed time. It may also introduce random access on one or more source hard drives which will cripple your throughput. Reading in anything other than cluster sequence will induce disk contention.
The optimal scenario here would be for your source table to be partitioned, that each partition is on separate storage (or very well striped), and each reader process is aligned with a partition boundary.

Memory utilization in redis for each database

Redis allows storing data in 16 different 'databases' (0 to 15). Is there a way to get utilized memory & disk space per database. INFO command only lists number of keys per database.

No, you can not control each database individually. These "databases" are just for logical partitioning of your data.
What you can do (depends on your specific requirements and setup) is spin multiple redis instances, each one does a different task and each one has its own redis.conf file with a memory cap. Disk space can't be capped though, at least not in Redis level.
Side note: Bear in mind that the 16 database number is not hardcoded - you can set it in redis.conf.

I did it by calling dump on all the keys in a Redis DB and measuring the total number of bytes used. This will slow down your server and take a while. It seems the size dump returns is about 4 times smaller than the actual memory use. These number will give you an idea of which db is using the most space.
Here's my code:
https://gist.github.com/mathieulongtin/fa2efceb7b546cbb6626ee899e2cfa0b

SQL: how does parallelism of a join operation exactly work in a "shared nothing" architecture?

I'm currently reading a book about parallelism in a DBMS and I find it difficult to understand how parallelism works exactly in the join operation.
Suppose we have 10 systems, each system has each own disk space and main memory. There is a network with which the systems can communicate with each other in order for example to share data.
Now suppose we have the following operation: A(X,Y) JOIN B(Y,Z)
the tables A and B are too big so we want to use parallelism in order to gain better overall computing speed.
What we do is we apply a hash function on the 'Y' attribute for each record of A and B tables, and we send these records to a different system. Each system can then use a local algorithm in order to join the records that they got.
What I don't understand is, where exactly is the initial hash function being applied and where exactly are the initial tables A and B being stored?
While I was reading I thought that we had another "main" system, which had also its own disk space, and in this space we had all the initial information, which is table A and B with all their records. This system used its own main memory in order to apply the initial hash function, which determined for each record the system out of the total 10 where it will eventually go and be processed.
however upon reading I got stuck in the following example(I translate from Greek)
Let's say we have two tables R(X,Y) JOIN S(Y,Z) where R has 1000 pages and S 500 pages. Suppose that we have 10 systems that can be used in parallel. So we start by using a hash function to determine where we should send each record. The total amount of I/Os needed to read the tables R and S is 1500, which is 150 for each system. Each system will have 15 pages of data which are necessary for each of the remaining systems, so it sends 135 pages to the other nine systems. Hence the total communication is 1350 pages.
I don't really understand the bold part. Why would a system have to send any data to the other systems? Doesn't the "main" system I was talking about previously do this job?
I imagine something like this:
main_system
||
\/
apply_hash(record)
||
\/
send record to the appropriate system
/ / / / / / / / / /
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
now all systems have their own records, they apply the local algorithm and give the result to the output. No communication between the systems, what am I missing here?? does the book use a different approach and if so what kind of approach because I have read the same unit 3 times and I still don't get it(maybe a bad translation not sure though).
thanks in advance

In a shared-nothing system, the data is typically partitioned across the processors when the data is created. Although databases can be shared-nothing, probably the best documentation is for Hadoop and HDFS, the Hadoop distributed file system.
A function assigns rows to a partition. Some examples are: round-robin, where new rows are assigned to the processors one after the other; range-based, where rows are assigned to a processor based on the value of a column; hash-based, where rows are assigned to a processor based on a hash of the value. The process of partitioning the data is very similar to "partitioning" in databases such as SQL Server and Oracle which are not in a shared-nothing environment.
If your join uses the partition key for both tables, and the partitioning method is the same, then the data is already local. Otherwise, one or both tables need to be redistributed to continue the processing.
In the section that you quote, you are probably confused by the arithmetic. Remember that if you have 1500 pages across 10 processors, each will have on average 150 pages. These pages need to be redistributed. Say you are processor 3. About 15 pages will go to processor 1; another 15 to processor 2. And another to processor 3. Wait! You don't have to send these; they are already in the right spot. You only have to send 9*15 = 135 pages to other processors.
The key idea is that the same processors are storing the data as doing the processing, in a shared-notthing environment.

id take a wild guess that your connection is your local client. since it has a connection to all machines.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas