about the Apache Hive Map side Join - apache

I know that hive map side join uses memory.
Can I use an SSD instead of a memory?
I want to do a mapside join by putting the dimension table on the SSD.
Is it possible?

I will try to answer your question by explaining you Hadoop distributed cache:
DistributedCache is a facility provided by the Map-Reduce framework to cache files (in your case it is hive table which you want to join) needed by applications.
The DistributedCache assumes that the files specified via urls are already present on the FileSystem (this is your SSD or HDD) at the path specified by the url and are accessible by every machine in the cluster.
So ironically it is the hadoop frame work who decides whether to put the map
file in memory(RAM / YARN) or in SSD/HDD depending on the map file
size.
Although By default, the maximum size of a table to be used in a map join (as the small table) is 1,000,000,000 bytes (about 1 GB), you can increase this manually also by hive set properties example:
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=2000000000;
The framework will copy the necessary files on to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.
you can find more about distributed cache on this links:
https://hadoop.apache.org/docs/r2.6.3/api/org/apache/hadoop/filecache/DistributedCache.html
https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/filecache/DistributedCache.html

Related

Foundry Data Storage Optimization

Hi I have a general question about pipelines optimization in order to lower storage space.
Does deleting trashed datasets help alleviate disk storage? Ex. Remove obsolete datasets: a.) based on business knowledge and utilization and b.) datasets in the trash.
Also, We'd like to manage the copies of datasets that are stored when a schedule runs. We believe that if we ever had to fall back to a previous version, we only need to reference the latest one, as opposed to keeping multiple copies.
Does this affect storage? And is there a way to manage configuration on this?
Deleting trashed datasets (in typical setups) will result in their underlying files being deleted, but typically a larger driver of storage consumption is the set of previous dataset views kept by default.
You can control the length of time these files and views are kept using the Foundry Retention service. I'd recommend you consult with platform documentation and your support team for configuration of this service.
Retention will compute and mark files matching your configuration for deletion and periodically delete them, thus reducing your storage consumption.

Controling HDFS Replication ,mappers number and Reducers identification

I am trying to run Apache Hadoop 2.65 in a distributed way (with a cluster of 3 computers) and I want to decide the number of mappers and reducers.
I am using HDFS with number of replication 1 and my input is 3 files (tables).
I want to adjust the way data flows in the system and for that, I would like to get some help with the following manners by is it possible? and how and where can I change it?
Replication of HDFS- Can I interfere with the way the replication of HDFS has been done? for example, make sure that each
file stored in a different computer? and if so can I choose on which
computer it will be stored?
Number of mappers- Can I change the number of mappers or input splits? I know that it is decided by the number of input splits and block size. It said on the web that I can do that by changing the following parameters but I don't know where?
-D mapred.map.tasks=5
mapred.min.split.size property
Reducers identification- How can I suggest or force the Resource manager to start the reduce containers (reduce tasks) on specific computers? and if so can I select their amount for each computer? (divide the map out output differently across the cluster). More specifically, add another parameter to the ContainerLaunchContext (we have Mem, CPU, Disk, and Locality).
Replication of HDFS- Can I interfere with the way the replication of HDFS has been done?
Ans- Yes we can change replication factor in hdfs. just go config file change there.
Number of mappers- Can I change the number of mappers or input splits?
Ans - We can change number of mappers also in hdfs.

Hive queries of external tables stored on Google Cloud Storage extremely slow

I have begun testing The Google Cloud Storage connector for Hadoop. I am finding it incredibly slow for hive queries run against it.
It seems a single client must scan the entire file system before starting the job, 10s of 1000s of files this takes 10s of minutes. Once the job is actually running it performs well.
Is this a configuration issue or the nature of hive/gcs? Can something be done to improve performance.
Running CDH 5.3.0-1 in GCE
I wouldn't say it's necessarily a MapReduce vs Hive difference, though there are possible reasons it could be more common to run into this type of slowness using Hive.
It's true that metadata operations like "stat/getFileStatus" have a slower round-trip latency on GCS than local HDFS, on the order of 30-70ms instead of single-digit milliseconds.
However, this doesn't mean it should take >10 of minutes to start a job on 10,000 files. Best-practice is to allow the connector to "batch" requests as much as possible, allowing retrieval of up to 1000 fileInfos in a single round-trip.
The key is that if I have a single directory:
gs://foobar/allmydata/foo-0000.txt
....<lots of files following this pattern>...
gs://foobar/allmydata/foo-9998.txt
gs://foobar/allmydata/foo-9999.txt
If I have my Hive "location" = gs://foobar/allmydata it should actually be very quick, because it will be fetching 1000 files at a time. If I did hadoop fs -ls gs://foobar/allmydata it should come back in <5 seconds.
However, if I have lots of small subdirectories:
gs://foobar/allmydata/dir-0000/foo-0000.txt
....<lots of files following this pattern>...
gs://foobar/allmydata/dir-9998/foo-9998.txt
gs://foobar/allmydata/dir-9999/foo-9999.txt
Then this could go awry. The Hadoop subsystem is a bit naive, so that if you just do hadoop fs -ls -R gs://foobar/allmydata in this case, it will indeed first find the 10000 directories of the form gs://foobar/allmydata/dir-####, and then run a for-loop over them, one-by-one listing the single file under each directory. This for-loop could easily take > 1000 seconds.
This was why we implemented a hook to intercept at least fully-specified glob expressions, released back in May of last year:
https://groups.google.com/forum/#!topic/gcp-hadoop-announce/MbWx1KqY2Q4
7. Implemented new version of globStatus which initially performs a flat
listing before performing the recursive glob logic in-memory to
dramatically speed up globs with lots of directories; the new behavior is
default, but can disabled by setting fs.gs.glob.flatlist.enable = false.
In this case, if the subdirectory layout was present, the user can opt instead to do hadoop fs -ls gs://foobar/allmydata/dir-*/foo*.txt. Hadoop lets us override a "globStatus", so by using this glob expression, we can correctly intercept the entire listing without letting Hadoop do its naive for-loop. We then batch it up efficiently, such that we'll retrieve all 10,000 fileInfos again in <5 seconds.
This could be a bit more complicated in the case of Hive if it doesn't allow as free usage of glob expressions.
Worst case, if you can move those files into a flat directory structure then Hive should be able to use that flat directory efficiently.
Here's a related JIRA from a couple years ago describing the similar problem for how Hive deals with files in S3, still officially unresolved: https://issues.apache.org/jira/browse/HIVE-951
If it's unclear how/why the Hive client is performing the slow for-loop, you can add log4j.logger.com.google=DEBUG to your log4j.properties and re-run the Hive client to see detailed info about what the GCS connector is doing under the hood.

How to limit the RAM used by multiple embedded HSLQDB DB instances as a whole?

Given:
HSQLDB embedded
50 distinct databases (I have 50 different data sources)
All the databases are of the file:/ kind
All the tables are CACHED
The amount of RAM allowed to use by all the embedded DB instances combined is limited and given upon startup of the java process.
The LOG file is disabled (no need to recover upon crash)
My understanding is that the RAM used by a single DB instance is comprised of the following pieces:
The cache of all the tables (all my tables are CACHED)
The DB instance internal state
Also, as far as I can see I have these two properties to control the total size of the cache of a single DB instance:
SET FILES CACHE SIZE
SET FILES CACHE ROWS
However, they control only the cache part of the RAM used by a DB instance. Plus, they are per DB instance, whereas I would like to limit all the instances as a whole.
So, I wonder whether it is possible to instruct HSQLDB to stay within the specified amount of RAM in total including all the DB instances?
You can only limit the CACHE memory use per database instance. Each instance is independent of the other.
You can reduce the CACHE SIZE and CACHE ROWS per database to suit your application.
HSQLDB does not use a lot of other memroy, but when it does, it uses the memory of the JVM, which is shared among the different database instanced.

Node-Local Map reduce job

I am currently attempting to write a map-reduce job where the input data is not in HDFS and cannot be loaded into HDFS basically because the programs using the data cannot use data from HDFS and there is too much to copy it into HDFS, at least 1TB per node.
So I have 4 directories on each of the 4 nodes in my cluster. Ideally I would like my mappers to just receive the paths for these 4 local directories and read them, using something like file:///var/mydata/... and then 1 mapper can work with each directory. i.e. 16 Mappers in total.
However to be able to do this I need to ensure that I get exactly 4 mappers per node and exactly the 4 mappers which have been assigned the paths local to that machine. These paths are static and so can be hard coded into my fileinputformat and recordreader, but how do I guarantee that given splits end up on a given node with a known hostname. If it were in HDFS I could use a varient on FileInputFormat setting isSplittable to false and hadoop would take care of it but as all the data is local this causes issues.
Basically all I want is to be able to crawl local directory structures on every node in my cluster exactly once, process a collection of SSTables in these directories and emit rows (on the mapper), and reduce the results (in the reduce step) into HDFS for further bulk processing.
I noticed that the inputSplits provide a getLocations function but I believe that this does not guarantee locality of execution, only optimises it and clearly if I try and use file:///some_path in each mapper I need to ensure exact locality otherwise I may end up reading some directories repeatedly and other not at all.
Any help would be greatly appreciated.
I see there are three ways you can do it.
1.) Simply load the data into HDFS, which you do not it want to do. But it is worth trying as it will be useful for future processing
2.) You can make use of NLineInputFormat. Create four different files with the URLs of the input files in each of your node.
file://192.168.2.3/usr/rags/data/DFile1.xyz
.......
You load these files into HDFS and write your program on these files to access the data data using these URLs and process your data. If you use NLineInputFormat with 1 line. You will process 16 mappers, each map processing an exclusive file. The only issue here, there is a high possibility that the data on one node may be processed on another node, however there will not be any duplicate processing
3.) You can further optimize the above method by loading the above four files with URLs separately. While loading any of these files you can remove the other three nodes to ensure that the file exactly goes to the node where the data files are locally present. While loading choose the replication as 1 so that the blocks are not replicated. This process will increase the probability of the maps launched processing the local files to a very high degree.
Cheers
Rags