I have a hive table to which data gets added every day.
So, around 5 files get added each day.
Now we ended up having 800 part files under this table.
The issue i have is joining or using this table anywhere is triggering 800 mappers, as mappers are proportional to the number of files.
But i do have to use the entire table for my jobs running.
Is there way to use the entire table but not triggering too many mappers?
Files look like below
-rw-rw-r-- 3 XXXX hdfs 106610 2015-12-15 05:39 /apps/hive/warehouse/prod.db/TABLE1/000000_0_copy_1.deflate
-rw-rw-r-- 3 XXXX hdfs 106602 2015-12-23 12:31 /apps/hive/warehouse/prod.db/TABLE1/000000_0_copy_10.deflate
-rw-rw-r-- 3 XXXX hdfs 157686 2016-03-06 05:20 /apps/hive/warehouse/prod.db/TABLE1/000000_0_copy_100.deflate
-rw-rw-r-- 3 XXXX hdfs 163580 2016-03-07 05:22 /apps/hive/warehouse/prod.db/TABLE1/000000_0_copy_101.deflate
I would prefer to partition the table so that the data is stored in the partition directories and whenever queried, only the files under the partitions are accessed and so are the mappers that get triggered in the hive queries when that partition columns are used.
Other option is to bucket the table using CLUSTER BY clause to distribute the data into fixed no. of bucketed directories and reducing the no. of directories and hence files that are accessed while querying.
Related
as per my understanding on Hive concepts, if we load the dataset into hive table, the data file will move from source path to hive warehouse within HDFS, and HDFS was set to three replicas for the data.
these questions might look silly but as i am beginner, i want clear my doubts.
my questions are:
1) if i delete the hive table, will it delete data file from hive warehouse only or along other two replicas from HDFS also?
2)if we are processing query on hive table, will that query be done as distributed processing?
per say, one data file is of size 1GB (interns 8 blocks x 128MB), and as we have three replication factor, there would be total 24 blocks available for this file
will our hive query be distributed among all the data blocks or it would be processed on hive warehouse blocks only?
Thanks in advance..
If you do "load data inpath" from a HDFS path the data will be moved from source to destination HDFS path,
If you do "load data local inpath", it doesn't move data from local to HDFS path, instead it copies
For your question
If you delete file in HDFS all the replicas are deleted.
If you have a 1gb file (8 blocks) with 3 replication factor, when you trigger the query in hive CLI, it converts your query to MR. It process only 8 blocks, in case of the datanode failure of the triggered job, it accesses the 2nd replica on a different node and processes the data (speculative execution)
So I need to create a external table for some data stored on S3 and add partitions explicitly (unfortunately, the directory hierarchy does not fit the dynamic partition functionality due to the name mismatch)
for example:
add partition for region:euwest1, year:2018, month:01, day:18, hour:18 at:s3://mybucket/mydata/euwest1/YYYY=2018/MM=01/dd=18/HH=18/
I ran this on an EMR cluster with Hive 2.3.2 and instance type r4.2xarge, which has 8 vCores and 61GB ram.
It takes about 4 seconds to finish adding one partition, it's not too bad but if we need to process multiple days of data then adding partitions would take a long time.
Is there anyway to make this process faster?
Thanks
In one of my usecases, several tables were created out of a bunch of csv files. Each csv file was about 50-80MB.Table is configured to contain 2 buckets. Tables are stored in ORC format. However, when I see in hive warehouse directory in hdfs, it is only about 4 MB - 5MB. I have already brought down the hive block size from its default to 64MB. My concern here is that small files in hdfs put pressure on Namenode. Similarly, is it an issue with small hive tables? Can I still bring down size of hive block?
I have created an external table that in Hive that uses data from a Parquet store in HDFS.
When the data in HDFS is deleted, there is no data in the table. When the data is inserted again in the same spot in HDFS, the table does not get updated to contain the new data. If I insert new records into the existing table that contains data, no new data is shown when I run my Hive queries.
How I create the table in Hive:
CREATE EXTERNAL TABLE nodes (id string) STORED AS PARQUET LOCATION "/hdfs/nodes";
The relevant error:
Error: java.io.FileNotFoundException: File does not exist: /hdfs/nodes/part-r-00038-2149d17d-f890-48bc-a9dd-5ea07b0ec590.gz.parquet
I have seen several posts that explain that external tables should have the most up to date data in them, such as here. However, this is not the case for me, and I don't know what is happening.
I inserted the same data into the database again, and queried the table. It contained the same amount of data as before. I then created an identical table with a different name. It had twice as much data in it, which was the right amount.
The issue might be with the metastore database. I am using PostgreSQL instead of Derby for the the database.
Relevant information:
Hive 0.13.0
Spark Streaming 1.4.1
PostgreSQL 9.3
CentOS 7
EDIT:
After examining the Parquet files, I found that the part files have seemingly incompatible file names.
-rw-r--r-- 3 hdfs hdfs 18702811 2015-08-27 08:22 /hdfs/nodes/part-r-00000-1670f7a9-9d7c-4206-84b5-e812d1d8fd9a.gz.parquet
-rw-r--r-- 3 hdfs hdfs 18703029 2015-08-26 15:43 /hdfs/nodes/part-r-00000-7251c663-f76e-4903-8c5d-e0c6f61e0192.gz.parquet
-rw-r--r-- 3 hdfs hdfs 18724320 2015-08-27 08:22 /hdfs/nodes/part-r-00001-1670f7a9-9d7c-4206-84b5-e812d1d8fd9a.gz.parquet
-rw-r--r-- 3 hdfs hdfs 18723575 2015-08-26 15:43 /hdfs/nodes/part-r-00001-7251c663-f76e-4903-8c5d-e0c6f61e0192.gz.parquet
These files are the files that causes Hive to error when it can't find it in the error described above. This means that the external table is not acting dynamically, accepting any files in the directory (if you call it that in HDFS), but instead is probably just keeping track of the list of parquet files inside the directory when it was created.
Sample Spark code:
nodes.foreachRDD(rdd => {
if (!rdd.isEmpty())
sqlContext.createDataFrame(rdd.map(
n => Row(n.stuff), ParquetStore.nodeSchema)
.write.mode(SaveMode.Append).parquet(node_name)
})
Where the nodeSchema is the schema and node_name is "/hdfs/nodes"
See my other question about getting Hive external tables to detect new files.
In order to get Hive to update its tables, I had to resort to using the partitioning feature of Hive. By creating a new partition during each Spark run, I create a series of directories internal to the /hdfs/nodes directory like this:
/hdfs/nodes/timestamp=<a-timestamp>/<parquet-files>
/hdfs/nodes/timestamp=<a-different-timestamp>/<parquet-files>
Then, after each Spark job completes, I run the Hive command MSCK REPAIR TABLE nodes using a HiveContext in my Spark job, which finds new partitions and updates the table.
I realize this isn't automatic, but it at least works.
Ok, so probably you need to encapsulate the file in a folder. Hive external table must be mapped on a folder where there could be more than one file.
try to write the file to: /path/to/hdfs/nodes/file
and then map the external table to /path/to/hdfs/nodes
so in the folder nodes you will have only the parquet file and it should works
I am using HSQLDB 2.3.0. I have a database that the following schema:
CREATE TABLE MEASUREMENT (ID INTEGER NOT NULL PRIMARY KEY IDENTITY, OBJ CLOB);
When I fill this table with test data, the LOBS file in my database grows:
ls -lath
-rw-rw-r-- 1 hsqldb hsqldb 35 May 6 16:37 msdb.log
-rw-rw-r-- 1 hsqldb hsqldb 85 May 6 16:37 msdb.properties
-rw-rw-r-- 1 hsqldb hsqldb 16 May 6 16:37 msdb.lck
drwxrwxr-x 2 hsqldb hsqldb 4.0K May 6 16:37 msdb.tmp
-rw-rw-r-- 1 hsqldb hsqldb 1.6M May 6 16:37 msdb.script
-rw-rw-r-- 1 hsqldb hsqldb 625M May 6 16:35 msdb.lobs
After running the following command:
TRUNCATE SCHEMA public AND COMMIT;
CHECKPOINT DEFRAG;
SHUTDOWN COMPACT;
The lobs file is still the same size:
-rw-rw-r-- 1 hsqldb hsqldb 84 May 6 16:44 msdb.properties
-rw-rw-r-- 1 hsqldb hsqldb 1.6M May 6 16:44 msdb.script
-rw-rw-r-- 1 hsqldb hsqldb 625M May 6 16:35 msdb.lobs
What is the best way to truncate the schema and get all the disk space back?
I have an application with the same problem using hsqldb 2.3.3. The .lobs file seems to be growing indefinatly even after calling "checkpoint defrag". My scenario is that I'm inserting a 1000 blobs of 300 bytes each. I'm periodically deleting them all and inserting a 1000 new blobs about the same size. After a number of rounds of this my .lobs file is now 1,3GB in size but it is really just storing around 300kB of data. Inspite of calling checkpoint defrag the .lobs file just grows and grows. Is this behavoiur a bug?
The database engine is designed for continuous use in real applications. If you have an application that uses lobs and deletes some of them, the space will be reused for future lobs after each checkpoint.
In normal application use, the DELETE statement is used to delete rows. This statement deallocates the lob space for reuse after each checkpoint.
You can design your tests in a way that recreates the database, rather than reuse the old database after removing the data.