I have 2 environments namely Dev and stage. Both has hive installed (same version 2.1). On Dev I have external hive tables pointing to hbase table. I have to export this hive table to stage. No compulsion that hbase table also be migrated. If managed hive table is created with data in it, will be sufficient. Can anyone suggest me how to do this? Below is diagrammatic representation of scenario. Solution to any of the expected scenario will be useful.
I tried:
Dump hive table's data into CSV file and load it into managed hive table on stage. But data have Japanese characters (non-utf8) causing higher row count on stage w.r.t. row count on Dev.
I guess, this is completely theoretical problem so not adding queries. Please let me know if you wish to see queries.
Dev Hive table -> Dev HDFS location -> Distcp -> Stage HDFS location -> Import -> Stage Hive table
You can export the hive table data to an HDFS location using the command below.
INSERT OVERWRITE DIRECTORY 'hdfs_exports_location/department' SELECT * FROM department;
Copy the HDFS data to the stage environment HDFS location using distcp
hadoop distcp <hdfs_export_location>/department hdfs://<stage name node>/<import location>
Import the table from the copied HDFS files
import from '<import location>';
Reference:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport
Related
I have created an external table that in Hive that uses data from a Parquet store in HDFS.
When the data in HDFS is deleted, there is no data in the table. When the data is inserted again in the same spot in HDFS, the table does not get updated to contain the new data. If I insert new records into the existing table that contains data, no new data is shown when I run my Hive queries.
How I create the table in Hive:
CREATE EXTERNAL TABLE nodes (id string) STORED AS PARQUET LOCATION "/hdfs/nodes";
The relevant error:
Error: java.io.FileNotFoundException: File does not exist: /hdfs/nodes/part-r-00038-2149d17d-f890-48bc-a9dd-5ea07b0ec590.gz.parquet
I have seen several posts that explain that external tables should have the most up to date data in them, such as here. However, this is not the case for me, and I don't know what is happening.
I inserted the same data into the database again, and queried the table. It contained the same amount of data as before. I then created an identical table with a different name. It had twice as much data in it, which was the right amount.
The issue might be with the metastore database. I am using PostgreSQL instead of Derby for the the database.
Relevant information:
Hive 0.13.0
Spark Streaming 1.4.1
PostgreSQL 9.3
CentOS 7
EDIT:
After examining the Parquet files, I found that the part files have seemingly incompatible file names.
-rw-r--r-- 3 hdfs hdfs 18702811 2015-08-27 08:22 /hdfs/nodes/part-r-00000-1670f7a9-9d7c-4206-84b5-e812d1d8fd9a.gz.parquet
-rw-r--r-- 3 hdfs hdfs 18703029 2015-08-26 15:43 /hdfs/nodes/part-r-00000-7251c663-f76e-4903-8c5d-e0c6f61e0192.gz.parquet
-rw-r--r-- 3 hdfs hdfs 18724320 2015-08-27 08:22 /hdfs/nodes/part-r-00001-1670f7a9-9d7c-4206-84b5-e812d1d8fd9a.gz.parquet
-rw-r--r-- 3 hdfs hdfs 18723575 2015-08-26 15:43 /hdfs/nodes/part-r-00001-7251c663-f76e-4903-8c5d-e0c6f61e0192.gz.parquet
These files are the files that causes Hive to error when it can't find it in the error described above. This means that the external table is not acting dynamically, accepting any files in the directory (if you call it that in HDFS), but instead is probably just keeping track of the list of parquet files inside the directory when it was created.
Sample Spark code:
nodes.foreachRDD(rdd => {
if (!rdd.isEmpty())
sqlContext.createDataFrame(rdd.map(
n => Row(n.stuff), ParquetStore.nodeSchema)
.write.mode(SaveMode.Append).parquet(node_name)
})
Where the nodeSchema is the schema and node_name is "/hdfs/nodes"
See my other question about getting Hive external tables to detect new files.
In order to get Hive to update its tables, I had to resort to using the partitioning feature of Hive. By creating a new partition during each Spark run, I create a series of directories internal to the /hdfs/nodes directory like this:
/hdfs/nodes/timestamp=<a-timestamp>/<parquet-files>
/hdfs/nodes/timestamp=<a-different-timestamp>/<parquet-files>
Then, after each Spark job completes, I run the Hive command MSCK REPAIR TABLE nodes using a HiveContext in my Spark job, which finds new partitions and updates the table.
I realize this isn't automatic, but it at least works.
Ok, so probably you need to encapsulate the file in a folder. Hive external table must be mapped on a folder where there could be more than one file.
try to write the file to: /path/to/hdfs/nodes/file
and then map the external table to /path/to/hdfs/nodes
so in the folder nodes you will have only the parquet file and it should works
I have 1 TB data in my HDFS in .csv format. When I load it in my Hive table what will be the total size of data. I mean will there be 2 copies of same data i.e 1 Copy in HDFS and other in Hive table ? Plz clarify. Thanks in advance.
If you create a hive external table, you provide a HDFS location for the table and you store that data into that particular location.
When you create a hive internal table hive create a directory into /apps/hive/warehouse/ directory.
Say, your table name is table1 then your directory will be /apps/hive/warehouse/table1
This directory is also a HDFS directory and when you load data into the table into internal table it goes into its directory.
Hive creates a mapping between table and their corresponding HDFS location and hence when you read the data its reading from the corresponding mapped directory.
Hence there wont be duplicate copy of data corresponding to table and their HDFS location.
But if in your Hadoop cluster Data Replication factor is set to 3(default replication) then it will take 3TB cluster disk space(as you have 1TB data) but there wont be any effect of your hive table data.
Please see below link to know more about Data replication.
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Replication
It depends whether you are creating an internal or external table in Hive.
If you create an external table in Hive, it will create a mapping on where your data is stored in HDFS and there won't be any duplication at all. Hive will automatically pick the data where ever it is stored in HDFS.
Read more about external tables here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ExternalTables
I am new to Hadoop and I just started working on Hive, I my understanding it provides a query language to process data in HDFS. With HiveQl we can create tables and load data into it from HDFS.
So my question is: where are those tables stored? Specifically if we have 100 GB file in our HDFS and we want to make a hive table out of that data what will be the size of that table and where is it stored?
If my understanding about this concept is wrong please correct me ..
If the table is 100GB you should consider an Hive External Table (as opposed to a "managed table", for the difference, see this).
With an external table the data itself will be still stored on the HDFS in the file path that you specify (note that you may specify a directory of files as long as they all have the same structure), but Hive will create a map of it in the meta-store whereas the managed table will store the data "in Hive".
When you drop a managed table, it drops the underlying data as opposed to dropping a hive external table which only drops the meta-data from the meta-store referencing that data.
Either way you are using only 100GB as viewed by the user and are taking advantage of the HDFS' robustness though duplication of the data.
Hive will create a directory on HDFS. If you didn't specify any location it will create a directory at /user/hive/warehouse on HDFS. After load command the files are moved to the /warehouse/tablename. You can also point to the HDFS directory if it contains partitions (if the files are partitioned), or use external table concept.
I am using hive and hbase to do some analysis on data.
I accidently removed the following file in hdfs
(hdfs) /user/hive/mydata.db
Although all table are existing in hive but when I retrieve data from hive through thrift server to plot it , it shows nothing. How can I rebuild my data. Any guide etc.
Is there a way to export data from Hortonworks Hive to Apache Cassandra without using ETL tools?
You can create a database "export" and a table "myview" inside of it.
create database export;
use export;
create table myview as select <put your query here>
Then use "desc" to get the full directory path:
describe myview.
From the directory path, you can point cassandra to that hdfs location and import it from hdfs.
Disclaimer: this process is only for smaller tables since your "myview" is not partitioned. It is not suitable for large ones that should have partitions defined on them.