Is there a way to export data from Hortonworks Hive to Apache Cassandra without using ETL tools?
You can create a database "export" and a table "myview" inside of it.
create database export;
use export;
create table myview as select <put your query here>
Then use "desc" to get the full directory path:
describe myview.
From the directory path, you can point cassandra to that hdfs location and import it from hdfs.
Disclaimer: this process is only for smaller tables since your "myview" is not partitioned. It is not suitable for large ones that should have partitions defined on them.
Related
I'm trying to read data partitons in S3 from Trino.
What I did exactly:
I uploaded my data with all partitions into S3. I have a specified avro schema, I put it in file local system.
Then I created an external hive table to point to the data location in S3 and to the avro schema in file local system.
Table is created.
Then, normaly I can query my data and partitions in S3 from Trino.
Trino>select * from hive.default.my_table;
It return only columns names.
trino>select * from hive.default."my_table$partitions";
it return only name of partitions.
Could you please suggest me a solution how can I read data partitons in S3 from Trino ?
Knowing that I'm using Apache Hive 2, even when I query the table in hive to return the table partitions, it return Ok, and display any thing. I think because Hive 2 we should use MSCK command
In Hive uploading partition folders and files into S3 and creating table is not enough, partition metadata should be created. Normally you can have folders not mounted as partitions. To mount all existing sub-folders in the table location as partitions:
Use msck repair table command:
MSCK [REPAIR] TABLE tablename;
or Amazon EMR version:
ALTER TABLE tablename RECOVER PARTITIONS;
It will create partition metadata in Hive metastore and partitions will become available.
Read more details about both commands here: RECOVER PARTITIONS
Faced the same issue. Once the table is created, we need to manually sync up the schema to the metastore using the below command of trino.
CALL system.sync_partition_metadata('<schema>', '<table>', 'ADD');
Ref.: https://trino.io/episodes/5.html
I have 2 environments namely Dev and stage. Both has hive installed (same version 2.1). On Dev I have external hive tables pointing to hbase table. I have to export this hive table to stage. No compulsion that hbase table also be migrated. If managed hive table is created with data in it, will be sufficient. Can anyone suggest me how to do this? Below is diagrammatic representation of scenario. Solution to any of the expected scenario will be useful.
I tried:
Dump hive table's data into CSV file and load it into managed hive table on stage. But data have Japanese characters (non-utf8) causing higher row count on stage w.r.t. row count on Dev.
I guess, this is completely theoretical problem so not adding queries. Please let me know if you wish to see queries.
Dev Hive table -> Dev HDFS location -> Distcp -> Stage HDFS location -> Import -> Stage Hive table
You can export the hive table data to an HDFS location using the command below.
INSERT OVERWRITE DIRECTORY 'hdfs_exports_location/department' SELECT * FROM department;
Copy the HDFS data to the stage environment HDFS location using distcp
hadoop distcp <hdfs_export_location>/department hdfs://<stage name node>/<import location>
Import the table from the copied HDFS files
import from '<import location>';
Reference:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport
Just now I start reading about Hive and I have a doubt. When I create a database called 'xyz' in Hive, it creates a folder 'xyz.db'. Anyway Hive is using metastore_db to store the table schema. Then what is the use of this 'xyz.db' folder?
Regards
Sivagururaja.
It is the default directory where the data files for the tables are stored on HDFS.
metastore_db is an external db (mysql, postgres, derby, etc..) which stores the table schema to be used to read the files in xyz.db.
I am new to Hadoop and I just started working on Hive, I my understanding it provides a query language to process data in HDFS. With HiveQl we can create tables and load data into it from HDFS.
So my question is: where are those tables stored? Specifically if we have 100 GB file in our HDFS and we want to make a hive table out of that data what will be the size of that table and where is it stored?
If my understanding about this concept is wrong please correct me ..
If the table is 100GB you should consider an Hive External Table (as opposed to a "managed table", for the difference, see this).
With an external table the data itself will be still stored on the HDFS in the file path that you specify (note that you may specify a directory of files as long as they all have the same structure), but Hive will create a map of it in the meta-store whereas the managed table will store the data "in Hive".
When you drop a managed table, it drops the underlying data as opposed to dropping a hive external table which only drops the meta-data from the meta-store referencing that data.
Either way you are using only 100GB as viewed by the user and are taking advantage of the HDFS' robustness though duplication of the data.
Hive will create a directory on HDFS. If you didn't specify any location it will create a directory at /user/hive/warehouse on HDFS. After load command the files are moved to the /warehouse/tablename. You can also point to the HDFS directory if it contains partitions (if the files are partitioned), or use external table concept.
Can I import CSV or any other flat files in to hive without creating and defining table structure first in hive. Say my csv file is having 200 columns and needs to be imported into hive table. So I have to first create a table in hive and define all the column names and datatype within that hive table and import. Is there any way in which I can directly import in to hive and it automatically creates tables structure from first line say, similar to sqoop import?
use sqoop with a "hive-import" switch & it will create your table for you http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_importing_data_into_hive
Check your hive-site.xml for the value of the property
javax.jdo.option.ConnectionURL. If you do not define this explicitly,
the default value will use a relative path for creation of hive
metastore (jdbc:derby:;databaseName=metastore_db;create=true) which
will be different depending upon where you launch the process from.
This would explain why you cannot see the table via show tables.
The way to overcome it would be to define this property value in your
hive-site.xml using an absolute path