Impala: create parquet from mysql dump - impala

I have mysql dumps, how can I convert them in parquet file format with Impala?
I know that I can create parquet files from CSV but I would like to create parquet files directly without this double step.

I usually use a two step process but I'm sure there are better ways. We use this way to keep the parquet table online so there is as much uninterruptible service during the update as possible.
sqoop import --table <mysql_table> --hive-import --hive-table <hive_text_table>
impala-shell -i <impala_ip_addr> -q 'use <db>; INVALIDATE METADATA <hive_text_table>; CREATE TABLE <parquet_table> LIKE <hive_text_table> STORED AS PARQUET; INSERT OVERWRITE <parquet_table> SELECT * FROM <hive_text_table>;'
A little long winded but just in case you don't get any other answers.

Related

sqoop export from hive partitioned parquet table to oracle

is it possible to do sqoop export from parquet partitioned hive table to oracle database?
our requirement is to use processed data to legacy system that cannot support hadoop/hive connection, thank you..
tried:
sqoop export -Dmapreduce.job.queuename=root.hsi_sqm \
--connect jdbc:oracle:thin:#host:1521:sid \
--username abc \
--password cde \
--export-dir '/user/hive/warehouse/stg.db/tb_parquet_w_partition/' \
--table UNIQSUBS_DAY
got error:
ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.DatasetNotFoundException: Descriptor location does not exist: hdfs://nameservice1/user/hive/warehouse/stg.db/tb_parquet_w_partition/.metadata
org.kitesdk.data.DatasetNotFoundException: Descriptor location does not exist: hdfs://nameservice1/user/hive/warehouse/stg.db/tb_parquet_w_partition/.metadata
at org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.checkExists(FileSystemMetadataProvider.java:562)
at org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.find(FileSystemMetadataProvider.java:605)
at org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.load(FileSystemMetadataProvider.java:114)
at org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.load(FileSystemDatasetRepository.java:197)
at org.kitesdk.data.Datasets.load(Datasets.java:108)
at org.kitesdk.data.Datasets.load(Datasets.java:140)
at org.kitesdk.data.mapreduce.DatasetKeyInputFormat$ConfigBuilder.readFrom(DatasetKeyInputFormat.java:92)
at org.kitesdk.data.mapreduce.DatasetKeyInputFormat$ConfigBuilder.readFrom(DatasetKeyInputFormat.java:139)
at org.apache.sqoop.mapreduce.JdbcExportJob.configureInputFormat(JdbcExportJob.java:84)
at org.apache.sqoop.mapreduce.ExportJobBase.runExport(ExportJobBase.java:432)
at org.apache.sqoop.manager.OracleManager.exportTable(OracleManager.java:465)
at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:80)
at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:99)
at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
is there any correct approach for this?
We were facing similar issues.
Parquet creates .metadata folder. If you created the parquet using some other process , it might create like .metadata-00000 ( something similar ).
You can try renaming the folder to .metadata and try.
Else, if this does not works you can try with hcatalog sqoop export.
Hi for those encountering the same problem as me, here is my own solution (this could be vary depends on your environment)
write hive data to hdfs directory, you can use insert overwrite directory command in hive.
if you have deflated generated data from hive query in designated hdfs path, use this:
hdfs dfs -text <hdfs_path_file>/000000_0.deflate | hdfs dfs -put <target_file_name> <hdfs_target_path>
sqoop export the inflated files using sqoop export command, don't forget to map your column according to your data type in target table

Impala import data from an SQL file

I have the .sql files in HDFS which has mysql INSERT queries. I created an external table in impala with the appropriate schema. Now I want to run all the insert commands stored in HDFS to the table. Is there a way to do it? Data size is 900GB!
Is any other way to import the .sql files from HDFS? We have tried with hive but it requires all the insert fields to be in lowercase which we do not have.

Sqoop import hive ORC

All,
I have question for sqooping , I am sqooping around 2tb of data for one table and then need to write ORC table wit h that . What's best way to achieve
1) sqoop all data in dir1 as text and write HQL to load into ORC table , where script fail for vertex issue
2) sqoop data in chucks and process and append into hive table ( have you done this ? )
3) sqoop hive import to write all data to hive ORC table
Which is best way ?
Option three will be better because you dont need to create a hive table and again loading data into it and storing that data in orc format it is a long process for 2tb of data so its better to give in sqoop so it can directly push the data into hive table with orc format but when you are returning data from hive table to rdbms you have to use sqoopserde

Creating text table from Impala partitioned parquet table

I have a parquet table formatted as follows:
.impala_insert_staging
yearmonth=2013-04
yearmonth=2013-05
yearmonth=2013-06
...
yearmonth=2016-04
Underneath each of these directories are my parquet files. I need to get them into my another table which just has a
.impala_insert_staging
file.
Please help.
The best I found is to pull the files in locally and sqoop them back up into a text table.
To pull the parquet table down I performed the following:
impala-shell -i <ip-addr> -B -q "use default; select * from <table>" -o filename '--output_delimiter=\x1A'
Unfortunately this adds the yearmonth value as another column on my table. So I either go into my 750GB file and sed/awk out that last column or use mysqlimport (since I'm using MySQL as well) to import only the columns I'm interested in.
Finally I'll sqoop up the data to a new text table.
sqoop import --connect jdbc:mysql://<mysqlip> --table <mysql_table> -uroot -p<pass> --hive-import --hive-table <new_db_text>

How to save the results of an impala query

I've loaded a large set of data from S3 into hdfs, and then inserted the data to a table in impala.
I then ran a query against this data, and I'm looking to get these results back into S3.
I'm using Amazon EMR, with impala 1.2.4. If it's not possible to get the results of the query back to S3 directly, are there options to get the data back to hdfs and then some how send it back to S3 from there?
I have messed around with the impala-shell -o filename options, but that appears to only work on the local linux file system.
I thought this would have been a common scenario, but having trouble finding any information about saving the results of a query anywhere.
Any pointers appreciated.
To add to the knowledge above I am including the command that writes the query results to a file with a delimeter as we declared using the option --output_delimeter and also by using the option
--delimeted which actually switches off the default tab delimeter option.
impala-shell -q "query " --delimited --output_delimiter='\001' --print_header -o 'filename'
What I usually do if it's a smallish result set is run the script from the command line then upload to s3 using the AWS command line tool:
impala-shell -e "select ble from bla" -o filename
aws s3 cp filename s3://mybucket/filename
An alternative is use Hive as the last step in your data pipeline after you've run your query in Impala:
1. Impala step:
create table processed_data
as
select blah
--do whatever else you need to do in here
from raw_data1
join raw_data2 on a=b
2. Hive step:
create external table export
like processed_data
location 's3://mybucket/export/';
insert into table export
select * from processed_data;
If you have aws cli installed you can use standard out of Impala shell , impala shell query | aws cli cp - s3folder/outputfilename
You can use unix pipe and stream (-)