Creating text table from Impala partitioned parquet table - hive

I have a parquet table formatted as follows:
.impala_insert_staging
yearmonth=2013-04
yearmonth=2013-05
yearmonth=2013-06
...
yearmonth=2016-04
Underneath each of these directories are my parquet files. I need to get them into my another table which just has a
.impala_insert_staging
file.
Please help.

The best I found is to pull the files in locally and sqoop them back up into a text table.
To pull the parquet table down I performed the following:
impala-shell -i <ip-addr> -B -q "use default; select * from <table>" -o filename '--output_delimiter=\x1A'
Unfortunately this adds the yearmonth value as another column on my table. So I either go into my 750GB file and sed/awk out that last column or use mysqlimport (since I'm using MySQL as well) to import only the columns I'm interested in.
Finally I'll sqoop up the data to a new text table.
sqoop import --connect jdbc:mysql://<mysqlip> --table <mysql_table> -uroot -p<pass> --hive-import --hive-table <new_db_text>

Related

When creating a new big query external table with parquet files on gcs. Showing error

I was trying to create a big query external table with parquet files on gcs. It's showing a wrong format error.
But using the same files to create a native table works fine. why it must be a native table.
If use a native table, how can I import more data to this table? I don't want to delete and create the table that every time I got new data.
Any help will be appreciated.
This appears to be supported now, at least in beta. This only works in us-central1 as far as I can tell.
Simply select 'External Table' and set 'Parquet' as your file type
The current google documentation might be a bit tricky to understand. It is a two step process, first create definition file and use that as an input to create the table.
Creating the definition file, if you are dealing with unpartitioned folders
bq mkdef \
--source_format=PARQUET \
"<path/to/parquet/folder>/*.parquet" > "<definition/file/path>"
Otherwise, if you are dealing with hive partitioned table
bq mkdef \
--autodetect \
--source_format=PARQUET \
--hive_partitioning_mode=AUTO \
--hive_partitioning_source_uri_prefix="<path/to/hive/table/folder>" \
"<path/to/hive/table/folder>/*.parquet" > "<definition/file/path>"
Note: path/to/hive/table/folder should not include the partition
folder
Eg: If your table is loaded in format
gs://project-name/tablename/year=2009/part-000.parquet
bq mkdef \
--autodetect \
--source_format=PARQUET \
--hive_partitioning_mode=AUTO \
--hive_partitioning_source_uri_prefix="gs://project-name/tablename" \
"gs://project-name/tablename/*.parquet" > "def_file_name"
Finally, the table can be created from the definition file by
bq mk --external_table_definition="<definition/file/path>" "<project_id>:<dataset>.<table_name>"
Parquet is not currently a supported data format for federated tables. You can repeatedly load more data into the same table as long as you append (instead of overwriting) the current contents.

sqoop export from hive partitioned parquet table to oracle

is it possible to do sqoop export from parquet partitioned hive table to oracle database?
our requirement is to use processed data to legacy system that cannot support hadoop/hive connection, thank you..
tried:
sqoop export -Dmapreduce.job.queuename=root.hsi_sqm \
--connect jdbc:oracle:thin:#host:1521:sid \
--username abc \
--password cde \
--export-dir '/user/hive/warehouse/stg.db/tb_parquet_w_partition/' \
--table UNIQSUBS_DAY
got error:
ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.DatasetNotFoundException: Descriptor location does not exist: hdfs://nameservice1/user/hive/warehouse/stg.db/tb_parquet_w_partition/.metadata
org.kitesdk.data.DatasetNotFoundException: Descriptor location does not exist: hdfs://nameservice1/user/hive/warehouse/stg.db/tb_parquet_w_partition/.metadata
at org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.checkExists(FileSystemMetadataProvider.java:562)
at org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.find(FileSystemMetadataProvider.java:605)
at org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.load(FileSystemMetadataProvider.java:114)
at org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.load(FileSystemDatasetRepository.java:197)
at org.kitesdk.data.Datasets.load(Datasets.java:108)
at org.kitesdk.data.Datasets.load(Datasets.java:140)
at org.kitesdk.data.mapreduce.DatasetKeyInputFormat$ConfigBuilder.readFrom(DatasetKeyInputFormat.java:92)
at org.kitesdk.data.mapreduce.DatasetKeyInputFormat$ConfigBuilder.readFrom(DatasetKeyInputFormat.java:139)
at org.apache.sqoop.mapreduce.JdbcExportJob.configureInputFormat(JdbcExportJob.java:84)
at org.apache.sqoop.mapreduce.ExportJobBase.runExport(ExportJobBase.java:432)
at org.apache.sqoop.manager.OracleManager.exportTable(OracleManager.java:465)
at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:80)
at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:99)
at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
is there any correct approach for this?
We were facing similar issues.
Parquet creates .metadata folder. If you created the parquet using some other process , it might create like .metadata-00000 ( something similar ).
You can try renaming the folder to .metadata and try.
Else, if this does not works you can try with hcatalog sqoop export.
Hi for those encountering the same problem as me, here is my own solution (this could be vary depends on your environment)
write hive data to hdfs directory, you can use insert overwrite directory command in hive.
if you have deflated generated data from hive query in designated hdfs path, use this:
hdfs dfs -text <hdfs_path_file>/000000_0.deflate | hdfs dfs -put <target_file_name> <hdfs_target_path>
sqoop export the inflated files using sqoop export command, don't forget to map your column according to your data type in target table

issues with sqoop and hive

We are facing the following issues details as follows, please share your inputs.
1) Issue with --validate option in sqoop
if we run the sqoop command without creating a job for it, validate works. But if we create a job first, with validate option the validate doesn't seem to work.
works with
sqoop import --connect "DB connection" --username $USER --password-file $File_Path --warehouse-dir $TGT_DIR --as-textfile --fields-terminated by '|' --lines-teriminated-by '\n' --table emp_table -m 1 --outdir $HOME/javafiles --validate
Does not work with
sqoop job --create Job_import_emp import --connect "DB connection" --username $USER --password-file $File_Path --warehouse-dir $TGT_DIR --as-textfile --fields-terminated by '|' --lines-teriminated-by '\n' --table emp_table -m 1 --outdir $HOME/javafiles --validate
2) Issue with Hive import
If we are importing data for the first time in hive, it becomes imperative to create hive table ( hive internal), so we keep "--create-hive-table" in sqoop command.
Even thouhg if i keep "--create-hive-table" option, Is there any way to skip create table step in hive while importing, if the table is already exists.
Thanks
Sheik
Sqoop allows --validate option only for sqoop import and sqoop export commands.
From the official Sqoop User guide, the validation has these limitations,
all-tables option
free-form query option
Data imported into Hive or HBase table
import with --where argument
No, the table check cannot be skip if --create-hive-table option is set, the job will fail if the target table exists.

Impala: create parquet from mysql dump

I have mysql dumps, how can I convert them in parquet file format with Impala?
I know that I can create parquet files from CSV but I would like to create parquet files directly without this double step.
I usually use a two step process but I'm sure there are better ways. We use this way to keep the parquet table online so there is as much uninterruptible service during the update as possible.
sqoop import --table <mysql_table> --hive-import --hive-table <hive_text_table>
impala-shell -i <impala_ip_addr> -q 'use <db>; INVALIDATE METADATA <hive_text_table>; CREATE TABLE <parquet_table> LIKE <hive_text_table> STORED AS PARQUET; INSERT OVERWRITE <parquet_table> SELECT * FROM <hive_text_table>;'
A little long winded but just in case you don't get any other answers.

Incremental updates in HIVE using SQOOP appends data into middle of the table

I am trying to append the new data from SQLServer to Hive using the following command
sqoop import --connect 'jdbc:sqlserver://10.1.1.12;database=testdb' --username uname --password passwd --table testable --where "ID > 11854" --hive-import -hive-table hivedb.hivetesttable --fields-terminated-by ',' -m 1
This command appends the data.
But when I run
select * from hivetesttable;
it doesnot show the new data at the end.
This is because the sqoop import statement for appending the new data result the mapper output as part-m-00000-copy
So my data in the hive table directory looks like
part-m-00000
part-m-00000-copy
part-m-00001
part-m-00002
Is there any way to append the data at end by changing the name of mapper?
Hive, similarly to any other relational database, doesn't guarantee any order unless you explicitly use ORDER BY clause.
You're correct in your analysis - the reason why the data appears in the "middle" is that Hive will read one file after another based on lexicographical sort and Sqoop simply names the files that they will get appended somewhere in the middle of that list.
However this operation is fully valid - Sqoop appended data to Hive table and because your query doesn't have any explicit ORDER BY statement the result have no guarantees with regards to order. In fact Hive itself can change this behavior and read files based on time of creation without breaking any compatibility.
I'm also interested to see how this is affecting your use case? I'm assuming that the query to list all rows is just a test one. Do you have any issues with actual production queries?