I import data from Postgresql to hdfs and hdfs to S3 in my daily operation. (sqoop import [postgres to hdfs] & distcp [from hdfs to s3])
I wanted to remove intermediate step (hdfs) and directly import data to S3 bucket by using sqoop.
However, same sqoop string fails in the end of the import operation.
sqoop import
-Dmapreduce.map.memory.mb="8192"
-Dmapreduce.map.java.opts="-Xmx7200m"
-Dmapreduce.task.timeout=0
-Dmapreduce.task.io.sort.mb="2400"
--connect $conn_string$
--fetch-size=20000
--username $user_name$
--p $password$
--num-mappers 20
--query "SELECT * FROM table1 WHERE table1.id > 10000000 and table1.id < 20000000 and \$CONDITIONS"
--hcatalog-database $schema_name$
--hcatalog-table $table_name$
--hcatalog-storage-stanza "STORED AS PARQUET LOCATION s3a://path/to/destination"
--split-by table1.id
I also tried --target-dir s3a://path/to/destination instead of ....... LOCATION s3a://path/to/destination
After "mapping: %100 completed" it throws error message below:
Error: java.io.IOException: Could not clean up TaskAttemptID:attempt_1571557098082_15536_m_000004_0#s3a://path/to/destination_DYN0.6894861001907555/ingest_day=__HIVE_DEFAULT_PARTITION__
at org.apache.hive.hcatalog.mapreduce.TaskCommitContextRegistry.commitTask(TaskCommitContextRegistry.java:83)
at org.apache.hive.hcatalog.mapreduce.FileOutputCommitterContainer.commitTask(FileOutputCommitterContainer.java:145)
at org.apache.hadoop.mapred.Task.commit(Task.java:1200)
at org.apache.hadoop.mapred.Task.done(Task.java:1062)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:345)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
Caused by: java.io.IOException: Could not rename
s3a://path/to/destination/_DYN0.6894861001907555/ingest_day=20180522/_temporary/1/_temporary/attempt_1571557098082_15536_m_000004_0
to
s3a://path/to/destination/_DYN0.6894861001907555/ingest_day=20180522/_temporary/1/task_1571557098082_15536_m_000004
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:579)
at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:172)
at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:343)
at org.apache.hive.hcatalog.mapreduce.DynamicPartitionFileRecordWriterContainer$1.commitTask(DynamicPartitionFileRecordWriterContainer.java:125)
at org.apache.hive.hcatalog.mapreduce.TaskCommitContextRegistry.commitTask(TaskCommitContextRegistry.java:80)
... 9 more```
Sqoop tries to rename the S3 folder after import job is completed which is not possible due to the architecture of S3 which is object storage.
I found out that this issue has been resolved in Sqoop 1.4.7.*
Related
I am trying to import the data from teradata into HDFS location.
I have access to view for that database. So I created a staging table in another database. But when I try to run the code it says error
Error: Running Sqoop version: 1.4.6.2.6.5.0-292 18/12/23 21:49:41 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 18/12/23 21:49:41 ERROR tool.BaseSqoopTool: Error parsing arguments for import:staging-table, t_hit_data_01_staging, –clear-staging-table, --query, select * from table1 where cast(date1 as Date) <= date '2017-09-02' and $CONDITIONS, --target-dir, <>, --split-by, date1, -m, 25
I have given the staging table details in the code and ran it. but throws error.
(Error parsing arguments from import and as un-recognized arguments from staging table)
sqoop import \
--connect jdbc:teradata://<server_link>/Database=db01 \
--connection-manager org.apache.sqoop.teradata.TeradataConnManager \
--username <UN> \
--password <PWD> \
–-staging-table db02.table1_staging –clear-staging-table \
--query "select * from table1 where cast(date1 as Date) <= date '2017-09-02' and \$CONDITIONS " \
--target-dir '<hdfs location>' \
--split-by date1 -m 25`
The data should be loaded into the HDFS location, using the staging table in another database in Teradata.Then later on changing the where clause it sqoop should create another file under the same folder in HDFS location. Example: part-0000, next file as part -0001 etc.,
I dont think there is a staging option available for import command.
https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html
I am new at Hive and am attempting to export a hive query to a local file on my computer that way I can import results to excel.
When I do from inside hive;
hive -e select * from TABLE limit 10'>output.txt;
I get "FAILED: ParseException line 1:0 cannot recognize input near 'hive' '-' 'e'"
when I do
hive -S -e "USE DATABASE; select * from TABLE limit 10" > /tmp/test/test.csv;
from shell OR
insert overwrite local directory '/tmp/hello'
select * from TABLE limit 10;
It goes to the hdfs system in Hive -- how do I get this to my local machine?
You can export query to CSV file like:
hive -e 'select * from your_Table' > /home/yourfile.csv
to get this file to your local machine, you should use HDFS:
HDFS DFS -get /tmp/hello /PATHinLocalMachine
Check out this Question
You are seeing the error as you are running the hive -e commands in the hive repl as show below
hive (venkat)> hive -e 'select * from a';
NoViableAltException(26#[])
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1084)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:437)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:320)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1219)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1260)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1156)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1146)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:168)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:379)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:739)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:684)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:624)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:233)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
FAILED: ParseException line 1:0 cannot recognize input near 'hive' '-' 'e'
you have to do it in the OS shell as shown below
[venkata_udamala#gw02 ~]$ hive -e 'use database_name;select * from table_name;' > temp.txt
I am trying to read data in stored as Kudu using PySpark 2.1.0
>>> from os.path import expanduser, join, abspath
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql import Row
>>> spark = SparkSession.builder \
.master("local") \
.appName("HivePyspark") \
.config("hive.metastore.warehouse.dir", "hdfs:///user/hive/warehouse") \
.enableHiveSupport() \
.getOrCreate()
>>> spark.sql("select count(*) from mySchema.myTable").show()
I have Kudu 1.2.0 installed on the cluster. Those are hive/ Impala tables.
When I execute the last line, I get the following error:
.
.
.
: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.com.cloudera.kudu.hive.KuduStorageHandler
.
.
.
aused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.com.cloudera.kudu.hive.KuduStorageHandler
at org.apache.hadoop.hive.ql.metadata.HiveUtils.getStorageHandler(HiveUtils.java:315)
at org.apache.hadoop.hive.ql.metadata.Table.getStorageHandler(Table.java:284)
... 61 more
Caused by: java.lang.ClassNotFoundException: com.cloudera.kudu.hive.KuduStorageHandler
I am referring to the following resources:
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
https://issues.apache.org/jira/browse/KUDU-1603
https://github.com/bkvarda/iot_demo/blob/master/total_data_count.py
https://kudu.apache.org/docs/developing.html#_kudu_python_client
I am interested to know how I can include the Kudu related dependencies into my pyspark program so that I can move past this error.
The way I solved this issue was to pass the respective Jar for kudu-spark to the pyspark2 shell or to the spark2-submit command
Apache Spark 2.3
Below is the code for your reference:
Read kudu table from pyspark with below code:
kuduDF = spark.read.format('org.apache.kudu.spark.kudu').option('kudu.master',"IP of master").option('kudu.table',"impala::TABLE name").load()
kuduDF.show(5)
Write to kudu table with below code:
DF.write.format('org.apache.kudu.spark.kudu').option('kudu.master',"IP of master").option('kudu.table',"impala::TABLE name").mode("append").save()
Reference link: https://medium.com/#sciencecommitter/how-to-read-from-and-write-to-kudu-tables-in-pyspark-via-impala-c4334b98cf05
If in case you want to use Scala below is the reference link:
https://kudu.apache.org/docs/developing.html
I am trying to import a CSV file into a Redshift cluster. I have successfully completed the example in the Redshift documentation. Now I am trying to COPY from my own CSV file.
This is my command:
copy frontend_chemical from 's3://awssampledb/mybucket/myfile.CSV'
credentials 'aws_access_key_id=xxxxx;aws_secret_access_key=xxxxx'
delimiter ',';
This is the error I see:
An error occurred when executing the SQL command:
copy frontend_chemical from 's3://awssampledb/mybucket/myfile.CSV'
credentials 'aws_access_key_id=XXXX...'
[Amazon](500310) Invalid operation: The specified S3 prefix 'mybucket/myfile.CSV' does not exist
Details:
-----------------------------------------------
error: The specified S3 prefix 'mybucket/myfile.CSV' does not exist
code: 8001
context:
query: 3573
location: s3_utility.cpp:539
process: padbmaster [pid=2432]
-----------------------------------------------;
Execution time: 0.7s
1 statement failed.
I think I'm constructing the S3 URL wrong, but how should I do it?
My Redshift cluster is in the US East (N Virginia) region.
The Amazon Redshift COPY command can be used to load multiple files in parallel.
For example:
Bucket = mybucket
The files are in the bucket under the path data
Then refer to the contents as:
s3://mybucket/data
For example:
COPY frontend_chemical
FROM 's3://mybucket/data'
CREDENTIALS 'aws_access_key_id=xxxxx;aws_secret_access_key=xxxxx'
DELIMITER ',';
This will load all files within the data directory. You can also refer to a specific file by including it in the path, eg s3://mybucket/data/file.csv
Iam getting below error when I run my sqoop export command.
This is my content to be exported by sqoop command
00001|Content|1|Content-article|\N|2015-02-1815:16:04|2015-02-1815:16:04|1 |\N|\N|\N|\N|\N|\N|\N|\N|\N
00002|Content|1|Content-article|\N|2015-02-1815:16:04|2015-02-1815:16:04|1 |\N|\N|\N|\N|\N|\N|\N|\N|\N
sqoop command
sqoop export --connect jdbc:postgresql://10.11.12.13:1234/db --table table1 --username user1 --password pass1--export-dir /hivetables/table/ --fields-terminated-by '|' --lines-terminated-by '\n' -- --schema schema
15/06/09 08:05:16 INFO mapreduce.Job: Task Id :
attempt_1431442954745_1210_m_000001_0, Status : FAILED Error:
java.io.IOException: Can't export data, please check failed map task
logs
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) Caused by: java.lang.RuntimeException: Can't parse input data: '\N'
at duser.__loadFromFields(duser.java:690)
at duser.parse(duser.java:558)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
... 10 more Caused by: java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
at java.sql.Timestamp.valueOf(Timestamp.java:202)
at duser.__loadFromFields(duser.java:627)
Can you help me resolve it ?
Try adding these arguments to the export statement
--input-null-string "\\\\N" --input-null-non-string "\\\\N"
From the documentation:
If --input-null-string is not specified, then the string "null" will
be interpreted as null for string-type columns. If
--input-null-non-string is not specified, then both the string "null" and the empty string will be interpreted as null for non-string
columns.
If you don't add those arguments, it won't be able to understand that the \N in your data is actually null.
The problem seems to be the order in which columns are being imported. Sqoop doesn't automatically understand the column mapping. Try using --columns argument to specify the order the columns appear in. Here's how to use it:
sqoop export --connect jdbc:postgresql://10.11.12.13:5432/reports ... --columns col1,col2,col3,...
See http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_purpose_4 for documentation on how to use --columns.