hadoop version: Hadoop 2.6.0-cdh5.12.2
hive version: Hive 1.1.0-cdh5.12.2
Considre two tables:
products - stores product Id and other details about the product
activity - stores user_id, product_id which tells which user purchased which product and other transaction details.
before creating these tables I added SerDe JAR using below command:
add jar /home/ManojKumarM_R/json-serde-1.3-jar-with-dependencies.jar;
CREATE EXTERNAL TABLE IF NOT EXISTS products (id string,name string,reseller
string,category string,price Double,discount Double,profit_percent Double)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' location
"/user/ManojKumarM_R/ProductsMergeEnrichOut";
sample data in /user/ManojKumarM_R/ProductsMergeEnrichOut
{"Id":"P101", "Name":"Round Tee", "Reseller":"Nike", "Category":"Top Wear", "Price":2195.03, "Discount":21.09, "Profit_percent":23.47}
{"Id":"P102", "Name":"Half Shift", "Reseller":"Nike", "Category":"Top Wear", "Price":1563.84, "Discount":23.83, "Profit_percent":17.12}
CREATE EXTERNAL TABLE IF NOT EXISTS activity (product_id string,user_id
string,cancellation boolean ,return boolean,cancellation_reason
string,return_reason string, order_date timestamp, shipment_date timestamp,
delivery_date timestamp , cancellation_date timestamp, return_date
timestamp) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' location
"/user/ManojKumarM_R/ActivityMergeEnrichOut/";
sample data in /user/ManojKumarM_R/ActivityMergeEnrichOut/
{"Product_id":"P117", "User_id":"U148", "Cancellation":"TRUE", "Return":"NA", "Cancellation_reason":"Duplicate Product", "Return_reason":"NA", "Order_date":"2016-02-12", "Shipment_date":"NA", "Delivery_date":"NA", "Cancellation_date":"2018-05-20", "Return_date":"NA"}
{"Product_id":null, "User_id":"U189", "Cancellation":"FALSE", "Return":"FALSE", "Cancellation_reason":"NA", "Return_reason":"NA", "Order_date":"2017-04-22", "Shipment_date":"2017-05-05", "Delivery_date":"2017-09-09", "Cancellation_date":"NA", "Return_date":"NA"}
table creation was successful,
select * from products;
&
select * from activity;
queries work absolutely fine thus denoting that SerDe JAR is picked during select query.
However, when I run below join query: I want to join these two tables on a common column which is Product Id
SELECT a.user_id, p.category FROM activity a JOIN products p
ON(a.product_id = p.Id);
it fails with below message
Execution log at: /tmp/ManojKumarM_R/ManojKumarM_R_20181010124747_690490ae-e59f-4e9d-9159-5c6a6e28b951.log
2018-10-10 12:47:43 Starting to launch local task to process map join; maximum memory = 2058354688
Execution failed with exit status: 2
Obtaining error information
Task failed!
Task ID:
Stage-5
Log in /tmp/ManojKumarM_R/ManojKumarM_R_20181010124747_690490ae-e59f-4e9d-9159-5c6a6e28b951.log
2018-10-10 12:47:43,984 ERROR [main]: mr.MapredLocalTask (MapredLocalTask.java:executeInProcess(398)) - Hive Runtime Error: Map local work failed
org.apache.hadoop.hive.ql.metadata.HiveException: Failed with exception java.lang.ClassNotFoundException: org.openx.data.jsonserde.JsonSerDejava.lang.RuntimeException: java.lang.ClassNotFoundException: org.openx.data.jsonserde.JsonSerDe
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:73)
which denotes that Hive is not able to find JsonSerDe JAR even though I've added JAR during that hive session and selct queries were working fine.
If anyone has resolved similar issue please let me know, am not sure if Hive looks in different directories for JARs during JOIN operation.
Hive doesn't invoke MR jobs for all "SELECT *" queries. In your case, the JAR file is not propagated across clusters when actual MR job (JOIN query) being invoked. So, I would recommend you to re-check the JAR folder/file permission or move the file to the HIVE library path and also update the Hive-site.xml. There are couple of previous post on how to add HIVE JAR file and you can check that also.
Previous post.
how to add a jar file in hive
Related
I'm relatively new to Flink and today I encountered a problem while using Flink SQL on Flink 1.11.3 session cluster.
Problem
I registered a source table which uses jdbc postgres driver. I am trying to move some data from this online DB to AWS S3 in parquet format. This table is huge in size (~43 GB). The job failed after around 1 minute, and the task manager crashed without any warning. But my best guess is task manager ran out of memory.
My Observation
I found that when I do tableEnv.executeSql("select ... from huge_table limit 1000") flink attempted to scan the entire source table into memory and only after that planned to do the limit.
Question
Since I only care about the most recent several days of data, is there any way to limit how many rows a job would scan by timestamp?
Appendix
Here is a minimal setup that can reproduce the issue (lots of noise removed)
Env setup code
var blinkSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build();
var tableEnv = TableEnvironment.create(blinkSettings);
Source table DDL in Flink SQL
CREATE TABLE source_transactions (
txid STRING,
username STRING,
amount BIGINT,
ts TIMESTAMP,
PRIMARY KEY (txid) NOT ENFORCED
) WITH (
'connector'='jdbc',
'url'='jdbc:postgresql://my.bank',
'table-name'='transactions',
'driver'='org.postgresql.Driver',
'username'='username',
'password'='password',
'scan.fetch-size'='2000'
)
Sink table DDL in Flink SQL
CREATE TABLE sink_transactions (
create_time TIMESTAMP,
username STRING,
delta_amount DOUBLE,
dt STRING
) PARTITIONED BY (dt) WITH (
'connector'='filesystem',
'path'='s3a://s3/path/to/transactions',
'format'='parquet'
)
Insert query in Flink SQL
INSERT INTO sink_transactions
SELECT ts, username, CAST(t.amount AS DOUBLE) / 100, DATE_FORMAT(ts, 'yyyy-MM-dd')
FROM source_transactions
Your observation is right,Flink doesn't support limit pushdown optimization for JDBC connector, and there's an nearly merged PR to support this feature, this will be used in Flink 1.13 and you can cherry-pick this patch to your code if you're urgent to this feature.
1.JIRA: FLINK-19650 Support the limit push down for the Jdbc
2.PR: https://github.com/apache/flink/pull/13800
I have presto hive and hdfs setup, and i have a table customer which has data in it(data stored in hdfs location /presto/customer.avro).
hive table also has the schema and metadata info.
on executing select * query in presto cli it gives all 3 records which were inserted;
On executing delete from customer; in presto cli all data are deleted.
Again on persisting data its gets reflected in hdfs customer file but presto select * query show no records.
If you have created an external table, then delete command will just remove the table from metastore and would not delete the underlying files.
Check properties related to external tables here https://trino.io/docs/current/connector/hive.html
I have a table in Hive:
CREATE EXTERNAL TABLE sr2015(
creation_date STRING,
status STRING,
first_3_chars_of_postal_code STRING,
intersection_street_1 STRING,
intersection_street_2 STRING,
ward STRING,
service_request_type STRING,
division STRING,
section STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES (
'colelction.delim'='\u0002',
'field.delim'=',',
'mapkey.delim'='\u0003',
'serialization.format'=',', 'skip.header.line.count'='1',
'quoteChar'= "\"")
The table is loaded data this way:
LOAD DATA INPATH "hdfs:///user/rxie/SR2015.csv" INTO TABLE sr2015;
Why the table is only accessible in Hive? when I attempt to access it in HUE/Impala Editor I got the following error:
AnalysisException: Could not resolve table reference: 'sr2015'
which seems saying there is no such a table, but the table does show up in the left panel.
In Impala-shell, error is different as below:
ERROR: AnalysisException: Failed to load metadata for table: 'sr2015'
CAUSED BY: TableLoadingException: Failed to load metadata for table:
sr2015 CAUSED BY: InvalidStorageDescriptorException: Impala does not
support tables of this type. REASON: SerDe library
'org.apache.hadoop.hive.serde2.OpenCSVSerde' is not supported.
I have always been thinking Hive table and Impala table are essentially the same and difference is Impala is a more efficient query engine.
Can anyone help sort it out? Thank you very much.
Assuming that sr2015 is located in DB called db, in order to make the table visible in Impala, you need to either issue
invalidate metadata db;
or
invalidate metadata db.sr2015;
in Impala shell
However in your case, the reason is probably the version of Impala you're using, since it doesn't support the table format altogether
I have created one table in hive from existing s3 file as follows:
create table reconTable (
entryid string,
run_date string
)
LOCATION 's3://abhishek_data/dump1';
Now I would like to update one entry as follows:
update reconTable set entryid='7.24E-13' where entryid='7.24E-14';
But I am getting following error:
FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.
I have gone through a few posts here, but not getting any idea how to fix this.
I think you should create an external table when reading data from a source like S3.
Also, you should declare the table in ORC format and set properties 'transactional'='true'.
Please refer to this for more info: attempt-to-do-update-or-delete-using-transaction-manager
You can refer to this Cloudera Community Thread:
https://community.cloudera.com/t5/Support-Questions/Hive-update-delete-and-insert-ERROR-in-cdh-5-4-2/td-p/29485
I have an Impala partitioned table, store as Parquet. Can I use Pig to load data from this table, and add partitions as columns?
The Parquet table is defined as:
create table test.test_pig (
name: chararray,
id bigint
)
partitioned by (gender chararray, age int)
stored as parquet;
And the Pig script is like:
A = LOAD '/test/test_pig' USING parquet.pig.ParquetLoader AS (name: bytearray, id: long);
However, gender and age are missing when DUMP A. Only name and id are displayed.
I have tried with:
A = LOAD '/test/test_pig' USING parquet.pig.ParquetLoader AS (name: bytearray, id: long, gender: chararray, age: int);
But I would receive error like:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1031: Incompatable
schema: left is "name:bytearray,id:long,gender:bytearray,age:int",
right is "name:bytearray,id:long"
Hope to get some advice here. Thank you!
You should test with the org.apache.hcatalog.pig.HCatLoader library.
Normally, Pig supports read from/write into partitioned tables;
read:
This load statement will load all partitions of the specified table.
/* myscript.pig */
A = LOAD 'tablename' USING org.apache.hcatalog.pig.HCatLoader();
...
...
If only some partitions of the specified table are needed, include a partition filter statement immediately following the load statement in the data flow. (In the script, however, a filter statement might not immediately follow its load statement.) The filter statement can include conditions on partition as well as non-partition columns.
https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore#HCatalogLoadStore-RunningPigwithHCatalog
write
HCatOutputFormat will trigger on dynamic partitioning usage if necessary (if a key value is not specified) and will inspect the data to write it out appropriately.
https://cwiki.apache.org/confluence/display/Hive/HCatalog+DynamicPartitions
However, I think this hasn't been yet properly tested with parquet files (at least not by the Cloudera guys) :
Parquet has not been tested with HCatalog. Without HCatalog, Pig cannot correctly read dynamically partitioned tables; that is true for all file formats.
http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_parquet.html