Error while converting csv to parquet using EMR cluster - hive
I have created an EMR cluster with Hive script as one of the execution step added.
Here's how my HIVE script looks like:
--Create Hive external Table for existing data
CREATE EXTERNAL TABLE calls_csv
(
id int,
campaign_id int,
campaign_name string,
offer_id int,
offer_name string,
is_offer_not_found int,
ivr_key string,
call_uuid string,
a_leg_uuid string,
a_leg_request_uuid string,
to_number string,
promo_id int,
description string,
call_type string,
answer_type string,
agent_id int,
from_number string,
from_caller_name string,
from_line_type string,
from_state string,
from_city string,
from_country string,
from_zip string,
from_latitude string,
from_longitude string,
b_leg_uuid string,
b_leg_number string,
b_leg_duration int,
b_leg_bill_rate double,
b_leg_bill_duration int,
b_leg_total_cost double,
b_leg_hangup_cause string,
b_leg_start_time string,
b_leg_answer_time string,
b_leg_end_time string,
b_leg_active tinyint,
bill_rate double,
bill_duration int,
hangup_cause string,
start_time string,
answer_time string,
end_time string,
status string,
selected_ivr_keys string,
processed_ivr_keys string,
filter_id int,
filter_name string,
ivr_action string,
selected_zip_code string,
processed_zip_code string,
duration int,
payout double,
min_duration int,
connected_duration int,
provider_cost double,
caller_id_cost double,
total_revenue double,
total_cost double,
total_profit double,
publisher_id int,
publisher_name string,
publisher_revenue double,
publisher_cost double,
publisher_profit double,
advertiser_id int,
advertiser_name string,
advertiser_cost double,
is_test tinyint,
is_sale tinyint,
is_repeat tinyint,
is_machine_detection tinyint,
no_of_call_transfer int,
offer_ivr_status tinyint,
file_url string,
algo string,
callback_service_status tinyint,
hangup_service_status tinyint,
sms_uuid string,
number_name string,
keyword string,
keywordmatchtype string,
created_at string,
updated_at string,
ymdhm bigint
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"')
LOCATION 's3://calls-csv/'
TBLPROPERTIES ('has_encrypted_data'='false',
'serialization.null.format'='');
msck repair table calls_csv;
-- Lets now create an external table to Parquet format
CREATE EXTERNAL TABLE calls_parquet (
id int,
campaign_id int,
campaign_name string,
offer_id int,
offer_name string,
is_offer_not_found int,
ivr_key string,
call_uuid string,
a_leg_uuid string,
a_leg_request_uuid string,
to_number string,
promo_id int,
description string,
call_type string,
answer_type string,
agent_id int,
from_number string,
from_caller_name string,
from_line_type string,
from_state string,
from_city string,
from_country string,
from_zip string,
from_latitude string,
from_longitude string,
b_leg_uuid string,
b_leg_number string,
b_leg_duration int,
b_leg_bill_rate double,
b_leg_bill_duration int,
b_leg_total_cost double,
b_leg_hangup_cause string,
b_leg_start_time string,
b_leg_answer_time string,
b_leg_end_time string,
b_leg_active tinyint,
bill_rate double,
bill_duration int,
hangup_cause string,
start_time string,
answer_time string,
end_time string,
status string,
selected_ivr_keys string,
processed_ivr_keys string,
filter_id int,
filter_name string,
ivr_action string,
selected_zip_code string,
processed_zip_code string,
duration int,
payout double,
min_duration int,
connected_duration int,
provider_cost double,
caller_id_cost double,
total_revenue double,
total_cost double,
total_profit double,
publisher_id int,
publisher_name string,
publisher_revenue double,
publisher_cost double,
publisher_profit double,
advertiser_id int,
advertiser_name string,
advertiser_cost double,
is_test tinyint,
is_sale tinyint,
is_repeat tinyint,
is_machine_detection tinyint,
no_of_call_transfer int,
offer_ivr_status tinyint,
file_url string,
algo string,
callback_service_status tinyint,
hangup_service_status tinyint,
sms_uuid string,
number_name string,
keyword string,
keywordmatchtype string,
created_at string,
updated_at string,
ymdhm bigint)
STORED AS PARQUET
LOCATION 's3://calls-parquet/';
--Time to convert and export. This step will run for a long time, depending on your data size and cluster size.
INSERT OVERWRITE TABLE calls_parquet SELECT * FROM calls_csv
Below is the error I am getting when I run this step on the EMR cluster
Status :FAILED
Details : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. Error moving: s3://calls-parquet/.hive-staging_hive_2018-03-20_07-09-28_592_6773618098932115163-1/-ext-10000 into: s3://calls-parquet/
JAR location : command-runner.jar
Main class : None
Arguments : hive-script --run-hive-script --args -f s3://calls-scripts/converToParquetHive.sql -d INPUT=s3://calls-csv -d OUTPUT=s3://calls-parquet
Action on failure: Continue
Related
Found unparsable section: 'EXECUTE IMMEDIATE\n ...' sqlfuff?
I'm using sqlfluff to check the sql code formatting. conf of sqlfluff are here : sqlfluff = "^0.13.2" [tool.sqlfluff] sql_file_exts = [ ".sql", ] [tool.sqlfluff.core] dialect = "bigquery" rules = "L001,L002,L003,L004,L005,L006,L008,L009,L010,L012,L013,L015,L016,L018,L019,L020,L022,L023,L024,L025,L035,L036,L037,L038,L039,L040,L041,L045,L046,L048,L049,L050,L051,L052,L053,L061,L063" I have a sql query that uses EXECUTE IMMEDIATE (dynamic bigquery sql) DECLARE c_c string; DECLARE ps ARRAY<string>; DECLARE p_cc string; DECLARE d_cc string; CREATE TABLE IF NOT EXISTS `dataset.table` ( A STRING, B STRING, C STRING, D STRING, F STRING, E STRING, G STRING, H STRING, E STRING, J timestamp, K INT64 ) PARTITION BY RANGE_BUCKET(K, GENERATE_ARRAY(0,100,1)); SET c_c="code1"; SET ps = ["project-code1", "project-code2", "project-code3", "project-code4", "project-code5", "project-code6"]; SET p_cc = ( WITH ps_cte AS ( SELECT * FROM UNNEST(ps) ps WHERE ps LIKE CONCAT('%-', c_c)) SELECT ps FROM ps_cte); SET d_cc = CONCAT(p_cc,".sales_",c_c,"."); EXECUTE IMMEDIATE FORMAT( """ CREATE OR REPLACE TEMP TABLE temp_table AS SELECT cols FROM `%ssat_table` sc """,d_cc) : when I want to apply sqlfluff lint : poetry run sqlfluff lint query.sql it gives: L: 47 | P: 1 | PRS | Line 47, Position 1: Found unparsable section: 'EXECUTE | IMMEDIATE\n FORMAT( """ \n CREAT...' why sqlfluff don't know execute immediate ? is there any solution? Thanks
Why the implicit CASTing is happening during SQL Join?
I have two tables T1 and T2. T1 has 3 columns t1c1, t1c2, t1c3 - all are of String types. T2 has 4 columns t2c1, t2c2, t2c3, t2c4 - all are of String types. I'm trying to perform a join as: SELECT T1.t1c1, T1.t1c2, T1.t1c3, T2.t2c1, T2.t2c2, T2.t2c3, T2.t2c4 FROM T1, T2 WHERE T1.t1c1 = T2.t2c1 & T1.t1c2 = T2.t2c2; But this is throwing me an error like this: AnalysisException: cannot resolve '(CAST(T2.t2c1 AS DOUBLE) & CAST(T1.t1c2 AS DOUBLE))' due to data type mismatch: '(CAST(T2.t2c1 AS DOUBLE) & CAST(T1.t1c2 AS DOUBLE))' requires integral type, not double; Then I tried this one: SELECT T1.t1c1, T1.t1c2, T1.t1c3, T2.t2c1, T2.t2c2, T2.t2c3, T2.t2c4 FROM T1, T2 WHERE CAST(T1.t1c1 AS STRING) = CAST(T2.t2c1 AS STRING) & CAST(T1.t1c2 AS STRING) = CAST(T2.t2c2 AS STRING); Then this error comes: AnalysisException: cannot resolve '(CAST(CAST(T2.t2c1 AS STRING) AS DOUBLE) & CAST(CAST(T1.t1c2 AS STRING) AS DOUBLE))' due to data type mismatch: '(CAST(CAST(T2.t2c1 AS STRING) AS DOUBLE) & CAST(CAST(T1.t1c2 AS STRING) AS DOUBLE))' requires integral type, not double; I want to work with StringType. How can I resolve this problem? And why this implicit casting is happening?
procedure to update selected columns in a table
POSTGRESQL Case Background -There is a table named MasterSIM to which details are to be added usin.csv file via Nodered. CREATE OR REPLACE PROCEDURE "master_SIMs".simiccidlink1(IMEI bigint, iccid text, phonenumber bigint, apn text, "Operator" text, isesim boolean, ism2m boolean, mdi bigint, imsi integer) LANGUAGE 'plpgsql' AS $BODY$ declare -- retval text begin if (length(IMEI) = 15) then if (master_packs.luhn_verify(IMEI) is True) then update "master_SIMs".list SET "ICCID" = iccid, "phoneNumber" = phonenumber, "APN" = apn, "operator" = "Operator", "isESIM" = isesim, "isM2M" = ism2m, "MDI" = mdi, "IMSI" = imsi where IMEI = "serialNumber"; end if; end if; end; $BODY$; CALL "master_SIMs".simiccidlink1(352277818466409, 'iccid1', '8018466080', 'apn1', 'operator1', '0', '0', 123456789987653, 12354); select * from "master_SIMs".list; error I am getting is - ERROR: procedure master_SIMs.simiccidlink1(unknown, unknown, unknown, unknown, integer, integer, integer, integer, bigint) does not exist LINE 2: CALL "master_SIMs".simiccidlink1('iccid1', '8018466080', 'ap... ^ HINT: No procedure matches the given name and argument types. You might need to add explicit type casts. SQL state: 42883 Character: 7 I am new to this - please GUIDE. I have tried all possible ways to fix it.
converting par array to array in scala
How can we convert ParArray[(Double, Double, Double, Double, Double)] to Array[(Double, Double, Double, Double, Double)] I have to do this as part of creating the data frame using sc.parallelize(Array[(Double,...)]) Apart from hardcoding (like below) is there any other way? for(x1 <- 0 until a.length){ new_a(x1)(0) = a(x1)._1 new_a(x1)(1) = a(x1)._2 new_a(x1)(2) = a(x1)._3 new_a(x1)(3) = a(x1)._4 new_a(x1)(4) = a(x1)._5 }
See http://docs.scala-lang.org/overviews/parallel-collections/conversions.html#converting-between-sequential-and-parallel-collections val a: ParArray[(Double, Double, Double, Double, Double)] = ??? val rdd = sc.parallelize(a.seq)
How to return a unaryfunction from a multi-variables function in VB.NET
I have a function: Public Function F(ByVal a As Double, ByVal b As Double, ByVal c As Double, ByVal x As Double) As Double y = ax ^ 3 + bx ^ 2 + cx + d Return y End Function How could I create a function that allows me to read parameters a,b,c,d and return an unary function? For example a=1,b=1,c=3,d=4: Public Function F(ByVal x As Double) As Double y = 1x ^ 3 + 2x ^ 2 + 3x + 4 Return y End Function or in other words, how could I create a function that returns a function of type Func(Of Double, Double)
Use a lambda function. Public Function F(ByVal a As Double, ByVal b As Double, ByVal c As Double) As Func(of Double, Double) Return Function(ByVal x As Double) As Double Return ax ^ 3 + bx ^ 2 + cx + d End Function End Function