Error while converting csv to parquet using EMR cluster

Error while converting csv to parquet using EMR cluster - hive

I have created an EMR cluster with Hive script as one of the execution step added.
Here's how my HIVE script looks like:
--Create Hive external Table for existing data
CREATE EXTERNAL TABLE calls_csv
(
id int,
campaign_id int,
campaign_name string,
offer_id int,
offer_name string,
is_offer_not_found int,
ivr_key string,
call_uuid string,
a_leg_uuid string,
a_leg_request_uuid string,
to_number string,
promo_id int,
description string,
call_type string,
answer_type string,
agent_id int,
from_number string,
from_caller_name string,
from_line_type string,
from_state string,
from_city string,
from_country string,
from_zip string,
from_latitude string,
from_longitude string,
b_leg_uuid string,
b_leg_number string,
b_leg_duration int,
b_leg_bill_rate double,
b_leg_bill_duration int,
b_leg_total_cost double,
b_leg_hangup_cause string,
b_leg_start_time string,
b_leg_answer_time string,
b_leg_end_time string,
b_leg_active tinyint,
bill_rate double,
bill_duration int,
hangup_cause string,
start_time string,
answer_time string,
end_time string,
status string,
selected_ivr_keys string,
processed_ivr_keys string,
filter_id int,
filter_name string,
ivr_action string,
selected_zip_code string,
processed_zip_code string,
duration int,
payout double,
min_duration int,
connected_duration int,
provider_cost double,
caller_id_cost double,
total_revenue double,
total_cost double,
total_profit double,
publisher_id int,
publisher_name string,
publisher_revenue double,
publisher_cost double,
publisher_profit double,
advertiser_id int,
advertiser_name string,
advertiser_cost double,
is_test tinyint,
is_sale tinyint,
is_repeat tinyint,
is_machine_detection tinyint,
no_of_call_transfer int,
offer_ivr_status tinyint,
file_url string,
algo string,
callback_service_status tinyint,
hangup_service_status tinyint,
sms_uuid string,
number_name string,
keyword string,
keywordmatchtype string,
created_at string,
updated_at string,
ymdhm bigint
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"')
LOCATION 's3://calls-csv/'
TBLPROPERTIES ('has_encrypted_data'='false',
'serialization.null.format'='');
msck repair table calls_csv;
-- Lets now create an external table to Parquet format
CREATE EXTERNAL TABLE calls_parquet (
id int,
campaign_id int,
campaign_name string,
offer_id int,
offer_name string,
is_offer_not_found int,
ivr_key string,
call_uuid string,
a_leg_uuid string,
a_leg_request_uuid string,
to_number string,
promo_id int,
description string,
call_type string,
answer_type string,
agent_id int,
from_number string,
from_caller_name string,
from_line_type string,
from_state string,
from_city string,
from_country string,
from_zip string,
from_latitude string,
from_longitude string,
b_leg_uuid string,
b_leg_number string,
b_leg_duration int,
b_leg_bill_rate double,
b_leg_bill_duration int,
b_leg_total_cost double,
b_leg_hangup_cause string,
b_leg_start_time string,
b_leg_answer_time string,
b_leg_end_time string,
b_leg_active tinyint,
bill_rate double,
bill_duration int,
hangup_cause string,
start_time string,
answer_time string,
end_time string,
status string,
selected_ivr_keys string,
processed_ivr_keys string,
filter_id int,
filter_name string,
ivr_action string,
selected_zip_code string,
processed_zip_code string,
duration int,
payout double,
min_duration int,
connected_duration int,
provider_cost double,
caller_id_cost double,
total_revenue double,
total_cost double,
total_profit double,
publisher_id int,
publisher_name string,
publisher_revenue double,
publisher_cost double,
publisher_profit double,
advertiser_id int,
advertiser_name string,
advertiser_cost double,
is_test tinyint,
is_sale tinyint,
is_repeat tinyint,
is_machine_detection tinyint,
no_of_call_transfer int,
offer_ivr_status tinyint,
file_url string,
algo string,
callback_service_status tinyint,
hangup_service_status tinyint,
sms_uuid string,
number_name string,
keyword string,
keywordmatchtype string,
created_at string,
updated_at string,
ymdhm bigint)
STORED AS PARQUET
LOCATION 's3://calls-parquet/';
--Time to convert and export. This step will run for a long time, depending on your data size and cluster size.
INSERT OVERWRITE TABLE calls_parquet SELECT * FROM calls_csv
Below is the error I am getting when I run this step on the EMR cluster
Status :FAILED
Details : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. Error moving: s3://calls-parquet/.hive-staging_hive_2018-03-20_07-09-28_592_6773618098932115163-1/-ext-10000 into: s3://calls-parquet/
JAR location : command-runner.jar
Main class : None
Arguments : hive-script --run-hive-script --args -f s3://calls-scripts/converToParquetHive.sql -d INPUT=s3://calls-csv -d OUTPUT=s3://calls-parquet
Action on failure: Continue

Related

Found unparsable section: 'EXECUTE IMMEDIATE\n ...' sqlfuff?

I'm using sqlfluff to check the sql code formatting.
conf of sqlfluff are here :
sqlfluff = "^0.13.2"
[tool.sqlfluff]
sql_file_exts = [
".sql",
]
[tool.sqlfluff.core]
dialect = "bigquery"
rules = "L001,L002,L003,L004,L005,L006,L008,L009,L010,L012,L013,L015,L016,L018,L019,L020,L022,L023,L024,L025,L035,L036,L037,L038,L039,L040,L041,L045,L046,L048,L049,L050,L051,L052,L053,L061,L063"
I have a sql query that uses EXECUTE IMMEDIATE (dynamic bigquery sql)
DECLARE
c_c string;
DECLARE
ps ARRAY<string>;
DECLARE
p_cc string;
DECLARE
d_cc string;
CREATE TABLE IF NOT EXISTS
`dataset.table` ( A STRING,
B STRING,
C STRING,
D STRING,
F STRING,
E STRING,
G STRING,
H STRING,
E STRING,
J timestamp,
K INT64 ) PARTITION BY
RANGE_BUCKET(K, GENERATE_ARRAY(0,100,1));
SET
c_c="code1";
SET
ps = ["project-code1",
"project-code2",
"project-code3",
"project-code4",
"project-code5",
"project-code6"];
SET
p_cc = (
WITH
ps_cte AS (
SELECT
*
FROM
UNNEST(ps) ps
WHERE
ps LIKE CONCAT('%-', c_c))
SELECT
ps
FROM
ps_cte);
SET
d_cc = CONCAT(p_cc,".sales_",c_c,".");
EXECUTE IMMEDIATE
FORMAT( """
CREATE OR REPLACE TEMP TABLE temp_table AS
SELECT
cols
FROM
`%ssat_table` sc
""",d_cc)
:
when I want to apply sqlfluff lint :
poetry run sqlfluff lint query.sql
it gives:
L: 47 | P: 1 | PRS | Line 47, Position 1: Found unparsable section: 'EXECUTE
| IMMEDIATE\n FORMAT( """ \n CREAT...'
why sqlfluff don't know execute immediate ? is there any solution?
Thanks

Why the implicit CASTing is happening during SQL Join?

I have two tables T1 and T2.
T1 has 3 columns t1c1, t1c2, t1c3 - all are of String types.
T2 has 4 columns t2c1, t2c2, t2c3, t2c4 - all are of String types.
I'm trying to perform a join as:
SELECT T1.t1c1, T1.t1c2, T1.t1c3, T2.t2c1, T2.t2c2, T2.t2c3, T2.t2c4
FROM T1, T2
WHERE T1.t1c1 = T2.t2c1 & T1.t1c2 = T2.t2c2;
But this is throwing me an error like this:
AnalysisException: cannot resolve '(CAST(T2.t2c1 AS DOUBLE) & CAST(T1.t1c2 AS DOUBLE))' due to data type mismatch: '(CAST(T2.t2c1 AS DOUBLE) & CAST(T1.t1c2 AS DOUBLE))' requires integral type, not double;
Then I tried this one:
SELECT T1.t1c1, T1.t1c2, T1.t1c3, T2.t2c1, T2.t2c2, T2.t2c3, T2.t2c4
FROM T1, T2
WHERE CAST(T1.t1c1 AS STRING) = CAST(T2.t2c1 AS STRING) & CAST(T1.t1c2 AS STRING) = CAST(T2.t2c2 AS STRING);
Then this error comes:
AnalysisException: cannot resolve '(CAST(CAST(T2.t2c1 AS STRING) AS DOUBLE) & CAST(CAST(T1.t1c2 AS STRING) AS DOUBLE))' due to data type mismatch: '(CAST(CAST(T2.t2c1 AS STRING) AS DOUBLE) & CAST(CAST(T1.t1c2 AS STRING) AS DOUBLE))' requires integral type, not double;
I want to work with StringType. How can I resolve this problem? And why this implicit casting is happening?

procedure to update selected columns in a table

POSTGRESQL
Case Background -There is a table named MasterSIM to which details are to be added usin.csv file via Nodered.
CREATE OR REPLACE PROCEDURE "master_SIMs".simiccidlink1(IMEI bigint, iccid text, phonenumber bigint, apn text, "Operator" text, isesim boolean, ism2m boolean, mdi bigint, imsi integer)
LANGUAGE 'plpgsql' AS
$BODY$
declare
-- retval text
begin
if (length(IMEI) = 15) then
if (master_packs.luhn_verify(IMEI) is True) then
update "master_SIMs".list SET "ICCID" = iccid, "phoneNumber" = phonenumber, "APN" = apn, "operator" = "Operator", "isESIM" = isesim, "isM2M" = ism2m, "MDI" = mdi, "IMSI" = imsi where IMEI = "serialNumber";
end if;
end if;
end;
$BODY$;
CALL "master_SIMs".simiccidlink1(352277818466409, 'iccid1', '8018466080', 'apn1', 'operator1', '0', '0', 123456789987653, 12354);
select * from "master_SIMs".list;
error I am getting is -
ERROR: procedure master_SIMs.simiccidlink1(unknown, unknown, unknown, unknown, integer, integer, integer, integer, bigint) does not exist
LINE 2: CALL "master_SIMs".simiccidlink1('iccid1', '8018466080', 'ap...
^
HINT: No procedure matches the given name and argument types. You might need to add explicit type casts.
SQL state: 42883
Character: 7
I am new to this - please GUIDE.
I have tried all possible ways to fix it.

converting par array to array in scala

How can we convert ParArray[(Double, Double, Double, Double, Double)] to Array[(Double, Double, Double, Double, Double)]
I have to do this as part of creating the data frame using sc.parallelize(Array[(Double,...)])
Apart from hardcoding (like below) is there any other way?
for(x1 <- 0 until a.length){
new_a(x1)(0) = a(x1)._1
new_a(x1)(1) = a(x1)._2
new_a(x1)(2) = a(x1)._3
new_a(x1)(3) = a(x1)._4
new_a(x1)(4) = a(x1)._5
}

See http://docs.scala-lang.org/overviews/parallel-collections/conversions.html#converting-between-sequential-and-parallel-collections
val a: ParArray[(Double, Double, Double, Double, Double)] = ???
val rdd = sc.parallelize(a.seq)

How to return a unaryfunction from a multi-variables function in VB.NET

I have a function:
Public Function F(ByVal a As Double, ByVal b As Double,
ByVal c As Double, ByVal x As Double) As Double
y = ax ^ 3 + bx ^ 2 + cx + d
Return y
End Function
How could I create a function that allows me to read parameters a,b,c,d and return an unary function? For example a=1,b=1,c=3,d=4:
Public Function F(ByVal x As Double) As Double
y = 1x ^ 3 + 2x ^ 2 + 3x + 4
Return y
End Function
or in other words, how could I create a function that returns a function of type
Func(Of Double, Double)

Use a lambda function.
Public Function F(ByVal a As Double, ByVal b As Double, ByVal c As Double) As Func(of Double, Double)
Return Function(ByVal x As Double) As Double
Return ax ^ 3 + bx ^ 2 + cx + d
End Function
End Function

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Error while converting csv to parquet using EMR cluster - hive

Related

Found unparsable section: 'EXECUTE IMMEDIATE\n ...' sqlfuff?

Why the implicit CASTing is happening during SQL Join?

procedure to update selected columns in a table

converting par array to array in scala

How to return a unaryfunction from a multi-variables function in VB.NET

Categories

Resources