Pig Load with Schema giving error - apache-pig

I have a file called data_tuple_bag.txt on hdfs with the following content:
10,{(1,2),(2,3)}
11,{(4,5),(6,7)}
I am creating a relation as below :
D = LOAD '/user/pig_demo/data_tuple_bag.txt' AS (f1:int,B:{T:(t1:int,t2:int)});
When I DUMP it it is giving me ACCESSING_NON_EXISTENT_FIELD 2 time(s) as well as FIELD_DISCARDED_TYPE_CONVERSION_FAILED 2 time(s) and an empty output.
I changed the relation to :
D = LOAD '/user/pig_demo/data_tuple_bag.txt' USING PigStorage(',') AS (f1:int,B:{T:(t1:int,t2:int)});
Now it's only giving FIELD_DISCARDED_TYPE_CONVERSION_FAILED 2 time(s) and output as :
(10,)
(11,)
I have another file data_only_bag.txt with following in it:
{(1,2),(2,3)}
{(4,5),(6,7)}
The relation is defined as :
A = LOAD '/user/pig_demo/data_only_bag.txt' AS (B:{T:(t1:int,t2:int)});
And it works.
Now I am updating the data_only_bag.txt as below:
10,{(1,2),(2,3)}
11,{(4,5),(6,7)}
And the relation is :
A = LOAD '/user/pig_demo/data_only_bag.txt' AS (f1:int,B:{T:(t1:int,t2:int)});
I am getting :
(,)
(,)
When I DUMP it it is giving me ACCESSING_NON_EXISTENT_FIELD 2 time(s) as well as FIELD_DISCARDED_TYPE_CONVERSION_FAILED 2 time(s) and an empty output.
Now I am updating the relation to :
A = LOAD '/user/pig_demo/data_only_bag.txt' USING PigStorage(',') AS (f1:int,B:{T:(t1:int,t2:int)});
Now it's only giving FIELD_DISCARDED_TYPE_CONVERSION_FAILED 2 time(s) and output as :
(10,)
(11,)
Same as before.
Can anybody tell me what wrong am I doing here?
Thanks in Advance.

It failed to parse the input with the provided schema,
Try with this :
D = LOAD '/user/pig_demo/data_tuple_bag.txt' USING PigStorage(',')
AS (f1:int, B: {T1: (t1:int, t2:int),T2: (t1:int, t2:int)});

Related

"Overriding" SQL Errors During R Uploads

I have this data set:
var_1 = rnorm(1000,1000,1000)
var_2 = rnorm(1000,1000,1000)
var_3 = rnorm(1000,1000,1000)
sample_data = data.frame(var_1, var_2, var_3)
I broke this data set into groups of 100 - thus creating 10 mini datasets:
list_of_dfs <- split(
sample_data, (seq(nrow(sample_data))-1) %/% 100
)
table_names <- paste0("sample_", 1:10)
I know want to upload these 10 mini datasets onto an SQL server :
library(odbc)
library(DBI)
library(RODBC)
library(purrr)
#establish connection
map2(
table_names,
list_of_dfs,
function(x,y) dbWriteTable(connection, x, y)
)
The problem is, that one of these mini datasets (e.g. sample_6) is not being accepted by the SQL server and gives this error:
Error in result_insert_dataframe(rs#prt, values): nanodbc/nanodbc.cpp:1587 : HY008 : Operation canceled
This means that "sample_1", "sample_2", "sample_3", "sample_4", "sample_5" were all successfully uploaded - but since "sample_6" was rejected, "sample_7", "sample_8", "sample_9" and "sample_10".
Is there a way to "override" this error and ensure that if one of these "sample_i" data sets are rejected, the computer will skip this data set and attempt to upload the remaining datasets?
If I were to do this manually, I could just "force" R to skip over the problem data set. For example, imagine if "sample_2" was causing the problem:
dbWriteTable(my_connection, SQL("sample_1"), sample_1)
dbWriteTable(my_connection, SQL("sample_2"), sample_2)
Error in result_insert_dataframe(rs#prt, values): nanodbc/nanodbc.cpp:1587 : HY008 : Operation canceled
dbWriteTable(my_connection, SQL("sample_3"), sample_3)
In the above code, "sample_1" and "sample_3" are successfully uploaded even though "sample_2" was causing a problem.
Is it possible to override these errors when "bulk-uploading" the datasets?
Thank you!

Apache Pig reading name value pairs in data file

i have a sample pig script with data that will read a csv file and dump it ot screen; however, my data has name value pairs. how can i read in a line of name value pairs and split the pairs using the name for the field and the value for the value?
data:
1,Smith,Bob,Business Development
2,Doe,John,Developer
3,Jane,Sally,Tester
script:
data = LOAD 'example-data.txt' USING PigStorage(',')
AS (id:chararray, last_name:chararray,
first_name:chararray, role:chararray);
DESCRIBE data;
DUMP data;
output:
data: {id: chararray,last_name: chararray,first_name: chararray,role: chararray}
(1,Smith,Bob,Business Development)
(2,Doe,John,Developer)
(3,Jane,Sally,Tester)
however, given the following input (as name value pairs); how could i process the data to get the same "data object"?
id=1,last_name=Smith,first_name=Bob,role=Business Development
id=2,last_name=Doe,first_name=John,role=Developer
id=3,last_name=Jane,first_name=Sally,role=Tester
Refer to STRSPLIT
A = LOAD 'example-data.txt' USING PigStorage(',') AS (f1:chararray,f2:chararray,f3:chararray, f4:chararray);
B = FOREACH A GENERATE
FLATTEN(STRSPLIT(f1,'=',2)) as (n1:chararray,v1:chararray),
FLATTEN(STRSPLIT(f2,'=',2)) as (n2:chararray,v2:chararray),
FLATTEN(STRSPLIT(f3,'=',2)) as (n3:chararray,v3:chararray),
FLATTEN(STRSPLIT(f4,'=',2)) as (n4:chararray,v4:chararray);
C = FOREACH B GENERATE v1,v2,v3,v4;
DUMP C;

Apache Pig floating number SUM error in precision

I have rows with a double values.
The sum of them however has additional floating digits which I dont want in the output. Any idea to avoid this problem ?
A = LOAD 'test.csv' Using PigStorage(',') AS (
ID: chararray,
COST:double
);
B = GROUP A BY (ID);
C = FOREACH B GENERATE SUM(A.COST);
STORE C INTO 'out.txt' USING PigStorage(',');
INPUT FILE
A,0.51
A,0.51
B,4.81
B,4.81
EXPECTED OUTPUT FILE
A,1.02
B,9.62
ACTUAL INVALID OUTPUT FILE
10.020000457763672
9.619999885559082
Try C = FOREACH B GENERATE ROUND(SUM(A.COST)*100.0)/100.0;
EDIT
It works, see below the output

In Pig latin, am not able to load data as multiple tuples, please advice

I am not able load the data as multiple tuples, am not sure what mistake am doing, please advise.
data.txt
vineet 1 pass Govt
hisham 2 pass Prvt
raj 3 fail Prvt
I want to load them as 2 touples.
A = LOAD 'data.txt' USING PigStorage('\t') AS (T1:tuple(name:bytearray, no:int), T2:tuple(result:chararray, school:chararray));
OR
A = LOAD 'data.txt' USING PigStorage('\t') AS (T1:(name:bytearray, no:int), T2:(result:chararray, school:chararray));
dump A;
the below data is displayed in the form of new line, i dont know why am not able to read actual data from data.txt.
(,)
(,)
(,)
As the input data is not stored as tuple we wont be able to read it directly in to a tuple.
One feasible approach is to read the data and then form a tuple with required fields.
Pig Script :
A = LOAD 'a.csv' USING PigStorage('\t') AS (name:chararray,no:int,result:chararray,school:chararray);
B = FOREACH A GENERATE (name,no) AS T1:tuple(name:chararray, no:int), (result,school) AS T2:tuple(result:chararray, school:chararray);
DUMP B;
Input : a.csv
vineet 1 pass Govt
hisham 2 pass Prvt
raj 3 fail Prvt
Output : DUMP B:
((vineet,1),(pass,Govt))
((hisham,2),(pass,Prvt))
((raj,3),(fail,Prvt))
Output : DESCRIBE B :
B: {T1: (name: chararray,no: int),T2: (result: chararray,school: chararray)}

How to get the number of words per line in pig?

I'm trying to figure out how many words their are per line in a file in pig. I've gotten as far as loading and splitting:
raw = load file;
words = FOREACH raw GENERATE TOKENIZE(*);
which gets me a bag of tulples each containing a word. Then I go to count these items I get an error:
counts = FOREACH words GENERATE COUNT(*);
I get an error:
org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing count in COUNT
...
Caused by: java.lang.NullPointerException
is that because some of the lines have an empty bag? or is there something else I'm doing wrong?
if it is the problem with an empty bag then you can try something like this: (Not tested)
raw = load file;
words = FOREACH raw GENERATE TOKENIZE(*) as tokenized_words;
counts = FOREACH words GENERATE ((tokenized_words IS null or TRIM(tokenized_words) == '') ? 0 : COUNT(*)) as total_count;
here we are writing if-else condition to check if the tokenized_words is null or empty, if yes then we are assigning zero to it else the total count.
Can you try like this?
input
Hi hello how are you
this is apache pig
works
like a charm
Pigscript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE TOKENIZE(line);
C = FOREACH B GENERATE COUNT($0);
DUMP C;
Output:
(5)
(4)
(1)
()
(3)