SUBSTRING operation does not work in JOIN operation - apache-pig

I have a column col1 in file 1:
00SPY58KHT5
00SPXB2BD0J
00SPXB2DXH6
00SPXDQ02S1
00SPXDY91JI
00SPXFG88L6
00SPXF1AQ4Z
00SPXF5UKS3
00SPXGL9IV6
I have column col2 in file2:
0SPY58KHT5
0SPXB2BD0J
0SPXB2DXH6
0SPXDQ02S1
0SPXDY91JI
0SPXFG88L6
0SPXF1AQ4Z
0SPXF5UKS3
0SPXGL9IV6
As you can see there is different of 0 in the first one in the beginning
I need to do JOIN operation between two files by these columns. So I need to use substring like this :
JOIN_FILE1_FILE2 = JOIN FILE1 BY TRIM(SUBSTRING(col1,1,10)), FILE1 BY TRIM(col2);
DUMP JOIN_FILE1_FILE2;
But I get empty result.
Input(s):
Successfully read 914493 records from: "/hdfs/data/adhoc/PR/02/RDO0/GUIDES/GUIDE_CONTRAT_USINE.csv"
Successfully read 102851809 records from: "/hdfs/data/adhoc/PR/02/RDO0/BB0/MGM7X007-2019-09-11.csv"
Output(s):
Successfully stored 0 records in: "hdfs://ha-manny/hdfs/hadoop/pig/tmp/temp964914764/tmp1220183619"
How can I did this jointure please ?

As a solution I generate first data to applicate the SUBSTRING function to the col1.
Then I did the filtration using TRIM and finally use CONCAT('0',col1) in other generation.
In other words
DATA1 = FOREACH DATA_SOURCE GENERATE
SUBSTRING(col1,1,10) AS col1;
JOINED_DATA = JOIN DATA1 BY col1, ...
FINAL_DATA = FOREACH JOINED_DATA GENERATE
CONCAT('0',col1) AS col1,
...
And this works without problem.

Related

mismatched input 'from'. Expecting: ',', <expression>

I have a query that I am running on AWS athena that should return all the filenames that are not contained in the second table. I am basically trying to find all the filename that are not in ejpos landing table.
The one table looks like this (item sales):
origin_file
run_id
/datarite/ejpos/8023/20220706/filename1
8035
/datarite/ejpos/8023/20220706/filename2
8035
/datarite/ejpos/8023/20220706/filename3
8035
The other table looks like this (ejpos_files_landing):
filename
filename1
filename2
filename3
filename4
They don't have the same number of rows, hence I am trying to find the file names that are in ejpos_pos_landing but not in item sales table.
I get this error when I run:
mismatched input 'from'. Expecting: ',', <expression>
The query is here:
SELECT trim("/datarite/ejpos/8023/20220706/" from "validated"."datarite_ejpos_itemsale" where
run_id = '8035') as origin_file,
FROM "validated"."datarite_ejpos_itemsale"
LEFT JOIN "landing"."ejpos_landing_files" ON "landing"."ejpos_landing_files".filename =
"validated"."datarite_ejpos_itemsale".origin_file
WHERE "landing"."ejpos_landing_files".filename IS NULL;
The expected result would be:
|filename4|
Because it is not in the other table
Can anyone assist?
There is a lot of wrong stuff in your query based on the example data and declared goals.
trim("/datarite/ejpos/8023/20220706/" from "validated"."datarite_ejpos_itemsale" where run_id = '8035') as origin_file is not a valid sql.
ON "landing"."ejpos_landing_files".filename = "validated"."datarite_ejpos_itemsale".origin_file will not work cause origin_file is prefixed. You can use strpos if there should be only one instance of filename in the origin_file.
your join and filtering condition are build to find items present in datarite_ejpos_itemsale and missing in ejpos_landing_files while you state the vise versa is needed.
the mentioned in the comments extra comma
Try next:
-- sample data
WITH item_sales(origin_file, run_id) AS (
VALUES ('/datarite/ejpos/8023/20220706/filename1', 8035),
('/datarite/ejpos/8023/20220706/filename2', 8035),
('/datarite/ejpos/8023/20220706/filename3', 8035),
('/datarite/ejpos/8023/20220706/filename4', 8036)
),
ejpos_files_landing(filename) as(
VALUES ('filename1'),
('filename2'),
('filename3'),
('filename4')
)
-- query
select filename
from ejpos_files_landing l
left outer join item_sales s -- reverse the join
on strpos(s.origin_file, l.filename) >= 1 -- assuming that filename should be present only one time in the string
and s.run_id = 8035 -- if you need to filter out run id
where s.origin_file is null
Output:
filename
filename4
Alternative approach you can try:
-- query
select filename
from ejpos_files_landing l
where filename not in (
select element_at(split(origin_file, '/'), -1) -- split by '/' and get last
from item_sales
where run_id = 8035
)

How can I count all NULL values, without column names, using SQL?

I'm reading and executing sql queries from file and I need to inspect the result sets to count all the null values across all columns. Because the SQL is read from file, I don't know the column names and thus can't call the columns by name when trying to find the null values.
I think using CTE is the best way to do this, but how can I call the columns when I don't know what the column names are?
WITH query_results AS
(
<sql_read_from_file_here>
)
select count_if(<column_name> is not null) FROM query_results
If you are using Python to read the file of SQL statements, you can do something like this which uses pglast to parse the SQL query to get the columns for you:
import pglast
sql_read_from_file_here = "SELECT 1 foo, 1 bar"
ast = pglast.parse_sql(sql_read_from_file_here)
cols = ast[0]['RawStmt']['stmt']['SelectStmt']['targetList']
sum_stmt = "sum(iff({col} is null,1,0))"
sums = [sum_sql.format(col = col['ResTarget']['name']) for col in cols]
print(f"select {' + '.join(sums)} total_null_count from query_results")
# outputs: select sum(iff(foo is null,1,0)) + sum(iff(bar is null,1,0)) total_null_count from query_results

data processing in pig , with tab separate

I am very new to Pig , so facing some issues while trying to perform very basic processing in Pig.
1- Load that file using Pig
2- Write a processing logic to filter records based on Date , for example the lines have 2 columns col_1 and col_2 ( assume the columns are chararray ) and I need to get only the records which are having 1 day difference between col_1 and col_2.
3- Finally store that filtered record in Hive table .
Input file ( tab separated ) :-
2016-01-01T16:31:40.000+01:00 2016-01-02T16:31:40.000+01:00
2017-01-01T16:31:40.000+01:00 2017-01-02T16:31:40.000+01:00
When I try
A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray);
The result I am getting like below :-
DUMP A;
(,2016-01-03T19:28:58.000+01:00,2016-01-02T16:31:40.000+01:00)
(,2017-01-03T19:28:58.000+01:00,2017-01-02T16:31:40.000+01:00)
Not sure Why ?
Please can some one help me in this how to parse tab separated file and how to covert that chararray to Date and filter based on Day difference ?
Thanks
Convert the columns to datetime object using ToDate and use DaysBetween.This should give the difference and if the difference == 1 then filter.Finally load it hive.
A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray);
B = FOREACH A GENERATE DaysBetween(ToDate(col_1,'yyyy-MM-dd HH:mm:ss'),ToDate(col_2,'yyyy-MM-dd HH:mm:ss')) as day_diff;
C = FILTER B BY (day_diff == 1);
STORE C INTO 'your_hive_partition' USING org.apache.hive.hcatalog.pig.HCatStorer();

Pig: efficient filtering by loaded list

In Apache Pig (version 0.16.x), what are some of the most efficient methods to filter a dataset by an existing list of values for one of the dataset's fields?
For example,
(Updated per #inquisitive_mind's tip)
Input: a line-separated file with one value per line
my_codes.txt
'110'
'100'
'000'
sample_data.txt
'110', 2
'110', 3
'001', 3
'000', 1
Desired Output
'110', 2
'110', 3
'000', 1
Sample script
%default my_codes_file 'my_codes.txt'
%default sample_data_file 'sample_data.txt'
my_codes = LOAD '$my_codes_file' as (code:chararray)
sample_data = LOAD '$sample_data_file' as (code: chararray, point: float)
desired_data = FILTER sample_data BY code IN (my_codes.code);
Error:
Scalar has more than one row in the output. 1st : ('110'), 2nd :('100')
(common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )
I had also tried FILTER sample_data BY code IN my_codes; but the "IN" clause seems to require parenthesis.
I also tried FILTER sample_data BY code IN (my_codes); but got the error:
A column needs to be projected from a relation for it to be used as a scalar
The my_codes.txt file has the codes as a row instead of a column.Since you are loading it into a single field the codes should be like this below
'110'
'100'
'000'
Alternatively,you can use JOIN
joined_data = JOIN sample_date BY code,my_codes BY code;
desired_data = FOREACH joined_data GENERATE $0,$1;

Apache Pig - Removing the pseudo-column added by -tagFile

I have files of the format test_YYYYMM.txt. I am using '-tagFile' and SUBSTRING() to extract the year and month for use in my pig script.
The file name gets added as a pseudo-column at the beginning of the tuple.
Before I do a DUMP I would like to remove that column. Doing a FOREACH ... GENERATE with only the columns I need does not work, it still retains the psuedo-column.
Is there a way to remove this column?
My sample script is as follows
raw_data = LOAD 'test_201501.txt' using PigStorage('|', '-tagFile') as
col1: chararray, col2: chararray;
data_with_yearmonth = FOREACH raw_data GENERATE
SUBSTRING($0,5,11) as yearmonth,
'TEST_DATA' as test,
col1,
col2;
DUMP data_with_yearmonth;
Expected Output:
201501, TEST_DATA, col1, col2
Current Output:
201501, TEST_DATA, test_YYYYMM.txt, col1, col2
First of all, if col1 and col2 are string then you should declare them as CHARARRAY in Pig.
Plus, I guess your current output is actually : 201501, TEST_DATA, test_YYYYMM.txt, col1.
Tell me if I'm wrong, but as you used '-TagFile' the first column is the file title, this is why you access to it with $0 in your SUBSTRING.
You can try with this code :
raw_data = LOAD 'text_201505.txt'
USING PigStorage('|', '-tagFile')
AS (title: CHARARRAY, col1: CHARARRAY, col2: CHARARRAY);
data_with_yearmonth = FOREACH raw_data
GENERATE
SUBSTRING($0,5,11) AS yearmonth,
'TEST_DATA' AS test,
col1,
col2;
DUMP data_with_yearmonth;