Pig: efficient filtering by loaded list - apache-pig

In Apache Pig (version 0.16.x), what are some of the most efficient methods to filter a dataset by an existing list of values for one of the dataset's fields?
For example,
(Updated per #inquisitive_mind's tip)
Input: a line-separated file with one value per line
my_codes.txt
'110'
'100'
'000'
sample_data.txt
'110', 2
'110', 3
'001', 3
'000', 1
Desired Output
'110', 2
'110', 3
'000', 1
Sample script
%default my_codes_file 'my_codes.txt'
%default sample_data_file 'sample_data.txt'
my_codes = LOAD '$my_codes_file' as (code:chararray)
sample_data = LOAD '$sample_data_file' as (code: chararray, point: float)
desired_data = FILTER sample_data BY code IN (my_codes.code);
Error:
Scalar has more than one row in the output. 1st : ('110'), 2nd :('100')
(common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )
I had also tried FILTER sample_data BY code IN my_codes; but the "IN" clause seems to require parenthesis.
I also tried FILTER sample_data BY code IN (my_codes); but got the error:
A column needs to be projected from a relation for it to be used as a scalar

The my_codes.txt file has the codes as a row instead of a column.Since you are loading it into a single field the codes should be like this below
'110'
'100'
'000'
Alternatively,you can use JOIN
joined_data = JOIN sample_date BY code,my_codes BY code;
desired_data = FOREACH joined_data GENERATE $0,$1;

Related

mismatched input 'from'. Expecting: ',', <expression>

I have a query that I am running on AWS athena that should return all the filenames that are not contained in the second table. I am basically trying to find all the filename that are not in ejpos landing table.
The one table looks like this (item sales):
origin_file
run_id
/datarite/ejpos/8023/20220706/filename1
8035
/datarite/ejpos/8023/20220706/filename2
8035
/datarite/ejpos/8023/20220706/filename3
8035
The other table looks like this (ejpos_files_landing):
filename
filename1
filename2
filename3
filename4
They don't have the same number of rows, hence I am trying to find the file names that are in ejpos_pos_landing but not in item sales table.
I get this error when I run:
mismatched input 'from'. Expecting: ',', <expression>
The query is here:
SELECT trim("/datarite/ejpos/8023/20220706/" from "validated"."datarite_ejpos_itemsale" where
run_id = '8035') as origin_file,
FROM "validated"."datarite_ejpos_itemsale"
LEFT JOIN "landing"."ejpos_landing_files" ON "landing"."ejpos_landing_files".filename =
"validated"."datarite_ejpos_itemsale".origin_file
WHERE "landing"."ejpos_landing_files".filename IS NULL;
The expected result would be:
|filename4|
Because it is not in the other table
Can anyone assist?
There is a lot of wrong stuff in your query based on the example data and declared goals.
trim("/datarite/ejpos/8023/20220706/" from "validated"."datarite_ejpos_itemsale" where run_id = '8035') as origin_file is not a valid sql.
ON "landing"."ejpos_landing_files".filename = "validated"."datarite_ejpos_itemsale".origin_file will not work cause origin_file is prefixed. You can use strpos if there should be only one instance of filename in the origin_file.
your join and filtering condition are build to find items present in datarite_ejpos_itemsale and missing in ejpos_landing_files while you state the vise versa is needed.
the mentioned in the comments extra comma
Try next:
-- sample data
WITH item_sales(origin_file, run_id) AS (
VALUES ('/datarite/ejpos/8023/20220706/filename1', 8035),
('/datarite/ejpos/8023/20220706/filename2', 8035),
('/datarite/ejpos/8023/20220706/filename3', 8035),
('/datarite/ejpos/8023/20220706/filename4', 8036)
),
ejpos_files_landing(filename) as(
VALUES ('filename1'),
('filename2'),
('filename3'),
('filename4')
)
-- query
select filename
from ejpos_files_landing l
left outer join item_sales s -- reverse the join
on strpos(s.origin_file, l.filename) >= 1 -- assuming that filename should be present only one time in the string
and s.run_id = 8035 -- if you need to filter out run id
where s.origin_file is null
Output:
filename
filename4
Alternative approach you can try:
-- query
select filename
from ejpos_files_landing l
where filename not in (
select element_at(split(origin_file, '/'), -1) -- split by '/' and get last
from item_sales
where run_id = 8035
)

How can I count all NULL values, without column names, using SQL?

I'm reading and executing sql queries from file and I need to inspect the result sets to count all the null values across all columns. Because the SQL is read from file, I don't know the column names and thus can't call the columns by name when trying to find the null values.
I think using CTE is the best way to do this, but how can I call the columns when I don't know what the column names are?
WITH query_results AS
(
<sql_read_from_file_here>
)
select count_if(<column_name> is not null) FROM query_results
If you are using Python to read the file of SQL statements, you can do something like this which uses pglast to parse the SQL query to get the columns for you:
import pglast
sql_read_from_file_here = "SELECT 1 foo, 1 bar"
ast = pglast.parse_sql(sql_read_from_file_here)
cols = ast[0]['RawStmt']['stmt']['SelectStmt']['targetList']
sum_stmt = "sum(iff({col} is null,1,0))"
sums = [sum_sql.format(col = col['ResTarget']['name']) for col in cols]
print(f"select {' + '.join(sums)} total_null_count from query_results")
# outputs: select sum(iff(foo is null,1,0)) + sum(iff(bar is null,1,0)) total_null_count from query_results

data processing in pig , with tab separate

I am very new to Pig , so facing some issues while trying to perform very basic processing in Pig.
1- Load that file using Pig
2- Write a processing logic to filter records based on Date , for example the lines have 2 columns col_1 and col_2 ( assume the columns are chararray ) and I need to get only the records which are having 1 day difference between col_1 and col_2.
3- Finally store that filtered record in Hive table .
Input file ( tab separated ) :-
2016-01-01T16:31:40.000+01:00 2016-01-02T16:31:40.000+01:00
2017-01-01T16:31:40.000+01:00 2017-01-02T16:31:40.000+01:00
When I try
A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray);
The result I am getting like below :-
DUMP A;
(,2016-01-03T19:28:58.000+01:00,2016-01-02T16:31:40.000+01:00)
(,2017-01-03T19:28:58.000+01:00,2017-01-02T16:31:40.000+01:00)
Not sure Why ?
Please can some one help me in this how to parse tab separated file and how to covert that chararray to Date and filter based on Day difference ?
Thanks
Convert the columns to datetime object using ToDate and use DaysBetween.This should give the difference and if the difference == 1 then filter.Finally load it hive.
A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray);
B = FOREACH A GENERATE DaysBetween(ToDate(col_1,'yyyy-MM-dd HH:mm:ss'),ToDate(col_2,'yyyy-MM-dd HH:mm:ss')) as day_diff;
C = FILTER B BY (day_diff == 1);
STORE C INTO 'your_hive_partition' USING org.apache.hive.hcatalog.pig.HCatStorer();

Pig latin script treats different columns in csv file as one single column

I am just pasting a line from the file for example
The following line is from the file 'airlines_new.txt' which I am loading into a relation
2008,1,3,4,617,615,652,650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,,0,NA,NA ,NA,NA,NA
====================================================
I am using the following query :
Airlines_data_schema = LOAD '/user/Jig13517/airlines_new.txt'
USING PigStorage(' ') AS
(Year, Month, DayofMonth, DayofWeek, DepTime_actual:chararray, CRSDeptime:chararray, Arrtime_actual:chararray, CRSArrtime:chararray, UniqueCarrier, FlightNum, TailNum_Plane ,ActualElapsedTime, CRSElapsedTime, Airtime, Arrdelay, Depdelay, Origin,Dest, Distance, Taxiin, Taxiout, Cancelled, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay);
==========================================================
B = FOREACH Airlines_data_schema generate $0 ;
dump B ;
=========================================================
Result :
(Year, Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCar rier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDela y,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,Carrie rDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay )
(2008,1,3,4,617,615,652,650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,,0,NA,N A,NA,NA,NA )
It is giving the all columns as single column. But intention is to break these into different columns. Ideally according to my script it should only give only the column "Year".
The records are separated by a comma but in the script you are using ' ' as the delimiter.Modify your script to use ',' as the delimiter in PigStorage.
Airlines_data_schema = LOAD '/user/Jig13517/airlines_new.txt' USING PigStorage(',') AS (Year,Month,DayofMonth,DayofWeek,DepTime_actual:chararray,CRSDeptime:chararray,Arrtime_actual:chararray,CRSArrtime:chararray,UniqueCarrier,FlightNum,TailNum_Plane,ActualElapsedTime,CRSElapsedTime,Airtime,Arrdelay,Depdelay,Origin,Dest,Distance,Taxiin,Taxiout,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay)
Appropriate delimiter needs to be used in this scenario to ensure that the fields are separated .
Airlines_data_schema = LOAD '/user/Jig13517/airlines_new.txt'
USING **PigStorage(',')** AS
(Year, Month, DayofMonth, DayofWeek, DepTime_actual:chararray, CRSDeptime:chararray, Arrtime_actual:chararray, CRSArrtime:chararray, UniqueCarrier, FlightNum, TailNum_Plane ,ActualElapsedTime, CRSElapsedTime, Airtime, Arrdelay, Depdelay, Origin,Dest, Distance, Taxiin, Taxiout, Cancelled, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay);.
This will ensure you to access each of the fields which are separated in the csv by ',' .

Pig: Cast error while grouping data

This is the code that I am trying to run. Steps:
Take an input (there is a .pig_schema file in the input folder)
Take only two fields (chararray) from it and remove duplicates
Group on one of those fields
The code is as follows:
x = LOAD '$input' USING PigStorage('\t'); --The input is tab separated
x = LIMIT x 25;
DESCRIBE x;
-- Output of DESCRIBE x:
-- x: {id: chararray,keywords: chararray,score: chararray,time: long}
distinctCounts = FOREACH x GENERATE keywords, id; -- generate two fields
distinctCounts = DISTINCT distinctCounts; -- remove duplicates
DESCRIBE distinctCounts;
-- Output of DESCRIBE distinctCounts;
-- distinctCounts: {keywords: chararray,id: chararray}
grouped = GROUP distinctCounts BY keywords; --group by keywords
DESCRIBE grouped; --THIS IS WHERE IT GIVES AN ERROR
DUMP grouped;
When I do the grouped, it gives the following error:
ERROR org.apache.pig.tools.pigstats.SimplePigStats -
ERROR: org.apache.pig.data.DataByteArray cannot be cast to java.lang.String
keywords is a chararray and Pig should be able to group on a chararray. Any ideas?
EDIT:
Input file:
0000010000014743 call for midwife 23 1425761139
0000010000062069 naruto 1 56 1425780386
0000010000079919 the following 98 1425788874
0000010000081650 planes 2 76 1425721945
0000010000118785 law and order 21 1425763899
0000010000136965 family guy 12 1425766338
0000010000136100 american dad 19 1425766702
.pig_schema file
{"fields":[{"name":"id","type":55},{"name":"keywords","type":55},{"name":"score","type":55},{"name":"time","type":15}]}
Pig is not able to identify the value of keywords as chararray.Its better to go for field naming during initial load, in this way we are explicitly stating the field types.
x = LOAD '$input' USING PigStorage('\t') AS (id:chararray,keywords:chararray,score: chararray,time: long);
UPDATE :
Tried the below snippet with updated .pig_schema to introduce score, used '\t' as separator and tried the below steps for the input shared.
x = LOAD 'a.csv' USING PigStorage('\t');
distinctCounts = FOREACH x GENERATE keywords, id;
distinctCounts = DISTINCT distinctCounts;
grouped = GROUP distinctCounts BY keywords;
DUMP grouped;
Would suggest to use unique alias names for better readability and maintainability.
Output :
(naruto 1,{(naruto 1,0000010000062069)})
(planes 2,{(planes 2,0000010000081650)})
(family guy,{(family guy,0000010000136965)})
(american dad,{(american dad,0000010000136100)})
(law and order,{(law and order,0000010000118785)})
(the following,{(the following,0000010000079919)})
(call for midwife,{(call for midwife,0000010000014743)})