Filter a null column out of tuple in pig - apache-pig

I am working on a usecase where we have to eliminate the nulls out of a tuple
A =
(7,Ron,ron#abc.com)
(8,,rina#xyz.com )
(9,Don,)
(9,Don,dmes#xyz.com)
(10,Maya,maya#cnn.com)
B = FILTER A BY col2 != '';
Output :-
(7,Ron,ron#abc.com)
(9,Don,dmes#xyz.com)
(10,Maya,maya#cnn.com)
Here the filter operator filters the second row. But we have to filter the column.
The expected output should be something like:
(7,Ron,ron#abc.com)
(8,rina#xyz.com)
(9,Don,dmes#xyz.com)
(9,Don)
(10,Maya,maya#cnn.com)

We can split our relation into sub relations, project desired columns, then union the result together :
So if the first column is not nullable, the second and the third are nullable but always we have atleast one of them, then :
SPLIT A INTO col1null IF $1 is null, col2null IF $2 is null, allnotnull IF ($1 is not null AND $2 is not null);
col1reject = FOREACH col1null GENERATE $0,$2; --remove column $1
col2reject = FOREACH col2null GENERATE $0,$1; --remove column $2
OUT = UNION allnotnull ,col1reject , col2reject ;

Related

How to filter on enum and include all rows if no filter value provided

I'm working on a project resource management application and my resource table has several fields, one of which is an enum as below:
CREATE TYPE "clearance" AS ENUM (
'None',
'Baseline',
'NV1',
'NV2',
'TSPV'
);
Then, my resource table includes that enum:
CREATE TABLE "resource" (
"employee_id" integer PRIMARY KEY,
"name" varchar NOT NULL,
"email" varchar NOT NULL,
"job_title_id" integer NOT NULL,
"manager_id" integer NOT NULL,
"workgroup_id" integer NOT NULL,
"clearance_level" clearance,
"specialties" text[],
"certifications" text[],
"active" boolean DEFAULT 't'
);
When querying the data, I want to be able to provide query string parameters in the url, that then apply filters to the database query.
For example (using a local dev machine):
curl localhost:6543/v1/resources # returns all resources in a paginated query
curl localhost:6543/v1/resources?specialties=NSX # returns all resources with NSX as a specialty
curl localhost:6543/v1/resources?manager=John+Smith # returns resources that report to John Smith
curl localhost:6543/v1/resources?jobTitle=Senior+Consultant # returns all Senior Consultants
etc.
Where I'm running into an issue though is that I also want to be able to filter on the security clearance level like this:
curl localhost:6543/v1/resources?clearance=NV2
When I provide a clearance filter I can get the query to work fine:
query := fmt.Sprintf(`
SELECT count(*) OVER(), r.employee_id, r.name, r.email, job_title.title, m.name AS manager, workgroup.workgroup_name, r.clearance_level, r.specialties, r.certifications, r.active
FROM (((resource r
INNER JOIN job_title ON r.job_title_id=job_title.title_id)
INNER JOIN resource m ON r.manager_id=m.employee_id)
INNER JOIN workgroup ON workgroup.workgroup_id=r.workgroup_id)
WHERE (workgroup.workgroup_name = ANY($1) OR $1 = '{}')
AND (r.clearance_level = $2::clearance)
AND (r.specialties #> $3 OR $3 = '{}')
AND (r.certifications #> $4 OR $4 = '{}')
AND (m.name = $5 OR $5 = '')
AND (r.active = $6)
AND (r.name = $7 OR $7 = '')
ORDER BY %s %s, r.employee_id ASC
LIMIT $8 OFFSET $9`, clearance_filter, fmt.Sprintf("r.%s", filters.sortColumn()), filters.sortDirection())
However, I can't figure out a reasonably way to implement the filtering, so that all results are returned when no clearance filter is provided.
The poor way I have made it work is to just apply an empty string filter on another field when no clearance is filtered for and substitute in the correct filter when a clearance argument is provided.
It works, but smells really bad:
func (m *ResourceModel) GetAll(name string, workgroups []string, clearance string, specialties []string,
certifications []string, manager string, active bool, filters Filters) ([]*Resource, Metadata, error) {
// THIS IS A SMELL
// Needed to provide a blank filter parameter if all clearance levels should be returned.
// Have not found a good way to filter on enums to include all values when no filter argument is provided
var clearance_filter = `AND (r.name = $2 OR $2 = '')`
if clearance != "" {
clearance_filter = `AND (r.clearance_level = $2::clearance)`
}
query := fmt.Sprintf(`
SELECT count(*) OVER(), r.employee_id, r.name, r.email, job_title.title, m.name AS manager, workgroup.workgroup_name, r.clearance_level, r.specialties, r.certifications, r.active
FROM (((resource r
INNER JOIN job_title ON r.job_title_id=job_title.title_id)
INNER JOIN resource m ON r.manager_id=m.employee_id)
INNER JOIN workgroup ON workgroup.workgroup_id=r.workgroup_id)
WHERE (workgroup.workgroup_name = ANY($1) OR $1 = '{}')
%s
AND (r.specialties #> $3 OR $3 = '{}')
AND (r.certifications #> $4 OR $4 = '{}')
AND (m.name = $5 OR $5 = '')
AND (r.active = $6)
AND (r.name = $7 OR $7 = '')
ORDER BY %s %s, r.employee_id ASC
LIMIT $8 OFFSET $9`, clearance_filter, fmt.Sprintf("r.%s", filters.sortColumn()), filters.sortDirection())
...
...
}
Is there a better way to approach this?
It feels like a really poor solution to the point that I'm thinking of dropping the enum and making it another table that just establishes a domain of values:
CREATE TABLE clearance (
"level" varchar NOT NULL
);
For anyone that needs this very niche use case in the future, the answer was built on the initial hint from #mkopriva
The approach was to cast the clearance_level to text, so the filter is:
...
AND(r.clearance_level::text = $2 OR $2 = '')
...
This returns all results, regardless of clearance when no clearance filter is provided and returns only the result that match the provided clearance_level when a filter is provided.
Must appreciated to #mkopriva for the assistance.

SQLITE printf only for non-null SELECT values

value
NameOfField
1
SlNo
10/3/21
ManfDate
2
SlNo
NULL
ManfDate
3
SlNo
11/3/21
ManfDate
My current sqlite query structure-
SELECT printf ("Item_No: %s Manufacturing_Date:%s ",(SELECT val from Orders WHERE name='Slno'),(SELECT val from Orders WHERE name='ManfDate') )
This gives output-
Item_No:1 Manufacturing_Date:10/3/21
**Item_No:2 Manufacturing_Date:**
Item_no:3 Manufacturing_Date:11/3/21
I want to modify the sqlite printf statement such that the field name doesn't print when the corresponding value is null.i.e.
Output should look like-
Item_No:1 Manufacturing_Date:10/3/21
**Item_No:2**
Item_no:3 Manufacturing_Date:11/3/21
I have been trying this using IFNULL,IS NOT NULL, EXISTS,CASE etc but nothing seems to work for me. Will be thankful for any help

PostgreSQL checking for empty array changes behavior in Golang

I'm trying to implement the behavior of selecting data based on either an array of input, or get all data if array is null or empty.
SELECT * FROM table_name
WHERE
('{}' = $1 OR col = ANY($1))
This will return pq: op ANY/ALL (array) requires array on right side.
If I run
SELECT * FROM table_name
WHERE
(col = ANY($1))
This works just fine and I get the contents I expected.
I can also use array_length but it will request me to assert what type of data is in $1. If I do (array_length($1::string[],1) < 1 OR col = ANY($1)), it seems to always return false on the array_length and go on to the col = ANY($1)
How can I return either JUST the values from $1 OR all if $1 is '{}' or NULL?
Got it:
($1::string[] IS NULL OR event_id = ANY($1))

Exclude "grouped" data from query

I have a table that looks like this (simplified):
CREATE TABLE IF NOT EXISTS records (
user_id uuid NOT NULL
ts timestamptz NOT NULL,
op_type text NOT NULL,
PRIMARY KEY (user_id, ts, op_type)
);
I cannot for practical purposes change the PRIMARY KEY.
I'm trying to write a query that gets all records for a given user_id where, for a specific record, the ts and the op_type don't match an array of exclusions.
I'm not exactly sure of the right postgres terminology so let me see if this example makes my constraint clearer:
This array looks something like this (in JavaScript):
var excludes = [
[DATE1, 'OP1'],
[DATE2, 'OP2']
]
If, for a given user id, there are rows that look like this in the database:
ts | op_type
----------------------------+-------------
DATE1 | OP1
DATE2 | OP2
DATE1 | OP3
DATE2 | OP1
OTHER DATE | OP1
OTHER DATE | OP2
Then, with the excludes from above, I'd like to run a query that returns everything EXCEPT or the first two rows since they match exactly.
My attempt was to do this:
client.query(`
SELECT * FROM records
WHERE
user_id = $1
AND (ts, op_type) NOT IN ($2)
`, [userId, excluding])
But I get "input of anonymous composite types is not implemented". I'm not sure how to properly type excluding or if this is even the right way to do this.
The query may look like this
SELECT *
FROM records
WHERE user_id = 'a0eebc999c0b4ef8bb6d6bb9bd380a11'
AND (ts, op_type) NOT IN (('2016-01-01', 'OP1'), ('2016-01-02', 'OP2'));
so if you want to pass the conditions as a single parameter then excluding should be a string in the format:
('2016-01-01', 'OP1'), ('2016-01-02', 'OP2')
It seems that there is no simple way to pass the condition string into query() as a parameter. You can try to write a function to get the string in the correct format (I'm not a JS developer but this piece of code seems to work well):
excluding = function(exc) {
var s = '(';
for (var i = 0; i < exc.length; i++)
s = s+ '(\''+ exc[i][0]+ '\',\''+ exc[i][1]+ '\'),';
return s.slice(0, -1)+ ')';
};
var excludes = [
['2016-01-01', 'OP1'],
['2016-01-02', 'OP2']
];
// ...
client.query(
'SELECT * FROM records '+
'WHERE user_id = $1 '+
'AND (ts, op_type) NOT IN ' + excluding(excludes),
[userId])

Pig Latin - Extracting fields meeting two different filter criteria from chararray line and grouping in a bag

I am new to Pig Latin.
I want to extract all lines that match a filter criteria (have a word "line_token" ) from log files and then from these matching lines extract two different fields meeting two separate field match criteria . Since the lines aren't structured well I am loading them as a char array.
When I try to run the following code - I get an error
"Invalid resource schema: bag schema must have tuple as its field"
I have tried to perform an explicit cast to a tuple but that does not work
input_lines = LOAD '/inputdir/' AS ( line:chararray);
filtered_lines = FILTER input_lines BY (line MATCHES '.*line_token1.*' );
tokenized_lines = FOREACH filtered_lines GENERATE FLATTEN(TOKENIZE(line)) AS tok_line;
my_wordbag = FOREACH tokenized_lines {
word1 = FILTER tok_line BY ( $0 MATCHES '.*word_token1.*' ) ;
word2 = FILTER tok_line BY ( $0 MATCHES '.*word_token1.*' ) ;
GENERATE word1 , word2 as my_tuple ;
-- I also tried --> GENERATE (word1 , word2) as my_tuple ;
}
dump my_wordbag;
I suppose I am taking a very wrong approach.
Please note - my logs aren't structured well - so I cant mend the way I load
Post loading and initial filtering for lines of interest ( which is straightforward) - I guess I need to do something different rather than tokenize line and iterate through fields trying to find fields.
Or maybe I should use joins ?
Also if I know the structure of line beforehand well as all text fields, then will loading it differently ( not as a chararray) make it an easier problem ?
For now I made a compromise - I added a extra filter clause in my original - line filter and settled for picking just one field from line. When I get back to it I will try with joins and post that code ... - here's my working code that gets me a useful output - but not all that I want.
-- read input lines from poorly structured log
input_lines = LOAD '/log-in-dir-in-hdfs' AS ( line:chararray) ;
-- Filter for line filter criteria and date interested in passed as arg
filtered_lines = FILTER input_lines BY (
( line MATCHES '.*line_filter1*' )
AND ( line MATCHES '.*line_filter2.*' )
AND ( line MATCHES '.*$forDate.*' )
) ;
-- Tokenize every line
tok_lines = FOREACH filtered_lines
GENERATE TOKENIZE(line) AS tok_line;
-- Pick up specific field frm tokenized line based on column filter criteria
fnames = FOREACH tok_lines {
fname = FILTER tok_line BY ( $0 MATCHES '.*field_selection.*' ) ;
GENERATE FLATTEN(fname) as nnfname;
}
-- Count occurances of that field and store it with field name
-- My original intent is to store another field name as well
-- I will do that once I figure how to put both of them in a tuple
flgroup = FOREACH fnames
GENERATE FLATTEN(TOKENIZE((chararray)$0)) as cfname;
grpfnames = group flgroup by cfname;
readcounts = FOREACH grpfnames GENERATE COUNT(flgroup), group ;
STORE readcounts INTO '/out-dir-in-hdfs';
As I understand, after the FLATTEN operation, you have single line (tok_line) in each row and you want to extract 2 words from each line. REGEX_EXTRACT will help you achieve this. I'm not a REGEX expert so will leave writing the REGEX part up to you.
data = FOREACH tokenized_lines
GENERATE
REGEX_EXTRACT(tok_line, <first word regex goes here>) as firstWord,
REGEX_EXTRACT(tok_line, <second word regex goes here>) as secondWord;
I hope this helps.
You must refer to the alias, not the column.
So:
word1 = FILTER tokenized_lines BY ( $0 MATCHES '.*word_token1.*' ) ;
word1 and word2 are going to be aliases as well, not columns.
How do you need the output to look like?