is there a way to create a small constant relation(table) in pig?
I need to create a relation with only 1 tuple that contains constant values.
something along the lines of:
A = LOAD using ConstantLoader('{(1,2,3)}');
thanks, Ido
I'm not sure why you would need that but, here's an ugly solution:
A = LOAD 'some/sample/file' ;
B = FOREACH A GENERATE '' ;
C = LIMIT A 1 ;
Now, you can use 'C' as the 'empty relation' that has one empty tuple.
DEFINE GenerateRelationFromString(string) RETURNS relation {
temp = LOAD 'somefile';
tempLimit1 = LIMIT temp 1;
$relation = FOREACH tempLimit1 GENERATE FLATTEN(TOKENIZE('$string', ','));
};
usage:
fourRows = GenerateRelationFromString('1,2,3,4');
myConstantRelation = FOREACH fourRows GENERATE (
CASE $0
WHEN '1' THEN (1, 'Ivan')
WHEN '2' THEN (2, 'Boris')
WHEN '3' THEN (3, 'Vladimir')
WHEN '4' THEN (4, 'Olga')
END
) as myTuple;
This for sure is hacky, and the right way, in my mind, would be to implement a StringLoader() that would work like this:
fourRows = LOAD '1,2,3,4' USING StringLoader(',');
The argument typically used for file location would just be used as litral string input.
Fast answer: no.
I asked about it in pig-dev mailing list.
Related
I have a json column on my database pgsql
I need to search keys with localId,
I tried both this query:
SELECT *
FROM public.translations
where datas->>'localId' = 7;
and
SELECT *
FROM public.translations
where datas->>'localId'::text = '7';
no results.
How can i do it please?
when i make this query i have no values
SELECT datas->>'localId' as local
FROM public.translations
SELECT datas::json->>'localId' as local
FROM public.translations
Your screenshot is a bit hard to read, but it seems your JSON is in fact a JSON array, so you need to pick the first element from there:
where (datas -> 0 ->> 'localId')::int = 7
or a bit shorter:
where (datas #>> '{0,localId}')::int = 7
alternatively you can use the contains operator #> to check if there is at least one element with localId = 7. But the #> operator requires jsonb, not json, so you will need to cast your column
where datas::jsonb #> '[{"localId": 7}]'
Online example
What you need is ::json right after datas :
SELECT *
FROM public.translations
where datas::json->>'localId' = '7';
Related question on StackOverFlow
If you need more informations here is the
Newest Postgresql Documentation
you need to cast both values to the same type
where datas->>'localId' = '7'::text;
or
where (datas->>'localId')::integer = 7;
or you should add brackets to you example
where (datas->>'localId')::text = '7';
I want to define an array of user Ids in Pig and then filter data if the userId from the input is NOT in that array,
How do I do this in pig latin? Below is the example of what I intend to do
Thanks
inputData = load '$INPUT' USING PigStorage('|') AS (useriD:chararray,controllerAction:chararray,url:chararray,browserName:chararray,IsMobile:chararray,exceptionDetails:chararray,renderTime:int,serviceHostId:int,auditEventTime:chararray);
filteredInput = filter inputData by controllerAction is not null and auditEventTime is not null and serviceHostId is not null and renderTime is not null and useriD in ('2be2df06-f4ba-4d87-8938-09d867d3f2fe','ac1ac6bf-d151-49fc-8c7c-2b52d2efbb58','f00aec16-36e5-46ae-b7cb-a0f1eeefe609','258890f9-102a-4f8e-a001-ae24d2e25269','cf221779-a077-472c-b377-cca4a9230e1b');
Thanks Murali..I tried the approach of declaring a variable and then using Flatten and stringSplit to join..However I get the following error
Syntax error, unexpected symbol at or near 'flatteneduserids'
%declare REQUIRED_USER_IDS 'xxxxx,yyyyy,sssss' ;
inputData = load '$INPUT' USING PigStorage('|') AS (useriD:chararray,controllerAction:chararray,url:chararray,browserName:chararray,IsMobile:chararray,exceptionDetails:chararray,renderTime:int,serviceHostId:int,auditEventTime:chararray);
filteredInput = filter inputData by controllerAction is not null and auditEventTime is not null and serviceHostId is not null and renderTime is not null;
flatteneduserids = FLATTEN(STRSPLIT('$REQUIRED_USER_IDS',',')) AS (uid:chararray);
useridfilter = JOIN filteredInput BY useriD, flatteneduserids BY uid USING 'replicated';
so Now I tried another way of declaring flatteneduserids which results in the error Undefined alias: IN
flatteneduserids = FOREACH IN GENERATE FLATTEN(STRSPLIT('$REQUIREDUSERIDS',',')) AS (uid:chararray);
Had a similar use case. Tried the approach by declaring the constant value in %define and accessing the same inside IN clause, was not able to achieve the objective. (Refer : Declare a comma seperated string constant)
A thought worth contemplating ....
If the condition inside IN clause is a static/ reference/ meta kind of data, then would suggest to declare this in a static file. We can then read the data at run time and do an inner join with input data to retrieve the matching records.
input_data = LOAD '$INPUT' USING PigStorage('|') AS (user_id:chararray ...)
static_data = LOAD ... AS (req_user_id:chararray
required_data = JOIN input_data BY useriD, static_data BY req_user_id USING 'replicated';
required_data_fmt = -- project required fields.
I was not able to figure out how to do this in memory
So as per Murali's suggestion I added the user ids in a file..load the file and then do a join...that worked as expected for mr
How would you implement the following query against JSONStore
In SQL format it is
select * from table where (A or B) and (C or D)
I'm assuming we would use an advancedFind operation with an array of QueryParts, however in the samples I can see how you can use QueryParts to form and AND but not how to form an OR query.
Any guidance appreciated.
Taking your example SQL and giving it values it would look like this:
select * from people where (name = 'carlos' or name = 'mike') AND (rank = 'king' or rank = 'pawn')
Which is the same as:
select * from people where (name = 'mike' AND rank = 'king') or (name = 'carlos' AND rank = 'pawn') or (name = 'carlos' AND rank = 'king') or (name = 'mike' and rank = 'pawn')
That can be expressed by JSONStore pseudocode like this:
var queryPart1 = WL.JSONStore.QueryPart()
.equal('name', 'mike') //and
.equal('rank', 'king');
//or
var queryPart2 = WL.JSONStore.QueryPart()
.equal('name', 'carlos') //and
.equal('rank', 'pawn');
//or
var queryPart3 = WL.JSONStore.QueryPart()
.equal('name', 'carlos') //and
.equal('rank', 'king');
//or
var queryPart4 = WL.JSONStore.QueryPart()
.equal('name', 'mike') //and
.equal('rank', 'pawn');
WL.JSONStore.get('people').advancedFind([queryPart1, queryPart2, queryPart3, queryPart4])
.then(...);
Everything inside a query part must match (i.e. it's like an and) and as long as one query part matches (i.e. it's like an or) results will be returned. Remember to work with top-level search fields.
Sometimes these fairly complex searches are required, but more often than not I would recommend re-thinking the offline experience. I wrote about that here.
FYI - Feature requests here. Bug reports here.
I am trying to convert the below Hive statement to Pig:
max(substr(case when url like 'http:%' then '' else url end,1,50))
My pig statement for the above is:
url_group = GROUP data by (uid);
max_substr_url= FOREACH url_group generate SUBSTRING(MAX(((Coalesce(data.url) matches '.*http:%.*') ? '' : Coalesce(data.url))), 0, 49);
For some of the data, the url can be null. So I have written a pig UDF called Coalesce(String) which returns an empty string if the data is either null or empty. If the data is not null or not empty it returns the string back.
The above pig statement is giving me lot of trouble and tried n different options/ways but nothing worked. Anyone got any ideas on how to implement this? Please help me.
Thanks in advance
You are going to want to use a nested FOREACH so that you can do the substring transformation on each tuple in the data bag then take the MAX of the transformed bag.
A = GROUP data by (uid);
B = FOREACH url_group {
-- MAX needs a one column bag
transformed = FOREACH data
GENERATE SUBSTRING((Coalesce(url) matches '.*http:.*' ? '' : Coalesce(url)), 0, 49);
GENERATE group AS uid, MAX(transformed) ;
}
The following pig latin script:
data = load 'access_log_Jul95' using PigStorage(' ') as (ip:chararray, dash1:chararray, dash2:chararray, date:chararray, date1:chararray, getRequset:chararray, location:chararray, http:chararray, code:int, size:int);
splitDate = foreach data generate size as size:int , ip as ip, FLATTEN(STRSPLIT(date, ':')) as h;
groupedIp = group splitDate by h.$1;
a = foreach groupedIp{
added = foreach splitDate generate SUM(size); --
generate added;
};
describe a;
gives me the error:
ERROR 1045:
<file 3.pig, line 10, column 39> Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
This error makes me think I need to cast size as an int, but if i describe my groupedIp field, I get the following schema.
groupedIp: {group: bytearray,splitDate: {(size: int,ip: chararray,h: bytearray)}} which indicates that size is an int, and should be able to be used by the sum function.
Am I calling the sum function incorrectly? Let me know if you would like to see any thing else, such as the input file.
SUM operates on a bag as input, but you pass it the field 'size'.
Try to eliminate the nested foreach and use:
a = foreach groupedIp generate SUM(splitDate.size);
Do some dumps of your data. I'll bet some of the stuff in the size column is non-integer, and Pig runs into that and dies. You could also code up your own isInteger udf to check this before the rest of your processing, and throw out any that aren't integers.
SUM, AVG and COUNT are functions that always work on a bag, therefore group the data and then join with the original set like below:
A = load 'nyse_data.txt' as (exchange:chararray, symbol:chararray,date:chararray, pen:float,high:float, low:float, close:float,volume:int, adj_close:float);
G = group A by symbol;
C = foreach G generate group, SUM(A.open);