PIG : How to print only certain attributes from Grouped Bag - apache-pig

I have a grouped result which looks exactly like below :
| grouped | group:chararray | log:bag{:tuple(driverId:chararray,truckId:chararray,eventTime:chararray,eventType:chararray,longitude:chararray,latitude:chararray,eventKey:chararray,CorrelationId:chararray,driverName:chararray,routeId:chararray,routeName:chararray,eventDate:chararray)}
When I perform below :
x = FOREACH grouped GENERATE {log.driverId, log.truckId, log.driverName};
illustrate x;
The out put am getting is :
| x | :bag{:tuple(:bag{:tuple(driverId:chararray)})} |
------------------------------------------------------------------------------------
| | {({(11), (11)}), ({(74), (39)}), ({(Jamie Engesser), (Jamie Engesser)})} |
------------------------------------------------------------------------------------
Where as my expectation is :
{({(11, 74, Jamie Engesser), (11,39,Jamie Engesser)})

Got the Solutions
Since
Group was a tuple and The adjacent result was Bag i had to use Nested FOREACH like below :
x = FOREACH grouped{
val1 = group;
vals = FOREACH log GENERATE driverId, truckId, driverName;
GENERATE val1, vals;
};
So this selected only the required attributes from the given result.
Please comment if some one knows a better/optimal/easier way of doing it.
Thanks

Related

Athena query get the index of any element in a list

I need to access to the elements in a column whose type is list according to the other elements' locations in another list-like column. Say, my dataset is like:
WITH dataset AS (
SELECT ARRAY ['hello', 'amazon', 'athena'] AS words,
ARRAY ['john', 'tom', 'dave'] AS names
)
SELECT * FROM dataset
And I'm going to achieve
SELECT element_at(words, index(names, 'john')) AS john_word
FROM dataset
Is there a way to have a function in Athena like "index"? Or how can I customize one like this? The desired result should be like:
| -------- |
| john_word|
| -------- |
| hello |
| -------- |
array_position:
array_position(x, element) → bigint
Returns the position of the first occurrence of the element in array x (or 0 if not found).
Note that in presto array indexes start from 1.
SELECT element_at(words, array_position(names, 'john')) AS john_word
FROM dataset

workspace does not return the correct value for me

i have problem with some code.
If i write Recenzes select: [:a | a komponenta nazev = 'Hitachi P21'] i got some right records. But if i use something like this:
| brzdy |
brzdy := (((
(Sekces select: [:b | b nazev = 'Brzdy']) collect: [:b | b komponenty]) flatten)
select: [:c | c vyrobce nazev = 'Hitachi']) collect: [:d | d nazev].
i can get 'Hitachi P21' with ^ command. But if i use variable 'brzdy' here: Recenzes select: [:a | a komponenta nazev = brzdy] i won't get anything.
In a nutshell. I want to show 'Recenzes' for 'Komponenty' which are in 'Sekces' with value 'Brzdy' and they are saved in column 'Komponenty' (Set) for 'Recenzes' and 'Sekces'.
Does anyone know why?
Since brzdy is the result of a #collect: message, it is a collection of strings, not a single string. Therefore no element a would satisfy the condition a komponenta nazev = brzdy, because you would be comparing objects of different classes. Try something on the lines of
Recenzes select: [:a | brzdy includes: a komponenta nazev]
As a side note, remember that you may eliminate some parentheses by using select:thenCollect: other than (select: blah) collect: bluh. For instance
brzdy := (Sekces select: [:b | b nazev = 'Brzdy'] thenCollect: [:b | b komponenty]) flatten
select: [:c | c vyrobce nazev = 'Hitachi']
thenCollect: [:d | d nazev]
(I'm not familiar with the #flatten message, so I can't tell whether it is necessary or superfluous).

How to loop through JSON array of JSON objects to see if it contains a value that I am looking for in postgres?

Here is an example of the json object
rawJSON = [
{"a":0, "b":7},
{"a":1, "b":8},
{"a":2, "b":9}
]
And I have a table that essentially looks like this.
demo Table
id | ...(other columns) | rawJSON
------------------------------------
0 | ...(other columns info) | [{"a":0, "b":7},{"a":1, "b":8}, {"a":2, "b":9}]
1 | ...(other columns info) | [{"a":0, "b":17},{"a":11, "b":5}, {"a":12, "b":5}]
What I want is to return a row which insideRawJSON has value from "a" of less than 2 AND the value from "b" of less than 8. THEY MUST BE FROM THE SAME JSON OBJECT.
Essentially the query would similarly look like this
SELECT *
FROM demo
WHERE FOR ANY JSON OBJECT in rawJSON column -> "a" < 2 AND -> "b" < 8
And therefore it will return
id | ...(other columns) | rawJSON
------------------------------------
0 | ...(other columns info) | [{"a":0, "b":7},{"a":1, "b":8}, {"a":2, "b":9}]
I have searched from several posts here but was not able to figure it out.
https://dba.stackexchange.com/questions/229069/extract-json-array-of-numbers-from-json-array-of-objects
https://dba.stackexchange.com/questions/54283/how-to-turn-json-array-into-postgres-array
I was thinking of creating a plgpsql function but wasn't able to figure out .
Any advice I would greatly appreciate it!
Thank you!!
I would like to avoid cross join lateral because it will slow down a lot.
You can use a subquery that searches through the array elements together with EXISTS.
SELECT *
FROM demo d
WHERE EXISTS (SELECT *
FROM jsonb_array_elements(d.rawjson) a(e)
WHERE (a.e->>'a')::integer < 2
AND (a.e->>'b')::integer < 8);
db<>fiddle
If the datatype for rawjson is json rather than jsonb, use json_array_elements() instead of jsonb_array_elements().

KQL: mv-expand OR bag_unpack equivalent command to convert a list to multiple columns

According to mv-expand documentation:
Expands multi-value array or property bag.
mv-expand is applied on a dynamic-typed column so that each value in the collection gets a separate row. All the other columns in an expanded row are duplicated.
Just like the mv-expand operator will create a row each for the elements in the list -- Is there an equivalent operator/way to make each element in a list an additional column?
I checked the documentation and found Bag_Unpack:
The bag_unpack plugin unpacks a single column of type dynamic by treating each property bag top-level slot as a column.
However, it doesn't seem to work on the list, and rather works on top-level JSON property.
Using bag_unpack (like the below query):
datatable(d:dynamic)
[
dynamic({"Name": "John", "Age":20}),
dynamic({"Name": "Dave", "Age":40}),
dynamic({"Name": "Smitha", "Age":30}),
]
| evaluate bag_unpack(d)
It will do the following:
Name Age
John 20
Dave 40
Smitha 30
Is there a command/way (see some_command_which_helps) I can achieve the following (convert a list to columns):
datatable(d:dynamic)
[
dynamic(["John", "Dave"])
]
| evaluate some_command_which_helps(d)
That translates to something like:
Col1 Col2
John Dave
Is there an equivalent where I can convert a list/array to multiple columns?
For reference: We can run the above queries online on Log Analytics in the demo section if needed (however, it may require login).
you could try something along the following lines
(that said, from an efficiency standpoint, you may want to check your options of restructuring the data set to begin with, using a schema that matches how you plan to actually consume/query it)
datatable(d:dynamic)
[
dynamic(["John", "Dave"]),
dynamic(["Janice", "Helen", "Amber"]),
dynamic(["Jane"]),
dynamic(["Jake", "Abraham", "Gunther", "Gabriel"]),
]
| extend r = rand()
| mv-expand with_itemindex = i d
| summarize b = make_bag(pack(strcat("Col", i + 1), d)) by r
| project-away r
| evaluate bag_unpack(b)
which will output:
|Col1 |Col2 |Col3 |Col4 |
|------|-------|-------|-------|
|John |Dave | | |
|Janice|Helen |Amber | |
|Jane | | | |
|Jake |Abraham|Gunther|Gabriel|
To extract key value pairs from text and convert them to columns without hardcoding the key names in query:
print message="2020-10-15T15:47:09 Metrics: duration=2280, function=WorkerFunction, count=0, operation=copy_into, invocationId=e562f012-a994-4fc9-b585-436f5b2489de, tid=lct_b62e6k59_prd_02, table=SALES_ORDER_SCHEDULE, status=success"
| extend Properties = extract_all(#"(?P<key>\w+)=(?P<value>[^, ]*),?", dynamic(["key","value"]), message)
| mv-apply Properties on (summarize make_bag(pack(tostring(Properties[0]), Properties[1])))
| evaluate bag_unpack(bag_)
| project-away message

In Postgres: Select columns from a set of arrays of columns and check a condition on all of them

I have a table like this:
I want to perform count on different set of columns (all subsets where there is at least one element from X and one element from Y). How can I do that in Postgres?
For example, I may have {x1,x2,y3}, {x4,y1,y2,y3},etc. I want to count number of "id"s having 1 in each set. So for the first set:
SELECT COUNT(id) FROM table WHERE x1=1 AND x2=1 AND x3=1;
and for the second set does the same:
SELECT COUNT(id) FROM table WHERE x4=1 AND y1=1 AND y2=1 AND y3=1;
Is it possible to write a loop that goes over all these sets and query the table accordingly? The array will have more than 10000 sets, so it cannot be done manually.
You should be able convert the table columns to an array using ARRAY[col1, col2,...], then use the array_positions function, setting the second parameter to be the value you're checking for. So, given your example above, this query:
SELECT id, array_positions(array[x1,x2,x3,x4,y1,y2,y3,y4], 1)
FROM tbl
ORDER BY id;
Will yield this result:
+----+-------------------+
| id | array_positions |
+----+-------------------+
| a | {1,4,5} |
| b | {1,2,4,7} |
| c | {1,2,3,4,6,7,8} |
+----+-------------------+
Here's a SQL Fiddle.