Apache Pig Group by and Filter if multiple values exist? - apache-pig

I am trying to group multiple rows with the same IDs, and then check for each tuple in the group if it contains both values, for example:
(10461 , 55 )
(10435 , 17 )
(10435 , 11 )
(10435 , 72 )
(10437 , 11 )
(10830 , 72 )
After I group it via: groupedData = group dataPoints by data_id;
I get :
(10461 ,{(10461 , 55)})
(10435 ,{(10435 , 17),(10435 , 11),(10435 , 72)})
I want to filter and get the value of 10435 if it contains 17 and 11.

You can use a nested FOREACH to filter the bags, and then check for empty bags. Note I'm not sure what you've called fields with the numbers (55, 17, 11 etc.) so this is value in the code below - replace as needed!
filteredBags = FOREACH groupedData {
seventeen = FILTER dataPoints BY value == 17;
eleven = FILTER dataPoints BY value == 11;
GENERATE
group AS data_id,
seventeen,
eleven;
}
nonNullBags = FILTER filteredBags BY NOT IsEmpty(seventeen) AND NOT IsEmpty(eleven);
finalIds = FOREACH nonNullBags GENERATE data_id;

Related

How to update each column one at time for each row in snowflake

Suppose I have 10 columns in my table and I want to update each column but one at a time for each row up to 10 rows.
if table is like
1,2,3
4,5,6
7,8,9
I want to update it like
x,2,3
4,y,6
7,8,z
Columns can be of any count so need dynamic approach. Also sometimes need to exclude some columns.
I tried to see if I can update row based on row id but there is no such option available as row id. I don't wanna change design of table to include a counter column.
you can use window function to assign a a row id and based on that :
with cte as (
select * from (
select * , row_number() over (order by id) rn
from tablename
) t ) ;
update t
set col1 = case when rn = 1 then <updatevalue> else col1 end
, col2 = case when rn = 2 then <updatevalue> else col2 end
, col3 = case when rn = 3 then <updatevalue> else col3 end
, ...
from tablename t
join cte on cte.id = t.id
The requirement "Columns can be of any count so need dynamic approach" looks like as a try to implement matrix as a table.
Alternative approach could be usage of ARRAY type and storing entire structure as single "cell" in the table.
CREATE OR REPLACE TABLE t
AS
SELECT ARRAY_CONSTRUCT(ARRAY_CONSTRUCT(1,2,3),
ARRAY_CONSTRUCT(4,5,6),
ARRAY_CONSTRUCT(7,8,9)) c
UNION ALL
SELECT ARRAY_CONSTRUCT(ARRAY_CONSTRUCT(10,20,30),
ARRAY_CONSTRUCT(40,50,60),
ARRAY_CONSTRUCT(70,80,90)) c;
SELECT *
FROM t;
/*
C
[[1,2,3],[4,5,6],[7,8,9]]
[[10,20,30],[40,50,60],[70,80,90]]
*/
Accessing elements:
SELECT c[0][0], c[0][1], c[0][2],
c[1][0], c[1][1], c[1][2],
c[2][0], c[2][1], c[2][2]
FROM t;
/*
C[0][0] C[0][1] C[0][2] C[1][0] C[1][1] C[1][2] C[2][0] C[2][1] C[2][2]
1 2 3 4 5 6 7 8 9
10 20 30 40 50 60 70 80 90
*/
Update:
UPDATE t
SET c = ARRAY_CONSTRUCT(ARRAY_CONSTRUCT('x' , c[0][1], c[0][2])
,ARRAY_CONSTRUCT(c[1][0], 'y' ,c[1][2])
,ARRAY_CONSTRUCT(c[2][0], c[2][1] , 'z' )
);
SELECT * FROM t;
/*
C
[["x",2,3],[4,"y",6],[7,8,"z"]]
[["x",20,30],[40,"y",60],[70,80,"z"]]
*/
More robust transformations could be performed via user-defined functions.

Is there a way to get this solution in the SPARK-SQL, we need to apply filter after the grouping of records?

Whenever there are 2 different values in the approval_ind column [ Y,X ] or it can be [ Y,N ] for the same Doc_Act_Checklist_Item_ID,Assigned_To_Person_ID
Then
1. Pick the record with entered_by_name=assigned_to_name
2. For both records if the names are matching then pick the minimum( Entered_date )
3. In the above case the names are not matching so we will pick the minimum( Entered_date ) which is Doc_Act_Checklist_Item_Person_ID = 101
I want to do this in SPARK-SQL please help me out of this.
I have tried:
SELECT * FROM STG_PUB_CHKLST_ITEM_PERSON
WHERE Doc_Act_Checklist_Item_Person_ID = (SELECT Doc_Act_Checklist_Item_Person_ID from (SELECT *,
CASE
WHEN ASSIGNED_TO_NAME = ENTERED_BY_NAME
THEN 'MATCH'
ELSE 'NO MATCH'
END AS MATCHING_STATUS
FROM STG_PUB_CHKLST_ITEM_PERSON
WHERE doc_act_checklist_item_id = 55
AND assigned_to_person_id = 33)
WHERE MATCHING_STATUS = 'MATCH')
;

Exclude value if they share the same ID

I have this statement in my Access database:
It lists Magazzino.Codice from 2 tables and the relating quantities.
SELECT Magazzino.Codice, Magazzino.Qnt
FROM Magazzino
WHERE (((Magazzino.[Prossimo_arrivo]) Is Null) And ((Magazzino.Qnt)<30) And ((Magazzino.[Fascia_I])=True));
UNION ALL --Joins allowing duplicates
SELECT Magazzino.Codice, Magazzino.Qnt
FROM Magazzino
WHERE (((Magazzino.[Prossimo_arrivo]) Is Null) And ((Magazzino.Qnt)<10) And ((Magazzino.[Fascia_II])=True));
I wish to add a statement avoiding to list Magazzino.Codice if the same ID is present on a third table Magazzino Grezzi.
How can I get this?
First, simplify the logic assuming you don't want duplicates form this table:
SELECT m.Codice, m.Qnt
FROM Magazzino as m
WHERE m.[Prossimo_arrivo]) Is Null AND
( (m.Qnt < 30 AND m.[Fascia_I] = True) OR
(m.Qnt < 10 AND m.[Fascia_II] = False) OR
)
Then use IN or EXISTS:
SELECT m.Codice, m.Qnt
FROM Magazzino as m
WHERE m.[Prossimo_arrivo]) Is Null AND
( (m.Qnt < 30 AND m.[Fascia_I] = True) OR
(m.Qnt < 10 AND m.[Fascia_II] = False) OR
) AND
NOT EXISTS (SELECT 1
FROM Magazzino_Grezzi as mg
WHERE mg.Codice = m.Codice
);
If you really want some rows to be duplicated (those that meet both conditions), then you can add the NOT EXISTS clause to both your subqueries.

Does dolphindb support list query?

I would like to apply the following group of SQL statements at once and union the result to get the most recent record behind mt=52355979 of various stock(idetified by 'symbol') of different trade places and market types(identified by 'c1','c2','c3','c4').
select * from t where symbol=`A,c1=25,c2=814,c3=11,c4=2, date=2020.02.05, mt<52355979 order by mt desc limit 1
select * from t where symbol=`B,c1=25,c2=814,c3=12,c4=2, date=2020.02.05, mt<52355979 order by mt desc limit 1
select * from t where symbol=`C,c1=25,c2=814,c3=12,c4=2, date=2020.02.05, mt<52354979 order by mt desc limit 1
select * from t where symbol=`A,c1=1180,c2=333,c3=3,c4=116, date=2020.02.05, mt<52355979 order by mt desc limit 1
The filter columns in where condition will not change, while the filter values may change each time. Does DolphindB offer querying methods which allow to run list query with varying input parameters?
You can define a function as follows
def bundleQuery(tbl, dt, dtColName, mt, mtColName, filterColValues, filterColNames){
cnt = filterColValues[0].size()
filterColCnt =filterColValues.size()
orderByCol = sqlCol(mtColName)
selCol = sqlCol("*")
filters = array(ANY, filterColCnt + 2)
filters[filterColCnt] = expr(sqlCol(dtColName), ==, dt)
filters[filterColCnt+1] = expr(sqlCol(mtColName), <, mt)
queries = array(ANY, cnt)
for(i in 0:cnt) {
for(j in 0:filterColCnt){
filters[j] = expr(sqlCol(filterColNames[j]), ==, filterColValues[j][i])
}
queries.append!(sql(select=selCol, from=tbl, where=filters, orderBy=orderByCol, ascOrder=false, limit=1))
}
return loop(eval, queries).unionAll(false)
}
and then use the following script
dt = 2020.02.05
dtColName = "dsl"
mt = 52355979
mtColName = "mt"
colNames = `symbol`c1`c2`c3`c4
colValues = [50982208 50982208 51180116 41774759, 25 25 25 1180, 814 814 814 333, 11 12 12 3, 2 2 2 116]
bundleQuery(t, dt, dtColName, mt, mtColName, colValues, colNames)

MAX NOT WORKING IN SQL QUERY

I want the latest record to be retrieved by the following query....
but max is not working in the below query. All the rows are getting retrieved instead of the latest one
SELECT SV.SEGMENT1 TARGETED_INCENTIVE,
SIT.ANALYSIS_CRITERIA_ID,
SIT.OBJECT_VERSION_NUMBER OBJECT_VERSION_NUMBER,
ST.ID_FLEX_NUM,
SIT.DATE_FROM,
SIT.DATE_TO,
MAX (SIT.PERSON_ANALYSIS_ID)
FROM FND_ID_FLEX_STRUCTURES_TL STTL,
FND_ID_FLEX_STRUCTURES ST,
PER_PERSON_ANALYSES SIT,
PER_ANALYSIS_CRITERIA SV
WHERE 1 = 1
AND (STTL.ID_FLEX_STRUCTURE_NAME) LIKE
('%%Tare%')
AND STTL.LANGUAGE = USERENV ('LANG')
AND ST.ID_FLEX_CODE = STTL.ID_FLEX_CODE
AND ST.ID_FLEX_NUM = STTL.ID_FLEX_NUM
AND ST.ID_FLEX_NUM = SIT.ID_FLEX_NUM
AND ST.ID_FLEX_NUM = SV.ID_FLEX_NUM
AND TO_DATE (SIT.DATE_TO) IS NULL
AND SIT.ANALYSIS_CRITERIA_ID = SV.ANALYSIS_CRITERIA_ID
AND SIT.PERSON_ID = (SELECT PERSON_ID
FROM abc
WHERE ID = :AIN)
GROUP BY SV.SEGMENT1,
SIT.ANALYSIS_CRITERIA_ID,
STTL.ID_FLEX_STRUCTURE_NAME,
SIT.OBJECT_VERSION_NUMBER,
ST.ID_FLEX_NUM,
SIT.DATE_FROM,
SIT.DATE_TO;
Can anyone guide ?
I'm afraid that's not what MAX() does. MAX() is an aggregate function (though it can be used as a window [analytic] function), so when you get the MAX() of a particular column grouped by other columns, you will get distinct combinations of values for all those other columns.
I think you might want something like this:
SELECT targeted_incentive, analysis_criteria_id
, object_version_number, id_flex_num, date_from
, date_to, person_analysis_id
FROM (
SELECT sv.segment1 AS targeted_incentive
, sit.analysis_criteria_id
, sit.object_version_number
, st.id_flex_num
, sit.date_from
, sit.date_to
, sit.person_analysis_id
, RANK() OVER ( ORDER BY sit.person_analysis_id DESC ) rn
FROM fnd_id_flex_structures_tl sttl
, fnd_id_flex_structures st
, per_person_analyses sit
, per_analysis_criteria sv
WHERE sttl.id_flex_structure_name LIKE '%Tare%'
AND sttl.language = USERENV('LANG')
AND st.id_flex_code = sttl.id_flex_code
AND st.id_flex_num = sttl.id_flex_num
AND st.id_flex_num = sit.id_flex_num
AND st.id_flex_num = sv.id_flex_num
AND sit.date_to IS NULL
AND sit.analysis_criteria_id = sv.analysis_criteria_id
AND sit.person_id = ( SELECT person_id FROM abc
WHERE id = :AIN )
) WHERE rn = 1;
The RANK() window function will return the rank of each row ordered by the value of person_analysis_id in descending order. To get the maximum value, simply filter for rank = 1. Note that this will return more than one row in case of ties. If you want only one row, use ROW_NUMBER() in place of RANK().
Also note that I cleaned up the query a bit. You certainly don't need to use two % wildcards in a row in a LIKE, for example. You also definitely don't need the WHERE 1=1 condition.