Using pig, How do I parse and comapre a grouped item - apache-pig

I have
A B
a, d
a, e
a, y
z, v
z, k
z, o
and so on.
Column B is of type cararray and contains key value pairs separated by &.
For example - d = 'abc=1&c=1&p=success'
What I want to figure out --
Suppose -
d = 'abc=1&c=1&xyz=23423423'
e = 'xyz=1&it=ssd'
y = 'abc=1&c=1&p=success'
For every 'a' I want to figure out if it has column b which contains the same value of abc and have c=1 and p = success. I also want to extract the value of abc and c from d and y.
For instance lets take the above example -
d contains abc=1 and c=1
y contains abc=1 and p= success
So this satisfies what I am looking for i.e for a given 'a' i have same value of abc and c=1 and p =success.
I started with grouping my data :
grouped = group data BY (A, B);
which gives me
a, (a,b)(a,e)(a,y)
z, (z,v)(z,k)(z,o)
But after this I am clueless on how to compare data within each group so that the above condition is satisfied.
Any help on this is appreciated.
Please let me know if you want me to clarify further on my question.

Since you are only concerned with some of the fields in the query string (I assume that's what it is), you will want to split the data with a FOREACH and STRSPLIT. Flatten it so you have something that looks like this
(a, b) where b would be a single key/value from the query ex: abc=1
Filter out the key/value pairs you don't care about, join them back together and then group by the combined key/value pairs. That will give you a list of every a with the same b where b only contains abc=X, c=1 and p=success

Related

How to count the number of elements in one array (with duplicates) that match with elements of another array in SQL(Presto)?

I have two arrays X,Y. X=[a,b,c] and Y=[a,a,b,b,b,c,d,d,e,e,e]. I want to write a query that will return the number of elements in Y that match the elements in X (with duplicates). in this case the out put should be [a,a,b,b,b,c] and I need the length of this array which is 6. I know array_intersect will return with no duplicates.
SELECT array_intersect([a,b,c],[a,a,b,b,b,c,d,d,e,e,e])
the result is
[a,b,c]
but my desired output is
[a,a,b,b,b,c]
This can be achieved with filter and contains:
SELECT filter(array['a','a','b','b','b','c','d','d'], el -> contains(array['a','b','c'], el))
Output:
_col0
[a, a, b, b, b, c]

PostgreSQL data transformation - Turn rows into columns

I have a table whose structure looks like the following:
k | i | p | v
Notice that the key (k) is not unique, there are no keys, nothing. Each key can have multiple attributes (i = 0, 1, 2, ...) which can be of different types (p) and have different values (v). One attribute type may also appear multiple times (p(i-1) = p(i)).
What I want to do is pick certain attribute types and their corresponding values and place them in the same row. For example I want to have:
k | attr_name1 | attr_name2
I have managed to make a query that does this and works for all keys (k) for which attr_name1 and attr_name2 appear in the column p of the initial table:
SELECT DISTINCT ON (key) fn.k AS key, fn.v AS attr_name1, a.v AS attr_name2
FROM Table fn
LEFT JOIN Table a ON fn.k = a.k
AND a.p = 'attr_name2'
WHERE fn.p = 'attr_name1'
I would like, however, to take into account the case where a certain key has no attribute named attr_name1 and insert a NULL value into the corresponding column of the new table. I am not sure how to achieve that. I have no issue using multiple queries or intermediate tables etc, but there are quite a lot of rows in the table and I need something that scales to millions of rows.
Any help would be appreciated.
Example:
k i p v
1 0 a 10
1 1 b 12
1 2 c 34
1 3 d 44
1 4 e 09
2 0 a 11
2 1 b 13
2 2 d 22
2 3 f 34
Would turn into (assuming I am only interested in columns a, b, c):
k a b c
1 10 12 34
2 11 13 NULL
I would use conditional aggregation. That is, an aggregate function around a CASE expression.
SELECT
k,
MAX(CASE WHEN p='a' THEN v END) AS a,
MAX(CASE WHEN p='b' THEN v END) AS b,
MAX(CASE WHEN p='c' THEN v END) AS c
FROM
your_table
GROUP BY
k
This presumes that (k, p) is unique. If there are duplicate keys, this will clearly find the one v with the highest value (for each (k,p))
As a general rule this kind of pivoting makes the data harder to process in SQL. This is often done for display purposes because humans find this easier to read. However, from a software engineering perspective, such formatting should not be done in the data layer; be careful that by doing this you don't actually make your future life harder.

How to read excel two dimensional parameter in Gams?

I have a Gams model and I want read sets and parameters from Excel to Gams.As shown below:
How can I read this parameter in Gams?
Thanks
For that table you need 2 indexes (i.e. sets) e.g. set i for the column of a, b and c. And set j for the row of d, e and f. Try this:
parameter d(i,j) "Data with column of a, b and c and row of e, d and f";
$Call GDXXRW.exe i=C:\Input.xlsx par=d rng=Sheet1!C1:F4 Rdim=1 Cdim=1 o=C:\Input.gdx
$GDXIN C:\Input.gdx
$LOAD d
$GDXIN
Display d;

Find continuity of elements in Pig

how can i find the continuity of a field and starting position
The input is like
A-1
B-2
B-3
B-4
C-5
C-6
The output i want is
A,1,1
B,3,2
C,2,5
Thanks.
Assuming you do not have discontinuous data with respect to a value, you can get the desired results by first grouping on value and using COUNT and MIN to get continuous_counts and start_index respectively.
A = LOAD 'data' USING PigStorage('-') AS (value:chararray;index:int);
B = FOREACH (GROUP A BY value) GENERATE
group as value,
COUNT(A) as continuous_counts,
MIN(A.value) as start_index;
STORE B INTO 'output' USING PigStorage(',');
If your data does have the possibility of discontinuous data, the solution is not longer trivial in native pig and you might need to write a UDF for that purpose.
Group and count the number of values for continous_counts. i.e.
A,1
B,3
C,2
Get the top row for each value. i.e.
A,1
B,2
C,5
Join the above two relations and get the desired output.
A = LOAD 'data.txt' USING PigStorage('-') AS (value:chararray;index:int);
B = GROUP A BY value;
C = FOREACH B GENERATE group as value,COUNT(A.value) as continuous_counts;
D = FOREACH B {
ordered = ORDER B BY index;
first = LIMIT ordered 1;
GENERATE first.value,first.index;
}
E = JOIN C BY value,D BY value;
F = FOREACH E GENERATE C::value,C::continuous_counts,D::index;
DUMP F;

Odd Even Sorting in VBA

I am trying to sort rows of data so that the integer value of an alpha-numerical address is in order of odd values then even values given they are of the same type.
The only way I have got it to (semi)work was this:
-Find if the integer of the address is even or odd
-Add EVEN or ODD to a cell in that addresses corresponding row
-Run the macro
-Filter the data by EVEN or ODD designation
This approach isn't ideal. I am interested in rearranging the rows without having to use filtering.
Below is an example of how the sorting would go.
UNSORTED SORTED
Address Type Address Type
1.1p A 1.1p A
1.2p A 1.2p A
1.3p A 1.3p A
1.4p A 1.4p A
2.1p A 3.1p A
2.2p A 3.2p A
2.3p A 3.3p A
2.4p A 3.4p A
3.1p A 5.1p A
3.2p A 5.2p A
3.3p A 5.3p A
3.4p A 5.4p A
4.1p A 2.1p A
4.2p A 2.2p A
4.3p A 2.3p A
4.4p A 2.4p A
5.1p A 4.1p A
5.2p A 4.2p A
5.3p A 4.3p A
5.4p A 4.4p A
6.1p B 7.1p B
6.2p B 7.2p B
6.3p B 7.3p B
6.4p B 7.4p B
7.1p B 9.1p B
7.2p B 9.2p B
7.3p B 9.3p B
7.4p B 9.4p B
8.1p B 6.1p B
8.2p B 6.2p B
8.3p B 6.3p B
8.4p B 6.4p B
9.1p B 8.1p B
9.2p B 8.2p B
9.3p B 8.3p B
9.4p B 8.4p B
10.1p B 10.1p B
10.2p B 10.2p B
10.3p B 10.3p B
10.4p B 10.4p B
I am new to VBA. Thank you in advance for any suggestions.
I think you need to create a helper column where you can store a value that you can use for sorting.
Basic idea is to extract the numeric value from your "Adress" column, check if it is even and if yes multiply it by an high value (eg 1000) so that it is guaranteed to be higher than the highest possible odd value.
You can use either a formula for this cell - but it's looking a little complicated to me. Assuming that your data starts in cell A2:
=VALUE(LEFT(A2, SEARCH("p", A2, 1)-1))*IF(ISODD(VALUE(LEFT(A2, SEARCH("p", A2, 1)-1))),1,1000)
or write a small UDF
Function SortVal(s As String) As Double
SortVal = Val(s)
If Int(SortVal) Mod 2 = 0 Then SortVal = SortVal * 1000
End Function
and put a call to it in your helper column
=SortVal(A2)