How to display all created variables in Pig Latin? - apache-pig

After I run this command in Pig Latin
abc = group xyz by n
def = group pqr by m
abc variable will be created which will store values present in xyz grouped by coloumn n and similarly def will be created. Now if I want to all the variables that are created, is there any command to display?

Related

How can I use the df.query method in a function?

I am trying to create a function that will use the query method to filter through a data frame based on multiple conditions and create a new series with the values applicable.
Here is an example of the df that I am working with:
Week
Treatment
Value
0
ABC
100
1
ABC
150
2
ABC
149
0
XYZ
350
1
XYZ
500
2
XYZ
600
0
ABC
101
1
ABC
130
2
ABC
147
I have been able to successfully filter my data for a given 'Week' and 'Treatment' using the .query method like this:
test = df.query('Week == 1 & Treatment.str.startswith("A").values')['value']
I am only interested in extracting the 'values' column.
This is great, but I don't want to have to copy and paste this line of code for every 'week' and 'treatment' in my df. Therefore, I want to create a function that will let me identify the desired week and treatment (or first letter of treatment) as arguments and have it return an object with my desired values for further analysis.
Here is what I have so far:
def sorter(df,week):
return df.query('Week == 0 & Treatment.str.startswith("A").values')['value']
I know that as I have it, my function does not return my values as an object. I will work on that. Where I am stuck is how to have one of the arguments for my function be a week (like '0') when in the query method the week is written as part of a string.
I tried this:
def sorter(df,week):
return df.query('Week == week & Treatment.str.startswith("A").values')['value']
But I got an error saying week was undefined.
A possible solution, based on f strings:
def sorter(df,week):
return df.query(f'Week == {week} & Treatment.str.startswith("A").values')['Value']
As an alternative, you can define variables using #:
def sorter(df,week):
return df.query('Week == #week & Treatment.str.startswith("A").values')['value']

Merge certain rows in a DataFrame based on startswith

I have a DataFrame, in which I want to merge certain rows to a single one. It has the following structure (values repeat)
Index Value
1 date:xxxx
2 user:xxxx
3 time:xxxx
4 description:xxx1
5 xxx2
6 xxx3
7 billed:xxxx
...
Now the problem is, that the columns 5 & 6 still belong to the description and were separated just wrong (whole string separated by ","). I want to merge the "description" row (4) with the values afterwards (5,6). In my DF, there can be 1-5 additional entries which have to be merged with the description row, but the structure allows me to work with startswith, because no matter how many rows have to be merged, the end point is always the row which starts with "billed". Due to me being very new to python, I haven´t got any code written for this problem yet.
My thought is the following (if it is even possible):
Look for a row which starts with "description" → Merge all the rows afterwards till reaching the row which starts with "billed", then stop (obviosly we keep the "billed" row) → Do the same to each row starting with "description"
New DF should look like:
Index Value
1 date:xxxx
2 user:xxxx
3 time:xxxx
4 description:xxx1, xxx2, xxx3
5 billed:xxxx
...
df = pd.DataFrame.from_dict({'Value': ('date:xxxx', 'user:xxxx', 'time:xxxx', 'description:xxx', 'xxx2', 'xxx3', 'billed:xxxx')})
records = []
description = description_val = None
for rec in df.to_dict('records'): # type: dict
# if previous description and record startswith previous description value
if description and rec['Value'].startswith(description_val):
description['Value'] += ', ' + rec['Value'] # add record Value into previous description
continue
# record with new description...
if rec['Value'].startswith('description:'):
description = rec
_, description_val = rec['Value'].split(':')
elif rec['Value'].startswith('billed:'):
# billed record - remove description value
description = description_val = None
records.append(rec)
print(pd.DataFrame(records))
# Value
# 0 date:xxxx
# 1 user:xxxx
# 2 time:xxxx
# 3 description:xxx, xxx2, xxx3
# 4 billed:xxxx

pandas/python Merge/concatenate related data of duplicate rows and add a new column to existing data frame

I am new to Pandas, and wanted your help with data slicing.
I have a dump of 10 million rows with duplicates. Please refer to this image for a sample of the rows with the steps I am looking to perform.
As you see in the image, the column for criteria "ABC" from Source 'UK' has 2 duplicate entries in the Trg column. I need help with:
Adding a concatenated new column "All Targets" as shown in image
Removing duplicates from above table so that only unique values without duplicates appear, as shown in step 2 in the image
Any help with this regard will be highly appreciated.
I would do like this:
PART 1:
First define a function that does what you want, than use apply method:
def my_func(grouped):
all_target = grouped["Trg"].unique()
grouped["target"] = ", ".join(all_target)
return grouped
df1 = df.groupby("Criteria").apply(my_func)
#output:example with first 4 rows
Criteria Trg target
0 ABC DE DE, FR
1 ABC FR DE, FR
2 DEF UK UK, FR
3 DEF FR UK, FR
PART 2:
df2 = df1.drop_duplicates(subset=["Criteria"])
I tried it only on first 4 rows so let me know if it works.

How to load array of strings with tab delimiter in pig

I have a text file with tab delimiter and I am trying to print first column as id and remaining array of strings as second column names.
consider below is the file to load:
cat file.txt;
1 A B
2 C D E F
3 G
4 H I J K L M
In the above file, first column is an id and the remaining are names.
I should get the output like:
id names
1 A,B
2 C,D,E,F
3 G
4 H,I,J,K,L,M
If names are split with delimiter ,, then I am getting the output by using below commands:
test = load '/tmp/arr' using PigStorage('\t') as (id:int,names:chararray)
btest = FOREACH test GENERATE id, FLATTEN(TOBAG(STRSPLIT(name,','))) as value:tuple(name:CHARARRAY);
But for the array with delimiter ('\t'), I am not getting them because it's considering only the first value in the column 2 (i.e, names).
Any solution for this?
I have a solution for this:
When using PigStorage('\t') in the load, the file should have tab delimiter. So the number of tab used in a line that many coloumns(+1) is created. This is how it works.
But you have a trick
You can change the default delimiter and use some other delimiter to load the file like comma and then you can have the names in commaseperated.
It will work for sure
Input file sample
1,A B
2,C D E F
3,G
4,H I J K L M
Hope this helps

Convert data in a specific format in Apache Pig.

I want to convert data in to a specific format in Apache Pig so that I can use a reporting tool on top of it.
For example:
10:00,abc
10:00,cde
10:01,abc
10:01,abc
10:02,def
10:03,efg
The output should be in the following format:
abc cde def efg
10:00 1 1 0 0
10:01 2 0 0 0
10:02 0 0 1 0
The main problem here is that a value can occur multiple times in a row, depending on the different values available in the sample csv file, up to a total of 120.
Any suggestions to tackle this are more than welcome.
Thanks
Gagan
Try something like the following:
A = load 'data' using PigStorage(",") as (key:chararray,value:chararray);
B = foreach A generate key,(value=='abc'?1:0) as abc,(value=='cde'?1:0) as cde,(value=='efg'?1:0) as efg;
C = group B by key;
D = foreach C generate group as key, COUNT(abc) as abc, COUNT(cde) as cde, COUNT(efg) as efg;
That should get you a count of the occurances of a particular value for a particular key.
EDIT: just noticed the limit 120 part of the question. If you cannot go above 120 put the following code
E = foreach D generate key,(abc>120?"OVER 120":abc) as abc,(cde>120?"OVER 120":cde) as cde,(efg>120?"OVER 120":efg) as efg;