store data into a path based on a field's value in the tuple - apache-pig

{'aaa', 'bbb', 'ccc'}
....
Let's suppose above tupple gets loaded with following schema:
as (firstField:chararray, secondField:chararray, thirdField:chararray)
I want to store the tuple in HDFS with the path based on 2nd field (which is 'bbb' in above example). So the above tuple would get stored in the path
/SomeBaseDir/bbb/testoutput.txt
Any help would be appreciated.

To load the file use below command. Remember the file input data should be tab separated.If you are using any other separator like comma then change the parameter passes in PigStorage funtion. It should be PigStorage(',')
A = load '/home/abhishek/Work/pigInput/data' using PigStorage('\t') as (firstField:chararray, secondField:chararray, thirdField:chararray);
Now, to get the second element simple use:
result = foreach A generate secondField;
Result
dump result
('bbb')
You can store it using below command
store result into 'provide the path';

I think you want to use MultiStorage (https://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/MultiStorage.html). This should do pretty much what you want. Specify base path, then the field on which subdirectories should be based.

The SPLIT operator in PigLatin would do the job here.
For an input data (in) which is loaded, it could be split into different output variable as following:
loadedData = load ' ' as (.. ,somefield, ) using ... ;
SPLIT loadedData INTO
segmentA IF (somefield=='A'),
segmentB IF (somefield=='B'),
OtherSources OTHERWISE;
store segmentA into 'hdfs://<path for data segmentA >' using ....;
store segmentB into 'hdfs://<path for data segmentB >' using ....;

Related

read specific files names in adf pipeline

I have got requirement saying, blob storage has multiple files with names file_1.csv,file_2.csv,file_3.csv,file_4.csv,file_5.csv,file_6.csv,file_7.csv. From these i have to read only filenames from 5 to 7.
how we can achieve this in ADF/Synapse pipeline.
I have repro’d in my lab, please see the below repro steps.
ADF:
Using the Get Metadata activity, get a list of all files.
(Parameterize the source file name in the source dataset to pass ‘*’ in the dataset parameters to get all files.)
Get Metadata output:
Pass the Get Metadata output child items to ForEach activity.
#activity('Get Metadata1').output.childItems
Add If Condition activity inside ForEach and add the true case expression to copy only required files to sink.
#and(greater(int(substring(item().name,4,1)),4),lessOrEquals(int(substring(item().name,4,1)),7))
When the If Condition is True, add copy data activity to copy the current item (file) to sink.
Source:
Sink:
Output:
I took a slightly different approaching using a Filter activity and the endsWith function:
The filter expression is:
#or(or(endsWith(item().name, '_5.csv'),endsWith(item().name, '_6.csv')),endsWith(item().name, '_7.csv'))
Slightly different approaches, similar results, it depends what you need.
You can always do what #NiharikaMoola-MT suggested . But since you already know the range of the files ( 5-7) , I suggest
Declare two paramter as an upper and lower range
Create a Foreach loop and pass the parameter and to create a range[lowerlimit,upperlimit]
Create a paramterized dataset for source .
Use the fileNumber from the FE loop to create a dynamic expression like
#concat('file',item(),'.csv')

how to remove a field from tuple in apache pig

i have a code as follow:
B_t= LOAD 'test.csv' USING PigStorage('\t') as (id:chararray,usr_id:chararray,weed:chararray,ip:chararray);
in above, i have field with name weed, i would like remove this field from record with filter command without use codes as follow:
B_f = FOREACH B_t GENERATE id , usr_id, ip
or
B_t= LOAD 'test.csv' USING PigStorage('\t') $0 as id, ....;
have anyone idea???
Why in the world u want to use Filter only. Filter is used emit the fields based on condition, not remove the entire field. Foreach -- Generate is the best way.

How to ignore "," in data fields

I am trying to generate following ...
Input
396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09
396124440112951296,"00:00 #MAW",WesleyBitton
A = LOAD '/user/root/data/tweets.csv' USING PigStorage(',') as (users:chararray, tweets:chararray);
B = FILTER A by users == '396124436476092416';
output truncated
(396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift)
Output excepting
(396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse")
I do not want to read row as line.
You can use CSVLoader for loading data
however if you do not wish to do that here is the work around in Apache Pig itself for that :
--Load your Data
A = LOAD 'your/path/users.csv' USING TextLoader() AS (unparsed:chararray);
--Replace your " string with | so as to separate your tweets
B = FOREACH A GENERATE REPLACE(unparsed, '\\"', '|') AS parsed:chararray;
--store your temporary parsed data into your location
STORE B INTO 'your/path/parsed_users.csv' USING PigStorage('|');
--load your parsed data
C = LOAD 'your/path/parsed_users.csv' USING PigStorage('|') AS (users:chararray, tweets:chararray);
--Dump your data , how ever this will still contain one extra comma(,) but you can replace it by using the replace function you get the point.
DUMP C;
Thats fit in the csv standardization, so you need just to use CSVLoader which
supports double-quoted fields that contain commas and other
double-quotes escaped with backslashes.
This is how to use it :
register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
A = LOAD '/user/root/data/tweets.csv' USING CSVLoader AS (users:chararray, tweets:chararray);
B = FILTER A by users == '396124436476092416';

REGEX_EXTRACT error in PIG

I have a CSV file with 3 columns: tweetid , tweet, and Userid. However within the tweet column there are comma separated values.
i.e. of 1 row of data:
`396124437168537600`,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
I want to extract all 3 fields individually, but REGEX_EXTRACT is giving me an error with this code:
a = LOAD tweets USING PigStorage(',') AS (f1,f2,f3);
b = FILTER a BY REGEX_EXTRACT(f1,'(.*)\\"(.*)',1);
The error is:
error: Filter's condition must evaluate to boolean.
In the use case shared, reading the data using PigStrorage(',') will result in missing savava143 (last field value)
A = LOAD '/Users/muralirao/learning/pig/a.csv' USING PigStorage(',') AS (f1,f2,f3);
DUMP A;
Output : A : Observe that the last field value is missing.
(396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.")
For the use case shared, to extract all the values from CSV file with field values having ',' we can use either CSVExcelStorage or CSVLoader.
Approach 1 : Using CSVExcelStorage
Ref : http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html
Input : a.csv
396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
Pig Script :
REGISTER piggybank.jar;
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (f1,f2,f3);
DUMP A;
Output : A
(396124437168537600,I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.,savava143)
Approach 2 : Using CSVLoader
Ref : http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVLoader.html
Below script makes use of CSVLoader(), DUMP A will result in the same output seen earlier.
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (f1,f2,f3);
The error is that you do not want to FILTER based on a regex but GENERATE new fields based on a regex. To filter, you need to know if the line have to be filtered, hence the boolean requirement.
Therefore, you have to use :
b = FOREACH a GENERATE REGEX_EXTRACT(FIELD, REGEX, HOW_MANY_GROUPS_TO_RETURN);
However, as #Murali Rao said, your values are not just coma separated but CSV (think how you will handle a coma in tweet : it is not a field separator, just some content).

How to remove characters from a field in Pig

Data:
someId,+1 5552221234
someId2,+1 3331114321
I want to remove the +1 from the second field below
I first load the data
A= LOAD 'Data' USING PigStroage(,) as (Id:chararray, Phone:chararray)
Now i want to have the following Data
Desired Output:
someId, 5552221234
someId2, 3331114321
How would i go about doing this. I was using the following but it doesn't work:
mss_demographic_data3= FOREACH mss_demographic_data2 GENERATE *, REGEX_EXTRACT_ALL(Phone, '[0-9]{9}$') as newPhone;
Use the substring function. (easiest way)
mss_demographic_data3= FOREACH mss_demographic_data2 GENERATE Id,SUBSTRING(Phone,3,12);
NOTE- You have this (substring function) only if you are using pig 0.8.0 or above. If you are using older version of pig, you may need to write an udf.