Data:
someId,+1 5552221234
someId2,+1 3331114321
I want to remove the +1 from the second field below
I first load the data
A= LOAD 'Data' USING PigStroage(,) as (Id:chararray, Phone:chararray)
Now i want to have the following Data
Desired Output:
someId, 5552221234
someId2, 3331114321
How would i go about doing this. I was using the following but it doesn't work:
mss_demographic_data3= FOREACH mss_demographic_data2 GENERATE *, REGEX_EXTRACT_ALL(Phone, '[0-9]{9}$') as newPhone;
Use the substring function. (easiest way)
mss_demographic_data3= FOREACH mss_demographic_data2 GENERATE Id,SUBSTRING(Phone,3,12);
NOTE- You have this (substring function) only if you are using pig 0.8.0 or above. If you are using older version of pig, you may need to write an udf.
Related
i have a code as follow:
B_t= LOAD 'test.csv' USING PigStorage('\t') as (id:chararray,usr_id:chararray,weed:chararray,ip:chararray);
in above, i have field with name weed, i would like remove this field from record with filter command without use codes as follow:
B_f = FOREACH B_t GENERATE id , usr_id, ip
or
B_t= LOAD 'test.csv' USING PigStorage('\t') $0 as id, ....;
have anyone idea???
Why in the world u want to use Filter only. Filter is used emit the fields based on condition, not remove the entire field. Foreach -- Generate is the best way.
{'aaa', 'bbb', 'ccc'}
....
Let's suppose above tupple gets loaded with following schema:
as (firstField:chararray, secondField:chararray, thirdField:chararray)
I want to store the tuple in HDFS with the path based on 2nd field (which is 'bbb' in above example). So the above tuple would get stored in the path
/SomeBaseDir/bbb/testoutput.txt
Any help would be appreciated.
To load the file use below command. Remember the file input data should be tab separated.If you are using any other separator like comma then change the parameter passes in PigStorage funtion. It should be PigStorage(',')
A = load '/home/abhishek/Work/pigInput/data' using PigStorage('\t') as (firstField:chararray, secondField:chararray, thirdField:chararray);
Now, to get the second element simple use:
result = foreach A generate secondField;
Result
dump result
('bbb')
You can store it using below command
store result into 'provide the path';
I think you want to use MultiStorage (https://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/MultiStorage.html). This should do pretty much what you want. Specify base path, then the field on which subdirectories should be based.
The SPLIT operator in PigLatin would do the job here.
For an input data (in) which is loaded, it could be split into different output variable as following:
loadedData = load ' ' as (.. ,somefield, ) using ... ;
SPLIT loadedData INTO
segmentA IF (somefield=='A'),
segmentB IF (somefield=='B'),
OtherSources OTHERWISE;
store segmentA into 'hdfs://<path for data segmentA >' using ....;
store segmentB into 'hdfs://<path for data segmentB >' using ....;
I have a CSV file with 3 columns: tweetid , tweet, and Userid. However within the tweet column there are comma separated values.
i.e. of 1 row of data:
`396124437168537600`,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
I want to extract all 3 fields individually, but REGEX_EXTRACT is giving me an error with this code:
a = LOAD tweets USING PigStorage(',') AS (f1,f2,f3);
b = FILTER a BY REGEX_EXTRACT(f1,'(.*)\\"(.*)',1);
The error is:
error: Filter's condition must evaluate to boolean.
In the use case shared, reading the data using PigStrorage(',') will result in missing savava143 (last field value)
A = LOAD '/Users/muralirao/learning/pig/a.csv' USING PigStorage(',') AS (f1,f2,f3);
DUMP A;
Output : A : Observe that the last field value is missing.
(396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.")
For the use case shared, to extract all the values from CSV file with field values having ',' we can use either CSVExcelStorage or CSVLoader.
Approach 1 : Using CSVExcelStorage
Ref : http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html
Input : a.csv
396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
Pig Script :
REGISTER piggybank.jar;
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (f1,f2,f3);
DUMP A;
Output : A
(396124437168537600,I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.,savava143)
Approach 2 : Using CSVLoader
Ref : http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVLoader.html
Below script makes use of CSVLoader(), DUMP A will result in the same output seen earlier.
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (f1,f2,f3);
The error is that you do not want to FILTER based on a regex but GENERATE new fields based on a regex. To filter, you need to know if the line have to be filtered, hence the boolean requirement.
Therefore, you have to use :
b = FOREACH a GENERATE REGEX_EXTRACT(FIELD, REGEX, HOW_MANY_GROUPS_TO_RETURN);
However, as #Murali Rao said, your values are not just coma separated but CSV (think how you will handle a coma in tweet : it is not a field separator, just some content).
I have data that has some rows that look like this:
(1655,var0,var1,NaN)
The first column is an ID, the second and third come from the correlation. The fourth column is the correlation value (from using the COR function).
I would like to filter these rows.
From the Apache Pig documentation, I was under the impression that NaN is equivalent to a null. Therefore I added this to my code:
filter_corr = filter correlation by (corr IS NOT NULL);
This obviously did not work since apparently Pig does not treat null and NaN in the same way.
I would like to know what is the correct way to filter NaN since it is not clear from the Pig documentation.
Thanks!
Eventually you could specify your column as chararray in you schema and Filter with a not matches 'NaN'
Or evenly if you want to replace your NaNs by something else, you put the chararray in your schema as previously and then :
Data = FOREACH Data GENERATE ..., (correlation matches 'NaN' ? 0 : (double) correlation), ...
I hope this could help, good luck ;)
You could read in the data as one chararray line and the use a udf to parse the rows. I made a dataset it looks like this
1665,var0,var1,NaN
1453,var2,var3,5.432
3452,var4,var5,7.654
8765,var6,var7,NaN
Create UDF
#!/usr/bin/env python
# -*- coding: utf-8 -*-
### name of file: udf.py ###
#outputSchema("id:int, col2:chararray, col3:chararray, corr:float")
def format_input(line):
parsed = line.split(',')
if parsed[len(parsed) - 1] == 'NaN'
parsed.pop()
parsed.append(None)
return tuple(parsed)
Then in the pig shell
$ pig -x local
grunt>
/* register udf */
register 'udf.py' using jython as udf;
data = load 'file' as (line:chararray);
A = foreach data generate FLATTEN(udf.format_input(line));
filtered = filter A by corr is not null;
dump filtered
output
(1453,var2,var3,5.432)
(3452,var4,var5,7.654)
I've gone with this solution:
filter_corr = filter data by (corr != 'NaN');
data1 = foreach filter_corr generate ID, (double)corr as double_corr;
I renamed the column and reassigned the data type from chararray to double.
I appreciate the responses but I cannot use UDFs during prototyping due to a limitation in the UI that I am using (Cloudera)
I have some data that contains a url string, which all have some variety substring embeded.
my goal to to get a set of results which have the substring removed from the string:
e.g.
rawdata: {
id Long,
url String
}
here's some sample rawdata:
1,/213112341_v1.html
2,43524254243_v2.html
5,/000000_v3.html
5,/000000_v4.html
the result I want is:
1,/213112341.html
2,43524254243.html
5,/000000.html
so basically remove teh subversion number( _v1|_v2|v3|_v4) from the url and create unique results.
How do I do that in pig?
Thanks,
Your best bet would be to do something like the following:
FOREACH data GENERATE id, CONCAT(REGEX_EXTRACT(url, '(/?[0-9]*)_,',1),'.html');
EDIT:
How about trying the following if the data is more complicated
FOREACH data GENERATE id, CONCAT(STRSPLIT(url, '_v[0-9]',1),'.html')
That should get everything before the version #, with the concat adding the .html back in. If both the before verson number and after verison number sections are more comlicated you could do something like:
FOREACH data GENERATE id, CONCAT(FLATTEN(STRSPLIT(url, '_v[0-9]',2)))