filter result by taking out a matching regex in pig latin - apache-pig

I have some data that contains a url string, which all have some variety substring embeded.
my goal to to get a set of results which have the substring removed from the string:
e.g.
rawdata: {
id Long,
url String
}
here's some sample rawdata:
1,/213112341_v1.html
2,43524254243_v2.html
5,/000000_v3.html
5,/000000_v4.html
the result I want is:
1,/213112341.html
2,43524254243.html
5,/000000.html
so basically remove teh subversion number( _v1|_v2|v3|_v4) from the url and create unique results.
How do I do that in pig?
Thanks,

Your best bet would be to do something like the following:
FOREACH data GENERATE id, CONCAT(REGEX_EXTRACT(url, '(/?[0-9]*)_,',1),'.html');
EDIT:
How about trying the following if the data is more complicated
FOREACH data GENERATE id, CONCAT(STRSPLIT(url, '_v[0-9]',1),'.html')
That should get everything before the version #, with the concat adding the .html back in. If both the before verson number and after verison number sections are more comlicated you could do something like:
FOREACH data GENERATE id, CONCAT(FLATTEN(STRSPLIT(url, '_v[0-9]',2)))

Related

I want to extract the string from the string and use it under a field

I want to extract a string from a string...and use it under a field named source.
I tried writing like this bu no good.
index = cba_nemis Status: J source = *AAP_ENC_UX_B.* |eval plan=upper
(substr(source,57,2)) |regex source = "AAP_ENC_UX_B.\w+\d+rp"|stats
count by plan,source
for example..
source=/p4products/nemis2/filehandlerU/encpr1/log/AAP_ENC_UX_B.az_in_aza_277U_ rp-20190722-054802.log
source=/p4products/nemis2/filehandlerU/encpr2/log/AAP_ENC_UX_B.oh_in_ohf_ed_ph_ld-20190723-034121.log
I want to extract the string \
AAP_ENC_UX_B.az_in_aza_277U_ rp from 1st
and
AAP_ENC_UX_B.oh_in_ohf_ed_ph_ld from 2nd.
and put it under the column source along with the counts..
I want results like...
source counts
AAP_ENC_UX_B.az_in_aza_277U_ rp 1
AAP_ENC_UX_B.oh_in_ohf_ed_ph_ld 1
You can use the [rex][1] command that extracts a new field from an existing field by applying a regular expression.
...search...
| rex field=source ".+\/(?<source_v2>[\.\w\s]+)-.+"
| stats count by plan, source_v2
Be careful, though: I called the new field source_v2, what you were asking would rewrite the existing source field without you explicitly requesting this. Just change source_v2 to source in my code in case this is what you want.
The search takes this new source_v2 field into account. Try and see if this is what you need. You can tweak it easily to get your expected results.

REGEX_EXTRACT error in PIG

I have a CSV file with 3 columns: tweetid , tweet, and Userid. However within the tweet column there are comma separated values.
i.e. of 1 row of data:
`396124437168537600`,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
I want to extract all 3 fields individually, but REGEX_EXTRACT is giving me an error with this code:
a = LOAD tweets USING PigStorage(',') AS (f1,f2,f3);
b = FILTER a BY REGEX_EXTRACT(f1,'(.*)\\"(.*)',1);
The error is:
error: Filter's condition must evaluate to boolean.
In the use case shared, reading the data using PigStrorage(',') will result in missing savava143 (last field value)
A = LOAD '/Users/muralirao/learning/pig/a.csv' USING PigStorage(',') AS (f1,f2,f3);
DUMP A;
Output : A : Observe that the last field value is missing.
(396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.")
For the use case shared, to extract all the values from CSV file with field values having ',' we can use either CSVExcelStorage or CSVLoader.
Approach 1 : Using CSVExcelStorage
Ref : http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html
Input : a.csv
396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
Pig Script :
REGISTER piggybank.jar;
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (f1,f2,f3);
DUMP A;
Output : A
(396124437168537600,I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.,savava143)
Approach 2 : Using CSVLoader
Ref : http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVLoader.html
Below script makes use of CSVLoader(), DUMP A will result in the same output seen earlier.
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (f1,f2,f3);
The error is that you do not want to FILTER based on a regex but GENERATE new fields based on a regex. To filter, you need to know if the line have to be filtered, hence the boolean requirement.
Therefore, you have to use :
b = FOREACH a GENERATE REGEX_EXTRACT(FIELD, REGEX, HOW_MANY_GROUPS_TO_RETURN);
However, as #Murali Rao said, your values are not just coma separated but CSV (think how you will handle a coma in tweet : it is not a field separator, just some content).

How to use Substring in SSIS

I want to export data from SharePoint list to SQL using SSIS.
In SharePoint list, i have a column as multi select, So i am getting below value in my column
1;#control 1;#3;#control 3
I want to use substring in derived column in such a way that i should get below result
1,3
I want only ID from the given column.
I have tried below code
SUBSTRING(ColumnName,1,FINDSTRING(ColumnName,";#",1) - 1)
But it only gives me answer as
1
Can anyone please help me out.?
Because there is an unknown number of controls selected in your SharePoint Multi-Select, a Derived Column transformation is not going to work for you. You'll have to use a script.
One way to parse your string is with regular expressions. You'll have to add an output to the script transformation and assign your parsed string to that output.
Regex controlExpression = new Regex(#"control ([0-9]+)");
MatchCollection controlMatches = controlExpression.Matches(--YOUR INPUT HERE--);
String output = string.Join(",",
(controlMatches.Cast<Match>().Select(n => n.Groups[1].ToString())).ToArray());

How to remove characters from a field in Pig

Data:
someId,+1 5552221234
someId2,+1 3331114321
I want to remove the +1 from the second field below
I first load the data
A= LOAD 'Data' USING PigStroage(,) as (Id:chararray, Phone:chararray)
Now i want to have the following Data
Desired Output:
someId, 5552221234
someId2, 3331114321
How would i go about doing this. I was using the following but it doesn't work:
mss_demographic_data3= FOREACH mss_demographic_data2 GENERATE *, REGEX_EXTRACT_ALL(Phone, '[0-9]{9}$') as newPhone;
Use the substring function. (easiest way)
mss_demographic_data3= FOREACH mss_demographic_data2 GENERATE Id,SUBSTRING(Phone,3,12);
NOTE- You have this (substring function) only if you are using pig 0.8.0 or above. If you are using older version of pig, you may need to write an udf.

SSIS filename - file count

I'm currently creating a flat file export for one of our clients, i've managed to get the file in the format they want, i'm trying to get the easiest way of creating a dynamic file name. I've got the date in as a variable and the path ect but they want a count in the file name. For example
File name 1 : TDY_11-02-2013_{1}_T1.txt. The {} being the count. So next weeks file would be TDY_17-02-2013_{2}_T1.txt
I cant see an easy way of doing this!! any idea's??
EDIT:
on my first answer, I thought you meant count of values returned on a query. My bad!
two ways to achieve this, you could loop into the destination folder, select the last file by date, get its value and increase 1, which sound like a lot of trouble. Why not a simple log table on the DB with last execution date and ID and then you compose your file name base on the last row of this table?
where exactly is your problem?
you can make a dynamic file name using expressions:
the count, you can use a "row count" component inside your data flow to assign the result to a variable and use the variable on your expression:
Use Script task and get the number inside the curly braces of the file name and store it in a variable.
Create a variable(FileNo of type int) which stores the number for the file
Pseudo code
string name = string.Empty;
string loction = #"D:\";
/* Get the path from the connection manager like the code below
instead of hard coding like D: above
string flatFileConn =
(string(Dts.Connections["Yourfile"].AcquireConnection(null) as String);
*/
string pattern = string.Empty;
int number = 0;
string pattern = #"{([0-9])}"; // Not sure about the correct regular expression to retrieve the number inside braces
foreach (string s in Directory.GetFiles(loction,"*.txt"))
{
name = Path.GetFileNameWithoutExtension(s);
Match match = Regex.Match(name, pattern );
if (match.Success)
{
dts.Variables["User::FileNo"].Value = int.Parse(match.Value)+1;
}
}
Now once you get the value use it in your file expression in the connection manager
#[User::FilePath] +#[User::FileName]
+"_{"+ (DT_STR,10,1252) #[User::FileNo] + "}T1.txt"