How to extract particular word from list of string in column - apache-pig

I am having below data in table.
Archer late
Patrick late
Marie Walter late
Michael-d'souza late
I want to remove late from this list using pig? can i use regex to remove the word. Can someone help me to sort this out?
Edited:
I've used below command but failed:
EXTRACT(surname,'(\b[Dd]+[Ee]+[Cc]+[Ee]+[Aa]+[Ss]+[Ee]+[Dd]+\b)'))

How aboutcalling REPLACE?
A = LOAD 'input.txt' AS (a0:chararray);
B = FOREACH A GENERATE REPLACE(a0, 'late','');
dump B;

Related

How should I perform data masking with pentaho PDI (spoon)?

I would perform data masking for more than 10 tables and each tables has more than 100 columns.
I'd tried to mask data using pentaho PDI tool, but I couldn't find out how should I write mask data with it.
How should I perform data masking with Pentaho?
I think one of the way is to use tool named "replace in String" but I couldn't change any string even if I tried to use it.
my question is,
Is it correct way to use "replace in String" in order to do data
masking.
if it is correct, how should I fill the value in the respective field?
I want to replace some value with *, let's say, the value is "this is sample value" it should be "txxx xx xxxxx xxxxe" some thing like this.
please help.
It's not about kettle, it's about regexp.
I can confirm that "String Replace" has strange unpredictable behavior, in case of using regex inside this step. There is no explanation of "Replace String" step in official docs as well, not much actually.
Anyway u can use RegexEvaluation step to capture needed part and replace inside original string.
But there is workaround which makes it easier
JavaScript-Step with str.replace
This can be done by using a javascript-step, like:
//variable
var str = data_to_mask;
//first letter
var first = str.match(/^[A-Za-z0-9]/);
//last letter
var last = str.match(/[A-Za-z0-9]$/);
//replace all with "x"
str = str.replace(/[A-Za-z0-9]/gi, "x");
//get the first and the last letter back
str = str.replace(/^[A-Za-z0-9]/, first);
str = str.replace(/[A-Za-z0-9]$/, last);
(Simar's answer works as well I think and maybe it's a bit more elegant :)

how to type convert inside a databag in pig

I have the following schema
x = foreach a generate ids as ids:bag{(mid: long)};
This works fine. But I actually need to do the following:
x = foreach a generate ids as ids:bag{((int)mid)};
This will give an error. And I found
x = foreach a generate ids as ids:bag{(mid:int)};
is not good enough. Can anybody please help me?
Thank you.
There is a bug in pig about casting after a colon:
https://issues.apache.org/jira/browse/PIG-2315
What you need is to issue another FOREACH statement.
As Ruslan mentioned, this is a bug. You can get around it with an "explicit" cast using parentheses:
x = foreach a generate ids as (bag{(mid:int)}) ids;

Pig - How to cast datetime to chararray

I'm using CurrentTime(), which is a datetime data type. However, I need it as a chararray. I have the following:
A = LOAD ...
B = FOREACH A GENERATE CurrentTime() AS todaysDate;
I've tried various approaches, such as the following:
B = FOREACH A GENERATE (chararray)CurrentTime() AS todaysDate;
However, I always get ERROR 1052: Cannot cast datetime to chararray.
Anyone know how I can do this? By the way, I'm very new to pig. Thanks in advance!
I had a similar issue and I didn't want to use a custom UDF as described in the other answer. I am pretty new with Pig but it seems a pretty basic operation to justify the need of an UDF. This command works great for me:
B = FOREACH A GENERATE ToString(yourdatetimeobject, 'yyyy-MM-dd\'T\'HH:mm:ssz') AS yourfieldname;
You can select the format you want by looking at the SimpleDateFormat javadoc
You need to create a custom UDF which does the conversion
(e.g: see CurrentTime() implementation). Alternatively you may check out my answer on a similar topic for workarounds.
If you are on AWS, then use their DATE_TIME UDF.

How to reverse values in a string in T-SQL

Using T-SQL, I'm trying to find the easiest way to make:
"abc.def.ghi/jkl" become "abc/def/ghi.jkl"?
Basically switch the . and /
Thank you
One way
select replace(replace(replace('abc.def.ghi/jkl','/','-'),'.','/'),'-','.')
you need to use an intermediate step, I chose the - symbol, choose something which won't exist in your string
SELECT REVERSE(#myvar) AS Reversed,
RIGHT(#myVar, CHARINDEX(‘ ‘, REVERSE(#myvar))) as Lastname;
took the answer from this guys blog. The first google result. You will need to modify it for your needs
link text

What is the correct name for this data format?

I am a perfectionist and need a good name for a function that parses data that has this type of format:
userID:12,year:2010,active:1
Maybe perhaps
parse_meta_data()
I'm not sure what the correct name for this type of data format is. Please advise! Thanks for your time.
parse_dict or parse_map
Except for the lack of braces and the quotes around the keys, it looks like either JSON or a Python dict.
parse_tagged_csv()
parse_csv()
parse_structured_csv()
parse_csv_with_attributes()
parse csvattr()
If it’s a proprietary data format, you can name it whatever you want. But it would be good to use a common term like serialized data or mapping list.
If it's just a list of simple items, each of which has a name and a value, then "key-value pairs" is probably the right term.
I would go with:
parse_named_records()
ParseCommaSeparatedNameValuePairs()
ParseDelimitedNameValuePairs()
ParseCommaSeparatedKeyValuePairs()
ParseDelimitedKeyValuePairs()
From the one line you gave, ParseJson() seems appropriate