How should I perform data masking with pentaho PDI (spoon)? - pentaho

I would perform data masking for more than 10 tables and each tables has more than 100 columns.
I'd tried to mask data using pentaho PDI tool, but I couldn't find out how should I write mask data with it.
How should I perform data masking with Pentaho?
I think one of the way is to use tool named "replace in String" but I couldn't change any string even if I tried to use it.
my question is,
Is it correct way to use "replace in String" in order to do data
masking.
if it is correct, how should I fill the value in the respective field?
I want to replace some value with *, let's say, the value is "this is sample value" it should be "txxx xx xxxxx xxxxe" some thing like this.
please help.

It's not about kettle, it's about regexp.
I can confirm that "String Replace" has strange unpredictable behavior, in case of using regex inside this step. There is no explanation of "Replace String" step in official docs as well, not much actually.
Anyway u can use RegexEvaluation step to capture needed part and replace inside original string.
But there is workaround which makes it easier

JavaScript-Step with str.replace
This can be done by using a javascript-step, like:
//variable
var str = data_to_mask;
//first letter
var first = str.match(/^[A-Za-z0-9]/);
//last letter
var last = str.match(/[A-Za-z0-9]$/);
//replace all with "x"
str = str.replace(/[A-Za-z0-9]/gi, "x");
//get the first and the last letter back
str = str.replace(/^[A-Za-z0-9]/, first);
str = str.replace(/[A-Za-z0-9]$/, last);
(Simar's answer works as well I think and maybe it's a bit more elegant :)

Related

How to copy one string's n number of characters to another string in Kotlin?

Let's take a string var str = "Hello Kotlin". I want to copy first 5 character of str to another variable strHello. I was wondering is there any function of doing this or I have to apply a loop and copy characters one by one.
As Tim commented, there's a substring() method which does exactly this, so you can simply do:
val strHello = str.substring(0, 5)
(The first parameter is the 0-based index of the first character to take; and the second is the index of the character to stop before.)
There are many, many methods available on most of the common types.  If you're using an IDE such as IDEA or Eclipse, you should see a list of them pop up after you type str..  (That's one of many good reasons for using an IDE.)  Or check the official documentation.
Please use the string.take(n) utility.
More details at
https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/take.html
I was using substring in my project, but it gave an exception when the length of the string was smaller than the second index of substring.
val name1 = "This is a very very long name"
// To copy to another string
val name2 = name1.take(5)
println(name1.substring(0..5))
println(name1.substring(0..50)) // Gives EXCEPTION
println(name1.take(5))
println(name1.take(50)) // No EXCEPTION

PostgreSQL full text search doesn't work in some case (Django)

I notice that in django when there is a sentence containing PLAZA/MASTERPIECE then when we search masterpiece I can't find this sentence. Is this a limitation of PostgreSQL full text search. Or how to solve this?
finalquery = SearchQuery("keyword")
vector = SearchVector('thefieldIwanttosearch')
self.search_results = self.search_results.annotate(search=vector).filter(search=finalquery).annotate(rank=SearchRank(vector, finalquery))
Is there any document about this? Thanks!
Yes, this is all documented.
When you write filter(search=finalquery) you're not specifying a lookup type.
As a convenience when no lookup type is provided (like in Entry.objects.get(id=14)) the lookup type is assumed to be exact.
So you're filtering on an exact match for "masterpiece". What you probably want is contains or icontains.

SSRS if field value in list

I've looked through a number of tutorials and asks, and haven't found a working solution to my problem.
Suppose my dataset has two columns: sort_order and field_value. sort_order is an integer and field_value is a numerical (10,2).
I want to format some rows as #,#0 and others as #,#0.00.
Normally I would just do
iif( fields!sort_order.value = 1 or fields!sort_order.value = 23 or .....
unfortunately, the list is fairly long.
I'd like to do the equivalent of if fields!sort_order.value in (1,2,21,63,78,...) then...)
As recommended in another post, I tried the following (if sort in list, then just output a 0, else a 1. this is just to test the functionality of the IN operator):
=iif( fields!sort_order.Value IN split("1,2,3,4,5,6,8,10,11,15,16,17,18,19,20,21,26,30,31,33,34,36,37,38,41,42,44,45,46,49,50,52,53,54,57,58,59,62,63,64,67,68,70,71,75,76,77,80,81,82,92,98,99,113,115,116,120,122,123,127,130,134,136,137,143,144,146,147,148,149,154,155,156,157,162,163,164,165,170,171,172,173,183,184,185,186,192,193,194,195,201,202,203,204,210,211,212,213,263",","),0,1)
However, it doesn't look like the SSRS expression editor wants to accept the "IN" operator. Which is strange, because all the examples I've found that solve this problem use the IN operator.
Any advice?
Try using IndexOf function:
=IIF(Array.IndexOf(split("1,2,3,4,...",","),fields!sort_order.Value)>-1,0,1)
Note all values must be inside quotations.
Consider the recommendation of #Jakub, I recommend this solution if
your are feeding your report via SP and you can't touch it.
Let me know if this helps.

Storing a Value of a Set analysis expression in a Variable

I am struggling with storing a set analysis expression's value in a variable.
I want to store below expression's value in a variable so that i can use that further for some calculations.
Min({< Data_Period = {'Weekly'},Formatted_Date = {'>$(=$(vSelectedWeek))'}>} Date,2)
The above expression works fine if i use it in a text box on a sheet tab. However, it is not working if i try to store its value in a variable and use that variable.
Set vW1 = Min({< Data_Period = {'Weekly'},Formatted_Date = {'>$(=$(vSelectedWeek))'}>} Date,2);
Here vSelectedWeek is being calculated as follows:
Set vSelectedWeek = Date(Weekstart(Only(BaseData_Date)),'dd/MM/YYYY');
Please advise if i am doing anything wrong or is there any other way around to achieve the same?
Thanks in advance.
If your var is truly working with that expression then try creating an input box object, define your var there and add the expression in the right column.
That should work.
If you find my answer to be pretty simple or not the way you want it, checking this link might help: https://community.qlik.com/thread/198307

Regex match SQL values string with multiple rows and same number of columns

I tried to match the sql values string (0),(5),(12),... or (0,11),(122,33),(4,51),... or (0,121,12),(31,4,5),(26,227,38),... and so on with the regular expression
\(\s*\d+\s*(\s*,\s*\d+\s*)*\)(\s*,\s*\(\s*\d+\s*(\s*,\s*\d+\s*)*\))*
and it works. But...
How can I ensure that the regex does not match a values string like (0,12),(1,2,3),(56,7) with different number of columns?
Thanks in advance...
As i mentioned in comment to the question, the best way to check if input string is valid: contains the same count of numbers between brackets, is to use client side programm, but not clear SQL.
Implementation:
List<string> s = new List<string>(){
"(0),(5),(12)", "(0,11),(122,33),(4,51)",
"(0,121,12),(31,4,5),(26,227,38)","(0,12),(1,2,3),(56,7)"};
var qry = s.Select(a=>new
{
orig = a,
newst = a.Split(new string[]{"),(", "(", ")"},
StringSplitOptions.RemoveEmptyEntries)
})
.Select(a=>new
{
orig = a.orig,
isValid = (a.newst
.Sum(b=>b.Split(new char[]{','},
StringSplitOptions.RemoveEmptyEntries).Count()) %
a.newst.Count()) ==0
});
Result:
orig isValid
(0),(5),(12) True
(0,11),(122,33),(4,51) True
(0,121,12),(31,4,5),(26,227,38) True
(0,12),(1,2,3),(56,7) False
Note: The second Select statement gets the modulo of sum of comma instances and the count of items in string array returned by Split function. If the result isn't equal to zero, it means that input string is invalid.
I strongly believe there's a simplest way to achieve that, but - at this moment - i don't know how ;)
:(
Unless you add some more constraints, I don't think you can solve this problem only with regular expressions.
It isn't able to solve all of your string problems, just as it cannot be used to check that the opening and closing of brackets (like "((())()(()(())))") is invalid. That's a more complicated issue.
That's what I learnt in class :P If someone knows a way then that'd be sweet!
I'm sorry, I spent a bit of time looking into how we could turn this string into an array and do more work to it with SQL but built in functionality is lacking and the solution would end up being very hacky.
I'd recommend trying to handle this situation differently as large scale string computation isn't the best way to go if your database is to gradually fill up.
A combination of client and serverside validation can be used to help prevent bad data (like the ones with more numbers) from getting into the database.
If you need to keep those numbers then you could rework your schema to include some metadata which you can use in your queries, like how many numbers there are and whether it all matches nicely. This information can be computed inexpensively from your server and provided to the database.
Good luck!