How to replace mail adresses in table input column with Pentaho - pentaho

I'm quite new to PDI and currently facing a challenge where I have to replace mail adresses read from the email column of an incoming table (extracted by the Table Input step in Kettle) with other mail adresses.
e.g. user.test#example.com should become abc[seq. number]#example.com.
The goal is to "anonymize" the incoming adresses for further work with the data.
I currently have no solution for this and am hoping you guys have one. :-)
Thank you!

You can implement a Java class, or you can do the following, after the table entry, you create a sequence, then with the step, split rows you process the mail, takes as delimiter the #, in the configuration of the step you create two fields, One that will contain the initial part of the email and the other with the domain (gmail.com for example), then take the field of the sequence you created earlier, concatenate it with a constant # (in the split rows you lose the symbol), And concatenate with field of the domain, in the end you will get 1#gmail.com, 2#hotmail.com, ect.solo are 4 steps I hope it helps you, greetings

There exists a "Replace in String" step under the "Transform" section exactly for your case.
Nevertheless, I recommend you to read some documentation first.

I solved it. I just took the long way with adding constants, sequences and concatenating eventually.

Related

Error "not mutually convertible in Unicode program" when adding line to table

I'm trying to add data from a internal table to a custom one.
DATA: BEGIN OF TMP_CTRYGRP_T OCCURS 1000,
CTYGR TYPE /SAPSLL/CTYGR,
TEXT1 TYPE /SAPSLL/TEXT60,
END OF TMP_CTRYGRP_T.
SELECT ctygr, text1 FROM /SAPSLL/CTYGPT INTO TABLE #DATA(lt_countryGroupsTable)
LOOP AT lt_countryGroupsTable ASSIGNING FIELD-SYMBOL(<ls_countryGroups>).
APPEND <ls_countryGroups> TO TMP_CTRYGRP_T.
ENDLOOP.
Then I want to add the line in a custom Table Type ZZ_T_TAB
So I've tried to create a field-symbol of this table, creating an internal table from it, but none of the solutions I've tried was permitting me to add lines in that Custom table (even if the one in the program had the lines).
The problem I mainly encountered was:
are not mutually convertible in a Unicode program.
So my questions are:
Why does that error happen? Googling it didn't provide me an understandable answer
For the moment I'm using an internal table limited to 1000 rows. But I don't really know by advance the number of lines the search could provide. Is there any way to improve that?
How to add lines from any solution to my ZZ_T_TAB then? And afterwards how could I add other fields in the same table, for the rows already existing?
As some of you maybe understood, I'm quite a rookie in ABAP.
So if there's any useful link to understand all of that I would be happy if you can share it with me.
Why don't you directly select into the table?
Don't use OCCURS as it is declared obsolete and already forbidden in classes.
Declare your own structure as type and mark your custom internal table as TYPE STANDARD TABLE OF struct_type. This way, there will be no upper bounds
TYPES:
BEGIN OF struct_type,
CTYGR TYPE /SAPSLL/CTYGR,
TEXT1 TYPE /SAPSLL/TEXT60,
END OF struct_type.
DATA tmp_ctrygrp_t TYPE STANDARD TABLE OF struct_type WITH EMPTY KEY.
Why does that error happen? Googling it didn't provide me an
understandable answer
You cannot use APPEND with non-identical structures. You have to "convert" it before. Look up for the command MOVE-CORRESPODING in ABAP help (F1 on command in editor).
For the moment I'm using an internal table limited to 1000 rows. But I
don't really know by advance the number of lines the search could
provide. Is there any way to improve that?
Do not use OCCURS extension it is deprecated (as lausek wrote), old syntax.
How then to add lines from any solution to my ZZ_T_TAB ? And
afterwards how could I add other fields in the same table, for the
rows already existing?
You can modify a DB table various ways.:
1, Use UPDATE statement to directly update a field value.
2, Use MODIFY statement to modify field values from a (for example) pre-selected
structure.
Look up the UPDATE and MODIFY command in ABAP help, there are really helpful code examples.

Remove all rows from a table where an invalid "URL" has been entered

A somewhat odd Postgresql question for our highly specific use case. We have a table which accepts URLs as a part of a comment input from our users. This is on a highly trafficked site. We had some PHP code that was validating that users only entered correctly-formed URLs, if they included one in their comment (usually comment text does not include any URLs).
However, sadly, our PHP is old on an old server. So at some point the ereg logic we had became dysfunctional. Which means miscreant users have had a field day entering comments with badly formed URLs like the following:
l%20are%20generally%20included%20almost%20anyplace--even%20if%20your%20"yard"%20is%20bound%20to%20an%20outdoor%20patio%20or%20balcony.Adding%20water%20to%20your%20patio%20could%20be%20as%20simple%20as%20aiming%20a%20low%20dish%20of%20water%20designed%20for%20use%20in%20the%20form%20of%20birdbath.Any%20cursory%20container%20around%206%20in%20.wide%20and%20a%20half-inch%20deep%20will%20attempt%20to%20work.Pie%20pans,%20garbage%20can%20lids,%20or%20flo
Note that it's not a URL at all. Hence, our question: is there a Postgresql-only way, perhaps through some PL/SQL function or some stored function or something, that we can use to delete all these rubbish records from our database? We'd ideally not want to use a PHP program that went through the entire database and checked it against the valid URL pattern.
We'd like to execute this within PG itself. We can take the database offline to perform this task for as long as it takes.
Thank you!
SELECT * FROM table WHERE url_column !~* '(https?|ftp)://(-\.)?([^\s/?\.#-]+\.?)+(/[^\s]*)?'
Try this query, validate the output en then you could create a DELETE query with this example.

How to handle to input stream in Pentaho with script steps?

How many different kind of steps in Pentaho can accept more than one input stream, such as "Merge Join", "Stream Look up"?
What's the typical user scenario of them?
Any script related steps can accept more than one stream as input, like javascript or UDJC? e.g. use one stream as data source, another as filter condition?
Thank you all.
All the steps under "Joins" and "Lookup", joins just like table join, lookup is to using one stream as source dataset another as "translate" dictionary, this is what I know
Answer to 3 questions as below:
All the Steps available in "Joins" and "Lookup" section will accept two streams. (i haven't tried with 3 streams) Some filter steps like Java Filter will also accept more than one stream.
Typical use scenario is to get data from one or more streams and to work on your business logic. There is no specific example i can explain at the moment.
As per my knowledge, you cannot use more than one stream in JavaScript Step. You might get an error like
I am trying to stream two columns of different names. Input 1 has column "a" and Input 2 has column "b".
You can ignore this error if you can make both the input stream columns to the same name.
Hope this help :)

Contains() function falters with strings of numbers?

For some background information, I'm creating an application that searches against a couple of indexed tables to retrieve some records. It isn't overtly complex to the point of say Google, but it's good enough for the purpose it serves, barring this strange issue.
I'm using the Contains() function, and it's going very well, except when the search contains strings of numbers. Now, I'm only passing in a string -- nowhere numerical datatypes being passed in -- only characters. We're searching against a collection of emails, each appended with a custom ID when shot off from a workflow. So while testing, we decided to search via number strings.
In our test, we isolated a number 0042600006, which belongs to one and only one email subject. However, when using our query we are getting results for 0042600001, 0042600002, etc. The query is this as follows (with some generic columns standing in):
SELECT description, subject FROM tableA WHERE CONTAINS((subject), '0042600006')
We've tried every possible combination: '0042600006*', '"0042600006"' and '"0042600006*"'.
I think it's just a limitation of the function, but I thought this would probably be the best place for answers. Thanks in advance.
Asked this same question recently. Please see the insightful answer someone left me here
Essentially what this user says to do is to turn off the noise words (Microsoft has included integers 0-9 as noise in the Full Text Search). Hope you can use this awesome tool with integers as I now am!
try to add language 1033 as an additional parameter. that worked with my solution.
SELECT description, subject FROM tableA WHERE CONTAINS((subject), '0042600006', language 1033)
try using
SELECT description, subject FROM tableA WHERE CONTAINS((subject), '%0042600006%')

Removing privacy data from a database?

Say that I needed to share a database with a partner. Obviously I have customer information in that database. Short of going through and identifying every column that contains privacy information and a custom script to 'scrub' the data, is there any tool or script which can scrub the data, but keep the format in tact (for example, if a string is 5 characters, it would stay 5 characters, only scrubbed)?
If not, how would you accomplish something like this, preferably in TSQL?
You may consider only share VIEW, create VIEWs to hide data that you don't want share.
Example:
CREATE VIEW v_customer
AS
SELECT
NAME,
LEFT(CreditCard,5) + '****' As CreditCard -- OR, don't show this column at all
....
FROM customer
Firstly I need to state professional interest I work for IBM which has tools that do exactly this.
Step 1. Ensure you identify all the PII (Personally Identifiable Information). When sharing database information it is typical that the obvious column names like "name" are found but you also need to find the "hidden" data where either the data is embedded in a standard format eg string-name-string and column name is something like "reference code" or is in free format text fields . as you have seen this is not going to be an easy job unless you automate it. The Tool for this is InfoSphere Discovery
Step 2. What context does the "scrubbed" data need to be in. Changing named fields to random characters has problems when testing as users focus on text errors rather than functional failures, therefore change names to real but ficticious. Credit card information often needs to be "valid". by that I mean it needs to have a valid prefix say 49XX but the rest an invalid sequence. Finally you need to ensure that every instance of the change is propogated through the database to maintain consistency. Tool for this is Optim Test Data Management with Data Privacy option.
The two tools integrate to give a full data privacy solution.
Based on the original question, it seems you need the fields to be the same length, but not in a "valid" format? How about:
UPDATE customers
SET email = REPLICATE('z', LEN(email))
-- additional fields as needed
Copy/paste and rename tables/fields as appropriate. I think you're going to have a hard time finding a tool that's less work, unless your schema is very complicated, or my formatting assumptions are incorrect.
I don't have an MSSQL database in front of me right now, but you can also find all of the string-like columns by something like:
SELECT *
FROM INFORMATION_SCHEMA.COLUMNS
WHERE DATA_TYPE IN ('...', '...')
I don't remember the exact values you need to compare for, but if you run the query and see what's there, they should be pretty self-explanatory.