Fuzzy matching Informatica vs SQL - sql

We are currently debating whether to implement pairwise matching functions in SQL to perform fuzzy matching on invoice reference numbers, or go down the route of using Informatica.
Informatica is a great solution (so ive heard) however im not familiar with the software.
Has anybody got any experience of its fuzzy match capabilities and the advantages it may offer over building some logic in SQL.
Thanks

Parser transformation can be used in Informatica do the job. Reference Data objects can be created in Informatica which will be used to search your given string. The reference data objects are of the following types - Pattern Sets , Probabilistic Models, Reference Tables , Regex , Token sets.
Pattern Sets - A pattern set contains the logic to identify data patterns for eg separating out initials from the name.
Probabilistic Models - A probabilistic model identifies tokens by the types of information they contain and by their positions in an input string.
A probabilistic model contains the following columns:
An input column that represents the data on the input port. You populate the column with sample data from the input port. The model uses the sample data as reference data in parsing and labeling operations.
One or more label columns that identify the types of information in each input string.
You add the columns to the model, and you assign labels to the tokens in each string. Use the label columns to indicate the correct position of the tokens in the string.
When you use a probabilistic model in a Parser transformation, the Parser writes each input value to an output port based on the label that matches the value. For example, the Parser writes the string "Franklin Delano Roosevelt" to FIRSTNAME, MIDDLENAME, and LASTNAME output ports.
The Parser transformation can infer a match between the input port data values and the model data values even if the port data is not listed in the model. This means that a probabilistic model does not need to list every token in a data set to correctly label or parse the tokens in the data set.
The transformation uses probabilistic or fuzzy logic to identify tokens that match tokens in the probabilistic model. You update the fuzzy logic rules when you compile the probabilistic model.
Reference Table - This is a db table for searching

Here it seems that your data is unstructured and you want to extract meaningful data from it. Informatica DataTransformation(DT) tool is good if your data follows some pattern. It is used with UDT transformation inside Informatica PowerCenter. With DT you can create a parser to parse your data and using serializer you can write it to any form you want, later you can do aggregation and other transformations on that data using Informatica PowerCenter's ETL capabilities.
DT is well known for it's capabilities to parse PDF's, forms and invoices. I hope it can solve the purpose.

Related

Use TOXI Solution in DataBase with Json data 101

We want to project a new database schema for our Society's application.
The program is developed in c# nancy serverside and react-redux-graphql on clientside.
Our Society often must implement repentine changing for treat new business data. So we want to realise a solid core for the fundamental and no subject to decadence data eg: Article (Code, description, Qty, Value, Price, categoryId).
But often we need to add particular category to an article, or special implementation only for a limited period of time. We are thinking to implement a TOXI like solution for treat those situations.
But in TOXI pattern implementation we wan to add a third table for define each tag data type and definition.
Here is a simple explanatory image:
In the Metadata we have two columns with JSON data: DataType and DefinedValue
DataType define How the program (eventually a func in db) must cast the varchar data in articoli_meta.value
DefinedValue is not null define if the type must have a series of predefined value eg: High, Medium, Low etc...
Those two column are varchar and contain JSON with a predefined standard, a defined standard from our programming team (ev. an sql func for validate those two columns)
I Understand that this kind of approach is not a 'pure' relational approach but we must consider that we often pass data to the client in json format so the DefinedValue column can easily queried as string and passed to interface as data for a dropdown list.
Any ideas, experience or design tips are appreciated

How to validate data types in pig?

I have been trying to validate the data type of the data that I got from a flat file through pig.
A simple CAT can do the trick but the Flat files are huge and they sometimes contain special characters.
I need to filter out the records containing special characters from the file and also when the data type is not int.
Is there any way to do this in pig?
I am trying to find a substitute for getType().getName() kind of usage of java here.
Enforcing schema and using Describe is what we do while loading data and then remove the miss match but is there anyway to it without enforcing the schema.
Any suggestions will be helpful.
Load the data into a line:charraray and use regular expression to filter out the records that contains characters other than numbers
A = LOAD 'data.txt' AS (line:chararray);
B = FILTER A BY (line matches '\\d+$'); -- Change according to your needs.
DUMP B;

data sanitization/clean-up

Just wondering…
We have a table where the data in certain fields is alphanumeric, comprising a 1-2 digit alpha followed by a 1-2 digit number e.g. x2, x53, yz1, yz95
The number of letters added before the number can be determined by the field so that certain fields will always have the same 1 letter added before the number while others will always have the same 2 letters.
For each field, the actual letters and number of letters added (1 or 2) are always the same, thus, we can always tell which letters appear before the numbers just via the field names.
For the purposes of all downstream data analysis, it is only ever the numeric value from the string which is important.
Sql queries are constructed dynamically behind a user form where the final sql can take many forms depending on which selections and switches the user has chosen. With this, the VBA generating the sql constructs is fairly involved, containing many conditions/variable pathways to the final sql construct.
With this, it would make the VBA and sql much easier to write, read, debug, and perhaps increase the sql execution speed, etc. – if we were only dealing with a numeric datatype e.g. I wouldn’t need to accommodate the many apostrophes within the numerous lines of “strSQL = strSQL & …”
Given that the data itself being analysed is a copy that’s imported via regular .csv extracts from a live source, would it be acceptable to pre sanitize/clean-up these fields around the import stage by converting the data within to numeric values and field datatypes?
- perhaps either by modifying the sql used to generate the extract or by modifying the schema/vba process used to import the extract into the analysis table e.g. using something like a Replace function such as “ = Replace(OriginalField,”yz”,””) “ to strip out the yz characters.
Yes, link the csv "as is", and for each linked table create a straight select query that does the sanitization, like:
Select
Val(Mid([Field1], 2)) As NumField1,
Val(Mid([Field2], 1)) As NumField2,
etc.
Val(Mid([FieldN], 2)) As NumFieldN
From
YourLinkedCsvTable
then use this query throughout your application when you need the data.

Regular expression search of Oracle BLOB field

I have a table with a BLOB field containing SOAP-serialised .NET objects (XML).
I want to search for records representing objects with specific values against known properties. I have a working .NET client that pulls back the objects and deserialises them one at a time to check the properties; this is user-friendly but creates a huge amount of network traffic and is very slow.
Now I would like to implement a server-side search by sending a regular expression to a stored procedure that will search the text inside the BLOB. Is this possible?
I have tried casting the column to varchar2 using utl_raw.cast_to_varchar2, but the length of the text is too long (in some cases 100KB).
dbms_lob.inst allows me to search the text field for a substring, but with such a complex XML structure I would like the additional flexibility offered by regular expressions.

How can I get SSIS Lookup transformation to ignore alphabetical case?

Hopefully this is easy to explain, but I have a lookup transformation in one of my SSIS packages. I am using it to lookup the id for an emplouyee record in a dimension table. However my problem is that some of the source data has employee names in all capitals (ex: CHERRERA) and the comparison data im using is all lower case (ex: cherrera).
The lookup is failing for the records that are not 100% case similar (ex: cherrera vs cherrera works fine - cherrera vs CHERRERA fails). Is there a way to make the lookup transformation ignore case on a string/varchar data type?
There isn't a way I believe to make the transformation be case-insensitive, however you could modify the SQL statement for your transformation to ensure that the source data matches the case of your comparison data by using the LOWER() string function.
Set the CacheType property of the lookup transformation to Partial or None.
The lookup comparisons will now be done by SQL Server and not by the SSIS lookup component, and will no longer be case sensitive.
You have to change the source and as well as look up data, both should be in same case type.
Based on this Microsoft Article:
The lookups performed by the Lookup transformation are case sensitive. To avoid lookup failures that are caused by case differences in data, first use the Character Map transformation to convert the data to uppercase or lowercase. Then, include the UPPER or LOWER functions in the SQL statement that generates the reference table
To read more about Character Map transformation, follow this link"
Character Map Transformation