Pandas - match a column of string with a column of regular expressions

Pandas - match a column of string with a column of regular expressions - pandas

The problem: I have two dataframes - one with a bunch of product titles that are not normalized, and one with a bunch of regular expressions that are tied to normalized product titles. I need to match the non-normalized titles to some regular expressions which are tied to normalized titles.
It should make more sense with the sample data below.
First dataframe (raw_titles):
| | Title | Release Date |
|---|------------------------------------------------|--------------|
| 1 | Apple iPad Air (3rd generation) - 64GB | 01/01/20 |
| 2 | Philips Hue White Ambiance A19 LED Smart Bulbs | 08/12/20 |
| 3 | Powerbeats Pro Totally Wireless Earphones | 06/20/19 |
Second dataframe (regex_titles):
| | Regex | Manufacturer | Model |
|---|-------------------------------------------------------|--------------|-------------------------|
| 1 | /ipad\s?air(?=.*(\b3\b|3rd\s?gen|2019))|\bair\s?3\b/i | Apple | iPad Air (2019) |
| 2 | /hue(?=.*cher)/i | Philips | Hue White Ambiance Cher |
| 3 | /powerbeats\s?pro/i | Beats | Powerbeats Pro |
The idea is to take each title in raw_titles, and run it through all the values in regex_titles to see if there's a match. Once that's done, raw_titles should then have two additional columns, Manufacturer and Model, which correspond to the regex_titles series they matched to (if there was no match, it would just stay empty.
Then the final table would look like this:
| | Title | Release Date | Manufacturer | Model |
|---|------------------------------------------------|--------------|--------------|-----------------|
| 1 | Apple iPad Air (3rd generation) - 64GB | 01/01/20 | Apple | iPad Air (2019) |
| 2 | Philips Hue White Ambiance A19 LED Smart Bulbs | 08/12/20 | | |
| 3 | Powerbeats Pro Totally Wireless Earphones | 06/12/19 | Beats | Powerbeats Pro |

There are many ways to do this, but the simplest is to test each of the regexes on each of the titles and return the first match you find. First, we'll define a function that will return two values: the manufacturer and model of the regex row, if we match, two Nones otherwise:
def find_match(title_row):
for _, regex_row in regex_titles.iterrows():
if re.search(regex_row['Regex'], title_row['Title']):
return [regex_row['Manufacturer'], regex_row['Model']]
return [None, None]
Then we'll apply our function to our titles dataframe and save the output to two new columns, Manufacturer and Model:
raw_titles[['Manufacturer', 'Model']] = raw_titles.apply(find_match, axis=1, result_type='broadcast')
Title Release Date Manufacturer Model
0 Apple iPad Air (3rd generation) - 64GB 01/01/20 Apple iPad Air (2019)
1 Philips Hue White Ambiance A19 LED Smart Bulbs 08/12/20 None None
2 Powerbeats Pro Totally Wireless Earphones 06/20/19 Beats Powerbeats Pro
One complication is that you'll have to translate your perl regexes into the python regex format:
perl: /powerbeats\s?pro/i -> python: (?i)powerbeats\s?pro
They're mostly the same, with a few small differences. Here's the reference.

Related

Fuzzy match a substring within a larger string in Postgres

Is it possible to fuzzy match a substring within a larger string in Postgres?
Example:
For a search of colour (ou), return all records where the string includes color, colors or colour.
select
*
from things
where fuzzy(color) in description;
id | description
----------------
1 | A red coloured car
2 | The garden
3 | Painting colors
=> return records 1 and 3
I was wondering if it's possible to combine both fuzzystrmatch and tsvector so that the fuzzy matching could be applied to each vectorized term?
Or if there is another approach?

You can do it of course, but I doubt it will be very useful:
select *,levenshtein(lexeme,'color') from things, unnest(to_tsvector('english',description))
order by levenshtein;
id | description | lexeme | positions | weights | levenshtein
----+--------------------+--------+-----------+---------+-------------
3 | Painting colors | color | {2} | {D} | 0
1 | A red coloured car | colour | {3} | {D} | 1
1 | A red coloured car | car | {4} | {D} | 3
1 | A red coloured car | red | {2} | {D} | 5
3 | Painting colors | paint | {1} | {D} | 5
2 | The garden | garden | {2} | {D} | 6
Presumably you would want to embellish the query to apply some cutoff, probably where the cutoff depends on the lengths, and return only the best result for each description assuming it met that cutoff. Doing that should be just routine SQL manipulations.
Perhaps better would be the word similarity operators recently added to pg_trgm.
select *, description <->> 'color' as distance from things order by description <->> 'color';
id | description | distance
----+--------------------+----------
3 | Painting colors | 0.166667
1 | A red coloured car | 0.333333
2 | The garden | 1
Another option would be to find a stemmer or thesaurus which standardizes British/American spellings (I am not aware of one readily available), and then not use fuzzy matching at all. I think this would be best, if you can do it.

Julia: Drawing of a table besides a plot

I have the following requirement and wondering if I could solve it with Julia: draw a SVG plot with up to 4 curves on it. Each curve has to be constructed out of 6 data points.
The plot title and the legend should be positioned on the top right corner outside of the plot area.
So far so easy, I'm pretty certain it's solvable with one of the great Julia metapackages. But this one drives me crazy:
In the bottom right corner outside of the plot area a table should be drawn. It should contain the values of 6 data points of each curve.
Example:
| pressure | 1 | 2 | 3 | 4 | 5 | 6 |
|----------------|-----|-----|-----|-----|-----|------|
| colored line 1 | 3.8 | 3.9 | 4.1 | 5.0 | 9.1 | 10.0 |
| colored line 2 | 4.0 | 4.1 | 5.0 | 7.1 | 8.0 | 11.0 |
| gal/min | | | | | | |
The plot and the data table should be placed both inside the SVG.
After studying the documentation for hours I'm still clueless how to draw this table. Is this even possible with a plotting library?

Query to compare values across different tables?

I have a pair of models in my Rails app that I'm having trouble bridging.
These are the tables I'm working with:
states
+----+--------+------------+
| id | fips | name |
+----+--------+------------+
| 1 | 06 | California |
| 2 | 36 | New York |
| 3 | 48 | Texas |
| 4 | 12 | Florida |
| 5 | 17 | Illinois |
| … | … | … |
+----+--------+------------+
places
+----+--------+
| id | place |
+----+--------+
| 1 | Fl |
| 2 | Calif. |
| 3 | Texas |
| … | … |
+----+--------+
Not all places are represented in the states model, but I'm trying to perform a query where I can compare a place's place value against all state names, find the closest match, and return the corresponding fips.
So if my input is Calif., I want my output to be 06
I'm still very new to writing SQL queries, so if there's a way to do this using Ruby within my Rails (4.1.5) app, that would be ideal.
My other plan of attack was to add a fips column to the "places" table, and write something that would run the above comparison and then populate fips so my app doesn't have to run this query every the page loads. But I'm very much a beginner, so that sounds... ambitious.

This is not an easy query in SQL. Your best bet is one of the fuzzing string matching routines, which are documented here.
For instance, soundex() or levenshtein() may be sufficient for what you want. Here is an example:
select distinct on (p.place) p.place, s.name, s.fips, levenshtein(p.place, s.name) as dist
from places p cross join
states s
order by p.place, dist asc;

Database functional dependency for Nullable Columns

I have 4 columns in my non-decomposed, non-normalized Job Application table which are all Nullable, for example my table is:
Name | SSN | Education | City | Job Applied | Post | Job Obtained | Post Obtained
John. | 123 | High School | LA | USPS | MailMan | USPS | MailMan
John. | 123 | High School | LA | Dept. of Agri | Assistant | *null* | *null*
Sam. | 123 | BS | NY | Intel | QA Analyst | Intel | QA Analyst
The first 4 Columns are non-nullable so I can easily determine Functional Dependencies between them.
The last 4 columns, can or cannot have values depending on if a person has got a job and also depending on if he/she has applied for a job.
My question is: Can I have Functional Dependencies on Nullable Columns either them being on the LHS or the RHS.

The answer should be yes, please see:
http://en.wikipedia.org/wiki/Functional_dependency

design for vehicle identification number (VIN)

I've designed a few Vehicle Identification Number (VIN) decoders for different OEMs. The thing about VIN numbers...despite being somewhat standardized, each OEM can overload the character position codes and treat them differently, add "extra" metadata (i.e. asterisks pointing to more data outside the VIN number), etc., etc. Despite all that, I've been able to build several different OEM VIN decoders, and now I'm trying to build a GM VIN decoder, and it is giving me a headache.
The gist of the problem is that GM treats the vehicle attributes section (position 4,5,6,7) differently depending on whether it is a truck or a car. Here is the breakdown:
GM Passenger Car VIN breakdown
GM Truck VIN breakdown
Normally what I do is design my own crude ETL process to import the data into an RDMBS - each table roughly correlates with the major VIN breakdown. For example, there will be a WMI table, EngineType table, ModelYear table, AssemblyPlant table, etc. Then I construct a View that joins on some contextual data that may or not be gleaned directly from the character codes in the VIN number itself (e.g. some vehicle types only have certain vehicle engines).
To look up a VIN is simply a matter of querying the VIEW with each major character code position breakdown of the VIN string. For example, an example VIN of 1FAFP53UX4A162757 breaks down like this in a different OEM's VIN structure:
| WMI | Restraint | LineSeriesBody | Engine | CheckDigit | Year | Plant | Seq |
| 123 | 4 | 567 | 8 | 9 | 10 | 11 | 12-17 |
---------------------------------------------------------------------------------
| 1FA | F | P53 | U | X | 4 | A | ... |
GM has thrown a wrench into this...depending on whether it is a car or truck, the character code positions mean different things.
Example of what I mean - each ASCII table below correlates somewhat to a SQL table. etc.. means there is a whole lot of other columnar data
Passenger Car
Here's an example of position 4,5 (corresponds to vehicle line/series). These really go together, the VIN source data doesn't really differentiate between position 4 and 5 despite the breakdown illustrated above.
| Code (45)| Line | Series | etc..
--------------------------------------
| GA | Buick | Lacrosse | etc..
..and position 6 corresponds to body style
| Code (6) | Style | etc..
--------------------------------------
| 1 | Coupe, 2-Door | etc..
Trucks
..but for trucks, the structure is completely different. Consider position 4 stands on its own as Grosse Vehicle Weight Restriction GVWR.
| Code (4) | GVWR | etc..
-------------------------------
| L | 6000 lbs | etc..
..and positions 5,6 (Chassis/Series) now mean something similar to position 4,5 of passenger car:
| Code (56) | Line | Series | etc..
---------------------------------------
| RV | Buick | Enclave | etc..
I'm looking for a crafty way to resolve this in the relational design. I would like to return a common structure when a VIN is decoded -- if possible (i.e. not returning a different structure for cars vs. trucks)

Based on your answer to my comment regarding if you can identify the type of vehicle by using other values, a possible approach could be to have a master table with the common fields and 2 detail tables, each one with the appropriate fields for either cars or trucks.
Approximately something like the following (here I am guessing WMI is the PK):
Master table
| WMI | Restraint | Engine | CheckDigit | Year | Plant | Seq |
| 123 | 4 | 8 | 9 | 10 | 11 | 12-17 |
Car detail table
| WMI | Veh Line | Series | Body Type |
| 123 | 2 | 3 | 4 |
Truck detail table
| WMI | GWVR | Chassis |Body Type |
| 123 | 7 | 8 | 9 |
Having this, you could use a unique select to retrieve the needed data like following:
Select *
From
(
Select M.*,
C.*,
Null GWVR,
Null Chassis,
Null Truck_Body_Type
From Master_Table M
Left Join Car_Table C
on M.WMI = C.WMI
Union
Select M.*,
Null Veh_Line,
Null Series,
Null Car_Body_Type
T.*
From Master_Table M
Left Join Truck_Table T
on M.WMI = T.WMI
)
As for DML SQL you would only need to control prior to insert or update sentences whether you have a car or a truck model.
Of course you would need to make sure that only one detail exists for each master row, either on the car detail table or on the truck detail table.
HTH

Why you do not define both of these rules for the decoding; only one will resolve a valid result.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pandas - match a column of string with a column of regular expressions - pandas

Related

Fuzzy match a substring within a larger string in Postgres

Julia: Drawing of a table besides a plot

Query to compare values across different tables?

Database functional dependency for Nullable Columns

design for vehicle identification number (VIN)

Categories

Resources