Pentaho Data Integration (Spoon) Value Mapper Wildcard - mapper

Is there a wildcard character for the Value Mapper transformation in Pentaho Spoon? I've done some digging and only found wildcard solutions for uploading files and documents. I need to be able to map any and all potential values that contain a specific word yet I don't have a way of identifying all possible variations of the phrase that contains that word.
Example: Map website values to a category.
Value -> Mapped Category
facebook.com -> Facebook
m.facebook.com -> Facebook
google.com -> Google
google.ca -> Google
I'd prefer to use a wildcard character (let's call it % for example) so that one mapping captures all cases for a given category (e.g. %facebook% -> Facebook) in my Value Mapper. Another benefit is that the wildcard would correctly map any future site traffic value that comes along. (e.g. A hypothetical l.facebook.com would be correctly mapped if it ever entered my data)
I've tried various characters as wildcards and none have worked. + \ * %
Please and thank you!

You can use the step Replace in String with regular expressions to do this.
If you still need the original field, create a copy first using the Calculator step. Then you can put a number of mappings into the Replace step. They will run in sequence and if the regex matches, replace the contents of the field with your chosen mapping.
The performance may not be great, but it gives you the full flexibility of regexes. Do keep in mind this way gives you the first match. See my example for what can go wrong.

Related

How to treat the whole sentence as one token in Azure Search

We are facing a problem of performing exact matching and ignore case-sensitivity by using Azure Search.
For example, we have a field called Description and it can be a small sentence or long sentence (For example: Welcome to Azure Search). We are trying to treat the whole sentence as one token such that when user search "Welcome to" it won't return the result back because we have to search "Welcome to Azure Search" to do a exactly matching. Another requirement is that we want the capability to search case-insensitive such that searching "welcome TO Azure SEARCH" will return the result.
I have used Keyword Analyzer to treat the whole field as a single token but this will prevent search case-insensitive from working.
I am also trying to define custom analyzer with keyword_v2 tokenizer and lowercase token filter. Looks like this will solve my problem however there is a 300 maximum token length limitation. In some of the cases, the Description field will be a long sentence more than 300 characters.
I also thought about duplicating an index field to be lowercase and using OData syntax $filter=Description eq 'welcome to azure search'. For example, there will be two fields: "Description" and "DescriptionLowerCase", when do the searching, search on "DescriptionLowerCase" and when returning result, returning "Description". But this will double the size of index storage.
Is there a better way to solve my problem?
you have pretty much covered all the options available options. At the moment there is no workaround the size limitation as without that the search will suffer performance. Now exactly why would you need exact match on the whole string more than 300 characters is beyond me. Have you tried using quotes around your search?

Extracting data using U-SQL file set pattern when silent switch is true

I want to extract data from multiple file so I am using file set pattern that requires one virtual column. Because of some issues in my data, I also require silent switch other wise I am not able to process my data. It looks like, when I use virtual column with silent switch it does not extract any row.
#drivers =
EXTRACT name string,
age string,
origin string
FROM "/input/{origin:*}file.csv"
USING Extractors.Csv(silent:true);
Note that I can extract data from a single file by removing virtual column. Is there any solution for this problem?
first you do not need to name the wildcard (and expose a virtual column) if you do not plan on referring to the value. Although we recommend that you make sure that you are not processing too many files with this pattern, so best may be to use the virtual column as a filter to restrict the number of files to a few thousand right now until we improve the implementation to work on more files.
I assume that at least one file contains some rows with two columns? If that is the case I think you found a bug. Could you please send me a simple repro (one file that works, and an additional file where it stops working and the script) to my email address so I can file it and we can investigate it?
Thanks!

Apache Pig: Extracting url query parameters that appear in arbitrary order

I have a logfile with urls that are tagged with custom Google Analytics campaign parameters (utm_source, utm_medium, utm_campaign). I need to extract the parameters from the urls and create a csv file where source, medium and campaign appear each in their own column (plus several other fields from the logfile).
This is how I started (url is the field that contains the url obviously):
extracted = foreach mydata GENERATE date, time,
FLATTEN(REGEX_EXTRACT_ALL(url, '.*utm_source=(.*)&utm_medium=(.*)&utm_campaign=(.*)&.*?'))
AS (source:CHARARRAY, medium:CHARARRAY, campaign:CHARARRAY);
This works, but only as long as the parameters appear in a fixed order (and are not preceeded by another parameter in the url).
So this will e.g. extract data from https://www.example.com/page.html?&utm_source=publisher&utm_medium=display&utm_campaign=standard&someotherparam but not from https://www.example.com/page.html?&utm_medium=display&utm_source=publisher&utm_campaign=standard&someotherparam. Since the parameter order is not consistent that doesn't work for me.
I have tried multiple conditions for the regexp separated by or (|) but that only ever gave me the first match. I have also tried to extract each parameter in it's own extract command and then join the data but that took ages and ended up duplicating the data.
So what would be the best (or at least a working) way to rewrite my pig command so that it will extract all three utm parameters from the urls independently from the order in which they appear ?
I would simply have three REGEX_ECTRACT:
... FOREACH mydata GENERATE FLATTEN(REGEX_EXTRACT(url, '.*utm_source=([^&]*)'), 1) AS (source:CHARARRAY)
...
Although you could probably do it with just one regex but I find this simpler and more readdable.

Google Places API - RadarSearch results are confusing

I'm running a query vs the Google Places RadarSearch API and don't entirely understand the results. I'm trying to find nearby Tesco Supermarkets. My query is structured like this:
https://maps.googleapis.com/maps/api/place/radarsearch/xml?location=51.503186,-0.126446&types=store&keyword=tesco&name=tesco&radius=5000&key=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
I've tried a bunch of variations of the fields types, keyword and name. None of the results are Tesco stores. Am i missing something?
The Google docs show the fields as:
keyword — A term to be matched against all content that Google has indexed for this place, including but not limited to name, type, and address, as well as customer reviews and other third-party content.
name — One or more terms to be matched against the names of places, separated by a space character. Results will be restricted to those containing the passed name values. Note that a place may have additional names associated with it, beyond its listed name. The API will try to match the passed name value against all of these names. As a result, places may be returned in the results whose listed names do not match the search term, but whose associated names do.
I always get the maximum of 200 results which maybe includes 1 or 2 Tescos. When I check on Google maps there are 10 Tescos in the radius I am searching. It's as if the api is ignoring the name field. It doesn't matter what I populate in the name field, I still get the same results
UPDATE: Seems this is a known bug https://code.google.com/p/gmaps-api-issues/issues/detail?id=7082
maybe I am wrong, but I believe it is a commercial issue, google will show all business filtering them with a particular criteria they are no publishing the rules, for example in your search, the type you used was "store" , so they are returning to you all stores, and using the name or keyword in their own way who knows which criteria they are internally using, and there is something else, on the API description, the sample that they provide for radar search shows the name of the place in the result, but in the tests i am doing, they are not even sending the name, so you couldn't iterate those results, and filter by your own, for you to get the name, you have to do another call using:
https://maps.googleapis.com/maps/api/place/details/json?placeid=ChIJq4lX1doEdkgR5JXPstgQjc0&key=YOUR_KEY
Maybe there is another way but I don't see it.
I find the radar search is returning strange results today. It worked differently a couple of days ago.
The keyword-parameter has no effect at the moment and I have breaking integration-tests that were working before. I hope this is a temporary issue.
I filed a bug report for it: https://code.google.com/p/gmaps-api-issues/issues/detail?id=7086

Lucene - Which field contains search term?

I have developed a search application with Lucene. I have created the basic search. Basically, my app works as follows:
My index has many fields. (Around 40)
User can enter query to multiple fields i.e: +NAME:John +SURNAME:Doe
Queries can contain wildcards such as ? and * i.e: +NAME:J?hn +SURNAME:Do*
Queries can also contain fuzzy i.e: +NAME:Jahn~0.5
Now, I want to find, which field(s) contains my search term(s). As I am using wildcard and fuzzy, I cannot just make string comparison. How can I do it?
If you need it for debugging purposes, you could use IndexSearcher.explain.
Otherwise, this problem looks like highlighting, so you should be able to find out the fields that matched by:
re-analyzing your document,
or using its term vectors.