Lucene Tokenizer - Include Spaces - lucene

We have an application that tokenises certain data. The problem I have is that I have a comma delimited field I need to tokenize but not on spaces. For Example:
"Age 6, Age 7, Age 8"
Becomes
Age
6
Age
7
Age
8
I need
Age 6
Age 7
Age 8
Is there a way for me to change the default behaviour on certain fields only?
The config setting I have at present:
<field fieldName="SizeGroup" storageType="YES" indexType="TOKENIZED" vectorType="NO"
boost="1f" type="System.String"
settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration,
Sitecore.ContentSearch.LuceneProvider" />

Unfortunately, I do not know C#, but I know Lucene. So for needed behavior you need to use PatternAnalyzer, which allows you to specify a regexp, that will be used for tokenizing. In your case, pattern like \\, should work for splitting on commas.

Related

Merging xfdf into template pdf without losing some special characters (eg. ő,Ű,č)

I have an xfdf file, which is utf8 and may contain non ASCII characters. I would like to merge it with the pdf that contains the form. I tried with pdftk, and although merging happens correctly - in terms of all fields are being populated - some characters are not appearing in the flattened pdf.
Taking the xfdf:
<?xml version="1.0" encoding="utf-8"?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
<fields>
<field name="some_data">
<value>Űző</value>
</field>
<field name="some_other_data">
<value>ùûüÿ€’“”«»àâæçéèêëïôœÙÛÜŸÀÂÆÇÉÈÊËÏÎÔ</value>
</field>
</fields>
</xfdf>
The result pdf's fields have the following values (excluded the quotation marks):
some_data: " z "
some_other_data: "ùûüÿ€’“”«»àâæçéèêëïôœÙÛÜŸÀÂÆÇÉÈÊËÏÎÔ"
So all the characters in some_other_data are stored correctly, but ő and Ű are stored as 00.
I also realized that if I uncompress the pdf with pdftk, I can find the original characters stored in the pdf as
/DA (/Helv 8.64 Tf 0 g)
/Subtype /Widget
/V (ţ˙ Q z\r )
/T (some_data)
The fact that the correct characters are there is also clear if I open the unflattened form with Adobe Reader. After opening, initially the form field some_data contains only the letter z surrounded with spaces, BUT if I click on the form field, the special characters are revealed, and any changes made to the field value will result in the correct characters to stay visible. On the other hand if I unfocus the form field without any modification, they disappear again..
I also tried to use numeric entities in the xfdf, but it did not help either.
I have 2 questions:
Why won't these characters appear in the value of the field when the pdf clearly contains the correct character information, and is also capable of rendering them?
Most importantly what can I do in order to have the correct characters appearing in the pdf after flattening the form? I'd prefer a solution which does not require any postprocessing once the xfdf is merged into the pdf form, but any solutions or ideas are welcome.
Thank you all!

Ignoring special characters during query time in SOLR

I want to ignore special characters during query time in SOLR .
For example :
Lets assume we have a doc in SOLR with content:My name is A-B-C .
content:A-B-C retunrs documents
but content:ABC doesnt return any document .
My requirement is that content:ABC should return that one document .
So basically i want to ignore that - during query time .
To get the tokens concatenated when they have a special character between them (i.e. A-B-C should match ABC and not just A), you can use a PatternReplaceCharFilter. This will allow you to replace all those characters with an empty string, effectively giving ABC to the next step of the analysis process instead.
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[^a-zA-Z0-9 ]" replacement=""/>
<tokenizer ...>
[...]
</analyzer>
This will keep all regular ascii letters, numbers and spaces, while replacing any other character with the empty string. You'll probably have to tweak that character group to include more, but that will depend on your raw content and how it should be processed.
This should be done both when indexing and when querying (as long as you want the user to be able to query for A-B-C as well). If you want to score these matches differently, use multiple fields with different analysis chains - for example keeping one field to only tokenize on whitespace, and then boosting it higher (with qf=text_ws^5 other_field) if you have a match on A-B-C.
This does not change what content is actually stored for the field, so the data returned will still be the same - only how a match is performed.
Here you must be having a field Type for your field content.
The fields type can have 2 separate analyzer. One for index and one for query.
Here you can either create indexes of content "A-B-C" like ABC, A-B-C by using the "Word Delimiter Token Filter" .
Use catenateWords. add as catenateWords = 1.
It will work as follow :
"hot-spot-sensor’s" → "hotspotsensor". In your case "A-B-C". it will generate "ABC"
Here is the example of it Word Delimiter Filter
Usage :
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory" preserveOriginal="true" catenateWords="1"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
This will create the multiple indexes and you will be able search with ABC and A-B-C

auto delete of documents is not working in Apache Solr

I am using Apache Solr 7 and trying to automate deletion of documents. I've done following steps as per Lucene's documentation.
step 1. in solrschema.xml
<updateRequestProcessorChain default="true">
<processor class="solr.processor.DocExpirationUpdateProcessorFactory">
<int name="autoDeletePeriodSeconds">30</int>
<str name="ttlFieldName">time_to_live_s</str>
<str name="expirationFieldName">expire_at_dt</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
step 2. in managed-schema.xml
<field name="id" type="string" indexed="true" stored="true" multiValued="false" />
<field name="time_to_live_s" type="string" stored="true" multiValued="false" />
<field name="expire_at_dt" type="date" stored="true" multiValued="false" />
step 3. I created a core by name sample1 and add the following document
curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/sample1/update?commit=true' -d '[{ "id":"sample_doc_1","expire_at_dt":"NOW+10SECONDS"}]'
after 10 Seconds, the document is still there. Am i missing any of the step, here or am I doing something wrong ?
I think in the indexing you should set the field time_to_live_s not the expire_at_dt, and value +10SECONDS or whatever you want will be fine.
As a reference:
ttlFieldName - Name of a field this process should look for in each
document processed, defaulting to ttl. If the specified field name
exists in a document, the document field value will be parsed as a
Date Math Expression relative to NOW and the result will be added to
the document using the expirationFieldName. Use to disable this feature.
If you want to directly set the expiration date - you should set the proper date string, not the Date Math Expression.
I have full working example of the auto delete code here

Solr Query not parsing forward slash

Is the forward slash "/" a reserved character in solr field names?
I'm having trouble writing a solr sort query which will parse for fields containing a forward slash "/"
When making an http query to my solr server:
q=*&sort=normal+desc
Will work but
q=*&sort=with/slash+desc
q=*&sort=with%2Fslash+desc
Both fail say "can not use FieldCache on multivalued field: with"
Each solr document contains two int fields "normal", and "with/slash". With my solr schema indexing the fields as so
...
<field name="normal" type="int" indexed="true" stored="true" required="false" />
<field name="with/slash" type="int" indexed="true" stored="true" required="false" />
...
Is there any special way I need to encode forward slashes in solr? Or are there any other delimiter characters I can use? I'm already using '-' and "." for other purposes.
I just came across the same problem, and after some experimentation found that if you have a forward-slash in the field name, you must escape it with a backslash in the Solr query (but note that you do not have to do this in the field list parameter, so a search looking for /my/field/name containing my_value is entered in the "q" field as:
\/my\/field\/name:my_value
I haven't tried the sort field, but try this and let us know :)
This is on Solr 4.0.0 alpha.
From the solr wiki at https://wiki.apache.org/solr/SolrQuerySyntax :
Solr 4.0 added regular expression support, which means that '/' is now
a special character and must be escaped if searching for literal
forward slash.
In my case I needed to search for forward slash / with wild card *, e.g.:
+(*/*)
+(*2016/17*)
I Tried to escape it like so:
+(*2016\/*)
+(*2016\/17*)
but that didn't work also.
the solution was to wrap the text with double quote " like do:
+("*\/*")
+("*/*")
+("*2016\/17*")
+("*2016/17*")
both returned the same result with and without escaping the forward slash

ASP Regular Expression for UK Telephone format in VB.net

I want regular expression validator for my telephone field in VB.net. Please see the requirement below:
Telephone format should be (+)xx-(0)xxxx-xxxxxx ext xxxx (Optional) example my number would appear as 44-7966-591739 Screen would be formatted to show +44-(0)7966-591739 ext
Please suggest.
Best Regards,
Yuv
+44-(0)7966-591739
The (0) is not valid in phone number display. Remove it.
It's +44 7966 591739 or 07966 591739.
The RegEx pattern is inefficient in multiple ways:
(\d{4}|\d{3})
The above simplifies to:
\d{3,4}
There are bigger problems:
^(((+44\s?\d{4}|(?0\d{4})?)\s?\d{3}\s?\d{3})|((+44\s?\d{3}|(?0\d{3})?)\s?\d{3}\s?\d{4})|((+44\s?\d{2}|(?0\d{2})?)\s?\d{4}\s?\d{4}))(\s?#(\d{4}|\d{3}))?$
Having found the leading +44 or leading 0 once, why keep on searching for it again and again?
^((+44\s?..|0..).....|(+44\s?..|0..).....|(+44\s?..|0..).....)
simplifies to
^(+44\s?|0)(.. .....|.. .....|.. .....)
However, the above pattern caters only for UK 4+6, 3+7 and 2+8 format numbers and not for 3+6, 4+5, 5+5 and 5+4 format numbers.
The pattern is inadequate.
Phone number validation and formatting needs to be broken down into separate steps. Allow a wide range of input formats, extract the vital digits and throw away the various dial prefixes, then strictly format the remaining number in international or national format.
For London numbers, the correct format with spaces is:
+44 20 3555 7890 or 020 3555 7890 or (020) 3555 7890
and without spaces:
+442035557890 or 02035557890.
(0) in parentheses is NEVER valid. Do not use it.
UK phone numbers use a variety of formats: 2+8, 3+7, 3+6, 4+6, 4+5, 5+5, 5+4. Some users don't know which format goes with which number range and might use the wrong one on input. Let them do that; you're interested in the DIGITS.
Step 1: Check the input format looks valid
Make sure that the input looks like a UK phone number. Accept various dial prefixes, +44, 011 44, 00 44 with or without parentheses, hyphens or spaces; or national format with a leading 0. Let the user use any format they want for the remainder of the number: (020) 3555 7788 or 00 (44) 203 555 7788 or 02035-557-788 even if it is the wrong format for that particular number. Don't worry about unbalanced parentheses. The important part of the input is making sure it's the correct number of digits. Punctuation and spaces don't matter.
^\(?(?:(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?\(?(?:0\)?[\s-]?\(?)?|0)(?:\d{5}\)?[\s-]?\d{4,5}|\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3})|\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4}|\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}|8(?:00[\s-]?11[\s-]?11|45[\s-]?46[\s-]?4\d))(?:(?:[\s-]?(?:x|ext\.?\s?|\#)\d+)?)$
The above pattern matches optional opening parentheses, followed by 00 or 011 and optional closing parentheses, followed by an optional space or hyphen, followed by optional opening parentheses. Alternatively, the initial opening parentheses are followed by a literal + without a following space or hyphen. Any of the previous two options are then followed by 44 with optional closing parentheses, followed by optional space or hyphen, followed by optional 0 in optional parentheses, followed by optional space or hyphen, followed by optional opening parentheses (international format). Alternatively, the pattern matches optional initial opening parentheses followed by the 0 trunk code (national format).
The previous part is then followed by the NDC (area code) and the subscriber phone number in 2+8, 3+7, 3+6, 4+6, 4+5, 5+5 or 5+4 format with or without spaces and/or hyphens. This also includes provision for optional closing parentheses and/or optional space or hyphen after where the user thinks the area code ends and the local subscriber number begins. The pattern allows any format to be used with any GB number. The display format must be corrected by later logic if the wrong format for this number has been used by the user on input.
The pattern ends with an optional extension number arranged as an optional space or hyphen followed by x, ext and optional period, or #, followed by the extension number digits. The entire pattern does not bother to check for balanced parentheses as these will be removed from the number in the next step.
At this point you don't care whether the number begins 01 or 07 or something else. You don't care whether it's a valid area code. Later steps will deal with those issues.
Step 2: Extract the NSN so it can be checked in more detail for length and range
After checking the input looks like a GB telephone number using the pattern above, the next step is to extract the NSN part so that it can be checked in greater detail for validity and then formatted in the right way for the applicable number range.
^\(?(?:(?:0(?:0|11)\)?[\s-]?\(?|\+)(44)\)?[\s-]?\(?(?:0\)?[\s-]?\(?)?|0)([1-9]\d{1,4}\)?[\s\d-]+)(?:((?:x|ext\.?\s?|\#)\d+)?)$
Use the above pattern to extract the '44' from $1 to know that international format was used, otherwise assume national format if $1 is null.
Extract the optional extension number details from $3 and store them for later use.
Extract the NSN (including spaces, hyphens and parentheses) from $2.
Step 3: Validate the NSN
Remove the spaces, hyphens and parentheses from $2 and use further RegEx patterns to check the length and range and identify the number type.
These patterns will be much simpler, since they will not have to deal with various dial prefixes or country codes.
The pattern to match valid mobile numbers is therefore as simple as
^7([45789]\d{2}|624)\d{6}$
Premium rate is
^9[018]\d{8}$
There will be a number of other patterns for each number type: landlines, business rate, non-geographic, VoIP, etc.
By breaking the problem into several steps, a very wide range of input formats can be allowed, and the number range and length for the NSN checked in very great detail.
Step 4: Store the number
Once the NSN has been extracted and validated, store the number with country code and all the other digits with no spaces or punctuation, e.g. 442035557788.
Step 5: Format the number for display
Another set of simple rules can be used to format the number with the requisite +44 or 0 added at the beginning.
The rule for numbers beginning 03 is
^44(3\d{2})(\d{3])(\d{4})$
formatted as
0$1 $2 $3 or as +44 $1 $2 $3
and for numbers beginning 02 is
^44(2\d)(\d{4})(\d{4})$
formatted as
(0$1) $2 $3 or as +44 $1 $2 $3
The full list is quite long. I could copy and paste it all into this thread, but it would be hard to maintain that information in multiple places over time. For the present the complete list can be found at: http://aa-asterisk.org.uk/index.php/Regular_Expressions_for_Validating_and_Formatting_GB_Telephone_Numbers
For validation:
As bobince points out, you should be flexible with phone numbers because there are so many ways to enter them.
One simple yet effective way to validate the value is first strip all non-numeric values, then make sure it is at least 11 digits long, and - if you're limiting to UK numbers - then check it starts with either 0 or 44.
I can't be bothered looking up vb.net syntax, but something along the lines of this:
if Phone.replaceAll('\D','').length < 11
// Invalid Number
endif;
(The \D is regex for anything not 0-9.)
To format a number as requested, assuming you've got a relatively fixed input that you want to display to a page, something like this might work:
replace:
(\d{2,3})\D*0?\D*(\d{4})\D*(\d{5})\D*(\d*)
with:
+$1-(0)$2-$3 ext $4
That's fairly flexible but wont accept any old phone number. It currently required an international code at the start, and I'm not quite sure on the rules of them to know if it's going to work perfectly, but it might be good enough for what you need.
An explanation of that regex, in regex comment mode (so it can be used directly as a regex if necessary):
(?x) # enable regex comment mode (whitespace ignored, hashes start comments)
# international code:
(\d{2,3}) # matches 3 or 2 digits; captured to group 1.
# optional 0 with potental spaces dashes or parens:
\D* # matches as many non-digits as possible, none required.
0? # optionally match a zero
\D* # matches as many non-digits as possible, none required.
# main part of number:
(\d{4}) # match 4 digits; captured to group 2
\D* # matches as many non-digits as possible, none required.
(\d{5}) # match 5 digits; captured to group 3.
# optional prefix:
\D* # matches as many non-digits as possible, none required.
(\d*) # match as many digits as possible, none required; captured to group 4.
Never include a (0) in parentheses in the international format.
ITU E.123 warns against it: http://www.itu.int/rec/T-REC-E.123-200102-I/en
as does: http://revk.www.me.uk/2009/09/it-is-not-44-0207-123-4567.html