Sitecore 7 Lucene: strip HTML from computed field - lucene

I am pasting together all "paragraph" child nodes from an "article" node in a computed field. This is to achieve that an article can be searched & found by its paragraph contents.
To achieve this, I did the following, under the <fields hint="raw:AddComputedIndexField"> node:
<field fieldName="Paragraphs" storageType="YES" indexType="TOKENIZED">
MyWebsite.ComputedFields.Paragraphs,MyWebsite
</field>
In this computed field, I concat the paragraph HTML bodies together.
I was assuming Sitecore would strip the HTML for me (like it does for rich text fields), but it does noet.
For "rich text" fields, it is probably the RichTextFieldReader that strips the HTML tags out. Decompiling the code confirms this.
The RichTextFieldReader is configured in the FieldReaders section. Trying to add a raw:AddFieldReaderByFieldNamesection below, does not seem to do anything.
The full section looks as follows, but does not work in this setup:
<FieldReaders type="Sitecore.ContentSearch.FieldReaders.FieldReaderMap, Sitecore.ContentSearch">
<mapFieldByTypeName hint="raw:AddFieldReaderByFieldTypeName">
....default stuff here...
</mapFieldByTypeName>
<mapFieldByFieldName hint="raw:AddFieldReaderByFieldName">
<fieldReader fieldName="Paragraphs" fieldReaderType="Sitecore.ContentSearch.FieldReaders.RichTextFieldReader, Sitecore.ContentSearch"></fieldReader>
</mapFieldByFieldName>
</FieldReaders>
Any other clues on how to achieve this (by config, not by using HTML agility pack etc)

The problem is the mapFieldByFieldName is expecting to match a field with that name from the Sitecore item, not a custom computed field in your index so the field reader is never called.
I don't know how to achieve this from config, but if you do not want to directly use HAP but are willing to use some code then after you paste your fields together in your computed field class just do what Sitecore does in the GetPlainText() method:
string input = "concatenated string";
return HttpUtility.HtmlDecode(Regex.Replace(input, "<[^>]*>", string.Empty));
or use the util method Sitecore.StringUtil.RemoveTags(text)

Related

How can I Map from XML HREF+Text to Word Document Hyperlink?

I Have a simple XML file e.g.:
<root>
<some_link_href>http://www.microsoft.com/</some_link_href>
<some_link_text>goto microsoft site</some_link_text>
</root>
I have setup XmlMapping, but I can't seem to get a Hyperlink.
I tried using richTextControl and and providing RTF text
(e.g. '{\rtf1\pc some \b BOLD \b0 text}') but it just shown the raw RTF.
Note: even though I'd like to have a Text with HREF, I can settle for clickable URI
Is there any other control? other good methods to use?
also posted # https://learn.microsoft.com/en-us/answers/questions/720964/place-link-on-winword-from-xmlmapping.html

How do I get empty fields in SOLR indexed for a schemaless collection?

How do I get empty fields in SOLR indexed? I am using solr 7.2.0
I am using schemaless SOLR to try to index everything as string, but for files with empty fields, those fields do not get indexed. Is there a way to get them to show up?
col1,col2,col3
a,,1
d,e,
g,h,3
for example column 1 shows up as
{
"col1":"a",
"col3":"1",
}
I'm trying to also get col2 to show up.
in my solrconfig.xml i have this
<dynamicField name="*" type="text_general" indexed="true" stored="true" required="true" default="" />
and I have any traces of the remove-blank processor removed from my config. I've reloaded and deleted/recreated by collection multiple times. Is there a solution for this?
The CSV import module has its own option to keep empty fields - f.<field name>.keepEmpty=true.
If you don't give that option, the CSV handler will never give the empty field value to the next step in your indexing process.
Giving f.col2.keepEmpty=True as an URL argument should at least give you a better starting point.
maybe preprocess your csv file like this:
s/,,/, ,/g
That is, add an space between both commas (you will have to specially deal with the last value differntly though, there is a regex for that).
And then try again. Right now solr is reading the value as non existant, making it a space has more chances to make it through, and would not change search results (if you don't have some crazy analysis chains)

How to define link text different in Odoo widget=url

I have made read only computed URL-field based on invoice number. It works nicely but I would like to produce text part only itself like 400:
400
Now it's producing whole link as text which is quite ugly
https://external_site_invoice?num=400
My Odoo fields are defined this way...
ext_invoice_number= fields.Integer(string="Ext number")
def _showlink(self):
for rec in self:
if rec.ext_invoice_number:
if rec.ext_invoice_number>0:
rec.ext_link="https://external site/invoice?num=%d" % (rec.ext_invoice_number,)
ext_link = fields.Char(string="Link",compute=_showlink,)
How can I define text part of URL in Odoo to be different than link? This is poorly documented or it's not possible?
you can define the text attribute in widget definition like that:
<field name="field_with_url" widget="url" readonly="1" text="My own text"/>
Regards

Sitecore 8.1: Custom Search Index not searching through PDF

I have a custom search index that I want to index pdf file content. The master index seems to be indexing pdf files fine and sitecore's built in search functionality searches through the pdf files perfectly fine. I seem to be having an issue on trying to index the PDF field and then search the contents of it.
In my indexConfiguration i add the filed by name
<fieldNames hint="raw:AddFieldByFieldName">
<field fieldName="publication pdf" storageType="YES" indexType="TOKENIZED" vectorType="NO" boost="1f" type="System.String" settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider" />
...
</fieldNames>
My results Item contains index field definition
[IndexField("publication pdf")]
public virtual string PDF { get; set; }
However when I create search context and try to find something inside the PDF, i get 0 results.
var query = context.GetQueryable<ResultItem>();
query = query.Where(p => p.PDF.Equals(SearchString));
Any help is greatly appreciated.
I'm guessing your "Publication PDF" field is some kind of reference field to a media library item. Content of the PDF is in fact not content of your current item. This means that you would need to write a custom computed field that would extract that media library item and crawl its content.
If you want to crawl content of a media item, you might want to use some reflector to check the code of Sitecore.ContentSearch.ComputedFields.MediaItemContentExtractor class. It's used by Sitecore to get the content of media items, as defined in Sitecore.ContentSearch.Lucene.DefaultIndexConfiguration.config:
<field fieldName="_content" type="Sitecore.ContentSearch.ComputedFields.MediaItemContentExtractor,Sitecore.ContentSearch">
<mediaIndexing ref="contentSearch/indexConfigurations/defaultLuceneIndexConfiguration/mediaIndexing"/>
</field>
You would need to first get media item and then use code copied from this class to get the content of PDF.
BUT
Yeah, there is always but. If the media library item has changed and your item has not changed, your item will not be reindexed automatically. So if you plan to change pdfs (uploading new item and selecting it should be fine), you would need either think about custom code that would execute reindexing of the item which holds reference to your pdf file, or manually reindex your item.

Using Polymer conditional attributes with HAML

According to the documentation for Polymer expressions, you can bind data to conditionally assign an attribute using the conditional attribute syntax:
Conditional attributes
For boolean attributes, you can control whether or not the attribute
appears using the special conditional attribute syntax:
attribute?={{boolean-expression}}
That's great, but I'm using HAML where attributes are assigned to elements like this:
%element{attribute: "value"}
I can't add a question mark before that colon without HAML giving me a syntax error.
So how can I use Polymer's conditional attributes (or a functional equivalent) when I'm using HAML to generate my HTML?
One potential solution is to use the :plain filter to insert raw HTML into your HAML file:
:plain
<element attribute?={{boolean-expression}}></element>
A bit ugly, but it seems to work.
If you need to enclose some HAML-generated tags in one of these plain HTML tags, you'll need to use the :plain filter twice; once for the opening tag, and once for the closing tag.
:plain
<element attribute?={{boolean-expression}}>
-# HAML Content Here
:plain
</element>
Be sure not to indent your HAML code after the opening tag, otherwise it will become part of the "raw HTML" output and get sent as plain text to the browser instead of being processed as HAML.
The current version of HAML (4.0.6) supports conditional attributes:
%core-menu{hidden?: '{{!globals.current_series_detail}}'}
Make sure you're not putting a space before the question mark.