Exclude menu from content extraction with tika - lucene

I generate html documents that contain a menu and a content part. Then I want to extract the content of these document to feed it to a lucene index. However, I would like to exclude the menu from the content extraction and thus only index the content.
<div class="menu">my menu goes here</div>
<div class="content">my content goes here</div>
what is the simplest way to achieve this with apache tika?

As a more general solution (not just for you specific menu) I would advise looking at boilerpipe that deals with removing uninteresting parts from pages (menus, navigation etc).
I know it can be integrated in Solr/tika, have a look and you probably can integrate it in your scenario.

Have a look at this post which specifies how to handle DIVs during the HTML parse, by specifying whether they are safe to parse or not, in which case its ignored. For your problem, you could have some logic in the override methods which ignore only DIV elements with attribute value "menu" (i.e. tell TIKA parser this DIV is unsafe to parse).

You can parse the html with a parser to a xhtml dom object an remove the div tag cotaining the attribute class="menu".

Related

main tag in html / CSS

I do have problems understand the tag </main>.
The project im looking at only has one </main> closing tag at the end, it has no opening tag of <main> which confuses me in the first place. It also has main {} in CSS. In order to try to understand what it does, I have played around with it a little.
As I comment out the </main> in html, absolutely nothing changes, which I understand if it has only informativ character, but the part where I get confused is, what is the main {....} in the CSS referring to, since when I comment this out, it will mess up the styling of the hole page.
edit: Since it seems to be unclear what my problem is: The problem is, that main {.....} in CSS does (!) influence the styling of the site, even without an opening <main> tag in html, and even without ANY <main> tag in html (for example if I remove the </main> tag in html.
How can a main {....} in CSS have influence on the styling, if there isn t even any <main> tag in the html whatsoever?
main{...} in css reffers to the styling of the main tag.
More about the element css selector
https://www.w3schools.com/cssref/sel_element.asp
You can learn more about the main tag here
https://www.w3schools.com/TAgs/tag_main.asp
And more about css selectors here
https://www.w3schools.com/cssref/css_selectors.asp
Also note that the main tag must have opening amd closing tag
You have to either add a tag, or remove main{...}.
The tag specifies the main content of a document.
The content inside the element should be unique to the document. It should not contain any content that is repeated across documents such as sidebars, navigation links, copyright information, site logos, and search forms.
Note: There must not be more than one element in a document. The element must NOT be a descendant of an , , , , or element.
just refers to the main content of the page, it does nothing. Main{...} is styling in the CSS page.

What HTML5 Tag Should be Used for a "Call to Action" Div?

I am new to HTML5 and am wondering which HTML5 tag should be used on a Call to Action div that sits in a column next to the main content on the home page.
Option 1:
<aside>
//call to action
</aside>
Option 2:
<article>
<section>
//call to action
</section>
</article>
The reason I ask is because I don't see either option as being a perfect fit. Perhaps I am missing something. Thanks!
My HTML for the Call to Action:
<section class="box">
<hgroup>
<h1 class="side">Call Now</h1>
<h2 class="side">To Schedule a Free Pick-Up!</h2>
<ul class="center">
<li>Cleaning</li>
<li>Repair</li>
<li>Appraisals</li>
</ul>
<h3 class="side no-bottom">(781) 729-2213</h3>
<h4 class="side no-top no-bottom">Ask for Bob!</h4>
</hgroup>
<img class="responsive" src="img/satisfaction-guarantee.png" alt="100% Satisfaction Guarantee">
<p class="side">We guarantee you will be thrilled with our services or your money back!</p>
</section>
This is a box on the right column of a three column layout. The content in the large middle column gives a summary of the company's services. If you wanted to use those services, you would have to schedule a pick-up, hence the call to action.
Does anyone object to this use of HTML5, or have a better way?
My take is that the best practices for the new HTML5 structural elements are still being worked out, and the forgiving nature of the new HTML5 economy means that you can establish the conventions that make the most sense for your application.
In my applications, I have separate considerations for markup that reflects the layout of the view (that is, the template that creates the overall consistency from page to page) versus the content itself (usually any function or query results that receive additional markup before being inserted into the various regions in the layout). The distinction matters because the layout element semantics (like header, footer, and aside) don't really help with differentiation of the content during search since that markup is usually repeated from page to page. I particularly favor using the semantic distinctions in HTML5 to describe the content the user is actually searching on. For example I generally use article to wrap the primary content and nav to wrap any associated list of links. Widget wrappers are usually tied to the page layout, so I'd go with the convention of the template for that guideline.
Whenever I have to decide on semantic vs generic names, my general heuristic is:
If there is a possible precedent already in the page template, follow that precedent;
If the element in question is new part of the page layout (vs a content query that is rendered into a region in the layout) and there is no guiding pattern in the template already, div is fine for associating that page layout behavior to;
If the content is created dynamically (that is, anything that gets instanced into the layout at request time--posts, navigation, most widgets), wrap it in a semantic wrapper that best describes what that item is (vs how you think it should appear)
Whenever authoring or generating content, use semantic HTML5 markup as appropriate within that content (hgroup to bracket hierarchical headings, section to organize chunks within the article, etc.). This is future-proof enrichment for search.
According to all this, div would be fine as a wrapper for your widget unless your page template already establishes a different widget wrapper. Also, your use of heading elements for creating large, bold appearance within the widget is using markup for appearance rather than for semantics. Since your particular usage is appearance-motivated, it would be better to use divs or spans with CSS classes that can let you specify sizes, spacing, and other adornments as needed for that non-specific text rather than having to override the browser defaults for the heading elements. I'd save the heading elements for the page heading, for widget headings, and for headings within the primary content region of the page. There can be SEO ranking issues for misuse of headings that are not part of the main content.
I hope these ideas help in your consideration of HTML5 markup usage.
So far as the semantics of the markup go, Don's advice makes sense. (As you said your CTA was visually beside the main content and secondary to that content, I would favor aside, but there's no single correct answer.)
However, you've tagged your question with "seo," so I take it you're interested in the SEO benefits of using the right markup. At this time, Google doesn't give special weight to having nice, semantic markup---they don't care about the difference between things like aside, section, and div. This may be partly because the meaning of these tags is still being defined (by the practice of Web devs), but they even seem to ignore tags that are clearly relevant search results (like nav, which will almost always be irrelevant to a page's description in the search results).
Instead, they heavily favor using microdata for marking up rich semantics. In the short term, marking your page up using the Schema.org WebPage microdata will likely provide greater benefit. You can mark your CTA as a relatedLink or significantLink, and keep it outside the mainContent of the page. If you're looking to optimize your page for search, this is a great way to do it---in my experience, Google very rarely shows text outside of your mainContent block in the search results description.
Proper markup depends on the actual content, which you have not provided.
That said, wrapping everything in a div is fine (although perhaps unnecessary) no matter what your content is as the <div> tag has no semantic value. Your two examples are probably not correct, unless your "call to action" is literally an entire article (which I doubt is the case).
The call to action might occur within an <aside>, but it's not likely that the call to action is the aside itself. Once again, that depends on the content (what it is) and context (where it is in relation to other content).
Typically "call to action" is a link somewhere, so the obvious answer to me is using an anchor, <a>.
It's just a link to another page. Use a div.

RDFa Snippet Generator from GoodRelations

I've created a RDFa snippet to use on a client's website using the GoodRelations tool. The generated code creates the tags as expected, but there's no text between the divs, for instance:
<div typeof="vcard:Address">
<div property="vcard:locality" content="Yorba Linda"></div>
</div>
I'm assuming that this is OK, and that I am expected to put descriptive text for humans between the 'locality' divs without any adverse effects (in relation to SEO.) Correct?
As William says: In most cases, is is impractical to reuse visible content for publishing meta-data, because they differ in sequence or structure. In that case, it is better to put all meta-data in a single block of <div> elements without visible content. This is called "RDFa in Snippet Style", see
http://www.ebusiness-unibw.org/tools/rdf2rdfa/
Hepp, Martin; GarcĂ­a, Roberto; Radinger, Andreas: RDF2RDFa: Turning RDF into Snippets for Copy-and-Paste, Technical Report TR-2009-01, 2009., PDF at http://www.heppnetz.de/files/RDF2RDFa-TR.pdf
Google is consuming such markup, despite a general preference for marking up visible content. Many big shops are using this approach with good results, e.g. http://www.rachaelraystore.com/Product/detail/Rachael-Ray-Stoneware-2-pc-Bubble-Brown-Baker-Set-Eggplant/316398
So if you can integrate the visible content and the RDFa constructs, then use
<div typeof="vcard:Address">
<div property="vcard:locality">Yorba Linda</div>
</div>
If you cannot, then use
<div typeof="vcard:Address">
<div property="vcard:locality" content="Yorba Linda"></div>
</div>
...
<div>
<div>Yorba Linda</div>
</div>
But the divs with invisible content must be close to the visible content and be placed better before than after the visible markup.
From and RDFa point of view, it is fine (I am assuming you are using bracers because you don't know how to escape greater than / less than characters).
The only thing you need to think about is how adding this fragment of HTML to your HTML document, will affect the rendering. Based on the fact that you are using the content attribute, this fragment is destined to remain hidden. So yo should think about this in relation to the CSS architecture. My advice would be to create a specific CSS class that is for annotations.
Having spoken to the author of Good Relations, his advice would be to put this fragment before any other HTML element in the body of your document. Generally, the Rich Snippets team indicate that they ignore hidden RDFa, but it doesn't actually matter and really in the long run it enables the publishing of RDF to anyone (not only Google) who wants to consume it.

Safe hidden text in HTML?

I need to have some hidden text in HTML to parse as text when i read an actual HTML file
I used to include my text in hidden div using style but i knew that may record us as spammers in SEO
.hideme {
position : absolute;
left : -1000px;
}
Can i have this content as commented text in the HTML ?
will that be safe as i know that SEO crawlers ignores the comments in HTML
<!-- my hidden text -->
Please advice
The search engines only care about hidden text when it is used to manipulate a page's rankings. Typically this is defined as content that is presented to the search engines that is not presented to users. So if you hide text so users can't see it but crawlers can you will find yourself having issues with Google. An example of when hiding text is good is when you use display:none to hide dynamic content and then use JavaScript or CSS to display the content when an action is performed (i.e. mouseover, etc).
If you place this extra content within comments as you suggest in your question you will be fine as this content is not available to users and search engines ignore comments.
Try to avoid "hide" in naming your CSS class.
But the best way is to avoid hidden text by finding easy and creative ways to add the text to the content of a web-page without seeming like spam.
You can't parse html comments so instead use a hidden field:
<input type="hidden" value="my text" name="my_hidden_field/>
Some people for SEO doing this :
.hideme {
width:0px;
overflow:hidden;
text-indent:-99999px
}
Why you do not use <meta> tags ?

Alter Rendered Page in Webbrowser Control

is there a way to alter the rendered HTML page in webbrowser control? What i need is to alter the rendered HTML Page in my webbrowser control to highlight selected text.
What i did is use a webclient and use the webclient.Downloadstring() to get the source code of the page, Highlight specific text then write it again in webbrowser. problm is, images along with that page does not appear since they are rendered as relative path.
Is there a way to solve this problem? Is there a way to detect images in a webbrowser control?
Not sure why you need to change the HTML to lighlight text, why not use IHighlightRenderingServices?
To specify a base url when loading HTML string you need to use the document's IPersistMoniker interface and specify a url in your IMoniker implementation.
I suggest you do it a different way, download and replace the text using the webbrowser control, this way your links will work. All you do is replace whatever is in the Search TextBox with the following, say the search term is "hello", then you replace all occurances of hello with the following:
<font color="yellow">hello</font>
Of course, this HTML can be replaced with the SPAN tag (which is an inline version of the DIV tag, so your lines wont break using SPAN, but will using DIV). But in either case, both these tags have a style attribute, where you can use CSS to change its color or a zillion other properties that are CSS compatible, like follows:
<SPAN style="background-color: yellow;">hello</SPAN>
Of course, there are a zillion other ways to change color using HTML, feel free to search the web for more if you want.
Now, you can use the .Replace() function in dotnet to do this (replace the searched text), it's very easy. So, you can Get the Whole document as a string using .DocumentText, and once all occurances are replaced (using .Replace()), you can set it back to .DocumentText (so, you're using .DocumentText to get the original string, and setting .DocumentText with the replaced string). Of course, you probably don't want to do this to items inside the actual HTML, so you can just loop through all the elements on the page by doing a For Each loop over all elements like below:
For Each someElement as HTMLElement in WebBrowser1.Document.All
And each element will have a .InnerText/.InnerHTML and .OuterText/.OuterHTML that you can Get (read from) and Set (overwrite with replaced text).
Of course, for your needs, you'd probably just want to be replacing and overwriting the .InnerText and/or the .OuterText.
If you need more help, let me know. In either case, i'd like to know how it worked out for you anyway, or if there is anything more any of us can do to add value to your problem. Cheers.