In my database I have a field wich contains a html document. Now there must be a possibility to search in this document. However, the html tags may not be found. So when I have something like this:
<html>
<head>
<title>Bar</title>
</head>
<body>
<p>
this content my be found
</p>
</body>
</html>
It is possible that the document stored in the database is not xhtml. Can you tell me what the best way is to search in the content? Shall i use regular expressions? And of so, how would it look like? ANd if not, what should I use else?
You could try turning on Full-Text Search or use something like Lucene.Net to index the content for you.
What volume of records are there? I expect you might have to use full-text search and an IFilter to do this efficiently. Html does not lend itself well to regex - it can quickly be very hard to do something very simple.
If the volume isn't huge, can you iterate over the records with an external parsing application, using something like the HTML Agility Pack (for .NET) - or any other DOM of your choice.
But the FTS/IFilter would be my first choice.
Related
Im new to i18n and when I typed it in the search bar i18next is in the top results.
I already did my research regarding i18n and how to use it. But it still not clear to me. All I know is that to be able to make your web app available to other language, you need to do a json file that contains the keys and value of your app, and you need to add a script for the i18n.
The rest is still confusing for me. This might sound a stupid question for you, but I just cant understand how it works.
1) Im not sure but based on my observation, you only create a json translation for those that have a value or text that will be shown in the page. Correct? Assuming in the html file, I have a text that is not inside a label or innerhtml, example:
<html>
<body>
**How are we going to translate this text? What key am I going to use?**
</body>
</html>
What do I need to do to translate this text?
2) What should we use as the key? id? class? tag? Because I've seen different examples and it uses different any of these. When is the right time to use these?
3) regarding the key-value pair, what if the pair is coming from the server? what's the syntax for this?
4) When do we need a multi line json?
i18n is a big topic, with a lot of solutions depending on what kind of web app you are trying to internationalize / localize. Unfortunately, i18next's documentation is not very good, and it has next to nothing in way of tutorials.
That said, you might be best off taking a look at the sample app on i18next.js's github repository here: https://github.com/jamuhl/i18next/tree/master/sample/static. It does give some examples of how i18next can be used to replace html text with localized versions of the same. To answer some of your questions:
There are a few ways of doing this. The sample script replaces much of the data by using the jQuery .text call -- something like this: $('#MyHTMLID').text($.t('ns.common:MyLocalizedTextForMyHTMLID'));. Any html inside the id "MyHTMLID" is replaced by the localized data for the key "MyLocalizedTextForMyHTMLID' by the i18next .t call.
A lot of these decisions are just convention. Keep it simple, be consistent.
Normally in a web app the json file is on the server, in a locales subdirectory of the directory where your html resides. Take a look at that i18next example for how it's laid out.
When you're first building your web app, use a multi-line json file to be able to troubleshoot. You can compress it later using something like http://jsonformatter.curiousconcept.com/.
Hope this helps get you started!
I wanted to index text from html, in Lucene, what is the best way to achieve this ?
Is there any good Contrib module that can do this in Lucene ?
EDIT
Finally ended up using Jericho Parser. It doesn't create DOM and is easy to use.
I'm assuming that you don't actually want to index the HTML tags. If that's the case, you can first extract text from HTML using Apache Tika. Then you can index the text in Lucene.
I would recommend using Jsoup HTML parser to extract the text and then use Lucene. It worked well for me.
You might also want to take a look at /Lucene-3.0.3/src/demo which has an HTML parser example.
i need to download page from source code..for example
<span id="businessNumOnMap" class="resultNumberOnMap" style="display:none;"></span><span>Cellini's Italian Restaurant
i want to download the "/len/aaproximat...php"..i didnt find the suitable regex for it..and i need to download that page..can anyone help?
im using vb.net
Normally it's not recommended to parse HTML with a regex, with the exception if this is a simple page that you know the format of, the Html Agility Pack is often recommended for this purpose instead.
Be aware though, if you're parsing this from a page that's on the internet, the site in question might have T&Cs for the usage of their data that you might need to follow to stay legal.
Do you want to download the php file itself with all the codes and not the only html codes? If it's in that case it's not possible
Use WebClient.DownloadString method for downloading. If you haven't found a suitable expression to extract that "Span" from the source, then build you own.
I have a large amount of non-compliant HTML stored in database tables that I need to make validate.
I thought of pulling it into an inline editor like X-Standard that would do a conversion, but is there an easier way to do this via VB.NET?
I would look into HTML Tidy.
From tidy's documentation:
Tidy reads HTML, XHTML and XML files
and writes cleaned up markup. For HTML
variants, it detects and corrects many
common coding errors and strives to
produce visually equivalent markup
that is both W3C compliant and works
on most browsers. A common use of Tidy
is to convert plain HTML to XHTML.
HTML Tidy is probably the best option.
If it's for a one-off conversion it might be easier to use a PHP script (where TIDY is built-in) to do the work; otherwise you'll have to wrap a COM object instead to use it with VB.NET (more info here if you want to do that.
By embedding a WYSIWYG editor on a detail page (tinyMCE), I was able to load the bad HTML and let the editor do the work of creating very close to valid code.
I observed that StackOverflow uses two types of links:
Should I list PDFs in my sitemap file?
and
Should I list PDFs in my sitemap file?
for the same question.
The idea is clear: add keywords into URL and have SE pick up the page faster.
But shouldn't Google punish for the duplicate content in this case?
I'm trying to understand what is more helpful since we have a similar situation on our site.
Source code has the answer.
<link rel="canonical" href="http://stackoverflow.com/questions/1072880/sitemap-xml">
<link rel="alternate" type="application/atom+xml" title="Feed for question 'Sitemap xml'" href="/feeds/question/1072880">
rel = alternate/canonical
The idea is clear: add keywords into
URL and have SE pick up the page
faster.
It actually has nothing to do with the speed of indexing, however, it does with ranking.
As pointed above, best approach when dealing with duplicated content inside the same domain is putting a canonical element pointing to the preferred URL.