Select specific elemets from a website in VB.net (WebScraping) - vb.net

I found a website where I can look up vehicle inspections in Denmark. I need to extract some information from the page and loop through a series of license plates. Lets take this car as an example: http://selvbetjening.trafikstyrelsen.dk/Sider/resultater.aspx?Reg=as87640
Here on the left table, you can see some basic information about the vehicle. On the right, you can see a list of the inspections for this specific car. I need a script, which can check if the car has any inspections and then grab the link to each of the inspection reports. Lets take the first inspection from the example. I would like to extract the onclick text from each of the inspections.
The first inspection link would be:
location.href="/Sider/synsrapport.aspx?Inspection=18014439&Vin=VF7X1REVF72378327"
or if you could extract the inspection ID and Vin variable from the URL immediately:
Inspection ID: 18014439
Vin: VF7X1REVF72378327
Here is an example of a car which don't have any inspections yet, if you want to see what that looks like: http://selvbetjening.trafikstyrelsen.dk/Sider/resultater.aspx?Reg=as87400
Current Solution plan:
Download the HTML source code as a String in VB.net
Search the string and extract the specific parts.
Store it in a StringBuilder and upload this to my SQL server
Is this the most efficient way, or do you know of any libraries which is used to specific extract elements from a website in VB.net! Thanks!

You could use Java libraries HtmlUnit or Jsoup to webscrape the page.
Here's an example using HtmlUnit:
LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log", "org.apache.commons.logging.impl.NoOpLog");
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF);
WebClient client = new WebClient(BrowserVersion.CHROME);
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setThrowExceptionOnScriptError(false);
client.getOptions().setThrowExceptionOnFailingStatusCode(false);
HtmlPage page = client.getPage("http://selvbetjening.trafikstyrelsen.dk/Sider/resultater.aspx?Reg=as87640");
HtmlTable inspectionsTable = (HtmlTable) page.getElementById("tblInspections");
Map<String, String> inspections = new HashMap<String, String>();
for (HtmlTableRow row: inspectionsTable.getRows()) {
String[] splitRow = row.getAttribute("onclick").split("=");
if (splitRow.length >= 4) {
String id = splitRow[2].split("&")[0];
String vin = splitRow[3].replace("\"", "");
inspections.put(id, vin);
System.out.println(id + " " + vin);
}
}

Related

Lucene calculate term vectors for existing index

With Lucene.net I would like to get the term vectors as described in this stackoverflow question.
The problem is, the index is already generated with the field indexed and stored, but without term vectors.
FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true);
type.setStoreTermVectors(false);
Theoretically, it should be possible to re-calculate the term vectors for each document and then store it in the index.
Do you know how this could be possible, without deleting the complete Lucene index?
As mentioned in my comments in the question, you can generate term vector data on-the-fly, which may help you to avoid a complete rebuild of your indexed data.
In my scenario, I want to find the offset positions of my search term in the matched document.
I don't want to oversell this approach - it's absolutely not a substitute for re-indexing - but if your queries are basic, it may help.
Step 1: Perform whatever query you are currently performing.
For each document in the list of hits, you will then need to re-process the relevant field from that document - so, either you already have the field data stored in your existing index, or you will need to retrieve it from its original source.
Step 2: For each such field, you can re-use the same analyzer to build a token stream on-the-fly. The token stream can be configured with different attributes, such as:
token attributes
offset attributes
and others (see here)
Example:
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Analysis.TokenAttributes;
using Lucene.Net.Util;
const LuceneVersion AppLuceneVersion = LuceneVersion.LUCENE_48;
String? fieldName = null;
String fieldContent = "Foo Bar Baz Bar Bat";
String searchTerm = "bar";
var analyzer = new StandardAnalyzer(AppLuceneVersion);
var ts = analyzer.GetTokenStream(fieldName, fieldContent);
var charTermAttr = ts.AddAttribute<ICharTermAttribute>();
var offsetAttr = ts.AddAttribute<IOffsetAttribute>();
try
{
ts.Reset();
Console.WriteLine("");
Console.WriteLine("Token: " + searchTerm);
while (ts.IncrementToken())
{
if (searchTerm.Equals(charTermAttr.ToString()))
{
var start = offsetAttr.StartOffset;
var end = offsetAttr.EndOffset;
Console.WriteLine(String.Format(" > offset: {0}-{1}", start, end));
}
}
ts.End();
}
catch (Exception)
{
throw;
}
The above example assumes one of the hits from step 1 was a field containing "Foo Bar Baz Bar Bat" - with a search term of bar.
The output generated is:
Token: bar
> offset: 4-7
> offset: 12-15
So, as you can see, you are not re-executing a query - you are just re-processing a token stream. The more complex the original search term is, the harder it will be to make this approach work the way you probably need it to.

Generate proper report when data is not in database [Telerik]

I'm generating telerik report for
"Number of students by level of education, field and sex"
Here the SQL query that I'm using to create this report
SELECT
[tbl_hec_ISCED].[ISCED_ID],
[tbl_hec_ISCED].[ISCED_Level],
[tbl_hec_Programme].[ISCED_ID] AS 'tbl_hec_ProgrammeISCED_ID',
[tbl_hec_Programme].[Programme_ID],
[tbl_hec_Programme].[Specialisation_ID_Number],
[tbl_hec_specialisation].[Rank_ID_Number],
[tbl_hec_specialisation].[Rank_Title],
[tbl_HEI_student].[Programme_ID] AS 'tbl_HEI_studentProgramme_ID',
[tbl_HEI_student].[Gender]
FROM ((([tbl_HEI_student]
FULL OUTER JOIN [tbl_hec_Programme]
ON [tbl_HEI_student].[Programme_ID] = [tbl_hec_Programme].[Programme_ID])
FULL OUTER JOIN [tbl_hec_specialisation]
ON [tbl_hec_Programme].[Specialisation_ID_Number] = [tbl_hec_specialisation].[Rank_ID_Number])
FULL OUTER JOIN [tbl_hec_ISCED]
ON [tbl_hec_Programme].[ISCED_ID] = [tbl_hec_ISCED].[ISCED_ID])
WHERE ([tbl_HEI_student].[Gender]='Male' or [tbl_HEI_student].[Gender]='Female') and ([tbl_hec_ISCED].[ISCED_Level]='5' or [tbl_hec_ISCED].[ISCED_Level]='6'or [tbl_hec_ISCED].[ISCED_Level]='7'or [tbl_hec_ISCED].[ISCED_Level]='8')
I'm getting null report since some values are not in database. I attached picture of it ,
HERE that view
I want generate report when there is no data in database. like below which means zero values for null rows.
HERE the expected report output
How can I overcome this challenge
My suggestion is to use document merging functionality using a template with all the report layout already set up as well as value placeholders (fields) that will be replaced by calculated values.
I'm using Aspose.Words library in my projects, but I'm sure there are others out there maybe even Telerik Reporting. This is very simple functionality so any reporting tool that can do complex stuff must be able to do this as well.
Here's some example code for Aspose. Other libraries will have different implementation.
using Aspose.Words;
void GenerateDocument(string templateFilePath, Dictionary<string, object> fieldNamesAndValues)
{
// This is our document object
Document output = null;
// Obtain the template file
if (File.Exists(templateFilePath))
{
// If the template file is successfully located, use this template
output = new Document(templateFilePath);
}
// Merge the provided values into the appropriate fields of the template
output.MailMerge.Execute(fieldNamesAndValues.Keys.ToArray(), fieldNamesAndValues.Values.ToArray());
// Save the document into a stream as PDF
MemoryStream stream = new MemoryStream();
doc.Save(stream, SaveFormat.Pdf);
// You can then do whatever with the stream:
// save it or push it to the browser for download
}
Using your expected result as an example, let's assume these are the names of your placeholders (fields) for the first row:
MALE_EDUCATION_ISCED5, MALE_EDUCATION_ISCED6, MALE_EDUCATION_ISCED7
You can then generate your report like so:
Dictionary<string, object> fieldsAndValues = new Dictionary<string, object>();
fieldsAndValues.Add("MALE_EDUCATION_ISCED5", calculatedValue1);
fieldsAndValues.Add("MALE_EDUCATION_ISCED6", calculatedValue2);
fieldsAndValues.Add("MALE_EDUCATION_ISCED7", calculatedValue3);
// and so on for other fields
GenerateDocument("~/Templates/Report.docx", fieldsAndValues);
More info on how to add fields in Microsoft Word:
https://support.office.com/en-us/article/7e9ea3b4-83ec-4203-9e66-4efc027f2cf3
More info on Aspose MailMerge:
http://www.aspose.com/docs/display/wordsnet/How+to++Execute+Simple+Mail+Merge

read all document by using particular category name using alfresco search.luceneSearch or search.lib.js

Category Name
|
Geograpy (8)
Study Db (18)
i am implement my own advance search in alfresco. i need to read all files which related with particular category.
example:
if there is 20 file under geograpy, lucene query should read particular document under search key word "banana".
Further explanation -
I am using search.lib.js to search. I would like to analyze the result to find out to which category the documents belong to. For example I would like to know how many documents belong to the category under Languages and the subcategories. I experimented with the Classification API but I don't get the result I want. Any Idea how to go through the result to get the category name of each document?
is there any simple method like node.properties["cm:creator"]?
thanks
janaka
I think you should specify more your question:
Are you using cm:content or a customized content?
Are you going to search the keyword inside the content of the file? or are you going to search the keyword in a specific metadata(s)?
Do you want to create a webscript (java or javascript)?
One thing to take in consideration:
if you use +PATH:"cm:generalclassifiable/...." for the categorization in your lucene queries, the performance will be slow (following my experince)
You can use for example the next query to find all nodes at any depth below /cm:Languages:
var results = search.luceneSearch("+PATH:\"cm:generalclassifiable/cm:Languages//*\");
Take a look to this url: https://wiki.alfresco.com/wiki/Search#Path_Queries
Once you have all the elements, you can loop all, and get to which category below. Of course you need to create some counter per each category/subcategory:
for(i = 0; i < results.length; i++){
var node = results[i];
var categoryNodeRef = node.properties["cm:categories"];
var categoryDesc = categoryNodeRef.properties["cm:description"];
var categoryName = categoryNodeRef.properties["cm:name"];
}
This is not exactly the solution, but can be a useful idea to start.
Sorry if it's not what you're asking for, I have just arrived from my holidays.

Lucene.net PerFieldAnalyzerWrapper

I've read on how to use the per field analyzer wrapper, but can't get it to work with a custom analyzer of mine. I can't even get the analyzer to run the constructor, which makes me believe I'm actually calling the per field analyzer incorrectly.
Here's what I'm doing:
Create the per field analyzer:
PerFieldAnalyzerWrapper perFieldAnalyzer = new PerFieldAnalyzerWrapper(srchInfo.GetAnalyzer(true));
perFieldAnalyzer.AddAnalyzer("<special field>", dta);
Add all the fields do document as usual, including a special field that we analyze differently.
And add document using the analyzer like this:
iw.AddDocument(doc, perFieldAnalyzer);
Am I on the right track?
The problem was related to my reliance on CMSs (Kentico) built-in Lucene helper classes. Basically, using those classes you need to specify the custom analyzer at index-level through the CMS and I did not wish to do that. So I ended up using Lucene.net directly almost everywhere gaining the flexibility of using any custom analyzer I want
I also did some changes to how I structure data and ended up using the tried-and-true KeywordAnalyzer to analyze document tags. Previously I was trying to do some custom tokenization magic on comma separated values like [tag1, tag2, tag with many parts] and could not get it reliably working with multi-parted tags. I still kept that field, but started adding multiple "tag" fields to the document, each storing one tag. So now I have N "tag" fields for "N" tags, each analyzed as a keyword, meaning each tag (one word or many) is a single token.
I think I overthinked it with my initial approach.
Here is what I ended up with.
On Indexing:
KeywordAnalyzer ka = new KeywordAnalyzer();
PerFieldAnalyzerWrapper perFieldAnalyzer = new PerFieldAnalyzerWrapper(srchInfo.GetAnalyzer(true));
perFieldAnalyzer.AddAnalyzer("documenttags_t", ka);
-- Some procedure to compile all documents by reading from DB and putting into Lucene docs
foreach(var doc in docs)
{
iw.AddDocument(doc, perFieldAnalyzer);
}
On Searching:
KeywordAnalyzer ka = new KeywordAnalyzer();
PerFieldAnalyzerWrapper perFieldAnalyzer = new PerFieldAnalyzerWrapper(srchInfo.GetAnalyzer(true));
perFieldAnalyzer.AddAnalyzer("documenttags_t", ka);
string baseQuery = "documenttags_t:\"" + tagName + "\"";
Query query = _parser.Parse(baseQuery);
var results = _searcher.Search(query, sortBy)

How do you get Endeca to search on a particular target field rather than across all indexed fields?

We have an Endeca index configured across multiple fields of email content - subject and body. But we only want searches to be performed on the subject lines. Endeca is returning matches within the bodies too. How do you limit the search to the subject?
You can search a specific field or fields by specifying it (them) with the Ntk parameter.
Or if you wish to search a specific group of fields frequently you can set up an interface (also specified with the Ntk parameter), that includes that group of fields.
This is how you can do it using presentation API.
final ENEQuery query = new ENEQuery();
final DimValIdList dimValIdList = new DimValIdList("0");
query.setNavDescriptors(dimValIdList);
final ERecSearchList searches = new ERecSearchList();
final StringBuilder builder = new StringBuilder();
for(final String productId : productIds){
builder.append(productId);
builder.append(" ");
}
final ERecSearch eRecSearch = new ERecSearch("product.id", builder.toString().trim(), "mode matchany");
searches.add(eRecSearch);
query.setNavERecSearches(searches);
Please see this post for a complete example.
Use Search Interfaces in Developer Studio.
Refer - http://docs.oracle.com/cd/E28912_01/DeveloperStudio.612/pdf/DevStudioHelp.pdf#page=209