How to get webgraph in apache nutch? - apache

I have generated webgrapgh db in apache nutch using command 'bin/nutch webgraph -segmentDir crawl/segments -webgraphdb crawl/webgraphdb'.... It generated three folders in crawl/webgraphdb which are inlinks, outlinks and nodes. Each of those folders contained two binary files like data and index. How to get visual web graph in apache nutch? What is the use of web graph?

The Webgraph is intented to be a step in the score calculation based on the link structure (i.e webgraph):
webgraph will generate the data structure for the specified segment/s
linkrank will calculate the score based on the previous structures
scoreupdater will update the score from the webgraph back into the crawldb
Be aware that this program is very CPU/IO intensive and that will ignore the internal links of a website by default.
You could use the nodedumper command to get useful data out of the webgraph data, including the actual score of a node and the highest scored inlinks/outlinks. But this is not intented to be visualized, although you could parse the output of this command and generate any visualization that you may need.
That being said, since Nutch 1.11 the plugin index-links has been added, which will allow you to index into Solr/ES the inlinks and outlinks of each URL. I've used this plugin indexing into Solr along with the sigmajs library to generate some graph visualizations of the link structure of my crawls, perhaps this could suit your needs.

Related

How to compare content between two web pages in different environments?

We are in the process of building a website from scratch from an existing website. The web page is an identical copy, and as the web page contains many pages we need a way to compare content between the sites. It is of course possible to do manually, but it takes both a lot of time and entails a risk of human errors.
I have seen that there are services that offer this by inputting two URLs which are then analyzed and where discrepancies are presented. However, these cannot be used as our test environment is local (built in Sitecore).
Is there a way to solve this without making our test environment available online (which is not possible)? For example, does software exist for this, or alternatively some service where you can compare a web page that is online with one that is local?
Note that we're only looking for content comparison (not visual).
(Un)fortunately there's many ways to do this, but fortunately there are some simple ones.
What I would do is:
Get a list of URLs for each site. If the Sitemap is exhaustive, then you could use that, if it's not you might want to run some Sitecore Powershell to get the lists.
Given the lists (from files, or Sitecore API or something), write a program to visit each URL, get the text of the page after it's done rendering, and save it to disk (something like Selenium is good for this and you can use any language). You'll want some folder structure like host/urlpart/urlpart/pagename.txt, basically the same as your content tree.
Use some filesystem diff program like WinMerge to compare the two folders
This is quick and dirty, but a good place to start.

AEM (Adobe Experience Manager) Indexed PDF Search Results

My employer has recently switched its CMS to AEM(Adobe Experience Manager).
We store a large amount of documentation and our site users need to be able to find the information contained within those documents, some of which are 100s pages in length.
Adobe are disappointingly saying their search tool will not search PDFs. Is there any format for producing or saving pdfs that allow the content be indexed?
I think you need to configure external index/search tools like Apache Solr and use REST endpoint to sync DAM data and fetch results on queries.
Out of the box AEM supports most binary formats, without needing for SOLR. You only need this in advanced scenarios, like exposing search outside of Authoring or having millions of assets.
When any asset is uploaded to AEM Dam it will go though a Dam Asset Workflow which has a step Metadata Processor. That step will extract content from the asset. So "binary" assets like Word docs, Excel and PDF it will be searchable. As long as you have Dam Asset Update workflow enabled you will be ok.

Can Apache solr stores actual files which are uploaded on it?

This is my first time on Stack Overflow. Thanks to all for providing valuable information and helping one another.
I am currently working on Apache Solr 7. There is a POC I need to complete as I have less time so putting this question here. I have setup SOLR on my windows machine. I have created core and uploaded a PDF document using /update/extract from Admin UI. After uploading, I can see the metadata of the file if I query from the Admin UI using query button. I was wondering if I can get the actusl content of the PDF as well. I can see there is one tlog file gets generated under /data/tlog/tlog000... with raw PDF data but not the actual file.
So the question are,
1. Can I get the PDF content?
2. does Solr stores the actual file somewhere?
a. If it stores then where it does?
b. If it does not store then, is there a way to store THE FILE?
Regards,
Munish Arora
Solr will not sore the actual file anywhere.
Depending on your config it can store the binary content though.
Using the extract request handler Apache Solr relies on Apache Tika[1] to extract the content from the document[2].
So you can search and return the content of the pdf and a lot of other metadata if you like.
[1] https://tika.apache.org/
[2] https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html

Creating dynamic facets using apache solr

I'm new to apache solr.
I have uploaded a few log files using solr-cell and I want to create facets based on the content which is there in the log file.
For example: inside my log file I have a record for transaction, I would like to create transactionid as my facet and clicking it should result in a search in the uploaded log files and give me results according to that particular id.
Note: I need to facet field according to the content which is in the log.
As long as the field is indexed, you can facet on it. So, you can use either schemaless configuration or use dynamicField definitions to match and automatically create fields for your log records.
Go through Solr examples first, there should be enough information there.
(updated based on the comments)
If the text needs to be pre-processed and split, there are two basic avenues:
Using DataImportHandler (DIH), probably with LineEntityProcessor and RegexTransformer to split the field into multiple fields
Using UpdateRequestProcessor chains (in solrconfig.xml) and probably clone the field multiple times and then use RegexReplaceProcessorFactory to extract relevant parts. That's even uglier than DIH though as there is no easy way to split one field into many.
Still, specifically for logs, it is better to use something like Logstash with Solr output plugin.
+1 to Alex's answer.
Another alternative is to write a custom update processor where you figure out what field you want to facet on and explicitly add that field to your document.
This makes sense only if you know what kind of fields to expect, based on some pattern. If that is not the case, then using dynamic fields or a schemaless config is your best bet.

How to merge two content source in Sharepoint 2010?

In my share point 2010 website, I added two content source
file system (shared folder)
BDC data (Line of Business Data)
I added the managed properties to map the metadata of the BDC data.
My search result coming link this
I would like to link the two content source, my second content source having the file related information like (tab, category, fileno, case name)
I added the column and also I altered the xslt in the search result web part. the results are coming link below.
From the result, the third one (120) is coming from the database so all the properties are mapped (caseid, casename,fileno, doctab, description)
But it's not mapping to the file system. The file system having relationship with the table with the file name and also the the path of the files having some information:
file://192.168.25.231/FolderName/CaseID/documenttab/filename
CaseId is the primary key for the table which I added as second content source.
How can I achieve this?
Hmm, it's difficult to add much more without seeing the environment. But here's plan B
Given you're using the BCS and want to display both unstructured content (the files) and application data that shares metadata with the files, you could try the following. It will require some coding knowledge. You can make connections between web parts in SharePoint Designer but this will need Visual Studio
create a custom search results web page, and use the standard core search results web part along with separate data web part for displaying the application data
create a custom query box for entering the search query, probably best done with separate fields for the metadata - case ID, case name etc. (You'd normally use a data filter web part, but that won't pass results through to the normal search results - you need to code to run two queries)
format and pass the query to both the core search results web part, and the BCS data web part, to display items that match the query
That's probably as much as I can help with. The SharePoint section on MSDN should be the next port of call. Good luck!
This may be an overly simplistic explanation to keep the response as short as possible.
For your search results page, the best approach when also retrieving application data is to not present that information in the core search results web part. Exclude it from the default scope. Instead, use a federated search results web part added to the results page. You'll also need to create the corresponding federation location for the scope (easy to do), and you can then use XSLT to style the display of the results - application data needs to be presented differently to links to files and web pages.
Then, a search for say the case ID, will display all files containing that information in the core search results web part, and will display any matching application data in the federated results web part, with the different formatting applied. Note - there will be no connection between the two. The only relationship is that they both match the search query. It is possible to connect web parts to filter one based on the selected value in another, but it's an entirely different approach and not easily done using search results.