make lucene into full fledged search engine like google - lucene

i wanted to build a search engine like google where if i enter a search term it retrieves the urls to websites.
i used lucene with tomcat but it searches the files residing in my system.
i want to search throughout the web.Please tell me how to do this using lucene?
if we can't do this using lucene,please suggest alternatives.

Use Nutch.

Related

How to compare content between two web pages in different environments?

We are in the process of building a website from scratch from an existing website. The web page is an identical copy, and as the web page contains many pages we need a way to compare content between the sites. It is of course possible to do manually, but it takes both a lot of time and entails a risk of human errors.
I have seen that there are services that offer this by inputting two URLs which are then analyzed and where discrepancies are presented. However, these cannot be used as our test environment is local (built in Sitecore).
Is there a way to solve this without making our test environment available online (which is not possible)? For example, does software exist for this, or alternatively some service where you can compare a web page that is online with one that is local?
Note that we're only looking for content comparison (not visual).
(Un)fortunately there's many ways to do this, but fortunately there are some simple ones.
What I would do is:
Get a list of URLs for each site. If the Sitemap is exhaustive, then you could use that, if it's not you might want to run some Sitecore Powershell to get the lists.
Given the lists (from files, or Sitecore API or something), write a program to visit each URL, get the text of the page after it's done rendering, and save it to disk (something like Selenium is good for this and you can use any language). You'll want some folder structure like host/urlpart/urlpart/pagename.txt, basically the same as your content tree.
Use some filesystem diff program like WinMerge to compare the two folders
This is quick and dirty, but a good place to start.

AEM (Adobe Experience Manager) Indexed PDF Search Results

My employer has recently switched its CMS to AEM(Adobe Experience Manager).
We store a large amount of documentation and our site users need to be able to find the information contained within those documents, some of which are 100s pages in length.
Adobe are disappointingly saying their search tool will not search PDFs. Is there any format for producing or saving pdfs that allow the content be indexed?
I think you need to configure external index/search tools like Apache Solr and use REST endpoint to sync DAM data and fetch results on queries.
Out of the box AEM supports most binary formats, without needing for SOLR. You only need this in advanced scenarios, like exposing search outside of Authoring or having millions of assets.
When any asset is uploaded to AEM Dam it will go though a Dam Asset Workflow which has a step Metadata Processor. That step will extract content from the asset. So "binary" assets like Word docs, Excel and PDF it will be searchable. As long as you have Dam Asset Update workflow enabled you will be ok.

Visio Hyperlinks query

The users have documents in Sharepoint document library. I need to be able to get the urls of these documents easy within Visio. Would I need to write something in VBA which gets the urls of these documents from Sharepoint, or is there an easier option ?
I have looked at the net use command so they can map a drive but that does not give the url.
anyone done anything like this before ?
SharePoint exposes XML interfaces that you can use to determine elements in a table or items in a library (which is a known and static location in SharePoint).
What you can use and how you access it will depend on your version of SharePoint and how it is configured on your network. However, there are plenty of examples on the internet that you can search in your favourite search engine (Alta Vista anyone?) which will get you started with some code. I expect you will have some specific questions once you start coding.
It has been a while since I did this.

how to store different files that need to searched in ASP.NET MVC 4 website

My requirement is like job sites where a user can upload a document(can be PDF,Text or word document) like Resume/CV. Then all these documents can be searched for a specific or a combination of keyword and they also have to be ranked based on those key words. I need to know which technology can be good from performance point of view when the number of files are huge and also there are good number of request for searching and indexing.
The website is built using SQL Server. So can I store those files in SQL Server? Will be good in terms of performance.
Or can it be done alone using Lucene.NET and i can store those files in single folder?
I think, the best suggestion is to use Lucene ....
you can save your documents as they are with some unique path name/file_name , and use that as identifier when you index the documents ... I am sure you can find a lot of similar examples if you search Lucene ..

Program to scrape a webpage into an index

I've been looking for a program to create an index from static webpages. I'm not looking for a program like Solr, or elasticsearch because both are assuming I will be interactively creating an index. I need something that can basically go to a url, and create a search index from the pages that it pulls. It can create the index in whatever way necessary (db, xml, etc.) I just don't need the programs that are so involved with the backend database access and the code, as this search will be very light and mostly for internal purposes, on a site that does not use any of those.
Thanks for any tips that may get me started or answers that will solve my problem!
Investigate Nutch. Nutch can index a URL and what you can index is very configurable.
Once you finish crawling/indexing, that index is searchable. There is no programming involved.