Program to scrape a webpage into an index

Program to scrape a webpage into an index - indexing

I've been looking for a program to create an index from static webpages. I'm not looking for a program like Solr, or elasticsearch because both are assuming I will be interactively creating an index. I need something that can basically go to a url, and create a search index from the pages that it pulls. It can create the index in whatever way necessary (db, xml, etc.) I just don't need the programs that are so involved with the backend database access and the code, as this search will be very light and mostly for internal purposes, on a site that does not use any of those.
Thanks for any tips that may get me started or answers that will solve my problem!

Investigate Nutch. Nutch can index a URL and what you can index is very configurable.
Once you finish crawling/indexing, that index is searchable. There is no programming involved.

Related

Automatic alias creation on indexing request in elasticsearch

Is it possible to automatically create an alias on an indexing request for an absent index in elasticsearch instead of creating the index?
For example, let's say an indexing request comes in for the uk_london index, with type document and the document contents. The uk_london index does not exist, so it will be created. I want to avoid this creation, to avoid having a kagillion shards. What I would like is for ES to create an alias uk_london (with routing, filtering etc.) and have it point to the uk index (which already exists).
I know this problem is solvable by changing the way the indexing requests come in. The cluster I am running receives documents from multiple systems. I do not have full control over how the documents are sent and would avoid having to touch everything that talks to it unless I really have to.
I am also aware of index templates and the fact that you can automatically set-up aliases on index creation through them. However, as far as I understand, this does not solve my problem as you can't tell it to create an alias instead of an index.

how to store different files that need to searched in ASP.NET MVC 4 website

My requirement is like job sites where a user can upload a document(can be PDF,Text or word document) like Resume/CV. Then all these documents can be searched for a specific or a combination of keyword and they also have to be ranked based on those key words. I need to know which technology can be good from performance point of view when the number of files are huge and also there are good number of request for searching and indexing.
The website is built using SQL Server. So can I store those files in SQL Server? Will be good in terms of performance.
Or can it be done alone using Lucene.NET and i can store those files in single folder?

I think, the best suggestion is to use Lucene ....
you can save your documents as they are with some unique path name/file_name , and use that as identifier when you index the documents ... I am sure you can find a lot of similar examples if you search Lucene ..

Sitecore System Lucene Index for custom queries

I have been using Sitecore query and FAST query for some sections of the website. But with growing content these queries have gotten slow and I'd like to implement Lucene querying for content to speed up things.
I am wondering if I can just use the System index instead of having to setup a separate index. Does Sitecore by default index all content in the content editor? Is this a good approach or should I just create my own index?

(I'm going to assume your using Sitecore 6.4->6.6)
As with everything .. it depends .. Sitecore keeps an index of all the Sitecore items in its system index, you are welcome to use that. Sometimes you may want a more specialised or restricted list of items, like being based on a certain template, being indexed or need a checkbox field indexed (as the system one by default only indexes text fields).
Setting up your own search index is pretty easy.. It does require some fiddling with the web.config though (and I'd recommend adding as a .include file).
Create an new <index> node with its own id that will define the name of the collection and the folder it will go into. (You can check its working by looking for the dir in the /data/indexes directory of your installation.
.. next you can tell the crawler which database to look at (most likely master if you want unpublished content to be indexed or web for published stuff) and where to start the search from (in this example I am indexing only the news section). You can tag,boostand tell if whether to IndexAllFields (otherwise it will only index fields it understands as text .. rich-text / multi-line text / text etc).
.. Finally, you can tell the indexer which template types to include or exclude.
How the indexer works is that it will subscribed to item events within sitecore .. so every time an item is changed or moved or deleted the index will be updated automatically. Obviously if you are indexing the web db the items will need to have been published.
More in-depth info on the query syntax & indexing can be found here on SDN.
The search syntax and API is much improved in 6.4/6.5 but if you want to add extra kick then my colleague Alex Shyba's Advanced Database Crawler is worth checking out too.
Hope this helps :D

You will want to implement your own index. For the same reason that you are seeing things slow down when there is a lot of content, indexes slow down when there is a lot of content in it as well.
I prefer targeted indexes meant specifically to drive the functionality I need and only has the data in it that is required. This allows for smaller and more efficient index usage on your components.
Additionally, you probably want to look into the AdvancedDatabaseCrawler put together by Alex Shyba. There are a few blogs out there with some great posts on implementing this lucene indexing module.

A separate index is always a wise decision, you can keep it light. In big environments the system index can grow up to gigabytes.
You can exclude the content from the index, as you will only be using it for performing lookups, not showing content from the index.
Finally: the system index is for the master database, you'll be querying the web database, possibly on a content delivery server.

Updating Lucene index from two different threads in a web application

I've a .net web application which uses Lucene.net for company search functionality.
When registered users add a new company,it is saved to database and also gets indexed in Lucene based company search index in real time.
When adding company in Lucene index, how do I handle use case of two or more logged-in users posting a new company at the same time?Also, will both these companies get indexed without any file lock, lock time out, etc. related issues?
Would appreciate if i could help with code as well.
Thanks.

By default Lucene.Net has inbuilt index locking using a text file. However if the default locking mode isn't good enough then there are others that you can use instead (which are included in the Lucene.Net source code).

Lucene index updation and performance

I am working on a job portal site and have been using Lucene for job search functionality.
Users will be posting a number jobs on our site on a daily basis.We need to make sure that new job posted is searchable on the site as soon as possible.
In this context, how do I update Lucene index when a new job is posted or when an existing job is edited?
Can lucene index updating and search work in parallel?
Also,can I know any tips/best practices with respect to Lucene indexing,optimizing,performance etc?
Appreciate ur help!
Thanks!

Yes, Lucene can search from and write to an index at the same time as long as no more than 1 IndexWriter writes to it. If you want the new records visible ASAP, have the IndexWriter call the commit() function often (see IndexWriter's JavaDoc for details).
These Wiki pages might also help:
ImproveIndexingSpeed
ImproveSearchingSpeed

I have used Lucene.Net on a web site similar to what you are doing. Yes, you can do live indexes, updating to keep everything up to date? What platform are you using Lucene on, .NET, Java?

Make sure you create a new IndexSearcher as any additions after an IndexSearcher has been created are not visible to that instance.
A better approach may be to ReOpen the IndexReader if you want to resuse the same index searcher.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas