RAILS3: Full-text Search Word Docs? - ruby-on-rails-3

My company has a collection of about 3500 highly-structured Word docs (and growing) that contain multiple choice questions from one of our products. I've been tasked with writing a front-end that will let people find and use these in other products. There is some metadata on them that would go in a database, but we'd also like full-text search.
I've been given the option of using for the front-end either MS Access (because I know it well) or Rails (because I'm supposed to be learning it). I've done one Rails app and prefer to continue with it.
Rather than load the documents into the database, I thought it made more sense to just have them on the file system and store paths to them in the database.
I know I can use Ferret to search database fields but what's the best way to add full-text searching to a Rails app for a pile of files on the filesystem?

Not sure if there are any gems that would search the word files for you. Although you have mentioned that you do not want to load the entire documents into the database, you might look into just copying the text contents of each file in your db. You can use win32ol library for doing this (http://ruby-doc.org/stdlib/libdoc/win32ole/rdoc/classes/WIN32OLE.html) .. If I had to implement this, I would run a cron job every night (or whatever frequency seems fit) that would refresh the database content with the changes in the word files.

Related

How to compare content between two web pages in different environments?

We are in the process of building a website from scratch from an existing website. The web page is an identical copy, and as the web page contains many pages we need a way to compare content between the sites. It is of course possible to do manually, but it takes both a lot of time and entails a risk of human errors.
I have seen that there are services that offer this by inputting two URLs which are then analyzed and where discrepancies are presented. However, these cannot be used as our test environment is local (built in Sitecore).
Is there a way to solve this without making our test environment available online (which is not possible)? For example, does software exist for this, or alternatively some service where you can compare a web page that is online with one that is local?
Note that we're only looking for content comparison (not visual).
(Un)fortunately there's many ways to do this, but fortunately there are some simple ones.
What I would do is:
Get a list of URLs for each site. If the Sitemap is exhaustive, then you could use that, if it's not you might want to run some Sitecore Powershell to get the lists.
Given the lists (from files, or Sitecore API or something), write a program to visit each URL, get the text of the page after it's done rendering, and save it to disk (something like Selenium is good for this and you can use any language). You'll want some folder structure like host/urlpart/urlpart/pagename.txt, basically the same as your content tree.
Use some filesystem diff program like WinMerge to compare the two folders
This is quick and dirty, but a good place to start.

SQL Database in GitHub

I am building a Java app that uses an SQLite database to hold most of its data. For the end-user, the database would be almost entirely read-only, with very occasional edits. I'll (theoretically) be displaying/distributing it through my GitHub page, so my question is:
What's the best way to load the database into GitHub? (I'm using IntelliJ with DataGrip.)
I'd prefer to be able to update the database when I commit/push, instead of having to overwrite the whole file. The closest question I can find is How to include MySQL database schema on GitHub? but there could potentially be hundreds or thousands of entries, so I can't just rebuild the tables when the user installs the app.
I'm applying for entry-level developer jobs, and this project is going to be my main portfolio piece during job-hunting. I'm trying to make sure it is not only functional but also makes a good impression. Any help is (very) greatly appreciated.
EDIT:
After moving my .db file into the folder connected to GitHub (same level as my src folder) apparently I can now commit/push it with the rest of my files. How do I make sure that the connection from my Java code to the database stays valid once it is loaded onto another user's system? Can I just stick with
connection = DriverManager.getConnection("jdbc:sqlite:mydatabase.db");
or do I need to rework the path?
Upon starting, if your application can't find a corresponding sqlite database file, have it create one. Then do initial load of your tables from either CSV, JSON or XML files.
You can upload these files to Git, as they are text formats.

how to store different files that need to searched in ASP.NET MVC 4 website

My requirement is like job sites where a user can upload a document(can be PDF,Text or word document) like Resume/CV. Then all these documents can be searched for a specific or a combination of keyword and they also have to be ranked based on those key words. I need to know which technology can be good from performance point of view when the number of files are huge and also there are good number of request for searching and indexing.
The website is built using SQL Server. So can I store those files in SQL Server? Will be good in terms of performance.
Or can it be done alone using Lucene.NET and i can store those files in single folder?
I think, the best suggestion is to use Lucene ....
you can save your documents as they are with some unique path name/file_name , and use that as identifier when you index the documents ... I am sure you can find a lot of similar examples if you search Lucene ..

Sitecore System Lucene Index for custom queries

I have been using Sitecore query and FAST query for some sections of the website. But with growing content these queries have gotten slow and I'd like to implement Lucene querying for content to speed up things.
I am wondering if I can just use the System index instead of having to setup a separate index. Does Sitecore by default index all content in the content editor? Is this a good approach or should I just create my own index?
(I'm going to assume your using Sitecore 6.4->6.6)
As with everything .. it depends .. Sitecore keeps an index of all the Sitecore items in its system index, you are welcome to use that. Sometimes you may want a more specialised or restricted list of items, like being based on a certain template, being indexed or need a checkbox field indexed (as the system one by default only indexes text fields).
Setting up your own search index is pretty easy.. It does require some fiddling with the web.config though (and I'd recommend adding as a .include file).
Create an new <index> node with its own id that will define the name of the collection and the folder it will go into. (You can check its working by looking for the dir in the /data/indexes directory of your installation.
.. next you can tell the crawler which database to look at (most likely master if you want unpublished content to be indexed or web for published stuff) and where to start the search from (in this example I am indexing only the news section). You can tag,boostand tell if whether to IndexAllFields (otherwise it will only index fields it understands as text .. rich-text / multi-line text / text etc).
.. Finally, you can tell the indexer which template types to include or exclude.
How the indexer works is that it will subscribed to item events within sitecore .. so every time an item is changed or moved or deleted the index will be updated automatically. Obviously if you are indexing the web db the items will need to have been published.
More in-depth info on the query syntax & indexing can be found here on SDN.
The search syntax and API is much improved in 6.4/6.5 but if you want to add extra kick then my colleague Alex Shyba's Advanced Database Crawler is worth checking out too.
Hope this helps :D
You will want to implement your own index. For the same reason that you are seeing things slow down when there is a lot of content, indexes slow down when there is a lot of content in it as well.
I prefer targeted indexes meant specifically to drive the functionality I need and only has the data in it that is required. This allows for smaller and more efficient index usage on your components.
Additionally, you probably want to look into the AdvancedDatabaseCrawler put together by Alex Shyba. There are a few blogs out there with some great posts on implementing this lucene indexing module.
A separate index is always a wise decision, you can keep it light. In big environments the system index can grow up to gigabytes.
You can exclude the content from the index, as you will only be using it for performing lookups, not showing content from the index.
Finally: the system index is for the master database, you'll be querying the web database, possibly on a content delivery server.

CMS Settings, .INI file or MySQL?

I'm looking for the best solution to store the ettings for a website, like the limit of posts for users, limit of users online, ranks, min. number of posts to be able to do something.
Like here, if you're new you can't thumbs up/down a post, or whatever, so how would you store all of these?
I thought of creating a table with constants in mysql but i think it's not the best solution to add a new mysql query on every page refresh.
Why not? After all, MySQL can handle a large number of requests and I've seen much more complex queries than checking access rights. A MySQL query is a query to a file just like checking an INI file, but optimized. I'm guessing that if you don't expect a huge amount of traffic, you'll be fine with a database.
Here it's a matter of preference. I prefer to do this in MySQL because I don't like to parse files and find querying a database easier. Also, editing rows is easier than changing values in a text file.
I'd say your first thought was spot-on. Put constants into a database.