Getting Search Server to ignore sharepoint document data, and speed up crawl times - sharepoint-2010

Background:
I have a Sharepoint Foundation 2010 installation that is being used to store scanned images of paper documents, making an electronic version of paper file folders we keep for each of our company's Clients. All of the documents are all stored as PDF files.
The configuration includes a web-server housing Sharepoint and the Search Server 2010 Express service, as well as separate database server housing the content data as well as the search crawl store. Both the Sharepoint/Search box, and the SQL box are VMware VMs running on shared hosts (including a shared SAN) with our other production servers.
Each file added to sharepoint must be added through a custom interface, including metadata tags for client information (a site content type with a set of site columns defines this extra metadata). We then expose this client identifying data with the search server by setting Managed Properties so we can do queries against the search webservice specifying WHERE CustomClientID = X.
Our data currently resides in two large document libraries, one for each arm of the company.
After a few years of operation our server now has some 250,000 documents and we are having issues with full crawls (running weekly off hours) sometimes crashing part way through, and our incrementals (running every 5 min during work hours) take 7-8 minutes to pick up 2-3 new files.
Question:
I was wondering if there was a way to get the search server crawler to only pick up the metadata we are supplying and ignore the document contents entirely, which I assume would speed up the crawl process by orders of magnitude. I believe this feature is described as full text search, but have not been successful in finding anything that explains if this is something that can be turned off.
If not, is there an alternative option for speeding up crawl times that anyone would advise?

Related

How to compare content between two web pages in different environments?

We are in the process of building a website from scratch from an existing website. The web page is an identical copy, and as the web page contains many pages we need a way to compare content between the sites. It is of course possible to do manually, but it takes both a lot of time and entails a risk of human errors.
I have seen that there are services that offer this by inputting two URLs which are then analyzed and where discrepancies are presented. However, these cannot be used as our test environment is local (built in Sitecore).
Is there a way to solve this without making our test environment available online (which is not possible)? For example, does software exist for this, or alternatively some service where you can compare a web page that is online with one that is local?
Note that we're only looking for content comparison (not visual).
(Un)fortunately there's many ways to do this, but fortunately there are some simple ones.
What I would do is:
Get a list of URLs for each site. If the Sitemap is exhaustive, then you could use that, if it's not you might want to run some Sitecore Powershell to get the lists.
Given the lists (from files, or Sitecore API or something), write a program to visit each URL, get the text of the page after it's done rendering, and save it to disk (something like Selenium is good for this and you can use any language). You'll want some folder structure like host/urlpart/urlpart/pagename.txt, basically the same as your content tree.
Use some filesystem diff program like WinMerge to compare the two folders
This is quick and dirty, but a good place to start.

When to use the Registry

One of my VB.net applications started off to be fairly simple and (surprise) has grown significantly. I was using the Registry to save my limited number of settings, however that has now grown to where I felt I was abusing the Registry. I have converted most of my saved settings to XML files that have been working well. I would appreciate thoughts on saving the following to the Registry, I have been looking through threads on this and still am not sure if I should use files or the Registry:
Paths to the user settings files. Currently the application looks for an
XML file in a specific sub-folder of the users Documents folder for
an XML file containing the paths.
Window positions and sizes for forms (over 50 forms).
Licensing data (licensing per machine).
Application version.

Merging 2 WordPress databases with both corresponding and unique content + minor tweaks

A little intro: I've been developing a WordPress website at my company, for a big, international client. Since it's launch some time ago, we've obviously had some improvements, tweaks, bugs, etc. We have a couple of developers here, working with a clean local - staging - production workflow, syncing our local environments with git. When changes have to be made, we pull a version from the live site, make changes locally, upload it on a staging site, and after approval and final testing, deploy to the live website.
Our last round of changes was a big one, recoding a lot of our theme files as well as tweaking the database a bit. After uploading it on the staging site for approval however, things went awry. Our contact person got confused with the versions, and went on to publish new posts and edit pages on the live site, as well as create new content on the staging site. I don't mean 'do the same thing on both'... they did some work on each.
Now I somehow have to sync the two, so I can deploy the site and move on to further development (we also need the site live because we wrote an API that communicates between the client's large corporate site, and it's smaller subsidiaries' sites)
I checked out these posts already:
How can I merge two MySQL tables?
Merge 2 SQL Server databases
Merging wordpress Databases
And I tried multiple built-in and plugin solutions.
Having a very large website a lot of custom fields (ACF) and having them translated in 4 languages, makes this even harder, I think. Exporting and importing has broken something on each try (often with the translation), and plugins like Database Sync only offer a complete replacement of the db's, and thus losing the unique content each site has. I have some knowledge of SQL queries, and could freshen these up a bit more, but I don't really see how I could manually merge the two sql-files.
In short: I need to sync 2 databases from 2 slightly different versions of mostly the same website. There is unique and duplicate content, as well as minor tweaks in WP.

In OpenERP 7.x how to give a customer read-only access through the portal to a small set of documents

I have been trying for a few days to figure out how to allow a set of customers to view a specific set of documents in OpenERP's knowledge management module. The goal is to easily add access to various sets of documents to existing customers. My particular use is that I deliver three different types of training sessions each of which has a set of materials in pdf format. I would like to offer all the attendees access to those materials through OpenERP (since they are already in the system as customers). I am not using the Events modules and I am not particularly interested in exploring it at this time.
Setup that I have tried:
Have some existing customers with at least a name and email address
Create a "Directory" in Knowledge->Document Management->Directories
Add a few pdf files in Knowledge->Documents each with the directory just created
Create a "Group" in Settings->Groups
... ???
I've tried various combinations of access rights, rules, users, etc. but nothing seems simple and nothing works exactly as I'm hoping: namely that a customer receives access through the web client to a clearly labeled menu that then shows them exactly the set of documents that they are allowed to access.
I have also tried the various "Share" features that can be done with documents, but again, they don't seem to work well for existing customers, nor so well with groups of related documents.
I have been able to get a user (not a customer) to get restricted access to see only a small set of the documents in the Knowledge Management system, but even there I'm having a hard time restricting that user to see only the documents that are in the specified directory.
I've taken a look at a number of sites (including ZestyBeanz) that describe various means of getting users to access the portal / limited features of OpenERP.
My OpenERP installation is self-hosted on Amazon so I have full control. I have written sophisticated modules for OpenERP and I am a reasonably capable Python programmer so please feel free to get seriously technical if that would help. I'm willing to consider writing a custom module to enable what I feel should be an obvious and easy feature, but that really seems like overkill!
To be clear: either a configuration or programming solution would be fine by me.

SharePoint as an alternative to large volumes of file shares? Dreaming?

We've been on SharePoint 2007 for close to 2 years now. We find it's a great CMS that helps us centralize project documents and colaborate with less duplication and confusion. The custom list feature offers a quick and dirty alternative to custom form and developed sql solutions sometimes.
That said, we still have over 100 Terrabytes of files shares with documents dating far back to the beginning of time outside of SharePoint.
As we look forward to smarter, faster and bigger Network FileSystems...
(1)What realistic role should SharePoint play?
(2) How reliable can SharePoint be as a seemless place to store documents? I mean, will we be able to save/retrieve documents to sharepoint from all popular clients without having to open sharepoint?
Part of why I ask ... A few months ago as part of another project, we attempted to use SharePoint 2007 like a file share.. attempting to set up a Windows drive map and UNC path and/or WebDAV. Our findings were that not every client (XP, Vista, 7, Mac. IX, etc) plays nice with UNC, WEBDAV and drive mappings. Not surprised. Does this change significantly looking ahead to future sharepoint releases?
(3) Are there documents that have no business in SharePoint? Databases? Executables? propietary logs? What about documents where we expect lots for row level IO from potentially multiple users?
(4) How many customers would you say are seriously looking at SharePoint as a significant alternative to file shares? I understand SharePoint DBs should not exceed 100g - so we have a DB for every site collection. But we have over 100T of potential content. If there are customers seriously looking to go this way - what might there archetecture look like? Blob storage outside of SQL? EBS vs RBS? Who are the major players that offer this and will SharePoint ever offer this natively? EMC? StoragePoint? who else? EBS vs RBS?
(5) What about performance and content indexing concerns?
Thanks in Advance.
If you're seriously consider BLOB storage and SharePoint, then your should look into Remote Blob Storage. See Overview of Remote BLOB Storage. Besides the free FILESTRAM based provider, there are 3rd party providers of RBS that can place the BLOB on SANs like the EMC one.
Metalogix's StoragePoint has an offering called FileShare Librarian that may be the answer you are looking for, it will quickly create the file structure and permissions in SharePoint, while leaving the BLOB's externalized. There is all FileShare Migration Manager for a full fidelity migration, you can still externalize the blobs to EMC with StoragePoint.