How to remove old Hibernate Search index - indexing

I use Hibernate search for full text search in my web application. I have button for index creation in admin panel. I do it by this code:
fullTextSession.createIndexer()
.purgeAllOnStart(true)
.optimizeAfterPurge(true)
.optimizeOnFinish(true)
.batchSizeToLoadObjects( 25 )
.threadsToLoadObjects( 5 )
.threadsForSubsequentFetching( 20 )
.startAndWait();
If index was build correctly and then I push this button again old index files still on disk and program creates new index. And so on. Can you help me remove old index files before creating new?

I do a similar thing but only when I shut my app down do I remove the indexes and then I set them up again on startup.
Have you tried calling purgeAll() instead of purgeAllOnStart()? That is what I call on app shutdown and it works. To be safe, after I purge the indexes I actually delete my index directory and all files / folders it contains from disk as well.

Related

File create time doesn't change even after it is deleted

I am using the following code:
from datetime import datetime
import time, pandas as pd, os, pickle
df = pd.DataFrame(np.arange(1,200))
fn = r'C:\z1.p'
pickle.dump(df, open(fn, 'wb'))
print(datetime.fromtimestamp(os.stat(fn).st_ctime))
os.remove(fn)
time.sleep(5)
pickle.dump(df, open(fn, 'wb'))
print(datetime.fromtimestamp(os.stat(fn).st_ctime))
But I get the same create time from both print statements as:
2022-03-16 08:43:30.885011
2022-03-16 08:43:30.885011
How do I make sure that new time gets printed for second print statement?
This is a Windows feature, called "file system tunnelling".
The apocryphal history of file system tunnelling
One of the file system features you may find yourself surprised by is
tunneling, wherein the creation timestamp and short/long names of a
file are taken from a file that existed in the directory previously.
In other words, if you delete some file “File with long name.txt” and
then create a new file with the same name, that new file will have the
same short name and the same creation time as the original file. You
can read this KB article for details on what operations are sensitive
to tunnelling.
Why does tunneling exist at all?
When you use a program to edit an existing file, then save it, you
expect the original creation timestamp to be preserved, since you’re
editing a file, not creating a new one. But internally, many programs
save a file by performing a combination of save, delete, and rename
operations (such as the ones listed in the linked article), and
without tunneling, the creation time of the file would seem to change
even though from the end user’s point of view, no file got created.
...
See this archived copy of Windows NT Contains File System Tunneling Capabilities:
When a name is removed from a directory (rename or delete), its
short/long name pair and creation time are saved in a cache, keyed by
the name that was removed. When a name is added to a directory (rename
or create), the cache is searched to see if there is information to
restore. The cache is effective per instance of a directory. If a
directory is deleted, the cache for it is removed.
These paired operations can cause tunneling on "name."
delete(name)/create(name)
delete(name)/rename(source, name)
rename(name, newname)/create(name)
rename(name, newname)/rename(source, name)
The idea is to mimic the behavior MS-DOS programs expect when they use
the safe save method. They copy the modified data to a temporary file,
delete the original and rename the temporary to the original. This
should seem to be the original file when complete. Windows performs
tunneling on both FAT and NTFS file systems to ensure long/short file
names are retained when 16-bit applications perform this safe save
operation.
One Windows function related to file tunneling is FltGetTunneledName():
The FltGetTunneledName routine retrieves the tunneled name for a file, given the normalized name returned for the file by a previous call to FltGetFileNameInformation, FltGetFileNameInformationUnsafe, or FltGetDestinationFileNameInformation.
...
To disable tunnelling:
Open regedit
Navigate here:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem
On the Edit menu, point to New and then click DWORD Value
Type MaximumTunnelEntries and then press Enter
On the Edit menu, click Modify
Type 0 and then click OK
Restart your computer
Done

libgit2: Merge with conflicts

I'm working on implementing merge using libgit2, and I'm having trouble getting it to deal with conflicts (changes to the same line in a file) - the merge just aborts, with nothing written to the index or to the workspace. Resolvable conflicts (changes to different lines) are working fine.
It exits with GIT_ECONFLICT, which apparently indicates that the worktree and/or index aren't clean, but I checked with git status just before calling git_merge() and it's clean.
I'm using default merge options, and checkout options set to GIT_CHECKOUT_SAFE | GIT_CHECKOUT_ALLOW_CONFLICTS. I tried using FORCE instead of SAFE but it didn't help. What else do I need to do so the conflicts are recorded?
Code is here (in Swift):
https://github.com/Uncommon/Xit/blob/ff1bf6312bb1250b1db432035947a282a2cdd362/Xit/XTRepository%2BMergePushPull.swift#L154
It turned out the problem was that, since my unit test had just done a git commit using the command line tool, libgit2's in-memory copy of the index was out of date, so using that old copy of the index it thought there was a conflict. Reloading the index with git_index_read() before calling git_merge() solved the problem.
This is actually a bug in libgit2; git_merge should be reloading the index itself: https://github.com/libgit2/libgit2/issues/4203

Solr: import Lucene index while server is up and running

As read here Can a raw Lucene index be loaded by Solr? Lucene indexes can be imported into Solr. This works well when the Solr server is not running (creating a Solr core folder structure in the data folder with all the needed configuration files) but it does not work when the Solr server is up and running.
Is there any call (via rest endpoint or java api) to tell Solr to re-scan the data folder?
You want to generate an index with lucene (outsite solr) and insert this to solr without restart.
You must not change the index-folder directly. But you can create a new core which point to the already build index folder and switch/swap the core with the (outdated) old one. Or you can merge the new index-Folder in the old core.
All this can be done by the solrj admin api.
e.g. create:
CoreAdminRequest.Create req = new CoreAdminRequest.Create();
req.setConfigName(configName);
req.setSchemaName(schemaName);
req.setDataDir(dataDir);
req.setCoreName(coreName);
req.setInstanceDir(instanceDir);
req.setIsTransient(true);
req.setIsLoadOnStartup(false); // <= unless its productive core.
return req.process(adminServer);
e.g. the swap:
CoreAdminRequest request = new CoreAdminRequest();
request.setAction(CoreAdminAction.SWAP);
request.setCoreName(coreName1);
request.setOtherCoreName(coreName2);
request.process(solrClient);
For SolrCloud use the first "create" approach with the collections api and use alias instead of swap.
e.g. the alias:
CollectionAdminRequest.CreateAlias req = new CollectionAdminRequest.CreateAlias();
req.setAliasedCollections(coreName);
req.setAliasName(aliasName);
return req.process(solrClient);

Symfony2 performance tweaking

Symfony2 was looking so promising, powerful and flexible. So we were going to use Symfony2 + mongodb for one of our projects. But it appeared too slow (Apache/2.2.25 + PHP/5.4.20). Currently the app is pretty simple. but I have noticed that the httpd.exe lads CPU up to 28% when some simple page is loaded. The page is quite lite - just user profile info and the list of his posts. I even can't imagine how hundreds of users can be served (not even talking about numbers like 100k users) if performance will not be much better.
For instance the CPU load is 2% when opening the heavy 'products' page of ActivationCloud account (which fetches a good amount of data) (PHP+Smarty+SQL).
After taking a look on Xdebug output, I have found that a gret deal of time 20% is utilized by ClassLoader->loadClass(...) - 265 calls
After performing the following steps:
*generated class map
php composer.phar dump-autoload --optimize
*installed and enabled APC
[APC]
extension=php_apc.dll
apc.enabled=1
apc.shm_segments=1
;32M per WordPress install
apc.shm_size=128M
;Relative to the number of cached files (you may need to
watch your stats for a day or two to find out a good number)
apc.num_files_hint=7000
;Relative to the size of WordPress
apc.user_entries_hint=4096
;The number of seconds a cache entry is allowed to idle
in a slot before APC dumps the cache
apc.ttl=7200
apc.user_ttl=7200
apc.gc_ttl=3600
;Setting this to 0 will give you the best performance, as APC will
;not have to check the IO for changes. However, you must clear
;the APC cache to recompile already cached files. If you are still
;developing, updating your site daily in WP-ADMIN, and running W3TC
;set this to 1
apc.stat=1
;This MUST be 0, WP can have errors otherwise!
apc.include_once_override=0
;Only set to 1 while debugging
apc.enable_cli=0
;Allow 2 seconds after a file is created before
it is cached to prevent users from seeing half-written/weird pages
apc.file_update_protection=2
;Leave at 2M or lower. WordPress does't have any file sizes close to 2M
apc.max_file_size=2M
;Ignore files
apc.filters = "/var/www/apc.php"
apc.cache_by_default=1
apc.use_request_time=1
apc.slam_defense=0
apc.mmap_file_mask=/var/www/temp/apc.XXXXXX
apc.stat_ctime=0
apc.canonicalize=1
apc.write_lock=1
apc.report_autofilter=0
apc.rfc1867=0
apc.rfc1867_prefix =upload_
apc.rfc1867_name=APC_UPLOAD_PROGRESS
apc.rfc1867_freq=0
apc.rfc1867_ttl=3600
apc.lazy_classes=0
apc.lazy_functions=0
expected a miracle after it but it did not happen.
*enabled APC class loader - in Symfony\web\app.php uncommented
/*
$loader = new ApcClassLoader('sf2', $loader);
$loader->register(true);
*/
The ClassLoader->loadClass(...) got better 'Self' is 11 instead of 21
Frankly speaking I was shocked by what I saw in xdebug :( a lot of repetitive calls like Container->get(...) -317 calls, DocumentManager->getClassMeataData(...) - 301 calls. Totally more than 2k of function calls. Hard to believe that.
These bundles are installed:
class AppKernel extends Kernel
{
public function registerBundles()
{
$bundles = array(
new Symfony\Bundle\FrameworkBundle\FrameworkBundle(),
new Symfony\Bundle\SecurityBundle\SecurityBundle(),
new Symfony\Bundle\TwigBundle\TwigBundle(),
new Symfony\Bundle\MonologBundle\MonologBundle(),
new Symfony\Bundle\SwiftmailerBundle\SwiftmailerBundle(),
new Symfony\Bundle\AsseticBundle\AsseticBundle(),
new Doctrine\Bundle\DoctrineBundle\DoctrineBundle(),
new Doctrine\Bundle\MongoDBBundle\DoctrineMongoDBBundle(),
new Sensio\Bundle\FrameworkExtraBundle\SensioFrameworkExtraBundle(),
new HWI\Bundle\OAuthBundle\HWIOAuthBundle(),
new Knp\Bundle\MenuBundle\KnpMenuBundle(),
... our bundles ...
);
if (in_array($this->getEnvironment(), array('dev', 'test'))) {
$bundles[] = new Symfony\Bundle\WebProfilerBundle\WebProfilerBundle();
$bundles[] = new Sensio\Bundle\DistributionBundle\SensioDistributionBundle();
$bundles[] = new Sensio\Bundle\GeneratorBundle\SensioGeneratorBundle();
}
return $bundles;
}
It was sad to find that Symfony2 got one of the worst benchmark results among others php frameworks http://www.techempower.com/benchmarks/#section=data-r8&hw=i7&test=json&l=sg
At the same time Francois Zaninotto said in his blog http://symfony.com/blog/who-really-uses-symfony that Yahoo uses Symfony2 for the bookmarks service, tried some apps form the list http://trac.symfony-project.org/wiki/ApplicationsDevelopedWithSymfony - they are not looking slow also on Quora http://www.quora.com/Who-is-using-Symfony2-in-production its spoken that dailymotion is using it as well.
How to make the performance acceptable?
Got Symfony working x10 faster after adding the
realpath_cache_size = 4096k
to php.ini
First you should use linux (you mentioned https.exe so I think you are using windows). Than you should use nginx instead of apache and php-5.5 with fpm instead of mod_php. Opcache instead of apc (by the way apc.stat should be turned off). Doctrine caches should be turned on and than you should use http caching wherever you can. (You can view packagist's code for some hints.)

Using sub-repo with hgwebdir difficulties in mercurial

Allright I got myself in a deadlock with Mercurial and sub-repos... Here's what happenend:
I had a large mercurial repo that I server via apache and hgweb.cgi.
Due to the size of the repo I decided to move to sub-repositories and share these with hgwebdir.cgi.
Using the convert tool with the filemap option I created several sub-repositories:
/main/foo
/main/bar
Nicely created an entry for the sub-repositories in .hgsub:
foo = foo
bar = bar
And set hgwebdir.cgi up to show $/** as the root folder.
Now when I went to my site (foo.com/hg) I saw my sub-repositories with one empty reposory among them (no name, no content), but I could not download it (archive location unknown):
empty_repo http://img707.imageshack.us/img707/8237/emptysubrepo.png
That was allright until I added a new sub-repository.
I could not push the new .hgsub file to foo.com/hg, since that page is served by hgwebdir.
The only method I can work currently is switch from hgwebdir to hgweb, commit .hgsubste and switch back to hgwebdir.
Does someone have a good setup for such a mess?
On the webserver your main and its subrepos should appear as siblings -- not with the subrepos inside main.
Main
ASCII
AlignDistribute
And the URLs in your .hgsub should look like:
ASCII = ../ASCII
AlignDistribute = ../AlignDsitribute
Then you'll be able to push/pull to http://foo.com/hg/Main and when you clone it the clone/update will automatically attach and clone down the separate subrepos.
From what I've read on https://www.mercurial-scm.org/wiki/PublishingRepositories#multiple
The keys (on the left) and the values (on the right) are both filesystem paths
The keys should be prefixes of the values and are "subtracted" from the values in order to generate the URL paths to each repository
What I'm guessing happened is that in your hgweb(dir) configuration you're specifying the same value for a collection possibly as the key, so during subtraction it ends up with a blank name and no way to get to it.
When I use [collections] to set /a/full/path = /a/full/path directly to a repo, it'll end up blank too, because it's reading that folder as a repo because it is a repo, instead of each sub-directory being an individual repo, after I removed the .hg folder and .hgsubs and everything from the root of my collection entry, all the subfolders started showing up properly.
I originally used in [paths], /path/to/my/project = /path/to/my/project, and since that is a single referenced repository, it'll subtract the value from the key, leaving you once again with '', instead I used project = /path/to/my/project and it came out as 'project'.
Hopefully that URL or these descriptions will get you out of your pickle!