Howto run Nearest Neighbour Search with Lucene HnswGraph - lucene

I would like to use Lucene to run a nearest neighbour search. I'm using Lucene 9.0.0 on JVM 11. I did not find much documentation and mainly tried to piece things together using the existing tests.
I wrote a small test which prepares a HnswGraph but so far the search does not yield the expected result. I setup a set of random vectors and add a final vector (0.99f,0.01f) which is very close to my search target.
The search unfortunately never returns the expected value. I'm not sure where my error is. I assume it may be related with the insert and document id order.
Maybe someone who is more familar with lucene might be able to provide some feedback. Is my approach correct? I'm using Documents only for persistence.
HnswGraphBuilder builder = new HnswGraphBuilder(vectors, similarityFunction, maxConn, beamWidth, seed);
HnswGraph hnsw = builder.build(vectors);
// Run a search
NeighborQueue nn = HnswGraph.search(
new float[] { 1, 0 },
10,
10,
vectors.randomAccess(), // ? Why do I need to specify the graph values again?
similarityFunction, // ? Why can I specify a different similarityFunction for search. Should that not be the same that was used for graph creation?
hnsw,
null,
new SplittableRandom(RandomUtils.nextLong()));
The whole test source can be found here:
https://gist.github.com/Jotschi/cea21a72412bcba80c46b967e9c52b0f

I managed to get this working.
Instead of using the HnswGraph API directly I now use LeafReader#searchNearestVectors. While debugging I noticed that the Lucene90HnswVectorsWriter for example invokes extra steps using the HnswGraph API. I assume this is done to create a correlation between inserted vectors and document Ids. The nodeIds I retrieved using a HnswGraph#search never matched up with the matched up with the vector Ids. I don't know whether extra steps are needed to setup the graph or whether the correlation needs to be created afterwards somehow.
The good news is that the LeafReader#searchNearestVectors method works. I have updated the example which now also makes use of the Lucene documents.
#Test
public void testWriteAndQueryIndex() throws IOException {
// Persist and read the data
try (MMapDirectory dir = new MMapDirectory(indexPath)) {
// Write index
int indexedDoc = writeIndex(dir, vectors);
// Read index
readAndQuery(dir, vectors, indexedDoc);
}
}
Vector 7 with [0.97|0.02] is very close to the search query target [0.98|0.01].
Test vectors:
0 => [0.13|0.37]
1 => [0.99|0.49]
2 => [0.98|0.57]
3 => [0.23|0.64]
4 => [0.72|0.92]
5 => [0.08|0.74]
6 => [0.50|0.27]
7 => [0.97|0.02]
8 => [0.90|0.21]
9 => [0.89|0.09]
10 => [0.11|0.95]
Doc Based Search:
Searching for NN of [0.98 | 0.01]
TotalHits: 11
7 => [0.97|0.02]
9 => [0.89|0.09]
Full example:
https://gist.github.com/Jotschi/d8a91758c84203d172f818c8be4964e4

Another way to solve this is to use the KnnVectorQuery.
try (IndexReader reader = DirectoryReader.open(dir)) {
IndexSearcher searcher = new IndexSearcher(reader);
System.out.println("Query: [" + String.format("%.2f", queryVector[0]) + ", " + String.format("%.2f", queryVector[1]) + "]");
TopDocs results = searcher.search(new KnnVectorQuery("field", queryVector, 3), 10);
System.out.println("Hits: " + results.totalHits);
for (ScoreDoc sdoc : results.scoreDocs) {
Document doc = reader.document(sdoc.doc);
StoredField idField = (StoredField) doc.getField("id");
System.out.println("Found: " + idField.numericValue() + " = " + String.format("%.1f", sdoc.score));
}
}
Full example:
https://gist.github.com/Jotschi/7d599dff331d75a3bdd02e62f65abfba

Related

Scalding Unit Test - How to Write A Local File?

I work at a place where scalding writes are augmented with a specific API to track dataset meta data. When converting from normal writes to these special writes, there are some intricacies with respect to Key/Value, TSV/CSV, Thrift ... datasets. I would like to compare the binary file is the same prior to conversion and after conversion to the special API.
Given I cannot provide the specific api for the metadata-inclusive writes, I only ask how can I write a unit test for .write method on a TypedPipe?
implicit val timeZone: TimeZone = DateOps.UTC
implicit val dateParser: DateParser = DateParser.default
implicit def flowDef: FlowDef = new FlowDef()
implicit def mode: Mode = Local(true)
val fileStrPath = root + "/test"
println("writing data to " + fileStrPath)
TypedPipe
.from(Seq[Long](1, 2, 3, 4, 5))
// .map((x: Long) => { println(x.toString); System.out.flush(); x })
.write(TypedTsv[Long](fileStrPath))
.forceToDisk
The above doesn't seem to write anything to local (OSX) disk.
So I wonder if I need to use a MiniDFSCluster something like this:
def setUpTempFolder: String = {
val tempFolder = new TemporaryFolder
tempFolder.create()
tempFolder.getRoot.getAbsolutePath
}
val root: String = setUpTempFolder
println(s"root = $root")
val tempDir = Files.createTempDirectory(setUpTempFolder).toFile
val hdfsCluster: MiniDFSCluster = {
val configuration = new Configuration()
configuration.set(MiniDFSCluster.HDFS_MINIDFS_BASEDIR, tempDir.getAbsolutePath)
configuration.set("io.compression.codecs", classOf[LzopCodec].getName)
new MiniDFSCluster.Builder(configuration)
.manageNameDfsDirs(true)
.manageDataDfsDirs(true)
.format(true)
.build()
}
hdfsCluster.waitClusterUp()
val fs: DistributedFileSystem = hdfsCluster.getFileSystem
val rootPath = new Path(root)
fs.mkdirs(rootPath)
However, my attempts to get this MiniCluster to work haven't panned out either - somehow I need to link the MiniCluster with the Scalding write.
Note: The Scalding JobTest framework for unit testing isn't going to work due actual data written is sometimes wrapped in bijection codec or setup with case class wrappers prior to the writes made by the metadata-inclusive writes APIs.
Any ideas how I can write a local file (without using the Scalding REPL) with either Scalding alone or a MiniCluster? (If using the later, I need a hint how to read the file.)
Answering ... There is an example of how to use a mini cluster for exactly reading and writing to HDFS. I will be able to cross read with my different writes and examine them. Here it is in the tests for scalding's TypedParquet type
HadoopPlatformJobTest is an extension for JobTest that uses a MiniCluster.
With some hand-waiving on detail in the link, the bulk of the code is this:
"TypedParquetTuple" should {
"read and write correctly" in {
import com.twitter.scalding.parquet.tuple.TestValues._
def toMap[T](i: Iterable[T]): Map[T, Int] = i.groupBy(identity).mapValues(_.size)
HadoopPlatformJobTest(new WriteToTypedParquetTupleJob(_), cluster)
.arg("output", "output1")
.sink[SampleClassB](TypedParquet[SampleClassB](Seq("output1"))) {
toMap(_) shouldBe toMap(values)
}
.run()
HadoopPlatformJobTest(new ReadWithFilterPredicateJob(_), cluster)
.arg("input", "output1")
.arg("output", "output2")
.sink[Boolean]("output2")(toMap(_) shouldBe toMap(values.filter(_.string == "B1").map(_.a.bool)))
.run()
}
}

Property Photo Files with PHRETS v2

My php code, below, attemps to download all the photos for a property listing. It successfully queries the RETS server, and creates a file for each photo, but the file does not seem to be a functional image. (MATRIX requires files to be downloaded, instead of URLs.)
The list of photos below suggests that it successfully queries one listing id (47030752) for all photos that exist, (20 photos in this case). In a web browser, the files appear only as a small white square on a black background: e.g. (https://photos.atlantarealestate-homes.com/photos/PHOTO-47030752-9.jpg). The file size (4) also seems to be very low, as compared to that of a real photo.
du -s PHOTO*
4 PHOTO-47030752-10.jpg
4 PHOTO-47030752-11.jpg
4 PHOTO-47030752-12.jpg
4 PHOTO-47030752-13.jpg
4 PHOTO-47030752-14.jpg
4 PHOTO-47030752-15.jpg
4 PHOTO-47030752-16.jpg
4 PHOTO-47030752-17.jpg
4 PHOTO-47030752-18.jpg
4 PHOTO-47030752-19.jpg
4 PHOTO-47030752-1.jpg
4 PHOTO-47030752-20.jpg
4 PHOTO-47030752-2.jpg
4 PHOTO-47030752-3.jpg
4 PHOTO-47030752-4.jpg
4 PHOTO-47030752-5.jpg
4 PHOTO-47030752-6.jpg
4 PHOTO-47030752-7.jpg
4 PHOTO-47030752-8.jpg
4 PHOTO-47030752-9.jpg
script I'm using:
#!/usr/bin/php
<?php
date_default_timezone_set('this/area');
require_once("composer/vendor/autoload.php");
$config = new \PHRETS\Configuration;
$config->setLoginUrl('https://myurl/login.ashx')
->setUsername('myser')
->setPassword('mypass')
->setRetsVersion('1.7.2');
$rets = new \PHRETS\Session($config);
$connect = $rets->Login();
$system = $rets->GetSystemMetadata();
$resources = $system->getResources();
$classes = $resources->first()->getClasses();
$classes = $rets->GetClassesMetadata('Property');
$host="localhost";
$user="db_user";
$password="db_pass";
$dbname="db_name";
$tablename="db_table";
$link=mysqli_connect ($host, $user, $password, $dbname);
$query="select mlsno, matrix_unique_id, photomodificationtimestamp from fmls_homes left join fmls_images on (matrix_unique_id=mls_no and photonum='1') where photomodificationtimestamp <> last_update or last_update is null limit 1";
print ("$query\n");
$result= mysqli_query ($link, $query);
$num_rows = mysqli_num_rows($result);
print "Fetching Images for $num_rows Homes\n";
while ($Row= mysqli_fetch_array ($result)) {
$matrix_unique_id="$Row[matrix_unique_id]";
$objects = $rets->GetObject('Property', 'LargePhoto', $matrix_unique_id);
foreach ($objects as $object) {
// does this represent some kind of error
$object->isError();
$object->getError(); // returns a \PHRETS\Models\RETSError
// get the record ID associated with this object
$object->getContentId();
// get the sequence number of this object relative to the others with the same ContentId
$object->getObjectId();
// get the object's Content-Type value
$object->getContentType();
// get the description of the object
$object->getContentDescription();
// get the sub-description of the object
$object->getContentSubDescription();
// get the object's binary data
$object->getContent();
// get the size of the object's data
$object->getSize();
// does this object represent the primary object in the set
$object->isPreferred();
// when requesting URLs, access the URL given back
$object->getLocation();
// use the given URL and make it look like the RETS server gave the object directly
$object->setContent(file_get_contents($object->getLocation()));
$listing = $object->getContentId();
$number = $object->getObjectId();
$url = $object->getLocation();
//$photo = $object->getContent();
$size = $object->getSize();
$desc = $object->getContentDescription();
if ($number >= '1') {
file_put_contents("/bigdirs/fmls_pics/PHOTO-{$listing}-{$number}.jpg", "$object->getContent();");
print "$listing - $number - $size $desc\n";
} //end if
} //end foreach
} //end while
mysqli_close ($link);
fclose($f);
php?>
Are there any suggested changes to capture photos into the created files? This command creates the photo files:
file_put_contents("/bigdirs/fmls_pics/PHOTO-{$listing}-{$number}.jpg", "$object->getContent();");
There may be some parts of this script that wouldn't work in live production, but are sufficient for testing. This script seems to successfully query for the information needed from the RETS server. The problem is just simply that the actual files created do not seem to be functional photos.
Thanks in Advance! :)
Your code sample is a mix of the official documentation and a usable implementation. The problem is with this line:
$object->setContent(file_get_contents($object->getLocation()));
You should completely take that out. That's actually overriding the image you downloaded with nothing before you get a chance to save the contents to a file. With that removed, it should work as expected.

Crunchbase Data API v3.1 to Google Sheets

I'm trying to pull data from the Crunchbase Open Data Map to a Google Spreadsheet. I'm following Ben Collins's script but it no longer works since the upgrade from v3 to v3.1. Anyone had any luck modifying the script for success?
var USER_KEY = 'insert your API key in here';
// function to retrive organizations data
function getCrunchbaseOrgs() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheet = ss.getSheetByName('Organizations');
var query = sheet.getRange(3,2).getValue();
// URL and params for the Crunchbase API
var url = 'https://api.crunchbase.com/v/3/odm-organizations?query=' + encodeURI(query) + '&user_key=' + USER_KEY;
var json = getCrunchbaseData(url,query);
if (json[0] === "Error:") {
// deal with error with fetch operation
sheet.getRange(5,1,sheet.getLastRow(),2).clearContent();
sheet.getRange(6,1,1,2).setValues([json]);
}
else {
if (json[0] !== 200) {
// deal with error from api
sheet.getRange(5,1,sheet.getLastRow(),2).clearContent();
sheet.getRange(6,1,1,2).setValues([["Error, server returned code:",json[0]]]);
}
else {
// correct data comes back, filter down to match the name of the entity
var data = json[1].data.items.filter(function(item) {
return item.properties.name == query;
})[0].properties;
// parse into array for Google Sheet
var outputData = [
["Name",data.name],
["Homepage",data.homepage_url],
["Type",data.primary_role],
["Short description",data.short_description],
["Country",data.country_code],
["Region",data.region_name],
["City name",data.city_name],
["Blog url",data.blog_url],
["Facebook",data.facebook_url],
["Linkedin",data.linkedin_url],
["Twitter",data.twitter_url],
["Crunchbase URL","https://www.crunchbase.com/" + data.web_path]
];
// clear any old data
sheet.getRange(5,1,sheet.getLastRow(),2).clearContent();
// insert new data
sheet.getRange(6,1,12,2).setValues(outputData);
// add image with formula and format that row
sheet.getRange(5,2).setFormula('=image("' + data.profile_image_url + '",4,50,50)').setHorizontalAlignment("center");
sheet.setRowHeight(5,60);
}
}
}
This code no longer pulls data as expected.
I couldn't confirm about the error messages when you ran the script. So I would like to show about the clear difference point. It seems that the endpoint was changed from https://api.crunchbase.com/v/3/ to https://api.crunchbase.com/v3.1/. So how about this modification?
From :
var url = 'https://api.crunchbase.com/v/3/odm-organizations?query=' + encodeURI(query) + '&user_key=' + USER_KEY;
To :
var url = 'https://api.crunchbase.com/v3.1/odm-organizations?query=' + encodeURI(query) + '&user_key=' + USER_KEY;
Note :
From your script, I couldn't also find query. So if the script doesn't work even when you modified the endpoint, please confirm about it. You can see the detail of API v3 Compared to API v3.1 is here.
References :
API v3 Compared to API v3.1
Using the API
If this was not useful for you, I'm sorry.

Selecting multiple files in a dart based chrome app

I was playing around with Google Dart and chrome apps. I tried to select a single file: No problem here!
The code looks like this and prints the filename.
Future<ChooseEntryResult> res = chrome.fileSystem.chooseEntry(new ChooseEntryOptions());
res.then((ChooseEntryResult entry) {
print("entries: " + entry.entry.name);
});
But selecting multiple files does not work. In a native chrome app I can do:
chrome.fileSystem.chooseEntry({"acceptsMultiple":true}, function(entries) {
console.log(entries);
});
Even this code fails (I only added acceptsMultiple: false)
Future<ChooseEntryResult> res = chrome.fileSystem.chooseEntry(new ChooseEntryOptions(acceptsMultiple: false));
res.then((ChooseEntryResult entry) {
print("entries: " + entry.entry.name);
});
I would expect this to work:
Future<ChooseEntryResult> res = chrome.fileSystem.chooseEntry(new ChooseEntryOptions(acceptsMultiple: true));
res.then((ChooseEntryResult entry) {
print("entries: " + entry.entries);
});
But whenever I select multiple files the "entry" and "entries" fields of ChooseEntryResult gives me null. Has anyone managed to get this working?

Identify and extract or delete pages of a PDF based on a search string / text (action / javascript)

Good Evening (UK)
I'm trying to filter down a 1500+ page PDF file to only the pages which include a certain text string (typically one or two words). My laptop is locked down with respect to installing more software BUT I have used action(script)s quite a bit
I get the error below when I try to install this action into Abobe Acrobat X Pro (Win 7):
screen dump of error
called "Extract Commented Pages"... supposed to be OK for X and XI this looks like what I want.....
I wondered if there was something simple causing the problem but the actionscript file is rather... busy to say the least.
I used to have an action that I think was based on a legal redaction script but it is filed somewhere!
If you have already got an action that does this or a version of the above that doesn't give the error I get (unable to import the Action.... The file is either invalid or corrupt) I will forever by indebted to your gratitude
Many thanks, have a good weekend!
I recently came across a script found at the following link: http://forums.adobe.com/thread/1077118
I'm having some issues getting the script to run in Acrobat, despite everything looking alright in the script itself. I'll update if I find any errors.
Here is a copy of the script:
// Set the word to search for here
var sWord = "forms";
// Source document = current document
var sd = this;
var nWords, currWord, fp, fpa = [], nd;
var fn = sd.documentFileName.replace(/\.pdf$/i, "");
// Loop through the pages
for (var i = 0; i < sd.numPages; i += 1) {
// Get the number of words on the page
nWords = sd.getPageNumWords(i);
// Loop through the words on the page
for (var j = 0; j < nWords; j += 1) {
// Get the current word
currWord = sd.getPageNthWord(i, j);
if (currWord === sWord) {
// Extract the current page to a new file
fp = fn + "_" + i + ".pdf";
fpa.push(fp);
sd.extractPages({nStart: i, nEnd: i, cPath: fp});
// Stop searching this page
break;
}
}
}
// Combine the individual pages into one PDF
if (fpa.length) {
// Open the document that's the first extracted page
nd = app.openDoc({cPath: fpa[0], oDoc: sd});
// Append any other pages that were extracted
if (fpa.length > 1) {
for (var i = 1; i < fpa.length; i += 1) {
nd.insertPages({nPage: i - 1, cPath: fpa[i], nStart: 0, nEnd: 0});
}
}
// Save to a new document and close this one
nd.saveAs({cPath: fn + "_searched.pdf"});
nd.closeDoc({bNoSave: true});
}