What do I put for average value when creating chronicleMap if the value is a class? [Chronicle-Map] - chronicle-map

I have the following class I want to use as a value for the map I'm implementing:
import java.util.ArrayList;
import java.util.List;
import net.openhft.chronicle.bytes.BytesMarshallable;
import org.apache.commons.lang3.tuple.Pair;
public class GlossesLexicalizations implements BytesMarshallable {
List<String> glosses = new ArrayList<String>();
List<Pair<String, POS.Tag>> lexicalizations = new ArrayList<Pair<String, POS.Tag>>();
public GlossesLexicalizations(){
}
public GlossesLexicalizations(List<String> g, List<Pair<String, POS.Tag>> l){
glosses = g;
lexicalizations = l;
}
public void setGlosses(List<String> l){
glosses = l;
}
public void setLexicalizations(List<Pair<String, POS.Tag>> l){
lexicalizations = l;
}
public List<String> getGlosses(){
return glosses;
}
public List<Pair<String, POS.Tag>> getLexicalizations(){
return lexicalizations;
}
}
I implement BytesMarshallable so that ChronicleMap can use it.
Then I create the database:
File file = new File("/home/sandor/Desktop/lexicalizations-and-glosses-map.bin");
String key = "bn:14232961n"; // example
List<String> glosses = bn.getGlosses(key, ULocale.US);
List<Pair<String, POS.Tag>> lexicalizations = bn.getLexicalizations(key, ULocale.US);
ChronicleMap<String, GlossesLexicalizations> lexicalizationGraph = ChronicleMap
.of(String.class, GlossesLexicalizations.class)
.name("lexicalizations-and-glosses")
.constantKeySizeBySample("bn:14271053n")
.entries(100) // poner number of entries + 15% extra
.createOrRecoverPersistedTo(file);
GlossesLexicalizations gl = new GlossesLexicalizations(glosses, lexicalizations);
lexicalizationGraph.put(key, gl);
lexicalizationGraph.close();
This gives me the following error:
Value size in serialized form must be configured in ChronicleMap, at least approximately. Use builder.averageValue()/.constantValueSizeBySample()/.averageValueSize() methods to configure the size
I can understand the average value size for a string, but what do you do for a class?

You can either create an object of your value class GlossesLexicalizations, fill it in as you would fill it in real application, and give it as a sample for averageValue(), or you can try to guess approximately how much data you will have in your object and use averageValueSize() with that number. Don't worry if the number is not exact - it just needs to be in a ballpark. This is required for Chronicle Map to be able to allocate objects in the memory (as you probably know Chronicle Map stores objects in the off-heap memory hence it has to do memory allocations itself).

Related

Graph traversal name to graph name mapping

Is there any API using which I can get graphTraversalName to graphName mapping defined in the script?
I am using the below messy code but it's error-prone if both graphs are using the same underlying storage.
Map<String, String> graphTraversalToNameMap = new ConcurrentHashMap<String, String>();
while(traversalSourceIterator.hasNext()){
String traversalSource = traversalSourceIterator.next();
String currentGraphString = ( (GraphTraversalSource) graphManager.getAsBindings().get(traversalSource)).getGraph().toString();
graphNameTraversalMap.put(currentGraphString, traversalSource);
}
Iterator<String> graphNamesIterator = graphManager.getGraphNames().iterator();
while(graphNamesIterator.hasNext()){
String graphName = graphNamesIterator.next();
String currentGraphString = graphManager.getGraph(graphName).toString();
String traversalSource = graphNameTraversalMap.get(currentGraphString);
graphTraversalToNameMap.put(traversalSource, graphName);
}
Does gremlinExecutor.getScriptEngineManager().getBindings().entrySet() provide order guarantee? I can iterate over this and populate my map
Is there any API using which I can get graphTraversalName to graphName mapping defined in the script?
No. They share the same namespace in Gremlin Server so the relationship gets lost programmatically. You would need to do something like what you are doing but I wouldn't rely on toString() of a Graph for equality. Perhaps use the Graph instance itself? Although that might not work either depending on your situation and what you want for equality as you could have two different Graph configurations pointed at the same data and want to resolve those as the same graph. I'm also not sure that any approach will work generally for all graph systems. Anyway, I think I'd experiment with using Map<Graph, String> graphTraversalToNameMap for your case and see how that goes.
Does gremlinExecutor.getScriptEngineManager().getBindings().entrySet() provide order guarantee?
No as it is backed by a ConcurrentHashMap. You would have to provide your own order.
Underlying storage details can be obtained from the configuration object and can be used for the mapping, sample code:
public class GraphTraversalMappingUtil {
public static void populateGraphTraversalToNameMapping(GraphManager graphManager){
if(graphTraversalToNameMap.size() != 0){
return;
}
Iterator<String> traversalSourceIterator = graphManager.getTraversalSourceNames().iterator();
Map<StorageBackendKey, String> storageKeyToTraversalMap = new HashMap<StorageBackendKey, String>();
while(traversalSourceIterator.hasNext()){
String traversalSource = traversalSourceIterator.next();
StorageBackendKey key = new StorageBackendKey(
graphManager.getTraversalSource(traversalSource).getGraph().configuration());
storageKeyToTraversalMap.put(key, traversalSource);
}
Iterator<String> graphNamesIterator = graphManager.getGraphNames().iterator();
while(graphNamesIterator.hasNext()) {
String graphName = graphNamesIterator.next();
StorageBackendKey key = new StorageBackendKey(
graphManager.getGraph(graphName).configuration());
graphTraversalToNameMap.put(storageKeyToTraversalMap.get(key), graphName);
}
}
}
For full code, refer: https://pastebin.com/7m8hi53p

Use Java8 Stream on JDBCTemplate Results from HIVE

I am using jdbcTemplate to query hive then writing the results to a .csv file. I basically just generate a list of objects then steam the list to write each record to the file.
I will like to stream the results as they coming back from hive and write it to the file instead of wait to get the whole thing then processing it. Can anyone pointing me to the right direction? Thanks!
private List<Avs> queryAvsData(String asSql) {
List<Avs> llistAvs = new ArrayList<Avs>();
List<Map<String, Object>> rows = hiveJdbcTemplate.queryForList(asSql);
Iterator<Map<String, Object>> it = rows.iterator();
while (it.hasNext()) {
Map<String, Object> row = it.next();
Avs laAvs = Avs.builder()
.make((String) row.get("make"))
.model((String) row.get("model"))
.build();
llistAvs.add(laAvs);
}
return llistAvs;
}
It doesn't look like there's a built-in solution, but you can do it. Basically, you wrap the existing functionality in an iterator, and use a spliterator to turn it into a stream. Here's a blog post on the subject:
The code implements Spring’s ResultSetExtractor interface, which is a Single Abstract Method (SAM) interface, allowing the use of a lambda expression to implement it.
The implementation wraps the SQL ResultSet in an iterator, constructs a stream using the Spliterators and StreamSupport utility classes, and applies that to a Function taking a stream of row sets and returning a generic result.
It's possible to stream values from JdbcTemplate. The following example is a service based on Spring Boot 2.4.8.
As, I run into problems (connection leak) using queryForStream then I will put a demo code here just to know that stream must be closed after usage.
import lombok.RequiredArgsConstructor;
import org.springframework.jdbc.core.SingleColumnRowMapper;
import org.springframework.jdbc.core.namedparam.NamedParameterJdbcTemplate;
import org.springframework.stereotype.Service;
import java.util.Map;
import java.util.stream.Stream;
#Service
#RequiredArgsConstructor
public class DataCleaningService {
private final NamedParameterJdbcTemplate jdbcTemplate;
public void doSomeStreaming() {
String nativeQuery = "SELECT string_value FROM my_table WHERE column = :valueToFiler";
Map<String, Object> queryParameters = Map.of("valueToFiler", "my value");
SingleColumnRowMapper<String> stringRowMapper = SingleColumnRowMapper.newInstance(String.class);
try (Stream<String> stringValueStream = jdbcTemplate.queryForStream(nativeQuery, queryParameters, stringRowMapper)) {
stringValueStream.forEach(stringValue -> {
// do the needed action with the value
//..
System.out.printf("My cool value: %s", stringValue);
});
}
}
}

Does this saving/loading pattern have a name?

There's a variable persistence concept I have integrated multiple times:
// Standard initialiation
boolean save = true;
Map<String, Object> dataHolder;
// variables to persist
int number = 10;
String text = "I'm saved";
// Use the variables in various ways in the project
void useVariables() { ... number ... text ...}
// Function to save the variables into a datastructure and for example write them to a file
public Map<String, Object> getVariables()
{
Map<String, Object> data = new LinkedHashMap<String, Object>();
persist(data);
return(data);
}
// Function to load the variables from the datastructure
public void setVariables(Map<String, Object> data)
{
persist(data);
}
void persist(Map<String, Object> data)
{
// If the given datastructure is empty, it means data should be saved
save = (data.isEmpty());
dataHolder = data;
number = handleVariable("theNumber", number);
text = handleVariable("theText", text);
...
}
private Object handleVariable(String name, Object value)
{
// If currently saving
if(save)
dataHolder.put(name, value); // Just add to the datastructure
else // If currently writing
return(dataHolder.get(name)); // Read and return from the datastruct
return(value); // Return the given variable (no change)
}
The main benefit of this principle is that you only have a single script where you have to mention new variables you add during the development and it's one simple line per variable.
Of course you can move the handleVariable() function to a different class which also contains the "save" and "dataHolder" variables so they wont be in the main application.
Additionally you could pass meta-information, etc. for each variable required for persisting the datastructure to a file or similar by saving a custom class which contains this information plus the variable instead of the object itself.
Performance could be improved by keeping track of the order (in another datastructure when first time running through the persist() function) and using a "dataHolder" based on an array instead of a search-based map (-> use an index instead of a name-string).
However, for the first time, I have to document this and so I wondered whether this function-reuse principle has a name.
Does someone recognize this idea?
Thank you very much!

Sorting an ArrayList of NotesDocuments using a CustomComparator

I'm trying to sort a Documents Collection using a java.util.ArrayList.
var myarraylist:java.util.ArrayList = new java.util.ArrayList()
var doc:NotesDocument = docs.getFirstDocument();
while (doc != null) {
myarraylist.add(doc)
doc = docs.getNextDocument(doc);
}
The reason I'm trying with ArrayList and not with TreeMaps or HashMaps is because the field I need for sorting is not unique; which is a limitation for those two objects (I can't create my own key).
The problem I'm facing is calling CustomComparator:
Here how I'm trying to sort my arraylist:
java.util.Collections.sort(myarraylist, new CustomComparator());
Here my class:
import java.util.Comparator;
import lotus.notes.NotesException;
public class CustomComparator implements Comparator<lotus.notes.Document>{
public int compare(lotus.notes.Document doc1, lotus.notes.Document doc2) {
try {
System.out.println("Here");
System.out.println(doc1.getItemValueString("Form"));
return doc1.getItemValueString("Ranking").compareTo(doc2.getItemValueString("Ranking"));
} catch (NotesException e) {
e.printStackTrace();
}
return 0;
}
}
Error:
Script interpreter error, line=44, col=23: Error calling method
'sort(java.util.ArrayList, com.myjavacode.CustomComparator)' on java
class 'java.util.Collections'
Any help will be appreciated.
I tried to run your SSJS code in a try-catch block, printing the error in exception in catch block and I got the following message - java.lang.ClassCastException: lotus.domino.local.Document incompatible with lotus.notes.Document
I think you have got incorrect fully qualified class names of Document and NotesException. They should be lotus.domino.Document and lotus.domino.NotesException respectively.
Here the SSJS from RepeatControl:
var docs:NotesDocumentCollection = database.search(query, null, 0);
var myarraylist:java.util.ArrayList = new java.util.ArrayList()
var doc:NotesDocument = docs.getFirstDocument();
while (doc != null) {
myarraylist.add(doc)
doc = docs.getNextDocument(doc);
}
java.util.Collections.sort(myarraylist, new com.mycode.CustomComparator());
return myarraylist;
Here my class:
package com.mycode;
import java.util.Comparator;
public class CustomComparator implements Comparator<lotus.domino.Document>{
public int compare(lotus.domino.Document doc1, lotus.domino.Document doc2) {
try {
// Numeric comparison
Double num1 = doc1.getItemValueDouble("Ranking");
Double num2 = doc2.getItemValueDouble("Ranking");
return num1.compareTo(num2);
// String comparison
// return doc1.getItemValueString("Description").compareTo(doc2.getItemValueString("Description"));
} catch (lotus.domino.NotesException e) {
e.printStackTrace();
}
return 0;
}
}
Not that this answer is necessarily the best practice for you, but the last time I tried to do the same thing, I realized I could instead grab the documents as a NotesViewEntryCollection, via SSJS:
var col:NotesViewEntryCollection = database.getView("myView").getAllEntriesByKey(mtgUnidVal)
instead of a NotesDocumentCollection. I just ran through each entry, grabbed the UNIDs for those that met my criteria, added to a java.util.ArrayList(), then sent onward to its destination. I was already sorting the documents for display elsewhere, using a categorized column by parent UNID, so this is probably what I should have done first; still on leading edge of the XPages/Notes learning curve, so every day brings something new.
Again, if your collection is not equatable to a piece of a Notes View, sorry, but for those with an available simple approach, KISS. I remind myself frequently.

How to add options for Analyze in Apache Lucene?

Lucene has Analyzers that basically tokenize and filter the corpus when indexing. Operations include converting tokens to lowercase, stemming, removing stopwords, etc.
I'm running an experiment where I want to try all possible combinations of analysis operations: stemming only, stopping only, stemming and stopping, ...
In total, there 36 combinations that I want to try.
How can I do easily and gracefully do this?
I know that I can extend the Analyzer class and implement the tokenStream() function to create my own Analyzer:
public class MyAnalyzer extends Analyzer
{
public TokenStream tokenStream(String field, final Reader reader){
return new NameFilter(
CaseNumberFilter(
new StopFilter(
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(reader)
)
), StopAnalyzer.ENGLISH_STOP_WORDS)
)
);
}
What I'd like to do is write one such class, which can somehow take boolean values for each of the possible operations (doStopping, doStemming, etc.). I don't want to have to write 36 different Analyzer classes that each perform one of the 36 combinations. What makes it difficult is the way the filters are all combined together in their constructors.
Any ideas on how to do this gracefully?
EDIT: By "gracefully", I mean that I can easily create a new Analyzer in some sort of loop:
analyzer = new MyAnalyzer(doStemming, doStopping, ...)
where doStemming and doStopping change with each loop iteration.
Solr solves this problem by using Tokenizer and TokenFilter factories. You could do the same, for example:
public interface TokenizerFactory {
Tokenizer newTokenizer(Reader reader);
}
public interface TokenFilterFactory {
TokenFilter newTokenFilter(TokenStream source);
}
public class ConfigurableAnalyzer {
private final TokenizerFactory tokenizerFactory;
private final List<TokenFilterFactory> tokenFilterFactories;
public ConfigurableAnalyzer(TokenizerFactory tokenizerFactory, TokenFilterFactory... tokenFilterFactories) {
this.tokenizerFactory = tokenizerFactory;
this.tokenFilterFactories = Arrays.asList(tokenFilterFactories);
}
public TokenStream tokenStream(String field, Reader source) {
TokenStream sink = tokenizerFactory.newTokenizer(source);
for (TokenFilterFactory tokenFilterFactory : tokenFilterFactories) {
sink = tokenFilterFactory.newTokenFilter(sink);
}
return sink;
}
}
This way, you can configure your analyzer by passing a factory for one tokenizer and 0 to n filters as constructor arguments.
Add some class variables to the custom Analyzer class which can be easily set and unset on the fly. Then, in the tokenStream() function, use those variables to determine which filters to perform.
public class MyAnalyzer extends Analyzer {
private Set customStopSet;
public static final String[] STOP_WORDS = ...;
private boolean doStemming = false;
private boolean doStopping = false;
public JavaSourceCodeAnalyzer(){
super();
customStopSet = StopFilter.makeStopSet(STOP_WORDS);
}
public void setDoStemming(boolean val){
this.doStemming = val;
}
public void setDoStopping(boolean val){
this.doStopping = val;
}
public TokenStream tokenStream(String fieldName, Reader reader) {
// First, convert to lower case
TokenStream out = new LowerCaseTokenizer(reader);
if (this.doStopping){
out = new StopFilter(true, out, customStopSet);
}
if (this.doStemming){
out = new PorterStemFilter(out);
}
return out;
}
}
There is one gotcha: LowerCaseTokenizer takes as input the reader variable, and returns a TokenStream. This is fine for the following filters (StopFilter, PorterStemFilter), because they take TokenStreams as input and return them as output, and so we can chain them together nicely. However, this means you can't have a filter before the LowerCaseTokenizer that returns a TokenStream. In my case, I wanted to split camelCase words into parts, and this has to be done before converting to lower case. My solution was to perform the splitting manually in the custom Indexer class, so by the time MyAnalyzer sees the text, it has already been split.
(I have also added a boolean flag to my customer Indexer class, so now both can work based solely on flags.)
Is there a better answer?