Is there an algorithm for encoding two text and the result will be the same even if change their position? - sql

May be the question is hard to understand, I mean this
Given two sample text
Text1 = "abc" and Text2 = "def"
Which algorithm can do like
encoding(Text1, Text2) == encoding(Text2, Text1)
And I wish the result of the function is unique(not duplicate with encoding(Text3, Text1) like in another checksum algorithm too.
Actually, the root of this is I want to search in my database for the question is there any rows that "Who is a friends of B" or "B is a friends of whom" by searching only one column like
SELECT * FROM relationship WHERE hash = "a039813"
not
SELECT *
FROM relationship
WHERE (personColumn1 = "B" and verb = "friend") OR
(personColumn2 = "B" and verb = "friend")

You can adapt any encoding to ensure encoding(Text1, Text2) == encoding(Text2, Text1) by simply enforcing a particular ordering of the arguments. Since you're dealing with text, maybe use a basic lexical order:
encoding_adapter(t1, t2)
{
if (t1 < t2)
return encoding(t1, t2)
else
return encoding(t2, t1)
}
If you use a simple single-input hash function you're probably tempted to write:
encoding(t1, t2)
{
return hash(t1 + t2)
}
But this can cause collisions: encoding("AA", "B") == encoding("A", "AB"). There are a couple easy solutions:
if you have a character or string that never appears in your input strings then use it as a delimiter:
return hash(t1 + delimiter + t2)
hash the hashes:
return hash(hash(t1) + hash(t2))

Related

Building string from list of list of strings

I rather have this ugly way of building a string from a list as:
val input = listOf("[A,B]", "[C,D]")
val builder = StringBuilder()
builder.append("Serialized('IDs((")
for (pt in input) {
builder.append(pt[0] + " " + pt[1])
builder.append(", ")
}
builder.append("))')")
The problem is that it adds a comma after the last element and if I want to avoid that I need to add another if check in the loop for the last element.
I wonder if there is a more concise way of doing this in kotlin?
EDIT
End result should be something like:
Serialized('IDs((A B,C D))')
In Kotlin you can use joinToString for this kind of use case (it deals with inserting the separator only between elements).
It is very versatile because it allows to specify a transform function for each element (in addition to the more classic separator, prefix, postfix). This makes it equivalent to mapping all elements to strings and then joining them together, but in one single call.
If input really is a List<List<String>> like you mention in the title and you assume in your loop, you can use:
input.joinToString(
prefix = "Serialized('IDs((",
postfix = "))')",
separator = ", ",
) { (x, y) -> "$x $y" }
Note that the syntax with (x, y) is a destructuring syntax that automatically gets the first and second element of the lists inside your list (parentheses are important).
If your input is in fact a List<String> as in listOf("[A,B]", "[C,D]") that you wrote at the top of your code, you can instead use:
input.joinToString(
prefix = "Serialized('IDs((",
postfix = "))')",
separator = ", ",
) { it.removeSurrounding("[", "]").replace(",", " ") }
val input = listOf("[A,B]", "[C,D]")
val result =
"Serialized('IDs((" +
input.joinToString(",") { it.removeSurrounding("[", "]").replace(",", " ") } +
"))')"
println(result) // Output: Serialized('IDs((A B,C D))')
Kotlin provides an extension function [joinToString][1] (in Iterable) for this type of purpose.
input.joinToString(",", "Serialized('IDs((", "))')")
This will correctly add the separator.

Smart search and replace [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I had some code that had a few thousands lines of code that contain pieces like this
opencanmanager.GetObjectDict()->ReadDataFrom(0x1234, 1).toInt()
that I needed to convert to some other library that uses syntax like this
ReadFromOD<int>(0x1234, 1)
.
Basically I need to search for
[whatever1]opencanmanager.GetObjectDict()->ReadDataFrom([whatever2]).toInt()[whatever3]
across all the lines of a text file and to replace every occurence of it with
[whatever1]ReadFromOD<int>([whatever2])[whatever3]
and then do the same for a few other data types.
Doing that manually was going to be a few days of absolutely terrible dumb work but all the automatic functions of any editor I know of do not allow for any smart code refactoring tools.
Now I have solved the problem using GNU AWK with the script below
#!/usr/bin/awk -f
BEGIN {
spl1 = "opencanmanager.GetObjectDict()->ReadDataFrom("
spl2 = ").to"
spl2_1 = ").toString()"
spl2_2 = ").toUInt()"
spl2_3 = ").toInt()"
min_spl2_len = length(spl2_3)
repl_start = "ReadFromOD<"
repl_mid1 = "QString"
repl_mid2 = "uint"
repl_mid3 = "int"
repl_end = ">("
repl_after = ")"
}
function replacer(str)
{
pos1 = index(str, spl1)
pos2 = index(str, spl2)
if (!pos1 || !pos2) {
return str
}
strbegin = substr(str, 0, pos1-1)
mid_start_pos = pos1+length(spl1)
strkey = substr(str, pos2, min_spl2_len)
key1 = substr(spl2_1, 0, min_spl2_len)
key2 = substr(spl2_2, 0, min_spl2_len)
key3 = substr(spl2_3, 0, min_spl2_len)
strmid = substr(str, mid_start_pos, pos2-mid_start_pos)
if (strkey == key1) {
repl_mid = repl_mid1; spl2_fact = spl2_1;
} else if (strkey == key2) {
repl_mid = repl_mid2; spl2_fact = spl2_2;
} else if (strkey == key3) {
repl_mid = repl_mid3; spl2_fact = spl2_3;
} else {
print "ERROR!!! Found", spl1, "but not any of", spl2_1, spl2_1, spl2_3 "!" > "/dev/stderr"
exit EXIT_FAILURE
}
str_remainder = substr(str, pos2+length(spl2_fact))
return strbegin repl_start repl_mid repl_end strmid repl_after str_remainder
}
{
resultstr = $0
do {
resultstr = replacer(resultstr)
more_spl = index(resultstr, spl1) || index(resultstr, spl2)
} while (more_spl)
print(resultstr)
}
and everything works fine but the thing still bugs me somewhat. My solution still feels a bit too complicated for a job that must be very common and must have an easy standard solution that I just dont't know about for some reason.
I am prepared to just let it go but if you know a more elegant and quick one-liner solution or some specific tool for the smart code modification problem then I would definitely would like to know.
If sed is an option, you can try this solution which should match both output examples from input such as this.
$ cat input_file
opencanmanager.GetObjectDict()->ReadDataFrom(0x1234, 1).toInt()
power1 = opencanmanager.GetObjectDict()->ReadDataFrom(0x1234, 1).toInt() * opencanmanager.GetObjectDict()->ReadDataFrom(0x5678, 1).toUInt() * FACTOR1;
power2 = opencanmanager.GetObjectDict()->ReadDataFrom(0x5678, 1).toUInt() / 2;
$ sed -E 's/ReadDataFrom/ReadFromOD<int>/g;s/int/uint/2;s/(.*= )?[^>]*>([^\.]*)[^\*|/]*?(\*|\/.{2,})?[^\.]*?[^>]*?>?([^\.]*)?[^\*]*?(.*)?/\1\2 \3 \4 \5/' input_file
ReadFromOD<int>(0x1234, 1)
power1 = ReadFromOD<int>(0x1234, 1) * ReadFromOD<uint>(0x5678, 1) * FACTOR1;
power2 = ReadFromOD<int>(0x5678, 1) / 2;
Explanation
s/ReadDataFrom/ReadFromOD<int>/g - The first part of the command does a simple global substitution substituting all occurances of ReadDataFrom to ReadFromOD<int>
s/int/uint/2 - The second part will only substitute the second occurance of int to uint if there is one
s/(.*= )?[^>]*>([^\.]*)[^\*|/]*?(\*|\/.{2,})?[^\.]*?[^>]*?>?([^\.]*)?[^\*]*?(.*)?/\1\2 \3 \4 \5/ - The third part utilizes sed grouping and back referencing.
(.*= )? - Group one returned with back reference \1 captures everything up to an = character, ? makes it conditional meaning it does not have to exist for the remaining grouping to match.
[^>]*> - This is an excluded match as it is not within parenthesis (). It matches everything continuing from the space after the = character up to the >, a literal > is then included to exclude that also. This is not conditional and must match.
([^\.]*) - Continuing from the excluded match, this will continue to match everything up to the first . and can be returned with back reference \2. This is not conditional and must match.
[^\*|/]*? - This is an excluded match and will match everything up to the literal * or | to /. It is conditional ? so does not have to match.
(\*|\/.{2,})? - Continuing from the excluded match, this will continue to match everything up to and including * or | / followed by at least 2 or more{2,} characters. It can be returned with back reference \3 and is conditional ?
[^\.]*?[^>]*?>? - Conditional excluded matches. Match everything up to a literal ., then everything up to > and include >
([^\.]*)? - Conditional group matching up to a full stop .. It can be returned with back reference \4.
[^\*]*? - Excluded. Continue matching up to *
(.*)? - Everything else after the final * should be grouped and returned with back reference \5 if it exist ?

Lucene Highlighter class: highlight different words in different colors

Probably most people reading the title who know a bit about Lucene won't need much further explanation. NB I use Jython but I think most Java users will understand the Java equivalent...
It's a classic thing to want to do: you have more than one term in your search string... in Lucene terms this returns a BooleanQuery. Then you use something like this code to highlight (NB I am a Lucene newbie, this is all closely tweaked from Net examples):
yellow_highlight = SimpleHTMLFormatter( '<b style="background-color:yellow">', '</b>' )
green_highlight = SimpleHTMLFormatter( '<b style="background-color:green">', '</b>' )
...
stream = FrenchAnalyzer( Version.LUCENE_46 ).tokenStream( "both", StringReader( both ) )
scorer = QueryScorer( fr_query, "both" )
fragmenter = SimpleSpanFragmenter(scorer)
highlighter = Highlighter( yellow_highlight, scorer )
highlighter.setTextFragmenter(fragmenter)
best_fragments = highlighter.getBestTextFragments( stream, both, True, 5 )
if best_fragments:
for best_frag in best_fragments:
print "=== best frag: %s, type %s" % ( best_frag, type( best_frag ))
html_text += "&bull %s<br>\n" % unicode( best_frag )
... and then the html_text is put in a JTextPane for example.
But how would you make the first word in your query highlight with a yellow background and the second word highlight with a green background? I have tried to understand the various classes in org.apache.lucene.search... to no avail. So my only way of learning was googling. I couldn't find any clues...
I asked this question four years ago... At the time I did manage to implement a solution using javax.swing.text.html.HTMLDocument. There's also the interface org.w3c.dom.html.HTMLDocument in the standard Java library. This way is hard work.
But for anyone interested there's a far simpler solution. Taking advantage of the fact that Lucene's SimpleHTMLFormatter returns about the simplest imaginable "marked up" piece of text: chosen words are highlighted with the HTML B tag. That's it. It's not even a "proper" HTML fragment, just a String with <B>s and </B>s in it.
A multi-word query generates a BooleanQuery... from which you can extract multiple TermQuerys by going booleanQuery.clauses() ... getQuery()
I'm working in Groovy. The colouring I want to apply is console codes, as per BASH (or Cygwin). Other types of colouring can be worked out on this model.
So you set up a map before to hold your "markup details":
def markupDetails = [:]
Then for each TermQuery, you call this, with the same text param each time, stipulating a different colour param for each term. NB I'm using Lucene 6.
def createHighlightAndAnalyseMarkup( TermQuery tq, String text, String colour ) {
def termQueryScorer = new QueryScorer( tq )
def termQueryHighlighter = new Highlighter( formatter, termQueryScorer )
TokenStream stream = TokenSources.getTokenStream( fieldName, null, text, analyser, -1 )
String[] frags = termQueryHighlighter.getBestFragments( stream, text, 999999 )
// not sure under what circs you get > 1 fragment...
assert frags.size() <= 1
// NB you don't always get all terms in all returned LDocuments...
if( frags.size() ) {
String highlightedFrag = frags[ 0 ]
Matcher boldTagMatcher = highlightedFrag =~ /<\/?B>/
def pos = 0
def previousEnd = 0
while( boldTagMatcher.find()) {
pos += boldTagMatcher.start() - previousEnd
previousEnd = boldTagMatcher.end()
markupDetails[ pos ] = boldTagMatcher.group() == '<B>'? colour : ConsoleColors.RESET
}
}
}
As I said, I wanted to colourise console output. The colour parameter in the method here is per the console colour codes as found here, for example. E.g. yellow is \033[033m. ConsoleColors.RESET is \033[0m and marks the place where each coloured bit of text stops.
... after you've finished doing this with all TermQuerys you will have a nice map telling you where individual colours begin and end. You work backwards from the end of the text so as to insert the "markup" at the right position in the String. NB here text is your original unmarked-up String:
markupDetails.sort().reverseEach{ pos, markup ->
String firstPart = text.substring( 0, pos )
String secondPart = text.substring( pos )
text = firstPart + markup + secondPart
}
... at the end of which text contains your marked-up String: print to console. Lovely.

Find all available values for a field in lucene .net

If I have a field x, that can contain a value of y, or z etc, is there a way I can query so that I can return only the values that have been indexed?
Example
x available settable values = test1, test2, test3, test4
Item 1 : Field x = test1
Item 2 : Field x = test2
Item 3 : Field x = test4
Item 4 : Field x = test1
Performing required query would return a list of:
test1, test2, test4
I've implemented this before as an extension method:
public static class ReaderExtentions
{
public static IEnumerable<string> UniqueTermsFromField(
this IndexReader reader, string field)
{
var termEnum = reader.Terms(new Term(field));
do
{
var currentTerm = termEnum.Term();
if (currentTerm.Field() != field)
yield break;
yield return currentTerm.Text();
} while (termEnum.Next());
}
}
You can use it very easily like this:
var allPossibleTermsForField = reader.UniqueTermsFromField("FieldName");
That will return you what you want.
EDIT: I was skipping the first term above, due to some absent-mindedness. I've updated the code accordingly to work properly.
TermEnum te = indexReader.Terms(new Term("fieldx"));
do
{
Term t = te.Term();
if (t==null || t.Field() != "fieldx") break;
Console.WriteLine(t.Text());
} while (te.Next());
You can use facets to return the first N values of a field if the field is indexed as a string or is indexed using KeywordTokenizer and no filters. This means that the field is not tokenized but just saved as it is.
Just set the following properties on a query:
facet=true
facet.field=fieldname
facet.limit=N //the number of values you want to retrieve
I think a WildcardQuery searching on field 'x' and value of '*' would do the trick.
I once used Lucene 2.9.2 and there I used the approach with the FieldCache as described in the book "Lucene in Action" by Manning:
String[] fieldValues = FieldCache.DEFAULT.getStrings(indexReader, fieldname);
The array fieldValues contains all values in the index for field fieldname (Example: ["NY", "NY", "NY", "SF"]), so it is up to you now how to process the array. Usually you create a HashMap<String,Integer> that sums up the occurrences of each possible value, in this case NY=3, SF=1.
Maybe this helps. It is quite slow and memory consuming for very large indexes (1.000.000 documents in index) but it works.

Verify string does NOT contain a value other than known values [VB.NET]

I am trying to verify that a string contains nothing but known values. In this case, I need to make sure it contains only "Shift", "Control", or "Alt", but not necessarily all of those. For example, these should be true: "Shift + P", "Shift + Control + H", "Alt + U; but these should not: "Other + P", "Shift + Fake + Y", "Unknown + Shift + E" etc.
This is the code I tried to use:
If Not shortcut.Contains("Shift") Or Not shortcut.Contains("Control") Or Not shortcut.Contains("Alt") Then
MessageBox.Show("Invalid")
End If
I'm having difficulty wrapping my head around the needed logic to do this. I'm assuming there's a logic operator that can do this?
I believe you should not use strings for this purpose. You need a data type that can represent a hotkey combination, which is comprised of a "normal" key and a set of modifiers (Alt, Control, Shift), each of which can either be on or off. The on/off modifiers can be represented by an enum with flags, and the "normal" key can be represented by a separate enum. Both of the enums can be contained within a class.
The System.Windows.Forms.Keys enumeration can be used as both enums. You can store two numeric values (one for the modifiers, one for the "normal" key) - the underlying enum values - and they will represent the combination. No need to store strings.
If you do use strings for this purpose, you need to define your constraints better. Your rules do not specify how "Shift + Other" is invalid, but "Shift + F" is. A way to go about this, anyway, is to separate the string by " + " (assuming this is always the separator) and then compare each part to the list of valid values, which apparently contains "Shift", "Alt", "Control" and all single letters.
I think it would be easier to split the string into segments and then iterate through the list of words and then comparing to what words should exist. This is in C# but I guess you could figure it out.
C#:
string text = "Shift+Control+Alt+Other";
string[] textSegments = text.Split('+');
string[] allowedWords = { "Alt", "Shift", "Control" };
foreach (string t in textSegments)
{
bool temp = false;
foreach (string t2 in allowedWords)
{
if (t == t2)
{
temp = true;
continue;
}
}
if (!temp)
{
MessageBox.Show("Wrong!");
}
else
{
MessageBox.Show("Right!");
}
}
The output will be 4 MessageBoxes displaying: "Right!" "Right!" "Right!" "Wrong!"