Android custom keyboard suggestions - optimization

I am building a custom keyboard for android, the one that atleast supports autocomplete suggestions. To achieve this, I am storing every word that user types (not password fields) in a Room database table which has a simple model, the word and its frequency. Now for showing the suggestions, I am using a Trie which is populated by words from this database table. My query is basically to order by the table based on frequency of the word and limit the results to 5K (I do not feel like overpopulating the Trie, these 5K words can be considered as the users' favourite words that he uses often and needs suggestions for). Now my actual problem is the ORDER BY clause, this is a rapidly growing data set, sorting lets say 0.1M words to get 5K words seems like an overkill. How can i rework this approach to improve the efficiency of this entire suggestions logic.

If not already implemented, an index on the frequency #ColumnInfo(index = true).
Another could be to add a table that maintains the highest 5k. Supported by yet another table (the support table) that has 1 row, with columns for; the highest frequency (not really required), the lowest frequency in the current 5k, and a 3rd column for the number currently held. So you could then, after adding an existing word get whether or not the new/updated word should be added to the 5k table (perhaps a 4th column for the primary key of the lowest to facilitate efficient deletion).
So
if the number currently held is less than 5k insert or update the 5k table and increment the number currently held in the support table.
otherwise if the number is lower then the lowest skip it.
otherwise update if it already exists.
otherwise delete the lowest, insert the replacement and then update the lowest accordingly in the support table.
Note that the 5K table would probably only need to store the rowid as a pointer/reference/map to the core table.
rowid is a column that virtually all tables will have in Room (virtual tables an exception as are table that have the WITHOUT ROWID attribute but Room does not facilitate (as far as I am aware) WITHOUT ROWID table).
The rowid can be up to twice as fast as other indexes. I would suggest using #PrimaryKey Long id=null; (java) or #PrimaryKey var id: Long?=null (Kotlin) and NOT using #PrimaryKey(autogenerate = true).
autogenerate = true equates to SQLite's AUTOINCREMENT, about which the SQLite documentation says "The AUTOINCREMENT keyword imposes extra CPU, memory, disk space, and disk I/O overhead and should be avoided if not strictly needed. It is usually not needed."
see https://www.sqlite.org/rowidtable.html, and also https://sqlite.org/autoinc.html
curiously/funnily the support table mentioned isn't that far away from what coding AUTOINCREMENT does.
a table with a row per table that has AUTOINCREMENT, is used (sqlite_sequence) that stores the table name and the highest ever allocated rowid.
Without AUTOINCREMENT but with <column_name> INTEGER PRIMARY KEY and no value or null for the primary key column's value then SQLite generates a value that is 1 greater than max(rowid).
With AUTOINCREMENT/autogenerate=true then the generated value is the greater of max(rowid) and the value stored, for that table, in the sqlite_sequence table (hence the overheads).
of course those overheads, will very likely be insignificant in comparison to sorting 0.1M rows.
Demonstration
The following is a demonstration albeit just using a basic Word table as the source.
First the 2 tables (#Entity annotated classes)
Word
#Entity (
indices = {#Index(value = {"word"},unique = true)}
)
class Word {
#PrimaryKey Long wordId=null;
#NonNull
String word;
#ColumnInfo(index = true)
long frequency;
Word(){}
#Ignore
Word(String word, long frequency) {
this.word = word;
this.frequency = frequency;
}
}
WordSubset aka the table with the highest occurring 5000 frequencies, it simply has a reference/map/link to the underlying/actual word. :-
#Entity(
foreignKeys = {
#ForeignKey(
entity = Word.class,
parentColumns = {"wordId"},
childColumns = {"wordIdMap"},
onDelete = ForeignKey.CASCADE,
onUpdate = ForeignKey.CASCADE
)
}
)
class WordSubset {
public static final long SUBSET_MAX_SIZE = 5000;
#PrimaryKey
long wordIdMap;
WordSubset(){};
#Ignore
WordSubset(long wordIdMap) {
this.wordIdMap = wordIdMap;
}
}
note the constant SUBSET_MAX_SIZE, hard coded just the once so a simple single change to adjust (lowering it after rows have been added may cause issues)
WordSubsetSupport this will be a single row table that contains the highest and lowest frequencies (highest is not really needed), the number of rows in the WordSubset table and a reference/map to the word with the lowest frequency.
#Entity(
foreignKeys = {
#ForeignKey(
entity = Word.class,
parentColumns = {"wordId"},
childColumns = {"lowestWordIdMap"}
)
}
)
class WordSubsetSupport {
#PrimaryKey
Long wordSubsetSupportId=null;
long highestFrequency;
long lowestFrequency;
long countOfRowsInSubsetTable;
#ColumnInfo(index = true)
long lowestWordIdMap;
WordSubsetSupport(){}
#Ignore
WordSubsetSupport(long highestFrequency, long lowestFrequency, long countOfRowsInSubsetTable, long lowestWordIdMap) {
this.highestFrequency = highestFrequency;
this.lowestFrequency = lowestFrequency;
this.countOfRowsInSubsetTable = countOfRowsInSubsetTable;
this.lowestWordIdMap = lowestWordIdMap;
this.wordSubsetSupportId = 1L;
}
}
For access an abstract class (rather than interface, as this, in Java, allows methods/functions with a body, a Kotlin interface allows these) CombinedDao :-
#Dao
abstract class CombinedDao {
#Insert(onConflict = OnConflictStrategy.IGNORE)
abstract long insert(Word word);
#Insert(onConflict = OnConflictStrategy.IGNORE)
abstract long insert(WordSubset wordSubset);
#Insert(onConflict = OnConflictStrategy.IGNORE)
abstract long insert(WordSubsetSupport wordSubsetSupport);
#Query("SELECT * FROM wordsubsetsupport LIMIT 1")
abstract WordSubsetSupport getWordSubsetSupport();
#Query("SELECT count() FROM wordsubsetsupport")
abstract long getWordSubsetSupportCount();
#Query("SELECT countOfRowsInSubsetTable FROM wordsubsetsupport")
abstract long getCountOfRowsInSubsetTable();
#Query("UPDATE wordsubsetsupport SET countOfRowsInSubsetTable=:updatedCount")
abstract void updateCountOfRowsInSubsetTable(long updatedCount);
#Query("UPDATE wordsubsetsupport " +
"SET countOfRowsInSubsetTable = (SELECT count(*) FROM wordsubset), " +
"lowestWordIdMap = (SELECT word.wordId FROM wordsubset JOIN word ON wordsubset.wordIdMap = word.wordId ORDER BY frequency ASC LIMIT 1)," +
"lowestFrequency = (SELECT coalesce(min(frequency),0) FROM wordsubset JOIN word ON wordsubset.wordIdMap = word.wordId)," +
"highestFrequency = (SELECT coalesce(max(frequency),0) FROM wordsubset JOIN word ON wordsubset.wordIdMap = word.wordId)")
abstract void autoUpdateWordSupportTable();
#Query("DELETE FROM wordsubset WHERE wordIdMap= (SELECT wordsubset.wordIdMap FROM wordsubset JOIN word ON wordsubset.wordIdMap = word.wordId ORDER BY frequency ASC LIMIT 1)")
abstract void deleteLowestFrequency();
#Transaction
#Query("")
int addWord(Word word) {
/* try to add the add word, setting the wordId value according to the result.
The result will be the wordId generated (1 or greater) or if the word already exists -1
*/
word.wordId = insert(word);
/* If the word was added and not rejected as a duplicate, then it may need to be added to the WordSubset table */
if (word.wordId > 0) {
/* Are there any rows in the support table? if not then add the very first entry/row */
if (getWordSubsetSupportCount() < 1) {
/* Need to add the word to the subset */
insert(new WordSubset(word.wordId));
/* Can now add the first (and only) row to the support table */
insert(new WordSubsetSupport(word.frequency,word.frequency,1,word.wordId));
autoUpdateWordSupportTable();
return 1;
}
/* If there are less than the maximum number of rows in the subset table then
1) insert the new subset row, and
2) update the support table accordingly
*/
if (getCountOfRowsInSubsetTable() < WordSubset.SUBSET_MAX_SIZE) {
insert(new WordSubset(word.wordId));
autoUpdateWordSupportTable();
return 2;
}
/*
Last case is that the subset table is at the maximum number of rows and
the frequency of the added word is greater than the lowest frequency in the
subset, so
1) the row with the lowest frequency is removed from the subset table and
2) the added word is added to the subset
3) the support table is updated accordingly
*/
if (getCountOfRowsInSubsetTable() >= WordSubset.SUBSET_MAX_SIZE) {
WordSubsetSupport currentWordSubsetSupport = getWordSubsetSupport();
if (word.frequency > currentWordSubsetSupport.lowestFrequency) {
deleteLowestFrequency();
insert(new WordSubset(word.wordId));
autoUpdateWordSupportTable();
return 3;
}
}
return 4; /* indicates word added but does not qualify for addition to the subset */
}
return -1;
}
}
The addWord method/function is the only method that is used as this automatically maintains the WordSubset and the WordSubsetSupport tables.
TheDatabase is a pretty standard #Database annotated class, other than that it allows use the main thread for the sake of convenience and brevity of the demo:-
#Database( entities = {Word.class,WordSubset.class,WordSubsetSupport.class}, version = TheDatabase.DATABASE_VERSION, exportSchema = false)
abstract class TheDatabase extends RoomDatabase {
abstract CombinedDao getCombinedDao();
private static volatile TheDatabase instance = null;
public static TheDatabase getInstance(Context context) {
if (instance == null) {
instance = Room.databaseBuilder(context,TheDatabase.class,DATABASE_NAME)
.addCallback(cb)
.allowMainThreadQueries()
.build();
}
return instance;
}
private static Callback cb = new Callback() {
#Override
public void onCreate(#NonNull SupportSQLiteDatabase db) {
super.onCreate(db);
}
#Override
public void onOpen(#NonNull SupportSQLiteDatabase db) {
super.onOpen(db);
}
};
public static final String DATABASE_NAME = "the_database.db";
public static final int DATABASE_VERSION = 1;
}
Finally activity code that randomly generates and adds 10,000 words (or thereabouts as some could be duplicate words), each word having a frequency that is also randomly generated (between 1 and 10000) :-
public class MainActivity extends AppCompatActivity {
TheDatabase db;
CombinedDao dao;
#Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
db = TheDatabase.getInstance(this);
dao = db.getCombinedDao();
for (int i=0; i < 10000; i++) {
Word currentWord = generateRandomWord();
Log.d("ADDINGWORD","Adding word " + currentWord.word + " frequency is " + currentWord.frequency);
dao.addWord(generateRandomWord());
}
}
public static final String ALPHABET = "abcdefghijklmnopqrstuvwxyz";
private Word generateRandomWord() {
Random r = new Random();
int wordLength = (abs(r.nextInt()) % 24) + 1;
int frequency = abs(r.nextInt()) % 10000;
StringBuilder sb = new StringBuilder();
for (int i=0; i < wordLength; i++) {
int letter = abs(r.nextInt()) % (ALPHABET.length());
sb.append(ALPHABET.substring(letter,letter+1));
}
return new Word(sb.toString(),frequency);
}
}
Obviously the results would differ per run, also the demo is only really designed to be run once (although it could be run more).
After running, using AppInspection, then
The support table (in this instance) is :-
So as countOfRowsInSubsetTable is 5000 then the subset table has been filled to it's capacity/limit.
The highest frequency encountered as 9999 (as could well be expected)
The lowest frequency in the subset is 4690 and that is for the word with the wordId that is 7412.
The subset table on it's own means little as it just contains a map to the actual word. So it's more informative to use a query to look at what it contains.
e.g.
As can be seen the query shows that the word who's wordId is 7412 is the one with the lowest frequency of 4690 (as expected according to the support table)
Going to the last page shows:-

Related

Can't remove items from Kotlin HashMap after the items have been modified [duplicate]

Is it bad practice to use mutable objects as Hashmap keys? What happens when you try to retrieve a value from a Hashmap using a key that has been modified enough to change its hashcode?
For example, given
class Key
{
int a; //mutable field
int b; //mutable field
public int hashcode()
return foo(a, b);
// setters setA and setB omitted for brevity
}
with code
HashMap<Key, Value> map = new HashMap<Key, Value>();
Key key1 = new Key(0, 0);
map.put(key1, value1); // value1 is an instance of Value
key1.setA(5);
key1.setB(10);
What happens if we now call map.get(key1)? Is this safe or advisable? Or is the behavior dependent on the language?
It has been noted by many well respected developers such as Brian Goetz and Josh Bloch that :
If an object’s hashCode() value can change based on its state, then we
must be careful when using such objects as keys in hash-based
collections to ensure that we don’t allow their state to change when
they are being used as hash keys. All hash-based collections assume
that an object’s hash value does not change while it is in use as a
key in the collection. If a key’s hash code were to change while it
was in a collection, some unpredictable and confusing consequences
could follow. This is usually not a problem in practice — it is not
common practice to use a mutable object like a List as a key in a
HashMap.
This is not safe or advisable. The value mapped to by key1 can never be retrieved. When doing a retrieval, most hash maps will do something like
Object get(Object key) {
int hash = key.hashCode();
//simplified, ignores hash collisions,
Entry entry = getEntry(hash);
if(entry != null && entry.getKey().equals(key)) {
return entry.getValue();
}
return null;
}
In this example, key1.hashcode() now points to the wrong bucket of the hash table, and you will not be able to retrieve value1 with key1.
If you had done something like,
Key key1 = new Key(0, 0);
map.put(key1, value1);
key1.setA(5);
Key key2 = new Key(0, 0);
map.get(key2);
This will also not retrieve value1, as key1 and key2 are no longer equal, so this check
if(entry != null && entry.getKey().equals(key))
will fail.
Hash maps use hash code and equality comparisons to identify a certain key-value pair with a given key. If the has map keeps the key as a reference to the mutable object, it would work in the cases where the same instance is used to retrieve the value. Consider however, the following case:
T keyOne = ...;
T keyTwo = ...;
// At this point keyOne and keyTwo are different instances and
// keyOne.equals(keyTwo) is true.
HashMap myMap = new HashMap();
myMap.push(keyOne, "Hello");
String s1 = (String) myMap.get(keyOne); // s1 is "Hello"
String s2 = (String) myMap.get(keyTwo); // s2 is "Hello"
// because keyOne equals keyTwo
mutate(keyOne);
s1 = myMap.get(keyOne); // returns "Hello"
s2 = myMap.get(keyTwo); // not found
The above is true if the key is stored as a reference. In Java usually this is the case. In .NET for instance, if the key is a value type (always passed by value), the result will be different:
T keyOne = ...;
T keyTwo = ...;
// At this point keyOne and keyTwo are different instances
// and keyOne.equals(keyTwo) is true.
Dictionary myMap = new Dictionary();
myMap.Add(keyOne, "Hello");
String s1 = (String) myMap[keyOne]; // s1 is "Hello"
String s2 = (String) myMap[keyTwo]; // s2 is "Hello"
// because keyOne equals keyTwo
mutate(keyOne);
s1 = myMap[keyOne]; // not found
s2 = myMap[keyTwo]; // returns "Hello"
Other technologies might have other different behaviors. However, almost all of them would come to a situation where the result of using mutable keys is not deterministic, which is very very bad situation in an application - a hard to debug and even harder to understand.
If key’s hash code changes after the key-value pair (Entry) is stored in HashMap, the map will not be able to retrieve the Entry.
Key’s hashcode can change if the key object is mutable. Mutable keys in HahsMap can result in data loss.
This will not work. You are changing the key value, so you are basically throwing it away. Its like creating a real life key and lock, and then changing the key and trying to put it back in the lock.
As others explained, it is dangerous.
A way to avoid that is to have a const field giving explicitly the hash in your mutable objects (so you would hash on their "identity", not their "state"). You might even initialize that hash field more or less randomly.
Another trick would be to use the address, e.g. (intptr_t) reinterpret_cast<void*>(this) as a basis for hash.
In all cases, you have to give up hashing the changing state of the object.
There are two very different issues that can arise with a mutable key depending on your expectation of behavior.
First Problem: (probably most trivial--but hell it gave me problems that I didn't think about!)
You are attempting to place key-value pairs into a map by updating and modifying the same key object. You might do something like Map<Integer, String> and simply say:
int key = 0;
loop {
map.put(key++, newString);
}
I'm reusing the "object" key to create a map. This works fine in Java because of autoboxing where each new value of key gets autoboxed to a new Integer object. What would not work is if I created my own (mutable) Integer object:
MyInteger {
int value;
plusOne(){
value++;
}
}
Then tried the same approach:
MyInteger key = new MyInteger(0);
loop{
map.put(key.plusOne(), newString)
}
My expectation is that, for instance, I map 0 -> "a" and 1 -> "b". In the first example, if I change int key = 0, the map will (correctly) give me "a". For simplicity let's assume MyInteger just always returns the same hashCode() (if you can somehow manage to create unique hashCode values for all possible states of an object, this will not be an issue, and you deserve an award). In this case, I call 0 -> "a", so now the map holds my key and maps it to "a", I then modify key = 1 and try to put 1 -> "b". We have a problem! The hashCode() is the same, and the only key in the HashMap is my MyInteger key object which has just been modified to be equal to 1, so It overwrites that key's value so that now, instead of a map with 0 -> "a" and 1 -> "b", I have 1 -> "b" only! Even worse, if I change back to key = 0, the hashCode points to 1 -> "b", but since the HashMap's only key is my key object, it satisfied the equality check and returns "b", not "a" as expected.
If, like me, you fall prey to this type of issue, it's incredibly difficult to diagnose. Why? Because if you have a decent hashCode() function it will generate (mostly) unique values. The hash value will largely take care of the inequality problem when structuring the map but if you have enough values, eventually you'll get a collision on the hash value and then you get unexpected and largely inexplicable results. The resultant behavior is that it works for small runs but fails for larger ones.
Advice:
To find this type of issue, modify the hashCode() method, even trivially (i.e. = 0--obviously when doing this, keep in mind that the hash values should be the same for two equal objects*), and see if you get the same results--because you should and if you don't, there's likely a semantic error with your implementation that's using a hash table.
*There should be no danger (if there is--you have a semantic problem) in always returning 0 from a hashCode() (although it would defeat the purpose of a Hash Table). But that's sort of the point: the hashCode is a "quick and easy" equality measure that's not exact. So two very different objects could have the same hashCode() yet not be equal. On the other hand, two equal objects must always have the same hashCode() value.
p.s. In Java, from my understanding, if you do such a terrible thing (as have many hashCode() collisions), it will start using a red-black-tree as opposed to ArrayList. So when you expect O(1) lookup, you'll get O(log(n))--which is better than the ArrayList which would give O(n).
Second Problem:
This is the one that most others seem to be focusing on, so I'll try to be brief. In this use case, I try to map a key-value pair and then I do some work on the key and then want to come back and get my value.
Expectation: key -> value is mapped, I then modify key and try to get(key). I expect that will give me value.
It seems kind of obvious to me that this wouldn't work but I'm not above having tried to use things like Collections as a key before (and quite quickly realizing it doesn't work). It doesn't work because it's quite likely that the hash value of key has changed so you won't even be looking in the correct bucket.
This is why it's very inadvisable to use collections as keys. I would assume, if you were doing this, you're trying to establish a many-to-one relationship. So I have a class (as in teaching) and I want two groups to do two different projects. What I want is that given a group, what is their project? Simple, I divide the class in two, and I have group1 -> project1 and group2 -> project2. But wait! A new student arrives so I place them in group1. The problem is that group1 has now been modified and likely its hash value has changed, therefore trying to do get(group1) is likely to fail because it will look in a wrong or non-existent bucket of the HashMap.
The obvious solution to the above is to chain things--instead of using the groups as keys, give them labels (that don't change) that point to the group and therefore the project: g1 -> group1 and g1 -> project1, etc.
p.s.
Please make sure to define a hashCode() and equals(...) method for any object you expect to use as a key (eclipse and, I'm assuming, most IDE's can do this for you).
Code Example:
Here is a class which exhibits the two different "problem" behaviors. In this case, I attempt to map 0 -> "a", 1 -> "b", and 2 -> "c" (in each case). In the first problem, I do that by modifying the same object, in the second problem, I use unique objects, and in the second problem "fixed" I clone those unique objects. After that I take one of the "unique" keys (k0) and modify it to attempt to access the map. I expect this will give me a, b, c and null when the key is 3.
However, what happens is the following:
map.get(0) map1: 0 -> null, map2: 0 -> a, map3: 0 -> a
map.get(1) map1: 1 -> null, map2: 1 -> b, map3: 1 -> b
map.get(2) map1: 2 -> c, map2: 2 -> a, map3: 2 -> c
map.get(3) map1: 3 -> null, map2: 3 -> null, map3: 3 -> null
The first map ("first problem") fails because it only holds a single key, which was last updated and placed to equal 2, hence why it correctly returns "c" when k0 = 2 but returns null for the other two (the single key doesn't equal 0 or 1). The second map fails twice: the most obvious is that it returns "b" when I asked for k0 (because it's been modified--that's the "second problem" which seems kind of obvious when you do something like this). It fails a second time when it returns "a" after modifying k0 = 2 (which I would expect to be "c"). This is more due to the "first problem": there's a hash code collision and the tiebreaker is an equality check--but the map holds k0, which it (apparently for me--could theoretically be different for someone else) checked first and thus returned the first value, "a" even though had it kept checking, "c" would have also been a match. Finally, the 3rd map works perfectly because I'm enforcing that the map holds unique keys no matter what else I do (by cloning the object during insertion).
I want to make clear that I agree, cloning is not a solution! I simply added that as an example of why a map needs unique keys and how enforcing unique keys "fixes" the issue.
public class HashMapProblems {
private int value = 0;
public HashMapProblems() {
this(0);
}
public HashMapProblems(final int value) {
super();
this.value = value;
}
public void setValue(final int i) {
this.value = i;
}
#Override
public int hashCode() {
return value % 2;
}
#Override
public boolean equals(final Object o) {
return o instanceof HashMapProblems
&& value == ((HashMapProblems) o).value;
}
#Override
public Object clone() {
return new HashMapProblems(value);
}
public void reset() {
this.value = 0;
}
public static void main(String[] args) {
final HashMapProblems k0 = new HashMapProblems(0);
final HashMapProblems k1 = new HashMapProblems(1);
final HashMapProblems k2 = new HashMapProblems(2);
final HashMapProblems k = new HashMapProblems();
final HashMap<HashMapProblems, String> map1 = firstProblem(k);
final HashMap<HashMapProblems, String> map2 = secondProblem(k0, k1, k2);
final HashMap<HashMapProblems, String> map3 = secondProblemFixed(k0, k1, k2);
for (int i = 0; i < 4; ++i) {
k0.setValue(i);
System.out.printf(
"map.get(%d) map1: %d -> %s, map2: %d -> %s, map3: %d -> %s",
i, i, map1.get(k0), i, map2.get(k0), i, map3.get(k0));
System.out.println();
}
}
private static HashMap<HashMapProblems, String> firstProblem(
final HashMapProblems start) {
start.reset();
final HashMap<HashMapProblems, String> map = new HashMap<>();
map.put(start, "a");
start.setValue(1);
map.put(start, "b");
start.setValue(2);
map.put(start, "c");
return map;
}
private static HashMap<HashMapProblems, String> secondProblem(
final HashMapProblems... keys) {
final HashMap<HashMapProblems, String> map = new HashMap<>();
IntStream.range(0, keys.length).forEach(
index -> map.put(keys[index], "" + (char) ('a' + index)));
return map;
}
private static HashMap<HashMapProblems, String> secondProblemFixed(
final HashMapProblems... keys) {
final HashMap<HashMapProblems, String> map = new HashMap<>();
IntStream.range(0, keys.length)
.forEach(index -> map.put((HashMapProblems) keys[index].clone(),
"" + (char) ('a' + index)));
return map;
}
}
Some Notes:
In the above it should be noted that map1 only holds two values because of the way I set up the hashCode() function to split odds and evens. k = 0 and k = 2 therefore have the same hashCode of 0. So when I modify k = 2 and attempt to k -> "c" the mapping k -> "a" gets overwritten--k -> "b" is still there because it exists in a different bucket.
Also there are a lot of different ways to examine the maps in the above code and I would encourage people that are curious to do things like print out the values of the map and then the key to value mappings (you may be surprised by the results you get). Do things like play with changing the different "unique" keys (i.e. k0, k1, and k2), try changing the single key k. You could also see how even the secondProblemFixed isn't actually fixed because you could also gain access to the keys (for instance via Map::keySet) and modify them.
I won't repeat what others have said. Yes, it's inadvisable. But in my opinion, it's not overly obvious where the documentation states this.
You can find it on the JavaDoc for the Map interface:
Note: great care must be exercised if mutable objects are used as map
keys. The behavior of a map is not specified if the value of an object
is changed in a manner that affects equals comparisons while the
object is a key in the map
Behaviour of a Map is not specified if value of an object is changed in a manner that affects equals comparision while object(Mutable) is a key. Even for Set also using mutable object as key is not a good idea.
Lets see a example here :
public class MapKeyShouldntBeMutable {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
Map<Employee,Integer> map=new HashMap<Employee,Integer>();
Employee e=new Employee();
Employee e1=new Employee();
Employee e2=new Employee();
Employee e3=new Employee();
Employee e4=new Employee();
e.setName("one");
e1.setName("one");
e2.setName("three");
e3.setName("four");
e4.setName("five");
map.put(e, 24);
map.put(e1, 25);
map.put(e2, 26);
map.put(e3, 27);
map.put(e4, 28);
e2.setName("one");
System.out.println(" is e equals e1 "+e.equals(e1));
System.out.println(map);
for(Employee s:map.keySet())
{
System.out.println("key : "+s.getName()+":value : "+map.get(s));
}
}
}
class Employee{
String name;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
#Override
public boolean equals(Object o){
Employee e=(Employee)o;
if(this.name.equalsIgnoreCase(e.getName()))
{
return true;
}
return false;
}
public int hashCode() {
int sum=0;
if(this.name!=null)
{
for(int i=0;i<this.name.toCharArray().length;i++)
{
sum=sum+(int)this.name.toCharArray()[i];
}
/*System.out.println("name :"+this.name+" code : "+sum);*/
}
return sum;
}
}
Here we are trying to add mutable object "Employee" to a map. It will work good if all keys added are distinct.Here I have overridden equals and hashcode for employee class.
See first I have added "e" and then "e1". For both of them equals() will be true and hashcode will be same. So map sees as if the same key is getting added so it should replace the old value with e1's value. Then we have added e2,e3,e4 we are fine as of now.
But when we are changing the value of an already added key i.e "e2" as one ,it becomes a key similar to one added earlier. Now the map will behave wired. Ideally e2 should replace the existing same key i.e e1.But now map takes this as well. And you will get this in o/p :
is e equals e1 true
{Employee#1aa=28, Employee#1bc=27, Employee#142=25, Employee#142=26}
key : five:value : 28
key : four:value : 27
key : one:value : 25
key : one:value : 25
See here both keys having one showing same value also. So its unexpected.Now run the same programme again by changing e2.setName("diffnt"); which is e2.setName("one"); here ...Now the o/p will be this :
is e equals e1 true
{Employee#1aa=28, Employee#1bc=27, Employee#142=25, Employee#27b=26}
key : five:value : 28
key : four:value : 27
key : one:value : 25
key : diffnt:value : null
So by adding changing the mutable key in a map is not encouraged.
To make the answer compact:
The root cause is that HashMap calculates an internal hash of the user's key object hashcode only once and stores it inside for own needs.
All other operations for data navigation inside the map are doing by this pre-calculated internal hash.
So if you change the hashcode of the key object (mutate) it will be still stored nicely inside the map with the changed key object's hashcode (you could even observe it via HashMap.keySet() and see the altered hashcode).
But HashMap internal hash will not be recalculated of course and it will be the old stored one and the map won't be able to locate your data by the provided mutated key object new hashcode. (e.g. by HashMap.get() or HashMap.containsKey()).
Your key-value pairs will be still inside the map but to get it back you will need that old hash code value that was given when you put your data into the map.
Notice that you also will be unable to get data back by the mutated key object taken right from the HashMap.keySet().

Lucene: Iterate all entries

I have a Lucene Index which I would like to iterate (for one time evaluation at the current stage in development)
I have 4 documents with each a few hundred thousand up to million entries, which I want to iterate to count the number of words for each entry (~2-10) and calculate the frequency distribution.
What I am doing at the moment is this:
for (int i = 0; i < reader.maxDoc(); i++) {
if (reader.isDeleted(i))
continue;
Document doc = reader.document(i);
Field text = doc.getField("myDocName#1");
String content = text.stringValue();
int wordLen = countNumberOfWords(content);
//store
}
So far, it is iterating something. The debug confirms that its at least operating on the terms stored in the document, but for some reason it only process a small part of the stored terms. I wonder what I am doing wrong? I simply want to iterate over all documents and everything that is stored in them?
Firstly you need to ensure you index with TermVectors enabled
doc.add(new Field(TITLE, page.getTitle(), Field.Store.YES, Field.Index.ANALYZED, TermVector.WITH_POSITIONS_OFFSETS));
Then you can use IndexReader.getTermFreqVector to count terms
TopDocs res = indexSearcher.search(YOUR_QUERY, null, 1000);
// iterate over documents in res, ommited for brevity
reader.getTermFreqVector(res.scoreDocs[i].doc, YOUR_FIELD, new TermVectorMapper() {
public void map(String termval, int freq, TermVectorOffsetInfo[] offsets, int[] positions) {
// increment frequency count of termval by freq
freqs.increment(termval, freq);
}
public void setExpectations(String arg0, int arg1,boolean arg2, boolean arg3) {}
});

generating 9 digit ids without database sequence

I'd like to create 9-digit numeric ids that are unique across machines. I'm currently using a database sequence for this, but am wondering if it could be done without one. The sequences will be used for X12 EDI transactions, so they don't have to be unique forever. Maybe even only unique for 24 hours.
My only idea:
Each server has a 2 digit server identifier.
Each server maintains a file that essentially keeps track of a local sequence.
id = + <7 digit sequence which wraps>
My biggest problem with this is what to do if the hard-drive fails. I wouldn't know where it left off.
All of my other ideas essentially end up re-creating a centralized database sequence.
Any thoughts?
The Following
{XX}{dd}{HHmm}{N}
Where {XX} is the machine number {dd} is the day of the month {HHmm} current time (24hr) and {N} a sequential number.
A hd crash will take more than a minute so starting at 0 again is not a problem.
You can also replace {dd} with {ss} for seconds, depending on requirements. Uniqueness period vs. requests per minute.
If HD fails you can just set new and unused 2 digit server identifier and be sure that the number is unique (for 24 hours at least)
How about generating GUIDs (ensures uniqueness) and then using some sort of hash function to turn the GUID into a 9-digit number?
Just off the top of my head...
Use a variation on:
md5(uniqid(rand(), true));
Just a thought.
In my recent project I also come across this requirement, to generate N digit long sequence number without any database.
This is actually a good Interview question, because there are consideration on performance and software crash recovery. Further Reading if interested.
The following code has these features:
Prefix each sequence with a prefix.
Sequence cache like Oracle Sequence.
Most importantly, there is recovery logic to resume sequence from software crash.
Complete implementation attached:
import java.util.concurrent.atomic.AtomicLong;
import org.apache.commons.lang.StringUtils;
/**
* This is a customized Sequence Generator which simulates Oracle DB Sequence Generator. However the master sequence
* is stored locally in the file as there is no access to Oracle database. The output format is "prefix" + number.
* <p>
* <u><b>Sample output:</u></b><br>
* 1. FixLengthIDSequence(null,null,15,0,99,0) will generate 15, 16, ... 99, 00<br>
* 2. FixLengthIDSequence(null,"K",1,1,99,0) will generate K01, K02, ... K99, K01<br>
* 3. FixLengthIDSequence(null,"SG",100,2,9999,100) will generate SG0100, SG0101, ... SG8057, (in case server crashes, the new init value will start from last cache value+1) SG8101, ... SG9999, SG0002<br>
*/
public final class FixLengthIDSequence {
private static String FNAME;
private static String PREFIX;
private static AtomicLong SEQ_ID;
private static long MINVALUE;
private static long MAXVALUE;
private static long CACHEVALUE;
// some internal working values.
private int iMaxLength; // max numeric length excluding prefix, for left padding zeros.
private long lNextSnapshot; // to keep track of when to update sequence value to file.
private static boolean bInit = false; // to enable ShutdownHook routine after program has properly initialized
static {
// Inspiration from http://stackoverflow.com/questions/22416826/sequence-generator-in-java-for-unique-id#35697336.
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
if (bInit) { // Without this, saveToLocal may hit NullPointerException.
saveToLocal(SEQ_ID.longValue());
}
}));
}
/**
* This POJO style constructor should be initialized via Spring Singleton. Otherwise, rewrite this constructor into Singleton design pattern.
*
* #param sFilename This is the absolute file path to store the sequence number. To reset the sequence, this file needs to be removed manually.
* #param prefix The hard-coded identifier.
* #param initvalue
* #param minvalue
* #param maxvalue
* #param cache
* #throws Exception
*/
public FixLengthIDSequence(String sFilename, String prefix, long initvalue, long minvalue, long maxvalue, int cache) throws Exception {
bInit = false;
FNAME = (sFilename==null)?"C:\\Temp\\sequence.txt":sFilename;
PREFIX = (prefix==null)?"":prefix;
SEQ_ID = new AtomicLong(initvalue);
MINVALUE = minvalue;
MAXVALUE = maxvalue; iMaxLength = Long.toString(MAXVALUE).length();
CACHEVALUE = (cache <= 0)?1:cache; lNextSnapshot = roundUpNumberByMultipleValue(initvalue, cache); // Internal cache is always 1, equals no cache.
// If sequence file exists and valid, restore the saved sequence.
java.io.File f = new java.io.File(FNAME);
if (f.exists()) {
String[] saSavedSequence = loadToString().split(",");
if (saSavedSequence.length != 6) {
throw new Exception("Local Sequence file is not valid");
}
PREFIX = saSavedSequence[0];
//SEQ_ID = new AtomicLong(Long.parseLong(saSavedSequence[1])); // savedInitValue
MINVALUE = Long.parseLong(saSavedSequence[2]);
MAXVALUE = Long.parseLong(saSavedSequence[3]); iMaxLength = Long.toString(MAXVALUE).length();
CACHEVALUE = Long.parseLong(saSavedSequence[4]);
lNextSnapshot = Long.parseLong(saSavedSequence[5]);
// For sequence number recovery
// The rule to determine to continue using SEQ_ID or lNextSnapshot as subsequent sequence number:
// If savedInitValue = savedSnapshot, it was saved by ShutdownHook -> use SEQ_ID.
// Else if saveInitValue < savedSnapshot, it was saved by periodic Snapshot -> use lNextSnapshot+1.
if (saSavedSequence[1].equals(saSavedSequence[5])) {
long previousSEQ = Long.parseLong(saSavedSequence[1]);
SEQ_ID = new AtomicLong(previousSEQ);
lNextSnapshot = roundUpNumberByMultipleValue(previousSEQ,CACHEVALUE);
} else {
SEQ_ID = new AtomicLong(lNextSnapshot+1); // SEQ_ID starts fresh from lNextSnapshot+!.
lNextSnapshot = roundUpNumberByMultipleValue(SEQ_ID.longValue(),CACHEVALUE);
}
}
// Catch invalid values.
if (minvalue < 0) {
throw new Exception("MINVALUE cannot be less than 0");
}
if (maxvalue < 0) {
throw new Exception("MAXVALUE cannot be less than 0");
}
if (minvalue >= maxvalue) {
throw new Exception("MINVALUE cannot be greater than MAXVALUE");
}
if (cache >= maxvalue) {
throw new Exception("CACHE value cannot be greater than MAXVALUE");
}
// Save the next Snapshot.
saveToLocal(lNextSnapshot);
bInit = true;
}
/**
* Equivalent to Oracle Sequence nextval.
* #return String because Next Value is usually left padded with zeros, e.g. "00001".
*/
public String nextVal() {
if (SEQ_ID.longValue() > MAXVALUE) {
SEQ_ID.set(MINVALUE);
lNextSnapshot = roundUpNumberByMultipleValue(MINVALUE,CACHEVALUE);
}
if (SEQ_ID.longValue() > lNextSnapshot) {
lNextSnapshot = roundUpNumberByMultipleValue(lNextSnapshot,CACHEVALUE);
saveToLocal(lNextSnapshot);
}
return PREFIX.concat(StringUtils.leftPad(Long.toString(SEQ_ID.getAndIncrement()),iMaxLength,"0"));
}
/**
* Store sequence value into the local file. This routine is called either by Snapshot or ShutdownHook routines.<br>
* If called by Snapshot, currentCount == Snapshot.<br>
* If called by ShutdownHook, currentCount == current SEQ_ID.
* #param currentCount - This value is inserted by either Snapshot or ShutdownHook routines.
*/
private static void saveToLocal (long currentCount) {
try (java.io.Writer w = new java.io.BufferedWriter(new java.io.OutputStreamWriter(new java.io.FileOutputStream(FNAME), "utf-8"))) {
w.write(PREFIX + "," + SEQ_ID.longValue() + "," + MINVALUE + "," + MAXVALUE + "," + CACHEVALUE + "," + currentCount);
w.flush();
} catch (Exception e) {
e.printStackTrace();
}
}
/**
* Load the sequence file content into String.
* #return
*/
private String loadToString() {
try {
return new String(java.nio.file.Files.readAllBytes(java.nio.file.Paths.get(FNAME)));
} catch (Exception e) {
e.printStackTrace();
}
return "";
}
/**
* Utility method to round up num to next multiple value. This method is used to calculate the next cache value.
* <p>
* (Reference: http://stackoverflow.com/questions/18407634/rounding-up-to-the-nearest-hundred)
* <p>
* <u><b>Sample output:</b></u>
* <pre>
* System.out.println(roundUpNumberByMultipleValue(9,10)); = 10
* System.out.println(roundUpNumberByMultipleValue(10,10)); = 20
* System.out.println(roundUpNumberByMultipleValue(19,10)); = 20
* System.out.println(roundUpNumberByMultipleValue(100,10)); = 110
* System.out.println(roundUpNumberByMultipleValue(109,10)); = 110
* System.out.println(roundUpNumberByMultipleValue(110,10)); = 120
* System.out.println(roundUpNumberByMultipleValue(119,10)); = 120
* </pre>
*
* #param num Value must be greater and equals to positive integer 1.
* #param multiple Value must be greater and equals to positive integer 1.
* #return
*/
private long roundUpNumberByMultipleValue(long num, long multiple) {
if (num<=0) num=1;
if (multiple<=0) multiple=1;
if (num % multiple != 0) {
long division = (long) ((num / multiple) + 1);
return division * multiple;
} else {
return num + multiple;
}
}
/**
* Main method for testing purpose.
* #param args
*/
public static void main(String[] args) throws Exception {
//FixLengthIDSequence(Filename, prefix, initvalue, minvalue, maxvalue, cache)
FixLengthIDSequence seq = new FixLengthIDSequence(null,"H",50,1,999,10);
for (int i=0; i<12; i++) {
System.out.println(seq.nextVal());
Thread.sleep(1000);
//if (i==8) { System.exit(0); }
}
}
}
To test the code, let the sequence run normally. You can press Ctrl+C to simulate the server crash. The next sequence number will continue from NextSnapshot+1.
Cold you use the first 9 digits of some other source of unique data like:
a random number
System Time
Uptime
Having thaught about it for two seconds, none of those are unique on there own but you could use them as seed values for hash functions as was suggested in another answer.

Generate combinations ordered by an attribute

I'm looking for a way to generate combinations of objects ordered by a single attribute. I don't think lexicographical order is what I'm looking for... I'll try to give an example. Let's say I have a list of objects A,B,C,D with the attribute values I want to order by being 3,3,2,1. This gives A3, B3, C2, D1 objects. Now I want to generate combinations of 2 objects, but they need to be ordered in a descending way:
A3 B3
A3 C2
B3 C2
A3 D1
B3 D1
C2 D1
Generating all combinations and sorting them is not acceptable because the real world scenario involves large sets and millions of combinations. (set of 40, order of 8), and I need only combinations above the certain threshold.
Actually I need count of combinations above a threshold grouped by a sum of a given attribute, but I think it is far more difficult to do - so I'd settle for developing all combinations above a threshold and counting them. If that's possible at all.
EDIT - My original question wasn't very precise... I don't actually need these combinations ordered, just thought it would help to isolate combinations above a threshold. To be more precise, in the above example, giving a threshold of 5, I'm looking for an information that the given set produces 1 combination with a sum of 6 ( A3 B3 ) and 2 with a sum of 5 ( A3 C2, B3 C2). I don't actually need the combinations themselves.
I was looking into subset-sum problem, but if I understood correctly given dynamic solution it will only give you information is there a given sum or no, not count of the sums.
Thanks
Actually, I think you do want lexicographic order, but descending rather than ascending. In addition:
It's not clear to me from your description that A, B, ... D play any role in your answer (except possibly as the container for the values).
I think your question example is simply "For each integer at least 5, up to the maximum possible total of two values, how many distinct pairs from the set {3, 3, 2, 1} have sums of that integer?"
The interesting part is the early bailout, once no possible solution can be reached (remaining achievable sums are too small).
I'll post sample code later.
Here's the sample code I promised, with a few remarks following:
public class Combos {
/* permanent state for instance */
private int values[];
private int length;
/* transient state during single "count" computation */
private int n;
private int limit;
private Tally<Integer> tally;
private int best[][]; // used for early-bail-out
private void initializeForCount(int n, int limit) {
this.n = n;
this.limit = limit;
best = new int[n+1][length+1];
for (int i = 1; i <= n; ++i) {
for (int j = 0; j <= length - i; ++j) {
best[i][j] = values[j] + best[i-1][j+1];
}
}
}
private void countAt(int left, int start, int sum) {
if (left == 0) {
tally.inc(sum);
} else {
for (
int i = start;
i <= length - left
&& limit <= sum + best[left][i]; // bail-out-check
++i
) {
countAt(left - 1, i + 1, sum + values[i]);
}
}
}
public Tally<Integer> count(int n, int limit) {
tally = new Tally<Integer>();
if (n <= length) {
initializeForCount(n, limit);
countAt(n, 0, 0);
}
return tally;
}
public Combos(int[] values) {
this.values = values;
this.length = values.length;
}
}
Preface remarks:
This uses a little helper class called Tally, that just isolates the tabulation (including initialization for never-before-seen keys). I'll put it at the end.
To keep this concise, I've taken some shortcuts that aren't good practice for "real" code:
This doesn't check for a null value array, etc.
I assume that the value array is already sorted into descending order, required for the early-bail-out technique. (Good production code would include the sorting.)
I put transient data into instance variables instead of passing them as arguments among the private methods that support count. That makes this class non-thread-safe.
Explanation:
An instance of Combos is created with the (descending ordered) array of integers to combine. The value array is set up once per instance, but multiple calls to count can be made with varying population sizes and limits.
The count method triggers a (mostly) standard recursive traversal of unique combinations of n integers from values. The limit argument gives the lower bound on sums of interest.
The countAt method examines combinations of integers from values. The left argument is how many integers remain to make up n integers in a sum, start is the position in values from which to search, and sum is the partial sum.
The early-bail-out mechanism is based on computing best, a two-dimensional array that specifies the "best" sum reachable from a given state. The value in best[n][p] is the largest sum of n values beginning in position p of the original values.
The recursion of countAt bottoms out when the correct population has been accumulated; this adds the current sum (of n values) to the tally. If countAt has not bottomed out, it sweeps the values from the start-ing position to increase the current partial sum, as long as:
enough positions remain in values to achieve the specified population, and
the best (largest) subtotal remaining is big enough to make the limit.
A sample run with your question's data:
int[] values = {3, 3, 2, 1};
Combos mine = new Combos(values);
Tally<Integer> tally = mine.count(2, 5);
for (int i = 5; i < 9; ++i) {
int n = tally.get(i);
if (0 < n) {
System.out.println("found " + tally.get(i) + " sums of " + i);
}
}
produces the results you specified:
found 2 sums of 5
found 1 sums of 6
Here's the Tally code:
public static class Tally<T> {
private Map<T,Integer> tally = new HashMap<T,Integer>();
public Tally() {/* nothing */}
public void inc(T key) {
Integer value = tally.get(key);
if (value == null) {
value = Integer.valueOf(0);
}
tally.put(key, (value + 1));
}
public int get(T key) {
Integer result = tally.get(key);
return result == null ? 0 : result;
}
public Collection<T> keys() {
return tally.keySet();
}
}
I have written a class to handle common functions for working with the binomial coefficient, which is the type of problem that your problem falls under. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. This method makes solving this type of problem quite trivial.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle. My paper talks about this. I believe I am the first to discover and publish this technique, but I could be wrong.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to perform the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
Check out this question in stackoverflow: Algorithm to return all combinations
I also just used a the java code below to generate all permutations, but it could easily be used to generate unique combination's given an index.
public static <E> E[] permutation(E[] s, int num) {//s is the input elements array and num is the number which represents the permutation
int factorial = 1;
for(int i = 2; i < s.length; i++)
factorial *= i;//calculates the factorial of (s.length - 1)
if (num/s.length >= factorial)// Optional. if the number is not in the range of [0, s.length! - 1]
return null;
for(int i = 0; i < s.length - 1; i++){//go over the array
int tempi = (num / factorial) % (s.length - i);//calculates the next cell from the cells left (the cells in the range [i, s.length - 1])
E temp = s[i + tempi];//Temporarily saves the value of the cell needed to add to the permutation this time
for(int j = i + tempi; j > i; j--)//shift all elements to "cover" the "missing" cell
s[j] = s[j-1];
s[i] = temp;//put the chosen cell in the correct spot
factorial /= (s.length - (i + 1));//updates the factorial
}
return s;
}
I am extremely sorry (after all those clarifications in the comments) to say that I could not find an efficient solution to this problem. I tried for the past hour with no results.
The reason (I think) is that this problem is very similar to problems like the traveling salesman problem. Until unless you try all the combinations, there is no way to know which attributes will add upto the threshold.
There seems to be no clever trick that can solve this class of problems.
Still there are many optimizations that you can do to the actual code.
Try sorting the data according to the attributes. You may be able to avoid processing some values from the list when you find that a higher value cannot satisfy the threshold (so all lower values can be eliminated).
If you're using C# there is a fairly good generics library here. Note though that the generation of some permutations is not in lexicographic order
Here's a recursive approach to count the number of these subsets: We define a function count(minIndex,numElements,minSum) that returns the number of subsets of size numElements whose sum is at least minSum, containing elements with indices minIndex or greater.
As in the problem statement, we sort our elements in descending order, e.g. [3,3,2,1], and call the first index zero, and the total number of elements N. We assume all elements are nonnegative. To find all 2-subsets whose sum is at least 5, we call count(0,2,5).
Sample Code (Java):
int count(int minIndex, int numElements, int minSum)
{
int total = 0;
if (numElements == 1)
{
// just count number of elements >= minSum
for (int i = minIndex; i <= N-1; i++)
if (a[i] >= minSum) total++; else break;
}
else
{
if (minSum <= 0)
{
// any subset will do (n-choose-k of them)
if (numElements <= (N-minIndex))
total = nchoosek(N-minIndex, numElements);
}
else
{
// add element a[i] to the set, and then consider the count
// for all elements to its right
for (int i = minIndex; i <= (N-numElements); i++)
total += count(i+1, numElements-1, minSum-a[i]);
}
}
return total;
}
Btw, I've run the above with an array of 40 elements, and size-8 subsets and consistently got back results in less than a second.

SQL set-based range

How can I have SQL repeat some set-based operation an arbitrary number of times without looping? How can I have SQL perform an operation against a range of numbers? I'm basically looking for a way to do a set-based for loop.
I know I can just create a small table with integers in it, say from 1 to 1000 and then use it for range operations that are within that range.
For example, if I had that table I could make a select to find the sum of numbers 100-200 like this:
select sum(n) from numbers where n between 100 and 200
Any ideas? I'm kinda looking for something that works for T-SQL but any platform would be okay.
[Edit] I have my own solution for this using SQL CLR which works great for MS SQL 2005 or 2008. See below.
I think the very short answer to your question is to use WITH clauses to generate your own.
Unfortunately, the big names in databases don't have built-in queryable number-range pseudo-tables. Or, more generally, easy pure-SQL data generation features. Personally, I think this is a huge failing, because if they did it would be possible to move a lot of code that is currently locked up in procedural scripts (T-SQL, PL/SQL, etc.) into pure-SQL, which has a number of benefits to performance and code complexity.
So anyway, it sounds like what you need in a general sense is the ability to generate data on the fly.
Oracle and T-SQL both support a WITH clause that can be used to do this. They work a little differently in the different DBMS's, and MS calls them "common table expressions", but they are very similar in form. Using these with recursion, you can generate a sequence of numbers or text values fairly easily. Here is what it might look like...
In Oracle SQL:
WITH
digits AS -- Limit recursion by just using it for digits.
(SELECT
LEVEL - 1 AS num
FROM
DUAL
WHERE
LEVEL < 10
CONNECT BY
num = (PRIOR num) + 1),
numrange AS
(SELECT
ones.num
+ (tens.num * 10)
+ (hundreds.num * 100)
AS num
FROM
digits ones
CROSS JOIN
digits tens
CROSS JOIN
digits hundreds
WHERE
hundreds.num in (1, 2)) -- Use the WHERE clause to restrict each digit as needed.
SELECT
-- Some columns and operations
FROM
numrange
-- Join to other data if needed
This is admittedly quite verbose. Oracle's recursion functionality is limited. The syntax is clunky, it's not performant, and it is limited to 500 (I think) nested levels. This is why I chose to use recursion only for the first 10 digits, and then cross (cartesian) joins to combine them into actual numbers.
I haven't used SQL Server's Common Table Expressions myself, but since they allow self-reference, recursion is MUCH simpler than it is in Oracle. Whether performance is comparable, and what the nesting limits are, I don't know.
At any rate, recursion and the WITH clause are very useful tools in creating queries that require on-the-fly generated data sets. Then by querying this data set, doing operations on the values, you can get all sorts of different types of generated data. Aggregations, duplications, combinations, permutations, and so on. You can even use such generated data to aid in rolling up or drilling down into other data.
UPDATE: I just want to add that, once you start working with data in this way, it opens your mind to new ways of thinking about SQL. It's not just a scripting language. It's a fairly robust data-driven declarative language. Sometimes it's a pain to use because for years it has suffered a dearth of enhancements to aid in reducing the redundancy needed for complex operations. But nonetheless it is very powerful, and a fairly intuitive way to work with data sets as both the target and the driver of your algorithms.
I created a SQL CLR table valued function that works great for this purpose.
SELECT n FROM dbo.Range(1, 11, 2) -- returns odd integers 1 to 11
SELECT n FROM dbo.RangeF(3.1, 3.5, 0.1) -- returns 3.1, 3.2, 3.3 and 3.4, but not 3.5 because of float inprecision. !fault(this)
Here's the code:
using System;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Collections;
[assembly: CLSCompliant(true)]
namespace Range {
public static partial class UserDefinedFunctions {
[Microsoft.SqlServer.Server.SqlFunction(DataAccess = DataAccessKind.None, IsDeterministic = true, SystemDataAccess = SystemDataAccessKind.None, IsPrecise = true, FillRowMethodName = "FillRow", TableDefinition = "n bigint")]
public static IEnumerable Range(SqlInt64 start, SqlInt64 end, SqlInt64 incr) {
return new Ranger(start.Value, end.Value, incr.Value);
}
[Microsoft.SqlServer.Server.SqlFunction(DataAccess = DataAccessKind.None, IsDeterministic = true, SystemDataAccess = SystemDataAccessKind.None, IsPrecise = true, FillRowMethodName = "FillRowF", TableDefinition = "n float")]
public static IEnumerable RangeF(SqlDouble start, SqlDouble end, SqlDouble incr) {
return new RangerF(start.Value, end.Value, incr.Value);
}
public static void FillRow(object row, out SqlInt64 n) {
n = new SqlInt64((long)row);
}
public static void FillRowF(object row, out SqlDouble n) {
n = new SqlDouble((double)row);
}
}
internal class Ranger : IEnumerable {
Int64 _start, _end, _incr;
public Ranger(Int64 start, Int64 end, Int64 incr) {
_start = start; _end = end; _incr = incr;
}
public IEnumerator GetEnumerator() {
return new RangerEnum(_start, _end, _incr);
}
}
internal class RangerF : IEnumerable {
double _start, _end, _incr;
public RangerF(double start, double end, double incr) {
_start = start; _end = end; _incr = incr;
}
public IEnumerator GetEnumerator() {
return new RangerFEnum(_start, _end, _incr);
}
}
internal class RangerEnum : IEnumerator {
Int64 _cur, _start, _end, _incr;
bool hasFetched = false;
public RangerEnum(Int64 start, Int64 end, Int64 incr) {
_start = _cur = start; _end = end; _incr = incr;
if ((_start < _end ^ _incr > 0) || _incr == 0)
throw new ArgumentException("Will never reach end!");
}
public long Current {
get { hasFetched = true; return _cur; }
}
object IEnumerator.Current {
get { hasFetched = true; return _cur; }
}
public bool MoveNext() {
if (hasFetched) _cur += _incr;
return (_cur > _end ^ _incr > 0);
}
public void Reset() {
_cur = _start; hasFetched = false;
}
}
internal class RangerFEnum : IEnumerator {
double _cur, _start, _end, _incr;
bool hasFetched = false;
public RangerFEnum(double start, double end, double incr) {
_start = _cur = start; _end = end; _incr = incr;
if ((_start < _end ^ _incr > 0) || _incr == 0)
throw new ArgumentException("Will never reach end!");
}
public double Current {
get { hasFetched = true; return _cur; }
}
object IEnumerator.Current {
get { hasFetched = true; return _cur; }
}
public bool MoveNext() {
if (hasFetched) _cur += _incr;
return (_cur > _end ^ _incr > 0);
}
public void Reset() {
_cur = _start; hasFetched = false;
}
}
}
and I deployed it like this:
create assembly Range from 'Range.dll' with permission_set=safe -- mod path to point to actual dll location on disk.
go
create function dbo.Range(#start bigint, #end bigint, #incr bigint)
returns table(n bigint)
as external name [Range].[Range.UserDefinedFunctions].[Range]
go
create function dbo.RangeF(#start float, #end float, #incr float)
returns table(n float)
as external name [Range].[Range.UserDefinedFunctions].[RangeF]
go
This is basically one of those things that reveal SQL to be less than ideal. I'm thinking maybe the right way to do this is to build a function that creates the range. (Or a generator.)
I believe the correct answer to your question is basically, "you can't".
(Sorry.)
You can use a common table expression to do this in SQL2005+.
WITH CTE AS
(
SELECT 100 AS n
UNION ALL
SELECT n + 1 AS n FROM CTE WHERE n + 1 <= 200
)
SELECT n FROM CTE
If using SQL Server 2000 or greater, you could use the table datatype to avoid creating a normal or temporary table. Then use the normal table operations on it.
With this solution you have essentially a table structure in memory that you can use almost like a real table, but much more performant.
I found a good discussion here: Temporary tables vs the table data type
Here's a hack you should never use:
select sum(numberGenerator.rank)
from
(
select
rank = ( select count(*)
from reallyLargeTable t1
where t1.uniqueValue > t2.uniqueValue ),
t2.uniqueValue id1,
t2.uniqueValue id2
from reallyLargeTable t2
) numberGenerator
where rank between 1 and 10
You can simplify this using the Rank() or Row_Number functions in SQL 2005