Lucene: Iterate all entries

Lucene: Iterate all entries - lucene

I have a Lucene Index which I would like to iterate (for one time evaluation at the current stage in development)
I have 4 documents with each a few hundred thousand up to million entries, which I want to iterate to count the number of words for each entry (~2-10) and calculate the frequency distribution.
What I am doing at the moment is this:
for (int i = 0; i < reader.maxDoc(); i++) {
if (reader.isDeleted(i))
continue;
Document doc = reader.document(i);
Field text = doc.getField("myDocName#1");
String content = text.stringValue();
int wordLen = countNumberOfWords(content);
//store
}
So far, it is iterating something. The debug confirms that its at least operating on the terms stored in the document, but for some reason it only process a small part of the stored terms. I wonder what I am doing wrong? I simply want to iterate over all documents and everything that is stored in them?

Firstly you need to ensure you index with TermVectors enabled
doc.add(new Field(TITLE, page.getTitle(), Field.Store.YES, Field.Index.ANALYZED, TermVector.WITH_POSITIONS_OFFSETS));
Then you can use IndexReader.getTermFreqVector to count terms
TopDocs res = indexSearcher.search(YOUR_QUERY, null, 1000);
// iterate over documents in res, ommited for brevity
reader.getTermFreqVector(res.scoreDocs[i].doc, YOUR_FIELD, new TermVectorMapper() {
public void map(String termval, int freq, TermVectorOffsetInfo[] offsets, int[] positions) {
// increment frequency count of termval by freq
freqs.increment(termval, freq);
}
public void setExpectations(String arg0, int arg1,boolean arg2, boolean arg3) {}
});

Related

Android custom keyboard suggestions

I am building a custom keyboard for android, the one that atleast supports autocomplete suggestions. To achieve this, I am storing every word that user types (not password fields) in a Room database table which has a simple model, the word and its frequency. Now for showing the suggestions, I am using a Trie which is populated by words from this database table. My query is basically to order by the table based on frequency of the word and limit the results to 5K (I do not feel like overpopulating the Trie, these 5K words can be considered as the users' favourite words that he uses often and needs suggestions for). Now my actual problem is the ORDER BY clause, this is a rapidly growing data set, sorting lets say 0.1M words to get 5K words seems like an overkill. How can i rework this approach to improve the efficiency of this entire suggestions logic.

If not already implemented, an index on the frequency #ColumnInfo(index = true).
Another could be to add a table that maintains the highest 5k. Supported by yet another table (the support table) that has 1 row, with columns for; the highest frequency (not really required), the lowest frequency in the current 5k, and a 3rd column for the number currently held. So you could then, after adding an existing word get whether or not the new/updated word should be added to the 5k table (perhaps a 4th column for the primary key of the lowest to facilitate efficient deletion).
So
if the number currently held is less than 5k insert or update the 5k table and increment the number currently held in the support table.
otherwise if the number is lower then the lowest skip it.
otherwise update if it already exists.
otherwise delete the lowest, insert the replacement and then update the lowest accordingly in the support table.
Note that the 5K table would probably only need to store the rowid as a pointer/reference/map to the core table.
rowid is a column that virtually all tables will have in Room (virtual tables an exception as are table that have the WITHOUT ROWID attribute but Room does not facilitate (as far as I am aware) WITHOUT ROWID table).
The rowid can be up to twice as fast as other indexes. I would suggest using #PrimaryKey Long id=null; (java) or #PrimaryKey var id: Long?=null (Kotlin) and NOT using #PrimaryKey(autogenerate = true).
autogenerate = true equates to SQLite's AUTOINCREMENT, about which the SQLite documentation says "The AUTOINCREMENT keyword imposes extra CPU, memory, disk space, and disk I/O overhead and should be avoided if not strictly needed. It is usually not needed."
see https://www.sqlite.org/rowidtable.html, and also https://sqlite.org/autoinc.html
curiously/funnily the support table mentioned isn't that far away from what coding AUTOINCREMENT does.
a table with a row per table that has AUTOINCREMENT, is used (sqlite_sequence) that stores the table name and the highest ever allocated rowid.
Without AUTOINCREMENT but with <column_name> INTEGER PRIMARY KEY and no value or null for the primary key column's value then SQLite generates a value that is 1 greater than max(rowid).
With AUTOINCREMENT/autogenerate=true then the generated value is the greater of max(rowid) and the value stored, for that table, in the sqlite_sequence table (hence the overheads).
of course those overheads, will very likely be insignificant in comparison to sorting 0.1M rows.
Demonstration
The following is a demonstration albeit just using a basic Word table as the source.
First the 2 tables (#Entity annotated classes)
Word
#Entity (
indices = {#Index(value = {"word"},unique = true)}
)
class Word {
#PrimaryKey Long wordId=null;
#NonNull
String word;
#ColumnInfo(index = true)
long frequency;
Word(){}
#Ignore
Word(String word, long frequency) {
this.word = word;
this.frequency = frequency;
}
}
WordSubset aka the table with the highest occurring 5000 frequencies, it simply has a reference/map/link to the underlying/actual word. :-
#Entity(
foreignKeys = {
#ForeignKey(
entity = Word.class,
parentColumns = {"wordId"},
childColumns = {"wordIdMap"},
onDelete = ForeignKey.CASCADE,
onUpdate = ForeignKey.CASCADE
)
}
)
class WordSubset {
public static final long SUBSET_MAX_SIZE = 5000;
#PrimaryKey
long wordIdMap;
WordSubset(){};
#Ignore
WordSubset(long wordIdMap) {
this.wordIdMap = wordIdMap;
}
}
note the constant SUBSET_MAX_SIZE, hard coded just the once so a simple single change to adjust (lowering it after rows have been added may cause issues)
WordSubsetSupport this will be a single row table that contains the highest and lowest frequencies (highest is not really needed), the number of rows in the WordSubset table and a reference/map to the word with the lowest frequency.
#Entity(
foreignKeys = {
#ForeignKey(
entity = Word.class,
parentColumns = {"wordId"},
childColumns = {"lowestWordIdMap"}
)
}
)
class WordSubsetSupport {
#PrimaryKey
Long wordSubsetSupportId=null;
long highestFrequency;
long lowestFrequency;
long countOfRowsInSubsetTable;
#ColumnInfo(index = true)
long lowestWordIdMap;
WordSubsetSupport(){}
#Ignore
WordSubsetSupport(long highestFrequency, long lowestFrequency, long countOfRowsInSubsetTable, long lowestWordIdMap) {
this.highestFrequency = highestFrequency;
this.lowestFrequency = lowestFrequency;
this.countOfRowsInSubsetTable = countOfRowsInSubsetTable;
this.lowestWordIdMap = lowestWordIdMap;
this.wordSubsetSupportId = 1L;
}
}
For access an abstract class (rather than interface, as this, in Java, allows methods/functions with a body, a Kotlin interface allows these) CombinedDao :-
#Dao
abstract class CombinedDao {
#Insert(onConflict = OnConflictStrategy.IGNORE)
abstract long insert(Word word);
#Insert(onConflict = OnConflictStrategy.IGNORE)
abstract long insert(WordSubset wordSubset);
#Insert(onConflict = OnConflictStrategy.IGNORE)
abstract long insert(WordSubsetSupport wordSubsetSupport);
#Query("SELECT * FROM wordsubsetsupport LIMIT 1")
abstract WordSubsetSupport getWordSubsetSupport();
#Query("SELECT count() FROM wordsubsetsupport")
abstract long getWordSubsetSupportCount();
#Query("SELECT countOfRowsInSubsetTable FROM wordsubsetsupport")
abstract long getCountOfRowsInSubsetTable();
#Query("UPDATE wordsubsetsupport SET countOfRowsInSubsetTable=:updatedCount")
abstract void updateCountOfRowsInSubsetTable(long updatedCount);
#Query("UPDATE wordsubsetsupport " +
"SET countOfRowsInSubsetTable = (SELECT count(*) FROM wordsubset), " +
"lowestWordIdMap = (SELECT word.wordId FROM wordsubset JOIN word ON wordsubset.wordIdMap = word.wordId ORDER BY frequency ASC LIMIT 1)," +
"lowestFrequency = (SELECT coalesce(min(frequency),0) FROM wordsubset JOIN word ON wordsubset.wordIdMap = word.wordId)," +
"highestFrequency = (SELECT coalesce(max(frequency),0) FROM wordsubset JOIN word ON wordsubset.wordIdMap = word.wordId)")
abstract void autoUpdateWordSupportTable();
#Query("DELETE FROM wordsubset WHERE wordIdMap= (SELECT wordsubset.wordIdMap FROM wordsubset JOIN word ON wordsubset.wordIdMap = word.wordId ORDER BY frequency ASC LIMIT 1)")
abstract void deleteLowestFrequency();
#Transaction
#Query("")
int addWord(Word word) {
/* try to add the add word, setting the wordId value according to the result.
The result will be the wordId generated (1 or greater) or if the word already exists -1
*/
word.wordId = insert(word);
/* If the word was added and not rejected as a duplicate, then it may need to be added to the WordSubset table */
if (word.wordId > 0) {
/* Are there any rows in the support table? if not then add the very first entry/row */
if (getWordSubsetSupportCount() < 1) {
/* Need to add the word to the subset */
insert(new WordSubset(word.wordId));
/* Can now add the first (and only) row to the support table */
insert(new WordSubsetSupport(word.frequency,word.frequency,1,word.wordId));
autoUpdateWordSupportTable();
return 1;
}
/* If there are less than the maximum number of rows in the subset table then
1) insert the new subset row, and
2) update the support table accordingly
*/
if (getCountOfRowsInSubsetTable() < WordSubset.SUBSET_MAX_SIZE) {
insert(new WordSubset(word.wordId));
autoUpdateWordSupportTable();
return 2;
}
/*
Last case is that the subset table is at the maximum number of rows and
the frequency of the added word is greater than the lowest frequency in the
subset, so
1) the row with the lowest frequency is removed from the subset table and
2) the added word is added to the subset
3) the support table is updated accordingly
*/
if (getCountOfRowsInSubsetTable() >= WordSubset.SUBSET_MAX_SIZE) {
WordSubsetSupport currentWordSubsetSupport = getWordSubsetSupport();
if (word.frequency > currentWordSubsetSupport.lowestFrequency) {
deleteLowestFrequency();
insert(new WordSubset(word.wordId));
autoUpdateWordSupportTable();
return 3;
}
}
return 4; /* indicates word added but does not qualify for addition to the subset */
}
return -1;
}
}
The addWord method/function is the only method that is used as this automatically maintains the WordSubset and the WordSubsetSupport tables.
TheDatabase is a pretty standard #Database annotated class, other than that it allows use the main thread for the sake of convenience and brevity of the demo:-
#Database( entities = {Word.class,WordSubset.class,WordSubsetSupport.class}, version = TheDatabase.DATABASE_VERSION, exportSchema = false)
abstract class TheDatabase extends RoomDatabase {
abstract CombinedDao getCombinedDao();
private static volatile TheDatabase instance = null;
public static TheDatabase getInstance(Context context) {
if (instance == null) {
instance = Room.databaseBuilder(context,TheDatabase.class,DATABASE_NAME)
.addCallback(cb)
.allowMainThreadQueries()
.build();
}
return instance;
}
private static Callback cb = new Callback() {
#Override
public void onCreate(#NonNull SupportSQLiteDatabase db) {
super.onCreate(db);
}
#Override
public void onOpen(#NonNull SupportSQLiteDatabase db) {
super.onOpen(db);
}
};
public static final String DATABASE_NAME = "the_database.db";
public static final int DATABASE_VERSION = 1;
}
Finally activity code that randomly generates and adds 10,000 words (or thereabouts as some could be duplicate words), each word having a frequency that is also randomly generated (between 1 and 10000) :-
public class MainActivity extends AppCompatActivity {
TheDatabase db;
CombinedDao dao;
#Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
db = TheDatabase.getInstance(this);
dao = db.getCombinedDao();
for (int i=0; i < 10000; i++) {
Word currentWord = generateRandomWord();
Log.d("ADDINGWORD","Adding word " + currentWord.word + " frequency is " + currentWord.frequency);
dao.addWord(generateRandomWord());
}
}
public static final String ALPHABET = "abcdefghijklmnopqrstuvwxyz";
private Word generateRandomWord() {
Random r = new Random();
int wordLength = (abs(r.nextInt()) % 24) + 1;
int frequency = abs(r.nextInt()) % 10000;
StringBuilder sb = new StringBuilder();
for (int i=0; i < wordLength; i++) {
int letter = abs(r.nextInt()) % (ALPHABET.length());
sb.append(ALPHABET.substring(letter,letter+1));
}
return new Word(sb.toString(),frequency);
}
}
Obviously the results would differ per run, also the demo is only really designed to be run once (although it could be run more).
After running, using AppInspection, then
The support table (in this instance) is :-
So as countOfRowsInSubsetTable is 5000 then the subset table has been filled to it's capacity/limit.
The highest frequency encountered as 9999 (as could well be expected)
The lowest frequency in the subset is 4690 and that is for the word with the wordId that is 7412.
The subset table on it's own means little as it just contains a map to the actual word. So it's more informative to use a query to look at what it contains.
e.g.
As can be seen the query shows that the word who's wordId is 7412 is the one with the lowest frequency of 4690 (as expected according to the support table)
Going to the last page shows:-

SparkSQL - Error in Schema [duplicate]

What does ArrayIndexOutOfBoundsException mean and how do I get rid of it?
Here is a code sample that triggers the exception:
String[] names = { "tom", "bob", "harry" };
for (int i = 0; i <= names.length; i++) {
System.out.println(names[i]);
}

Your first port of call should be the documentation which explains it reasonably clearly:
Thrown to indicate that an array has been accessed with an illegal index. The index is either negative or greater than or equal to the size of the array.
So for example:
int[] array = new int[5];
int boom = array[10]; // Throws the exception
As for how to avoid it... um, don't do that. Be careful with your array indexes.
One problem people sometimes run into is thinking that arrays are 1-indexed, e.g.
int[] array = new int[5];
// ... populate the array here ...
for (int index = 1; index <= array.length; index++)
{
System.out.println(array[index]);
}
That will miss out the first element (index 0) and throw an exception when index is 5. The valid indexes here are 0-4 inclusive. The correct, idiomatic for statement here would be:
for (int index = 0; index < array.length; index++)
(That's assuming you need the index, of course. If you can use the enhanced for loop instead, do so.)

if (index < 0 || index >= array.length) {
// Don't use this index. This is out of bounds (borders, limits, whatever).
} else {
// Yes, you can safely use this index. The index is present in the array.
Object element = array[index];
}
See also:
The Java Tutorials - Language Basics - Arrays
Update: as per your code snippet,
for (int i = 0; i<=name.length; i++) {
The index is inclusive the array's length. This is out of bounds. You need to replace <= by <.
for (int i = 0; i < name.length; i++) {

From this excellent article: ArrayIndexOutOfBoundsException in for loop
To put it briefly:
In the last iteration of
for (int i = 0; i <= name.length; i++) {
i will equal name.length which is an illegal index, since array indices are zero-based.
Your code should read
for (int i = 0; i < name.length; i++)
^

It means that you are trying to access an index of an array which is not valid as it is not in between the bounds.
For example this would initialize a primitive integer array with the upper bound 4.
int intArray[] = new int[5];
Programmers count from zero. So this for example would throw an ArrayIndexOutOfBoundsException as the upper bound is 4 and not 5.
intArray[5];

What causes ArrayIndexOutOfBoundsException?
If you think of a variable as a "box" where you can place a value, then an array is a series of boxes placed next to each other, where the number of boxes is a finite and explicit integer.
Creating an array like this:
final int[] myArray = new int[5]
creates a row of 5 boxes, each holding an int. Each of the boxes has an index, a position in the series of boxes. This index starts at 0 and ends at N-1, where N is the size of the array (the number of boxes).
To retrieve one of the values from this series of boxes, you can refer to it through its index, like this:
myArray[3]
Which will give you the value of the 4th box in the series (since the first box has an index of 0).
An ArrayIndexOutOfBoundsException is caused by trying to retrieve a "box" that does not exist, by passing an index that is higher than the index of the last "box", or negative.
With my running example, these code snippets would produce such an exception:
myArray[5] //tries to retrieve the 6th "box" when there is only 5
myArray[-1] //just makes no sense
myArray[1337] //way to high
How to avoid ArrayIndexOutOfBoundsException
In order to prevent ArrayIndexOutOfBoundsException, there are some key points to consider:
Looping
When looping through an array, always make sure that the index you are retrieving is strictly smaller than the length of the array (the number of boxes). For instance:
for (int i = 0; i < myArray.length; i++) {
Notice the <, never mix a = in there..
You might want to be tempted to do something like this:
for (int i = 1; i <= myArray.length; i++) {
final int someint = myArray[i - 1]
Just don't. Stick to the one above (if you need to use the index) and it will save you a lot of pain.
Where possible, use foreach:
for (int value : myArray) {
This way you won't have to think about indexes at all.
When looping, whatever you do, NEVER change the value of the loop iterator (here: i). The only place this should change value is to keep the loop going. Changing it otherwise is just risking an exception, and is in most cases not necessary.
Retrieval/update
When retrieving an arbitrary element of the array, always check that it is a valid index against the length of the array:
public Integer getArrayElement(final int index) {
if (index < 0 || index >= myArray.length) {
return null; //although I would much prefer an actual exception being thrown when this happens.
}
return myArray[index];
}

To avoid an array index out-of-bounds exception, one should use the enhanced-for statement where and when they can.
The primary motivation (and use case) is when you are iterating and you do not require any complicated iteration steps. You would not be able to use an enhanced-for to move backwards in an array or only iterate on every other element.
You're guaranteed not to run out of elements to iterate over when doing this, and your [corrected] example is easily converted over.
The code below:
String[] name = {"tom", "dick", "harry"};
for(int i = 0; i< name.length; i++) {
System.out.print(name[i] + "\n");
}
...is equivalent to this:
String[] name = {"tom", "dick", "harry"};
for(String firstName : name) {
System.out.println(firstName + "\n");
}

In your code you have accessed the elements from index 0 to the length of the string array. name.length gives the number of string objects in your array of string objects i.e. 3, but you can access only up to index 2 name[2],
because the array can be accessed from index 0 to name.length - 1 where you get name.length number of objects.
Even while using a for loop you have started with index zero and you should end with name.length - 1. In an array a[n] you can access form a[0] to a[n-1].
For example:
String[] a={"str1", "str2", "str3" ..., "strn"};
for(int i=0; i<a.length(); i++)
System.out.println(a[i]);
In your case:
String[] name = {"tom", "dick", "harry"};
for(int i = 0; i<=name.length; i++) {
System.out.print(name[i] +'\n');
}

For your given array the length of the array is 3(i.e. name.length = 3). But as it stores element starting from index 0, it has max index 2.
So, instead of 'i**<=name.length' you should write 'i<**name.length' to avoid 'ArrayIndexOutOfBoundsException'.

So much for this simple question, but I just wanted to highlight a new feature in Java which will avoid all confusions around indexing in arrays even for beginners. Java-8 has abstracted the task of iterating for you.
int[] array = new int[5];
//If you need just the items
Arrays.stream(array).forEach(item -> { println(item); });
//If you need the index as well
IntStream.range(0, array.length).forEach(index -> { println(array[index]); })
What's the benefit? Well, one thing is the readability like English. Second, you need not worry about the ArrayIndexOutOfBoundsException

The most common case I've seen for seemingly mysterious ArrayIndexOutOfBoundsExceptions, i.e. apparently not caused by your own array handling code, is the concurrent use of SimpleDateFormat. Particularly in a servlet or controller:
public class MyController {
SimpleDateFormat dateFormat = new SimpleDateFormat("MM/dd/yyyy");
public void handleRequest(ServletRequest req, ServletResponse res) {
Date date = dateFormat.parse(req.getParameter("date"));
}
}
If two threads enter the SimplateDateFormat.parse() method together you will likely see an ArrayIndexOutOfBoundsException. Note the synchronization section of the class javadoc for SimpleDateFormat.
Make sure there is no place in your code that are accessing thread unsafe classes like SimpleDateFormat in a concurrent manner like in a servlet or controller. Check all instance variables of your servlets and controllers for likely suspects.

You are getting ArrayIndexOutOfBoundsException due to i<=name.length part. name.length return the length of the string name, which is 3. Hence when you try to access name[3], it's illegal and throws an exception.
Resolved code:
String[] name = {"tom", "dick", "harry"};
for(int i = 0; i < name.length; i++) { //use < insteadof <=
System.out.print(name[i] +'\n');
}
It's defined in the Java language specification:
The public final field length, which contains the number of components
of the array. length may be positive or zero.

That's how this type of exception looks when thrown in Eclipse. The number in red signifies the index you tried to access. So the code would look like this:
myArray[5]
The error is thrown when you try to access an index which doesn't exist in that array. If an array has a length of 3,
int[] intArray = new int[3];
then the only valid indexes are:
intArray[0]
intArray[1]
intArray[2]
If an array has a length of 1,
int[] intArray = new int[1];
then the only valid index is:
intArray[0]
Any integer equal to the length of the array, or bigger than it: is out of bounds.
Any integer less than 0: is out of bounds;
P.S.: If you look to have a better understanding of arrays and do some practical exercises, there's a video here: tutorial on arrays in Java

For multidimensional arrays, it can be tricky to make sure you access the length property of the right dimension. Take the following code for example:
int [][][] a = new int [2][3][4];
for(int i = 0; i < a.length; i++){
for(int j = 0; j < a[i].length; j++){
for(int k = 0; k < a[j].length; k++){
System.out.print(a[i][j][k]);
}
System.out.println();
}
System.out.println();
}
Each dimension has a different length, so the subtle bug is that the middle and inner loops use the length property of the same dimension (because a[i].length is the same as a[j].length).
Instead, the inner loop should use a[i][j].length (or a[0][0].length, for simplicity).

For any array of length n, elements of the array will have an index from 0 to n-1.
If your program is trying to access any element (or memory) having array index greater than n-1, then Java will throw ArrayIndexOutOfBoundsException
So here are two solutions that we can use in a program
Maintaining count:
for(int count = 0; count < array.length; count++) {
System.out.println(array[count]);
}
Or some other looping statement like
int count = 0;
while(count < array.length) {
System.out.println(array[count]);
count++;
}
A better way go with a for each loop, in this method a programmer has no need to bother about the number of elements in the array.
for(String str : array) {
System.out.println(str);
}

ArrayIndexOutOfBoundsException whenever this exception is coming it mean you are trying to use an index of array which is out of its bounds or in lay man terms you are requesting more than than you have initialised.
To prevent this always make sure that you are not requesting a index which is not present in array i.e. if array length is 10 then your index must range between 0 to 9

ArrayIndexOutOfBounds means you are trying to index a position within an array that is not allocated.
In this case:
String[] name = { "tom", "dick", "harry" };
for (int i = 0; i <= name.length; i++) {
System.out.println(name[i]);
}
name.length is 3 since the array has been defined with 3 String objects.
When accessing the contents of an array, position starts from 0. Since there are 3 items, it would mean name[0]="tom", name[1]="dick" and name[2]="harry
When you loop, since i can be less than or equal to name.length, you are trying to access name[3] which is not available.
To get around this...
In your for loop, you can do i < name.length. This would prevent looping to name[3] and would instead stop at name[2]
for(int i = 0; i<name.length; i++)
Use a for each loop
String[] name = { "tom", "dick", "harry" };
for(String n : name) {
System.out.println(n);
}
Use list.forEach(Consumer action) (requires Java8)
String[] name = { "tom", "dick", "harry" };
Arrays.asList(name).forEach(System.out::println);
Convert array to stream - this is a good option if you want to perform additional 'operations' to your array e.g. filter, transform the text, convert to a map etc (requires Java8)
String[] name = { "tom", "dick", "harry" };
--- Arrays.asList(name).stream().forEach(System.out::println);
--- Stream.of(name).forEach(System.out::println);

ArrayIndexOutOfBoundsException means that you are trying to access an index of the array that does not exist or out of the bound of this array. Array indexes start from 0 and end at length - 1.
In your case
for(int i = 0; i<=name.length; i++) {
System.out.print(name[i] +'\n'); // i goes from 0 to length, Not correct
}
ArrayIndexOutOfBoundsException happens when you are trying to access
the name.length indexed element which does not exist (array index ends at length -1). just replacing <= with < would solve this problem.
for(int i = 0; i < name.length; i++) {
System.out.print(name[i] +'\n'); // i goes from 0 to length - 1, Correct
}

According to your Code :
String[] name = {"tom", "dick", "harry"};
for(int i = 0; i<=name.length; i++) {
System.out.print(name[i] +'\n');
}
If You check
System.out.print(name.length);
you will get 3;
that mean your name length is 3
your loop is running from 0 to 3
which should be running either "0 to 2" or "1 to 3"
Answer
String[] name = {"tom", "dick", "harry"};
for(int i = 0; i<name.length; i++) {
System.out.print(name[i] +'\n');
}

Each item in an array is called an element, and each element is accessed by its numerical index. As shown in the preceding illustration, numbering begins with 0. The 9th element, for example, would therefore be accessed at index 8.
IndexOutOfBoundsException is thrown to indicate that an index of some sort (such as to an array, to a string, or to a vector) is out of range.
Any array X, can be accessed from [0 to (X.length - 1)]

I see all the answers here explaining how to work with arrays and how to avoid the index out of bounds exceptions. I personally avoid arrays at all costs. I use the Collections classes, which avoids all the silliness of having to deal with array indices entirely. The looping constructs work beautifully with collections supporting code that is both easier to write, understand and maintain.

If you use an array's length to control iteration of a for loop, always remember that the index of the first item in an array is 0. So the index of the last element in an array is one less than the array's length.

ArrayIndexOutOfBoundsException name itself explains that If you trying to access the value at the index which is out of the scope of Array size then such kind of exception occur.
In your case, You can just remove equal sign from your for loop.
for(int i = 0; i<name.length; i++)
The better option is to iterate an array:
for(String i : name )
System.out.println(i);

This error is occurs at runs loop overlimit times.Let's consider simple example like this,
class demo{
public static void main(String a[]){
int[] numberArray={4,8,2,3,89,5};
int i;
for(i=0;i<numberArray.length;i++){
System.out.print(numberArray[i+1]+" ");
}
}
At first, I have initialized an array as 'numberArray'. then , some array elements are printed using for loop. When loop is running 'i' time , print the (numberArray[i+1] element..(when i value is 1, numberArray[i+1] element is printed.)..Suppose that, when i=(numberArray.length-2), last element of array is printed..When 'i' value goes to (numberArray.length-1) , no value for printing..In that point , 'ArrayIndexOutOfBoundsException' is occur.I hope to you could get idea.thank you !

You can use Optional in functional style to avoid NullPointerException and ArrayIndexOutOfBoundsException :
String[] array = new String[]{"aaa", null, "ccc"};
for (int i = 0; i < 4; i++) {
String result = Optional.ofNullable(array.length > i ? array[i] : null)
.map(x -> x.toUpperCase()) //some operation here
.orElse("NO_DATA");
System.out.println(result);
}
Output:
AAA
NO_DATA
CCC
NO_DATA

In most of the programming language indexes is start from 0.So you must have to write i<names.length or i<=names.length-1 instead of i<=names.length.

You could not iterate or store more data than the length of your array. In this case you could do like this:
for (int i = 0; i <= name.length - 1; i++) {
// ....
}
Or this:
for (int i = 0; i < name.length; i++) {
// ...
}

Sage: Iterate over increasing sequences

I have a problem that I am unwilling to believe hasn't been solved before in Sage.
Given a pair of integers (d,n) as input, I'd like to receive a list (or set, or whatever) of all nondecreasing sequences of length d all of whose entries are no greater than n.
Similarly, I'd like another function which returns all strictly increasing sequences of length d whose entries are no greater than n.
For example, for d = 2 n=3, I'd receive the output:
[[1,2], [1,3], [2,3]]
or
[[1,1], [1,2], [1,3], [2,2], [2,3], [3,3]]
depending on whether I'm using increasing or nondecreasing.
Does anyone know of such a function?
Edit Of course, if there is such a method for nonincreasing or decreasing sequences, I can modify that to fit my purposes. Just something to iterate over sequences

I needed this algorithm too and I finally managed to write one today. I will share the code here, but I only started to learn coding last week, so it is not pretty.
Idea Input=(r,d). Step 1) Create a class "ListAndPosition" that has a list L of arrays Integer[r+1]'s, and an integer q between 0 and r. Step 2) Create a method that receives a ListAndPosition (L,q) and screens sequentially the arrays in L checking if the integer at position q is less than the one at position q+1, if so, it adds a new array at the bottom of the list with that entry ++. When done, the Method calls itself again with the new list and q-1 as input.
The code for Step 1)
import java.util.ArrayList;
public class ListAndPosition {
public static Integer r=5;
public final ArrayList<Integer[]> L;
public int q;
public ListAndPosition(ArrayList<Integer[]> L, int q) {
this.L = L;
this.q = q;
}
public ArrayList<Integer[]> getList(){
return L;
}
public int getPosition() {
return q;
}
public void decreasePosition() {
q--;
}
public void showList() {
for(int i=0;i<L.size();i++){
for(int j=0; j<r+1 ; j++){
System.out.print(""+L.get(i)[j]);
}
System.out.println("");
}
}
}
The code for Step 2)
import java.util.ArrayList;
public class NonDecreasingSeqs {
public static Integer r=5;
public static Integer d=3;
public static void main(String[] args) {
//Creating the first array
Integer[] firstArray;
firstArray = new Integer[r+1];
for(int i=0;i<r;i++){
firstArray[i] = 0;
}
firstArray[r] = d;
//Creating the starting listAndDim
ArrayList<Integer[]> L = new ArrayList<Integer[]>();
L.add(firstArray);
ListAndPosition Lq = new ListAndPosition(L,r-1);
System.out.println(""+nonDecSeqs(Lq).size());
}
public static ArrayList<Integer[]> nonDecSeqs(ListAndPosition Lq){
int iterations = r-1-Lq.getPosition();
System.out.println("How many arrays in the list after "+iterations+" iterations? "+Lq.getList().size());
System.out.print("Should we stop the iteration?");
if(0<Lq.getPosition()){
System.out.println(" No, position = "+Lq.getPosition());
for(int i=0;i<Lq.getList().size();i++){
//Showing particular array
System.out.println("Array of L #"+i+":");
for(int j=0;j<r+1;j++){
System.out.print(""+Lq.getList().get(i)[j]);
}
System.out.print("\nCan it be modified at position "+Lq.getPosition()+"?");
if(Lq.getList().get(i)[Lq.getPosition()]<Lq.getList().get(i)[Lq.getPosition()+1]){
System.out.println(" Yes, "+Lq.getList().get(i)[Lq.getPosition()]+"<"+Lq.getList().get(i)[Lq.getPosition()+1]);
{
Integer[] tempArray = new Integer[r+1];
for(int j=0;j<r+1;j++){
if(j==Lq.getPosition()){
tempArray[j] = new Integer(Lq.getList().get(i)[j])+1;
}
else{
tempArray[j] = new Integer(Lq.getList().get(i)[j]);
}
}
Lq.getList().add(tempArray);
}
System.out.println("New list");Lq.showList();
}
else{
System.out.println(" No, "+Lq.getList().get(i)[Lq.getPosition()]+"="+Lq.getList().get(i)[Lq.getPosition()+1]);
}
}
System.out.print("Old position = "+Lq.getPosition());
Lq.decreasePosition();
System.out.println(", new position = "+Lq.getPosition());
nonDecSeqs(Lq);
}
else{
System.out.println(" Yes, position = "+Lq.getPosition());
}
return Lq.getList();
}
}
Remark: I needed my sequences to start at 0 and end at d.

This is probably not a very good answer to your question. But you could, in principle, use Partitions and the max_slope=-1 argument. Messing around with filtering lists of IntegerVectors sounds equally inefficient and depressing for other reasons.
If this has a canonical name, it might be in the list of sage-combinat functionality, and there is even a base class you could perhaps use for integer lists, which is basically what you are asking about. Maybe you could actually get what you want using IntegerListsLex? Hope this proves helpful.

This question can be solved by using the class "UnorderedTuples" described here:
http://doc.sagemath.org/html/en/reference/combinat/sage/combinat/tuple.html
To return all all nondecreasing sequences with entries between 0 and n-1 of length d, you may type:
UnorderedTuples(range(n),d)
This returns the nondecreasing sequence as a list. I needed an immutable object (because the sequences would become keys of a dictionary). So I used the "tuple" method to turn the lists into tuples:
immutables = []
for s in UnorderedTuples(range(n),d):
immutables.append(tuple(s))
return immutables
And I also wrote a method which picks out only the increasing sequences:
def isIncreasing(list):
for i in range(len(list) - 1):
if list[i] >= list[i+1]:
return false
return true
The method that returns only strictly increasing sequences would look like
immutables = []
for s in UnorderedTuples(range(n),d):
if isIncreasing(s):
immutables.append(tuple(s))
return immutables

How to get document ID in CustomScoreProvider?

In short, I am trying to determine a document's true document ID in method CustomScoreProvider.CustomScore which only provides a document "ID" relative to a sub-IndexReader.
More info: I am trying to boost my documents' scores by precomputed boost factors (imagine an in-memory structure that maps Lucene's document ids to boost factors). Unfortunately I cannot store the boosts in the index for a couple of reasons: boosting will not be used for all queries, plus the boost factors can change regularly and that would trigger a lot of reindexing.
Instead I'd like to boost the score at query time and thus I've been working with CustomScoreQuery/CustomScoreProvider. The boosting takes place in method CustomScoreProvider.CustomScore:
public override float CustomScore(int doc, float subQueryScore, float valSrcScore) {
float baseScore = subQueryScore * valSrcScore; // the default computation
// boost -- THIS IS WHERE THE PROBLEM IS
float boostedScore = baseScore * MyBoostCache.GetBoostForDocId(doc);
return boostedScore;
}
My problem is with the doc parameter passed to CustomScore. It is not the true document id -- it is relative to the subreader used for that index segment. (The MyBoostCache class is my in-memory structure mapping Lucene's doc ids to boost factors.) If I knew the reader's docBase I could figure out the true id (id = doc + docBase).
Any thoughts on how I can determine the true id, or perhaps there's a better way to accomplish what I'm doing?
(I am aware that the id I'm trying to get is subject to change and I've already taken steps to make sure the MyBoostCache is always up to date with the latest ids.)

I was able to achieve this by passing the IndexSearcher to my CustomScoreProvider, using it to determine which of its subreaders is being used by the CustomScoreProvider, and then getting the MaxDoc for the prior subreaders from the IndexSearcher to determine the docBase.
private int DocBase { get; set; }
public MyScoreProvider(IndexReader reader, IndexSearcher searcher) {
DocBase = GetDocBaseForIndexReader(reader, searcher);
}
private static int GetDocBaseForIndexReader(IndexReader reader, IndexSearcher searcher) {
// get all segment readers for the searcher
IndexReader rootReader = searcher.GetIndexReader();
var subReaders = new List<IndexReader>();
ReaderUtil.GatherSubReaders(subReaders, rootReader);
// sequentially loop through the subreaders until we find the specified reader, adjusting our offset along the way
int docBase = 0;
for (int i = 0; i < subReaders.Count; i++)
{
if (subReaders[i] == reader)
break;
docBase += subReaders[i].MaxDoc();
}
return docBase;
}
public override float CustomScore(int doc, float subQueryScore, float valSrcScore) {
float baseScore = subQueryScore * valSrcScore;
float boostedScore = baseScore * MyBoostCache.GetBoostForDocId(doc + DocBase);
return boostedScore;
}

What is the fastest way to compare two byte arrays?

I am trying to compare two long bytearrays in VB.NET and have run into a snag. Comparing two 50 megabyte files takes almost two minutes, so I'm clearly doing something wrong. I'm on an x64 machine with tons of memory so there are no issues there. Here is the code that I'm using at the moment and would like to change.
_Bytes and item.Bytes are the two different arrays to compare and are already the same length.
For Each B In item.Bytes
If B <> _Bytes(I) Then
Mismatch = True
Exit For
End If
I += 1
Next
I need to be able to compare as fast as possible files that are potentially hundreds of megabytes and even possibly a gigabyte or two. Any suggests or algorithms that would be able to do this faster?
Item.bytes is an object taken from the database/filesystem that is returned to compare, because its byte length matches the item that the user wants to add. By comparing the two arrays I can then determine if the user has added something new to the DB and if not then I can just map them to the other file and not waste hard disk drive space.
[Update]
I converted the arrays to local variables of Byte() and then did the same comparison, same code and it ran in like one second (I have to benchmark it still and compare it to others), but if you do the same thing with local variables and use a generic array it becomes massively slower. I’m not sure why, but it raises a lot more questions for me about the use of arrays.

What is the _Bytes(I) call doing? It's not loading the file each time, is it? Even with buffering, that would be bad news!
There will be plenty of ways to micro-optimise this in terms of looking at longs at a time, potentially using unsafe code etc - but I'd just concentrate on getting reasonable performance first. Clearly there's something very odd going on.
I suggest you extract the comparison code into a separate function which takes two byte arrays. That way you know you won't be doing anything odd. I'd also use a simple For loop rather than For Each in this case - it'll be simpler. Oh, and check whether the lengths are correct first :)
EDIT: Here's the code (untested, but simple enough) that I'd use. It's in C# for the minute - I'll convert it in a sec:
public static bool Equals(byte[] first, byte[] second)
{
if (first == second)
{
return true;
}
if (first == null || second == null)
{
return false;
}
if (first.Length != second.Length)
{
return false;
}
for (int i=0; i < first.Length; i++)
{
if (first[i] != second[i])
{
return false;
}
}
return true;
}
EDIT: And here's the VB:
Public Shared Function ArraysEqual(ByVal first As Byte(), _
ByVal second As Byte()) As Boolean
If (first Is second) Then
Return True
End If
If (first Is Nothing OrElse second Is Nothing) Then
Return False
End If
If (first.Length <> second.Length) Then
Return False
End If
For i as Integer = 0 To first.Length - 1
If (first(i) <> second(i)) Then
Return False
End If
Next i
Return True
End Function

The fastest way to compare two byte arrays of equal size is to use interop. Run the following code on a console application:
using System;
using System.Runtime.InteropServices;
using System.Security;
namespace CompareByteArray
{
class Program
{
static void Main(string[] args)
{
const int SIZE = 100000;
const int TEST_COUNT = 100;
byte[] arrayA = new byte[SIZE];
byte[] arrayB = new byte[SIZE];
for (int i = 0; i < SIZE; i++)
{
arrayA[i] = 0x22;
arrayB[i] = 0x22;
}
{
DateTime before = DateTime.Now;
for (int i = 0; i < TEST_COUNT; i++)
{
int result = MemCmp_Safe(arrayA, arrayB, (UIntPtr)SIZE);
if (result != 0) throw new Exception();
}
DateTime after = DateTime.Now;
Console.WriteLine("MemCmp_Safe: {0}", after - before);
}
{
DateTime before = DateTime.Now;
for (int i = 0; i < TEST_COUNT; i++)
{
int result = MemCmp_Unsafe(arrayA, arrayB, (UIntPtr)SIZE);
if (result != 0) throw new Exception();
}
DateTime after = DateTime.Now;
Console.WriteLine("MemCmp_Unsafe: {0}", after - before);
}
{
DateTime before = DateTime.Now;
for (int i = 0; i < TEST_COUNT; i++)
{
int result = MemCmp_Pure(arrayA, arrayB, SIZE);
if (result != 0) throw new Exception();
}
DateTime after = DateTime.Now;
Console.WriteLine("MemCmp_Pure: {0}", after - before);
}
return;
}
[DllImport("msvcrt.dll", CallingConvention = CallingConvention.Cdecl, EntryPoint="memcmp", ExactSpelling=true)]
[SuppressUnmanagedCodeSecurity]
static extern int memcmp_1(byte[] b1, byte[] b2, UIntPtr count);
[DllImport("msvcrt.dll", CallingConvention = CallingConvention.Cdecl, EntryPoint = "memcmp", ExactSpelling = true)]
[SuppressUnmanagedCodeSecurity]
static extern unsafe int memcmp_2(byte* b1, byte* b2, UIntPtr count);
public static int MemCmp_Safe(byte[] a, byte[] b, UIntPtr count)
{
return memcmp_1(a, b, count);
}
public unsafe static int MemCmp_Unsafe(byte[] a, byte[] b, UIntPtr count)
{
fixed(byte* p_a = a)
{
fixed (byte* p_b = b)
{
return memcmp_2(p_a, p_b, count);
}
}
}
public static int MemCmp_Pure(byte[] a, byte[] b, int count)
{
int result = 0;
for (int i = 0; i < count && result == 0; i += 1)
{
result = a[0] - b[0];
}
return result;
}
}
}

If you don't need to know the byte, use 64-bit ints that gives you 8 at once. Actually, you can figure out the wrong byte, once you've isolated it to a set of 8.
Use BinaryReader:
saveTime = binReader.ReadInt32()
Or for arrays of ints:
Dim count As Integer = binReader.Read(testArray, 0, 3)

Better approach... If you are just trying to see if the two are different then save some time by not having to go through the entire byte array and generate a hash of each byte array as strings and compare the strings. MD5 should work fine and is pretty efficient.

I see two things that might help:
First, rather than always accessing the second array as item.Bytes, use a local variable to point directly at the array. That is, before starting the loop, do something like this:
array2 = item.Bytes
That will save the overhead of dereferencing from the object each time you want a byte. That could be expensive in Visual Basic, especially if there's a Getter method on that property.
Also, use a "definite loop" instead of "for each". You already know the length of the arrays, so just code the loop using that value. This will avoid the overhead of treating the array as a collection. The loop would look something like this:
For i = 1 to max Step 1
If (array1(i) <> array2(i))
Exit For
EndIf
Next

Not strictly related to the comparison algorithm:
Are you sure your bottleneck is not related to the memory available and the time used to load the byte arrays? Loading two 2 GB byte arrays just to compare them could bring most machines to their knees. If the program design allows, try using streams to read smaller chunks instead.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Lucene: Iterate all entries - lucene

Related

Android custom keyboard suggestions

SparkSQL - Error in Schema [duplicate]

Sage: Iterate over increasing sequences

How to get document ID in CustomScoreProvider?

What is the fastest way to compare two byte arrays?

Categories

Resources