Check for array index to avoid outofbounds exception - arraylist

I'm still very new to Java, so I have a feeling that I'm doing more than I need to here and would appreciate any advise as to whether there is a more proficient way to go about this. Here is what I'm trying to do:
Output the last value in the Arraylist.
Intentionally insert an out of bounds index value with system.out (index (4) in this case)
Bypass the incorrect value and provide the last valid Arraylist value (I hope this makes sense).
My program is running fine (I'm adding more later, so userInput will eventually be used), but I'd like to do this without using a try/catch/finally block (i.e. check the index length) if possible. Thank you all in advance!
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
public class Ex02 {
public static void main(String[] args) throws IOException {
BufferedReader userInput = new BufferedReader(new InputStreamReader(
System.in));
try {
ArrayList<String> myArr = new ArrayList<String>();
myArr.add("Zero");
myArr.add("One");
myArr.add("Two");
myArr.add("Three");
System.out.println(myArr.get(4));
System.out.print("This program is not currently setup to accept user input. The last printed string in this array is: ");
} catch (Exception e) {
System.out.print("This program is not currently setup to accept user input. The requested array index which has been programmed is out of range. \nThe last valid string in this array is: ");
} finally {
ArrayList<String> myArr = new ArrayList<String>();
myArr.add("Zero");
myArr.add("One");
myArr.add("Two");
myArr.add("Three");
System.out.print(myArr.get(myArr.size() - 1));
}
}
}

Check for array index to avoid outofbounds exception:
In a given ArrayList, you can always get the length of it. By doing a simple comparison, you can check the condition you want. I haven't gone through your code, below is what i was talking-
public static void main(String[] args) {
List<String> list = new ArrayList<String>();
list.add("stringA");
list.add("stringB");
list.add("stringC");
int index = 20;
if (isIndexOutOfBounds(list, index)) {
System.out.println("Index is out of bounds. Last valid index is "+getLastValidIndex(list));
}
}
private static boolean isIndexOutOfBounds(final List<String> list, int index) {
return index < 0 || index >= list.size();
}
private static int getLastValidIndex(final List<String> list) {
return list.size() - 1;
}

Related

Lucene Custom Analyzer - TokenStream Contract Violation

I was trying to create my own custom analyzer and tokenizer classes in Lucene. I followed mostly the instructions here:
http://www.citrine.io/blog/2015/2/14/building-a-custom-analyzer-in-lucene
And I updated as needed (in Lucene's newer versions the Reader is stored in "input")
However I get an exception:
TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.
What could be the reason for this? I gather calling reset\close is not my job at all, but should be done by the analyzer.
Here's my custom analyzer class:
public class MyAnalyzer extends Analyzer {
protected TokenStreamComponents createComponents(String FieldName){
// TODO Auto-generated method stub
return new TokenStreamComponents(new MyTokenizer());
}
}
And my custom tokenizer class:
public class MyTokenizer extends Tokenizer {
protected CharTermAttribute charTermAttribute =
addAttribute(CharTermAttribute.class);
public MyTokenizer() {
char[] buffer = new char[1024];
int numChars;
StringBuilder stringBuilder = new StringBuilder();
try {
while ((numChars =
this.input.read(buffer, 0, buffer.length)) != -1) {
stringBuilder.append(buffer, 0, numChars);
}
}
catch (IOException e) {
throw new RuntimeException(e);
}
String StringToTokenize = stringBuilder.toString();
Terms=Tokenize(StringToTokenize);
}
public boolean incrementToken() throws IOException {
if(CurrentTerm>=Terms.length)
return false;
this.charTermAttribute.setEmpty();
this.charTermAttribute.append(Terms[CurrentTerm]);
CurrentTerm++;
return true;
}
static String[] Tokenize(String StringToTokenize){
//Here I process the string and create an array of terms.
//I tested this method and it works ok
//In case it's relevant, I parse the string into terms in the //constructor. Then in IncrementToken I simply iterate over the Terms array and //submit them each at a time.
return Processed;
}
public void reset() throws IOException {
super.reset();
Terms=null;
CurrentTerm=0;
};
String[] Terms;
int CurrentTerm;
}
When I traced the Exception, I saw that the problem was with input.read - it seems that there is nothing inside input (or rather there is a ILLEGAL_STATE_READER in it) I don't understand it.
You are reading from the input stream in your Tokenizer constructor, before it is reset.
The problem here, I think, is that you are handling the input as a String, instead of as a Stream. The intent is for you to efficiently read from the stream in the incrementToken method, rather than to load the whole stream into a String and pre-process a big ol' list of tokens at the beginning.
It is possible to go this route, though. Just move all the logic currently in the constructor into your reset method instead (after the super.reset() call).

Pig - passing Databag to UDF constructor

I have a script which is loading some data about venues:
venues = LOAD 'venues_extended_2.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (Name:chararray, Type:chararray, Latitude:double, Longitude:double, City:chararray, Country:chararray);
Then I want to create UDF which has a constructor that is accepting venues type.
So I tried to define this UDF like that:
DEFINE GenerateVenues org.gla.anton.udf.main.GenerateVenues(venues);
And here is the actual UDF:
public class GenerateVenues extends EvalFunc<Tuple> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
private static final String ALLCHARS = "(.*)";
private ArrayList<String> venues;
private String regex;
public GenerateVenues(DataBag venuesBag) {
Iterator<Tuple> it = venuesBag.iterator();
venues = new ArrayList<String>((int) (venuesBag.size() + 1)); // possible fails!!!
String current = "";
regex = "";
while (it.hasNext()){
Tuple t = it.next();
try {
current = "(" + ALLCHARS + t.get(0) + ALLCHARS + ")";
venues.add((String) t.get(0));
} catch (ExecException e) {
throw new IllegalArgumentException("VenuesRegex: requires tuple with at least one value");
}
regex += current + (it.hasNext() ? "|" : "");
}
}
#Override
public Tuple exec(Tuple tuple) throws IOException {
// expect one string
if (tuple == null || tuple.size() != 2) {
throw new IllegalArgumentException(
"BagTupleExampleUDF: requires two input parameters.");
}
try {
String tweet = (String) tuple.get(0);
for (String venue: venues)
{
if (tweet.matches(ALLCHARS + venue + ALLCHARS))
{
Tuple output = mTupleFactory.newTuple(Collections.singletonList(venue));
return output;
}
}
return null;
} catch (Exception e) {
throw new IOException(
"BagTupleExampleUDF: caught exception processing input.", e);
}
}
}
When executed the script is firing error at the DEFINE part just before (venues);:
2013-12-19 04:28:06,072 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 6, column 60> mismatched input 'venues' expecting RIGHT_PAREN
Obviously I'm doing something wrong, can you help me out figuring out what's wrong.
Is it the UDF that cannot accept the venues relation as a parameter. Or the relation is not represented by DataBag like this public GenerateVenues(DataBag venuesBag)?
Thanks!
PS I'm using Pig version 0.11.1.1.3.0.0-107.
As #WinnieNicklaus already said, you can only pass strings to UDF constructors.
Having said that, the solution to your problem is using distributed cache, you need to override public List<String> getCacheFiles() to return a list of filenames that will be made available via distributed cache. With that, you can read the file as a local file and build your table.
The downside is that Pig has no initialization function, so you have to implement something like
private void init() {
if (!this.initialized) {
// read table
}
}
and then call that as the first thing from exec.
You can't use a relation as a parameter in a UDF constructor. Only strings can be passed as arguments, and if they are really of another type, you will have to parse them out in the constructor.

Text extraction from table cells

I have a pdf. The pdf contains a table. The table contains many cells (>100). I know the exact position (x,y) and dimension (w,h) of every cell of the table.
I need to extract text from cells using itextsharp. Using PdfReaderContentParser + FilteredTextRenderListener (using a code like this http://itextpdf.com/examples/iia.php?id=279 ) I can extract text but I need to run the whole procedure for each cell. My pdf have many cells and the program needs too much time to run. Is there a way to extract text from a list of "rectangle"? I need to know the text of each rectangle. I'm looking for something like PDFTextStripperByArea by PdfBox (you can define as many regions as you need and the get text using .getTextForRegion("region-name") ).
This option is not immediately included in the iTextSharp distribution but it is easy to realize. In the following I use the iText (Java) class, interface, and method names because I am more at home with Java. They should easily be translatable into iTextSharp (C#) names.
If you use the LocationTextExtractionStrategy, you can can use its a posteriori TextChunkFilter mechanism instead of the a priori FilteredRenderListener mechanism used in the sample you linked to. This mechanism has been introduced in version 5.3.3.
For this you first parse the whole page content using the LocationTextExtractionStrategy without any FilteredRenderListener filtering applied. This makes the strategy object collect TextChunk objects for all PDF text objects on the page containing the associated base line segment.
Then you call the strategy's getResultantText overload with a TextChunkFilter argument (instead of the regular no-argument overload):
public String getResultantText(TextChunkFilter chunkFilter)
You call it with a different TextChunkFilter instance for each table cell. You have to implement this filter interface which is not too difficult as it only defines one method:
public static interface TextChunkFilter
{
/**
* #param textChunk the chunk to check
* #return true if the chunk should be allowed
*/
public boolean accept(TextChunk textChunk);
}
So the accept method of the filter for a given cell must test whether the text chunk in question is inside your cell.
(Instead of separate instances for each cell you can of course also create one instance whose parameters, i.e. cell coordinates, can be changed between getResultantText calls.)
PS: As mentioned by the OP, this TextChunkFilter has not yet been ported to iTextSharp. It should not be hard to do so, though, only one small interface and one method to add to the strategy.
PPS: In a comment sschuberth asked
Do you then still call PdfTextExtractor.getTextFromPage() when using getResultantText(), or does it somehow replace that call? If so, how to you then specify the page to extract to?
Actually PdfTextExtractor.getTextFromPage() internally already uses the no-argument getResultantText() overload:
public static String getTextFromPage(PdfReader reader, int pageNumber, TextExtractionStrategy strategy, Map<String, ContentOperator> additionalContentOperators) throws IOException
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
return parser.processContent(pageNumber, strategy, additionalContentOperators).getResultantText();
}
To make use of a TextChunkFilter you could simply build a similar convenience method, e.g.
public static String getTextFromPage(PdfReader reader, int pageNumber, LocationTextExtractionStrategy strategy, Map<String, ContentOperator> additionalContentOperators, TextChunkFilter chunkFilter) throws IOException
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
return parser.processContent(pageNumber, strategy, additionalContentOperators).getResultantText(chunkFilter);
}
In the context at hand, though, in which we want to parse the page content only once and apply multiple filters, one for each cell, we might generalize this to:
public static List<String> getTextFromPage(PdfReader reader, int pageNumber, LocationTextExtractionStrategy strategy, Map<String, ContentOperator> additionalContentOperators, Iterable<TextChunkFilter> chunkFilters) throws IOException
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
parser.processContent(pageNumber, strategy, additionalContentOperators)
List<String> result = new ArrayList<>();
for (TextChunkFilter chunkFilter : chunkFilters)
{
result.add(strategy).getResultantText(chunkFilter);
}
return result;
}
(You can make this look fancier by using Java 8 collection streaming instead of the old'fashioned for loop.)
Here's my take on how to extract text from a table-like structure in a PDF using itextsharp. It returns a collection of rows and each row contains a collection of interpreted columns. This may work for you on the premise that there is a gap between one column and the next which is greater than the average width of a single character. I also added an option to check for wrapped text within a virtual column. Your mileage may vary.
using (PdfReader pdfReader = new PdfReader(stream))
{
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
TableExtractionStrategy tableExtractionStrategy = new TableExtractionStrategy();
string pageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, tableExtractionStrategy);
var table = tableExtractionStrategy.GetTable();
}
}
public class TableExtractionStrategy : LocationTextExtractionStrategy
{
public float NextCharacterThreshold { get; set; } = 1;
public int NextLineLookAheadDepth { get; set; } = 500;
public bool AccomodateWordWrapping { get; set; } = true;
private List<TableTextChunk> Chunks { get; set; } = new List<TableTextChunk>();
public override void RenderText(TextRenderInfo renderInfo)
{
base.RenderText(renderInfo);
string text = renderInfo.GetText();
Vector bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
Rectangle rectangle = new Rectangle(bottomLeft[Vector.I1], bottomLeft[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Chunks.Add(new TableTextChunk(rectangle, text));
}
public List<List<string>> GetTable()
{
List<List<string>> lines = new List<List<string>>();
List<string> currentLine = new List<string>();
float? previousBottom = null;
float? previousRight = null;
StringBuilder currentString = new StringBuilder();
// iterate through all chunks and evaluate
for (int i = 0; i < Chunks.Count; i++)
{
TableTextChunk chunk = Chunks[i];
// determine if we are processing the same row based on defined space between subsequent chunks
if (previousBottom.HasValue && previousBottom == chunk.Rectangle.Bottom)
{
if (chunk.Rectangle.Left - previousRight > 1)
{
currentLine.Add(currentString.ToString());
currentString.Clear();
}
currentString.Append(chunk.Text);
previousRight = chunk.Rectangle.Right;
}
else
{
// if we are processing a new line let's check to see if this could be word wrapping behavior
bool isNewLine = true;
if (AccomodateWordWrapping)
{
int readAheadDepth = Math.Min(i + NextLineLookAheadDepth, Chunks.Count);
if (previousBottom.HasValue)
for (int j = i; j < readAheadDepth; j++)
{
if (previousBottom == Chunks[j].Rectangle.Bottom)
{
isNewLine = false;
break;
}
}
}
// if the text was not word wrapped let's treat this as a new table row
if (isNewLine)
{
if (currentString.Length > 0)
currentLine.Add(currentString.ToString());
currentString.Clear();
previousBottom = chunk.Rectangle.Bottom;
previousRight = chunk.Rectangle.Right;
currentString.Append(chunk.Text);
if (currentLine.Count > 0)
lines.Add(currentLine);
currentLine = new List<string>();
}
else
{
if (chunk.Rectangle.Left - previousRight > 1)
{
currentLine.Add(currentString.ToString());
currentString.Clear();
}
currentString.Append(chunk.Text);
previousRight = chunk.Rectangle.Right;
}
}
}
return lines;
}
private struct TableTextChunk
{
public Rectangle Rectangle;
public string Text;
public TableTextChunk(Rectangle rect, string text)
{
Rectangle = rect;
Text = text;
}
public override string ToString()
{
return Text + " (" + Rectangle.Left + ", " + Rectangle.Bottom + ")";
}
}
}

Store and retrieve string arrays in HBase

I've read this answer (How to store complex objects into hadoop Hbase?) regarding the storing of string arrays with HBase.
There it is said to use the ArrayWritable Class to serialize the array. With WritableUtils.toByteArray(Writable ... writable) I'll get a byte[] which I can store in HBase.
When I now try to retrieve the rows again, I get a byte[] which I have somehow to transform back again into an ArrayWritable.
But I don't find a way to do this. Maybe you know an answer or am I doing fundamentally wrong serializing my String[]?
You may apply the following method to get back the ArrayWritable (taken from my earlier answer, see here) .
public static <T extends Writable> T asWritable(byte[] bytes, Class<T> clazz)
throws IOException {
T result = null;
DataInputStream dataIn = null;
try {
result = clazz.newInstance();
ByteArrayInputStream in = new ByteArrayInputStream(bytes);
dataIn = new DataInputStream(in);
result.readFields(dataIn);
}
catch (InstantiationException e) {
// should not happen
assert false;
}
catch (IllegalAccessException e) {
// should not happen
assert false;
}
finally {
IOUtils.closeQuietly(dataIn);
}
return result;
}
This method just deserializes the byte array to the correct object type, based on the provided class type token.
E.g:
Let's assume you have a custom ArrayWritable:
public class TextArrayWritable extends ArrayWritable {
public TextArrayWritable() {
super(Text.class);
}
}
Now you issue a single HBase get:
...
Get get = new Get(row);
Result result = htable.get(get);
byte[] value = result.getValue(family, qualifier);
TextArrayWritable tawReturned = asWritable(value, TextArrayWritable.class);
Text[] texts = (Text[]) tawReturned.toArray();
for (Text t : texts) {
System.out.print(t + " ");
}
...
Note:
You may have already found the readCompressedStringArray() and writeCompressedStringArray() methods in WritableUtils
which seem to be suitable if you have your own String array-backed Writable class.
Before using them, I'd warn you that these can cause serious performance hit due to
the overhead caused by the gzip compression/decompression.

Mono.CSharp: how do I inject a value/entity *into* a script?

Just came across the latest build of Mono.CSharp and love the promise it offers.
Was able to get the following all worked out:
namespace XAct.Spikes.Duo
{
class Program
{
static void Main(string[] args)
{
CompilerSettings compilerSettings = new CompilerSettings();
compilerSettings.LoadDefaultReferences = true;
Report report = new Report(new Mono.CSharp.ConsoleReportPrinter());
Mono.CSharp.Evaluator e;
e= new Evaluator(compilerSettings, report);
//IMPORTANT:This has to be put before you include references to any assemblies
//our you;ll get a stream of errors:
e.Run("using System;");
//IMPORTANT:You have to reference the assemblies your code references...
//...including this one:
e.Run("using XAct.Spikes.Duo;");
//Go crazy -- although that takes time:
//foreach (Assembly assembly in AppDomain.CurrentDomain.GetAssemblies())
//{
// e.ReferenceAssembly(assembly);
//}
//More appropriate in most cases:
e.ReferenceAssembly((typeof(A).Assembly));
//Exception due to no semicolon
//e.Run("var a = 1+3");
//Doesn't set anything:
//e.Run("a = 1+3;");
//Works:
//e.ReferenceAssembly(typeof(A).Assembly);
e.Run("var a = 1+3;");
e.Run("A x = new A{Name=\"Joe\"};");
var a = e.Evaluate("a;");
var x = e.Evaluate("x;");
//Not extremely useful:
string check = e.GetVars();
//Note that you have to type it:
Console.WriteLine(((A) x).Name);
e = new Evaluator(compilerSettings, report);
var b = e.Evaluate("a;");
}
}
public class A
{
public string Name { get; set; }
}
}
And that was fun...can create a variable in the script's scope, and export the value.
There's just one last thing to figure out... how can I get a value in (eg, a domain entity that I want to apply a Rule script on), without using a static (am thinking of using this in a web app)?
I've seen the use compiled delegates -- but that was for the previous version of Mono.CSharp, and it doesn't seem to work any longer.
Anybody have a suggestion on how to do this with the current version?
Thanks very much.
References:
* Injecting a variable into the Mono.CSharp.Evaluator (runtime compiling a LINQ query from string)
* http://naveensrinivasan.com/tag/mono/
I know it's almost 9 years later, but I think I found a viable solution to inject local variables. It is using a static variable but can still be used by multiple evaluators without collision.
You can use a static Dictionary<string, object> which holds the reference to be injected. Let's say we are doing all this from within our class CsharpConsole:
public class CsharpConsole {
public static Dictionary<string, object> InjectionRepository {get; set; } = new Dictionary<string, object>();
}
The idea is to temporarily place the value in there with a GUID as key so there won't be any conflict between multiple evaluator instances. To inject do this:
public void InjectLocal(string name, object value, string type=null) {
var id = Guid.NewGuid().ToString();
InjectionRepository[id] = value;
type = type ?? value.GetType().FullName;
// note for generic or nested types value.GetType().FullName won't return a compilable type string, so you have to set the type parameter manually
var success = _evaluator.Run($"var {name} = ({type})MyNamespace.CsharpConsole.InjectionRepository[\"{id}\"];");
// clean it up to avoid memory leak
InjectionRepository.Remove(id);
}
Also for accessing local variables there is a workaround using Reflection so you can have a nice [] accessor with get and set:
public object this[string variable]
{
get
{
FieldInfo fieldInfo = typeof(Evaluator).GetField("fields", BindingFlags.NonPublic | BindingFlags.Instance);
if (fieldInfo != null)
{
var fields = fieldInfo.GetValue(_evaluator) as Dictionary<string, Tuple<FieldSpec, FieldInfo>>;
if (fields != null)
{
if (fields.TryGetValue(variable, out var tuple) && tuple != null)
{
var value = tuple.Item2.GetValue(_evaluator);
return value;
}
}
}
return null;
}
set
{
InjectLocal(variable, value);
}
}
Using this trick, you can even inject delegates and functions that your evaluated code can call from within the script. For instance, I inject a print function which my code can call to ouput something to the gui console window:
public delegate void PrintFunc(params object[] o);
public void puts(params object[] o)
{
// call the OnPrint event to redirect the output to gui console
if (OnPrint!=null)
OnPrint(string.Join("", o.Select(x => (x ?? "null").ToString() + "\n").ToArray()));
}
This puts function can now be easily injected like this:
InjectLocal("puts", (PrintFunc)puts, "CsInterpreter2.PrintFunc");
And just be called from within your scripts:
puts(new object[] { "hello", "world!" });
Note, there is also a native function print but it directly writes to STDOUT and redirecting individual output from multiple console windows is not possible.