Blobs in hsqldb with "res" connection string - blob

I have a hsqldb database packaged in, a jar file that contains my database files (mydb.script and mydb.lobs).
When connecting to my database using "res" url (jdbc:hsqldb:res:mydb) all the queries work ok except getting bytes from a BLOB column. This is the exception I get:
Caused by: org.hsqldb.HsqlException: file input/output error
at org.hsqldb.error.Error.error(Unknown Source)
at org.hsqldb.types.BlobDataID.getBytes(Unknown Source)
at org.hsqldb.types.BlobInputStream.readIntoBuffer(Unknown Source)
When connecting to the same database using "file" url everything works. The code used to get bytes from a BLOB column is this:
// rs is ResultSet
Blob blob = rs.getBlob(i + 1);
int blobSize = (int) blob.length();
byte[] bytes = new byte[blobSize];
InputStream is = blob.getBinaryStream();
try {
is.read(bytes, 0, blobSize);
} catch (IOException e) {
logger.error("Error reading bytes from blob: ", e);
}
Any ideas what could cause reading bytes from BLOB column to fail when using "res" url and succeed when using "file" url?

With databases used as resources (in the class path or jars) LOBs are not supported in HSQLDB up to version 2.2.9. The next version should support it.
Initial support has just been added to the latest HSQLDB snapshot jar which can be downloaded from:
http://www.hsqldb.org/repos/org/hsqldb/hsqldb/SNAPSHOT/

Related

Docx4j 6.1.2 - Convert Flat Open XML to Xlsx

I can create a Flat OPC XML File from an Xlsx using following code:
SpreadsheetMLPackage spreadsheetMLPackage = SpreadsheetMLPackage.load(new File("test.xlsx"));
FlatOpcXmlCreator flatOpcXmlCreator = new FlatOpcXmlCreator(spreadsheetMLPackage);
String flatOpcXml = org.docx4j.XmlUtils.marshaltoString(flatOpcXmlCreator.get(), false, true, org.docx4j.jaxb.Context.jcXmlPackage);
Files.write(Path.of("testFlatOpc.xml"), flatOpcXml.getBytes(), StandardOpenOption.CREATE, StandardOpenOption.WRITE);
but if I now try to read the generated Flat OPC XML in order to convert it back to an Xlsx using following code
FlatOpcXmlImporter flatOpcXmlImporter = new FlatOpcXmlImporter(new FileInputStream("testFlatOpc.xml"));
OpcPackage opcPackage = flatOpcXmlImporter.get();
the flatOpcXmlImporter.get() call throws following Exception:
org.docx4j.openpackaging.exceptions.Docx4JException: Failed to add parts from relationships
at org.docx4j.convert.in.FlatOpcXmlImporter.addPartsFromRelationships(FlatOpcXmlImporter.java:297)
at org.docx4j.convert.in.FlatOpcXmlImporter.get(FlatOpcXmlImporter.java:221)
at at.apa.psp.TestExcel.main(TestExcel.java:38)
Caused by: org.docx4j.openpackaging.exceptions.Docx4JException: Failed to getPart
at org.docx4j.convert.in.FlatOpcXmlImporter.getRawPart(FlatOpcXmlImporter.java:659)
at org.docx4j.convert.in.FlatOpcXmlImporter.getRawPart(FlatOpcXmlImporter.java:426)
at org.docx4j.convert.in.FlatOpcXmlImporter.getPart(FlatOpcXmlImporter.java:365)
at org.docx4j.convert.in.FlatOpcXmlImporter.addPartsFromRelationships(FlatOpcXmlImporter.java:295)
... 2 more
Caused by: javax.xml.bind.JAXBException: Preprocessing exception
- with linked exception:
[javax.xml.bind.UnmarshalException: unerwartetes Element (URI:"http://schemas.openxmlformats.org/spreadsheetml/2006/main", lokal:"workbook"). Erwartete Elemente sind <{http://schemas.openxmlformats.org/markup-compatibility/2006}AlternateContent>, ...
at org.docx4j.openpackaging.parts.JaxbXmlPartXPathAware.unmarshal(JaxbXmlPartXPathAware.java:707)
at org.docx4j.convert.in.FlatOpcXmlImporter.getRawPart(FlatOpcXmlImporter.java:515)
... 5 more
Caused by: javax.xml.bind.UnmarshalException: unerwartetes Element (URI:"http://schemas.openxmlformats.org/spreadsheetml/2006/main", lokal:"workbook"). Erwartete Elemente sind ...
at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallingContext.handleEvent(UnmarshallingContext.java:662)
at com.sun.xml.bind.v2.runtime.unmarshaller.Loader.reportError(Loader.java:258)
at com.sun.xml.bind.v2.runtime.unmarshaller.Loader.reportError(Loader.java:253)
at com.sun.xml.bind.v2.runtime.unmarshaller.Loader.reportUnexpectedChildElement(Loader.java:120)
at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallingContext$DefaultRootLoader.childElement(UnmarshallingContext.java:1063)
at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallingContext._startElement(UnmarshallingContext.java:498)
at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallingContext.startElement(UnmarshallingContext.java:480)
at com.sun.xml.bind.v2.runtime.unmarshaller.InterningXmlVisitor.startElement(InterningXmlVisitor.java:75)
at com.sun.xml.bind.v2.runtime.unmarshaller.SAXConnector.startElement(SAXConnector.java:150)
Should it be possibly to convert Flat OPC XML to an Excel using Docx4j 6.1.2?
Why does the FlatOpcXmlCreator write Namespaces the FlatOpcXmlImporter cannot read?
If it is not possible with docx4j, would there be any alternatives to create an Excel from a Flat OPC XML?
This is now fixed by https://github.com/plutext/docx4j/commit/f5f8b2c9caa9a3d8d339b74e7e878d19c56ad526
This will be in the next 8.1.x release, or you could patch 6.1.2 with that fix yourself.

Nifi CompressContent - Got this exception "IOException thrown from CompressContent: java.io.IOException: Input is not in the .gz format"

I am new to Nifi and trying to do a POC on below flow.
I get XML messages from a Kakfa topic. I need to consume the XML message get few attributes and data which is GZIP compressed format from XML elements, GZIP decompress the data (which is again an XML) and then load to MySQL DB. I am trying this and got stuck in below step.
(1)ConsumeKafka → (2)EvaluateXPath (flowfile-attribute = I set few XML elements as flowfile-attributes which is useful downstream) → (3)EvaluateXPath (flowfile-content = get gzip data using XPATH expression = string(//ABC/data) ) → (4)UpdateAttribute (mime.type = application/gzip) → (5) CompressContent (Compression Format = use mime.type attribute and mode = decompress)
My CompressContent is failing with the below Exception.
org.apache.nifi.processor.exception.ProcessException: IOException thrown from CompressContent[id=be4b9583-016e-1000-7cce-b9d822334c4c]: java.io.IOException: java.io.IOException: Input is not in the .gz format
It could be because my datatype of flowfile-content from (3)EvaluateXPath is set to String. Do I need to convert String to byte before feeding to CompressContent? If Yes, how can I do that in the same (3)EvaluateXPath by using some kind of function toBytes()?
Thanks in advance for your help!!!
Got the solution for this issue. Data is Base64 encoded and hence the Gzip process is unable to decompress, so I have added "Base64EncodeContent" processor before "CompressContent' (Gzip Decompress) and that solved the issue.

Exception "Data source name not found" while connecting to an existing DSN

I have created an ODBC connections (both 32/64 bit) with configuration given below:
Microsoft SQL Server ODBC Driver Version 10.00.14393
Data Source Name: ODBCMSSQL
Data Source Description:
Server: .\SQLEXPRESS
Database: MedicalMarketting
Language: (Default)
Translate Character Data: Yes
Log Long Running Queries: No
Log Driver Statistics: No
Use Regional Settings: No
Prepared Statements Option: Drop temporary procedures on disconnect
Use Failover Server: No
Use ANSI Quoted Identifiers: Yes
Use ANSI Null, Paddings and Warnings: Yes
Data Encryption: No
I want to connect to local MsSQL server as given in below code snippet:
string connectionString = "Data Source=ODBCMSSQL;Initial Catalog=MedicalMarketting;Integrated Security=True";
con = new OdbcConnection(connectionString);
cmd = new OdbcCommand();
cmd.Connection = con;
cmd.CommandType = CommandType.Text;
try
{
this.con.Open();
this.tr = con.BeginTransaction();
this.cmd.Transaction = tr;
}
catch (Exception ex)
{
this.RollBack();
}
This throws an exception which has an error message as below:
ERROR [IM002] [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified
Sorry if this was too basic, but had to post for a clue because the same configuration for different ODBC connections works perfectly.
I found a quick fix to the problem by changing the connection string to,
string connectionString = "DSN=ODBCMSSQL";// best practice is to store this in a seperate config file.
In fact, other attributes specified (Initial Catalog, Integrated Security) are not ODBC connection string attributes, hence ignored. A full list of ODBC connection attributes can be found below.
https://msdn.microsoft.com/en-us/library/ee275047(v=bts.10).aspx

Cascading S3 Sink Tap not being deleted with SinkMode.REPLACE

We are running Cascading with a Sink Tap being configured to store in Amazon S3 and were facing some FileAlreadyExistsException (see [1]).
This was only from time to time (1 time on around 100) and was not reproducable.
Digging into the Cascading codem, we discovered the Hfs.deleteResource() is called (among others) by the BaseFlow.deleteSinksIfNotUpdate().
Btw, we were quite intrigued with the silent NPE (with comment "hack to get around npe thrown when fs reaches root directory").
From there, we extended the Hfs tap with our own Tap to add more action in the deleteResource() method (see [2]) with a retry mechanism calling directly the getFileSystem(conf).delete.
The retry mechanism seemed to bring improvement, but we are still sometimes facing failures (see example in [3]): it sounds like HDFS returns isDeleted=true, but asking directly after if the folder exists, we receive exists=true, which should not happen. Logs also shows randomly isDeleted true or false when the flow succeeds, which sounds like the returned value is irrelevant or not to be trusted.
Can anybody bring his own S3 experience with such a behavior: "folder should be deleted, but it is not"? We suspect a S3 issue, but could it also be in Cascading or HDFS?
We run on Hadoop Cloudera-cdh3u5 and Cascading 2.0.1-wip-dev.
[1]
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3n://... already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
at com.twitter.elephantbird.mapred.output.DeprecatedOutputFormatWrapper.checkOutputSpecs(DeprecatedOutputFormatWrapper.java:75)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:923)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:882)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:882)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:856)
at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:104)
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:174)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:137)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:122)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.j
[2]
#Override
public boolean deleteResource(JobConf conf) throws IOException {
LOGGER.info("Deleting resource {}", getIdentifier());
boolean isDeleted = super.deleteResource(conf);
LOGGER.info("Hfs Sink Tap isDeleted is {} for {}", isDeleted,
getIdentifier());
Path path = new Path(getIdentifier());
int retryCount = 0;
int cumulativeSleepTime = 0;
int sleepTime = 1000;
while (getFileSystem(conf).exists(path)) {
LOGGER
.info(
"Resource {} still exists, it should not... - I will continue to wait patiently...",
getIdentifier());
try {
LOGGER.info("Now I will sleep " + sleepTime / 1000
+ " seconds while trying to delete {} - attempt: {}",
getIdentifier(), retryCount + 1);
Thread.sleep(sleepTime);
cumulativeSleepTime += sleepTime;
sleepTime *= 2;
} catch (InterruptedException e) {
e.printStackTrace();
LOGGER
.error(
"Interrupted while sleeping trying to delete {} with message {}...",
getIdentifier(), e.getMessage());
throw new RuntimeException(e);
}
if (retryCount == 0) {
getFileSystem(conf).delete(getPath(), true);
}
retryCount++;
if (cumulativeSleepTime > MAXIMUM_TIME_TO_WAIT_TO_DELETE_MS) {
break;
}
}
if (getFileSystem(conf).exists(path)) {
LOGGER
.error(
"We didn't succeed to delete the resource {}. Throwing now a runtime exception.",
getIdentifier());
throw new RuntimeException(
"Although we waited to delete the resource for "
+ getIdentifier()
+ ' '
+ retryCount
+ " iterations, it still exists - This must be an issue in the underlying storage system.");
}
return isDeleted;
}
[3]
INFO [pool-2-thread-15] (BaseFlow.java:1287) - [...] at least one sink is marked for delete
INFO [pool-2-thread-15] (BaseFlow.java:1287) - [...] sink oldest modified date: Wed Dec 31 23:59:59 UTC 1969
INFO [pool-2-thread-15] (HiveSinkTap.java:148) - Now I will sleep 1 seconds while trying to delete s3n://... - attempt: 1
INFO [pool-2-thread-15] (HiveSinkTap.java:130) - Deleting resource s3n://...
INFO [pool-2-thread-15] (HiveSinkTap.java:133) - Hfs Sink Tap isDeleted is true for s3n://...
ERROR [pool-2-thread-15] (HiveSinkTap.java:175) - We didn't succeed to delete the resource s3n://... Throwing now a runtime exception.
WARN [pool-2-thread-15] (Cascade.java:706) - [...] flow failed: ...
java.lang.RuntimeException: Although we waited to delete the resource for s3n://... 0 iterations, it still exists - This must be an issue in the underlying storage system.
at com.qubit.hive.tap.HiveSinkTap.deleteResource(HiveSinkTap.java:179)
at com.qubit.hive.tap.HiveSinkTap.deleteResource(HiveSinkTap.java:40)
at cascading.flow.BaseFlow.deleteSinksIfNotUpdate(BaseFlow.java:971)
at cascading.flow.BaseFlow.prepare(BaseFlow.java:733)
at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:761)
at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:710)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
First, double check the Cascading compatibility page for supported distributions.
http://www.cascading.org/support/compatibility/
Note Amazon EMR is listed as they periodically run the compatibility tests and report the results back.
Second, S3 is an eventually consistent filesystem. HDFS is not. So assumptions about the behavior of HDFS don't carry over to storing data against S3. For example, a rename is really a copy and delete. Where the copy can take hours. Amazon has patched their internal distribution to accommodate many of the differences.
Third, there are no directories in S3. It is a hack and supported differently by different S3 interfaces (jets3t vs s3cmd vs ...). This is bound to be problematic considering the prior point.
Fourth, network latency and reliability are critical, especially when communicating to S3. Historically I've found the Amazon network to be better behaved when manipulating massive datasets on S3 when using EMR vs standard EC2 instances. I also believe their is a patch in EMR that improves matters here as well.
So I'd suggest try running the EMR Apache Hadoop distribution to see if your issues clear up.
When running any jobs on Hadoop that use files in S3, the nuances of eventual consistency must be kept in mind.
I've helped troubleshoot many apps which turned out to have similar race conditions for delete as their root issue -- whether they were in Cascading or Hadoop streaming or written directly in Java.
There was discussion at one point of having notifications from S3 after a given key/value pair had been fully deleted. I haven't kept up on where that feature stood. Otherwise, it's probably best to design systems -- again, whether in Cascading or any other app that uses S3 -- such that data which is consumed or produced by a batch workflow gets managed in HDFS or HBase or a key/value framework (e.g., have used Redis for this). Then S3 gets used for durable storage, but not for intermediate data.

SqlServerCE version 3.5 SP2 - Database file is larger than configured maximum

Here is some sample code of how I am creating/connecting/working with my database
string connection = #"Data Source='C:\test.sdf';Max Database Size=4000;"
+ "Max Buffer Size=4096;";
File.Delete(#"C:\test.sdf");
using (var engine = new SqlCeEngine(connection))
{
engine.CreateDatabase();
engine.Compact("Data Source=; Case Sensitive=True; Max Database Size=4000;");
}
using (var dbConn = new SqlCeConnection(connection))
{
// Create tables, indexes, etc, and insert loads of data here
// Somewhere in the loading of data I get
// the "Database file is larger..." exception
}
Here is my question. The database file size at the point of the exception is a mere 368 MB (386,879,488 bytes to be exact according to the file properties). Do I need to add the max database size string into the Compact statement?
Any other ideas on what could be wrong.
The default value for Max Database Size is 256 MB, so yes, you would need to add this to the connection string, if the file size grows over this.
As had ErikEJ said, this is how the connection string have to be:
"Data Source=MyData.sdf;Max Database Size=256;Persist Security Info=False;"
where you can replace 256 with a needed size.