EofException when doing a deployment using the Tooltwist Controller - tooltwist

I'm deploying a ToolTwist application to a production server using FIP, and Im getting this error on Transfer Phase.
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
and in the fipserver console
org.eclipse.jetty.io.EofException
at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:892)
at org.eclipse.jetty.http.AbstractGenerator.blockForOutput(AbstractGenerator.java:486)
at org.eclipse.jetty.http.AbstractGenerator.flush(AbstractGenerator.java:424)
at org.eclipse.jetty.server.HttpOutput.flush(HttpOutput.java:78)
at org.eclipse.jetty.server.HttpConnection$Output.flush(HttpConnection.java:1094)
at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:159)
at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:98)
at tooltwist.fip.jetty.GetFileListServlet.doGet(GetFileListServlet.java:82)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
what should be the solution for this?

This error is occuring in the first stage of the FIP file transfer, where the fipserver creates an index of the existing files on the destination server. This is done in GetFileListServlet.doGet(), which can be seen in the stack trace. It is also indicated on the client side by the message...
Indexing source...
Indexing destination...
ERROR: java.net.SocketTimeoutException Read timed out
Exception: tooltwist.fip.FipException: java.net.SocketTimeoutException: Read timed out
This indexing process involves creating a hash for each file on the destination server, which the fip client then compares with the hashes of files on the source machine. It does this to determine which files are different, and so need to be installed.
A read timeout occurs when the client is waiting too long for the FIP server to index the files on the destination machine. Indexing is normally a fairly quick process, but does involve reading all the files beneath the destination directory (e.g. in ~/server). If monsterously huge files exist within that destination directory then the scanning will take a proportionately long time to complete. If that time is too long, then the client times out and drops the connection, and the server also sees the connection was dropped and stops indexing.
The most common cause of this error is excessively large log files in ~/server/tomcat/logs. If you clean those up, the problem should go away.

Related

Kafka Connect S3 source throws java.io.IOException

Kafka Connect S3 source connector throws the following exception around 20 seconds into reading an S3 bucket:
Caused by: java.io.IOException: Attempted read on closed stream.
at org.apache.http.conn.EofSensorInputStream.isReadAllowed(EofSensorInputStream.java:107)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:133)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
The error is preceded by the following warnning:
WARN Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use. (com.amazonaws.services.s3.internal.S3AbortableInputStream:178)
I am running Kafka connect out of this image: confluentinc/cp-kafka-connect-base:6.2.0. Using the confluentinc-kafka-connect-s3-source-2.1.1 jar.
My source connector configuration looks like so:
{
"connector.class":"io.confluent.connect.s3.source.S3SourceConnector",
"tasks.max":"1",
"s3.region":"eu-central-1",
"s3.bucket.name":"test-bucket-yordan",
"topics.dir":"test-bucket/topics",
"format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"partitioner.class":"io.confluent.connect.storage.partitioner.DefaultPartitioner",
"schema.compatibility":"NONE",
"confluent.topic.bootstrap.servers": "blockchain-kafka-kafka-0.blockchain-kafka-kafka-headless.default.svc.cluster.local:9092",
"transforms":"AddPrefix",
"transforms.AddPrefix.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.AddPrefix.regex":".*",
"transforms.AddPrefix.replacement":"$0_copy"
}
Any ideas on what might be the issue? Also I was unable to find the repository of Kafka connect S3 source connector, is it opensource?
Edit: I don't see the problem if gzip compression on the kafka-connect sink is disabled.
The warning means that close()was called before the file was read. S3 was not done with sending the data but the connection was left hanging.
2 options:
Validate that the input stream contains no more data. That way the connection can be reused
Call s3ObjectInputStream.abort() (NOTE: this connection could not be reused if you abort the input stream and a new one will need to created which will have performance impact.) In some cases this might make sense e.g. when the read is getting too slow etc.

Unable to execute HTTP request: Timeout waiting for connection from pool in Flink

I'm working on an app which uploads some files to an s3 bucket and at a later point, it reads files from s3 bucket and pushes it to my database.
I'm using Flink 1.4.2 and fs.s3a API for reading and write files from the s3 bucket.
Uploading files to s3 bucket works fine without any problem but when the second phase of my app that is reading those uploaded files from s3 starts, my app is throwing following error:
Caused by: java.io.InterruptedIOException: Reopen at position 0 on s3a://myfilepath/a/b/d/4: org.apache.flink.fs.s3hadoop.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:125)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:155)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:281)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:364)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:94)
at org.apache.flink.api.common.io.DelimitedInputFormat.fillBuffer(DelimitedInputFormat.java:702)
at org.apache.flink.api.common.io.DelimitedInputFormat.open(DelimitedInputFormat.java:490)
at org.apache.flink.api.common.io.GenericCsvInputFormat.open(GenericCsvInputFormat.java:301)
at org.apache.flink.api.java.io.CsvInputFormat.open(CsvInputFormat.java:53)
at org.apache.flink.api.java.io.PojoCsvInputFormat.open(PojoCsvInputFormat.java:160)
at org.apache.flink.api.java.io.PojoCsvInputFormat.open(PojoCsvInputFormat.java:37)
at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:145)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
I was able to control this error by increasing the max connection parameter for s3a API.
As of now, I have around 1000 files in the s3 bucket which is pushed and pulled by my app in the s3 bucket and my max connection is 3000. I'm using Flink's parallelism to upload/download these files from s3 bucket. My task manager count is 14.
This is an intermittent failure, I'm having success cases also for this scenario.
My query is,
Why I'm getting an intermittent failure? If the max connection I set was low, then my app should be throwing this error every time I run.
Is there any way to calculate the optimal number of max connection required for my app to work without facing the connection pool timeout error? Or Is this error related to something else that I'm not aware of?
Thanks
In Advance
Some comments, based on my experience with processing lots of files from S3 via Flink (batch) workflows:
When you are reading the files, Flink will calculate "splits" based on the number of files, and each file's size. Each split is read separately, so the theoretical max # of simultaneous connections isn't based on the # of files, but a combination of files and file sizes.
The connection pool used by the HTTP client releases connections after some amount of time, as being able to reuse an existing connection is a win (server/client handshake doesn't have to happen). So that introduces a degree of randomness into how many available connections are in the pool.
The size of the connection pool doesn't impact memory much, so I typically set it pretty high (e.g. 4096 for a recent workflow).
When using AWS connection code, the setting to bump is fs.s3.maxConnections, which isn't the same as a pure Hadoop configuration.

Apache Flink error checkpointing to S3

We have Apache Flink (1.4.2) running on an EMR cluster. We are checkpointing to an S3 bucket, and are pushing about 5,000 records per second through the flows. We recently saw the following error in our logs:
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink#ip-XXX-XXX-XXX-XXX:XXXXXX/user/taskmanager#-XXXXXXX]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.messages.TaskManagerMessages$RequestTaskManagerLog".
at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:442)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
Immediately after this we got the following in our logs:
2018-07-30 15:08:32,177 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 831 # 1532963312177
2018-07-30 15:09:46,750 ERROR org.apache.flink.runtime.blob.BlobServerConnection - PUT operation failed
java.io.EOFException: Read an incomplete length
at org.apache.flink.runtime.blob.BlobUtils.readLength(BlobUtils.java:366)
at org.apache.flink.runtime.blob.BlobServerConnection.readFileFully(BlobServerConnection.java:403)
at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:349)
at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:114)
At this point, the flow crashed and was not able to automatically recover, however we were able to restart the flow manually, without needing to change the location of the s3 bucket. The fact that the crash occurred while pushing to S3, makes me think that is the crux of the problem.
Any ideas?
FYI, this was caused by too much cross-talk between nodes flooding the NICs on each server. The solution was more intelligent partitioning.

Amazon S3 File Read Timeout. Trying to download a file using JAVA

New to Amazon S3 usage.I get the following error when trying to access the file from Amazon S3 using a simple java method.
2016-08-23 09:46:48 INFO request:450 - Received successful response:200, AWS Request ID: F5EA01DB74D0D0F5
Caught an AmazonClientException, which means the client encountered an
internal error while trying to communicate with S3, such as not being
able to access the network.
Error Message: Unable to store object contents to disk: Read timed out
The exact lines of code worked yesterday.I was able to download 100% of 5GB file in 12 min. Today I'm in a better connected environment but only 2% or 3% of the file is downloaded and then the program fails.
Code that I'm using to download.
s3Client.getObject(new GetObjectRequest("mybucket", file.getKey()), localFile);
You need to set the connection timeout and the socket timeout in your client configuration.
Click here for a reference article
Here is an excerpt from the article:
Several HTTP transport options can be configured through the com.amazonaws.ClientConfiguration object. Default values will suffice for the majority of users, but users who want more control can configure:
Socket timeout
Connection timeout
Maximum retry attempts for retry-able errors
Maximum open HTTP connections
Here is an example on how to do it:
Downloading files >3Gb from S3 fails with "SocketTimeoutException: Read timed out"

Liferay stopped at database shutdown caused a crash

I was stopping the Liferay portal, but few seconds after, I stopped the database (db2 quiesce, that means, that the connections are closed) and apparently, Liferay did not stopped correctly its execution.
After that, I restarted the database and liferay, but the portal does not work now. It shows this message in the browser:
HTTP Status 500 -
type Exception report
message
description The server encountered an internal error () that prevented it from fulfilling this request.
exception
javax.servlet.ServletException: Servlet execution threw an exception
com.liferay.portal.kernel.servlet.filters.invoker.InvokerFilterChain.doFilter(InvokerFilterChain.java:72)
...
root cause
java.lang.NoSuchMethodError: com.liferay.portal.util.PortalUtil.getCDNHostHttp()Ljava/lang/String;
com.liferay.portal.events.ServicePreActionExt.servicePre(ServicePreActionExt.java:937)
After looking in the logs, I found the following messages (they are edited):
SEVERE: Error waiting for multi-thread deployment of directories to completehostConfig.deployWar=Deploying web application archive {0}
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1000)
WARN [DefaultConnectionTester:203] SQL State '08001' of Exception which occurred during a Connection test (fallback DatabaseMetaData test) implies that the database is invalid, and the pool should refill itself with fresh Connections.
com.ibm.db2.jcc.am.DisconnectNonTransientConnectionException: [jcc][t4][2030][11211][3.63.75] A communication error occurred during operations on the connection's underlying socket, socket input stream, or socket output stream. Error location: Reply.fill() - insufficient data (-1). Message: Insufficient data. ERRORCODE=-4499, SQLSTATE=08001
at com.ibm.db2.jcc.am.fd.a(fd.java:321)
WARN [DefaultConnectionTester:136] SQL State '08001' of Exception tested by statusOnException() implies that the database is invalid, and the pool should refill itself with fresh Connections.
WARN [C3P0PooledConnectionPool:708] A ConnectionTest has failed, reporting that all previously acquired Connections are likely invalid. The pool will be reset.
WARN [NewPooledConnection:486] [c3p0] A PooledConnection that has already signalled a Connection error is still in use!
WARN [NewPooledConnection:487] [c3p0] Another error has occurred [ com.ibm.db2.jcc.am.SqlNonTransientConnectionException: [jcc][t4][10335][10366][3.63.75] Invalid operation: Connection is closed. ERRORCODE=-4470, SQLSTATE=08003 ] which will not be reported to listeners!
com.ibm.db2.jcc.am.SqlNonTransientConnectionException: [jcc][t4][10335][10366][3.63.75] Invalid operation: Connection is closed. ERRORCODE=-4470, SQLSTATE=08003
WARN [BasicResourcePool:1841] com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask#4fad5112 -- Acquisition Attempt Failed!!! Clearing pending acquires. While trying to acquire a needed new resource, we failed to succeed more than the maximum number of allowed acquisition attempts (3). Last acquisition attempt exception:
com.ibm.db2.jcc.am.SqlNonTransientConnectionException: DB2 SQL Error: SQLCODE=-20157, SQLSTATE=08004, SQLERRMC=FUT5MAN;QUIESCE DATABASE;;, DRIVER=3.63.75
ERROR [PortalJobStore:109] MisfireHandler: Error handling misfires: Unexpected runtime exception: null
org.quartz.JobPersistenceException: Unexpected runtime exception: null [See nested exception: java.lang.reflect.UndeclaredThrowableException]
Caused by: java.lang.reflect.UndeclaredThrowableException
at $Proxy279.prepareStatement(Unknown Source)
at org.quartz.impl.jdbcjobstore.StdJDBCDelegate.countMisfiredTriggersInState(StdJDBCDelegate.java:413)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor65.invoke(Unknown Source)
Caused by: java.sql.SQLException: Connections could not be acquired from the underlying database!
at com.mchange.v2.sql.SqlUtils.toSQLException(SqlUtils.java:106)
Caused by: com.mchange.v2.resourcepool.CannotAcquireResourceException: A ResourcePool could not acquire a resource from its primary factory or source.
at com.mchange.v2.resourcepool.BasicResourcePool.awaitAvailable(BasicResourcePool.java:1319)
Now, I see that it is almost impossible to start the current Liferay installation. However, I have the database (I made a full backup), and the lucene's data directory. How can I recreate a Liferay installation with these two things? I would like to recover some of this data in a new installation, but I do not how.
This is not the best solution, but I installed Liferay with a new database. Once it was configured, I change the database configuration in order to use the other one.
Probably, it was a problem with the ROOT deployment, but this is very weird.
I could recover all the data from the Lucene and the database.
The database is still quiesced and the Liferay user doesn't have the QUIESCE_CONNECT privilege.
Unquiesce the database and restart Liferay.
Using DB2 instance owner (if you're on Windows, any administrator):
db2 connect to DBNAME
db2 unquiesce database
db2 connect reset
Regards.