Large files download fails with scrapy media_pipeline - scrapy

Some files are small but some exceed 300M. Those large fail to download. I'm using media_pipeline to download files. Output:
2016-07-08 18:11:22 [scrapy] WARNING: Received (208954047) bytes larger than download warn size (200000000).
This repeats many times. Then
Gave up retrying <GET http://pmd.foxsports.com.au/free/nogeoblock/2016/07/06/DVU_20160607_AFL_TONIGHT_201607061809/DVU_20160607_AFL_TONIGHT_201607061809_1596.mp4> (failed 3 times): User timeout caused connection failure: Getting http://pmd.foxsports.com.au/free/nogeoblock/2016/07/06/DVU_20160607_AFL_TONIGHT_201607061809/DVU_20160607_AFL_TONIGHT_201607061809_1596.mp4 took longer than 1800.0 seconds..
The timeout of 1800 s is extremely large. The 300M file download takes much less on my channel. However doesn't work. I know I can use some external lib/downloader.But if possible I want to do this by the framework means.
UPD: The exact scenario is the following:
0) The timeout shold be adjusted in settings.py (the timeout int "meta" request field doesn't work
1) The number of requests (6) files are passed through MediaPipeline.get_media_requests. Each request corresponds to a large file
2) Some files are downloaded fine
3) At some point further files are not downloaded. All remaining files fail with timeout. The framework retries downloading but with no avail. The timeout repeats the number of tries.
4) The same files could be downloaded fine with wget

Related

Unable to execute HTTP request: Timeout waiting for connection from pool in Flink

I'm working on an app which uploads some files to an s3 bucket and at a later point, it reads files from s3 bucket and pushes it to my database.
I'm using Flink 1.4.2 and fs.s3a API for reading and write files from the s3 bucket.
Uploading files to s3 bucket works fine without any problem but when the second phase of my app that is reading those uploaded files from s3 starts, my app is throwing following error:
Caused by: java.io.InterruptedIOException: Reopen at position 0 on s3a://myfilepath/a/b/d/4: org.apache.flink.fs.s3hadoop.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:125)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:155)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:281)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:364)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:94)
at org.apache.flink.api.common.io.DelimitedInputFormat.fillBuffer(DelimitedInputFormat.java:702)
at org.apache.flink.api.common.io.DelimitedInputFormat.open(DelimitedInputFormat.java:490)
at org.apache.flink.api.common.io.GenericCsvInputFormat.open(GenericCsvInputFormat.java:301)
at org.apache.flink.api.java.io.CsvInputFormat.open(CsvInputFormat.java:53)
at org.apache.flink.api.java.io.PojoCsvInputFormat.open(PojoCsvInputFormat.java:160)
at org.apache.flink.api.java.io.PojoCsvInputFormat.open(PojoCsvInputFormat.java:37)
at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:145)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
I was able to control this error by increasing the max connection parameter for s3a API.
As of now, I have around 1000 files in the s3 bucket which is pushed and pulled by my app in the s3 bucket and my max connection is 3000. I'm using Flink's parallelism to upload/download these files from s3 bucket. My task manager count is 14.
This is an intermittent failure, I'm having success cases also for this scenario.
My query is,
Why I'm getting an intermittent failure? If the max connection I set was low, then my app should be throwing this error every time I run.
Is there any way to calculate the optimal number of max connection required for my app to work without facing the connection pool timeout error? Or Is this error related to something else that I'm not aware of?
Thanks
In Advance
Some comments, based on my experience with processing lots of files from S3 via Flink (batch) workflows:
When you are reading the files, Flink will calculate "splits" based on the number of files, and each file's size. Each split is read separately, so the theoretical max # of simultaneous connections isn't based on the # of files, but a combination of files and file sizes.
The connection pool used by the HTTP client releases connections after some amount of time, as being able to reuse an existing connection is a win (server/client handshake doesn't have to happen). So that introduces a degree of randomness into how many available connections are in the pool.
The size of the connection pool doesn't impact memory much, so I typically set it pretty high (e.g. 4096 for a recent workflow).
When using AWS connection code, the setting to bump is fs.s3.maxConnections, which isn't the same as a pure Hadoop configuration.

Large commit stalls halfway through

I have a problem with our subversion server. Doing small commits works fine, but as soon as someone tried to commit a large collection sizeable files the commit stalls halfway through and the client finally time out. My test set consists of roughly 2000 files and the total size of the commit is about 1 GB. When I commit the files the file uploading starts but about halfway through the transfer rate drops to 0kb/s and the commit just stalls and never recovers. If I splitting the commit into smaller pieces (<150 Mb) everything works just fine, but that breaks the atomicity of the commit structure and is something I really want to avoid.
When I look at the logs generate by Apache there is no error messages.
When I bumped the loglevel from debug to trace6 on the Apache server, there is some errors appearing at the moment when the upload stalls:
...
OpenSSL: I/O error, 2229 bytes expected to read on BIO
OpenSSL: read 1460/2229 bytes from BIO
...
Versions used:
We are running the connection to the subversion via apache, mod_dav, mod_dav_svn, mod_authz_svn and mod_auth_digest. The client connects via https.
Server:
OpenSuse 42.3
svnserve: 1.9.7
Apache: 2.4.23
Client:
Windows 10 enterprise
svn client: 1.10.0-dev.
What I tried so far:
I have tried increasing the TimeOut value in the apache configuration. The only difference is that the client ends up in stalled mode longer before posting the timeout message.
I have tried increasing the MaxKeepAliveRequests from 100 to 1000. No change.
I have tried adding SVNAllowBulkUpdates Prefer to the svn settings. No change.
Have anyone got any hints on how to debug these types of errors?

Is there a way to download this homestead virtualbox file?

Short Question
I am currently trying to install Homestead. The following url is choking me to death
https://atlas.hashicorp.com/laravel/boxes/homestead/versions/2.1.0/providers/virtualbox.box
Is there any other way to download the above file ?
Details:
The above file download is redirected to signed S3 url. Out of great wisdom, there is a timeout of 60 [perhaps in second] .. so after downloading 10% or , the download fails.
Have a look at the following :
vagrant init laravel/homestead; vagrant up --provider virtualbox
...
==> default: Adding box 'laravel/homestead' (v2.1.0) for provider: virtualbox
default: Downloading: https://atlas.hashicorp.com/laravel/boxes/homestead/versions/2.1.0/providers/virtualbox.box
==> default: Box download is resuming from prior download progress
default:
An error occurred while downloading the remote file.
The error message, if any, is reproduced below.
Please fix this error and try again.
I have tried downloading the file using other means, e.g through the browser or through curl. The call to the above url results in a signed S3 link, which in all its glory has a timeout as you can see in the I get the following url :: .
https://hc-prod-storagelocker.s3.amazonaws.com/boxes/5b64bd3b-eb87-4af4-9b2d-1c1560efca67?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJBKZ6DNPERBCPYKQ%2F20170613%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20170613T022000Z&X-Amz-Expires=60&X-Amz-SignedHeaders=host&X-Amz-Signature=0d13db989138a93b4ab82d18c4768b141e837d0c834654f73f5394e1cd04ce0e
Even though I am not sitting under a rock, but they clearly think that I should have better speed, hence such a short expiry of the signed url.
Due to slow connection, the download kept breaking. Even though vagrant says:
Box download is resuming from prior download progress
there was no indication from progress percent that it was actually resuming.
I kept on trying, by running it over and over again .. and eventually it accounted for previous download progress.

Amazon S3 File Read Timeout. Trying to download a file using JAVA

New to Amazon S3 usage.I get the following error when trying to access the file from Amazon S3 using a simple java method.
2016-08-23 09:46:48 INFO request:450 - Received successful response:200, AWS Request ID: F5EA01DB74D0D0F5
Caught an AmazonClientException, which means the client encountered an
internal error while trying to communicate with S3, such as not being
able to access the network.
Error Message: Unable to store object contents to disk: Read timed out
The exact lines of code worked yesterday.I was able to download 100% of 5GB file in 12 min. Today I'm in a better connected environment but only 2% or 3% of the file is downloaded and then the program fails.
Code that I'm using to download.
s3Client.getObject(new GetObjectRequest("mybucket", file.getKey()), localFile);
You need to set the connection timeout and the socket timeout in your client configuration.
Click here for a reference article
Here is an excerpt from the article:
Several HTTP transport options can be configured through the com.amazonaws.ClientConfiguration object. Default values will suffice for the majority of users, but users who want more control can configure:
Socket timeout
Connection timeout
Maximum retry attempts for retry-able errors
Maximum open HTTP connections
Here is an example on how to do it:
Downloading files >3Gb from S3 fails with "SocketTimeoutException: Read timed out"

Apache upload failed when file size is over 100k

Below it is some information about my problem.
Our Apache2.2 is on windows 2008 server.
Basically the problem is user fails to upload file which is bigger than 100k to our server.
The error in Apache log is: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. : Error reading request entity data, referer: ......
There were a few times (not always) I could upload larger files(100k-800k, failed for 20m) in Chrome. In FF4 it always fails for uploading file over 100k. In IE8 it is similar to FF4.
It seems that it fails to get request from client, then I reset TimeOut in Apache setting to default value(300) which did not help at all.
I do not have the RequestLimitBody option set up and I am not using PHP. Anyone saw the similar error before? Now I am not sure what I can try next. Any advise would be appreciated!
Edit:
I just tried to use remote desk to upload files on the server and it worked fine. First thought was about the firewall which however is off all the time, Http Proxy is applied though.