How to download files with scrapy? - scrapy

I wonder what techniques would you use when, say a page contains links to 6 videos, 300Mb each and you want to download them all. Should I write my custom downloader?
I'm used to use MediaPipeline but it utilizes the framework scheduler which has the following issues:
You never know which file is currently being downloaded
You have no idea on download progress/state until it fails
Strange timeout behaviour:
a) Looks like timeout is applied to the whole request download operation, not only to pauses in download. So, say, having a timeout of 5min I will never be able to download a file which takes 6 min to download. b) If you make 5 concurrent long requests and one of them is taking too long, you will get all of them (not complete yet) timed out. You have to limit the number of concurrent requests by 1 in settings (which will affect the whole spider).

You can make use of Youtube downloader after having retrieve links to the videos.
Youtube downloader will try to continue if video has not finished downloading. You can also force it to continue. Write a wrapper around it for concurrency if it takes long for single downloads.
Disclaimer: I am not in anyway affiliated with the maintainers of this package.

Related

Is it possible to stream the output of an ffmpeg command to a client with dot net core?

I'm trying to take two videos and transform them with ffmpeg into a single video. It works great if you take the two videos, run them through ffmpeg and then serve that file up via an API. Unfortunately the upper range for these videos is ~20 minutes, and this method takes too long to create the full video (~30 seconds w/ ultrafast).
I had an idea to stream the output of the ffmpeg command to the client which would eliminate the need to wait for ffmpeg to create the whole video. I've tried to proof this out myself and haven't had much success. It could be my inexperience with streams, or this could be impossible.
Does anyone know if my idea to stream the in-progress output of ffmpeg is possible / feasible?
you should check hangfire. I used this for running the process on the background, and if it needs a notification, signalR will help you
What do you mean by "streaming" ? Serving the result of your command to an http client on the fly ? Or your client is some video player that play the video (like a VLC player receiving a tcp stream of 4 IP cameras) ?
Dealing with video isn't a simple task, and you need to choose your protocols, tools and even hardware carefully.
Based on the command that you send as an example, you probably need some jobs that convert your videos.
Here's a complete article on how to use Azure Batch to process using ffmeg. You can use any batching solution if you want (another answer suggests Hangfire and it's ok too)

Why do I get many SSE request that slow down my webpage in my NUXT.js project?

I have a project implemented using NUXT.js (ssr mode). Every time I refresh pages, I got three or four sse requests (like _loading/sse) in the network console. Those sse requests are slow and would fail in the end and they cause page loading time slow in my computer (I run whole project in local computer).
Anyone knows what those sse requests are and how to get rid of them?
What you refer to is a loading-screen and looks like there must be something in your app which si firing many requests and takes quite a while to render or fail.
You need to check through your app code what generates those requests and where they might fail.

RestKit network limits blocks other calls when parallel requests are running

we are facing a problem.
we have background requests that are downloading files constantly (up to 5MB each file). meanwhile, we have a UI that most navigations require REST calls.
we limited the number of background downloads so it won't suffocate the operationQueue that RESTkit uses.
when several files are downloaded in background, we see the network usage with 1->2 MB (which is understandable).
The problem is: the user navigates through the app, and each navigation calls a quick REST call that should return very little data. but because of the background downloads, the UI call is taking forever (~10 seconds).
Priority did not help, i saw that the UI call i make instantly is handled by the operation queue (because we limited the downloads limit and the NSOperationQueue had more space to fulfill other requests.
when we limited the concurrent REST download calls to 5 - the REST calls from the UI took 10 seconds.
when we limited the concurrent REST download calls to 2 - everything worked fine.
the issue here is that because we let only 2 downloads occur in the background - the whole background operation of downloading files will take forever.
the best scenario would be that every UI call would be considered as most important network-wise and even pause the background operations and let only the UI call to be handled - then resume the background operation - but i'm not sure it's possible.
any other idea to address this issue?
You could use 2 RKObjectManagers so that you have 2 separate queues, then use one for 'UI' and the other for 'background'. On top of that you can set the concurrent limits for each queue differently and you could suspend the background queue. Note that suspending the queue doesn't mean already running operations are paused, it just stops new operations from being started.
By doing this you can gain some control, but better options really are to limit the data flow, particularly when running on a mobile data network, and to inform the user what is happening so they can accept the situation or pause it till later.

Best way to start reliable background downloads immediately on iOS 7

I am a bit confused about the different options to handle file downloads on iOS.
I want to be able to handle > 2.000 downloads a time, so some kind of parallelism would be nice
I want a download to be started immediately when fired
I want a download not to be paused or stopped when sending the app to the background
The concrete scenario is a bunch of downloads which are made initially after the user logs into the application. Here I have to download that many files which are mainly small images.
Currently I am using NSURLSessionConfiguration's defaultSessionConfiguration, but going this way the downloads get paused when the user suspends the app (which is likely as the full process needs some minutes).
NSURLSessionConfiguration backgroundSessionConfiguration seems to be the better way to go, but I am seeing delays of up to 20-30 seconds before anything happens after initializing. That's propably okay for some scenarios, but not for mine.
So is there a way to get a workaround to that delay issue? Otherwise I will propably go the "old way" and just download the files in background threads on my own with the 10 minutes limitation.

Long polling blocking multiple windows?

Long polling has solved 99% of my problems. There is now just one other problem. Imagine a penny auction site, where people bid. On the frontpage, there are several Auctions.
If the user opens three of these auctions, and because javascript is not multithreaded, how would you get the other pages to ever load? Won't they always get bogged down and not load because they are waiting for long polling to end? In practice, I've experienced this and I can't think of a way around it. Any ideas?
There are two ways that javascript gets around some of this.
While javascript is single threaded conceptually, it does its io in separate threads using completion handlers. This means other pieces of javascript can be running while you are waiting for your network request to complete.
Javascript for each page (or even each frame in each page) is isolated from Javascript on the other pages/frames. This means that each copy of javascript can be running in its own thread.
A bigger issue for you is likely to be that browsers often limit the number of concurrent connections to a given site, and it sounds like you want to make many concurrent connections to the same site. In this case you will get a lock up.
If you control both the sever and client, you will need to combined the multiple long-poll request from the client into a single long-poll request to the server.