Screen Scraping with HTTP Headers Issue - I Think

Screen Scraping with HTTP Headers Issue - I Think - scraper

I've been trying to figure this one out for about a week now and just
can't come up with a good solution. So, I figured I would see if anyone could help me out. Here's one of the links that I'm trying to scrape:
http://content.lib.washington.edu/cdm4/item_viewer.php?CISOROOT=/alaskawcanada&CISOPTR=491&CISOBOX=1&REC=4
I right-clicked to copy image location.
This is the link that is copied:
(Can't paste this as a link because I'm new)
http:// content (dot) lib (dot) washington (dot) edu/cgi-bin/getimage.exe?CISOROOT=/alaskawcanada&CISOPTR=491&DMSCALE=100.00000&DMWIDTH=802&DMHEIGHT=657.890625&DMX=0&DMY=0&DMTEXT=%20NA3050%20%09AWC0644%20AWC0388%20AWC0074%20AWC0575&REC=4&DMTHUMB=0&DMROTATE=0
There is no clear image URL being displayed. Obviously that's
because the image is hidden behind some type of script. Through trial and
error I found that I can put ".jpg" after the "CISOPTR=491" and then the link becomes an Image URL. The problem is that this is not the high-resolution version of the image. To get to the
high-resolution version I have to change the URL even more. I found a lot of articles #Stackoverflow.com to mention trying to build a script using curl and PHP, I have even tried a few of them with no luck. "491" is the image number and I can change that number to find other images in the same directory. So, scraping a sequence of numbers should be pretty easy. But I'm still a noob at scraping and this one is kicking my butt. Here's what I've tried.
Get remote image using cURL then resample
also tried this.
http://psung.blogspot.com/2008/06/using-wget-or-curl-to-download-web.html
I also have Outwit Hub, and Site Sucker, but they don't recognize the URL as an image file and fo they just pass right ove it. I used SiteSucker overnight and it download 40,000 files and only 60 were jpegs, none of which were the ones I wanted.
The other thing I keep running into, is the files I have been able to download manually, the filename is always either getfile.exe or showfile.exe and then if I manually add ".jpg" as the extension I can view the image locally.
How can I reached the original high-res image file, and automate the download process so that I can scrape a couple hundred of these images?

I right-clicked to copy image location. This is the link that is
copied:
You noticed the title has ".exe" in there. Look at the stuff in the query string:
DMSCALE=100.00000
DMWIDTH=802
DMHEIGHT=657.890625
DMX=0
DMY=0
DMTEXT=%20NA3050%20%09AWC0644%20AWC0388%20AWC0074%20AWC0575
REC=4
DMTHUMB=0
DMROTATE=0
Strongly implies the original source of this image is in a database or something and it is being passed thru a server-side filter (not sure if that is what you meant by "some kind of script"). Ie, this is dynamically generated content, not static, and the same caveats apply as would to dynamic text content: you have to figure out what instructions to provide the server to get it to cough up what you want. Which you pretty much have in front of you...if SiteSucker or whatever won't deal with it properly, scrape the address yourself using an HTML parser.

Related

How to save all images from website using webscraping or macro

In my job of e-commerce i have to save images from MFG. sites to upload them to my client webpage, sometimes when product came like apple i6 there are hundreds of images in product description
so i want to save all images from mfg url with one click
For example:
URL- http://www.gsmarena.com/samsung_galaxy_a9_%282016%29-pictures-7641.php
it have 7 images, so i wanna get all images downloaded in single click
image links:
http://cdn.gsmarena.com/imgroot/reviews/15/apple-iphone-6s/-347x151/thumb.jpg
http://cdn2.gsmarena.com/vv/bigpic/samsung-galaxy-a9-2016-.jpg
http://cdn2.gsmarena.com/vv/bigpic/samsung-galaxy-a9-2016-.jpg
http://cdn2.gsmarena.com/vv/pics/samsung/samsung-galaxy-a9-2016-6.jpg
http://cdn2.gsmarena.com/vv/pics/samsung/samsung-galaxy-a9-2016-7.jpg
http://cdn2.gsmarena.com/vv/pics/samsung/samsung-galaxy-a9-2016-5.jpg
http://cdn2.gsmarena.com/vv/bigpic/samsung-galaxy-a8-.jpg

Maybe this way?
Image Picker Plugin for Firefox
Seems it does the same job without any installation (but plugin installation).

Not sure if it is, what you are looking for. But I tryed it and found an easy option for you.
Use the tool: jdownloader2
Copy the link of your desired URL and paste it into the link collector. It gathers automatically all the pictures on that page and downloads it for you. All images with a single click...

How to get the video file for a movie currently playing in browser?

So I have a youtube page open where I can watch a video.
But this video was taken down by the user. My open page still has the video, if you go to it again (refresh) the new page does not.
Since I have the video loaded in my browser tab (chrome), how can I go about finding the actual file and saving it?

In the old days of YouTube, it may have been possible to find the single video file on your harddrive and save it, but this is no longer the case. As explained in this Computerphile video, all YouTube videos are now split into tiny pieces and downloaded piece by piece.
You can observe this for yourself if you open up Chrome (or Firefox's) Dev Tools and watch the nnetwork tab. You'll see:
all of the pieces of the video loading bit by bit.
One additional thing you'll learn from the Network tab is that the videos are downloaded as octet streams, so you won't be able to find the links to the pieces hidden in the DOM.
One thing you migth try is, in the Network tab, clear the results and then move the cursor to the beginning of the video. You should see the streams come up again. Right click on the path name and then do a "save as", and save it as 0000.mp4 (or whatever), for all the pieces. You should be able to reassemble these pieces in any video editing software. I tested this by getting two pieces from a random YouTube video.

I couldn't find anything that doesn't require a restart (and hence reload) of Chrome.
One (kludgy) hack if possible, though, would be to run a screen video capture and play the video.

I have done this long back using IE6, i.e. fetch the file from the temporary files location and rename it to the extension flv.
The following links should point you in the right direction, but can't say it will work for sure, as I believe recent chrome versions seem to have a defensive cache implementation.
Ubuntu Forum solutions
You might need to tweak the above for your use.

Run a screen recording/capture program such as:
Screenr
CamStudio
Then edit out the youtube bar if its visible.

The buffered video is cached at the following location:
C:\Users\<user name>\AppData\Local\Temp\flaxxxx.tmp
Note you have to change to whatever user you are using, and xxxx is a random number. Also, the .tmp file might be hidden, so make sure your windows explorer is displaying the hidden files.
While the tab is open, you won't be able to copy the file, but if you close it, the file will be automatically deleted. For doing so, download HoboCopy, extract it and after that, run cmd as administrator. Change the directory on the console to the directory where you have extracted HoboCopy and type the following command:
hobocopy C:\users\<user name>\Appdata\Local\Temp C:\videos fla1234.tmp
<user name> - replace with your windows username
C:\videos - the directory where you want the video to be copied to
fla1234.tmp - the name of the file to be copied.
Wait for the copy to be done and then you can rename the destination file, changing '.tmp' to '.flv'. This file can be played with any FLV supporting media player.

i find this software to get the video from temp files folder and play it http://www.nirsoft.net/utils/video_cache_view.html

Video file is cached, therefore suggested ways can help you to save the file. But if you deal with same problems I offer using IDM(Internet Download Manager). After installation of this application for every online video stream (e.g. all flv files in youtube) IDM brings a small picture that you can click on it and downloading will be started automatically without need of any configuration.

You have to install a browser extension to download YouTube videos. You won't find a simple URL for an mp4 file in the HTML source. Try googling "youtube downloader" + your browser name.
As far as I recall, YouTube videos are not served as a continuous HTTP resource, but instead divided into small chunks and assembled client-side by the Flash player. This is why you can jump into the middle of a video, without having to buffer the first half of the video.
Generally speaking, YouTube don't want you to rip their content, so they aren't exactly making it easy for downloaders.

Displaying PDF on website using pdf.js

I want to put a file sample.pdf on my website, and want it to be displayed using pdf.js. What I want is to display my own file like the demo, with a toolbar, zooming in/out, etc. So far I can't do that yet.
I did check out the helloworld example, but it simply shows the file like an image, without toolbar, zooming in/out, etc. When I put another file with many pages instead of helloworld.pdf, it just shows the first page.

I am not quite sure what you are looking for but I was able to get this working exactly like the demo. Although you may not want to use that example viewer for your project, you can use the working code as a starting point for your own requirements.
For a simple test you can just clone the project somewhere under a web server into a directory like myproject and visit http://yourservername.com/myproject/web/viewer.html. You should see the pdf appear. This can be a starting point to working with this project. I did this running a very basic Apache server on Linux.
If you are not looking for an example styled like that the demo above you can also see this jsbin from the docs that show how to do something completely customized with working next/previous buttons to move between the pages (as you mentioned you were only seeing the first page).
As a note, it seems that this library does not work properly with Safari. You can see an issue about it here. Unfortunately this makes it unusable for me now as I need to support all current browsers.
Also, remember to watch for the warnings concerning CORS.

In any web site, the image always downloaded in the background, right?

Just to confirm, the image always downloaded in another thread which is different with the page text loading thread??
I put in my page, refer to a image on internet, the all text always show up firstly.
What do you think?

I think that html file contains all the prose and refers to pictures, so in whatever threads you do that you first download the text. Whether it's rendered before pictures are downloaded is up to UA and they may or may not be the same in this respect.

Depends on the Browser and the website. In most cases the Browser loads the "main html" where there are references to the Pics and other things.
If the Website loads most of the text-content via AJAX it could be kind of the other way round.
.. but in most cases you are right

UNC Linking to Network Share in Chrome

I have a UNC link something like this on an ASP.NET page which links to an Network Share location. This link works perfectly in IE (surprisingly), and even in Chrome and Firefox if I copy/paste into the address bar, but the link is completely broken. I can't even right click to copy the link. I know this is a known issue, that was supposed to have been fixed several versions ago, but I still need a work around.
I've been looking into adding "content-disposition", "attachment; filename=sample.pdf" to the header, but I don't know how to reference the actual file because the link still doesn't work relative to the server. It keeps trying to save the aspx page rather than the pdf. Ideas? I would LOVE help on this. Thanks ;)
Download
I'm actually implementing the link forming in the VB.NET codebehind, but I can't even get it to work properly with a statically defined link. What gives?

The Reason "\\Server\AppShares\Files\sample.pdf" is not excepted is that it is not a valid URI so not acceptable in a html link. The correct format is file://Server/AppShares/Files/samples.pdf http://blogs.msdn.com/b/ie/archive/2006/12/06/file-uris-in-windows.aspx provides a guide on translations from UNC paths to URIs and back again

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Screen Scraping with HTTP Headers Issue - I Think - scraper

Related

How to save all images from website using webscraping or macro

How to get the video file for a movie currently playing in browser?

Displaying PDF on website using pdf.js

In any web site, the image always downloaded in the background, right?

UNC Linking to Network Share in Chrome

Categories

Resources