How do you do real time document tracking? - pdf

I was considering diff Document Tracking options and came across DocTracking.com. DocTracking.com allows you to upload documents (PDF Word etc) and adds some kind of invisible tracking to it and returns the document to you which can then be used just like you would use the document otherwise. This tracking tells you when your documents were opened, who opened them (IP), geo-location of opening if they are re-opened or forwarded, what pages were read and how long it was read for, what was printed. Any leads on how this could be done would be appreciated.

It may be possible to embed an internet control/image into the document so get the client application to send an HTTP request to the server, which can use the request to identify the document and link it to the IP address (and consequently location). Various degrees of logic can be embedded for more sophistication.
Of course, the won't work if someone's Word, let's say, is firewalled.
That's one of the ways I can think of to make something like doctracking.com's system work. I doubt that there is anything built into Acrobat or Word to let a document stream data from the host PC, seems massively unsafe.

Related

Search keyword in PDF and check if exists

The idea was to be able to, as soon as a receive a mail with a PDF attached, find a way in which the PDF can be downloaded and be searched for a specific keyword (for instance, see if my name is in it) and if my name is on any of the pages of the PDF, then send another mail notifying the user that there’s a pdf in which he has been named.
This is in order to avoid having to check dozens of mails daily and PDFs just to see if your name is in it or not.
I managed to do this using Zapier but I relied on PDFco’s API for the search, and it is payware, so I’m taking a different approach.
My question is more based on what library would make that search inside the PDF and would provide a Boolean value that said if the keyword exists or not.
Thank you!

Externally triggering Thunderbird into displaying a wanted message

I would like having a way to trigger Thunderbird, from an external script, into displaying a particular message in a particular folder.
If it were Firefox, say, I would use firefox -new-tab http://some-URL, and an already running Firefox (or a new one if none) would nicely fetch and display URL. But I found no way to do something equivalent with Thunderbird, neither on the Thunderbird site or through existing extensions, and even after some furious Googling around, which I attempted more than once!
One problem, compared to a plain URL, is the need some notation for selecting a message. Short of a better solution, I wrote a script which knows folder:SOME-FOLDER:ORDINAL, and behaves like an extension of xdg-open. My tool inserts a proper prefix and a few .sbd as needed within the SOME-FOLDER part to turn it into an absolute Thunderbird file reference, and ORDINAL picks a message in that folder. My tool then grabs the message, heuristically converts it into HTML file, and then, directs a Web browser to the resulting file (and if :ORDINAL is not given, it processes the whole folder instead, yielding an HTML index and many linked messages).
My current tool helps a bit at saving message references in other documents and efficiently retrieving them later, but I handle a copy of the Thunderbird message, and not the original. So if I want to delete it, refile it in another Thunderbird folder, and do other similar operation, I still have to go to Thunderbird, interactively find my way again to the wanted message before I can handle it, and this, is not efficient. What I'm dreaming of is a way to get rid of all my HTML conversion and browser trickery, but still keep the pseudo-URL paradigm and pseudo xdg-open interface, to directly force Thunderbird into the correct folder, with the wanted message correctly displayed.
In previous email readers I used (Emacs RMAIL and then Gnus, and Mutt as well later), such things could be managed, and I heavily used such capabilities in scripts. I am astonished, surprised, even a bit dismayed, by the apparent weakness of Thunderbird as a scriptable mail reader. Am I missing something evident? Any avenue or suggestion?
François
P.S. Of course, I agree that using ORDINAL is not very clever. It might mean a different message if the folder get some messages added or deleted. This is a lesser bad. A better but potentially heavier notation might use Message-ID values, but then, an index would also be needed to find the Thunderbird folder containing each message.
There seems to be some way to do it since Google Desktop supported it according to this thread - http://forums.mozillazine.org/viewtopic.php?f=39&t=584542. Perhaps try installing Google Desktop and see what kind of hyperlink its using?
I'll add Outlook supports using external hyperlinks using the outlook: naming scheme, for example outlook:Inbox or outlook:0000000007A2379547B0624691F4FB2E5468A0D7642E2000. See http://www.davidtan.org/create-hyperlinks-to-outlook-messages-folders-contacts-events/ for more info.

Figure out if a website has restricted/password protected area

I have a big list of websites and I need to know if they have areas that are password protected.
I am thinking about doing this: downloading all of them with httrack and then writing a script that looks for keywords like "Log In" and "401 Forbidden". But the problem is these websites are different/some static and some dynamic (html, cgi, php,java-applets...) and most of them won't use the same keywords...
Do you have any better ideas?
Thanks a lot!
Looking for password fields will get you so far, but won't help with sites that use HTTP authentication. Looking for 401s will help with HTTP authentication, but won't get you sites that don't use it, or ones that don't return 401. Looking for links like "log in" or "username" fields will get you some more.
I don't think that you'll be able to do this entirely automatically and be sure that you're actually detecting all the password-protected areas.
You'll probably want to take a library that is good at web automation, and write a little program yourself that reads the list of target sites from a file, checks each one, and writes to one file of "these are definitely passworded" and "these are not", and then you might want to go manually check the ones that are not, and make modifications to your program to accomodate. Using httrack is great for grabbing data, but it's not going to help with detection -- if you write your own "check for password protected area" program with a general purpose HLL, you can do more checks, and you can avoid generating more requests per site than would be necessary to determine that a password-protected area exists.
You may need to ignore robots.txt
I recommend using the python port of perls mechanize, or whatever nice web automation library your preferred language has. Almost all modern languages will have a nice library for opening and searching through web pages, and looking at HTTP headers.
If you are not capable of writing this yourself, you're going to have a rather difficult time using httrack or wget or similar and then searching through responses.
Look for forms with password fields.
You may need to scrape the site to find the login page. Look for links with phrases like "log in", "login", "sign in", "signin", or scrape the whole site (needless to say, be careful here).
I would use httrack with several limits and then search the downloaded files for password fields.
Typically, a login form could be found within two links of the home page. Almost all ecommerce sites, web apps, etc. have login forms that are accessed just by clicking on one link on the home page, but another layer or even two of depth would almost guarantee that you didn't miss any.
I would also limit the speed that httrack downloads, tell it not to download any non-HTML files, and prevent it from downloading external links. I'd also limit the number of simultaneous connections to the site to 2 or even 1. This should work for just about all of the sites you are looking at, and it should be keep you off the hosts.deny list.
You could just use wget and do something like:
wget -A html,php,jsp,htm -S -r http://www.yoursite.com > output_yoursite.txt
This will cause wget to download the entire site recursively, but only download endings listed with the -A option, in this case try to avoid heavy files.
The header will be directed to file output_yoursite.txt which you then can parse for the header value 401, which means that the part of the site requires authentication, and parse the files accordingly to Konrad's recommendation also.
Looking for 401 codes won't reliably catch them as sites might not produce links to anything you don't have privileges for. That is, until you are logged in, it won't show you anything you need to log in for. OTOH some sites (ones with all static content for example) manage to pop a login dialog box for some pages so looking for password input tags would also miss stuff.
My advice: find a spider program that you can get source for, add in whatever tests (plural) you plan on using and make it stop of the first positive result. Look for a spider that can be throttled way back, can ignore non HTML files (maybe by making HEAD requests and looking at the mime type) and can work with more than one site independently and simultaneously.
You might try using cURL and just attempting to connect to each site in turn (possibly put them in a text file and read each line, try to connect, repeat).
You can set up one of the callbacks to check the HTTP response code and do whatever you need from there.

Refresh browser via cron(or not) to a different page on remote request?

I need to display pages in a tutorial fashion. I looked in to netsupport, beamyourscreen and other possibilities but, I do not want the viewers to download anything. I cannot use gd / send screenshots due to audio / video instructions embedded in some of the pages.
Basically, I need the ability to "refresh" a users browser window to a different page via an interface on my end. Whether via a form submission, javascript or any other type of "controller" that allows me to change the page on the viewers browser. PERL preferred but, PHP / javascript whatever works and is cross browser. I set up a simple javascript page forward timer that "works" but, page load times and conversation interruptions are a huge factor.
The entire tutorial website will be developed around this ability.
I was looking in to curl / cron / wget methods but, found little information.
I have seen forum and chat scripts that basically perform a similar task but, there must be a simple(ish) solution in leau of hacking up another script to suit my needs.
I do not want others to control the pages either. The site really, only needs to be accessable during the tutorial however, It "could" remain web accessable as long as user interaction was normal unless (being controlled).
The initial site concept is based on instructing people how to properly introduce new pets into a home. Will be operated by a veteranarian that saved my pets life. I wanted to give something back.
Possible? I really appreciate simple examples etc...
You have no other way but to keep polling the server for "instructions" using javascript. No, you can't send nothing to the end user browser, neither curl nor wget.
Mainly, you'll have to set up a simple request/response protocol between the browser and the server.
If you want to go deeper, you can use something like cometd/meteord/etc. If not, a hidden iframe that reloads himself and receives pages with javascript code for the needed actions can do the trick.
Another alternative.
With javascript dopolling and single character flatfile. Have a simple one character flatfile with a single var. Write it in perl (it is faster and uses less resources than php). The parent script calls a javascript variable in a flatfile. It hits the flatfile and goes wherever the var sets it. The flatfile is written to by the controller. Done.
I guess you could also rename an empty flatfile and use that as the controller. I am usure which is faster, open and read a specific file or hit the directory and return the file name. On the controller side, opening and writing to a file vs renaming a file. Maybe they counter each other in resources and time?
This way the site can act as a normal site. When you want to have remote users see a "presentation" (automatically being shown the site pages at the controllers pace), the controller activates polling and tells the viewers to push a start button. This allows a remote instructor to load pages for the viewers at his leisure.
It is a simple solution that works with nothing really sophisticated going on. No frames are needed either. Just need javascript enabled.
Any better suggestions are welcome!
It occurred to me that what you might want to use is HTML Push technology. Check out the wiki, they have several links. I have never used it myself

How to get a pdf to display in a web browser before it's fully downloaded

I have a client that's been struggling with slow loading pdf files on the web.
My client has some very large pdf files that are almost 10 Mb. They take upwards of 3-4 minutes to download. The files will not display until the whole file is loaded.
We and they have seen other's sites where the pdfs load one page at a time, so the end user can start looking at the file as the rest of the page is still loading in the background. Gives the illusion that the page has loaded faster.
According to the documentation they see, IIS 6 should automatically do this if the pdf file is created with “Optimized for fast web view” checked. It is checked, and the file will still not load a page at a time.
They have searched and found nothing other than IIS will do this automatically if the file is saved correctly.
How can they "stream" the pdf? Is this because the pdf's were saved in a special way? Is this a java script that handles the download? Or is there a change that needs to happen in IIS?
Thanks
Update:
The file starts out like this:
%PDF-1.4
%âãÏÓ
171 0 obj << 0/Linearized 1
Linearized?
The PDF document isn't being served up from an aspx/asp page. (It's just posted directly to the site and linked to).
You need to lineraize the PDF and not trust IIS to do this for you.
There are a number of apps that will do this for you. I have used CVision (thier compression is 2nd to none, but the licensing and SDK are a pain), there is also some cheaper alternatives here, but I dont know how well they work.
To clarify Tony's point... (I think)
If you have actually used these tools and your pdf is linearized, try converting the PDF to a byte array and Response.Write() the byte array (with content headers, etc) to the client (in a new browser window or frame)
Would it be possible to use a third party service, like Scribd? If you go this route you can embed their streaming viewer onto your client's website. Just a thought, although I know it's not really suitable for every type of business.
This might happen if you are serving the PDF from an aspx page, to get the byte-serving that linearized pdf's need the page needs to be served directly or you need to provide the byte serving from the aspx code.
Save one of the files and open it up in a text editor. If you don't see something like
<< /Linearized 1.0 /L <number> /H [<number> <number>] /O <number> /E <number> ...
in the first couple hundred bytes or so, then you're not getting a linearized (ie, fast web) PDF.
First, the document needs to be "linearized", as others have explained; you can linearize it in Acrobat or using pdfopt from Ghostscript. Second, the web server must be able to serve byte ranges (i.e., support the Range header); I have no idea how to configure IIS for this, but even if the document is linearized, the client has to be able to read particular byte ranges.