Tracking readership of a PDF document - pdf

Background. As some of you know I Publish a business deals listings magazine in PDF format. I am after Google Analytics/Matomo style tracking of readership of the PDF. (I use Matomo for my websites and prefer it over google)
Where I want to get to for example is User one spent 3 seconds on Page 1 skipped page 2, read pages 3-10 left.
User 2 read pages 20-30 and spent an average of 5 seconds on each page. Etc Etc,
Is there a way to do this or is it even possible?
I already know if the magazine has been downloaded and where and when and have this data.
Many thanks

Related

Web Scrape a specific tag using python BeautifulSoup

I am working on a self-project where I am trying to analyze the causes that happened due to the unethical use of AI systems. I am trying to web scrape this website.
URL - https://incidentdatabase.ai/apps/discover?display=details&page=1
I want each and every 28 pages URL mentioned on page 1, so that I could scrape information from those URLs. But I am not able to access the particular and its contents where under each grid URL for each incident is mentioned, I am getting an empty list only when I try to scrape. I am guessing its because it is mentioned inside a grid. Any help would be appreciated.
I have attached an image of the URL inspect where I have circled what exactly I wanted to scrape.
Thank you in advance for any help.
You don't need to scrape this! You can download all the content that is being displayed from the snapshot page:
https://incidentdatabase.ai/research/snapshots
I generated a new snapshot 2 minutes ago, which will list to the above URL shortly, at the following link.
https://s3.amazonaws.com/aiid-backups-public/backup-20220912170854.tar.bz2
This will give you the entire database, which is rendered to HTML from MongoDB (JSON) collections.
Please reach out via the contact page (or comment on this solution) if something does not suit your needs.
urls=[x.get_attribute('href') for x in driver.find_elements(By.XPATH,"//div[#class='h-100 card']/a")]
If you want the 28 or so elements hrefs you can grab them like so. You can add Webdriver Waits if there is excess page loading.
This is a very interesting question, by its very nature of an X-Y Problem. Selenium is not the right tool for this this job, imho. Page is (very) dynamic, and beside being hydrated from external APIs, is also analyzing user interaction and loading the data as you scroll. Of course, it's possible to do it with selenium as well, but there is a better way. There are 311 incidents, all of them extensively documented. The way forward here is to scrape the api endpoints for each one of them: the result will be a huge json object, very detailed.
For example, to scrape the first 20 incidents using requests and pandas:
import requests
import pandas as pd
from tqdm import tqdm
big_df = pd.DataFrame()
for counter in tqdm(range(1, 20)):
r = requests.get(f'https://incidentdatabase.ai/page-data/cite/{counter}/page-data.json')
df = pd.json_normalize(r.json()['result']['pageContext']['incidentReports'])
big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
print(big_df)
This will result in:
19/19 [01:00<00:00, 3.25s/it]
submitters date_published report_number title url image_url cloudinary_id source_domain mongodb_id text authors epoch_date_submitted language
0 [Roman Yampolskiy] 2015-05-19 1 Google’s YouTube Kids App Criticized for ‘Inappropriate Content’ https://blogs.wsj.com/digits/2015/05/19/googles-youtube-kids-app-criticized-for-inappropriate-content/ http://si.wsj.net/public/resources/images/BN-IM269_YouTub_P_20150518174822.jpg reports/si.wsj.net/public/resources/images/BN-IM269_YouTub_P_20150518174822.jpg blogs.wsj.com 5d34b8c29ced494f010ed45a Child and consumer advocacy groups complained to the Federal Trade Commission Tuesday that Google’s new YouTube Kids app contains “inappropriate content,” including explicit sexual language and jokes about pedophilia.\n\nGoogle launched the app for young children in February, saying the available videos were “narrowed down to content appropriate for kids.” [Alistair Barr] 1559347200 en
1 [Roman Yampolskiy] 2018-02-07 2 YouTube Kids app is STILL showing disturbing videos https://www.dailymail.co.uk/sciencetech/article-5358365/YouTube-Kids-app-showing-disturbing-videos.html https://i.dailymail.co.uk/i/pix/2018/02/06/15/48EEE02F00000578-0-image-a-18_1517931140185.jpg reports/i.dailymail.co.uk/i/pix/2018/02/06/15/48EEE02F00000578-0-image-a-18_1517931140185.jpg dailymail.co.uk 5d34b8c29ced494f010ed45b Google-owned YouTube has apologised again after more disturbing videos surfaced on its YouTube Kids app.\n\nInvestigators found several unsuitable videos including one of a burning aeroplane from the cartoon Paw Patrol and footage explaining how to sharpen a knife.\n\nYouTube has been criticised for using algorithms to sieve through material rather than using human moderators to judge what might be appropriate.\n\nThere have been hundreds of disturbing videos found on YouTube Kids in recent months that are easily accessed by children.\n\nThese videos have featured horrible things happening to various characters, including ones from the Disney movie Frozen, the Minions franchise, Doc McStuffins and Thomas the Tank Engine.\n\nParents, regulators, advertisers and law enforcement have become increasingly concerned about the open nature of the service.\n\nScroll down for video\n\nYouTube has apologised again after more disturbing videos surfaced on its YouTube Kids app. Investigators found several unsuitable videos including one from the cartoon Paw Patrol on a burning aeroplane and footage showing how to sharpen a knife\n\nA YouTube spokesperson has admitted the company needs to 'do more' to tackle inappropriate videos on their kids platform.\n\nThis investigation is the latest to expose inappropriate content on the video-sharing site which has been subject to a slew of controversies since its creation in 2005.\n\nAs part of an in-depth investigation by BBC Newsround, Google's Public Policy Manager Katie O'Donovan met five children who told her about the distressing videos they had seen on the site.\n\nThey included videos showing clowns covered in blood and messages warning them there was someone at the door.\n\nMs O'Donovan said she was 'very, very sorry for any hurt or discomfort'.\n\n'We've actually built a whole new platform for kids, called YouTube Kids, where we take the best content, stuff that children are most interested in and put it on there in a packaged up place just for kids,' she said.\n\nIt normally takes five days for supposedly child-friendly content like cartoons to get from YouTube to YouTube Kids.\n\nWithin that window it is hoped users and a specially-trained team will flag disturbing content.\n\nOnce it has been flagged and reviewed, it won't appear on the YouTube Kids app and only people who are signed in and older than 18 years old will be able to view it.\n\nThe company say thousands of people will be working around the clock to flag content.\n\nHowever, as part of the investigation Newsround revealed there are still lots of inappropriate videos on the Kids section.\n\n'We have seen significant investment in building the right tools so people can flag that [content], and those flags are reviewed very, very quickly', Ms O'Donovan said.\n\n'We're also beginning to use machine learning to identify the most harmful content, which is then automatically reviewed.'\n\nThe problem was managing an open platform where content is uploaded straight onto the site, she added.\n\n'It is a difficult environment because things are moving so, so quickly', said Ms O'Donovan.\n\n'We have a responsibility to make sure the platform can survive and can thrive so that we have a collection that comes from around the world on there'.\n\nBy the end of last year YouTube said it had removed more than 50 user channels and had stopped running ads on more than 3.5 million videos since June.\n\n'Content that endangers children is unacceptable to us and we have clear policies against such videos on YouTube and YouTube Kids', a YouTube spokesperson told MailOnline.\n\n'When we discover any inappropriate content, we quickly take action to remove it from our platform.\n\n'Over the past few months, we've taken a series of steps to tackle many of the emerging challenges around family content on YouTube, including: tightening enforcement of our Community Guidelines, age-gating content that inappropriately targets families, and removing it from the YouTube Kids app.'\n\nYouTube has been criticised for using algorithms to sieve through material rather than using human moderators to judge what might be appropriate (stock image)\n\nIn March, a disturbing Peppa Pig fake, found by journalist Laura June, shows a dentist with a huge syringe pulling out the character's teeth as she screams in distress.\n\nMrs June only realised the violent nature of the video as her three-year-old daughter watched it beside her.\n\n'Peppa does a lot of screaming and crying and the dentist is just a bit sadistic and it's just way, way off what a three-year-old should watch,' she said.\n\n'But the animation is close enough to looking like Peppa - it's crude but it's close enough that my daughter was like 'This is Peppa Pig.''\n\nAnother video depicted Peppa Pig and a friend deliberately burning down a house with someone in it.\n\nAll of these videos are easily accessed by children through YouTube's search results or recommended videos.\n\nIn March, a disturbing Peppa Pig fake, found by journalist Laura June, shows a dentist with a huge syringe pulling ou [Phoebe Weston] 1559347200
[...]
JSON response(s) can be further dissected and analysed, and more useful information can be pulled from them (including euclidean distance between incidents, etc - really a lot).
Requests docs: https://requests.readthedocs.io/en/latest/
Pandas docs: https://pandas.pydata.org/pandas-docs/stable/index.html
And for tqdm: https://tqdm.github.io/

Multipage Bootstrap and Google Analytics

I have sort of a problem how to use Google Analytics properly with Boostrap.
My page has 3 level deep subpages and the last subpage has it's own subdomain. In GA I see I can use max. 50 tracking codes within one service. What if I need more than that?
You are limited to 50 properties not 50 pages. Each property can track many pages and (up to 10 million hits a month for the free version) and events.
Typically you would use the same property code on all pages on the same site so you can see all that data together (though with option to drill down).
You would only use a new property code for a new site (though your subdomain might qualify for that if you want to track it separately).
So the two questions you want to ask yourself are:
Do you want to be able to report on two pages together? E.g. To see that your site gets 10,000 hits and 20% are for this page and 5% are for that page. Or people start at this page and then go to that page and then on to this page. If so it should be the same analytics property.
Do different people need to see these page stats? And is it a problem if they do? If so put as a separate property so you can permission separately.
It sounds like these are part of the same site so I'd be veering towards tracking them together on same property.
On a different note you should set one page as the main version (with a rel canonical tag) and redirect other version to that page to avoid confusing search engines thinking you have duplicated content. Do you have a reason for having the same content on two different addresses? It can cause SEO and other problems.

How to Remove Submitted Content from Search Engines (Google)?

I have submitted 1000 pages to Google included in my sitemap, but I did it by mistake. I didn't want to submit all, I wanted to submit 800 and release 1 page per day during 200 days in order to add new fresh content every day, that way Google would see it as a frequently updated website which is a good SEO practice.
I don't want Google to know about the existence of those 200 pages right now, I want Google to think it is fresh content when I release it every day.
Shall I resend the sitemap.xml with only 800 links and hide the pages in the website?
If Google has already indexed the pages, is there any change to make Google "forget" those pages and not recognize them in the future when I release them back?
Any suggestion about what to do?
Thank you guys.
I wouldn't do that. You're trying to cheat, Google doesn't like this. Remain your site as it is now, create new content and submit it into Google's index as frequently as you want. If you'll exclude previously submitted data from the index, with a high probability it won't be indexed again.

Make search engines distinguish website chronological updates over time (like in forums)

I see that search engines are prominently capable of finding pages chronologically for forum websites and the like, offering the option to show the results for the last 24 hours, last week, last month, last year, etc.
I understand that these sites need to be continuously crawled to provide those updates, but I have technical doubts about what structure, tags or whatever I need to do to achieve it for my website.
I see that at the client side (which is also the side search engines are at) content appears basically as static data, already processed by the server, so the question is:
If I have a website for which I update and add content constantly to the index page to make it easily visible, and for which I even add links, times and dates as text for the new pages, why don't these updates show at all in search engines?
Do I need to add XML/RSS feeds, or what else?
How do forums and sites with heavy updates with a chronological mark achieve the capability to allow search engines to list results separated by hours, days, etc.?
What specific set of tags and overall structure do I need to add for this feature?
I also see that search engines, mainly Googlebot, usually take a minimum of 3 days to crawl those new pages, but still, they aren't organized persistently (or at all) in a chronological way in search results.
I am not using any forum, blog or other kind of web publishing software, just raw HTML and PHP written by hand, and the minimum I mentioned above, of pointing to new documents from the index page of the website along with a description.
Do I need to add XML/RSS feeds, or what else?
Yes. Atom or one of the RSS formats (or several formats at the same time, so you could offer Atom and RSS).
Search engines will know about new blog posts, microblog post, forum threads, forum thread answers etc., because they subscribe to the feed. So sometimes you'll notice that a page is indexed by a search engines only minutes after it was published. But for smaller sites, search engines probably don't check for updates every few minutes, instead it might take even days until a new page is indexed.
A sitemap might help, too.

Getting data from Google Spreadsheets

I quickly made a little form in Google Docs that lets people insert the most current attraction wait times at Disneyland and submit them to a Google Spreadsheet. I want to make a web page that will display the bottom, most recent row from that spreadsheet so the current wait time for each attraction is always displayed when someone visits the web page. Is there a possible way already to share and embed just the bottom row of data from the spreadsheet?
Hooray for google's api documentation section, although it's hard to sometimes find the right section... I've never done this before but it looks pretty straightforward
for list based feeds
see this: http://code.google.com/apis/spreadsheets/data/3.0/developers_guide.html#ListFeeds
or for cell based feeds
see this: http://code.google.com/apis/spreadsheets/data/3.0/developers_guide.html#CellFeeds