Scraping data from CrowdTangle using API return expired image links - selenium

I wanted to download images from CrowdTangle Dashboard. I wrote the code to fetch data using its API. However, historical posts scraped using the API return expired media links. While downloading the image, I got "URL expired" error. How to generate new links?

After talking with people, I figured out that I needed to scroll in the CrowdTangle dashboard to generate new image links. However, scrolling manually through thousands of posts will be a tedious task. Hence I decided to code a bot that scrolls. This solved my problem and I was able to generate new links.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
browser = webdriver.Chrome(options=options)
link = {insert_link}
browser.get(link)
browser.maximize_window()
fb_button = browser.find_element(by=By.LINK_TEXT, value="click here.")
fb_button.click()
time.sleep(7)
phone = browser.find_element(by=By.ID,value="email")
password = browser.find_element(by=By.ID,value="pass")
submit = browser.find_element(by=By.ID,value="loginbutton")
phone.send_keys({phone number})
password.send_keys({password})
submit.click()
time.sleep(6)
element = browser.find_element(by=By.XPATH, value="/html/body/div[1]/div/div/div[3]/div")
while True:
browser.execute_script("arguments[0].scrollTop = arguments[0].scrollHeight", element)
time.sleep(3)
Go to the CrowdTangle dashboard, enter your filters, and query. Copy the link from the browser in the code. I would recommend running the scroll bot for each month. Sometimes more posts won't load. This is an issue with CrowdTangle. Just close the browser and move on to the next month.

Related

How can I get the link to YouTube Channel from Video Page?

I've been trying to get the link to the YouTube channel from the Video Page. However, I couldn't find the element of the link. With Inspector, it is obvious that the link is right here as the following picture.
With the code 'a.yt-simple-endpoint.style-scope.yt-formatted-string', I tried to get the link through the following code.
! pip install selenium
from selenium import webdriver
! pip install beautifulsoup4
from bs4 import BeautifulSoup
driver = webdriver.Chrome('D:\chromedrive\chromedriver.exe')
driver.get('https://www.youtube.com/watch?v=P6Cc2R2jK6s')
soup = BeautifulSoup(driver.page_source, 'lxml')
links = soup.select('a.yt-simple-endpoint.style-scope.yt-formatted-string')
for link in links:
print(link.get_attribute("href"))
However, no matter I used links = soup.select('a.yt-simple-endpoint.style-scope.yt-formatted-string') or links = soup.find('a', class_='yt-simple-endpoint style-scope ytd-video-owner-renderer'), it did not print anything. Someone please help me solve this.
Instead of this:
links = soup.select('a.yt-simple-endpoint.style-scope.yt-formatted-string')
In Selenium if I would do:
links = drvier.find_elements_by_css_selector('a.yt-simple-endpoint.style-scope.yt-formatted-string')

Splash not rendering a webpage completely

I am trying to use scrapy + splash to scrape this site https://www.teammitsubishihartford.com/new-inventory/index.htm?compositeType=new. But i am unable to extract any data from the site. When I try rendering the webpage using splash api (browser), I came to know that the site is not fully loaded (splash rendering returns a partially loaded website image). How can I render the site completly??
#Vinu Abraham, If your requirement is not specific to scrapy + splash, you can use selenium. This issue occurs when we try to scrape a dynamic site.
Below is the code snippet for reference.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import re
from csv import writer
# url of the page we want to scrape
url = 'https://www.*******/drugs-all-medicines'
driver = webdriver.Chrome('./chromedriver')
driver.get(url)
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
all_divs = soup.find('div', {'class': 'style__container___1i8GI'})
Also let me know if you get any solution for the same using scrapy.

grafana- get values from dashboard with mysql as data source with API

I am trying to extract values from the dashboard from Grafana Dashboard. I have MySQL as data source and I have a query to extract values from a particular table in MySQL.
I am trying to get these values shown on dashboard with some API.
For prometheus, I have came across this API of Instant Queries and it works well. Similarly I want to get the for grafana dashboard. I went through these Grafana HTTP APIs, but did not find any to get mysql records displayed on dashboard.
Are there any other APIs? Or any other way to get these records?
The only way I found was to:
Open the dashboard and then inspect the panel
Open the query tab and click on refresh
Use the url and parameters on your own query to the api
note: First you need to create an API key in the UI with the proper role and add the bearer to the request headers
I found a way to extract the record. It is using "Chro Path Utility", get the XPath for needed row/column/cell and use them in your Selenium code.
emmmm, i found that is chrome problem, which cant add custom header...
replace with PHANTOMJS that it work.
bur PHANTOMJS cant see full data(download png see that)....
......................
oh, can you tell me how do that?
i use selenium but cant get the data...that is my code.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')
url="http://<IP:PORT>/d/SsYvTg6Wk/rights?orgId=1&from=now-1h&to=now"
#url copy from browser chrome
browser=webdriver.Chrome(executable_path="./chromedriver", chrome_options=chrome_options)
browser.header_overrides = {
'Authorization': 'Bearer xxx'
}
browser.implicitly_wait(10)
browser.get(url)
# browser.execute_script("return document.body.innerHTML")
browser.find_element_by_id('flotGagueValue0')

retrieving ad urls using scrapy and selenium

I am trying to retrieve the ad URLs for this website:
http://www.appledaily.com
The ad URLs are loaded using javascript so a standard crawlspider does not work. The ads also changes as you refresh the page.
I found this question here and what I gathered is that we need to first use selenium to load a page in the browser then use Scrapy to retrieve the url. I have some experiences with scrapy but none at all in using Selenium. Can anyone show/point me to resource on how I can write a script to do that?
Thank you very much!
EDIT:
I tried the following but neither works in opening the ad banner. Can anyone help?
from selenium import webdriver driver=webdriver.Firefox()
driver=webdriver.Firefox()
driver.get('http://appledaily.com')
adBannerElement = driver.find_element_by_id('adHeaderTop')
adBannerElement.click()
2nd try:
adBannerElement =driver.find_element_by_css_selector("div[#id='adHeaderTop']")
adBannerElement.click()
CSS Selector should not contain # symbol it should be 'div[id='adHeaderTop']' or a shorter way of representing the same as div#adHeaderTop
Actually on observing and analyzing the site and the event that you are trying to carry out, I find that the noscript tag is what should interest you. Just get the HTML source of this node, parse the href attribute and fire this URL.
It will be equivalent to clicking the banner.
<noscript>
"<a href="http://adclick.g.doubleclick.net/aclk%253Fsa%...</a>"
</noscript>
(This is not the complete node information, just inspect the banner in Chrome and you will find this tag).
EDIT: Here is a working snippet that gives you the URL without clicking on the Ad banner, as mentioned from the tag.
driver = new FirefoxDriver();
driver.navigate().to("http://www.appledaily.com");
WebElement objHidden = driver.findElement(By.cssSelector("div#adHeaderTop_ad_container noscript"));
if(objHidden != null){
String innerHTML = objHidden.getAttribute("innerHTML");
String adURL = innerHTML.split("\"")[1];
System.out.println("** " + adURL); ///URL when you click on the Ad
}
else{
System.out.println("<noscript> element not found...");
}
Though this is written in Java, the page source wont change.

How to autorefresh chromeDriver with Selenium?

Previously I have been using chrome Auto Refresh plug in. However, now my code has multiple ChromeDriver instances opening and closing and I cannot use Auto Refresh. Also, it is quite a hassle to install Auto Refresh on new computers.
Is there any way to refresh driver (simulate F5 say every 15 seconds if driver does not change remains motionless) with Selenium similar to Google Auto Refresh?
refresh is a built in command.
driver = webdriver.Chrome()
driver.get("http://www.google.com")
driver.refresh()
If you don't have the chrome driver it can be found here:
https://code.google.com/p/chromedriver/downloads/list
Put the binary in the same folder as the python script you're writing. (Or add it to the path or whatever, more information here: https://code.google.com/p/selenium/wiki/ChromeDriver)
edit:
If you want to refresh ever 10 seconds or something, just wrap the refresh line with a loop and a delay. For example:
import time
while(True):
driver.refresh()
time.sleep(refresh_time_in_seconds)
If you only want to refresh if the page hasn't changed in the meantime, keep track of the page that you're on. driver.current_url is the url of the current page. So putting it all together it would be:
import time
refresh_time_in_seconds = 15
driver = webdriver.Chrome()
driver.get("http://www.google.com")
url = driver.current_url
while(True):
if url == driver.current_url:
driver.refresh()
url = driver.current_url
time.sleep(refresh_time_in_seconds)
Well there are two ways of doing this.
1. We can use refresh method
driver.get("some website url");
driver.navigate().refresh();
We can use actions class and mimic F5 press
Actions act = new Actions(driver);
act.SendKeys(Keys.F5).perform();
If you write unit tests that must be run like if you had to open/refresh a new browser session each time, you can use a method with before annotations:
#Before
public void refreshPage() {
driver.navigate().refresh();
}
If all tests are individually successful (green) but fail all together, the reason might also been that you need to wait for some resources to be available on the page, so you also need to handle it, setting the timeout like this:
public WebElement getSaveButton() {
return findDynamicElementByXPath(By.xpath("//*[#id=\"form:btnSave\"]"), 320);
}
320 is a long time, but you must make sure that you give enough time to get all that it takes to test.