Why getting only empty ResultSet for href from google search result? - selenium

I've been working on Google Colab developing a script to scrape google search results. It has been working for a long time without any problem but now doesn't. It seems that the code page source its different and the CSS classes I used to use now are diferents.
I use Selenium and BeautifulSoup and the code is the following:
# Installing Selenium after new Ubuntu update
%%shell
cat > /etc/apt/sources.list.d/debian.list <<'EOF'
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster.gpg] http://deb.debian.org/debian buster main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster-updates.gpg] http://deb.debian.org/debian buster-updates main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-security-buster.gpg] http://deb.debian.org/debian-security buster/updates main
EOF
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A
apt-key export 77E11517 | gpg --dearmour -o /usr/share/keyrings/debian-buster.gpg
apt-key export 22F3D138 | gpg --dearmour -o /usr/share/keyrings/debian-buster-updates.gpg
apt-key export E562B32A | gpg --dearmour -o /usr/share/keyrings/debian-security-buster.gpg
cat > /etc/apt/preferences.d/chromium.pref << 'EOF'
Package: *
Pin: release a=eoan
Pin-Priority: 500
Package: *
Pin: origin "deb.debian.org"
Pin-Priority: 300
Package: chromium*
Pin: origin "deb.debian.org"
Pin-Priority: 700
EOF
apt-get update
apt-get install chromium chromium-driver
pip install selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
# Parameters to use Selenium and Chromedriver
ua = UserAgent()
userAgent = ua.random
options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--user-agent="'+userAgent+'"')
#options.headless = True
driver = webdriver.Chrome('chromedriver',options=options)
# Trying to scrape Google Search Results
links = []
url = "https://www.google.es/search?q=alergia
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
#This doesn't return anything
search = soup.find_all('div', class_='yuRUbf')
for h in search:
links.append(h.a.get('href'))
print(links)
Why now the class yuRUbf doesnt work for scrape search results? Always worked for me
Trying to scrape href links from google search results using Selenium and BeautifulSoup

There can be different issues, as long as your question is not that specific in this point - So always and first of all, take a look at your soup to see if all the expected ingredients are in place.
Check if you run into consent banner redirect and handle it with selenium via clicking or sending corresponding headers.
Classes are highly dynamic things, so change selection strategy and use more static content like id or HTML structure - used css selctors here:
soup.select('a:has(h3)')
Example:
Cause selenium is not really needed here this is a light version with requests:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://www.google.es/search?q=alergia',headers = {'User-Agent': 'Mozilla/5.0'}, cookies={'CONSENT':'YES+'}).text)
[a.get('href').strip('/url?q=') for a in soup.select('a:has(h3)')]

Related

How to use TreeTagger in Google Colab?

i want to use TreeTagger module to tag POS-information on the raw corpus.
As it seems to be faster to use GPU via Google Colab, I installed TreeTagger module, but Colab codes cannot locate TreeTagger directory.
The error type is like this:
TreeTaggerError: Can't locate TreeTagger directory (and no TAGDIR specified)
Please tell me where I should uplaod the treetagger folder.
You have to specify directory:
treetaggerwrapper.TreeTagger(TAGLANG='en', TAGDIR='treetagger/') # treetagger is the installation dir
Installation in Colab.
Follow the instructions on the website.
In one cell in Colab you have to put the following (for other (not English) languages put other link for parameter files):
%%bash
mkdir treetagger
cd treetagger
# Download the tagger package for your system (PC-Linux, Mac OS-X, ARM64, ARMHF, ARM-Android, PPC64le-Linux).
wget https://cis.lmu.de/~schmid/tools/TreeTagger/data/tree-tagger-linux-3.2.4.tar.gz
tar -xzvf tree-tagger-linux-3.2.4.tar.gz
# Download the tagging scripts into the same directory.
wget https://cis.lmu.de/~schmid/tools/TreeTagger/data/tagger-scripts.tar.gz
gunzip tagger-scripts.tar.gz
# Download the installation script install-tagger.sh.
wget https://cis.lmu.de/~schmid/tools/TreeTagger/data/install-tagger.sh
# Download the parameter files for the languages you want to process.
# list of all files (parameter files) https://cis.lmu.de/~schmid/tools/TreeTagger/#parfiles
wget https://cis.lmu.de/~schmid/tools/TreeTagger/data/english.par.gz
sh install-tagger.sh
cd ..
sudo pip install treetaggerwrapper
And in the other following cell you can check the installation:
>>> import pprint # For proper print of sequences.
>>> import treetaggerwrapper
>>> #1) build a TreeTagger wrapper:
>>> tagger = treetaggerwrapper.TreeTagger(TAGLANG='en', TAGDIR='treetagger/')
>>> #2) tag your text.
>>> tags = tagger.tag_text("This is a very short text to tag.")
>>> #3) use the tags list... (list of string output from TreeTagger).
>>> pprint.pprint(tags)

How to render OpenAI gym in google Colab? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm trying to use OpenAI gym in google colab. As the Notebook is running on a remote server I can not render gym's environment.
I found some solution for Jupyter notebook, however, these solutions do not work with colab as I don't have access to the remote server.
I wonder if someone knows a workaround for this that works with google Colab?
Korakot's answer is not correct.
You can indeed render OpenAi Gym in colaboratory, albiet kind of slowly using none other than matplotlib.
Heres how:
Install xvfb & other dependencies (Thanks to Peter for his comment)
!apt-get install x11-utils > /dev/null 2>&1
!pip install pyglet > /dev/null 2>&1
!apt-get install -y xvfb python-opengl > /dev/null 2>&1
As well as pyvirtual display:
!pip install gym pyvirtualdisplay > /dev/null 2>&1
then import all your libraries, including matplotlib & ipythondisplay:
import gym
import numpy as np
import matplotlib.pyplot as plt
from IPython import display as ipythondisplay
then you want to import Display from pyvirtual display & initialise your screen size, in this example 400x300... :
from pyvirtualdisplay import Display
display = Display(visible=0, size=(400, 300))
display.start()
last but not least, using gym's "rgb_array" render functionally, render to a "Screen" variable, then plot the screen variable using Matplotlib! (rendered indirectly using Ipython display)
env = gym.make("CartPole-v0")
env.reset()
prev_screen = env.render(mode='rgb_array')
plt.imshow(prev_screen)
for i in range(50):
action = env.action_space.sample()
obs, reward, done, info = env.step(action)
screen = env.render(mode='rgb_array')
plt.imshow(screen)
ipythondisplay.clear_output(wait=True)
ipythondisplay.display(plt.gcf())
if done:
break
ipythondisplay.clear_output(wait=True)
env.close()
Link to my working Colaboratory notebook demoing cartpole:
https://colab.research.google.com/drive/16gZuQlwxmxR5ZWYLZvBeq3bTdFfb1r_6
Note: not all Gym Environments support "rgb_array" render mode, but most of the basic ones do.
Try this :-
!apt-get install python-opengl -y
!apt install xvfb -y
!pip install pyvirtualdisplay
!pip install piglet
from pyvirtualdisplay import Display
Display().start()
import gym
from IPython import display
import matplotlib.pyplot as plt
%matplotlib inline
env = gym.make('CartPole-v0')
env.reset()
img = plt.imshow(env.render('rgb_array')) # only call this once
for _ in range(40):
img.set_data(env.render('rgb_array')) # just update the data
display.display(plt.gcf())
display.clear_output(wait=True)
action = env.action_space.sample()
env.step(action)
This worked for me so I guess it should also work for you.
I recently had to solve the same problem and have written up a blog post with my solution. For ease of reference I am re-posting the TLDR; version here.
Paste this code into a cell in Colab and run it to install all of the dependencies.
%%bash
# install required system dependencies
apt-get install -y xvfb x11-utils
# install required python dependencies (might need to install additional gym extras depending)
pip install gym[box2d]==0.17.* pyvirtualdisplay==0.2.* PyOpenGL==3.1.* PyOpenGL-accelerate==3.1.*
And then start a virtual display in the background.
import pyvirtualdisplay
_display = pyvirtualdisplay.Display(visible=False, # use False with Xvfb
size=(1400, 900))
_ = _display.start()
In the blog post I also provide a sample simulation demo that demonstrates that the above actually works.
By far the best solution I found after spending countless hours on this issue is to record & play video. It comes UX wise very close to the real render function.
Here is a Google colab notebook that records & renders video.
https://colab.research.google.com/drive/12osEZByXOlGy8J-MSpkl3faObhzPGIrB
enjoy :)

How do I upload a file to Google Colaboratory that is already on Google Drive?

https://colab.research.google.com/notebooks/io.ipynb#scrollTo=KHeruhacFpSU
In this notebook help it explains how to upload a file to drive and then download to Colaboratory but my files are already in drive.
Where can I find the file ID ?
# Download the file we just uploaded.
#
# Replace the assignment below with your file ID
# to download a different file.
#
# A file ID looks like: 1uBtlaggVyWshwcyP6kEI-y_W3P8D26sz
file_id = 'target_file_id'
My advice would be to use pydrive for this (docs).
You could also do this via the Drive UI -- I think the shortest path is to select the file, click "Get shareable link" -- it's the id parameter in the resulting URL. (If the file wasn't shared when you started, you'll want to then uncheck the green "link" button.)
Connect to Gdrive using below snippet.
You will have to authenticate twice using the link from cell output. But once this step is taken care of you can load files from drive and save to drive directly as you would do locally.
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse
from google.colab import auth
auth.authenticate_user()
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}
!mkdir -p drive
!google-drive-ocamlfuse drive
Read CSV using pandas
df = pd.read_csv('drive/path/file.csv')
Save CSV
Use index = False if you don't need index as first col in csv.
df.to_csv('drive/path/file.csv',index = False)
You can use curlWget extension in chrome. If you want to download anything, just click on download and as soon as it started downloading you can cancel the download. Go to curlwget and get the whole link of file or data, just copy it.
Go to colab, add a cell and paste it, just put ! mark before the copied data from curlwget.
Better to use colab api
from google.colab import drive
drive.mount('/content/drive')

Link to images within google drive from a colab notebook

I would like to store image files on a drive and link to them from a collaboration notebook. Is this possible? For example.
google-drive/
notebook.ipynb
images/
pic.jpg
Within notebook.ipynb markdown cell:
![Alternate Text](images/pic.jpg)
Google always loses me in the details. That's where the devil is. Based on the previous answer, these are my steps to make an image sharable so it can be used in a notebook.
In your Google Drive, create a public folder as follows:
a. Create a new folder and name it Image, for instance.
b. Right-click the folder just created and select Share from the drop-down menu.
c. In the popup dialog window, click the Advanced link in the bottom right.
d. In the section Who has access, select "Public on the web - Anyone on the Internet can find and view".
e. Click the Done button.
Store the image in the folder you just created.
Right-click on the image, and from the drop-down menu select Share.
Click the Copy link button. A link is copied in your clipboard.
From the link, copy the long image ID made of numbers and letters.
Replace imageID in the following URL: https://docs.google.com/uc?export=download&id= with the ID you copied in the previous step.
Copy this URL in the markdown image tag such as:![test](https://docs.google.com/uc?export=download&id=mmXXDD123zDGV51twxSCGAAX23)
A possible alternative to the great answer by Arpit Gupta for images are publicly shared: Get the ID of your file and prepend this line:
https://docs.google.com/uc?export=download&id=
Grabbed it from this forum post.
Use the following code:
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse
from google.colab import auth
auth.authenticate_user()
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}
After running above command it will ask for the key. Click on the provided link and verify it.
Create a folder in drive using:
!mkdir -p colabData
!google-drive-ocamlfuse colabData
After this you can use the drive as if its connected locally:
%%bash
echo "Hello World...!!!" > colabData/hello.txt
ls colabData

Disable Scrapyd item storing in .jl feed

Question
I want to know how to disable Item storing in scrapyd.
What I tried
I deploy a spider to the Scrapy daemon Scrapyd. The deployed spider stores the spidered data in a database. And it works fine.
However Scrapyd logs each scraped Scrapy item. You can see this when examining the scrapyd web interface.
This item data is stored in ..../items/<project name>/<spider name>/<job name>.jl
I have no clue how to disable this. I run scrapyd in a Docker container and it uses way too much storage.
I have tried suppress Scrapy Item printed in logs after pipeline, but this does nothing for scrapyd logging it seems. All spider logging settings seem to be ignored by scrapyd.
Edit
I found this entry in the documentation about Item storing. It seems if you omit the items_dir setting, item logging will not happen. It is said that this is disabled by default. I do not have a scrapyd.conf file, so item logging should be disabled. It is not.
After writing my answer I re-read your question and I see that what you want has nothing to do with logging but it's about not writing to the (default-ish) .jl feed (Maybe update the title to: "Disable scrapyd Item storing"). To override scrapyd's default, just set FEED_URI to an empty string like this:
$ curl http://localhost:6800/schedule.json -d project=tutorial -d spider=example -d setting=FEED_URI=
For other people who are looking into logging... Let's see an example. We do the usual:
$ scrapy startproject tutorial
$ cd tutorial
$ scrapy genspider example example.com
then edit tutorial/spiders/example.py to contain the following:
import scrapy
class TutorialItem(scrapy.Item):
name = scrapy.Field()
surname = scrapy.Field()
class ExampleSpider(scrapy.Spider):
name = "example"
start_urls = (
'http://www.example.com/',
)
def parse(self, response):
for i in xrange(100):
t = TutorialItem()
t['name'] = "foo"
t['surname'] = "bar %d" % i
yield t
Notice the difference between running:
$ scrapy crawl example
# or
$ scrapy crawl example -L DEBUG
# or
$ scrapy crawl example -s LOG_LEVEL=DEBUG
and
$ scrapy crawl example -s LOG_LEVEL=INFO
# or
$ scrapy crawl example -L INFO
By trying such combinations on your spider confirm that it doesn't print Item info for log-level beyond debug.
It's now time, after you deploy to scrapyd to do exactly the same:
$ curl http://localhost:6800/schedule.json -d setting=LOG_LEVEL=INFO -d project=tutorial -d spider=example
Confirm that the logs don't contain items when you run:
Note that if your items are still printed in INFO level, it likely means that your code or some pipeline is printing it. You could rise log-level further and/or investigate and find the code that prints it and remove it.