Beaufiulsoup: AttributeError: 'NoneType' object has no attribute 'text' and is not subscriptable - beautifulsoup

I'm using beautiful soup and I'm getting the error, "AttributeError: 'NoneType' object has no attribute 'get_text'" and also "TypeError: 'NoneType' object is not subscriptable".
I know my code works when I use it to search for a single restaurant. However when I try to make a loop for all restaurants, then I get an error.
Here is my screen recording showing the problem. https://streamable.com/pok13
The rest of the code can be found here: https://pastebin.com/wsv1kfNm
# AttributeError: 'NoneType' object has no attribute 'get_text'
restaurant_address = yelp_containers[yelp_container].find("address", {
"class": 'lemon--address__373c0__2sPac'
}).get_text()
print("restaurant_address: ", restaurant_address)
# TypeError: 'NoneType' object is not subscriptable
restaurant_starCount = yelp_containers[yelp_container].find("div", {
"class": "lemon--div__373c0__1mboc i-stars__373c0__30xVZ i-stars--regular-4__373c0__2R5IO border-color--default__373c0__2oFDT overflow--hidden__373c0__8Jq2I"
})['aria-label']
print("restaurant_starCount: ", restaurant_starCount)
# AttributeError: 'NoneType' object has no attribute 'text'
restaurant_district = yelp_containers[yelp_container].find("div", {
"class": "lemon--div__373c0__1mboc display--inline-block__373c0__25zhW border-color--default__373c0__2xHhl"
}).text
print("restaurant_district: ", restaurant_district)

You are getting the error because your selectors are too specific, and you don't check if the tag was found or not. One solution is loosen the selectors (the lemon--div-XXX... selectors will probably change in the near future anyway):
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import csv
import re
my_url = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=San%20Francisco%2C%20CA'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
bs = soup(page_html, "html.parser")
yelp_containers = bs.select('li:contains("All Results") ~ li:contains("read more")')
for idx, item in enumerate(yelp_containers, 1):
print("--- Restaurant number #", idx)
restaurant_title = item.h3.get_text(strip=True)
restaurant_title = re.sub(r'^[\d.\s]+', '', restaurant_title)
restaurant_address = item.select_one('[class*="secondaryAttributes"]').get_text(separator='|', strip=True).split('|')[1]
restaurant_numReview = item.select_one('[class*="reviewCount"]').get_text(strip=True)
restaurant_numReview = re.sub(r'[^\d.]', '', restaurant_numReview)
restaurant_starCount = item.select_one('[class*="stars"][aria-label]')['aria-label']
restaurant_starCount = re.sub(r'[^\d.]', '', restaurant_starCount)
pr = item.select_one('[class*="priceRange"]')
restaurant_price = pr.get_text(strip=True) if pr else '-'
restaurant_category = [a.get_text(strip=True) for a in item.select('[class*="priceRange"] ~ span a')]
restaurant_district = item.select_one('[class*="secondaryAttributes"]').get_text(separator='|', strip=True).split('|')[-1]
print(restaurant_title)
print(restaurant_address)
print(restaurant_numReview)
print(restaurant_price)
print(restaurant_category)
print(restaurant_district)
print('-' * 80)
Prints:
--- Restaurant number # 1
Fog Harbor Fish House
Pier 39
5487
$$
['Seafood', 'Bars']
Fisherman's Wharf
--------------------------------------------------------------------------------
--- Restaurant number # 2
The House
1230 Grant Ave
4637
$$$
['Asian Fusion']
North Beach/Telegraph Hill
--------------------------------------------------------------------------------
...and so on.

Related

Saving datasets created in a forloop to multiple files

I have URLs (for web scraping) and municipality name stored in this list:
muni = [("https://openbilanci.it/armonizzati/bilanci/filettino-comune-fr/entrate/dettaglio?year=2021&type=preventivo", "filettino"), ("https://openbilanci.it/armonizzati/bilanci/partanna-comune-tp/entrate/dettaglio?year=2021&type=preventivo","partanna"), ("https://openbilanci.it/armonizzati/bilanci/fragneto-labate-comune-bn/entrate/dettaglio?year=2021&type=preventivo", "fragneto-labate") ]
I am trying to create different datasets for different municipalities. For example, data scraped from the first URL would be: filettinodak.csv.
I am using the following code right now:
import re
import json
import requests
import pandas as pd
import os
import random
os.chdir(r"/Users/aartimalik/Dropbox/data")
muni = [("https://openbilanci.it/armonizzati/bilanci/filettino-comune-fr/entrate/dettaglio?year=2021&type=preventivo", "filettino"),
("https://openbilanci.it/armonizzati/bilanci/partanna-comune-tp/entrate/dettaglio?year=2021&type=preventivo","partanna"),
("https://openbilanci.it/armonizzati/bilanci/fragneto-labate-comune-bn/entrate/dettaglio?year=2021&type=preventivo", "fragneto-labate")
]
for m in muni[1]:
URL = m
r = requests.get(URL)
p = re.compile("var bilancio_tree = (.*?);")
data = p.search(r.text).group(1)
data = json.loads(data)
all_data = []
for d in data:
for v in d["values"]:
for kk, vv in v.items():
all_data.append([d["label"], "-", kk, vv.get("abs"), vv.get("pc")])
for c in d["children"]:
for v in c["values"]:
for kk, vv in v.items():
all_data.append(
[d["label"], c["label"], kk, vv.get("abs"), vv.get("pc")]
)
df = pd.DataFrame(all_data, columns=["label 1", "label 2", "year", "abs", "pc"])
df.to_csv(muni[2]+"dak.csv", index=False)
The error I am getting is: Traceback (most recent call last): File "<stdin>", line 19, in <module> TypeError: can only concatenate tuple (not "str") to tuple.
I think I am doing something wrong with the muni indexing: muni[i]. Any suggestions? Thank you so much!
If you adjust your for loop a bit, it should solve your problem. The below change loops through all list entries in muni. Each time, it extracts the first value from each tuple into URL and the second tuple value into label.
for URL, label in muni:
And with that change, the final line in your code can become:
df.to_csv(label+"dak.csv", index=False)

AttributeError: 'list' object has no attribute 'sents'

How can I resolve this attribute error in spaCy?
from __future__ import unicode_literals, print_function
from spacy.lang.en import English
nlp = English()
sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer)
assert len(list(doc.sents)) == 2
This is the traceback:
AttributeError Traceback (most recent call last)
<ipython-input-81-0459326012bf> in <module>
5 sentencizer = nlp.create_pipe("sentencizer")
6 nlp.add_pipe(sentencizer)
----> 7 assert len(list(doc.sents)) == 2
AttributeError: 'list' object has no attribute 'sents'
If your goal is to tokenize (split) sentences, below is a code sample using spaCy.
import spacy
nlp = spacy.load('en_core_web_lg')
raw_text = 'Hello, world. Here are two sentences.'
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]
assert len(sentences) == 2
print(sentences)
Output:
['Hello, world.', 'Here are two sentences.']

Scrapy : TypeError: argument of type 'NoneType' is not iterable

Whit scrapy, I receive this NoneType error when I launch my spider:
if 'Jockey' in tab_arrivee_th: TypeError: argument of type 'NoneType'
is not iterable
The code works fine in the console test with a list, but not with the response.css.
I think the problem comes from the response_arrivee_th, and I don't understand why, because the 'scrapy shell' gives me a list in return, and it's the same that I use in the test.
def parse(self, response):
tab_arrivee_th = response.css('.arrivees th::text').extract()
# list obtained whit the response.css from above in scrapy shell
# tab_arrivee_th = ['Cl.', 'N°', 'Cheval', 'S/A', 'Œill.', 'Poids', 'Corde', 'Ecart', 'Jockey', 'Entraîneur', 'Tx', 'Récl.', 'Rapp. Ouv.']
if 'Jockey' in tab_arrivee_th:
col_jockey = tab_arrivee_th.index('Jockey') + 1
elif 'Driver' in tab_arrivee_th:
col_jockey = tab_arrivee_th.index('Driver') + 1
else:
col_jockey = 0
jockey = partant.css('td:nth-child(' + str(col_jockey) + ') > a::text').extract()
if 'Jockey' in tab_arrivee_th: TypeError: argument of type 'NoneType'
is not iterable
thx for the help
Solved : the 'response.css('.arrivees th::text').extract()' point to a list construct in js.
So I used scrapy-splash to have a 0.5 second delay. And it works fine.
the response for this line tab_arrivee_th = response.css('.arrivees th::text').extract() is empty , check the response again.

TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str

My Code.
#!/usr/bin/env python
#coding: utf-8
userid="NicoNicoCreate#gmail.com"
passwd="********"
import sys, re, cgi, urllib, urllib.request, urllib.error, http.cookiejar, xml.dom.minidom, time, urllib.parse
import simplejson as json
def getToken():
html = urllib.request.urlopen("http://www.nicovideo.jp/my/mylist").read()
for line in html.splitlines():
mo = re.match(r'^\s*NicoAPI\.token = "(?P<token>[\d\w-]+)";\s*',line)
if mo:
token = mo.group('token')
break
assert token
return token
def mylist_create(name):
cmdurl = "http://www.nicovideo.jp/api/mylistgroup/add"
q = {}
q['name'] = name.encode("utf-8")
q['description'] = ""
q['public'] = 0
q['default_sort'] = 0
q['icon_id'] = 0
q['token'] = token
cmdurl += "?" + urllib.parse.urlencode(q).encode("utf-8")
j = json.load( urllib.request.urlopen(cmdurl), encoding='utf-8')
return j['id']
def addvideo_tomylist(mid,smids):
for smid in smids:
cmdurl = "http://www.nicovideo.jp/api/mylist/add"
q = {}
q['group_id'] = mid
q['item_type'] = 0
q['item_id'] = smid
q['description'] = u""
q['token'] = token
cmdurl += "?" + urllib.parse.urlencode(q).encode("utf-8")
j = json.load( urllib.request.urlopen(cmdurl), encoding='utf-8')
time.sleep(0.5)
#Login
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(http.cookiejar.CookieJar()))
urllib.request.install_opener(opener)
urllib.request.urlopen("https://secure.nicovideo.jp/secure/login",
urllib.parse.urlencode( {"mail":userid, "password":passwd}) ).encode("utf-8")
#GetToken
token = getToken()
#MakeMylist&AddMylist
mid = mylist_create(u"Testlist")
addvideo_tomylist(mid, ["sm9","sm1097445", "sm1715919" ] )
MyError.
Traceback (most recent call last):
File "Nico3.py", line 48, in <module>
urllib.parse.urlencode( {"mail":userid, "password":passwd}) ).encode("utf-8")
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 162, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 463, in open
req = meth(req)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 1170, in do_request_
raise TypeError(msg)
TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str.
I've tried encode but it did not help.
I'm japanese accademic students.
It was not able to be settled by my knowledge.
I am aware of this similar question, TypeError: POST data should be bytes or an iterable of bytes. It cannot be str, but am too new for the answer to be much help.
You paren is in the wrong place so you are not actually encoding:
.urlencode({"mail":userid, "password":passwd}).encode("utf-8")) # <- move inside

python TypeError: expected string or buffer when parsing JSON from a file

I realize this problem has been answered for other folks but none of the threads are helping me solve it. I'm trying to parse a JSON structure and add all values in the sent_file when the keys match with the tweet_file. The error I'm getting
import sys
import json
def main():
sent_file = open(sys.argv[1])
tweet_file = open(sys.argv[2])
scores = {}
#tweet = {}
#tweet_text = {}
#hw()
#lines(sent_file)
#lines(tweet_file)
for line in sent_file:
term,score = line.split("\t")
scores[term] = int(score)
#print scores.items()
for tweets in tweet_file:
current_sent_value = 0
tweet = {} #this is a dict
#print type(tweets) str
tweet = json.loads(tweets)#[0] #this assignment changes tweet to a list. Why?
if 'text' in tweet:
tweet_text = {}
unicode_string = tweet['text']
encoded_string = unicode_string.encode('utf-8')
tweet_text = encoded_string.split()
for key in tweet_text:
for key in scores:
#print type(tweet_text) -- list
#print type(scores) --dict
if tweet_text.get(key) == scores.get(key): # get() does not work on a list. tweet_text is a list.
current_sent_value += scores(value)
print current_sent_value
if name == 'main':
main()
The error is here \assignment1\tweet_sentiment2.py", line 42, in main
if tweet_text.get(key) == scores.get(key): # get() does not work on a list. tweet_text is a list.
AttributeError: 'list' object has no attribute 'get'