I've hit a wall with the way I would like to use the YouTube data API. I have a user account that is trying to act as an 'aggregator', by adding videos from various other channels into one of about 15 playlists, based on categories. My problem is, I can't get all these videos into a single feed, because they belong to various YouTube users. I'd like to get them all into a single list, so I could sort that master list by most recent and most popular, to populate different views in my web app.
How can I get a list of all the videos that a user has added to any of their playlists?
YouTube must track this kind of stuff, because if you go into the "Feed" section of any user's page at `http://www.youtube.com/' it gives you a stream of activity that includes videos added to playlists.
To be clear, I don't want to fetch a list of videos uploaded by just this user, so http://gdata.../<user>/uploads won't work. Since there are a number of different playlists, http://gdata.../<user>/playlists won't work either, because I would need to make about 15 requests each time I wanted to check for new videos.
There seems to be no way to retrieve a list of all videos that a user has added to all of their playlists. Can somebody think of a way to do this that I might have overlooked?
Something like this for retrieving youtube links from playlist. It still need improvements.
import urllib2
import xml.etree.ElementTree as et
import re
import os
more = 1
id_playlist = raw_input("Enter youtube playlist id: ")
number_of_iteration = input("How much video links: ")
number = number_of_iteration / 50
number2 = number_of_iteration % 50
if (number2 != 0):
number3 = number + 1
else:
number3 = number
start_index = 1
while more <= number3:
#reading youtube playlist page
if (more != 1):
start_index+=50
str_start_index = str(start_index)
req = urllib2.Request('http://gdata.youtube.com/feeds/api/playlists/'+ id_playlist + '?v=2&&start-index=' + str_start_index + '&max-results=50')
response = urllib2.urlopen(req)
the_page = response.read()
#writing page in .xml
dat = open("web_content.xml","w")
dat.write(the_page)
dat.close()
#searching page for links
tree = et.parse('web_content.xml')
all_links = tree.findall('*/{http://www.w3.org/2005/Atom}link[#rel="alternate"]')
#writing links + attributes to .txt
if (more == 1):
till_links = 50
else:
till_links = start_index + 50
str_till_links = str(till_links)
dat2 = open ("links-"+ str_start_index +"to"+ str_till_links +".txt","w")
for links in all_links:
str1 = (str(links.attrib) + "\n")
dat2.write(str1)
dat2.close()
#getting only links
f = open ("links-"+ str_start_index +"to"+ str_till_links +".txt","r")
link_all = f.read()
new_string = link_all.replace("{'href': '","")
new_string2 = new_string.replace("', 'type': 'text/html', 'rel': 'alternate'}","")
f.close()
#writing links to .txt
f = open ("links-"+ str_start_index +"to"+ str_till_links +".txt","w")
f.write(new_string2)
f.close()
more+=1
os.remove('web_content.xml')
print "Finished!"
Related
I have managed to get the text I want but I can't seem to send the entire list to a telegram message. I only manage to send the first line.
service = Service(executable_path=ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)
driver.get("")
Source = driver.page_source
soup = BeautifulSoup(Source, "html.parser")
for cars in soup.findAll(class_="car-title"):
print(cars.text)
driver.close()
def telegram_bot_sendtext(bot_message):
bot_token = ''
bot_chatID = ''
send_text = 'https://api.telegram.org/bot' + bot_token + '/sendMessage?chat_id=' + bot_chatID + '&parse_mode=Markdown&text=' + bot_message
response = requests.get(send_text)
return response.json()
test = telegram_bot_sendtext(cars.text)
The print function gives me this
AUDI E-TRON
MERCEDES-BENZ EQC
TESLA MODEL 3
NISSAN LEAF
MERCEDES-BENZ EQV
AUDI E-TRON
At some point I would like to add a function to check for updates and if there any changes then send a push message to telegram. If someone could point me in the right direction I would be grateful.
What happens?
Your sending one line, cause you do not store the results anywhere and only the last result from iterating is in memory.
How to fix?
Asuming you want to send text as in the question, you should store results in variable - Iterate over resultset, extract text and join() results by newline character:
cars = '\n'.join([cars.text for cars in soup.find_all(class_="car-title")])
Example
...
cars = '\n'.join([cars.text for cars in soup.find_all(class_="car-title")])
def telegram_bot_sendtext(bot_message):
bot_token = ''
bot_chatID = ''
send_text = 'https://api.telegram.org/bot' + bot_token + '/sendMessage?chat_id=' + bot_chatID + '&parse_mode=Markdown&text=' + bot_message
response = requests.get(send_text)
return response.json()
test = telegram_bot_sendtext(cars)
I would like to extract the text data of the author affiliations on this page using Beautiful soup.
I know of a work around using selenium to simply click on the 'show more' link and scan the page again? Im not sure what kind of elements these are, hidden? as they only appear in the inspector after clicking the button.
Is there a way to extract this info just using beautiful soup or do I need selenium or something equivalent to reveal the elements in the HTML code?
from bs4 import BeautifulSoup
import requests
url = 'https://www.sciencedirect.com/science/article/abs/pii/S0920379621007596'
sp = BeautifulSoup(r.content, 'html.parser')
r = sp.get(url)
author_data = sp.find('div', id='author-group')
affiliations = author_data.find('dl', class_='affiliation').text
print(affiliations)
That info is within a script tag though you need to map the letters for affiliations to the actual affiliations. The code below extracts the JavaScript object housing the info you want and handles with JSON library.
There is then a series of steps to dynamically determine which indices hold the info of interest and then use a constructed mapping of the letters to affiliations to assign the correct affiliation to each author.
The author first and last names are also dynamically ascertained and joined together with a space.
The intention was to avoid hardcoding indices which might change over time.
import re
import json
import requests
r = requests.get('https://www.sciencedirect.com/science/article/abs/pii/S0920379621007596',
headers={'User-Agent': 'Mozilla/5.0'})
data = json.loads(re.search(r'(\{"abstracts".*})', r.text).group(1))
base = [i for i in data['authors']['content']
if i.get('#name') == 'author-group'][0]['$$']
affiliation_data = [i for i in base if i['#name'] == 'affiliation']
author_data = [i for i in base if i['#name'] == 'author']
name_info = [i['_'] for author in author_data for i in author['$$']
if i['#name'] in ['given-name', 'surname']]
affiliations = dict(zip([j['_'] for i in affiliation_data for j in i['$$'] if j['#name'] == 'label'], [
j['_'] for i in affiliation_data for j in i['$$'] if isinstance(j, dict) and '_' in j and j['_'][0].isupper()]))
# print(affiliations)
author_affiliations = dict(zip([' '.join([i[0], i[1]]) for i in zip(name_info[0::2], name_info[1::2])], [
affiliations[j['_']] for author in author_data for i in author['$$'] if i['#name'] == 'cross-ref' for j in i['$$'] if j['_'] != '⁎']))
print(author_affiliations)
I'm extracting NBA stats from my yahoo fantasy account. Below is the code that I made in jupyter notebook using selenium. Each page shows 25 players and a total of 720 players. I did a for loop that will scrape players in increments of 25 instead of one by one.
for k in range (0,725,25):
Players = driver.find_elements_by_xpath('//tbody/tr/td[2]/div/div/div/div/a')
Team_Position = driver.find_elements_by_xpath('//span[#class= "Fz-xxs"]')
Games_Played = driver.find_elements_by_xpath('//tbody/tr/td[7]/div')
Minutes_Played = driver.find_elements_by_xpath('//tbody/tr/td[11]/div')
FGM_A = driver.find_elements_by_xpath('//tbody/tr/td[12]/div')
FTM_A = driver.find_elements_by_xpath('//tbody/tr/td[14]/div')
Three_Points = driver.find_elements_by_xpath('//tbody/tr/td[16]/div')
PTS = driver.find_elements_by_xpath('//tbody/tr/td[17]/div')
REB = driver.find_elements_by_xpath('//tbody/tr/td[18]/div')
AST = driver.find_elements_by_xpath('//tbody/tr/td[19]/div')
ST = driver.find_elements_by_xpath('//tbody/tr/td[20]/div')
BLK = driver.find_elements_by_xpath('//tbody/tr/td[21]/div')
TO = driver.find_elements_by_xpath('//tbody/tr/td[22]/div')
NBA_Stats = []
for i in range(len(Players)):
players_stats = {'Name': Players[i].text,
'Position': Team_Position[i].text,
'GP': Games_Played[i].text,
'MP': Minutes_Played[i].text,
'FGM/A': FGM_A[i].text,
'FTM/A': FTM_A[i].text,
'3PTS': Three_Points[i].text,
'PTS': PTS[i].text,
'REB': REB[i].text,
'AST': AST[i].text,
'ST': ST[i].text,
'BLK': BLK[i].text,
'TO': TO[i].text}
driver.get('https://basketball.fantasysports.yahoo.com/nba/28951/players?status=ALL&pos=P&cut_type=33&stat1=S_AS_2021&myteam=0&sort=AR&sdir=1&count=' + str(k))
The browser will go page by page after it's done. I print out the results. It only scrape 1 player. What did I do wrong?
A picture of my codes and printing the results
It's hard to see what the issue here is without looking at the original page (can you provide a URL?), however looking at this:
next = driver.find_element_by_xpath('//a[#id = "yui_3_18_1_1_1636840807382_2187"]')
"1636840807382" looks like a Javascript timestamp, so I would guess that the reference you've got hardcoded there is dynamically generated, so the element "yui_3_18_1_1_1636840807382_2187" no longer exists.
How can I remove that? I Tried so many things and I am exhausted of trying to defeat this error by myself. I spent the last 3 hours looking at this and trying to get through it and I surrender to this code. Please help.
The first "for" statement grabs article titles from news.google.com
The second "for" statement grabs the time of submisssion from that article on news.google.com.
This is on django btw and this page shows the list of article titles and their time of submission in a list, going down. The weird unicode letters are popping up from the second "for" statement which is the time submissions. Here is my views.py:
def articles(request):
""" Grabs the most recent articles from the main news page """
import bs4, requests
list = []
list2 = []
url = 'https://news.google.com/'
r = requests.get(url)
try:
r.raise_for_status() == True
except ValueError:
print('Something went wrong.')
soup = bs4.BeautifulSoup(r.text, 'html.parser')
for (listarticles) in soup.find_all('h2', 'esc-lead-article-title'):
if listarticles is not None:
a = listarticles.text
list.append(a)
for articles_times in soup.find_all('span','al-attribution-timestamp'):
if articles_times is not None:
b = articles_times.text
list2.append(b)
list = zip(list,list2)
context = {'list':list}
return render(request, 'newz/articles.html', context)
I am new in twitter development. I am trying to download tweets of important news agency. I used the guidelines provided in http://www.karambelkar.info/2015/01/how-to-use-twitters-search-rest-api-most-effectively. to download the tweets. I know that twitter api has some limitations on the number of requests (180 req per 15 min) and each request can fetch at most 100 tweets. So I expect the following code to get 18K tweets when I run it for the first time. However, I can only get arround 3000 tweets for each news agency. For example nytimes 3234 tweets, cnn 3207.
I'll be thankful if you can take a look at my code and let me know the problem.
def get_tweets(api, username, sinceId):
max_id = -1L
maxTweets = 1000000 # Some arbitrary large number
tweetsPerReq = 100 # the max the API permits
tweetCount = 0
print "writing to {0}_tweets.txt".format(username)
with open("{0}_tweets.txt".format(username) , 'w') as f:
while tweetCount < maxTweets:
try:
if (max_id <= 0):
if (not sinceId):
new_tweets = api.user_timeline(screen_name = username, count= tweetsPerReq)
else:
new_tweets = api.user_timeline(screen_name = username, count= tweetsPerReq, since_id = sinceId)
else:
if (not sinceId):
new_tweets = api.user_timeline(screen_name = username, count= tweetsPerReq, max_id=str(max_id - 1))
else:
new_tweets = api.search(screen_name = username, count= tweetsPerReq, max_id=str(max_id - 1), since_id=sinceId)
if not new_tweets:
print "no new tweet"
break
#create array of tweet information: username, tweet id, date/time, text
for tweet in new_tweets:
f.write(jsonpickle.encode(tweet._json, unpicklable=False) +'\n')
tweetCount += len(new_tweets)
print("Downloaded {0} tweets".format(tweetCount))
max_id = new_tweets[-1].id
except tweepy.TweepError as e:
# Just exit if any error
print("some error : " + str(e))
break
print ("Downloaded {0} tweets, Saved to {1}_tweets.txt".format(tweetCount, username))
Those are the limitations imposed by the API.
If you read the documentation, you will see that it says
This method can only return up to 3,200 of a user’s most recent Tweets.
So, the answer is - normal API users cannot access that data.