Pull imdbIDs for titles on search list - pandas

Would it be possible to get all the IMDb IDs for titles that meet a search criteria (such as number of votes, language, release year, etc)?
My priority is to compile a list of all the IMDb IDs are classified as a feature film and have over 25,000 votes (a.k.a. those eligible appear on the top 250 list) as it appears here. At the time of this posting, there are 4,296 films that meet that criteria.
(If you are unfamiliar with IMDb IDs: it is a unique 7-digit code associated with every film/person/character/etc in the database. For instance, for the movie "Drive" (2011), the IMDb ID is "0780504".)
However, in the future, it would be helpful to set the search criteria as I see fit, as I can when typing in the url address (with &num_votes=##, &year=##, &title_type=##, ...)
I have been using IMDBpy with great success to pull information on individual movie titles and would love if this search feature I describe were accessible through that library.
Until now, I have been generating random 7-digit-strings and testing to see if they meet my criteria, but this will be inefficient moving forward because I waste processing time on superfluous IDs.
from imdb import IMDb, IMDbError
import random
i = IMDb(accessSystem='http')
movies = []
for _ in range(11000):
randID = str(random.randint(0, 7221897)).zfill(7)
movies.append(randID)
for m in movies:
try:
movie = i.get_movie(m)
except IMDbError as err:
print(err)`
if str(movie)=='':
continue
kind = movie.get('kind')
if kind != 'movie':
continue
votes=movie.get('votes')
if votes == None:
continue
if votes>=25000:

Take a look at http://www.omdbapi.com/
You can use the API directly, to search by title or ID.
In python3
import urllib.request
urllib.request.urlopen("http://www.omdbapi.com/?apikey=27939b55&s=moana").read()

Found a solution using Beautiful Soup based on a tutorial written by Alexandru Olteanu
Here is my code:
from requests import get
from bs4 import BeautifulSoup
import re
import math
from time import time, sleep
from random import randint
from IPython.core.display import clear_output
from warnings import warn
url = "http://www.imdb.com/search/title?num_votes=25000,&title_type=feature&view=simple&sort=num_votes,desc&page=1&ref_=adv_nxt"
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
num_films_text = html_soup.find_all('div', class_ = 'desc')
num_films=re.search('of (\d.+) titles',str(num_films_text[0])).group(1)
num_films=int(num_films.replace(',', ''))
print(num_films)
num_pages = math.ceil(num_films/50)
print(num_pages)
ids = []
start_time = time()
requests = 0
# For every page in the interval`
for page in range(1,num_pages+1):
# Make a get request
url = "http://www.imdb.com/search/title?num_votes=25000,&title_type=feature&view=simple&sort=num_votes,desc&page="+str(page)+"&ref_=adv_nxt"
response = get(url)
# Pause the loop
sleep(randint(8,15))
# Monitor the requests
requests += 1
sleep(randint(1,3))
elapsed_time = time() - start_time
print('Request: {}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
clear_output(wait = True)
# Throw a warning for non-200 status codes
if response.status_code != 200:
warn('Request: {}; Status code: {}'.format(requests, response.status_code))
# Break the loop if the number of requests is greater than expected
if requests > num_pages:
warn('Number of requests was greater than expected.')
break
# Parse the content of the request with BeautifulSoup
page_html = BeautifulSoup(response.text, 'html.parser')
# Select all the 50 movie containers from a single page
movie_containers = page_html.find_all('div', class_ = 'lister-item mode-simple')
# Scrape the ID
for i in range(len(movie_containers)):
id = re.search('tt(\d+)/',str(movie_containers[i].a)).group(1)
ids.append(id)
print(ids)

Related

Implementing a Flask and Dash application. Dash is running inside of Flask. How can I transfer data from Flask to Dash?

I have implemented a Dash and Flask application. Not sure if this is important, but I have configured it where Dash is running inside of Flask. I am using Flask to query data from a mysql database, and I'm using Dash for creating a frontend dashboard. However, I am just at a loss on how to pass data from Flask to Dash(if there is a way to do so). I should mention this as well because I know it would be important to consider. The data I would like to pass over is a DataFrame. So I am hoping that there is someone out there who can shed some light on how to do this.
In terms of all other functionality everything works. The dashboard portion works fine, and querying from the database works fine as well. It is just a matter of seeing if there is a way to pass data from Flask(a queried data) to the Dash application.
I have many files in this application, but I think the only important files to show is the routes.py and dashboard.py files.
Below is the routes.py file. The "/dash" route leads to the dashboard app. The rest of the the routes were just practice routes for myself to see if I am able to query data from the database.
import pandas as pd
from flask import render_template, request, redirect
from dash_package.dashboard import app
from dash_package.functions import *
from dash_package.data import *
from dash_package.database import conn
#app.server.route('/dash')
def dashboard():
return app.index()
#app.server.route('/')
def index():
#query = 'SELECT * FROM test_data LIMIT 5;'
df = pd.DataFrame()
#df = pd.read_sql(query, conn)
return render_template('index.html', dataSaved=False, dataFound=False, data=df)
#app.server.route('/save', methods=['GET', 'POST'])
def save():
if request.method == 'POST':
game = request.form['save']
data = [game]
df = pd.DataFrame(data, columns=['Games'])
df.to_sql('test_data', conn,if_exists='append',index=False)
return render_template('index.html', dataSaved=True, dataFound=False, data=df)
#app.server.route('/search', methods=['GET', 'POST'])
def search():
if request.method == 'POST':
game = request.form['search']
print(game)
query = 'SELECT * FROM test_data WHERE Games = \'' + game + '\''
#query = 'SELECT * FROM test_data LIMIT 5;'
print(query)
df = pd.read_sql(query, conn)
return render_template('index.html', dataSaved=False, dataFound=True, data=df)
Below would be the dashboard.py file. Nothing much going on here, but the point was to see if I can put any simple dashboard and run it without errors.
import dash
from dash import html, dcc, Input, Output
# import dash_core_components as dcc
# import dash_html_components as html
from dash_package import app
from dash_package.functions import *
app.layout = html.Div([
html.H2(hello()),
dcc.Dropdown(['LA', 'NYC', 'MTL'],
'LA',
id='dropdown'
),
html.Div(id='display-value')
])
#app.callback(Output('display-value', 'children'),
[Input('dropdown', 'value')])
def display_value(value):
return f'You have selected {value}'
Well I basically tried a lot of research, but to my surprise it was very difficult to find anything about this. I truly spent hours trying to find some kind of hint.

How to feed an audio file from S3 bucket directly to Google speech-to-text

We are developing a speech application using Google's speech-to-text API. Now our data (audio files) get stored in S3 bucket on AWS. is there a way to directly pass the S3 URI to Google's speech-to-text API?
From their documentation it seems this is at the moment not possible in Google's speech-to-text API
This is not the case for their vision and NLP APIs.
Any ideas why this limitation for speech APIs?
And whats a good work around for this?
Currently, Google only allows audio files from either your local source or from Google's Cloud Storage. No reasonable explanation is given on the documentation about this.
Passing audio referenced by a URI
More typically, you will pass a uri parameter within the Speech request's audio field, pointing to an audio file (in binary format, not base64) located on Google Cloud Storage
I suggest you move your files to Google Cloud Storage. If you don't want to, there is a good workaround:
Use Google Cloud Speech API with streaming API. You are not required to store anything anywhere. Your speech application provides input from any microphone. And don't worry if you don't know handling inputs from microphone.
Google provides a sample code that does it all:
# [START speech_transcribe_streaming_mic]
from __future__ import division
import re
import sys
from google.cloud import speech
import pyaudio
from six.moves import queue
# Audio recording parameters
RATE = 16000
CHUNK = int(RATE / 10) # 100ms
class MicrophoneStream(object):
"""Opens a recording stream as a generator yielding the audio chunks."""
def __init__(self, rate, chunk):
self._rate = rate
self._chunk = chunk
# Create a thread-safe buffer of audio data
self._buff = queue.Queue()
self.closed = True
def __enter__(self):
self._audio_interface = pyaudio.PyAudio()
self._audio_stream = self._audio_interface.open(
format=pyaudio.paInt16,
channels=1,
rate=self._rate,
input=True,
frames_per_buffer=self._chunk,
# Run the audio stream asynchronously to fill the buffer object.
# This is necessary so that the input device's buffer doesn't
# overflow while the calling thread makes network requests, etc.
stream_callback=self._fill_buffer,
)
self.closed = False
return self
def __exit__(self, type, value, traceback):
self._audio_stream.stop_stream()
self._audio_stream.close()
self.closed = True
# Signal the generator to terminate so that the client's
# streaming_recognize method will not block the process termination.
self._buff.put(None)
self._audio_interface.terminate()
def _fill_buffer(self, in_data, frame_count, time_info, status_flags):
"""Continuously collect data from the audio stream, into the buffer."""
self._buff.put(in_data)
return None, pyaudio.paContinue
def generator(self):
while not self.closed:
# Use a blocking get() to ensure there's at least one chunk of
# data, and stop iteration if the chunk is None, indicating the
# end of the audio stream.
chunk = self._buff.get()
if chunk is None:
return
data = [chunk]
# Now consume whatever other data's still buffered.
while True:
try:
chunk = self._buff.get(block=False)
if chunk is None:
return
data.append(chunk)
except queue.Empty:
break
yield b"".join(data)
def listen_print_loop(responses):
"""Iterates through server responses and prints them.
The responses passed is a generator that will block until a response
is provided by the server.
Each response may contain multiple results, and each result may contain
multiple alternatives; for details, see the documentation. Here we
print only the transcription for the top alternative of the top result.
In this case, responses are provided for interim results as well. If the
response is an interim one, print a line feed at the end of it, to allow
the next result to overwrite it, until the response is a final one. For the
final one, print a newline to preserve the finalized transcription.
"""
num_chars_printed = 0
for response in responses:
if not response.results:
continue
# The `results` list is consecutive. For streaming, we only care about
# the first result being considered, since once it's `is_final`, it
# moves on to considering the next utterance.
result = response.results[0]
if not result.alternatives:
continue
# Display the transcription of the top alternative.
transcript = result.alternatives[0].transcript
# Display interim results, but with a carriage return at the end of the
# line, so subsequent lines will overwrite them.
#
# If the previous result was longer than this one, we need to print
# some extra spaces to overwrite the previous result
overwrite_chars = " " * (num_chars_printed - len(transcript))
if not result.is_final:
sys.stdout.write(transcript + overwrite_chars + "\r")
sys.stdout.flush()
num_chars_printed = len(transcript)
else:
print(transcript + overwrite_chars)
# Exit recognition if any of the transcribed phrases could be
# one of our keywords.
if re.search(r"\b(exit|quit)\b", transcript, re.I):
print("Exiting..")
break
num_chars_printed = 0
def main():
language_code = "en-US" # a BCP-47 language tag
client = speech.SpeechClient()
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=RATE,
language_code=language_code,
)
streaming_config = speech.StreamingRecognitionConfig(
config=config, interim_results=True
)
with MicrophoneStream(RATE, CHUNK) as stream:
audio_generator = stream.generator()
requests = (
speech.StreamingRecognizeRequest(audio_content=content)
for content in audio_generator
)
responses = client.streaming_recognize(streaming_config, requests)
# Now, put the transcription responses to use.
listen_print_loop(responses)
if __name__ == "__main__":
main()
# [END speech_transcribe_streaming_mic]
Dependencies are google-cloud-speech and pyaudio
For AWS S3, you can store your audio files there before/after you get the transcripts from Google Speech API.
Streaming is super fast as well.
And don't forget to include your credentials. You need to get authorized first by providing GOOGLE_APPLICATION_CREDENTIALS

Cleaning up data scraped from website and constructing clean Pandas Dataframe

I am practicing my web scraping and am having a tough time cleaning data and putting it into a DataFrame to later manipulate. My code is something like:
import requests as re
import urllib.request as ure
import time
from bs4 import BeautifulSoup as soup
import pandas as pd
myURL = "http://naturalstattrick.com/games.php"
reURL = re.get(myURL)
mySoup = soup(reURL.content, 'html.parser')
print(mySoup)
From that, I want to isolate the date, teams, and score - which always begins with < b >, followed by spacehyphenspace, followed by the away team (which can be 1 of 31 teams), space, awayTeamScore, commaspace, homeTeam, space, homeTeamScore, and ends with < /b >.
Then I want to isolate all of the numeric data that starts with < td > and ends with < /td > into their own columns but obviously alongside the record of the game.

Get info of exposed models in Tensorflow Serving

Once I have a TF server serving multiple models, is there a way to query such server to know which models are served?
Would it be possible then to have information about each of such models, like name, interface and, even more important, what versions of a model are present on the server and could potentially be served?
It is really hard to find some info about this, but there is possibility to get some model metadata.
request = get_model_metadata_pb2.GetModelMetadataRequest()
request.model_spec.name = 'your_model_name'
request.metadata_field.append("signature_def")
response = stub.GetModelMetadata(request, 10)
print(response.model_spec.version.value)
print(response.metadata['signature_def'])
Hope it helps.
Update
Is is possible get these information from REST API. Just get
http://{serving_url}:8501/v1/models/{your_model_name}/metadata
Result is json, where you can easily find model specification and signature definition.
It is possible to get model status as well as model metadata. In the other answer only metadata is requested and the response, response.metadata['signature_def'] still needs to be decoded.
I found the solution is to use the built-in protobuf method MessageToJson() to convert to json string. This can then be converted to a python dictionary with json.loads()
import grpc
import json
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
from tensorflow_serving.apis import model_service_pb2_grpc
from tensorflow_serving.apis import get_model_status_pb2
from tensorflow_serving.apis import get_model_metadata_pb2
from google.protobuf.json_format import MessageToJson
PORT = 8500
model = "your_model_name"
channel = grpc.insecure_channel('localhost:{}'.format(PORT))
request = get_model_status_pb2.GetModelStatusRequest()
request.model_spec.name = model
result = stub.GetModelStatus(request, 5) # 5 secs timeout
print("Model status:")
print(result)
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
request = get_model_metadata_pb2.GetModelMetadataRequest()
request.model_spec.name = model
request.metadata_field.append("signature_def")
result = stub.GetModelMetadata(request, 5) # 5 secs timeout
result = json.loads(MessageToJson(result))
print("Model metadata:")
print(result)
To continue the decoding process, either follow Tyler's approach and convert the message to JSON, or more natively Unpack into a SignatureDefMap and take it from there
signature_def_map = get_model_metadata_pb2.SignatureDefMap()
response.metadata['signature_def'].Unpack(signature_def_map)
print(signature_def_map.signature_def.keys())
To request data using REST API, for additional data of the particular model that is served, you can issue (via curl, Postman, etc.):
GET http://host:port/v1/models/${MODEL_NAME}
GET http://host:port/v1/models/${MODEL_NAME}/metadata
For more information, please check https://www.tensorflow.org/tfx/serving/api_rest

Run scrapy on a set of hundred plus urls

I need to download CPU and GPU data of a set of phones fro gsmarena. Now as a step one, I downloaded the urls of those phones by running scrapy and deleted the unnecessary items.
COde for the same is below.
# -*- coding: utf-8 -*-
from scrapy.selector import Selector
from scrapy import Spider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from gsmarena_data.items import gsmArenaDataItem
class MobileInfoSpider(Spider):
name = "mobile_info"
allowed_domains = ["gsmarena.com"]
start_urls = (
# 'http://www.gsmarena.com/samsung-phones-f-9-10.php',
# 'http://www.gsmarena.com/apple-phones-48.php',
# 'http://www.gsmarena.com/microsoft-phones-64.php',
# 'http://www.gsmarena.com/nokia-phones-1.php',
# 'http://www.gsmarena.com/sony-phones-7.php',
# 'http://www.gsmarena.com/lg-phones-20.php',
# 'http://www.gsmarena.com/htc-phones-45.php',
# 'http://www.gsmarena.com/motorola-phones-4.php',
# 'http://www.gsmarena.com/huawei-phones-58.php',
# 'http://www.gsmarena.com/lenovo-phones-73.php',
# 'http://www.gsmarena.com/xiaomi-phones-80.php',
# 'http://www.gsmarena.com/acer-phones-59.php',
# 'http://www.gsmarena.com/asus-phones-46.php',
# 'http://www.gsmarena.com/oppo-phones-82.php',
# 'http://www.gsmarena.com/blackberry-phones-36.php',
# 'http://www.gsmarena.com/alcatel-phones-5.php',
# 'http://www.gsmarena.com/xolo-phones-85.php',
# 'http://www.gsmarena.com/lava-phones-94.php',
# 'http://www.gsmarena.com/micromax-phones-66.php',
# 'http://www.gsmarena.com/spice-phones-68.php',
'http://www.gsmarena.com/gionee-phones-92.php',
)
def parse(self, response):
phone = gsmArenaDataItem()
hxs = Selector(response)
phone_listings = hxs.css('.makers')
for phone_listings in phone_listings:
phone['model'] = phone_listings.xpath("ul/li/a/strong/text()").extract()
phone['link'] = phone_listings.xpath("ul/li/a/#href").extract()
yield phone
Now, I need to run scrapy on those set of urls to get the CPU and GPU data. All that info comes css selector = ".ttl".
Kindly guide how to loop scrapy on the set of urls and output the data in a single csv or json. I'm well aware will creating items and using css selectors. Need help with how to loop on those hundred plus pages.
I have a list of urls like:
www.gsmarena.com/samsung_galaxy_s5_cdma-6338.php
www.gsmarena.com/samsung_galaxy_s5-6033.php
www.gsmarena.com/samsung_galaxy_core_lte_g386w-6846.php
www.gsmarena.com/samsung_galaxy_core_lte-6099.php
www.gsmarena.com/acer_iconia_one_8_b1_820-7217.php
www.gsmarena.com/acer_iconia_tab_a3_a20-7136.php
www.gsmarena.com/microsoft_lumia_640_dual_sim-7082.php
www.gsmarena.com/microsoft_lumia_532_dual_sim-6951.php
Which are the links to phone descriptions on gsm arena.
Now I need to download the CPU and GPU info of the 100 models I have.
I extracted the urls of those 100 models for which the data is required.
The spider written for the same is,
from scrapy.selector import Selector
from scrapy import Spider
from gsmarena_data.items import gsmArenaDataItem
class MobileInfoSpider(Spider):
name = "cpu_gpu_info"
allowed_domains = ["gsmarena.com"]
start_urls = (
"http://www.gsmarena.com/microsoft_lumia_435_dual_sim-6949.php",
"http://www.gsmarena.com/microsoft_lumia_435-6942.php",
"http://www.gsmarena.com/microsoft_lumia_535_dual_sim-6792.php",
"http://www.gsmarena.com/microsoft_lumia_535-6791.php",
)
def parse(self, response):
phone = gsmArenaDataItem()
hxs = Selector(response)
cpu_gpu = hxs.css('.ttl')
for phone_listings in phone_listings:
phone['cpu'] = cpu_gpu.xpath("ul/li/a/strong/text()").extract()
phone['gpu'] = cpu_gpu.xpath("ul/li/a/#href").extract()
yield phone
If somehow I could run on the urls for which I want to extract this data, I could get the required data in a single csv file.
I think you need information from every vendors. If so you don't have to put those hundreds of urls in the start-url, alternatively you can use this link as start-url after that in parse() you could extract those urls programatically and process what you want.
This answer will help you to do so.