Cleaning up data scraped from website and constructing clean Pandas Dataframe

Cleaning up data scraped from website and constructing clean Pandas Dataframe - pandas

I am practicing my web scraping and am having a tough time cleaning data and putting it into a DataFrame to later manipulate. My code is something like:
import requests as re
import urllib.request as ure
import time
from bs4 import BeautifulSoup as soup
import pandas as pd
myURL = "http://naturalstattrick.com/games.php"
reURL = re.get(myURL)
mySoup = soup(reURL.content, 'html.parser')
print(mySoup)
From that, I want to isolate the date, teams, and score - which always begins with < b >, followed by spacehyphenspace, followed by the away team (which can be 1 of 31 teams), space, awayTeamScore, commaspace, homeTeam, space, homeTeamScore, and ends with < /b >.
Then I want to isolate all of the numeric data that starts with < td > and ends with < /td > into their own columns but obviously alongside the record of the game.

Related

How to access a dataframe from a Python dataframe list through a cell from a date column in the dataframe

I have created a list (df) which contains some dataframes after importing csv files. Instead of accessing this dataframes using df[0], df[1] etc, I would like to access them in a much easier way with something like df[20/04/22] or df[date=='20/04/22] or something similar. I am really new to Python and programming, thank you very much in advance. I attach the simplified code (contains only 2 items in the list) for simplyfying reasons.
I came up with two ways of achieving that but each time I have some trouble realising them.
Through my directory path names. Each csv (dataframe) file name includes the date in each original name file, something like : "5f05d5d83a442d4f78db0a19_2022-04-01.csv"
Each csv (dataframe), includes a date column (object type) which I have changed to datetime64 type so I can work with plots. So, I thought that maybe through this column what I ask would be possible.
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime
from datetime import date
from datetime import time
from pandas.tseries.offsets import DateOffset
import glob
import os
path = "C:/Users/dsdadsdsaa/"
all_files = glob.glob(path + '*.csv')
df = []
for filename in all_files:
dataframe = pd.read_csv(filename, index_col=None, header=0)
df.append(dataframe)
for i in range(0,2):
df[i]['date'] = pd.to_datetime(df[i]['date'])
df[i]['time'] = pd.to_datetime(df[i]['time'])
df[0]

In Jupyter, do you do a running update (hide) a pandas table as you are updating it?

In Jupyter, I am running a long-running computation.
I want to show a Pandas table with the top 25 rows. The top 25 may update each iteration.
However, I don't want to show many Pandas tables. I want to delete / update the existing displayed Pandas table.
How is this possible?
This approach seems usable for matplotlibs but not pandas pretty tables.

You can use clear_output and display the dataframe:
from IPython.display import display, clear_output
# this is just to simulate the delay
import time
i = 1
while i<10:
time.sleep(1)
df = pd.DataFrame(np.random.rand(4,3))
clear_output(wait=True)
display(df)
i += 1

pd.read_csv, when changing separator data type changes?

My dataframe is originally a text file, where the columns are separated by a tab.
I first changed these tabs to spaces by hand (sep=" "), loaded and plotted the data, my plot looked the way it should.
Since I have multiple files to plot, its not really handy to change the separator of each file. That's why I changed the seper<tor to sep="\s+".
Suddenly the x-axis of my new plot takes every single position value and overlaps them.
Anyone knows why this is happening and how to prevent it?
My first code looked like:
import pandas as pd
import numpy as np
from functools import reduce
data1500 = pd.read_csv('V026-15.000-0.1.txt', sep = " ", index_col='Position')
plt.plot(data_merged1.ts1500, label="ts 15.00")
and the second:
import pandas as pd
import numpy as np
from functools import reduce
from matplotlib import pyplot as plt
data1500 = pd.read_csv('V025-15.000-0.5.txt', sep = "\s+", index_col='Position')
plt.plot(data_merged2.ts1500, label="ts 15.00")

you could do this to import a tab-delimited file:
import re
with open('V026-15.000-0.1.txt.txt') as f:
data = [re.split('\t',x) for x in f.read().split('\n')]
or do this:
import csv
with open('data.txt', newline = '') as mytext:
data = csv.reader(mytext, delimiter='\t')
then to plot your data you should do as follow:
Read each line in the file using for loop.
Append required columns into a list.
After reading the whole file, plot the required data
something like this:
for row in data:
x.append(row[0])
y.append(row[1])
plt.plot(x, y)

How to get all stocks from the specified URL in selenium headless mode?

Below I describe the issue I have.
Description
I want to simple fetch all stocks from the URL: https://www.di.se/bors/large-cap/
I do this from a very slow computer with a small screen (15"1) also zoom is set to 150% in Windows.
I want to do this in selenium headless mode by Java.
Problem
All stocks are not visible nor at screen or in inspect.
I try to fetch all stocks by the line:
driver.findElement(By.tagName("body")).getText();
This command don't return all stocks. If I go the end of page and pageup to the end of the stock lists, I can see "getting more data" or in my language Swedish "hämtar mer data" at the end of the stock lists ie. Complete list with all stocks ends with Wihlborgs Fastigheter.
Inspect of current element gives:
<p class="instrument-table__load-more-info">Hämtar mer data...</p>
To update page with more stocks I have to scroll the page.
Question
How to fetch all stocks in headless mode in Java?

Can you simply download your data from the Yahoo Finance API?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.optimize as sco
import datetime as dt
import math
from datetime import datetime, timedelta
from pandas_datareader import data as wb
from sklearn.cluster import KMeans
np.random.seed(777)
import yfinance as yf
start = '2018-06-30'
end = '2020-06-30'
tickers = ['MSFT','AAPL','GOOG']
thelen = len(tickers)
price_data = []
for ticker in tickers:
data = yf.download(ticker, start, end)
data = data.reset_index()
data

Pull imdbIDs for titles on search list

Would it be possible to get all the IMDb IDs for titles that meet a search criteria (such as number of votes, language, release year, etc)?
My priority is to compile a list of all the IMDb IDs are classified as a feature film and have over 25,000 votes (a.k.a. those eligible appear on the top 250 list) as it appears here. At the time of this posting, there are 4,296 films that meet that criteria.
(If you are unfamiliar with IMDb IDs: it is a unique 7-digit code associated with every film/person/character/etc in the database. For instance, for the movie "Drive" (2011), the IMDb ID is "0780504".)
However, in the future, it would be helpful to set the search criteria as I see fit, as I can when typing in the url address (with &num_votes=##, &year=##, &title_type=##, ...)
I have been using IMDBpy with great success to pull information on individual movie titles and would love if this search feature I describe were accessible through that library.
Until now, I have been generating random 7-digit-strings and testing to see if they meet my criteria, but this will be inefficient moving forward because I waste processing time on superfluous IDs.
from imdb import IMDb, IMDbError
import random
i = IMDb(accessSystem='http')
movies = []
for _ in range(11000):
randID = str(random.randint(0, 7221897)).zfill(7)
movies.append(randID)
for m in movies:
try:
movie = i.get_movie(m)
except IMDbError as err:
print(err)`
if str(movie)=='':
continue
kind = movie.get('kind')
if kind != 'movie':
continue
votes=movie.get('votes')
if votes == None:
continue
if votes>=25000:

Take a look at http://www.omdbapi.com/
You can use the API directly, to search by title or ID.
In python3
import urllib.request
urllib.request.urlopen("http://www.omdbapi.com/?apikey=27939b55&s=moana").read()

Found a solution using Beautiful Soup based on a tutorial written by Alexandru Olteanu
Here is my code:
from requests import get
from bs4 import BeautifulSoup
import re
import math
from time import time, sleep
from random import randint
from IPython.core.display import clear_output
from warnings import warn
url = "http://www.imdb.com/search/title?num_votes=25000,&title_type=feature&view=simple&sort=num_votes,desc&page=1&ref_=adv_nxt"
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
num_films_text = html_soup.find_all('div', class_ = 'desc')
num_films=re.search('of (\d.+) titles',str(num_films_text[0])).group(1)
num_films=int(num_films.replace(',', ''))
print(num_films)
num_pages = math.ceil(num_films/50)
print(num_pages)
ids = []
start_time = time()
requests = 0
# For every page in the interval`
for page in range(1,num_pages+1):
# Make a get request
url = "http://www.imdb.com/search/title?num_votes=25000,&title_type=feature&view=simple&sort=num_votes,desc&page="+str(page)+"&ref_=adv_nxt"
response = get(url)
# Pause the loop
sleep(randint(8,15))
# Monitor the requests
requests += 1
sleep(randint(1,3))
elapsed_time = time() - start_time
print('Request: {}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
clear_output(wait = True)
# Throw a warning for non-200 status codes
if response.status_code != 200:
warn('Request: {}; Status code: {}'.format(requests, response.status_code))
# Break the loop if the number of requests is greater than expected
if requests > num_pages:
warn('Number of requests was greater than expected.')
break
# Parse the content of the request with BeautifulSoup
page_html = BeautifulSoup(response.text, 'html.parser')
# Select all the 50 movie containers from a single page
movie_containers = page_html.find_all('div', class_ = 'lister-item mode-simple')
# Scrape the ID
for i in range(len(movie_containers)):
id = re.search('tt(\d+)/',str(movie_containers[i].a)).group(1)
ids.append(id)
print(ids)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Cleaning up data scraped from website and constructing clean Pandas Dataframe - pandas

Related

How to access a dataframe from a Python dataframe list through a cell from a date column in the dataframe

In Jupyter, do you do a running update (hide) a pandas table as you are updating it?

pd.read_csv, when changing separator data type changes?

How to get all stocks from the specified URL in selenium headless mode?

Pull imdbIDs for titles on search list

Categories

Resources