How can I optimise this webscraping code for iterative loop? - pandas

This code scrapes www.oddsportal.com for all the URLs provided in the code and appends it to a dataframe.
I am not very well versed with iterative logic hence I am finding it difficult to improvise on it.
Code:
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup as bs
browser = webdriver.Chrome()
class GameData:
def __init__(self):
self.date = []
self.time = []
self.game = []
self.score = []
self.home_odds = []
self.draw_odds = []
self.away_odds = []
self.country = []
self.league = []
def parse_data(url):
browser.get(url)
df = pd.read_html(browser.page_source, header=0)[0]
html = browser.page_source
soup = bs(html, "lxml")
cont = soup.find('div', {'id': 'wrap'})
content = cont.find('div', {'id': 'col-content'})
content = content.find('table', {'class': 'table-main'}, {'id': 'tournamentTable'})
main = content.find('th', {'class': 'first2 tl'})
if main is None:
return None
count = main.findAll('a')
country = count[1].text
league = count[2].text
game_data = GameData()
game_date = None
for row in df.itertuples():
if not isinstance(row[1], str):
continue
elif ':' not in row[1]:
game_date = row[1].split('-')[0]
continue
game_data.date.append(game_date)
game_data.time.append(row[1])
game_data.game.append(row[2])
game_data.score.append(row[3])
game_data.home_odds.append(row[4])
game_data.draw_odds.append(row[5])
game_data.away_odds.append(row[6])
game_data.country.append(country)
game_data.league.append(league)
return game_data
urls = {
"https://www.oddsportal.com/soccer/england/premier-league/results/#/page/1",
"https://www.oddsportal.com/soccer/england/premier-league/results/#/page/2",
"https://www.oddsportal.com/soccer/england/premier-league/results/#/page/3",
"https://www.oddsportal.com/soccer/england/premier-league/results/#/page/4",
"https://www.oddsportal.com/soccer/england/premier-league/results/#/page/5",
"https://www.oddsportal.com/soccer/england/premier-league/results/#/page/6",
"https://www.oddsportal.com/soccer/england/premier-league/results/#/page/7",
"https://www.oddsportal.com/soccer/england/premier-league/results/#/page/8",
"https://www.oddsportal.com/soccer/england/premier-league/results/#/page/9",
"https://www.oddsportal.com/soccer/england/premier-league-2019-2020/results/#/page/1",
"https://www.oddsportal.com/soccer/england/premier-league-2019-2020/results/#/page/2",
"https://www.oddsportal.com/soccer/england/premier-league-2019-2020/results/#/page/3",
"https://www.oddsportal.com/soccer/england/premier-league-2019-2020/results/#/page/4",
"https://www.oddsportal.com/soccer/england/premier-league-2019-2020/results/#/page/5",
"https://www.oddsportal.com/soccer/england/premier-league-2019-2020/results/#/page/6",
"https://www.oddsportal.com/soccer/england/premier-league-2019-2020/results/#/page/7",
"https://www.oddsportal.com/soccer/england/premier-league-2019-2020/results/#/page/8",
"https://www.oddsportal.com/soccer/england/premier-league-2019-2020/results/#/page/9",
"https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/1",
"https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/2",
"https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/3",
"https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/4",
"https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/5",
"https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/6",
"https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/7",
"https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/8",
"https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/9",
}
if __name__ == '__main__':
results = None
for url in urls:
game_data = parse_data(url)
if game_data is None:
continue
result = pd.DataFrame(game_data.__dict__)
if results is None:
results = result
else:
results = results.append(result, ignore_index=True)
print(results)
| | date | time | game | score | home_odds | draw_odds | away_odds | country | league |
|-----|-------------|--------|----------------------------------|---------|-------------|-------------|-------------|-----------|--------------------------|
| 0 | 12 May 2019 | 14:00 | Brighton - Manchester City | 1:4 | 14.95 | 7.75 | 1.2 | England | Premier League 2018/2019 |
| 1 | 12 May 2019 | 14:00 | Burnley - Arsenal | 1:3 | 2.54 | 3.65 | 2.75 | England | Premier League 2018/2019 |
| 2 | 12 May 2019 | 14:00 | Crystal Palace - Bournemouth | 5:3 | 1.77 | 4.32 | 4.22 | England | Premier League 2018/2019 |
| 3 | 12 May 2019 | 14:00 | Fulham - Newcastle | 0:4 | 2.45 | 3.55 | 2.92 | England | Premier League 2018/2019 |
| 4 | 12 May 2019 | 14:00 | Leicester - Chelsea | 0:0 | 2.41 | 3.65 | 2.91 | England | Premier League 2018/2019 |
| 5 | 12 May 2019 | 14:00 | Liverpool - Wolves | 2:0 | 1.31 | 5.84 | 10.08 | England | Premier League 2018/2019 |
| 6 | 12 May 2019 | 14:00 | Manchester Utd - Cardiff | 0:2 | 1.3 | 6.09 | 9.78 | England | Premier League 2018/2019 |
As you can see the URLs can be optimised to run through all pages in that league/branch
inspect element
If I can accomodate this code to run iteratively for every page as in this inspect element below:
Inspect Element
That can be really useful and helpful.
How can I optimise this code to iteratively run for every page?

Related

How to pivot columns so they turn into rows using PySpark or pandas?

I have a dataframe that looks like the one bellow, but with hundreds of rows. I need to pivot it, so that each column after Region would be a row, like the other table bellow.
+--------------+----------+---------------------+----------+------------------+------------------+-----------------+
|city |city_tier | city_classification | Region | Jan-2022-orders | Feb-2022-orders | Mar-2022-orders|
+--------------+----------+---------------------+----------+------------------+------------------+-----------------+
|new york | large | alpha | NE | 100000 |195000 | 237000 |
|los angeles | large | alpha | W | 330000 |400000 | 580000 |
I need to pivot it using PySpark, so I end up with something like this:
+--------------+----------+---------------------+----------+-----------+---------+
|city |city_tier | city_classification | Region | month | orders |
+--------------+----------+---------------------+----------+-----------+---------+
|new york | large | alpha | NE | Jan-2022 | 100000 |
|new york | large | alpha | NE | Fev-2022 | 195000 |
|new york | large | alpha | NE | Mar-2022 | 237000 |
|los angeles | large | alpha | W | Jan-2022 | 330000 |
|los angeles | large | alpha | W | Fev-2022 | 400000 |
|los angeles | large | alpha | W | Mar-2022 | 580000 |
P.S.: A solution using pandas would work too.
In pandas :
df.melt(df.columns[:4], var_name = 'month', value_name = 'orders')
city city_tier city_classification Region month orders
0 york large alpha NE Jan-2022-orders 100000
1 angeles large alpha W Jan-2022-orders 330000
2 york large alpha NE Feb-2022-orders 195000
3 angeles large alpha W Feb-2022-orders 400000
4 york large alpha NE Mar-2022-orders 237000
5 angeles large alpha W Mar-2022-orders 580000
or even
df.melt(['city', 'city_tier', 'city_classification', 'Region'],
var_name = 'month', value_name = 'orders')
city city_tier city_classification Region month orders
0 york large alpha NE Jan-2022-orders 100000
1 angeles large alpha W Jan-2022-orders 330000
2 york large alpha NE Feb-2022-orders 195000
3 angeles large alpha W Feb-2022-orders 400000
4 york large alpha NE Mar-2022-orders 237000
5 angeles large alpha W Mar-2022-orders 580000
In PySpark, your current example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('new york', 'large', 'alpha', 'NE', 100000, 195000, 237000),
('los angeles', 'large', 'alpha', 'W', 330000, 400000, 580000)],
['city', 'city_tier', 'city_classification', 'Region', 'Jan-2022-orders', 'Feb-2022-orders', 'Mar-2022-orders']
)
df2 = df.select(
'city', 'city_tier', 'city_classification', 'Region',
F.expr("stack(3, 'Jan-2022', `Jan-2022-orders`, 'Fev-2022', `Feb-2022-orders`, 'Mar-2022', `Mar-2022-orders`) as (month, orders)")
)
df2.show()
# +-----------+---------+-------------------+------+--------+------+
# | city|city_tier|city_classification|Region| month|orders|
# +-----------+---------+-------------------+------+--------+------+
# | new york| large| alpha| NE|Jan-2022|100000|
# | new york| large| alpha| NE|Fev-2022|195000|
# | new york| large| alpha| NE|Mar-2022|237000|
# |los angeles| large| alpha| W|Jan-2022|330000|
# |los angeles| large| alpha| W|Fev-2022|400000|
# |los angeles| large| alpha| W|Mar-2022|580000|
# +-----------+---------+-------------------+------+--------+------+
The function which enables it is stack. It does not have a dataframe API, so you need to use expr to access it.
BTW, this is not pivoting, it's the opposite - unpivoting.

Iterate through pandas data frame and replace some strings with numbers

I have a dataframe sample_df that looks like:
bar foo
0 rejected unidentified
1 clear caution
2 caution NaN
Note this is just a random made up df, there are lot of other columns lets say with different data types than just text. bar and foo might also have lots of empty cells/values which are NaNs.
The actual df looks like this, the above is just a sample btw:
| | Unnamed: 0 | user_id | result | face_comparison_result | created_at | facial_image_integrity_result | visual_authenticity_result | properties | attempt_id |
|-----:|-------------:|:---------------------------------|:---------|:-------------------------|:--------------------|:--------------------------------|:-----------------------------|:----------------|:---------------------------------|
| 0 | 58 | ecee468d4a124a8eafeec61271cd0da1 | clear | clear | 2017-06-20 17:50:43 | clear | clear | {} | 9e4277fc1ddf4a059da3dd2db35f6c76 |
| 1 | 76 | 1895d2b1782740bb8503b9bf3edf1ead | clear | clear | 2017-06-20 13:28:00 | clear | clear | {} | ab259d3cb33b4711b0a5174e4de1d72c |
| 2 | 217 | e71b27ea145249878b10f5b3f1fb4317 | clear | clear | 2017-06-18 21:18:31 | clear | clear | {} | 2b7f1c6f3fc5416286d9f1c97b15e8f9 |
| 3 | 221 | f512dc74bd1b4c109d9bd2981518a9f8 | clear | clear | 2017-06-18 22:17:29 | clear | clear | {} | ab5989375b514968b2ff2b21095ed1ef |
| 4 | 251 | 0685c7945d1349b7a954e1a0869bae4b | clear | clear | 2017-06-18 19:54:21 | caution | clear | {} | dd1b0b2dbe234f4cb747cc054de2fdd3 |
| 5 | 253 | 1a1a994f540147ab913fcd61b7a859d9 | clear | clear | 2017-06-18 20:05:05 | clear | clear | {} | 1475037353a848318a32324539a6947e |
| 6 | 334 | 26e89e4a60f1451285e70ca8dc5bc90e | clear | clear | 2017-06-17 20:21:54 | suspected | clear | {} | 244fa3e7cfdb48afb44844f064134fec |
| 7 | 340 | 41afdea02a9c42098a15d94a05e8452b | NaN | clear | 2017-06-17 20:42:53 | clear | clear | {} | b066a4043122437bafae3ddcf6c2ab07 |
| 8 | 424 | 6cf6eb05a3cc4aabb69c19956a055eb9 | rejected | NaN | 2017-06-16 20:00:26 |
I want to replace any strings I find with numbers, per the below mapping.
def no_strings(df):
columns=list(df)
for column in columns:
df[column] = df[column].map(result_map)
#We will need a mapping of strings to numbers to be able to analyse later.
result_map = {'unidentified':0,"clear": 1, 'suspected': 2,"caution" : 3, 'rejected':4}
So the output might look like:
bar foo
0 4 0
1 1 3
2 3 NaN
For some reason, when I run no_strings(sample_df) I get errors.
What am I doing wrong?
df['bar'] = df['bar'].map(result_map)
df['foo'] = df['foo'].map(result_map)
df
bar foo
0 4 0
1 1 3
2 3 2
However, if you wish to be on the safe side (assuming a key/value is not in your result_map and you dont want to see a NaN) do this:
df['foo'] = df['foo'].map(lambda x: result_map.get(x, 'not found'))
df['bar'] = df['bar'].map(lambda x: result_map.get(x, 'not found'))
so an out put for this df
bar foo
0 rejected unidentified
1 clear caution
2 caution suspected
3 sdgdg 0000
will result in:
bar foo
0 4 0
1 1 3
2 3 2
3 not found not found
To be extra efficient:
cols = ['foo','bar','other_columns']
for c in cols:
df[c] = df[c].map(lambda x: result_map.get(x, 'not found'))
Lets try stack, map the dict and then unstack
df.stack().to_frame()[0].map(result_map).unstack()
bar foo
0 4 0
1 1 3
2 3 2

pandas read IRS space-delimited txt data

I recently was working with IRS tax file data. It is space-delimited txt data like the following (full data are here):
There are some patterns in the way the data was stored. But to me, the data is not formatted in a standard way and it is not easy to read into Pandas. I was wondering how to get a dataframe like the following from the above txt data:
+------------+-------------+--------------------------+-----+-----+-----+------+
| fips_state | fips_county | name | c1 | c2 | c3 | c4 |
+------------+-------------+--------------------------+-----+-----+-----+------+
| 02 | 013 | Aleutians East Borough T | 145 | 280 | 416 | 1002 |
| 02 | 016 | Aleutians West Total Mig | 304 | 535 | 991 | 2185 |
| ... | ... | ... | ... | ... | ... | ... |
+------------+-------------+--------------------------+-----+-----+-----+------+
This will get you the data into columns in two separate dataframes within pandas or prior to creating your lists. After parsing merge the two dataframes.
import urllib.request # the lib that handles the url stuff
target_url='https://raw.githubusercontent.com/shuai-zhou/DataRepo/master/data/C9091aki.txt'
list_a = []
list_b = []
for line in urllib.request.urlopen(target_url):
if line.decode('utf-8')[0:2] != ' ':
print(line.decode('utf-8').strip())
list_a.append(line.decode('utf-8').strip())
if line.decode('utf-8')[0:5] == ' ':
print(line.decode('utf-8').strip())
list_b.append(line.decode('utf-8').strip())
dfa = pd.DataFrame(list_a)
dfb = pd.DataFrame(list_b)

How to improve function using pd.groupby.transform in a dask environment

We need to create groups based on a time sequence.
We are working with dask but for this function we need to move back to pandas since transform is not yet implemented in dask. Although the function works - is there anyway to improve the performance? (We are running our code on a local Client and sometimes on a yarn-client)
Bellow is our function and a minimal, complete and verifiable example:
import pandas as pd
import numpy as np
import random
import dask
import dask.dataframe as dd
from datetime import timedelta
def create_groups_from_time_sequence(df, col_id: str=None, col_time: np.datetime64=None, time_threshold: str='120s',
min_weight: int=2) -> pd.DataFrame:
"""
Function creates group of units for relationships
:param df: dataframe pandas or dask
:param col_id: column containing the index
:param col_time: column containing datetime of query
:param time_threshold: maximum threshold between queries to create
:param min_weight: The threshold to filter the minimum relationship between 2 ids
:return: pandas dataframe
"""
partitions = None
if isinstance(df, dd.DataFrame):
partitions = df.npartitions
df = df.compute()
if np.issubdtype(df[col_time].dtype, np.datetime64):
df[col_time] = pd.to_datetime(df[col_time])
df = df.sort_values([col_id, col_time])
df['cluster_bool'] = df.groupby(col_id)[col_time].transform(lambda x: x.diff() > time_threshold)
df['EdgeID'] = df.groupby(col_id)['cluster_bool'].transform(lambda x: x.astype(int).cumsum())
df['cluster_weight'] = df.groupby([col_id, 'EdgeID'])['EdgeID'].transform('count')
mask_weight = df['cluster_weight'] > min_weight
df = df[mask_weight]
df = df.drop(['cluster_bool'], axis=1).reset_index(drop=True)
if partitions:
df = dd.from_pandas(df, npartitions=partitions)
df = df.set_index('EdgeID')
return df
Using the above function with the dask dataset example:
df_raw = dask.datasets.timeseries()
df = df_raw[['id', 'name']]
df = df.assign(timegroup=df.index)
df.timegroup = df.timegroup.apply(lambda s: s + timedelta(seconds=random.randint(0,60)) )
df.head()
| timestamp | id | name | timegroup |
| 2000-01-01 00:00:00 | 968 | Alice | 2000-01-01 00:00:46 |
| 2000-01-01 00:00:01 | 1030 | Xavier | 2000-01-01 00:00:22 |
| 2000-01-01 00:00:02 | 991 | George | 2000-01-01 00:00:59 |
| 2000-01-01 00:00:03 | 975 | Zelda | 2000-01-01 00:00:26 |
| 2000-01-01 00:00:04 | 1028 | Zelda | 2000-01-01 00:00:18 |
dfg = create_groups_from_time_sequence(df, col_id='id', col_time='timegroup', time_threshold='120s',min_weight=2)
dfg.head()
| EdgeID | id | name | timegroup | cluster_weight |
|-------- |------ |--------- |--------------------- |---------------- |
| 0 | 960 | Norbert | 2000-01-01 00:01:10 | 3 |
| 0 | 969 | Sarah | 2000-01-01 00:03:32 | 7 |
| 0 | 1013 | Michael | 2000-01-01 00:02:58 | 8 |
| 0 | 963 | Ray | 2000-01-01 00:05:58 | 5 |
| 0 | 996 | Ray | 2000-01-01 00:03:41 | 6 |

Pandas: need to create dataframe for weekly search per event occurrence

If I have this events dataframe df_e below:
|------|------------|-------|
| group| event date | count |
| x123 | 2016-01-06 | 1 |
| | 2016-01-08 | 10 |
| | 2016-02-15 | 9 |
| | 2016-05-22 | 6 |
| | 2016-05-29 | 2 |
| | 2016-05-31 | 6 |
| | 2016-12-29 | 1 |
| x124 | 2016-01-01 | 1 |
...
and also know the t0 which is the beginning of time (let's say for x123 it's 2016-01-01) and tN which is the end of experiment from another dataframe df_s (2017-05-25), then how can I create the dataframe df_new which should like this
|------|------------|---------------|--------|
| group| obs. weekly| lifetime, week| status |
| x123 | 2016-01-01 | 1 | 1 |
| | 2016-01-08 | 0 | 0 |
| | 2016-01-15 | 0 | 0 |
| | 2016-01-22 | 1 | 1 |
| | 2016-01-29 | 2 | 1 |
...
| | 2017-05-18 | 1 | 1 |
| | 2017-05-25 | 1 | 1 |
...
| x124 | 2017-05-18 | 1 | 1 |
| x124 | 2017-05-25 | 1 | 1 |
Explanation: take t0 and generate rows until tN per week period. For each row R, search with that group if the event date falls within R, if True, then count how long in weeks it lives there, also set status = 1 as alive, otherwise set lifetime, status columns for this R as 0, e.g. dead.
Questions:
1) How to generate dataframes per group given t0 and tN values, e.g. generate [group, obs. weekly, lifetime, status] columns for (tN - t0) / week rows?
2) How to accomplish the construction of such df_new dataframe explained above?
I can begin with this so far =)
import pandas as pd
# 1. generate dataframes per group to get the boundary within `t0` and `tN` from df_s dataframe, where each dataframe has "group, obs, lifetime, status" columns X (tN - t0 / week) rows filled with 0 values.
df_all = pd.concat([df_group1, df_group2])
def do_that(R):
found_event_row = df_e.iloc[[R.group]]
# check if found_event_row['date'] falls into R['obs'] week
# if True, then found how long it's there
df_new = df_all.apply(do_that)
I'm not really sure if I get you but group one is not related to group two, right? if that's the case I think what you want is something like this:
import pandas as pd
df_group1 = df_group1.set_index('event date')
df_group1.index = pd.to_datetime(df_group1.index) #convert the index to datetime so you can 'resample'
df_group1['lifetime, week'] = df_group1.resample('1W').apply(lamda x: yourfuncion(x))
df_group1 = df_group1.reset_index()
df_group1['status']= df_group1.apply(lambda x: 1 if x['lifetime, week']>0 else 0)
#do the same with group2 and concat to create df_all
I'm not sure how you get 'lifetime, week' but all that's left is creating the function that generates it.