How to pivot columns so they turn into rows using PySpark or pandas? - pandas

I have a dataframe that looks like the one bellow, but with hundreds of rows. I need to pivot it, so that each column after Region would be a row, like the other table bellow.
+--------------+----------+---------------------+----------+------------------+------------------+-----------------+
|city |city_tier | city_classification | Region | Jan-2022-orders | Feb-2022-orders | Mar-2022-orders|
+--------------+----------+---------------------+----------+------------------+------------------+-----------------+
|new york | large | alpha | NE | 100000 |195000 | 237000 |
|los angeles | large | alpha | W | 330000 |400000 | 580000 |
I need to pivot it using PySpark, so I end up with something like this:
+--------------+----------+---------------------+----------+-----------+---------+
|city |city_tier | city_classification | Region | month | orders |
+--------------+----------+---------------------+----------+-----------+---------+
|new york | large | alpha | NE | Jan-2022 | 100000 |
|new york | large | alpha | NE | Fev-2022 | 195000 |
|new york | large | alpha | NE | Mar-2022 | 237000 |
|los angeles | large | alpha | W | Jan-2022 | 330000 |
|los angeles | large | alpha | W | Fev-2022 | 400000 |
|los angeles | large | alpha | W | Mar-2022 | 580000 |
P.S.: A solution using pandas would work too.

In pandas :
df.melt(df.columns[:4], var_name = 'month', value_name = 'orders')
city city_tier city_classification Region month orders
0 york large alpha NE Jan-2022-orders 100000
1 angeles large alpha W Jan-2022-orders 330000
2 york large alpha NE Feb-2022-orders 195000
3 angeles large alpha W Feb-2022-orders 400000
4 york large alpha NE Mar-2022-orders 237000
5 angeles large alpha W Mar-2022-orders 580000
or even
df.melt(['city', 'city_tier', 'city_classification', 'Region'],
var_name = 'month', value_name = 'orders')
city city_tier city_classification Region month orders
0 york large alpha NE Jan-2022-orders 100000
1 angeles large alpha W Jan-2022-orders 330000
2 york large alpha NE Feb-2022-orders 195000
3 angeles large alpha W Feb-2022-orders 400000
4 york large alpha NE Mar-2022-orders 237000
5 angeles large alpha W Mar-2022-orders 580000

In PySpark, your current example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('new york', 'large', 'alpha', 'NE', 100000, 195000, 237000),
('los angeles', 'large', 'alpha', 'W', 330000, 400000, 580000)],
['city', 'city_tier', 'city_classification', 'Region', 'Jan-2022-orders', 'Feb-2022-orders', 'Mar-2022-orders']
)
df2 = df.select(
'city', 'city_tier', 'city_classification', 'Region',
F.expr("stack(3, 'Jan-2022', `Jan-2022-orders`, 'Fev-2022', `Feb-2022-orders`, 'Mar-2022', `Mar-2022-orders`) as (month, orders)")
)
df2.show()
# +-----------+---------+-------------------+------+--------+------+
# | city|city_tier|city_classification|Region| month|orders|
# +-----------+---------+-------------------+------+--------+------+
# | new york| large| alpha| NE|Jan-2022|100000|
# | new york| large| alpha| NE|Fev-2022|195000|
# | new york| large| alpha| NE|Mar-2022|237000|
# |los angeles| large| alpha| W|Jan-2022|330000|
# |los angeles| large| alpha| W|Fev-2022|400000|
# |los angeles| large| alpha| W|Mar-2022|580000|
# +-----------+---------+-------------------+------+--------+------+
The function which enables it is stack. It does not have a dataframe API, so you need to use expr to access it.
BTW, this is not pivoting, it's the opposite - unpivoting.

Related

How can I optimise this webscraping code for iterative loop?

This code scrapes www.oddsportal.com for all the URLs provided in the code and appends it to a dataframe.
I am not very well versed with iterative logic hence I am finding it difficult to improvise on it.
Code:
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup as bs
browser = webdriver.Chrome()
class GameData:
def __init__(self):
self.date = []
self.time = []
self.game = []
self.score = []
self.home_odds = []
self.draw_odds = []
self.away_odds = []
self.country = []
self.league = []
def parse_data(url):
browser.get(url)
df = pd.read_html(browser.page_source, header=0)[0]
html = browser.page_source
soup = bs(html, "lxml")
cont = soup.find('div', {'id': 'wrap'})
content = cont.find('div', {'id': 'col-content'})
content = content.find('table', {'class': 'table-main'}, {'id': 'tournamentTable'})
main = content.find('th', {'class': 'first2 tl'})
if main is None:
return None
count = main.findAll('a')
country = count[1].text
league = count[2].text
game_data = GameData()
game_date = None
for row in df.itertuples():
if not isinstance(row[1], str):
continue
elif ':' not in row[1]:
game_date = row[1].split('-')[0]
continue
game_data.date.append(game_date)
game_data.time.append(row[1])
game_data.game.append(row[2])
game_data.score.append(row[3])
game_data.home_odds.append(row[4])
game_data.draw_odds.append(row[5])
game_data.away_odds.append(row[6])
game_data.country.append(country)
game_data.league.append(league)
return game_data
urls = {
"https://www.oddsportal.com/soccer/england/premier-league/results/#/page/1",
"https://www.oddsportal.com/soccer/england/premier-league/results/#/page/2",
"https://www.oddsportal.com/soccer/england/premier-league/results/#/page/3",
"https://www.oddsportal.com/soccer/england/premier-league/results/#/page/4",
"https://www.oddsportal.com/soccer/england/premier-league/results/#/page/5",
"https://www.oddsportal.com/soccer/england/premier-league/results/#/page/6",
"https://www.oddsportal.com/soccer/england/premier-league/results/#/page/7",
"https://www.oddsportal.com/soccer/england/premier-league/results/#/page/8",
"https://www.oddsportal.com/soccer/england/premier-league/results/#/page/9",
"https://www.oddsportal.com/soccer/england/premier-league-2019-2020/results/#/page/1",
"https://www.oddsportal.com/soccer/england/premier-league-2019-2020/results/#/page/2",
"https://www.oddsportal.com/soccer/england/premier-league-2019-2020/results/#/page/3",
"https://www.oddsportal.com/soccer/england/premier-league-2019-2020/results/#/page/4",
"https://www.oddsportal.com/soccer/england/premier-league-2019-2020/results/#/page/5",
"https://www.oddsportal.com/soccer/england/premier-league-2019-2020/results/#/page/6",
"https://www.oddsportal.com/soccer/england/premier-league-2019-2020/results/#/page/7",
"https://www.oddsportal.com/soccer/england/premier-league-2019-2020/results/#/page/8",
"https://www.oddsportal.com/soccer/england/premier-league-2019-2020/results/#/page/9",
"https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/1",
"https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/2",
"https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/3",
"https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/4",
"https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/5",
"https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/6",
"https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/7",
"https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/8",
"https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/9",
}
if __name__ == '__main__':
results = None
for url in urls:
game_data = parse_data(url)
if game_data is None:
continue
result = pd.DataFrame(game_data.__dict__)
if results is None:
results = result
else:
results = results.append(result, ignore_index=True)
print(results)
| | date | time | game | score | home_odds | draw_odds | away_odds | country | league |
|-----|-------------|--------|----------------------------------|---------|-------------|-------------|-------------|-----------|--------------------------|
| 0 | 12 May 2019 | 14:00 | Brighton - Manchester City | 1:4 | 14.95 | 7.75 | 1.2 | England | Premier League 2018/2019 |
| 1 | 12 May 2019 | 14:00 | Burnley - Arsenal | 1:3 | 2.54 | 3.65 | 2.75 | England | Premier League 2018/2019 |
| 2 | 12 May 2019 | 14:00 | Crystal Palace - Bournemouth | 5:3 | 1.77 | 4.32 | 4.22 | England | Premier League 2018/2019 |
| 3 | 12 May 2019 | 14:00 | Fulham - Newcastle | 0:4 | 2.45 | 3.55 | 2.92 | England | Premier League 2018/2019 |
| 4 | 12 May 2019 | 14:00 | Leicester - Chelsea | 0:0 | 2.41 | 3.65 | 2.91 | England | Premier League 2018/2019 |
| 5 | 12 May 2019 | 14:00 | Liverpool - Wolves | 2:0 | 1.31 | 5.84 | 10.08 | England | Premier League 2018/2019 |
| 6 | 12 May 2019 | 14:00 | Manchester Utd - Cardiff | 0:2 | 1.3 | 6.09 | 9.78 | England | Premier League 2018/2019 |
As you can see the URLs can be optimised to run through all pages in that league/branch
inspect element
If I can accomodate this code to run iteratively for every page as in this inspect element below:
Inspect Element
That can be really useful and helpful.
How can I optimise this code to iteratively run for every page?

How do I clean this dataframe?

In row 2, I have a value "AVE" in the 'address' column that I would like to join with the 'address' value in row 1. The result should be row 1 'address' reads as "NEWPORT AVE / HIGHLAND AVE". How do I do this?
I also need to perform the same function with row 3 where 'action_taken' reads as "SERVICE RENDERED" with "RENDERED" taken from row 4.
|incident_no | date_reported | time_reported | address | incident_type | action_taken
------------------------------------------------------------------------------------------------------
1 | 2100030948 | 2021-05-16 | 23:21:00 | NEWPORT AVE / HIGHLAND | ERRATIC M/V | UNFOUNDED
2 | <NA> | NaT | NaT | AVE | NaN | NaN
3 | 2100030947 | 2021-05-16 | 23:16:00 | FALMOUTH ST | SECURITY CHECK| SERVICE
4 | <NA> | NaT | NaT | NaN | NaN | RENDERED
5 | 2100030946 | 2021-05-16 | 22:55:00 | PINE RD | SECURITY CHECK| SERVICE
``
First columns from list forward filling missing values, then group by them and aggregate join with remove missing values:
cols = ['incident_no','date_reported','time_reported']
df[cols] = df[cols].ffill()
df = df.groupby(cols).agg(lambda x: ' '.join(x.dropna())).reset_index()

.agg on a group inside a groupby object?

Sorry if this has been asked before, I couldn't find it.
I have census population dataframe that contains the population of each county in the US.
The relevant part of df looks like:
+----+--------+---------+----------------------------+---------------+
| | REGION | STNAME | CTYNAME | CENSUS2010POP |
+----+--------+---------+----------------------------+---------------+
| 1 | 3 | Alabama | Autauga County | 54571 |
+----+--------+---------+----------------------------+---------------+
| 2 | 3 | Alabama | Baldwin County | 182265 |
+----+--------+---------+----------------------------+---------------+
| 69 | 4 | Alaska | Aleutians East Borough | 3141 |
+----+--------+---------+----------------------------+---------------+
| 70 | 4 | Alaska | Aleutians West Census Area | 5561 |
+----+--------+---------+----------------------------+---------------+
How I can get the np.std of the states population (sum of counties' population) for each of the four regions in the US without modifying the df?
You can use transform:
df['std_col'] = df.groupby('STNAME')['CENSUS2010POP'].transform("std")
IIUC, if you want sum of counties, you do:
state_pop = df.groupby('STNAME')['CTYNAME'].nunique().apply(np.std)
You can also directly use the standard deviation method std()
new_df=df.groupby(['REGION'])[['CENSUS2010POP']].std()

Pandas - Pivot and Rearrange Table With Multiple Labels in Same Header

I have an xlsx file with tabs for multiple years of data. Each tab contains a table with many columns and the table is structured like this:
+-----------+-------+-------------------------+----------------------+
| City | State | Number of Drivers, 2019 | Number of Cars, 2019 |
+-----------+-------+-------------------------+----------------------+
| LA | CA | 123 | 10.0 |
| San Diego | CA | 456 | 2345 |
+-----------+-------+-------------------------+----------------------+
I would like to rearrange the table to look like this, and do it for each tab in the xlsx:
+-----------+-------+------+-------------------+---------------+
| City | State | Year | Measure Name | Measure Value |
+-----------+-------+------+-------------------+---------------+
| LA | CA | 2019 | Number of Drivers | 123 |
| San Diego | CA | 2019 | Number of Drivers | 456 |
| LA | CA | 2019 | Number of Cars | 10 |
| San Diego | CA | 2019 | Number of Cars | 2345 |
+-----------+-------+------+-------------------+---------------+
There are a lot of moving pieces to this and has been a little tricky to get the final formatting correct.
We do melt then join with str.split
s=df.melt(['City','State'])
s=s.join(s.variable.str.split(',',expand=True))
Out[120]:
City State variable value 0 1
0 LA CA NumberofDrivers,2019 123.0 NumberofDrivers 2019
1 SanDiego CA NumberofDrivers,2019 456.0 NumberofDrivers 2019
2 LA CA NumberofCars,2019 10.0 NumberofCars 2019
3 SanDiego CA NumberofCars,2019 2345.0 NumberofCars 2019
# if you need change the name adding .rename(columns={}) at the end
This is how I wwas able to apply Yoben's solution to every tab in the xlsx file, append them together and write the full table to a .csv:
sheets_dict = pd.read_excel(r'file.xlsx', sheet_name=None)
full_table = pd.DataFrame()
for name, sheet in sheets_dict.items():
sheet['sheet'] = name
sheet = sheet.melt(['City','State'])
sheet = sheet.join(sheet.variable.str.split(',' , expand=True))
full_table = full_table.append(sheet)
full_table.reset_index(inplace=True, drop=True)
full_table.to_csv('Full Table.csv')

How to improve function using pd.groupby.transform in a dask environment

We need to create groups based on a time sequence.
We are working with dask but for this function we need to move back to pandas since transform is not yet implemented in dask. Although the function works - is there anyway to improve the performance? (We are running our code on a local Client and sometimes on a yarn-client)
Bellow is our function and a minimal, complete and verifiable example:
import pandas as pd
import numpy as np
import random
import dask
import dask.dataframe as dd
from datetime import timedelta
def create_groups_from_time_sequence(df, col_id: str=None, col_time: np.datetime64=None, time_threshold: str='120s',
min_weight: int=2) -> pd.DataFrame:
"""
Function creates group of units for relationships
:param df: dataframe pandas or dask
:param col_id: column containing the index
:param col_time: column containing datetime of query
:param time_threshold: maximum threshold between queries to create
:param min_weight: The threshold to filter the minimum relationship between 2 ids
:return: pandas dataframe
"""
partitions = None
if isinstance(df, dd.DataFrame):
partitions = df.npartitions
df = df.compute()
if np.issubdtype(df[col_time].dtype, np.datetime64):
df[col_time] = pd.to_datetime(df[col_time])
df = df.sort_values([col_id, col_time])
df['cluster_bool'] = df.groupby(col_id)[col_time].transform(lambda x: x.diff() > time_threshold)
df['EdgeID'] = df.groupby(col_id)['cluster_bool'].transform(lambda x: x.astype(int).cumsum())
df['cluster_weight'] = df.groupby([col_id, 'EdgeID'])['EdgeID'].transform('count')
mask_weight = df['cluster_weight'] > min_weight
df = df[mask_weight]
df = df.drop(['cluster_bool'], axis=1).reset_index(drop=True)
if partitions:
df = dd.from_pandas(df, npartitions=partitions)
df = df.set_index('EdgeID')
return df
Using the above function with the dask dataset example:
df_raw = dask.datasets.timeseries()
df = df_raw[['id', 'name']]
df = df.assign(timegroup=df.index)
df.timegroup = df.timegroup.apply(lambda s: s + timedelta(seconds=random.randint(0,60)) )
df.head()
| timestamp | id | name | timegroup |
| 2000-01-01 00:00:00 | 968 | Alice | 2000-01-01 00:00:46 |
| 2000-01-01 00:00:01 | 1030 | Xavier | 2000-01-01 00:00:22 |
| 2000-01-01 00:00:02 | 991 | George | 2000-01-01 00:00:59 |
| 2000-01-01 00:00:03 | 975 | Zelda | 2000-01-01 00:00:26 |
| 2000-01-01 00:00:04 | 1028 | Zelda | 2000-01-01 00:00:18 |
dfg = create_groups_from_time_sequence(df, col_id='id', col_time='timegroup', time_threshold='120s',min_weight=2)
dfg.head()
| EdgeID | id | name | timegroup | cluster_weight |
|-------- |------ |--------- |--------------------- |---------------- |
| 0 | 960 | Norbert | 2000-01-01 00:01:10 | 3 |
| 0 | 969 | Sarah | 2000-01-01 00:03:32 | 7 |
| 0 | 1013 | Michael | 2000-01-01 00:02:58 | 8 |
| 0 | 963 | Ray | 2000-01-01 00:05:58 | 5 |
| 0 | 996 | Ray | 2000-01-01 00:03:41 | 6 |