pandas df columns series - pandas

Have dataframe, and I have done some operations with its columns as follows
df1=sample_data.sort_values("Population")
df2=df1[(df1.Population > 500000) & (df1.Population < 1000000)]
df3=df2["Avg check"]*df2["Avg Daily Rides Last Week"]/df2["CAC"]
df4=df2["Avg check"]*df2["Avg Daily Rides Last Week"]
([[df3],[df4]])
If I understand right, then df3 & df4 now are series only, not dataframe. There should be a way to make a new dataframe with these Series and to plot scatter. Please advise. Thanks.
Wanted to add annotate for each and faced the issue
df3=df2["Avg check"]*df2["Avg Daily Rides Last Week"]/df2["CAC"]
df4=df2["Avg check"]*df2["Avg Daily Rides Last Week"]
df5=df2["Population"]
df6=df2["city_id"]
sct=plt.scatter(df5,df4,c=df3, cmap="viridis")
plt.xlabel("Population")
plt.ylabel("Avg check x Avg Daily Rides")
for i, txt in enumerate(df6):
plt.annotate(txt,(df4[i],df5[i]))
plt.colorbar()
plt.show()

I think you can pass both Series to matplotlib.pyplot.scatter:
import matplotlib.pyplot as plt
sc = plt.scatter(df3, df4)
EDIT: Swap df5 and df4 and for select by positions use Series.iat:
for i, txt in enumerate(df6):
plt.annotate(txt,(df5.iat[i],df4.iat[i]))

You can create a DataFrame from Series. Here is how to do it. Simply add both series in a dictionary
author = ['Jitender', 'Purnima', 'Arpit', 'Jyoti']
article = [210, 211, 114, 178]
auth_series = pd.Series(author)
article_series = pd.Series(article)
frame = { 'Author': auth_series, 'Article': article_series }
and then create a DataFrame from that dictionary:
result = pd.DataFrame(frame)
The code is from geeksforgeeks.org

Related

Working on multiple data frames with data for NBA players during the season, how can I modify all the dataframes at the same time?

I have a list of 16 dataframes that contain stats for each player in the NBA during the respective season. My end goal is to run unsupervised learning algorithms on the data frames. For example, I want to see if I can determine a player's position by their stats or if I can determine their total points during the season based on their stats.
What I would like to do is modify the list(df_list), unless there's a better solution, of these dataframes instead modifying each dataframe to:
Change the datatype of the MP(minutes played column from str to int.
Modify the dataframe where there are only players with 1000 or more MP and there are no duplicate players(Rk)
(for instance in a season, a player(Rk) can play for three teams in a season and have 200MP, 300MP, and 400MP mins with each team. He'll have a column for each team and a column called TOT which will render his MP as 900(200+300+400) for a total of four rows in the dataframe. I only need the TOT row
Use simple algebra with various and individual columns columns, for example: being able to total the MP column and the PTS column and then diving the sum of the PTS column by the MP column.
Or dividing the total of the PTS column by the len of the PTS column.
What I've done so far is this:
Import my libraries and create 16 dataframes using pd.read_html(url).
The first dataframes created using two lines of code:
url = "https://www.basketball-reference.com/leagues/NBA_1997_totals.html"
ninetysix = pd.read_html(url)[0]
HOWEVER, the next four data frames had to be created using a few additional line of code(I received an error code that said "html5lib not found, please install it" so I downloaded both html5lib and requests). I say that to say...this distinction in creating the DF may have to considered in a solution.
The code I used:
import requests
import uuid
url = 'https://www.basketball-reference.com/leagues/NBA_1998_totals.html'
cookies = {'euConsentId': str(uuid.uuid4())}
html = requests.get(url, cookies=cookies).content
ninetyseven = pd.read_html(html)[0]
These four data frames look like this:
I tried this but it didn't do anything:
df_list = [
eightyfour, eightyfive, eightysix, eightyseven,
eightyeight, eightynine, ninety, ninetyone,
ninetytwo, ninetyfour, ninetyfive,
ninetysix, ninetyseven, ninetyeight, owe_one, owe_two
]
for df in df_list:
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
owe_two
============================UPDATE===================================
This code will solves a portion of problem # 2
url = 'https://www.basketball-reference.com/leagues/NBA_1997_totals.html'
dd = pd.read_html(url)[0]
dd = dd[dd['Rk'].ne('Rk')]
dd['MP'] = dd['MP'].astype(int)
players_1000_rk_list = list(dd[dd['MP'] >= 1000]['Rk'])
players_dd = dd[dd['Rk'].isin(players_1000_rk_list)]
But it doesn't remove the duplicates.
==================== UPDATE 10/11/22 ================================
Let's say I take rows with values "TOT" in the "Tm" and create a new DF with them, and these rows from the original data frame...
could I then compare the new DF with the original data frame and remove the names from the original data IF they match the names from the new data frame?
the problem is that the df you are working on in the loop is not the same df that is in the df_list. you could solve this by saving the new df back to the list, overwriting the old df
for i,df in enumerate(df_list):
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
df_list[i] = df
the2 lines are probably wrong as well
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
perhaps you want this
for i,df in enumerate(df_list):
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
#df = list(df[df['MP'] >= 1000]['Rk'])
#df = df[df['Rk'].isin(df)]
# just the rows where MP > 1000
df_list[i] = df[df['MP'] >= 1000]

Multiple Line plot from a dataframe

from a dataframe I need to plot by month the count of events to know which ones are more likely to happen in a given month. I don know how to use the column Count.
df["MONTH"]= pd.to_datetime(df["BEGIN_DATE_TIME"], format = "%m/%d/%Y").dt.month
montly_events =df.groupby(["EVENT_TYPE", "MONTH"]).size().astype(int)
montly_events2 = montly_events.to_frame(name = "Count").reset_index()
plt.figure(figsize =(15,3))
sns.lineplot(x="MONTH", y="EVENT_TYPE", palette="ch:.25", data=df)
IIUC, count is another column in your df, try to make that as a 'y', the one that you have included here as screenshot
Try
sns.lineplot(x="MONTH", y="count", palette="ch:.25", data=df)
This was the solution I found using a barplot:
g = sns.FacetGrid(montly_ev2, row="EVENT_TYPE", hue="MONTH",palette="Set3", height=4, aspect=2)
g.map(sns.barplot, 'MONTH', 'Count', order=hours)

select top n rows after resampling DatetimeIndex

I need to get top n rows by some value per week (and I have hourly data).
data:
import numpy as np
import pandas as pd
dates = pd.date_range(start='1/1/2020', end='11/1/2020', freq="1H")
values = np.random.randint(20, 100500, len(dates))
some_other_column = np.random.randint(0, 10000000, len(dates))
df = pd.DataFrame({"date": dates, "value": values, "another_column": some_other_column})
My attempt:
resampled = df.set_index("date").resample("W")["value"].nlargest(5).to_frame()
It does give top 5 rows but all other columns except for date and value are missing - and I want to keep them all (in my dataset I have lots of columns but here another_column just to show that it's missing).
The solution I came up with:
resampled.index.names = ["week", "date"]
result = pd.merge(
resampled.reset_index(),
df,
how="left",
on=["date", "value"]
)
But it all feels wrong, I know there should be much simpler solution. Any help?
The output I was looking for. Thanks #wwnde.
df["week"] = df["date"].dt.isocalendar().week
df.loc[df.groupby("week")["value"].nlargest(5).index.get_level_values(1), :]
Groupby, and mask any nlargest
df.set_index('date', inplace=True)
df[df.groupby(df.index.week)['value'].transform(lambda x:x.nlargest(5).any())]

Use a for loop to plot the same period multiple times

I have a pandas data frame with a DateTime series:
And I would like to plot multiple subplots with the same x-axis (hours 0 to 23) to compare the number of users on different days.
So, in the end, I have the same number of plots as days instead of just one plot comprising all January.
I have created 2 new columns "Day" and "Hour" and tried to iterate through them as follows:
for d in high['Day'].unique():
print('Day ' + str(d))
plt.figure()
plt.plot(high['Hour'], high['Usuarios'])
plt.show()
Although I'm creating a plot per day it is not working as expected:
The main thing that is missing, is restricting the hours plotted to only one day. One way to do so is creating a new dataframe like this: day_high = high[high['Day'] == d]. Pandas supports many other ways to do so, for example groupby.
Here is some sample code to show how it could work. I added a line to save the plot to a file.
import matplotlib.pyplot as plt
import pandas as pd
import random
data = [[d, h, random.randint(0, 15)] for h in range(0, 24) for d in range(1, 32)]
high = pd.DataFrame(data, columns=['Day', 'Hour', 'Usuarios' ])
for d in high['Day'].unique():
print('Day ' + str(d))
day_high = high[high['Day'] == d]
plt.plot(day_high ['Hour'], day_high ['Usuarios'])
plt.title(f'Día {d}')
plt.savefig(f'Día {d}.png')
plt.show()

colormap with pandas dataframe plot function

I have data from multiple sites that record a sharp change in the monitored parameter. How could I plot the data for all these sites using value-dependent colors to enhance the visualization?
import numpy as np
import pandas as pd
import string
# site names
cols = string.ascii_uppercase
# number of days
ndays = 3
# index
index = pd.date_range('2018-05-01', periods=3*24*60, freq='T')
# simulated daily data
d1 = np.random.randn(len(index)//ndays, len(cols))
d2 = np.random.randn(len(index)//ndays, len(cols))+2
d3 = np.random.randn(len(index)//ndays, len(cols))-2
data=np.concatenate([d1, d2, d3])
# df = pd.DataFrame(data=data, index=index, columns=list(cols))
df.plot(legend=False)
Each site (column) gets assigned one color in the above code. Is there a way to represent the parameter values to different colors?
I guess one alternative is using colormaps option from scatter plot function: How to use colormaps to color plots of Pandas DataFrames
ax = plt.subplots(figsize=(12,6))
collection = [plt.scatter(range(len(df)), df[col], c=df[col], s=25, cmap=cmap, edgecolor='None') for col in df.columns]
However, if I plot over time (i.e., x=df.index) things appear not to work as expected.
Is there any other alternative? or suggestion how to better visualize the sudden change in the time series?
In what follows I will use only 3 columns and hourly data in order to make the plots look less messy. The examples work as well with the original data.
cols = string.ascii_uppercase[:3]
ndays = 3
index = pd.date_range('2018-05-01', periods=3*24, freq='H')
# simulated daily data
d1 = np.random.randn(len(index)//ndays, len(cols))
d2 = np.random.randn(len(index)//ndays, len(cols))+2
d3 = np.random.randn(len(index)//ndays, len(cols))-2
data=np.concatenate([d1, d2, d3])
df = pd.DataFrame(data=data, index=index, columns=list(cols))
df.plot(legend=False)
The pandas way
You are out of luck,DataFrame.plot.scatter does not work with datetime-like data due to a long standing bug.
The matplotlib way
Matplotlib's scatter can handle datetime-like data but the x-axis does not scale as expected.
for col in df.columns:
plt.scatter(df.index, df[col], c=df[col])
plt.gcf().autofmt_xdate()
This looks like a bug to me but I could not find any reports. You can work around this by manually adjusting the x-limits.
for col in df.columns:
plt.scatter(df.index, df[col], c=df[col])
start, end = df.index[[0, -1]]
xmargin = (end - start) * plt.gca().margins()[0]
plt.xlim(start - xmargin, end + xmargin)
plt.gcf().autofmt_xdate()
Unfortunately the x-axis formatter is not as nice as the pandas one.
The pandas way, revisited
I discovered this trick by chance and I do not understand why it works. If you plot a pandas series indexed by the same datetime data before calling matplotlib's scatter, the autoscaling issue disappear and you get the nice pandas formatting.
So I made an invisible plot of the first column and then the scatter plot.
df.iloc[:, 0].plot(lw=0) # invisible plot
for col in df.columns:
plt.scatter(df.index, df[col], c=df[col])