Pandas Groupby: Groupby conditional statement - pandas

I am trying to identify the location of stops from gps data but need to account for some gps drift.
I have identified stops and isolated them into a new dataframe:
df['Stopped'] = (df.groupby('DAY')['LAT'].diff().abs() <= 0.0005) & (df.groupby('DAY')['LNG'].diff().abs() <= 0.0005)
df2 = df.loc[(df['Stopped'] == True)]
Now I can label groups that have the exact match in coordinates using:
df2['StoppedEvent'] = df2.groupby(['LAT','LNG']).ngroup()
But I want to group by the same conditions of Stopped. Something like this but that works:
df2['StoppedEvent'] = df2.groupby((['LAT','LNG']).diff().fillna(0).abs() <= 0.0005).ngroup()

I would do something like the following:
df['Stopped'] = (df.groupby('DAY')['LAT'].diff().abs() <= 0.0005)\
& (df.groupby('DAY')['LNG'].diff().abs() <= 0.0005)
df["Stopped_Group"] = (~df["Stopped"]).cumsum()
df2 = df.loc[df['Stopped']]
Now you'll have a column, "Stopped_Group", which is constant within a set of rows that are close to each other as determined by your logic. In the original dataframe, df, this column won't have any meaning for rows that correspond to motion.
To get your desired output (if I understand you correctly), do something like the following:
df2["Stopped_Duration"] = df2.groupby("Stopped_Group").transform("size")

Related

Contatenate rows in Pandas

I have 12 months sales data for each month. I want to analyze the dataset as a whole.
I have tried using the concat function but It produces not a number (NaN) in my dataframe fields.
In R, cbind function solves this. How do i approach this differently in Python?
I tried using df.concat function to bind the rows cos all the column names are the same for the datasets.
What other options can i explore?
sales_1 = pd.read_csv('Sales_January_2019.csv')
sales_2 = pd.read_csv('Sales_February_2019.csv')
sales_3 = pd.read_csv('Sales_March_2019.csv')
sales_4 = pd.read_csv('Sales_April_2019.csv')
sales_5 = pd.read_csv('Sales_May_2019.csv')
sales_6 = pd.read_csv('Sales_June_2019.csv')
sales_7 = pd.read_csv('Sales_July_2019.csv')
sales_8 = pd.read_csv('Sales_August_2019.csv')
sales_9 = pd.read_csv('Sales_September_2019.csv')
sales_10 = pd.read_csv('Sales_October_2019.csv')
sales_11 = pd.read_csv('Sales_November_2019.csv')
sales_12 = pd.read_csv('Sales_December_2019.csv')
I expect all data frame to be merged into one since the column names are the same for all
perhaps
# using concat with the list of the DF that you already read-in to combine into a single DF
pd.concat([sales_1 ,sales_2 ,sales_3 ,sales_4 ,sales_5 ,sales_6 ,sales_7 ,sales_8 ,sales_9 ,sales_10 ,sales_11 ,sales_12 ])

Working on multiple data frames with data for NBA players during the season, how can I modify all the dataframes at the same time?

I have a list of 16 dataframes that contain stats for each player in the NBA during the respective season. My end goal is to run unsupervised learning algorithms on the data frames. For example, I want to see if I can determine a player's position by their stats or if I can determine their total points during the season based on their stats.
What I would like to do is modify the list(df_list), unless there's a better solution, of these dataframes instead modifying each dataframe to:
Change the datatype of the MP(minutes played column from str to int.
Modify the dataframe where there are only players with 1000 or more MP and there are no duplicate players(Rk)
(for instance in a season, a player(Rk) can play for three teams in a season and have 200MP, 300MP, and 400MP mins with each team. He'll have a column for each team and a column called TOT which will render his MP as 900(200+300+400) for a total of four rows in the dataframe. I only need the TOT row
Use simple algebra with various and individual columns columns, for example: being able to total the MP column and the PTS column and then diving the sum of the PTS column by the MP column.
Or dividing the total of the PTS column by the len of the PTS column.
What I've done so far is this:
Import my libraries and create 16 dataframes using pd.read_html(url).
The first dataframes created using two lines of code:
url = "https://www.basketball-reference.com/leagues/NBA_1997_totals.html"
ninetysix = pd.read_html(url)[0]
HOWEVER, the next four data frames had to be created using a few additional line of code(I received an error code that said "html5lib not found, please install it" so I downloaded both html5lib and requests). I say that to say...this distinction in creating the DF may have to considered in a solution.
The code I used:
import requests
import uuid
url = 'https://www.basketball-reference.com/leagues/NBA_1998_totals.html'
cookies = {'euConsentId': str(uuid.uuid4())}
html = requests.get(url, cookies=cookies).content
ninetyseven = pd.read_html(html)[0]
These four data frames look like this:
I tried this but it didn't do anything:
df_list = [
eightyfour, eightyfive, eightysix, eightyseven,
eightyeight, eightynine, ninety, ninetyone,
ninetytwo, ninetyfour, ninetyfive,
ninetysix, ninetyseven, ninetyeight, owe_one, owe_two
]
for df in df_list:
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
owe_two
============================UPDATE===================================
This code will solves a portion of problem # 2
url = 'https://www.basketball-reference.com/leagues/NBA_1997_totals.html'
dd = pd.read_html(url)[0]
dd = dd[dd['Rk'].ne('Rk')]
dd['MP'] = dd['MP'].astype(int)
players_1000_rk_list = list(dd[dd['MP'] >= 1000]['Rk'])
players_dd = dd[dd['Rk'].isin(players_1000_rk_list)]
But it doesn't remove the duplicates.
==================== UPDATE 10/11/22 ================================
Let's say I take rows with values "TOT" in the "Tm" and create a new DF with them, and these rows from the original data frame...
could I then compare the new DF with the original data frame and remove the names from the original data IF they match the names from the new data frame?
the problem is that the df you are working on in the loop is not the same df that is in the df_list. you could solve this by saving the new df back to the list, overwriting the old df
for i,df in enumerate(df_list):
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
df_list[i] = df
the2 lines are probably wrong as well
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
perhaps you want this
for i,df in enumerate(df_list):
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
#df = list(df[df['MP'] >= 1000]['Rk'])
#df = df[df['Rk'].isin(df)]
# just the rows where MP > 1000
df_list[i] = df[df['MP'] >= 1000]

How to select only rows containing specific values with multiple data frame in for loop?

I'm new to python, I have a multiple data frame and select data frame based one columns which contains value xxx.
below is my code
MasterFiles = [Master_Jun22, Master_May22, Master_Apr22, Master_Mar22, Master_Feb22, Master_Jan22,
Master_Dec21, Master_Nov21, Master_Oct21, Master_Sep21, Master_Aug21, Master_Jul21,
Master_Jun21, Master_May21, Master_Apr21]
ColumName = ['product_category']
for d in MasterFiles:
for c in ColumName:
d = d.loc[d[c]=='XXX']
it is not working please help on this.
You need to gather the output and append it to a new Dataframe:
MasterFiles = [Master_Jun22, Master_May22, Master_Apr22, Master_Mar22, Master_Feb22, Master_Jan22,
Master_Dec21, Master_Nov21, Master_Oct21, Master_Sep21, Master_Aug21, Master_Jul21,
Master_Jun21, Master_May21, Master_Apr21]
ColumName = ['product_category']
res_df = pandas.Dataframe(columns=ColumName)
for d in MasterFiles:
for c in ColumName:
res_df.append[d.loc[d[c]=='XXX']]
# the results
res_df.head()
I am not sure if I am understanding your question correctly. So, please let me rephrase your question here.
You have 3 tasks,
first is to loop through each pandas data frame,
second is to loop through each column in your ColumName list, and
third is to return the data frame rows that consists of value Surabhi - DCL - Unsecured based on the column name in the columnName list.
If I am interpreting this correctly. This is how I would work on your issue.
MasterFiles = [Master_Jun22, Master_May22, Master_Apr22, Master_Mar22, Master_Feb22, Master_Jan22,
Master_Dec21, Master_Nov21, Master_Oct21, Master_Sep21, Master_Aug21, Master_Jul21,
Master_Jun21, Master_May21, Master_Apr21]
ColumName = ['product_category']
## list to store filter data frame by rows
df_temp = []
for d in MasterFiles:
for c in ColumName:
df_temp.append(d.loc[d[c] == 'Surabhi - DCL - Unsecured'])
## Assuming row wise concatenation
## i.e., using same column names to join data
df = pd.concat(df_temp, axis=0, ignore_index=True)
## df is the data frame you need

pandas data frame columns - how to select a subset of columns based on multiple criteria

Let us say I have the following columns in a data frame:
title
year
actor1
actor2
cast_count
actor1_fb_likes
actor2_fb_likes
movie_fb_likes
I want to select the following columns from the data frame and ignore the rest of the columns :
the first 2 columns (title and year)
some columns based on name - cast_count
some columns which contain the string "actor1" - actor1 and actor1_fb_likes
I am new to pandas. For each of the above operations, I know what method to use. But I want to do all three operations together as all I want is a dataframe that contains the above columns that I need for further analysis. How do I do this?
Here is example code that I have written:
data = {
"title":['Hamlet','Avatar','Spectre'],
"year":['1979','1985','2007'],
"actor1":['Christoph Waltz','Tom Hardy','Doug Walker'],
"actor2":['Rob Walker','Christian Bale ','Tom Hardy'],
"cast_count":['15','24','37'],
"actor1_fb_likes":[545,782,100],
"actor2_fb_likes":[50,78,35],
"movie_fb_likes":[1200,750,475],
}
df_input = pd.DataFrame(data)
print(df_input)
df1 = df_input.iloc[:,0:2] # Select first 2 columns
df2 = df_input[['cast_count']] #select some columns by name - cast_count
df3 = df_input.filter(like='actor1') #select columns which contain the string "actor1" - actor1 and actor1_fb_likes
df_output = pd.concat(df1,df2, df3) #This throws an error that i can't understand the reason
print(df_output)
Question 1:
df_1 = df[['title', 'year']]
Question 2:
# This is an example but you can put whatever criteria you'd like
df_2 = df[df['cast_count'] > 10]
Question 3:
# This is an example but you can put whatever criteria you'd like this way
df_2 = df[(df['actor1_fb_likes'] > 1000) & (df['actor1'] == 'actor1')]
Make sure each filter is contained within it's own set of parenthesis () before using the & or | operators. & acts as an and operator. | acts as an or operator.

I have a dataframe and I want to find the standard deviation for some specific cells

I'm trying to use pandas to find the standard deviation for the entries in some specific cells
I have tried using numPy's stdev like so:
numpy.std(df[columnName][j:i])
I have also tried using this:
df.std(axis=0)[columnName][j:i]
Just pseudocode becuase my actual code is more confusing than necessary for this question:
df = loadIris()
for feat in df.columns:
i = 0
j = 0
flower = df['flower'][i]
while i < df.index.max():
if df['flower'][i] == flower:
i+=1
else:
j = i
stand = df.std(axis=0)[feat][j:i]
flower = df['flower'][i]
I ended up just appending all of the values to a list and then calculating the standard deviation using statistics.stdev which you can get by importing statistics.