Julia DataFrame: Create new column sum of col values :x by :y - dataframe

I have a DataFrame of x and y occurrences. I would like to count how often each occurrence happens in the DataFrame and what percentage of the :y occurrences that combination represents. I have the first part down now, thanks to a previous question.
using DataFrames
mydf = DataFrame(y = rand('a':'h', 1000), x = rand('i':'p', 1000))
mydfsum = by(mydf, [:x, :y], df -> DataFrame(n = length(df[:x])))
This successfully creates a column that counts how often each value of :x occurs with each value of :y. Now I need to be able to generate a new column that counts how often each value of :y occurs. I could next create a new DataFrame using:
mydfsumy = by(mydf, [:y], df -> DataFrame(ny = length(df[:x])))
Join the DataFrames together.
mydfsum = join(mydfsum, mydfsumy, on = :y)
And create the percentage :yp column
mydfsum[:yp] = mydfsum[:n] ./ mydfsum[:ny]
But this seems like a clunky workaround for a common data management problem. In R I would do all of this in one line using dplyr:
mydf %>% groupby(x,y) %>% summarize(n = n()) %>% groupby(y) %>% mutate(yp = n/sum(n))

You can do it in one line:
mydfsum = by(mydf, :y, df -> by(df, :x, dd -> DataFrame(n = size(dd,1), yp = size(dd,1)/size(df,1))))
or, if that becomes hard to read, you can use the do notation for anonymous functions:
mydfsum = by(mydf,:y) do df
by(df, :x) do dd
DataFrame(n = size(dd,1), yp = size(dd,1)/size(df,1))
end
end
What you are doing in R is actually doing a first by on both x and y, then mutating a column of the output. You can also do that, but you need to have created that column first. Here I first initialize the yp column with zeroes and then modify it in place with another by.
mydfsum = by(mydf,[:x,:y], df -> DataFrame(n = size(df,1), yp = 0.))
by(mydfsum, :y, df -> (df[:yp] = df[:n]/sum(df[:n])))
For more advanced data manipulation you may want to take a look at Query.jl

Related

replacing df.append with pd.concat when building a new dataframe from file read

...
header = pd.DataFrame()
for x in {0,7,8,9,10,11,12,13,14,15,18,19,21,23}:
header = header.append({'col1':data1[x].split(':')[0],
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':'---'},
ignore_index=True)`
...
I have some Jupyter Notebook code which reads in 2 text files to data1 and data2 and using a list I am picking out specific matching lines in both files to a dataframe for easy display and comparison in the notebook
Since df.append is now being bumped for pd.concat what's the tidiest way to do this
is it basically to replace the inner loop code with
...
header = pd.concat(header, {all the column code from above })
...
addtional input to comment below
Yes, sorry for example the next block of code does this:
for x in {4,2 5}:
header = header.append({'col1':SOMENEWROWNAME'',
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':float(data2[x].split(':'},[1]([-1]) -float(data1[x].split(':'},[1]([-1])
ignore_index=True)`
repeated 5 times with different data indices in the loop, and then a different SOMENEWROWNAME
I inherited this notebook and I see now that this way of doing it was because they only wanted to do a numerical float difference on the columns where numbers come
but there are several such blocks, with different lines in the data and where that first parameter SOMENEWROWNAME is the different text fields from the respective lines in the data.
so I was primarily just trying to fix these append to concat warnings, but of course if the code can be better written then all good!
Use list comprehension and DataFrame constructor:
data = [{'col1':data1[x].split(':')[0],
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':'---'} for x in {0,7,8,9,10,11,12,13,14,15,18,19,21,23}]
df = pd.DataFrame(data)
EDIT:
out = []
#sample
for x in {1,7,30}:
out.append({'col1':SOMENEWROWNAME'',
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':float(data2[x].split(':'},[1]([-1]) -float(data1[x].split(':'},[1]([-1]))))))
df1 = pd.DataFrame(out)
out1 = []
#sample
for x in {1,7,30}:
out1.append({another dict})))
df2 = pd.DataFrame(out1)
df = pd.concat([df1, df2])
Or:
final = []
for x in {4,2,5}:
final.append({'col1':SOMENEWROWNAME'',
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':float(data2[x].split(':'},[1]([-1]) -float(data1[x].split(':'},[1]([-1]))))))
for x in {4,2, 5}:
final.append({another dict})))
df = pd.DataFrame(final)

Working on multiple data frames with data for NBA players during the season, how can I modify all the dataframes at the same time?

I have a list of 16 dataframes that contain stats for each player in the NBA during the respective season. My end goal is to run unsupervised learning algorithms on the data frames. For example, I want to see if I can determine a player's position by their stats or if I can determine their total points during the season based on their stats.
What I would like to do is modify the list(df_list), unless there's a better solution, of these dataframes instead modifying each dataframe to:
Change the datatype of the MP(minutes played column from str to int.
Modify the dataframe where there are only players with 1000 or more MP and there are no duplicate players(Rk)
(for instance in a season, a player(Rk) can play for three teams in a season and have 200MP, 300MP, and 400MP mins with each team. He'll have a column for each team and a column called TOT which will render his MP as 900(200+300+400) for a total of four rows in the dataframe. I only need the TOT row
Use simple algebra with various and individual columns columns, for example: being able to total the MP column and the PTS column and then diving the sum of the PTS column by the MP column.
Or dividing the total of the PTS column by the len of the PTS column.
What I've done so far is this:
Import my libraries and create 16 dataframes using pd.read_html(url).
The first dataframes created using two lines of code:
url = "https://www.basketball-reference.com/leagues/NBA_1997_totals.html"
ninetysix = pd.read_html(url)[0]
HOWEVER, the next four data frames had to be created using a few additional line of code(I received an error code that said "html5lib not found, please install it" so I downloaded both html5lib and requests). I say that to say...this distinction in creating the DF may have to considered in a solution.
The code I used:
import requests
import uuid
url = 'https://www.basketball-reference.com/leagues/NBA_1998_totals.html'
cookies = {'euConsentId': str(uuid.uuid4())}
html = requests.get(url, cookies=cookies).content
ninetyseven = pd.read_html(html)[0]
These four data frames look like this:
I tried this but it didn't do anything:
df_list = [
eightyfour, eightyfive, eightysix, eightyseven,
eightyeight, eightynine, ninety, ninetyone,
ninetytwo, ninetyfour, ninetyfive,
ninetysix, ninetyseven, ninetyeight, owe_one, owe_two
]
for df in df_list:
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
owe_two
============================UPDATE===================================
This code will solves a portion of problem # 2
url = 'https://www.basketball-reference.com/leagues/NBA_1997_totals.html'
dd = pd.read_html(url)[0]
dd = dd[dd['Rk'].ne('Rk')]
dd['MP'] = dd['MP'].astype(int)
players_1000_rk_list = list(dd[dd['MP'] >= 1000]['Rk'])
players_dd = dd[dd['Rk'].isin(players_1000_rk_list)]
But it doesn't remove the duplicates.
==================== UPDATE 10/11/22 ================================
Let's say I take rows with values "TOT" in the "Tm" and create a new DF with them, and these rows from the original data frame...
could I then compare the new DF with the original data frame and remove the names from the original data IF they match the names from the new data frame?
the problem is that the df you are working on in the loop is not the same df that is in the df_list. you could solve this by saving the new df back to the list, overwriting the old df
for i,df in enumerate(df_list):
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
df_list[i] = df
the2 lines are probably wrong as well
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
perhaps you want this
for i,df in enumerate(df_list):
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
#df = list(df[df['MP'] >= 1000]['Rk'])
#df = df[df['Rk'].isin(df)]
# just the rows where MP > 1000
df_list[i] = df[df['MP'] >= 1000]

Select cells in a pandas DataFrame by a Series of its column labels

Say we have a DataFrame and a Series of its column labels, both (almost) sharing a common index:
df = pd.DataFrame(...)
s = df.idxmax(axis=1).shift(1)
How can I obtain cells given a series of columns, getting value from every row using a corresponding column label from the joined series? I'd imagine it would be:
values = df[s] # either
values = df.loc[s] # or
In my example I'd like to have values that are under biggest-in-their-row values (I'm doing a poor man's ML :) )
However I cannot find any interface selecting cells by series of columns. Any ideas folks?
Meanwhile I use this monstrous snippet:
def get_by_idxs(df: pd.DataFrame, idxs: pd.Series) -> pd.Series:
ts_v_pairs = [
(ts, row[row['idx']])
for ts, row in df.join(idxs.rename('idx'), how='inner').iterrows()
if isinstance(row['idx'], str)
]
return pd.Series([v for ts, v in ts_v_pairs], index=[ts for ts, v in ts_v_pairs])
I think you need dataframe lookup
v = s.dropna()
v[:] = df.to_numpy()[range(len(v)), df.columns.get_indexer_for(v)]

unique values in categorical variables R estudio

How can I find how many unique values each categorical takes in a data frame and then represent it with a graph? all this in R studio
We'll use the tidyverse here.
library(tidyverse)
You can apply the unique() function to a dataframe to remove any repeat rows.
df <- iris %>% unique()
The group_by(), summarise() and n() functions let you count the number of instances of a variable in a dataframe.
df2 <- df %>% group_by(Species) %>% summarise(n = n())
## alternatively use count() which does the same thing
df2 <- df %>% count(Species)
Finally we can use the ggplot package to create a graph.
ggplot() + geom_col(data = df2, aes(x = Species, y = n))
If you're not interested in having a separate dataframe with the data in it and want to jump straight to the graph, you can ignore the step with group_by() and summarise() and just use geom_bar().
ggplot() + geom_bar(data = df, aes(Species))

Conditional join using sqldf in R with time data

So I have a have a table (~2000 rows, call it df1) of when a particular subject received a medication on a particular date, and I have a large excel file (>1 million rows) of weight data for subjects for different dates (call it df2).
AIM: I want to group by subject and find the weight in df2 that was recorded closest to the medication admin time in df1 using sqldf(because tables are too big to load into R). Or alternatively, I can set up a time frame of interest (e.g. +/- 1 week of medication given) and find a row that falls within that timeframe.
Example:
df1 <- data.frame(
PtID = rep(c(1:5), each=2),
Dose = rep(seq(100,200,25),2),
ADMIN_TIME =seq.Date(as.Date("2016/01/01"), by = "month", length.out = 10)
)
df2 <- data.frame(
PtID = rep(c(1:5),each=10),
Weight = rnorm(50, 50, 10),
Wt_time = seq.Date(as.Date("2016/01/01"), as.Date("2016/10/31"), length.out = 50)
)
So I think i want to left_join df1 and df2, group by PtID, and set up some condition that identifies either the closest df2$Weight to the df1$Admin_time or a df2$Weight within an acceptable range around df1$Admin_time using sql formatting.
So I tried creating a range and then querying the following:
library(dplry)
library(lubridate)
df1 <- df1 %>%
mutate(ADMIN_START = ADMIN_TIME - ddays(30),
ADMIN_END = ADMIN_TIME + ddays(30))
#df2.csv is the large spreadsheet saved in my working directory
result <- read.csv.sql("df2.csv", sql = "select Weight from file
left join df1
on file.Wt_time between df1.ADMIN_START and df1.ADMIN_END")
This will run but it never results anything and I have to escape out of it. Any thoughts are appreciated.
Thanks!