I have the following dataset and I would like to remove that 1% top and bottom percentiles for each "PRIMARY_SIC_CODE" on the column "ROA", i.e., take all the different ROAS for each PRIMARY_SIC_CODE, and remove the quantiles and the rest of the rows in the dataset.
Is there any easy way of doing it? Thanks!
If you want to exclude the top and bottom 1% by considering the column ROAS in its entirety:
top_1perc = df['ROA'].quantile(q=0.99)
bottom_1perc = df['ROA'].quantile(q=0.01)
new_df = df[(df['ROA']> bottom_1perc) & (df['ROA']< top_1perc)
If instead, you want to exclude them for each PRIMARY SIC CODE group:
df[df.groupby('PRIMARY SIC CODE')['ROA'].transform(\
lambda x : ((x > x.quantile(q=0.01)) & (x<x.quantile(q=0.99)))).eq(1)]
Try along the lines of...
df.groupby("PRIMARY SIC CODE")['ROA'].quantile(q=0.1)
Related
I am trying to calculate proportions with multiple subcategories. As seen in the screenshot below, the series is grouped by ['budget_levels', 'revenue_levels'].
I would like to calculate the proportion for each.
For example,
budget_levels=='low' & revenue_levels=='low' / budget_levels=='low'
budget_levels=='low' & revenue_levels=='medium' / budget_levels=='low'
However, not getting the desired output.
Is there any way I could do this calculation for each with a simple one-line code such as .apply(lambda) function?
Use value_counts to get the number of occurences of each combination. Then group by the column budget_levels and divide the observations in each group by their sum. sort_index makes it easier to compare the groups.
df.value_counts().groupby(level=0).transform(lambda x: x / x.sum()).sort_index()
Could anyone help me ?
I want to sum the values with the format:
print (...+....+)
for example:
a b
France 2
Italie 15
Croatie 7
I want to make the sum of France and Croatie.
Thank you for your help !
One of possible solutions:
set column a as the index,
using loc select rows for the "wanted" values,
take column b,
sum the values found.
So the code can be:
result = df.set_index('a').loc[['France', 'Croatie']].b.sum()
Note double square brackets. The outer pair is the "container" of index values
passed to loc.
The inner part, and what is inside, is a list of values.
To subtract two sums (one for some set of countries and the second for another set),
you can run e.g.:
wrk = df.set_index('a').b
result = wrk.loc[['Italie', 'USA']].sum() - wrk.loc[['France', 'Croatie']].sum()
The command below shows some details about the dataframe.
df.describe()
It gives details about count, mean, std, min, 25%, ...
Is there any way to get the count of rows in a dataframe at 75% or 25%?
Thanks.
Use pandas.Series.quantile to determine the value for the given quantile of the selected column.
.quantile has the benefit of being able to specify any quantile value (e.g. 30%)
.describe() is limited to [25%, 50%, 75%], and it performs unnecessary aggregations.
Select the specific data using Boolean selection, with .ge and .le
.ge is >=
.le is <=
.eq is ==
Once you have all the values matching the criteria, use something like quartile_25.count() or len(quartile_25), to get determine how many values meet the criteria.
col should be some column name as a string
quartile_75 = df[df[col].ge(df[col].quantile(q=.75))]
quartile_25 = df[df[col].le(df[col].quantile(q=.25))]
max_ = df[df[col].eq(df[col].max())]
Help with homework problem: "Let us define the "data science experience" of a given person as the person's largest score among Regression, Classification, and Clustering. Compute the average data science experience among all MSIS students."
Beginner to coding. I am trying to figure out how to check amongst columns and compare those columns to each other for the largest value. And then take the average of those found values.
I greatly appreciate your help in advance!
Picture of the sample data set: 1: https://i.stack.imgur.com/9OSjz.png
Provided Code:
import pandas as pd
df = pd.read_csv("cleaned_survey.csv", index_col=0)
df.drop(['ProgSkills','Languages','Expert'],axis=1,inplace=True)
Sample Data:
What I have tried so far:
df[data_science_experience]=df[["Regression","Classification","Clustering"]].values.max()
df['z']=df[['Regression','Classification','Clustering']].apply(np.max,axis=1)
df[data_science_experience]=df[["Regression","Classification","Clustering"]].apply(np.max,axis=1)
If you want to get the highest score of column 'hw1' you can get it with:
pd['hw1'].max(). this gives you a series of all the values in that column and max returns the maximum. for average use mean:
pd['hw1'].mean()
if you want to find the maximum of multiple columns, you can use:
maximum_list = list()
for col in pd.columns:
maximum_list.append(pd[col].max)
max = maximum_list.max()
avg = maximum_list.mean()
hope this helps.
First, you want to get only the rows with MSIS in the Program column. That can be done in the following way:
df[df['Program'] == 'MSIS']
Next, you want to get only the Regression, Classification and Clustering columns. The previous query filtered only rows; we can add to that, like this:
df.loc[df['Program'] == 'MSIS', ['Regression', 'Classification', 'Clustering']]
Now, for each row remaining, we want to take the maximum. That can be done by appending .max(axis=1) to the previous line (axis=1 because we want the maximum of each row, not each column).
At this point, we should have a DataFrame where each row represents the highest score of the three categories for each student. Now, all that's left to do is take the mean, which can be done with .mean(). The full code should therefore look like this:
df.loc[df['Program'] == 'MSIS', ['Regression', 'Classification', 'Clustering']].max(axis=1).mean()
I'm a beginner to R from a SAS background trying to do a basic "case when" match on two tables to get a flag where I have and have not found a match. Please see the SAS code I have in mind below. I just need something analogous to this in R. Thanks in advance.
proc sql;
create table
x as
select
a.*,
b.*,
case when a.first_column=b.column_first and
a.second_column=b.column_second
then 1 else 0 end as matched_flag
from table1 as a
left join
table2 as b
on a.first_column=b.column_first and a.second_column=b.column_second;
quit;
I'm not familiar with SAS, but I think I understand what you are trying to do. To see how many rows/columns are similar between two tables, you can use %in% and the length function.
For example, initialize two matrices of different dimensions and given them similar row names and column names:
mat.a <- matrix(1, nrow=3, ncol = 2)
mat.b <- matrix(1, nrow=2, ncol = 3)
rownames(mat.a) <- c('a','b','c')
rownames(mat.b) <- c('a','d')
colnames(mat.a) <- c('g','h')
colnames(mat.b) <- c('h','i')
mat.a and mat.b now exist with different row and column names. To match the rows by names, you can use:
row.match <- rownames(mat.a)[rownames(mat.a) %in% rownames(mat.b)]
num.row.match <- length(row.match)
Note that row.match can now be used to index into both of the matrices. The %in% operator returns a logical of the same length of the first argument (in this case, rownames(mat.a)) that indicates if the ith element of the first argument was found anywhere in the elements of the second argument. This nature of %in% means that you have to be sensitive to how you order the arguments for your indexing.
If you simply want to quantify how many rows or columns are the same between the two matrices, then you can use the sum function with the %in% operator:
sum(rownames(mat.a) %in% rownames(mat.b))
With the sum function used like this, you do not need to be sensitive to how you order the arguments, because the number of row names of mat.a in row names of mat.b is equivalent to the number of row names of mat.b in row names of mat.a. That is to say that this usage of %in% is commutative.
I hope this helps!
You will want to use dataframe objects. These are like datasets in SAS. You can use bind to put two dataframe objects together side by side. Then you can select rows based on conditions and set the flag based on this. In the code below you will see that I did this twice: once to set the 1 flag and once to set the 0 flag.
To select the rows where all fields match you can do something similar, but instead of assigning a new column you can assign all the results back to the name of the table you are working on.
Here's the code:
# make up example a and b data frames
table1 <- data.frame(list(a.first_column=c(1,2,3),a.second_column=c(4,5,6)))
table2 <- data.frame(list(b.first_column=c(1,3,6),b.second_column=c(4,5,9)))
# Combine columns (horizontally)
x <- cbind(table1, table2)
print("Combined Data Frames")
print(x)
# create matched flag (1 when the first columns match)
x$matched_flag[x$a.first_column==x$b.first_column] <- 1
x$matched_flag[!x$a.first_column==x$b.first_column] <- 0
# only select records that match both data frames
x <- x[x$a.first_column==x$b.first_column & x$a.second_column==x$b.second_column,]
print("Matched Data Frames")
print(x)
BTW: since you are used to using SQL, you might want to try the sqldf package in R. It will let you use the same techniques that you are used to but in R and on data frames.