How do i print specific rows that meet the conditions? - pandas

I'm trying to find out, is there a way I could print the rows that meet the condition I've set? I'm currently using iterrows() though I know it is not ideal, I have over 1000 rows of data I have to sift through and have not found any other way I can iterate through my data.
Here's a mock data:
[1]: https://i.stack.imgur.com/C1TlT.png
For example, I'm trying to find out if the +-3SD of the age range of two people overlap (I did not calculate the +-3SD here in the mock data but hope you are able to understand) And here's how I code it:
for i,row in df.iterrows():
if row['last_name_x'] > row['last_name_y'] or row['last_name_x'] < row[last_name_y']:
And then I'm stuck. I want to allocate the id_x and id_y of those who meet the conditions above into a dataframe. The ideal output I would want would be as follows:
id_x id_y
0 Vyel Vyel
3 Vyel Jinda
^ this is just an example of the dataframe I would want it to look like.
Do let me know if it's possible and how I can improve, thank you!

Use:
df[df['last_name_x'] != df['last_name_y']][['id_x', 'id_y']]

Related

Pandas run function only on subset of whole Dataframe

Lets say i have Dataframe, which has 200 values, prices for products. I want to run some operation on this dataframe, like calculate average price for last 10 prices.
The way i understand it, right now pandas will go through every single row and calculate average for each row. Ie first 9 rows will be Nan, then from 10-200, it would calculate average for each row.
My issue is that i need to do a lot of these calculations and performance is an issue. For that reason, i would want to run the average only on say on last 10 values (dont need more) from all values, while i want to keep those values in the dataframe. Ie i dont want to get rid of those values or create new Dataframe.
I just essentially want to do calculation on less data, so it is faster.
Is something like that possible? Hopefully the question is clear.
Building off Chicodelarose's answer, you can achieve this in a more "pandas-like" syntax.
Defining your df as follows, we get 200 prices up to within [0, 1000).
df = pd.DataFrame((np.random.rand(200) * 1000.).round(decimals=2), columns=["price"])
The bit you're looking for, though, would the following:
def add10(n: float) -> float:
"""An exceptionally simple function to demonstrate you can set
values, too.
"""
return n + 10
df["price"].iloc[-12:] = df["price"].iloc[-12:].apply(add10)
Of course, you can also use these selections to return something else without setting values, too.
>>> df["price"].iloc[-12:].mean().round(decimals=2)
309.63 # this will, of course, be different as we're using random numbers
The primary justification for this approach lies in the use of pandas tooling. Say you want to operate over a subset of your data with multiple columns, you simply need to adjust your .apply(...) to contain an axis parameter, as follows: .apply(fn, axis=1).
This becomes much more readable the longer you spend in pandas. 🙂
Given a dataframe like the following:
Price
0 197.45
1 59.30
2 131.63
3 127.22
4 35.22
.. ...
195 73.05
196 47.73
197 107.58
198 162.31
199 195.02
[200 rows x 1 columns]
Call the following to obtain the mean over the last n rows of the dataframe:
def mean_over_n_last_rows(df, n, colname):
return df.iloc[-n:][colname].mean().round(decimals=2)
print(mean_over_n_last_rows(df, 2, "Price"))
Output:
178.67

How do I calculate the discrepancy percentage between two columns with Pandas?

Date
GoogleAnalytics_PVS
AdobeAnalytics_PVS
6-3-2020
4802
4922
6-4-2020
5939
5932
6-5-2020
5122
5298
I have a table structured like the one above where it returns the number of page views from two sources. Ideally, I would like another column that would return a discrepancy percentage.
Am I overthinking it or could I just do something like
df['Discrep_%'] = (df['GoogleAnalytics_PVS'] - df['AdobeAnalytics_PVS'] / df['GoogleAnalytics_PVS']) x 100
Is there a better method, please let me know, thanks!
complexity wise it's the same, but here is another way. hence there can be multiple ways but the one you are applying is also the better.
df_new = [df[df.columns.difference(['GoogleAnalytics_PVS', 'AdobeAnalytics_PVS'])]/df['GoogleAnalytics_PVS'] ]*100
df_new

Building a new dataset

I want to take data from one set and enter it into another empty set.
So, for example, I want to do something like:
if ([i,x] > 9){
new_data$House[y,x] <- data[i,2]
}
but I want to do it over and over, creating new rows in new_data.
How do I keep adding data to new_data and overriding/saving the new row?
Essentially, I just want to know how to "grow" an empty data set.
Please ignore any errors in the code, it is just an example and I am still working on other details.
Thanks
If you are using r language, I presume you are looking for rbind:
new_data = NULL # define your new dataset
for(i in 1:nrow(data)) # loop over row of data
{
if(data[i,x] > 9) # if statement for implementing a condition
{
new_data = rbind(new_data,data[i,2:6]) # adding values of the row i and column 2 to 6
}
}
At the end, new_data will contain as many rows that satisfy the if statement and each row will contain values extracted from column 2 to 6.
If it is what you are looking for, there is various ways to do that without the need of a for loop, as an example:
new_data = data[data[i,x]>9,2:6]
If this answer is not satisfying for you, please provide more details in your question, include a reproducible example of your data and the expected output

How to check the highest score among specific columns and compute the average in pandas?

Help with homework problem: "Let us define the "data science experience" of a given person as the person's largest score among Regression, Classification, and Clustering. Compute the average data science experience among all MSIS students."
Beginner to coding. I am trying to figure out how to check amongst columns and compare those columns to each other for the largest value. And then take the average of those found values.
I greatly appreciate your help in advance!
Picture of the sample data set: 1: https://i.stack.imgur.com/9OSjz.png
Provided Code:
import pandas as pd
df = pd.read_csv("cleaned_survey.csv", index_col=0)
df.drop(['ProgSkills','Languages','Expert'],axis=1,inplace=True)
Sample Data:
What I have tried so far:
df[data_science_experience]=df[["Regression","Classification","Clustering"]].values.max()
df['z']=df[['Regression','Classification','Clustering']].apply(np.max,axis=1)
df[data_science_experience]=df[["Regression","Classification","Clustering"]].apply(np.max,axis=1)
If you want to get the highest score of column 'hw1' you can get it with:
pd['hw1'].max(). this gives you a series of all the values in that column and max returns the maximum. for average use mean:
pd['hw1'].mean()
if you want to find the maximum of multiple columns, you can use:
maximum_list = list()
for col in pd.columns:
maximum_list.append(pd[col].max)
max = maximum_list.max()
avg = maximum_list.mean()
hope this helps.
First, you want to get only the rows with MSIS in the Program column. That can be done in the following way:
df[df['Program'] == 'MSIS']
Next, you want to get only the Regression, Classification and Clustering columns. The previous query filtered only rows; we can add to that, like this:
df.loc[df['Program'] == 'MSIS', ['Regression', 'Classification', 'Clustering']]
Now, for each row remaining, we want to take the maximum. That can be done by appending .max(axis=1) to the previous line (axis=1 because we want the maximum of each row, not each column).
At this point, we should have a DataFrame where each row represents the highest score of the three categories for each student. Now, all that's left to do is take the mean, which can be done with .mean(). The full code should therefore look like this:
df.loc[df['Program'] == 'MSIS', ['Regression', 'Classification', 'Clustering']].max(axis=1).mean()

Dataframe non-null values differ from value_counts() values

There is an inconsistency with dataframes that I cant explain. In the following, I'm not looking for a workaround (already found one) but an explanation of what is going on under the hood and how it explains the output.
One of my colleagues which I talked into using python and pandas, has a dataframe "data" with 12,000 rows.
"data" has a column "length" that contains numbers from 0 to 20. she wants to divided the dateframe into groups by length range: 0 to 9 in group 1, 9 to 14 in group 2, 15 and more in group 3. her solution was to add another column, "group", and fill it with the appropriate values. she wrote the following code:
data['group'] = np.nan
mask = data['length'] < 10;
data['group'][mask] = 1;
mask2 = (data['length'] > 9) & (data['phraseLength'] < 15);
data['group'][mask2] = 2;
mask3 = data['length'] > 14;
data['group'][mask3] = 3;
This code is not good, of course. the reason it is not good is because you dont know in run time whether data['group'][mask3], for example, will be a view and thus actually change the dataframe, or it will be a copy and thus the dataframe would remain unchanged. It took me quit sometime to explain it to her, since she argued correctly that she is doing an assignment, not a selection, so the operation should always return a view.
But that was not the strange part. the part the even I couldn't understand is this:
After performing this set of operation, we verified that the assignment took place in two different ways:
By typing data in the console and examining the dataframe summary. It told us we had a few thousand of null values. The number of null values was the same as the size of mask3 so we assumed the last assignment was made on a copy and not on a view.
By typing data.group.value_counts(). That returned 3 values: 1,2 and 3 (surprise) we then typed data.group.value_counts.sum() and it summed up to 12,000!
So by method 2, the group column contained no null values and all the values we wanted it to have. But by method 1 - it didnt!
Can anyone explain this?
see docs here.
You dont' want to set values this way for exactly the reason you pointed; since you don't know if its a view, you don't know that you are actually changing the data. 0.13 will raise/warn that you are attempting to do this, but easiest/best to just access like:
data.loc[mask3,'group'] = 3
which will guarantee you inplace setitem