How do I calculate the discrepancy percentage between two columns with Pandas? - pandas

Date
GoogleAnalytics_PVS
AdobeAnalytics_PVS
6-3-2020
4802
4922
6-4-2020
5939
5932
6-5-2020
5122
5298
I have a table structured like the one above where it returns the number of page views from two sources. Ideally, I would like another column that would return a discrepancy percentage.
Am I overthinking it or could I just do something like
df['Discrep_%'] = (df['GoogleAnalytics_PVS'] - df['AdobeAnalytics_PVS'] / df['GoogleAnalytics_PVS']) x 100
Is there a better method, please let me know, thanks!

complexity wise it's the same, but here is another way. hence there can be multiple ways but the one you are applying is also the better.
df_new = [df[df.columns.difference(['GoogleAnalytics_PVS', 'AdobeAnalytics_PVS'])]/df['GoogleAnalytics_PVS'] ]*100
df_new

Related

How do i print specific rows that meet the conditions?

I'm trying to find out, is there a way I could print the rows that meet the condition I've set? I'm currently using iterrows() though I know it is not ideal, I have over 1000 rows of data I have to sift through and have not found any other way I can iterate through my data.
Here's a mock data:
[1]: https://i.stack.imgur.com/C1TlT.png
For example, I'm trying to find out if the +-3SD of the age range of two people overlap (I did not calculate the +-3SD here in the mock data but hope you are able to understand) And here's how I code it:
for i,row in df.iterrows():
if row['last_name_x'] > row['last_name_y'] or row['last_name_x'] < row[last_name_y']:
And then I'm stuck. I want to allocate the id_x and id_y of those who meet the conditions above into a dataframe. The ideal output I would want would be as follows:
id_x id_y
0 Vyel Vyel
3 Vyel Jinda
^ this is just an example of the dataframe I would want it to look like.
Do let me know if it's possible and how I can improve, thank you!
Use:
df[df['last_name_x'] != df['last_name_y']][['id_x', 'id_y']]

Pandas value_counts() with percentage [duplicate]

I was experimenting with the kaggle.com Titanic data set (data on every person on the Titanic) and came up with a gender breakdown like this:
df = pd.DataFrame({'sex': ['male'] * 577 + ['female'] * 314})
gender = df.sex.value_counts()
gender
male 577
female 314
I would like to find out the percentage of each gender on the Titanic.
My approach is slightly less than ideal:
from __future__ import division
pcts = gender / gender.sum()
pcts
male 0.647587
female 0.352413
Is there a better (more idiomatic) way?
This function is implemented in pandas, actually even in value_counts(). No need to calculate :)
just type:
df.sex.value_counts(normalize=True)
which gives exactly the desired output.
Please note that value_counts() excludes NA values, so numbers might not add up to 1.
See here: http://pandas-docs.github.io/pandas-docs-travis/generated/pandas.Series.value_counts.html
(A column of a DataFrame is a Series)
In case you wish to show percentage one of the things that you might do is use value_counts(normalize=True) as answered by #fanfabbb.
With that said, for many purposes, you might want to show it in the percentage out of a hundred.
That can be achieved like so:
gender = df.sex.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'
In this case, we multiply the results by hundred, round it to one decimal point and add the percentage sign.
If you want to merge counts with percentage, can use:
c = df.sex.value_counts(dropna=False)
p = df.sex.value_counts(dropna=False, normalize=True)
pd.concat([c,p], axis=1, keys=['counts', '%'])
I think I would probably do this in one go (without importing division):
1. * df.sex.value_counts() / len(df.sex)
or perhaps, remembering you want a percentage:
100. * df.sex.value_counts() / len(df.sex)
Much of a muchness really, your way looks fine too.

Pandas run function only on subset of whole Dataframe

Lets say i have Dataframe, which has 200 values, prices for products. I want to run some operation on this dataframe, like calculate average price for last 10 prices.
The way i understand it, right now pandas will go through every single row and calculate average for each row. Ie first 9 rows will be Nan, then from 10-200, it would calculate average for each row.
My issue is that i need to do a lot of these calculations and performance is an issue. For that reason, i would want to run the average only on say on last 10 values (dont need more) from all values, while i want to keep those values in the dataframe. Ie i dont want to get rid of those values or create new Dataframe.
I just essentially want to do calculation on less data, so it is faster.
Is something like that possible? Hopefully the question is clear.
Building off Chicodelarose's answer, you can achieve this in a more "pandas-like" syntax.
Defining your df as follows, we get 200 prices up to within [0, 1000).
df = pd.DataFrame((np.random.rand(200) * 1000.).round(decimals=2), columns=["price"])
The bit you're looking for, though, would the following:
def add10(n: float) -> float:
"""An exceptionally simple function to demonstrate you can set
values, too.
"""
return n + 10
df["price"].iloc[-12:] = df["price"].iloc[-12:].apply(add10)
Of course, you can also use these selections to return something else without setting values, too.
>>> df["price"].iloc[-12:].mean().round(decimals=2)
309.63 # this will, of course, be different as we're using random numbers
The primary justification for this approach lies in the use of pandas tooling. Say you want to operate over a subset of your data with multiple columns, you simply need to adjust your .apply(...) to contain an axis parameter, as follows: .apply(fn, axis=1).
This becomes much more readable the longer you spend in pandas. 🙂
Given a dataframe like the following:
Price
0 197.45
1 59.30
2 131.63
3 127.22
4 35.22
.. ...
195 73.05
196 47.73
197 107.58
198 162.31
199 195.02
[200 rows x 1 columns]
Call the following to obtain the mean over the last n rows of the dataframe:
def mean_over_n_last_rows(df, n, colname):
return df.iloc[-n:][colname].mean().round(decimals=2)
print(mean_over_n_last_rows(df, 2, "Price"))
Output:
178.67

groupby 2 columns and count into separate columns based on one columns cases

I'm trying to group by 2 columns of which the first value has 5 different values and the second 2.
My data looks like this:
and using
df_counted = df_analysis
.groupby(['TYPE', 'RESULT'])
.size()
.sort_values(ascending=False)
.reset_index(name='COUNT')
I was able to transform it into the cases I want:
However I don't want a column for result, just for counts.
It's suppoed to be like
COUNT_TRUE COUNT_FALSE
FORWARD 21 182
BACKWARD 34 170
RIGHT 24 298
LEFT 20 242
NEUTRAL 16 82
The best I could do there was this. How do I get there?
Pandas has a feature of making a pivot table with dataframe. Your task can also be done by making pivot table.
df_counted.pivot_table(index="TYPE", columns="RESULT", values="COUNT")
Result:
Solved it and went a kind of full SQL there. It's not elegant, but it works:
df_counted is the last df from the question with the NaN values.
# drop duplicates for the first counts
df_pos = df_counted.drop_duplicates(subset=['TYPE'], keep='first').drop(columns=['COUNT_POS'])
# drop duplicates for the first counts
df_neg = df_counted.drop_duplicates(subset=['TYPE'], keep='last').drop(columns=['COUNT_NEG'])
# join on TYPE
df = df_pos.set_index('TYPE').join(df_neg.set_index('TYPE'))
If someone has a more elegant way of doing this, I'd be super interested to see it.

Performing calculations on multiple columns in dataframe and create new columns

I'm trying to perform calculations based on the entries in a pandas dataframe. The dataframe looks something like this:
and it contains 1466 rows. I'll have to run similar calculations on other dfs with more rows later.
What I'm trying to do, is calculate something like mag='(U-V)/('R-I)' (but ignoring any values that are -999), put that in a new column, and then z_pred=10**((mag-c)m) in a new column (mag, c and m are just hard-coded variables). I have other columns I need to add too, but I figure that'll just be an extension of the same method.
I started out by trying
for i in range(1):
current = qso[:]
mag = (U-V)/(R-I)
name = current['NED']
z_pred = 10**((mag - c)/m)
z_meas = current['z']
but I got either a Series for z, which I couldn't operate on, or various type errors when I tried to print the values or write them to a file.
I found this question which gave me a start, but I can't see how to apply it to multiple calculations, as in my situation.
How can I achieve this?
Conditionally adding calculated columns row wise are usually performed with numpy's np.where;
df['mag'] = np.where(~df[['U', 'V', 'R', 'I']].eq(-999).any(1), (df.U - df.V) / (df.R - df.I), -999)
Note; assuming here that when any of the columns contain '-999' it will not be calculated and a '-999' is returned.