Pandas run function only on subset of whole Dataframe - pandas

Lets say i have Dataframe, which has 200 values, prices for products. I want to run some operation on this dataframe, like calculate average price for last 10 prices.
The way i understand it, right now pandas will go through every single row and calculate average for each row. Ie first 9 rows will be Nan, then from 10-200, it would calculate average for each row.
My issue is that i need to do a lot of these calculations and performance is an issue. For that reason, i would want to run the average only on say on last 10 values (dont need more) from all values, while i want to keep those values in the dataframe. Ie i dont want to get rid of those values or create new Dataframe.
I just essentially want to do calculation on less data, so it is faster.
Is something like that possible? Hopefully the question is clear.

Building off Chicodelarose's answer, you can achieve this in a more "pandas-like" syntax.
Defining your df as follows, we get 200 prices up to within [0, 1000).
df = pd.DataFrame((np.random.rand(200) * 1000.).round(decimals=2), columns=["price"])
The bit you're looking for, though, would the following:
def add10(n: float) -> float:
"""An exceptionally simple function to demonstrate you can set
values, too.
"""
return n + 10
df["price"].iloc[-12:] = df["price"].iloc[-12:].apply(add10)
Of course, you can also use these selections to return something else without setting values, too.
>>> df["price"].iloc[-12:].mean().round(decimals=2)
309.63 # this will, of course, be different as we're using random numbers
The primary justification for this approach lies in the use of pandas tooling. Say you want to operate over a subset of your data with multiple columns, you simply need to adjust your .apply(...) to contain an axis parameter, as follows: .apply(fn, axis=1).
This becomes much more readable the longer you spend in pandas. 🙂

Given a dataframe like the following:
Price
0 197.45
1 59.30
2 131.63
3 127.22
4 35.22
.. ...
195 73.05
196 47.73
197 107.58
198 162.31
199 195.02
[200 rows x 1 columns]
Call the following to obtain the mean over the last n rows of the dataframe:
def mean_over_n_last_rows(df, n, colname):
return df.iloc[-n:][colname].mean().round(decimals=2)
print(mean_over_n_last_rows(df, 2, "Price"))
Output:
178.67

Related

how to add a character to every value in a dataframe without losing the 2d structure

Today my problem is this: I have a dataframe of 300 X 41. Its encoded with numbers. I want to append an 'a' to each value in the dataframe so that another down stream program will not fuss about these being 'continuous variables' which they arent, they are factors. Simple right?
Every way I can think to do this though returns a dataframe or object that is not 300x 41...but just one long list of altered values:
Please end this headache for me. How can I do this in a way that returns a 400 X 31 altered output?
> dim(x)
[1] 300 41
>x2 <- sub("^","a",x)
>dim(x2)
[1] 12300 1

Finding the mean of a column; but excluding a singular value

Imagine I have a dataset that is like so:
ID birthyear weight
0 619040 1962 0.1231231
1 600161 1963 0.981742
2 25602033 1963 1.3123124
3 624870 1987 10,000
and I want to get the mean of the column weight, but the obvious 10,000 is hindering the actual mean. In this situation I cannot change the value but must work around it, this is what I've got so far, but obviously it's including that last value.
avg_num_items = df_cleaned['trans_quantity'].mean()
translist = df_cleaned['trans_quantity'].tolist()
my dataframe is df_cleaned and the column I'm actually working with is 'trans_quantity' so how do I go about the mean while working around that value?
Since you added SQL in your tags, In SQL you'd want to exclude it in the WHERE clause:
SELECT AVG(trans_quantity)
FROM your_data_base
WHERE trans_quantity <> 10,000
In Pandas:
avg_num_items = df_cleaned[df_cleaned["trans_quantity"] != 10000]["trans_quantity"].mean()
You can also replace your value with a NAN and skip it in the mean:
avg_num_items = df_cleaned["trans_quantity"].replace(10000, np.nan).mean(skipna=True)
With pandas, ensure you have numeric data (10,000 is a string), filter the values above threshold and use the mean:
(pd.to_numeric(df['weight'], errors='coerce')
.loc[lambda x: x<10000]
.mean()
)
output: 0.8057258333333334

Encode all data in one column and assign the same code if data has a same value

I have a dataframe which has appr. 100 columns and 20000 rows. Now I want to encode one categorical column so that it will have numerical code. After checking its value counts, the result shows something like this:
df['name'].value_counts()
aaa 650
baa 350
cad 50
dae 10
ef3 1
....
The total unique values are about 3300. So I might have a code range from 1 to 3300. I will
normalize the numerical code before train it. As I have already many columns in the dataset, I prefer not using one hot encoding method. So how can I do it? Thank you!
You can enumerate each group using ngroup(). It would look something like:
df.assign(num_code=lambda x: x.groupby(['name']).ngroup())
I don't know what kind of information the column contains, however I am not sure it makes sense to assign an incremental numerical code to a column that seems to be categorical for training models.

groupby 2 columns and count into separate columns based on one columns cases

I'm trying to group by 2 columns of which the first value has 5 different values and the second 2.
My data looks like this:
and using
df_counted = df_analysis
.groupby(['TYPE', 'RESULT'])
.size()
.sort_values(ascending=False)
.reset_index(name='COUNT')
I was able to transform it into the cases I want:
However I don't want a column for result, just for counts.
It's suppoed to be like
COUNT_TRUE COUNT_FALSE
FORWARD 21 182
BACKWARD 34 170
RIGHT 24 298
LEFT 20 242
NEUTRAL 16 82
The best I could do there was this. How do I get there?
Pandas has a feature of making a pivot table with dataframe. Your task can also be done by making pivot table.
df_counted.pivot_table(index="TYPE", columns="RESULT", values="COUNT")
Result:
Solved it and went a kind of full SQL there. It's not elegant, but it works:
df_counted is the last df from the question with the NaN values.
# drop duplicates for the first counts
df_pos = df_counted.drop_duplicates(subset=['TYPE'], keep='first').drop(columns=['COUNT_POS'])
# drop duplicates for the first counts
df_neg = df_counted.drop_duplicates(subset=['TYPE'], keep='last').drop(columns=['COUNT_NEG'])
# join on TYPE
df = df_pos.set_index('TYPE').join(df_neg.set_index('TYPE'))
If someone has a more elegant way of doing this, I'd be super interested to see it.

Need explanation on how pandas.drop is working here

I have a data frame, lets say xyz. I have written code to find out the % of null values each column possess in the dataframe. my code below:
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)
let say i got following results:
abc 26.63
def 36.58
ghi 78.46
I want to drop column ghi because it has more than 70% of null values.
I achieved it using the following code:
xyz = xyz.drop(xyz.loc[:,round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70].columns, 1)
but , i did not understand how does this code works, can anyone please explain it?
the code is doing the following:
xyz.drop( [...], 1)
removes the specified elements for a given axis, either by row or by column. In this particular case, df.drop( ..., 1) means you're dropping by axis 1, i.e, column
xyz.loc[:, ... ].columns
will return a list with the column names resulting from your slicing condition
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70
this instruction is counting the number of nulls, adding them up and normalizing by the number of rows, effectively computing the percentage of nan in each column. Then, the amount is rounded to have only 2 decimal positions and finally you return True is the number of nan is more than 70%. Hence, you get a mapping between columns and a True/False array.
Putting everything together: you're first producing a Boolean array that marks which columns have more than 70% nan, then, using .loc you use Boolean indexing to look only at the columns you want to drop ( nan % > 70%), then using .columns you recover the name of such columns, which then are used by the .drop instruction.
Hopefully this clear things up!
If you code is hard to understand , you can just check dropna with thresh, since pandas already cover this case.
df=df.dropna(axis=1,thresh=round(len(df)*0.3))