Compute moving average on a dynamic rolling window - pandas

OK, need some help here! I have the following dataframe.
df2 = {'Value': [123, 126, 120, 121, 123, 126, 120, 121, 123, 126],
'Look-back': [2, 3, 4, 5, 3, 6, 2, 4, 2, 1]}
df2 = pd.DataFrame(df2)
df2
I'd like to add a third row that shows the simple moving average of the 'Value' column with the rolling look-back period of the 'Look-back' column. My thought was to do this.
df2['Average'] = df2['Value'].rolling(df['Look-back']).mean()
Of course this doesn't work because the rolling() function needs an integer key value and I'm supplying a series.
How do I get what I'm after here?

Related

Pandas columns by given value in last row

Below my dataframe "df" made of 34 columns (pairs of stocks) and 530 rows (their respective cumulative returns). 'Date' is the index
Now, my target is to consider last row (Date=3 Febraury 2021). I want to plot ONLY those columns (pair stocks) that have a positive return on last Date.
I started with:
n=list()
for i in range(len(df.columns)):
if df.iloc[-1,i] >0:
n.append(i)
Output: [3, 11, 12, 22, 23, 25, 27, 28, 30]
Now, final step is to create a subset dataframe of 'df' containing only columns belonging to those numbers in this list. This is where I have problems. Have you any idea? Thanks
Does this solve your problem?
n = []
for i, col in enumerate(df.columns):
if df.iloc[-1,i] > 0:
n.append(col)
df[n]
Here you are ;)
sample df:
a b c
date
2017-04-01 0.5 -0.7 -0.6
2017-04-02 1.0 1.0 1.3
df1.loc[df1.index.astype(str) == '2017-04-02', df1.ge(1.2).any()]
c
date
2017-04-02 1.3
the logic will be same for your case also.
If I understand correctly, you want columns with IDs [3, 11, 12, 22, 23, 25, 27, 28, 30], am I right?
You should use DataFrame.iloc:
column_ids = [3, 11, 12, 22, 23, 25, 27, 28, 30]
df_subset = df.iloc[:, column_ids].copy()
The ":" on the left side of df.iloc means "all rows". I suggest using copy method in case you want to perform additional operations on df_subset without the risk to affect the original df, or raising Warnings.
If instead of a list of column IDs, you have a list of column names, you should just replace .iloc with .loc.

How eliminate rows in a dataframe with common values with other dataframe? Rstudio

I need to eliminate the rows in a dataframe that have at a column common values with the same column in a second dataframe
The columns the code have to take into account contain IDs of subjetcs, while the rest contain data refering to those subjects.
Example of dataframes (Rstudio)
df1<-data.frame(ID=c(13, 16, 25, 36, 25, 17, 50, 63, 61, 34, 65, 17), AnyData=round(runif(12, 1, 5)))
df2<-data.frame(ID=c(89, 57, 13, 17, 18, 21, 51, 50, 72, 84), AnyData=round(runif(10, 1, 5)))
I have tried two functions
df1<- filter(df1, ID!=df2[ID])
df1<- df1[-c(which(df1[ID]==df2[ID]))]
The result should be:
df1 <- data.frame(ID=c(16, 25, 36, 25, 63, 61, 34, 656), AnyData=(...)
AnyData depends on the values asigned with ruinf, so it will vary, but the value must be the same as in the original df1.
What you need is an anti_join():
library(dplyr)
df1 %>%
anti_join(df2, by = "ID")

SQL. BigQuery. Find values from an array corresponding to another column

I have a table with columns Date, UserID, EventID, Value, RangeOfValues.
The task is two determine a pair of values from the last columns between which is the value from the 4th columns. So, for example, if the user on the screenshot has value 326 in Value clmn it will be between 200 and 1000. I have a lot of users and need to extract such pairs for each of them. Can do this in python but have no ideas how to do this in bigquery (or even if it's possible).
Any advice would be appreciated!
The table looks like this
Yes, this is easily achievable using UNNEST() to turn the array into rows and then run simple sub-queries on them:
WITH test as (
SELECT * FROM UNNEST([
STRUCT(4 as value, [1, 3, 5, 7, 9, 100, 150, 40] as rangeOfValues)
,(15, [1, 3, 5, 7, 9, 100, 150, 40])
,(50, [1, 3, 5, 7, 9, 100, 150, 40])
,(160, [1, 3, 5, 7, 9, 100, 150, 40])
])
)
SELECT
value,
(SELECT MAX(r) FROM UNNEST(rangeOfValues) r WHERE r<value ) nextLowest,
(SELECT MIN(r) FROM UNNEST(rangeOfValues) r WHERE r>value ) nextBiggest
FROM test

Percentile of every column and row in a dataframe

I have a csv that looks like the image below. I want to calculate the percentile(10,50,90) of each row starting from B2 to X2 and adding that final percentile in a new column. Essentially, I want to find the 10th percetile of the average(std, cv, sp_tim.....) value over the entire period of record available.
I have created the following code line to read it in python as a dataframe format so far.
da = pd.read_csv('Project/11433300_annual_flow_matrix.csv', index_col=0, parse_dates=True)
If I have understood your question correctly then below code might be helpful for you:
I have Used some Dummy data, and given similar kind of treatment on it which you are looking for
aq = [1, 2, 2, 3, 3, 4, 4, 5, 7, 8, 10, 11]
aw = [91, 25, 13, 53, 95, 94, 75, 35, 57, 88, 111, 12]
df = pd.DataFrame({'aq': aq, 'aw': aw})
n = df.shape[0]
p = 0.1 #for 10th percentile
position = np.ceil(n*p)
position = int(position)
df.iloc[position,]
Kindly have a look and let me know if this is works for you.

matplotlib plot data with nans

I'm surprised how few are the posts relating to this problem. Anyway...
here it is:
I have csv data files containing X values in the first column, and several Y values columns thereafter. But for a given X value not all Y series have a corresponding value. Here is an example:
0, 16, 96, 99
10, 88, 45, 85
20, 85, 61, 10
30, 30, --, 45
40, 82, 28, 82
50, 23, 9, 61
60, 40, 77, 0
70, 26, 21, --
80, --, 58, 99
90, 1, 14, 30
when this csv data is loaded with numpy.genfromtxt, the '--' strings are taken as nan which is good. But when plotting, the plots are interrupted with blanks where there is a nan. Is there an option when a nan appears to make pyplot.plot() ignoring both the nan and the corresponding X value?
Not sure if matplotlib has such functionality built in, but you could home-brew it doing the following:
idx = ~numpy.isnan(Y)
pyplot.plot(X[idx], Y[idx])
Look at this post
As proposed in my answer there, I'd recommend using np.isfinite instead of np.isnan. There might be other reasons for your plot to have discontinuities, e.f., inf