I have a dataframe of a cryptoCoin in the format of:
time open high low close volume TM
0 1618617600000 61342.7 61730.9 61268.7 61648.8 82.523952 5
1 1618618500000 61648.9 61695.3 61188.4 61333.2 72.375605 5
2 1618619400000 61333.1 61396.4 61144.2 61200.0 52.882392 5
3 1618620300000 61200.0 61509.4 61199.9 61446.2 48.429485 5
4 1618621200000 61446.2 61764.7 61446.2 61647.4 83.822974 5
... ... ... ... ... ... ... ..
19213 1635909300000 63006.2 63087.2 62935.0 63081.9 35.265568 26
19214 1635910200000 63081.9 63214.5 62950.1 63084.0 41.213263 30
19215 1635911100000 63084.0 63236.0 63027.6 63213.9 32.429295 21
19216 1635912000000 63213.8 63213.8 63021.5 63024.1 47.032509 19
19217 1635912900000 63024.1 63091.4 62852.1 62970.7 84.098123 16
I want to calculate moving average of the close price with varied timeperiod, the timeperiod came from a TM column. I will use talib/ta library. efficiency is necessary so I tried apply and np.where:
dataframe['DMA'] = dataframe.apply(lambda x: ta.MA(dataframe['close'], timeperiod=dataframe['TM']), axis=0)
and
dataframe['DMA'] = np.where(dataframe['TM'].values , ta.MA(dataframe['close'], timeperiod=dataframe['TM'].values), )
both return error:
TypeError: only size-1 arrays can be converted to Python scalars
which I believed came from timeperiod= dataframe['TM'].values part. and if I use dataframe['TM'].values[0], only the first value, which is 5, apply to all iterate loop. How can I access to the scalar of the cell in TM, in vectorized-way and not iterating over index or use for_loop.
My desire output:
output dataframe has another column at the end, named it DMA, and last 3 rows should be like
............... DMA
19215 ..... ta.MA(dataframe['close'], timeperiod = 21)
19216 ..... ta.MA(dataframe['close'], timeperiod = 19)
19217 ..... ta.MA(dataframe['close'], timeperiod = 16)
in index 19215 I want to calculate Moving Average of last 21 close
prices
in index 19216 I want to calculate Moving Average of last 19
close prices
in index 19216 I want to calculate Moving Average of
last 16 close prices
Appreciate your time.
Related
I have a dataframe with 4 variables DIVISION, QTR, MODEL_SCORE, MONTH with the sum of variable X aggregated by those 4.
I would like to effective partition the data by DIVISION,QTR, and MODEL SCORE and keep a running total order the MONTH FIELD order smallest to largest. The idea being it would reset if it got to a new permutation of the other 3 columns
df = df.groupby(['DIVISION','MODEL','QTR','MONTHS'])['X'].sum()
I'm trying
df['cumsum'] = df.groupby(level=3)['X'].cumsum()
having tried all numbers I can think in the level argument. It seems be able to work any way other than what I want.
EDIT: I know the below isn't formatted ideally, but basically as long as the only variable changing was MONTH the cumulative sum would continue but any other variable would cause it to reset.
DIVSION QTR MODEL MONTHS X CUMSUM
A 1 1 1 10 10
A 1 1 2 20 30
A 1 2 1 5 5
I'm sorry for all the trouble I believe the answer was way simpler than I was making it to be.
After
df = df.groupby(['DIVISION','MODEL','QTR','MONTHS'])['X'].sum()
I was supposed to reset the index I did not want a multi-index and this appears to have worked.
df = df.reset_index()
df['cumsum'] = df.groupby(['DIVISION','MODEL','QTR'])['X'].cumsum()
Is there a way to use numpy to add numbers in a series up to a threshold, then restart the counter. The intention is to form groupby based on the categories created.
amount price
0 27 22.372505
1 17 126.562276
2 33 101.061767
3 78 152.076373
4 15 103.482099
5 96 41.662766
6 108 98.460743
7 143 126.125865
8 82 87.749286
9 70 56.065133
The only solutions I found iterate with .loc which is slow. I tried building a solution based on this answer https://stackoverflow.com/a/56904899:
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
df['cumvals'] = sumvals.accumulate(df['amount'], dtype=np.object)
The use-case is to find the average price of every 75 sold amounts of the thing.
Solution #1 Interpreting the following one way will get my solution below: "The use-case is to find the average price of every 75 sold amounts of the thing." If you are trying to do this calculation the "hard way" instead of pd.cut, then here is a solution that will work well but the speed / memory will depend on the cumsum() of the amount column, which you can find out if you do df['amount'].cumsum(). The output will take about 1 second per every 10 million of the cumsum, as that is how many rows is created with np.repeat. Again, this solution is not horrible if you have less than ~10 million in cumsum (1 second) or even 100 million in cumsum (~10 seconds):
i = 75
df = np.repeat(df['price'], df['amount']).to_frame().reset_index(drop=True)
g = df.index // i
df = df.groupby(g)['price'].mean()
df.index = (df.index * i).astype(str) + '-' + (df.index * i +75).astype(str)
df
Out[1]:
0-75 78.513748
75-150 150.715984
150-225 61.387540
225-300 67.411182
300-375 98.829611
375-450 126.125865
450-525 122.032363
525-600 87.326831
600-675 56.065133
Name: price, dtype: float64
Solution #2 (I believe this is wrong but keeping just in case)
I do not believe you are tying to do it this way, which was my initial solution, but I will keep it here in case, as you haven't included expected output. You can create a new series with cumsum and then use pd.cut and pass bins=np.arange(0, df['Group'].max(), 75) to create groups of cumulative 75. Then, groupby the groups of cumulative 75 and take the mean. Finally, use pd.IntervalIndex to clean up the format and change to a sting:
df['Group'] = df['amount'].cumsum()
s = pd.cut(df['Group'], bins=np.arange(0, df['Group'].max(), 75))
df = df.groupby(s)['price'].mean().reset_index()
df['Group'] = pd.IntervalIndex(df['Group']).left.astype(str) + '-' + pd.IntervalIndex(df['Group']).right.astype(str)
df
Out[1]:
Group price
0 0-75 74.467390
1 75-150 101.061767
2 150-225 127.779236
3 225-300 41.662766
4 300-375 98.460743
5 375-450 NaN
6 450-525 126.125865
7 525-600 87.749286
In this code i have ploted pct_day. Since the value does not increase like it would in a stock value, is it possible to plot this data where the current value which is to be plotted is added to the previous value and that data is plotted. This way the line graph would increase over time as opposed to the image below where the chart is plotted over a zero line?
High Low Open Close Volume Adj Close year pct_day
month day
1 2 794.913004 779.509998 788.783002 789.163007 6.372860e+08 789.163007 1997.400000 0.002211
3 833.470005 818.124662 823.937345 828.889339 9.985193e+08 828.889339 1997.866667 0.004160
4 863.153573 849.154299 858.737861 853.571429 1.042729e+09 853.571429 1997.714286 -0.003345
5 900.455715 888.571429 895.716426 894.472137 1.022023e+09 894.472137 1998.357143 -0.001216
6 847.453076 837.161537 840.123847 844.383843 8.889831e+08 844.383843 1998.076923 0.003679
... ... ... ... ... ... ... ... ... ...
12 27 909.735997 900.942000 905.528664 904.734009 7.485793e+08 904.734009 1998.133333 -0.000308
28 946.635010 940.440016 942.995721 944.127147 7.552150e+08 944.127147 1998.071429 0.001251
29 950.723837 941.625390 944.760775 947.200773 6.830400e+08 947.200773 1998.076923 0.002899
30 891.501671 883.954989 887.031665 887.819181 6.010675e+08 887.819181 1997.833333 0.001844
31 918.943857 910.320763 916.251549 913.786154 6.879523e+08 913.786154 1997.923077 -0.002772
363 rows × 8 columns
in Jupyter notebook as shows below:
You need the cumulative sum of the column pct_day. First, create a new column where you compute that value by means of numpy cumsum
pct_value_list = df['pct_value'].tolist()
pct_value_cumsum = list(np.cumsum(pct_value_list))
df['pct_value_cumsum'] = pct_value_cumsum
After that you can plot by df.plot(y='pct_value_cumsum')
I'm rather new to pandas and recently run into a problem. I have a pandas DataFrame that I need to process. I need to extract parts of the DataFrame where specific conditions are met. However, i want these parts to be coherent blocks, not one big set.
Example:
Consider the following pandas DataFrame
col1 col2
0 3 11
1 7 15
2 9 1
3 11 2
4 13 2
5 16 16
6 19 17
7 23 13
8 27 4
9 32 3
I want to extract the subframes where the values of col2 >= 10, resulting maybe in a list of DataFrames in the form of (in this case):
col1 col2
0 3 11
1 7 15
col1 col2
5 16 16
6 19 17
7 23 13
Ultimately, I need to do further analysis on the values in col1 within the resulting parts. However, the start and end of each of these blocks is important to me, so simply creating a subset using pandas.DataFrame.loc isn't going to work for me, i think.
What I have tried:
Right now I have a workaround that gets the subset using pandas.DataFrame.loc and then extracts the start and end index of each coherent block afterwards, by iterating through the subset and check, whether there is a jump in the indices. However, it feels rather clumsy and I feel that I'm missing a basic pandas function here, that would make my code more efficient and clean.
This is code representing my current workaround as adapted to the above example
# here the blocks will be collected for further computations
blocks = []
# get all the items where col2 >10 using 'loc[]'
subset = df.loc[df['col2']>10]
block_start = 0
block_end = None
#loop through all items in subset
for i in range(1, len(subset)):
# if the difference between the current index and the last is greater than 1 ...
if subset.index[i]-subset.index[i-1] > 1:
# ... this is the current blocks end
next_block_start = i
# extract the according block and add it to the list of all blocks
block = subset[block_start:next_block_start]
blocks.append(block)
#the next_block_start index is now the new block's starting index
block_start = next_block_start
#close and add last block
blocks.append(subset[block_start:])
Edit: I was by mistake previously referring to 'pandas.DataFrame.where' instead of 'pandas.DataFrame.loc'. I seem to be a bit confused by my recent research.
You can split you problem into parts. At first you check the condition:
df['mask'] = (df['col2']>10)
We use this to see where a new subset starts:
df['new'] = df['mask'].gt(df['mask'].shift(fill_value=False))
Now you can combine these informations into a group number. The cumsum will generate a step function which we set to zero (via the mask column) if this is not a group we are interested in.
df['grp'] = (df.new + 0).cumsum() * df['mask']
EDIT
You don't have to do the group calculation in your df:
s = (df['col2']>10)
s = (s.gt(s.shift(fill_value=False)) + 0).cumsum() * s
After that you can split this into a dict of separate DataFrames
grp = {}
for i in np.unique(s)[1:]:
grp[i] = df.loc[s == i, ['col1', 'col2']]
The rolling function in Pandas can only calculate rolling statistics according to row counts or date/time columns. But I want to have a discrete time column for calculating rolling sum, something like this:
key time value
A 1 10
A 2 20
A 4 30
A 7 10
B 1 15
B 2 30
B 3 15
I want to first group by key, then calculate the rolling sum on value for the nearest 3 time:
key time value output
A 1 10 10
A 2 20 30(10+20)
A 4 30 60(10+20+30)
A 7 10 40(30+10)
B 1 15 15
B 2 30 45
B 3 15 60
I tried this:
grouped = input.groupby("key", as_index=False)
for name, group in grouped:
group = group.sort_values("time")
time = list(group["time"])
value = list(group["value"])
#calcRollingStat is a custom function that outputs a list of corresponding results
out = calcRollingStat(time, value, mode="avg")
group["output"] = out #out is a list
But then I don't know how to convert grouped back to DataFrame. Pandas tells me that there is no reset_index attribute in grouped.
Is my code the best method to do this? How would you tackle this problem?
Thank you!
I believe you can use GroupBy.apply with custom function:
def f(group):
group = group.sort_values("time")
time = list(group["time"])
value = list(group["value"])
#calcRollingStat is a custom function that outputs a list of corresponding results
group["output"] = calcRollingStat(time, value, mode="avg")
return group
df = input.groupby("key", as_index=False).apply(f)