How to plot a chart so it adds to the value of previous value instead of plotting it over a zero line - pandas

In this code i have ploted pct_day. Since the value does not increase like it would in a stock value, is it possible to plot this data where the current value which is to be plotted is added to the previous value and that data is plotted. This way the line graph would increase over time as opposed to the image below where the chart is plotted over a zero line?
High Low Open Close Volume Adj Close year pct_day
month day
1 2 794.913004 779.509998 788.783002 789.163007 6.372860e+08 789.163007 1997.400000 0.002211
3 833.470005 818.124662 823.937345 828.889339 9.985193e+08 828.889339 1997.866667 0.004160
4 863.153573 849.154299 858.737861 853.571429 1.042729e+09 853.571429 1997.714286 -0.003345
5 900.455715 888.571429 895.716426 894.472137 1.022023e+09 894.472137 1998.357143 -0.001216
6 847.453076 837.161537 840.123847 844.383843 8.889831e+08 844.383843 1998.076923 0.003679
... ... ... ... ... ... ... ... ... ...
12 27 909.735997 900.942000 905.528664 904.734009 7.485793e+08 904.734009 1998.133333 -0.000308
28 946.635010 940.440016 942.995721 944.127147 7.552150e+08 944.127147 1998.071429 0.001251
29 950.723837 941.625390 944.760775 947.200773 6.830400e+08 947.200773 1998.076923 0.002899
30 891.501671 883.954989 887.031665 887.819181 6.010675e+08 887.819181 1997.833333 0.001844
31 918.943857 910.320763 916.251549 913.786154 6.879523e+08 913.786154 1997.923077 -0.002772
363 rows × 8 columns
in Jupyter notebook as shows below:

You need the cumulative sum of the column pct_day. First, create a new column where you compute that value by means of numpy cumsum
pct_value_list = df['pct_value'].tolist()
pct_value_cumsum = list(np.cumsum(pct_value_list))
df['pct_value_cumsum'] = pct_value_cumsum
After that you can plot by df.plot(y='pct_value_cumsum')

Related

Changing column name and it's values at the same time

Pandas help!
I have a specific column like this,
Mpg
0 18
1 17
2 19
3 21
4 16
5 15
Mpg is mile per gallon,
Now I need to replace that 'MPG' column to 'litre per 100 km' and change those values to litre per 100 km' at the same time. Any help? Thanks beforehand.
-Tom
I changed the name of the column but doing both simultaneously,i could not.
Use pop to return and delete the column at the same time and rdiv to perform the conversion (1 mpg = 1/235.15 liter/100km):
df['litre per 100 km'] = df.pop('Mpg').rdiv(235.15)
If you want to insert the column in the same position:
df.insert(df.columns.get_loc('Mpg'), 'litre per 100 km',
df.pop('Mpg').rdiv(235.15))
Output:
litre per 100 km
0 13.063889
1 13.832353
2 12.376316
3 11.197619
4 14.696875
5 15.676667
An alternative to pop would be to store the result in another dataframe. This way you can perform the two steps at the same time. In my code below, I first reproduce your dataframe, then store the constant for conversion and perform it on all entries using the apply method.
df = pd.DataFrame({'Mpg':[18,17,19,21,16,15]})
cc = 235.214583 # constant for conversion from mpg to L/100km
df2 = pd.DataFrame()
df2['litre per 100 km'] = df['Mpg'].apply(lambda x: cc/x)
print(df2)
The output of this code is:
litre per 100 km
0 13.067477
1 13.836152
2 12.379715
3 11.200694
4 14.700911
5 15.680972
as expected.

How do you iterate through a data frame based on the value in a row

I have a data frame which I am trying to iterate through, however not based on time, but on an increase of 10 for example
Column A
Column B
12:05
1
13:05
6
14:05
11
15:05
16
so in this case it would return a new data frame with the rows with 1 and 11. How am I able to do this? The different methods that I have tried such as asfreq resample etc. don't seem to work. They say invalid frequency. The reason I think about this is that it is not time based. What is the function that allows me to do this that isn't time based but based on a numerical value such as 10 or 7. I don't want the every nth number, but every time the column value changes by 10 from the last selected value. ex 1 to 11 then if the next values were 12 15 17 21, it would be 21.
here is one way to do it
# do a remainder division, and choose rows where remainder is zero
# offset by the first value, to make calculation simpler
first_val = df.loc[0]['Column B']
df.loc[((df['Column B'] - first_val) % 10).eq(0)]
Column A Column B
0 12:05 1
2 14:05 11

Pandas count rows before/after after current row

I need to calculate some measures on a window of my dataframe, with the value of interest in the centre of the window. To be more clear I use an example: if I have a dataset of 10 rows and a window size of 2, when I am in the 5th row I need to compute for example the mean of the values in 3rd, 4th, 5th, 6th and 7th row. When I am in the first row, I will not have the previous rows so I need to use only the following ones (so in the example, to compute the mean of 1st, 2nd and 3rd rows); if there are some rows but not enough, I need to use all the rows that are present (so fpr example if I am in the 2nd row, I will use 1st, 2nd, 3rd and 4th). How can I do that? As the title of my question suggest, the first idea I had was to count the number of rows preceding and following the current one, but I don't know how to do that. I am not forced to use this method, so if you have any suggestions on a better method feel free to share it.
What you want is a rolling mean with min_periods=1, center=True:
df = pd.DataFrame({'col': range(10)})
N = 2 # numbers of rows before/after to include
df['rolling_mean'] = df['col'].rolling(2*N+1, min_periods=1, center=True).mean()
output:
col rolling_mean
0 0 1.0
1 1 1.5
2 2 2.0
3 3 3.0
4 4 4.0
5 5 5.0
6 6 6.0
7 7 7.0
8 8 7.5
9 9 8.0
I assume that you have the target_row and window_size numbers as an input. You are trying to do an operation on a window_size of rows around the target_row in a dataframe df, and I gather from your question that you already know that you can't just grab +/- the window size, because it might exceed the size of the dataframe. Instead, just quickly define the resulting start and end rows based on the dataframe size, and then pull out the window you want:
start_row = max(target_row - window_size, 0)
end_row = min(target_row + window_size, len(df)-1)
window = df.iloc[start_row:end_row+1,:]
Then you can perform whatever operation you want on the window such as taking an average with window.mean().

access scalar in dataframe in each iterate loop

I have a dataframe of a cryptoCoin in the format of:
time open high low close volume TM
0 1618617600000 61342.7 61730.9 61268.7 61648.8 82.523952 5
1 1618618500000 61648.9 61695.3 61188.4 61333.2 72.375605 5
2 1618619400000 61333.1 61396.4 61144.2 61200.0 52.882392 5
3 1618620300000 61200.0 61509.4 61199.9 61446.2 48.429485 5
4 1618621200000 61446.2 61764.7 61446.2 61647.4 83.822974 5
... ... ... ... ... ... ... ..
19213 1635909300000 63006.2 63087.2 62935.0 63081.9 35.265568 26
19214 1635910200000 63081.9 63214.5 62950.1 63084.0 41.213263 30
19215 1635911100000 63084.0 63236.0 63027.6 63213.9 32.429295 21
19216 1635912000000 63213.8 63213.8 63021.5 63024.1 47.032509 19
19217 1635912900000 63024.1 63091.4 62852.1 62970.7 84.098123 16
I want to calculate moving average of the close price with varied timeperiod, the timeperiod came from a TM column. I will use talib/ta library. efficiency is necessary so I tried apply and np.where:
dataframe['DMA'] = dataframe.apply(lambda x: ta.MA(dataframe['close'], timeperiod=dataframe['TM']), axis=0)
and
dataframe['DMA'] = np.where(dataframe['TM'].values , ta.MA(dataframe['close'], timeperiod=dataframe['TM'].values), )
both return error:
TypeError: only size-1 arrays can be converted to Python scalars
which I believed came from timeperiod= dataframe['TM'].values part. and if I use dataframe['TM'].values[0], only the first value, which is 5, apply to all iterate loop. How can I access to the scalar of the cell in TM, in vectorized-way and not iterating over index or use for_loop.
My desire output:
output dataframe has another column at the end, named it DMA, and last 3 rows should be like
............... DMA
19215 ..... ta.MA(dataframe['close'], timeperiod = 21)
19216 ..... ta.MA(dataframe['close'], timeperiod = 19)
19217 ..... ta.MA(dataframe['close'], timeperiod = 16)
in index 19215 I want to calculate Moving Average of last 21 close
prices
in index 19216 I want to calculate Moving Average of last 19
close prices
in index 19216 I want to calculate Moving Average of
last 16 close prices
Appreciate your time.

Gnuplot: How to load and display single numeric value from data file

My data file has this content
# data file for use with gnuplot
# Report 001
# Data as of Tuesday 03-Sep-2013
total 1976
case1 522 278 146 65 26 7
case2 120 105 15 0 0 0
case3 660 288 202 106 63 1
I am making a histogram from the case... lines using the script below - and that works. My question is: how can I load the grand total value 1976 (next to the word 'total') from the data file and either (a) store it into a variable or (b) use it directly in the title of the plot?
This is my gnuplot script:
reset
set term png truecolor
set terminal pngcairo size 1024,768 enhanced font 'Segoe UI,10'
set output "output.png"
set style fill solid 1.00
set style histogram rowstacked
set style data histograms
set xlabel "Case"
set ylabel "Frequency"
set boxwidth 0.8
plot for [i=3:7] 'mydata.dat' every ::1 using i:xticlabels(1) with histogram \
notitle, '' every ::1 using 0:2:2 \
with labels \
title "My Title"
For the benefit of others trying to label histograms, in my data file, the column after the case label represents the total of the rest of the values on that row. Those total numbers are displayed at the top of each histogram bar. For example for case1, 522 is the total of (278 + 146 + 65 + 26 + 7).
I want to display the grand total somewhere on my chart, say as the second line of the title or in a label. I can get a variable into sprintf into the title, but I have not figured out syntax to load a "cell" value ("cell" meaning row column intersection) into a variable.
Alternatively, if someone can tell me how to use the sum function to total up 522+120+660 (read from the data file, not as constants!) and store that total in a variable, that would obviate the need to have the grand total in the data file, and that would also make me very happy.
Many thanks.
Lets start with extracting a single cell at (row,col). If it is a single values, you can use the stats command to extract the values. The row and col are specified with every and using, like in a plot command. In your case, to extract the total value, use:
# extract the 'total' cell
stats 'mydata.dat' every ::::0 using 2 nooutput
total = int(STATS_min)
To sum up all values in the second column, use:
stats 'mydata.dat' every ::1 using 2 nooutput
total2 = int(STATS_sum)
And finally, to sum up all values in columns 3:7 in all rows (i.e. the same like the previous command, but without using the saved totals) use:
# sum all values from columns 3:7 from all rows
stats 'mydata.dat' every ::1 using (sum[i=3:7] column(i)) nooutput
total3 = int(STATS_sum)
These commands require gnuplot 4.6 to work.
So, your plotting script could look like the following:
reset
set terminal pngcairo size 1024,768 enhanced
set output "output.png"
set style fill solid 1.00
set style histogram rowstacked
set style data histograms
set xlabel "Case"
set ylabel "Frequency"
set boxwidth 0.8
# extract the 'total' cell
stats 'mydata.dat' every ::::0 using 2 nooutput
total = int(STATS_min)
plot for [i=3:7] 'mydata.dat' every ::1 using i:xtic(1) notitle, \
'' every ::1 using 0:(s = sum [i=3:7] column(i), s):(sprintf('%d', s)) \
with labels offset 0,1 title sprintf('total %d', total)
which gives the following output:
For linux and similar.
If you don't know the row number where your data is located, but you know it is in the n-th column of a row where the value of the m-th column is x, you can define a function
get_data(m,x,n,filename)=system('awk "\$'.m.'==\"'.x.'\"{print \$'.n.'}" '.filename)
and then use it, for example, as
y = get_data(1,"case2",4,"datafile.txt")
using data provided by user424855
print y
should return 15
It's not clear to me where your "grand total" of 1976 comes from. If I calculate 522+120+660 I get 1302 not 1976.
Anyway, here is a solution which works even without stats and sum which were not available in gnuplot 4.4.0.
In the data you don't necessarily need the "grand total" or the sum of each row, because gnuplot can calculate this for you. This is done by (not) plotting the file as a matrix, and at the same time summing up the rows in the string variable S0 and the total sum in variable Total. There will be a warning warning: matrix contains missing or undefined values which you can ignore. The labels are added by plotting '+' ... with labels extracting the desired values from the S0 string.
Data: SO18583180.dat
So, the reduced input data looks like this:
# data file for use with gnuplot
# Report 001
# Data as of Tuesday 03-Sep-2013
case1 278 146 65 26 7
case2 105 15 0 0 0
case3 288 202 106 63 1
Script: (works for gnuplot>=4.4.0, March 2010 and gnuplot 5.x)
### histogram with sums and total sum
reset
FILE = "SO18583180.dat"
set style histogram rowstacked
set style data histograms
set style fill solid 0.8
set xlabel "Case"
set ylabel "Frequency"
set boxwidth 0.8
set key top left noautotitle
set grid y
set xrange [0:2]
set offsets 0.5,0.5,0,0
Total = 0
S0 = ''
addSums(v) = S0.sprintf(" %g",(M=$2,(N=$1+1)==1?S1=0:0,S1=S1+v))
plot for [i=2:6] FILE u i:xtic(1) notitle, \
'' matrix u (S0=addSums($3),Total=Total+$3,NaN) w p, \
'+' u 0:(real(S2=word(S0,int($0*N+N)))):(S2) every ::::M w labels offset 0,0.7 title sprintf("Total: %g",Total)
### end of script
Result: (created with gnuplot 4.4.0, Windows terminal)