Index (columns [0]) have duplicate values - pandas

I have the following data:
Start Time=2012-04-12 16:13:09
Finish Time=2012-11-30 13:31:08
Sample Period=01:00:00
As a CSV file:
Date(yyyy-mm-dd) Time(hh:mm:ss) Celsius (C)
2012-04-12 16:13:09 20.6
2012-04-12 17:13:09 20.6
2012-04-12 18:13:09 20.6
2012-04-12 19:13:09 20.6
2012-04-12 20:13:09 20.6
2012-04-12 21:13:09 20.6
2012-04-12 22:13:09 20.6
2012-04-12 23:13:09 20.6
and now I want to read-in data by pandas like this:
df=read_csv('mmf0401.txt',skiprows=(0,5),parse_dates=[[0,1]],index_col=0)
Unfortunately, it raises an exception Index (columns [0]) have duplicate values ['2012-04-12']
I don't know why, how can I correct this?

Related

pandas (multi) index wrong need to change it

I have a DataFrame multiData that looks like this:
print(multiData)
Date Open High Low Close Adj Close Volume
Ticker Date
AAPL 0 2010-01-04 7.62 7.66 7.59 7.64 6.51 493729600
1 2010-01-05 7.66 7.70 7.62 7.66 6.52 601904800
2 2010-01-06 7.66 7.69 7.53 7.53 6.41 552160000
3 2010-01-07 7.56 7.57 7.47 7.52 6.40 477131200
4 2010-01-08 7.51 7.57 7.47 7.57 6.44 447610800
... ... ... ... ... ... ... ...
META 2668 2022-12-23 116.03 118.18 115.54 118.04 118.04 17796600
2669 2022-12-27 117.93 118.60 116.05 116.88 116.88 21392300
2670 2022-12-28 116.25 118.15 115.51 115.62 115.62 19612500
2671 2022-12-29 116.40 121.03 115.77 120.26 120.26 22366200
2672 2022-12-30 118.16 120.42 117.74 120.34 120.34 19492100
I need to get rid of "Date 0, 1, 2, ..." column and make the actual "Date" column part of the (multi) index
How do I do this?
Use df.droplevel to delete level 1 and chain df.set_index to add column Date to the index by setting the append parameter to True.
df = df.droplevel(1).set_index('Date', append=True)
df
Open High Low Close Adj Close Volume
Ticker Date
AAPL 2010-01-04 7.62 7.66 7.59 7.64 6.51 493729600
2010-01-05 7.66 7.70 7.62 7.66 6.52 601904800

missing data in pandas web.DataReader yahoo

I am collecting data from yahoo, it works fine for more than 3 months. Then in last week I noticed that datas from 22.4.2021 is missing till today.
When I print the stockData, here is the last records :
...
...
2021-04-20 7.52 7.07 7.35 7.13 4119197.0 7.13
2021-04-21 7.36 7.13 7.16 7.25 3110870.0 7.25
2021-04-22 7.73 7.10 7.22 7.59 13178439.0 7.59
2021-05-06 10.08 9.48 9.52 9.93 2753885.0 9.93
As you see, datas are missing. It's because of yahoo or panda function, Could you help me to fix it ?.
Alper

Resample a datetimeIndex start day wrong

Source:
import pandas as pd
import numpy as np
cols = ['Date', 'Time', 'Load', 'Battery', 'Panel',
'Wind', 'Temp', 'Humidity', 'Volt']
data = pd.read_csv('test.csv',delimiter=';',header=0,names=cols,
decimal=',',parse_dates[[0,1]],
infer_datetime_format=True)
data.set_index('Date_Time',inplace=True)
I have this data frame:
In [126]: data.head()
Out[126]:
Load Battery Panel Wind Temp Humidity Volt
Date_Time
2018-07-31 13:07:15 13.3 326.3 353.1 0.98 33.93 21.92 3.89
2018-07-31 13:08:15 14.0 314.4 342.5 0.59 33.88 21.84 3.88
2018-07-31 13:09:16 13.4 309.6 335.5 0.39 33.84 22.14 3.88
2018-07-31 13:10:16 13.8 285.1 313.8 2.55 33.71 23.18 3.88
2018-07-31 13:11:16 13.6 292.9 314.7 2.03 33.62 23.25 3.88
......
with other 93000 rows. from 2018-07-31 to 2018-04-10. I'd like to resample by taking the sum of values for each 10minute frame. So I tried:
In [127]: data.resample('10min',closed='left',label='left').sum()
Out[127]:
Load Battery Panel Wind Temp Humidity Volt
Date_Time
2018-01-08 00:00:00 136.9 -140.6 -2.9 19.06 291.27 245.63 39.45
2018-01-08 00:10:00 137.3 -140.7 -3.1 15.14 290.62 244.88 39.42
2018-01-08 00:20:00 137.4 -140.4 -2.3 18.03 288.61 246.44 39.44
2018-01-08 00:30:00 137.5 -140.4 -2.2 12.61 286.97 246.83 39.43
That is close to what I expect, but the 'resample' remove all the data from the first day (I suspect maybe because the series do not start at midnight), what is the proper way to do the resampling? There are two issues:
The first day is missing in the result, i.e. all data removed and the resampled dataframe starts in the first of august and not on 07/31.
It is ok to consider intervals that starts at midnight and are so, perfectly multiple of 10min (so, ok for 00:00, 10:00, 20:00) but then I expect that the first grouping is:
2018-07-31 13:07:15 13.3 326.3 353.1 0.98 33.93 21.92 3.89
2018-07-31 13:08:15 14.0 314.4 342.5 0.59 33.88 21.84 3.88
2018-07-31 13:09:16 13.4 309.6 335.5 0.39 33.84 22.14 3.88
and then from 13:10:16, of course in the first day of the dataset and not on the second.
Ok. I solved it using:
x = data['2018-07-31'].resample('10min').sum()
y = data.resample('10min',closed='left',label='left').sum()
r = pd.concat([x,y])
but I think that this must be a form of bug in resample.
For output that starts at exactly 2018-07-31 13:07:15, you need to add in the argument base: "the origin of the aggregated intervals": documentation.
Example code:
start = pd.to_datetime('2018-07-31 13:07:15', format='%Y-%m-%d %H:%M:%S')
minutes = pd.date_range(start, start + timedelta(10), freq='min')
df = pd.DataFrame({'Date_Time': minutes, 'Load': np.random.randint(13, size=len(minutes))})
df.set_index('Date_Time', inplace=True)
df.resample('10min', closed='left', label='left', base=7.25).sum()
Result:
Date_Time Load
2018-07-31 13:07:15 11
2018-07-31 13:17:15 1
2018-07-31 13:27:15 6

how do i access only specific entries of a dataframe having date as index

[this is tail of my DataFrame for around 1000 entries][1]
Open Close High Change mx_profitable
Date
2018-06-06 263.00 270.15 271.4 7.15 8.40
2018-06-08 268.95 273.00 273.9 4.05 4.95
2018-06-11 273.30 274.00 278.4 0.70 5.10
2018-06-12 274.00 282.85 284.4 8.85 10.40
I have to sort out the entries of only certain dates, for example, 25th of every month.
I think need DatetimeIndex.day with boolean indexing:
df[df.index.day == 25]
Sample:
rng = pd.date_range('2017-04-03', periods=1000)
df = pd.DataFrame({'a': range(1000)}, index=rng)
print (df.head())
a
2017-04-03 0
2017-04-04 1
2017-04-05 2
2017-04-06 3
2017-04-07 4
df1 = df[df.index.day == 25]
print (df1.head())
a
2017-04-25 22
2017-05-25 52
2017-06-25 83
2017-07-25 113
2017-08-25 144

Select values based on another cell's values, then calculate a statistic from those cells and put it in a specific cell

I have is a dataset of daily stream runoff values for the past 11 years. It looks like this:
ID Year DD Apr May Jun Jul Aug Sep Oct
08HF004 2000 1 26.5 37.6 18.3 12.3 8.35 5.19 7.98
08HF004 2000 2 28.8 25.8 19.3 10.4 6.86 4.61 5.86
08HF004 2000 3 34.7 22.8 25.9 9.32 5.82 4.07 4.71
08HF004 2000 4 29.7 19.4 33.8 9.16 5.5 3.61 4.01
08HF004 2000 5 19.9 17.5 38.6 9.01 5.39 3.32 3.53
08HF004 2000 6 15 14.6 33.1 9.04 5.22 3.32 3.2
08HF004 2000 7 11.6 14.1 27 10.3 4.83 4.55 2.96
...and so forth for 400+ more lines. What I want to do is use VBA to select all the values from each month (April 2000, May 2000, etc) and figure out the average and standard deviation from each month and send them to a cell in the worksheet, or a cell in another worksheet, or, ideally, a new workbook in the directory I can just call "results".
I suggest a PivotTable (one month per table) - Year for ROWS and say April for VALUES (once as Average of and once as StdDev of or StdDevp of).
Or you might 'flatten' the data (eg as shown here) and use different views of a single PivotTable: