pandas df.resample('D').sum() returns NaN - pandas

I've got a pandas data frame with electricity meter readings(cumulative). The df DatetimeIndex dtype='datetime64[ns]'. When I load the .csv file the dataframe does not contain any NaN values. I need to calculate both the monthly and daily energy generated.
To calculate monthly generation I use dfmonth = df.resample('M').sum() . This works fine.
To calculate daily generation I thought of using: dfday = df.resample('D').sum(). Which partially works but for some index dates (no data missing in raw file) returns NaN.
Please see code below. Does anyone knows why this happens? Any proposed solution?
df = pd.read_csv(file)
df = df.set_index(pd.DatetimeIndex(df['Reading Timestamp']))
df=df.rename(columns = {'Energy kWh':'meter', 'Instantaneous Power kW (approx)': 'kW'})
df.drop(df.columns[:10], axis=1, inplace=True) #Delete columns I don't need.
df['kWh'] = df['meter'].sub(df['meter'].shift())
dfmonth = df.resample('M').sum() #This works OK calculating kWh. dfmonth does not contain any NaN.
dfday = df.resample('D').sum() # This returns a total of 8 NaN out of 596 sampled points. Original df has 27929 DatetimeIndex rows
Thank you in advance.

A big apology to you all. The .csv I was given and the raw .csv I was checking against are not the same file. Data was somehow corrupted....
I've been banging my head against the wall till now, there is not problem with df.resample('D').sum()
Sorry again, consider thread sorted.

Related

How to work with multiple `item_id`'s of varying length for timeseries prediction in autogluon?

I'm currently building a model to predict daily stock price based on daily data for thousands of stocks. In the data, I've got the daily data for all stocks, however they are for different lengths. Eg: for some stocks I have daily data from 2000 to 2022, and for others I have data from 2010 to 2022.
Many dates are also obviously repeated for all stocks.
While I was learning autogluon, I used the following function to format timeseries data so it can work with .fit():
def forward_fill_missing(ts_dataframe: TimeSeriesDataFrame, freq="D") -> TimeSeriesDataFrame:
original_index = ts_dataframe.index.get_level_values("timestamp")
start = original_index[0]
end = original_index[-1]
filled_index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
return ts_dataframe.droplevel("item_id").reindex(filled_index, method="ffill")
ts_dataframe = ts_dataframe.groupby("item_id").apply(forward_fill_missing)
This worked, however I was trying this for data for just one item_id and now I have thousands.
When I use this now, I get the following error: ValueError: cannot reindex from a duplicate axis
It's important to note that I have already foreward filled my data with pandas, and the ts_dataframe shouldn't have any missing dates or values, but when I try to use it with .fit() I get the following error:
ValueError: Frequency not provided and cannot be inferred. This is often due to the time index of the data being irregularly sampled. Please ensure that the data set used has a uniform time index, or create the TimeSeriesPredictorsettingignore_time_index=True.
I assume that this is because I have only filled in missing data and dates, but not taken into account the varying number of days available for every stock individually.
For reference, here's how I have formatted the data with pandas:
df = pd.read_csv(
"/content/drive/MyDrive/stock_data/training_data.csv",
parse_dates=["Date"],
)
df["Date"] = pd.to_datetime(df["Date"], errors="coerce", dayfirst=True)
df.fillna(method='ffill', inplace=True)
df = df.drop("Unnamed: 0", axis=1)
df[:11]
How can I format the data so I can use it with .fit()?
Thanks!

Pandas dataframe mixed dtypes when reading csv

I am reading in a large dataframe that is throwing a DtypeWarning: Columns (I understand this warning) but am struggling to prevent it (I don't want to set low_memory to False as I would like to specify the correct dtypes.
For every columns, the majority of rows are float values and the last 3 rows are string (metadata basically, information about each column). I understand that I can set the dtype per column when reading in the csv, however I do not know how to change rows 1:n to be float32 for example and the last 3 rows to be strings. I would like to avoid reading in two separate CSVs. The resulting dtype of all columns after reading in the dataframe is 'object'. Below is a reproducible example. The dtype warning is not thrown when reading in i am guessing because of the size of the dataframe - however the result is exactly the same as the problem i am facing. i would like to make the first 3 rows float32 and the last 3 string so that they are the correct dtype. thank you!
reproducible example:
df = pd.DataFrame([[0.1, 0.2,0.3],[0.1, 0.2,0.3],[0.1, 0.2,0.3],
['info1', 'info2','info3'],['info1', 'info2','info3'],['info1', 'info2','info3']],
index=['index1', 'index2', 'index3', 'info1', 'info2', 'info3'],
columns=['column1', 'column2', 'column3'] )
df.to_csv('test.csv')
df1 = pd.read_csv('test.csv', index_col=0)

Group DataFrame by binning a column::Float64, in Julia

Say I have a DataFrame with a column of Float64s, I'd like to group the dataframe by binning that column. I hear the cut function might help, but it's not defined over dataframes. Some work has been done (https://gist.github.com/tautologico/3925372), but I'd rather use a library function rather than copy-pasting code from the Internet. Pointers?
EDIT Bonus karma for finding a way of doing this by month over UNIX timestamps :)
You could bin dataframes based on a column of Float64s like this. Here my bins are increments of 0.1 from 0.0 to 1.0, binning the dataframe based on a column of 100 random numbers between 0.0 and 1.0.
using DataFrames #load DataFrames
df = DataFrame(index = rand(Float64,100)) #Make a DataFrame with some random Float64 numbers
df_array = map(x->df[(df[:index] .>= x[1]) .& (df[:index] .<x[2]),:],zip(0.0:0.1:0.9,0.1:0.1:1.0)) #Map an anonymous function that gets every row between two numbers specified by a tuple called x, and map that anonymous function to an array of tuples generated using the zip function.
This will produce an array of 10 dataframes, each one with a different 0.1-sized bin.
As for the UNIX timestamp question, I'm not as familiar with that side of things, but after playing around a bit maybe something like this could work:
using Dates
df = DataFrame(unixtime = rand(1E9:1:1.1E9,100)) #Make a dataframe with floats containing pretend unix time stamps
df[:date] = Dates.unix2datetime.(df[:unixtime]) #convert those timestamps to DateTime types
df[:year_month] = map(date->string(Dates.Year.(date))*" "*string(Dates.Month.(date)),df[:date]) #Make a string for every month in your time range
df_array = map(ym->df[df[:year_month] .== ym,:],unique(df[:year_month])) #Bin based on each unique year_month string

Realign Pandas Rolling Average series in a dataframe

Very new to Pandas. Importing stock data in a DF and I want to calculate 10D rolling average. That I can figure out. Issue is it gives 9 NaN because of the 10D moving average period.
I want to re-align the data so the 10th piece of data is a new rolling average column at the top of the data frame. I tried moving the data by writing the following code:
small = pd.rolling_mean(df['Close'],10)
and then trying to add that to the df with the following code
df['MA10D2'] = small[9:]
but it still provides 9 NaN at the top. Anyone can help me out?
Assignment will be done based on index. small[9:] will start the index at position 9, thus the assignement will keep their position starting at index 9.
The function you are searching for is called shift:
df['MA10D2'] = small.shift(-9)

pandas / numpy arithmetic mean in csv file

I have a csv file which contains 3000 rows and 5 columns, which constantly have more rows appended to it on a weekly basis.
What i'm trying to do is to find the arithmetic mean for the last column for the last 1000 rows, every week. (So when new rows are added to it weekly, it'll just take the average of most recent 1000 rows)
How should I construct the pandas or numpy array to achieve this?
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
#How should I write the next line of codes to get the average for the most 1000 rows?
I'm on a different machine than what my pandas is installed on so I'm going on memory, but I think what you'll want to do is...
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
#Let's pretend your 5th column has a name (header) of `Stuff`
last_thousand = df_1.tail(1000)
np.mean(last_thousand.Stuff)
A little bit quicker using mean():
df = pd.read_csv("fds.csv", header = 0)
results = df.tail(1000).mean()
Results will contain the mean for each column within the last 1000 rows. If you want more statistics, you can also use describe():
resutls = df.tail(1000).describe().unstack()
So basically I needed to use the pandas tail function. My Code below works.
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
numpy.average(df_1.tail(1000))