How to work with multiple `item_id`'s of varying length for timeseries prediction in autogluon? - pandas

I'm currently building a model to predict daily stock price based on daily data for thousands of stocks. In the data, I've got the daily data for all stocks, however they are for different lengths. Eg: for some stocks I have daily data from 2000 to 2022, and for others I have data from 2010 to 2022.
Many dates are also obviously repeated for all stocks.
While I was learning autogluon, I used the following function to format timeseries data so it can work with .fit():
def forward_fill_missing(ts_dataframe: TimeSeriesDataFrame, freq="D") -> TimeSeriesDataFrame:
original_index = ts_dataframe.index.get_level_values("timestamp")
start = original_index[0]
end = original_index[-1]
filled_index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
return ts_dataframe.droplevel("item_id").reindex(filled_index, method="ffill")
ts_dataframe = ts_dataframe.groupby("item_id").apply(forward_fill_missing)
This worked, however I was trying this for data for just one item_id and now I have thousands.
When I use this now, I get the following error: ValueError: cannot reindex from a duplicate axis
It's important to note that I have already foreward filled my data with pandas, and the ts_dataframe shouldn't have any missing dates or values, but when I try to use it with .fit() I get the following error:
ValueError: Frequency not provided and cannot be inferred. This is often due to the time index of the data being irregularly sampled. Please ensure that the data set used has a uniform time index, or create the TimeSeriesPredictorsettingignore_time_index=True.
I assume that this is because I have only filled in missing data and dates, but not taken into account the varying number of days available for every stock individually.
For reference, here's how I have formatted the data with pandas:
df = pd.read_csv(
"/content/drive/MyDrive/stock_data/training_data.csv",
parse_dates=["Date"],
)
df["Date"] = pd.to_datetime(df["Date"], errors="coerce", dayfirst=True)
df.fillna(method='ffill', inplace=True)
df = df.drop("Unnamed: 0", axis=1)
df[:11]
How can I format the data so I can use it with .fit()?
Thanks!

Related

Plotting a graph of the top 15 highest values

I am working on a dataset which shows the budget spent on movies. I want make a plot which contains the top 15 highest budget movies.
#sort the 'budget' column in decending order and store it in the new dataframe.
info = pd.DataFrame(dp['budget'].sort_values(ascending = False))
info['original_title'] = dp['original_title']
data = list(map(str,(info['original_title'])))
#extract the top 10 budget movies data from the list and dataframe.
x = list(data[:10])
y = list(info['budget'][:10])
This was the ouput i got
C:\Users\Phillip\AppData\Local\Temp\ipykernel_7692\1681814737.py:2: FutureWarning: The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.
y = list(info['budget'][:5])
I'm new to the data analysis scene so i'm confused on how else to go about the problem
A simple example using a movie dataset I found online:
import pandas as pd
url = "https://raw.githubusercontent.com/erajabi/Python_examples/master/movie_sample_dataset.csv"
df = pd.read_csv(url)
# Bar plot of 15 highest budgets:
df.nlargest(n=15, columns="budget").plot.bar(x="movie_title", y="budget")
You can customize your plot in various ways by adding arguments to the .bar(...) call.

Adding The Results Of Two Queries In Pandas Dataframes

I am trying to do data analysis for the first time using Pandas in a Jupyter notebook and was wondering what I am doing wrong.
I have created a data frame for the results of a query to store a table that represents the total population I am comparing to.
ds
count
2022-28-9
100
2022-27-9
98
2022-26-9
99
2022-25-9
98
This data frame is called total_count
I have created a data frame for the results of a query to store a table that represents the count of items that are out of SLA to be divided by the total.
ds
oo_sla
2022-28-9
60
2022-27-9
38
2022-26-9
25
2022-25-9
24
This data frame is called out_of_sla
These two data sets are created by Presto queries from Hive tables if that matters.
I am now trying to divide those results to get a % out of SLA but I am getting errors.
data = {"total_count"[], "out_of_sla"[]}
df = pd.DataFrame(data)
df["result"] = [out_of_sla]/[total_count]
print(df)
I am getting an error for invalid syntax on line 3. My goal was to create a trend of in/out of sla status and a widget for the most recent datestamps sla. Any insight is appreciated.

Generating Percentages from Pandas

0
I am working with a data set from SQL currently -
import pandas as pd
df = spark.sql("select * from donor_counts_2015")
df_info = df.toPandas()
print(df_info)
The output looks like this (I can't include the actual output for privacy reasons): enter image description here
As you can see, it's a data set that has the name of a fund and then the number of people who have donated to that fund. What I am trying to do now is calculate what percent of funds have only 1 donation, what percent have 2, 34, etc. I am wondering if there is an easy way to do this with pandas? I also would appreciate if you were able to see the percentage of a range of funds too, like what percentage of funds have between 50-100 donations, 500-1000, etc. Thanks!
You can make a histogram of the donations to visualize the distribution. np.histogram might help. Or you can also sort the data and count manually.
For the first task, to get the percentage the column 'number_of_donations', you can do:
df['number_of_donations'].value_counts(normalize=True) * 100
For the second task, you need to create a new column with categories, and then make the same:
# Create a Serie with categories
New_Serie = pd.cut(df.number_of_donations,bins=[0,100,200,500,99999999],labels = ['Few','Medium','Many','Too Many'])
# Change the name of the Column
New_Serie.name = Category
# Concat df and New_Serie
df = pd.concat([df, New_Serie], axis=1)
# Get the percentage of the Categories
df['Category'].value_counts(normalize=True) * 100

pandas df.resample('D').sum() returns NaN

I've got a pandas data frame with electricity meter readings(cumulative). The df DatetimeIndex dtype='datetime64[ns]'. When I load the .csv file the dataframe does not contain any NaN values. I need to calculate both the monthly and daily energy generated.
To calculate monthly generation I use dfmonth = df.resample('M').sum() . This works fine.
To calculate daily generation I thought of using: dfday = df.resample('D').sum(). Which partially works but for some index dates (no data missing in raw file) returns NaN.
Please see code below. Does anyone knows why this happens? Any proposed solution?
df = pd.read_csv(file)
df = df.set_index(pd.DatetimeIndex(df['Reading Timestamp']))
df=df.rename(columns = {'Energy kWh':'meter', 'Instantaneous Power kW (approx)': 'kW'})
df.drop(df.columns[:10], axis=1, inplace=True) #Delete columns I don't need.
df['kWh'] = df['meter'].sub(df['meter'].shift())
dfmonth = df.resample('M').sum() #This works OK calculating kWh. dfmonth does not contain any NaN.
dfday = df.resample('D').sum() # This returns a total of 8 NaN out of 596 sampled points. Original df has 27929 DatetimeIndex rows
Thank you in advance.
A big apology to you all. The .csv I was given and the raw .csv I was checking against are not the same file. Data was somehow corrupted....
I've been banging my head against the wall till now, there is not problem with df.resample('D').sum()
Sorry again, consider thread sorted.

matplotlib: value error x and y have different dimensions

Having some difficulty plotting out values grouped by a text/name field and along a range of dates. The issue is that while I can group by the name and generate plots for some of the date ranges, there are instances where the grouping contains missing date values (just the nature of the overall dataset).
That is to say, I may very well be able to plot for a date_range('10/1/2013', '10/31/2013') for SOME of the grouped values, but there are instances where there is no '10/15/2013' within that range and therefore will throw the error mentioned in the title of this post.
Thanks for any input!
plt.rcParams['legend.loc'] = 'best'
dtable = pd.io.parsers.read_table(str(datasource), sep=',')
unique_keys = np.unique(dtable['KEY'])
index = date_range(d1frmt, d2frmt)
for key in unique_keys:
values = dtable[dtable['KEY'] == key]
plt.figure()
plt.plot(index, values['VAL']) <--can fail if index is missing a date
plt.xlim(xmin=d1frmt,xmax=d2frmt)
plt.xticks(rotation=270)
plt.xticks(size='small')
plt.legend(('H20'))
plt.ylabel('Head (ft)')
plt.title('Well {0}'.format(key))
fig = str('{0}.png'.format(key))
out = str(outputloc) + "\\" + str(fig)
plt.savefig(out)
plt.close()
You must have a date column, or index, in your dtable. Otherwise you dont know which in values['Val'] belong to which date.
If you do, there are two ways.
Since you make a subset based on a key, you can either use the index (if its a datetime!) of that subset:
plt.plot(values.index.to_pydatetime(), values['VAL'])
or reindex the subset to your 'target' range":
values = values.reindex(index)
plt.plot(index.to_pydatetime(), values['VAL'])
By default, reindex inserts NaN values as missing data.
It would be easier if you gave a working example, its a bit hard to answer without knowing what your Dataframe looks like.