I am trying to do data analysis for the first time using Pandas in a Jupyter notebook and was wondering what I am doing wrong.
I have created a data frame for the results of a query to store a table that represents the total population I am comparing to.
ds
count
2022-28-9
100
2022-27-9
98
2022-26-9
99
2022-25-9
98
This data frame is called total_count
I have created a data frame for the results of a query to store a table that represents the count of items that are out of SLA to be divided by the total.
ds
oo_sla
2022-28-9
60
2022-27-9
38
2022-26-9
25
2022-25-9
24
This data frame is called out_of_sla
These two data sets are created by Presto queries from Hive tables if that matters.
I am now trying to divide those results to get a % out of SLA but I am getting errors.
data = {"total_count"[], "out_of_sla"[]}
df = pd.DataFrame(data)
df["result"] = [out_of_sla]/[total_count]
print(df)
I am getting an error for invalid syntax on line 3. My goal was to create a trend of in/out of sla status and a widget for the most recent datestamps sla. Any insight is appreciated.
Related
I'm currently building a model to predict daily stock price based on daily data for thousands of stocks. In the data, I've got the daily data for all stocks, however they are for different lengths. Eg: for some stocks I have daily data from 2000 to 2022, and for others I have data from 2010 to 2022.
Many dates are also obviously repeated for all stocks.
While I was learning autogluon, I used the following function to format timeseries data so it can work with .fit():
def forward_fill_missing(ts_dataframe: TimeSeriesDataFrame, freq="D") -> TimeSeriesDataFrame:
original_index = ts_dataframe.index.get_level_values("timestamp")
start = original_index[0]
end = original_index[-1]
filled_index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
return ts_dataframe.droplevel("item_id").reindex(filled_index, method="ffill")
ts_dataframe = ts_dataframe.groupby("item_id").apply(forward_fill_missing)
This worked, however I was trying this for data for just one item_id and now I have thousands.
When I use this now, I get the following error: ValueError: cannot reindex from a duplicate axis
It's important to note that I have already foreward filled my data with pandas, and the ts_dataframe shouldn't have any missing dates or values, but when I try to use it with .fit() I get the following error:
ValueError: Frequency not provided and cannot be inferred. This is often due to the time index of the data being irregularly sampled. Please ensure that the data set used has a uniform time index, or create the TimeSeriesPredictorsettingignore_time_index=True.
I assume that this is because I have only filled in missing data and dates, but not taken into account the varying number of days available for every stock individually.
For reference, here's how I have formatted the data with pandas:
df = pd.read_csv(
"/content/drive/MyDrive/stock_data/training_data.csv",
parse_dates=["Date"],
)
df["Date"] = pd.to_datetime(df["Date"], errors="coerce", dayfirst=True)
df.fillna(method='ffill', inplace=True)
df = df.drop("Unnamed: 0", axis=1)
df[:11]
How can I format the data so I can use it with .fit()?
Thanks!
dataframe -1:
created year, rec_counts
2016 50
2015 40
Dataframe -2:
created year, rec_counts
2016 1000
2015 47
There are 2 methods you can try.
Let's assume the names of two DataFrames are df1 and df2.
Now, if you just want to count the number of rows and check if both has same row count or not, use df1.count() and df2.count() and check if both gives the same output (total number of rows in each group).
Secondly, you can write statement df2.except(df1) and this will return the complete rows which haven't present in other dataframe. If it returns NULL, it means both dataframes are same.
I am having a problem mapping groupby mean statistics to a dataframe column in order to produce a new column.
The raw data is as follows:
I set about creating a new data frame which would display the average sales for 2018 by 'Brand Origin'.
I then proceeded to convert the new data frame to a dictionary in order to complete the mapping process.
I attempted to map the data to the original data frame but I get NaN values.
What have I done wrong?
I think you need transform:
df['new'] = df.groupby('Brand Origin')['2018'].transform('mean')
I am working with data analysis using python. I want to replace all data < 120 in one column with average_steam. average_steam= 123
to access the column in data frame I write "data.steam" then I get all values
the code I tried is:
average_steam=data.steamin.mean()
print(average_steam)
data.steamin.replace(data.steamin<=120,average_steam, inplace=True)
Very new to Pandas. Importing stock data in a DF and I want to calculate 10D rolling average. That I can figure out. Issue is it gives 9 NaN because of the 10D moving average period.
I want to re-align the data so the 10th piece of data is a new rolling average column at the top of the data frame. I tried moving the data by writing the following code:
small = pd.rolling_mean(df['Close'],10)
and then trying to add that to the df with the following code
df['MA10D2'] = small[9:]
but it still provides 9 NaN at the top. Anyone can help me out?
Assignment will be done based on index. small[9:] will start the index at position 9, thus the assignement will keep their position starting at index 9.
The function you are searching for is called shift:
df['MA10D2'] = small.shift(-9)