When fitting a statsmodel, I'm receiving a warning about the date frequency.
First, I import a dataset:
import statsmodels as sm
df = sm.datasets.get_rdataset(package='datasets', dataname='airquality').data
df['Year'] = 1973
df['Date'] = pd.to_datetime(df[['Year', 'Month', 'Day']])
df.drop(columns=['Year', 'Month', 'Day'], inplace=True)
df.set_index('Date', inplace=True, drop=True)
Next I try to fit a SES model:
fit = sm.tsa.api.SimpleExpSmoothing(df['Wind']).fit()
Which returns this warning:
/anaconda3/lib/python3.6/site-packages/statsmodels/tsa/base/tsa_model.py:171: ValueWarning: No frequency information was provided, so inferred frequency D will be used.
% freq, ValueWarning)
My dataset is daily so inferred 'D' is ok, but I was wondering how I can manually set the frequency.
Note that the DatetimeIndex doesn't have the freq (last line) ...
DatetimeIndex(['1973-05-01', '1973-05-02', '1973-05-03', '1973-05-04',
'1973-05-05', '1973-05-06', '1973-05-07', '1973-05-08',
'1973-05-09', '1973-05-10',
...
'1973-09-21', '1973-09-22', '1973-09-23', '1973-09-24',
'1973-09-25', '1973-09-26', '1973-09-27', '1973-09-28',
'1973-09-29', '1973-09-30'],
dtype='datetime64[ns]', name='Date', length=153, freq=None)
As per this answer I've checked for missing dates, but there doesn't appear to be any:
pd.date_range(start = '1973-05-01', end = '1973-09-30').difference(df.index)
DatetimeIndex([], dtype='datetime64[ns]', freq='D')
How should I set the frequency for the index?
I think pd.to_datetime not set default frequency, need DataFrame.asfreq:
df = df.set_index('Date').asfreq('d')
print (df.index)
DatetimeIndex(['1973-05-01', '1973-05-02', '1973-05-03', '1973-05-04',
'1973-05-05', '1973-05-06', '1973-05-07', '1973-05-08',
'1973-05-09', '1973-05-10',
...
'1973-09-21', '1973-09-22', '1973-09-23', '1973-09-24',
'1973-09-25', '1973-09-26', '1973-09-27', '1973-09-28',
'1973-09-29', '1973-09-30'],
dtype='datetime64[ns]', name='Date', length=153, freq='D')
But if duplicated values in index get error:
df = pd.concat([df, df])
df = df.set_index('Date')
print (df.asfreq('d').index)
ValueError: cannot reindex from a duplicate axis
Solution is use resample with some aggregate function:
print (df.resample('2D').mean().index)
DatetimeIndex(['1973-05-01', '1973-05-03', '1973-05-05', '1973-05-07',
'1973-05-09', '1973-05-11', '1973-05-13', '1973-05-15',
'1973-05-17', '1973-05-19', '1973-05-21', '1973-05-23',
'1973-05-25', '1973-05-27', '1973-05-29', '1973-05-31',
'1973-06-02', '1973-06-04', '1973-06-06', '1973-06-08',
'1973-06-10', '1973-06-12', '1973-06-14', '1973-06-16',
'1973-06-18', '1973-06-20', '1973-06-22', '1973-06-24',
'1973-06-26', '1973-06-28', '1973-06-30', '1973-07-02',
'1973-07-04', '1973-07-06', '1973-07-08', '1973-07-10',
'1973-07-12', '1973-07-14', '1973-07-16', '1973-07-18',
'1973-07-20', '1973-07-22', '1973-07-24', '1973-07-26',
'1973-07-28', '1973-07-30', '1973-08-01', '1973-08-03',
'1973-08-05', '1973-08-07', '1973-08-09', '1973-08-11',
'1973-08-13', '1973-08-15', '1973-08-17', '1973-08-19',
'1973-08-21', '1973-08-23', '1973-08-25', '1973-08-27',
'1973-08-29', '1973-08-31', '1973-09-02', '1973-09-04',
'1973-09-06', '1973-09-08', '1973-09-10', '1973-09-12',
'1973-09-14', '1973-09-16', '1973-09-18', '1973-09-20',
'1973-09-22', '1973-09-24', '1973-09-26', '1973-09-28',
'1973-09-30'],
dtype='datetime64[ns]', name='Date', freq='2D')
The problem is caused by the not explicitly set frequence. In most cases you can't be sure that your data does not have any gaps, so generate a data range with
rng = pd.date_range(start = '1973-05-01', end = '1973-09-30', freq='D')
reindex your DataFrame with this rng and fill the np.nan with your method or value of choice.
Related
Index(['Apr-20', 'Apr-21', 'Apr-22', 'Aug-20', 'Aug-21', 'Aug-22', 'Dec-20',
'Dec-21', 'Dec-22', 'Feb-21', 'Feb-22', 'Jan-21', 'Jan-22', 'Jan-23',
'Jul-20', 'Jul-21', 'Jul-22', 'Jun-20', 'Jun-21', 'Jun-22', 'Mar-20',
'Mar-21', 'Mar-22', 'May-20', 'May-21', 'May-22', 'Nov-20', 'Nov-21',
'Nov-22', 'Oct-20', 'Oct-21', 'Oct-22', 'Sep-20', 'Sep-21', 'Sep-22'],
dtype='object', name='months')
How could I sort this month-year object dtype into the datetime format such as 'MMM-YY' in pandas? Take thanks in advance!
If need only sorting values of index like datetimes use DataFrame.sort_index with key parameter:
df = df.sort_index(key=lambda x: pd.to_datetime(x, format='%b-%y'))
If need DatetimeIndex and then sorting use:
df.index = pd.to_datetime(df.index, format='%b-%y')
df = df.sort_index()
Another idea is create PeriodIndex:
df.index = pd.to_datetime(df.index, format='%b-%y').to_period('m')
df = df.sort_index()
How do I drop columns in raw_clin if the same columns already exist in raw_clinical_sample? Using isin raised a cannot compute isin with a duplicate axis error.
Explanation of the code:
I want to merge raw_clinical_patient and raw_clinical_sample dataframes. However, the SAMPLE_ID column in raw_clinical_sample should be relabeled as PATIENT_ID before the merge (because it was wrongly labelled). I want the new PATIENT_ID to be the index of raw_clin.
import pandas as pd
# Clinical patient info
raw_clinical_patient = pd.read_csv("./gbm_tcga/data_clinical_patient.txt", sep="\t", header=4)
raw_clinical_patient["PATIENT_ID"] = raw_clinical_patient["PATIENT_ID"].replace()
raw_clinical_patient.set_index("PATIENT_ID", inplace=True)
raw_clinical_patient.sort_index()
# Clinical sample info
raw_clinical_sample = pd.read_csv("./gbm_tcga/data_clinical_sample.txt", sep="\t", header=4)
raw_clinical_sample.set_index("PATIENT_ID", inplace=True)
raw_clinical_sample = raw_clinical_sample[raw_clinical_sample.index.isin(raw_clinical_patient.index)]
# Get the actual patient ID from the `raw_clinical_sample` dataframe
# Drop "PATIENT_ID" and rename "SAMPLE_ID" as "PATIENT_ID" and set as index
raw_clin = raw_clinical_patient.merge(raw_clinical_sample, on="PATIENT_ID", how="left").reset_index().drop(["PATIENT_ID"], axis=1)
raw_clin.rename(columns={'SAMPLE_ID':'PATIENT_ID'}, inplace=True)
raw_clin.set_index('PATIENT_ID', inplace=True)
Now, I want to drop all the columns in raw_clinical_sample since the only columns that are needed were the PATIENT_ID and SAMPLE_ID columns.
# Drop columns that exist in `raw_clinical_sample`
raw_clin = raw_clin[~raw_clin.isin(raw_clinical_sample)]
Traceback:
ValueError Traceback (most recent call last)
<ipython-input-60-45e2e83ddc00> in <module>()
18
19 # Drop columns that exist in `raw_clinical_sample`
---> 20 raw_clin = raw_clin[~raw_clin.isin(raw_clinical_sample)]
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in isin(self, values)
10514 elif isinstance(values, DataFrame):
10515 if not (values.columns.is_unique and values.index.is_unique):
> 10516 raise ValueError("cannot compute isin with a duplicate axis.")
10517 return self.eq(values.reindex_like(self))
10518 else:
ValueError: cannot compute isin with a duplicate axis.
You have many ways to do this.
For example using isin:
new_df1 = df1.loc[:, ~df1.columns.isin(df2.columns)]
or with drop:
new_df1 = df1.drop(columns=df1.columns.intersection(df2.columns))
example input:
df1 = pd.DataFrame(columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(columns=['B', 'E'])
output:
pd.DataFrame(columns=['A', 'C', 'D'])
You can use set operations for your application like this:
df1 = pd.DataFrame()
df1['string'] = ['Hello', 'Hi', 'Hola']
df1['number'] = [1, 2, 3]
df2 = pd.DataFrame()
df2['string'] = ['Hello', 'Hola']
df2['number'] = [1, 5]
ds1 = set(map(tuple, df1.values))
ds2 = set(map(tuple, df2.values))
df_out = pd.DataFrame(list(ds1.difference(ds2)))
df_out.columns = df1.columns
print(df_out)
Output:
string number
0 Hola 3
1 Hi 2
Inspired by: https://stackoverflow.com/a/18184990/7509907
Edit:
Sorry I didn't notice you need to drop the columns. For that, you can use the following: (using mozway's dummy example)
df1 = pd.DataFrame(columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(columns=['B', 'E'])
ds1 = set(df1.columns)
ds2 = set(df2.columns)
cols = ds1.difference(ds2)
df = df1[cols]
print(df)
Output:
Empty DataFrame
Columns: [C, A, D]
Index: []
for every city , I want to create a new column which is minmax scalar of another columns (age).
I tried this an get Input contains infinity or a value too large for dtype('float64').
cols=['age']
def f(x):
scaler1=preprocessing.MinMaxScaler()
x[['age_minmax']] = scaler1.fit_transform(x[cols])
return x
df = df.groupby(['city']).apply(f)
From the comments:
df['age'].replace([np.inf, -np.inf], np.nan, inplace=True)
Or
df['age'] = df['age'].replace([np.inf, -np.inf], np.nan)
I have a df with date and price.
Given a datetime, I would like to find the price at the nearest date.
This works for one input datetime:
import requests, xlrd, openpyxl, datetime
import pandas as pd
file = "E:/prices.csv" #two columns: Timestamp (UNIX epoch), Price (int)
df = pd.read_csv(file, index_col=None, names=["Timestamp", "Price"])
df['Timestamp'] = pd.to_datetime(df['Timestamp'],unit='s')
df = df.drop_duplicates(subset=['Timestamp'], keep='last')
df = df.set_index('Timestamp')
file = "E:/input.csv" #two columns: ID (string), Date (dd-mm-yyy hh:ss:mm)
dfinput = pd.read_csv(file, index_col=None, names=["ID", "Date"])
dfinput['Date'] = pd.to_datetime(dfinput['Date'], dayfirst=True)
exampledate = pd.to_datetime("20-3-2020 21:37", dayfirst=True)
exampleprice = df.iloc[df.index.get_loc(exampledate, method='nearest')]["Price"]
print(exampleprice) #price as output
I have another dataframe with the datetimes ("dfinput") I want to lookup prices of and save in a new column "Price".
Something like this which is obviously not working:
dfinput['Date'] = pd.to_datetime(dfinput['Date'], dayfirst=True)
dfinput['Price'] = df.iloc[df.index.get_loc(dfinput['Date'], method='nearest')]["Price"]
dfinput.to_csv('output.csv', index=False, columns=["Hash", "Date", "Price"])
Can I do this for a whole column or do I need to iterate over all rows?
I think you need merge_asof (cannot test, because no sample data):
df = df.sort_index('Timestamp')
dfinput = dfinput.sort_values('Date')
df = pd.merge_asof(df, dfinput, left_index=True, right_on='Date', direction='nearest')
I have a multiple dataframe dictionary where the index is set to 'Date' but am having a trouble to capture the specific day of a search.
Dictionary created as per link:
Call a report from a dictionary of dataframes
Then I tried to add the following column to create specific days for each row:
df_dict[k]['Day'] = pd.DatetimeIndex(df['Date']).day
It´s not working. The idea is to separate the day of the month only (from 1 to 31) for each row. When I call the report, it will give me the day of month of that occurrence.
More details if needed.
Regards and thanks!
In the case of your code, there is no 'Date' column, because it's set as the index.
df_dict = {f.stem: pd.read_csv(f, parse_dates=['Date'], index_col='Date') for f in files}
To extract the day from the index use the following code.
df_dict[k]['Day'] = df.index.day
Pulling the code from this question
# here you can see the Date column is set as the index
df_dict = {f.stem: pd.read_csv(f, parse_dates=['Date'], index_col='Date') for f in files}
data_dict = dict() # create an empty dict here
for k, df in df_dict.items():
df_dict[k]['Return %'] = df.iloc[:, 0].pct_change(-1)*100
# create a day column; this may not be needed
df_dict[k]['Day'] = df.index.day
# aggregate the max and min of Return
mm = df_dict[k]['Return %'].agg(['max', 'min'])
# get the min and max day of the month
date_max = df.Day[df['Return %'] == mm.max()].values[0]
date_min = df.Day[df['Return %'] == mm.min()].values[0]
# add it to the dict, with ticker as the key
data_dict[k] = {'max': mm.max(), 'min': mm.min(), 'max_day': date_max, 'min_day': date_min}
# print(data_dict)
[out]:
{'aapl': {'max': 8.702843218147871,
'max_day': 2,
'min': -4.900700398891522,
'min_day': 20},
'msft': {'max': 6.603769278967109,
'max_day': 2,
'min': -4.084428935702855,
'min_day': 8}}