Pandas and timeseries - pandas

I have a dictionary of dataframes. I want to convert each dataframe in it to its respective timeseries. I am able to convert one nicely. But, if I do it within an iterator, it complains. Eg:
This works:
df = dfDict[4]
df['start_date'] = pd.to_datetime(df['start_date'])
df.set_index('start_date', inplace = True)
df.sort_index(inplace = True)
print df.head() works nicely.
But, this doesn't work:
tsDict = {}
for id, df in dfDict.iteritems():
df['start_date'] = pd.to_datetime(df['start_date'])
df.set_index('start_date', inplace = True)
df.sort_index(inplace = True)
tsDict[id] = df
It gives the following error message:
Traceback (most recent call last):
File "tsa.py", line 105, in <module>
main()
File "tsa.py", line 84, in main
df['start_date'] = pd.to_datetime(df['start_date'])
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 1997, in __getitem__
return self._getitem_column(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 2004, in _getitem_column
return self._get_item_cache(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 1350, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 3290, in get
loc = self.items.get_loc(item)
File "/usr/local/lib/python2.7/dist-packages/pandas/indexes/base.py", line 1947, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)
File "pandas/index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)
File "pandas/hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)
File "pandas/hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)
KeyError: 'start_date'
I am unable to see the subtle problem here...

Related

MissingDataError: exog contains inf or nans

Input:
import statsmodels.api as sm
import pandas as pd
# reading data from the csv
data = pd.read_csv('/Users/justkiddings/Desktop/Python/TM/TM.csv')
# defining the variables
x = data['FSP'].tolist()
y = data['RSP'].tolist()
# adding the constant term
x = sm.add_constant(x)
# performing the regression
# and fitting the model
result = sm.OLS(y, x).fit()
# printing the summary table
print(result.summary())
Output:
runfile('/Users/justkiddings/Desktop/Python/Code/untitled28.py', wdir='/Users/justkiddings/Desktop/Python/Code')
Traceback (most recent call last):
File "/Users/justkiddings/opt/anaconda3/lib/python3.9/site-packages/spyder_kernels/py3compat.py", line 356, in compat_exec
exec(code, globals, locals)
File "/Users/justkiddings/Desktop/Python/Code/untitled28.py", line 24, in <module>
result = sm.OLS(y, x).fit()
File "/Users/justkiddings/opt/anaconda3/lib/python3.9/site-packages/statsmodels/regression/linear_model.py", line 890, in __init__
super(OLS, self).__init__(endog, exog, missing=missing,
File "/Users/justkiddings/opt/anaconda3/lib/python3.9/site-packages/statsmodels/regression/linear_model.py", line 717, in __init__
super(WLS, self).__init__(endog, exog, missing=missing,
File "/Users/justkiddings/opt/anaconda3/lib/python3.9/site-packages/statsmodels/regression/linear_model.py", line 191, in __init__
super(RegressionModel, self).__init__(endog, exog, **kwargs)
File "/Users/justkiddings/opt/anaconda3/lib/python3.9/site-packages/statsmodels/base/model.py", line 267, in __init__
super().__init__(endog, exog, **kwargs)
File "/Users/justkiddings/opt/anaconda3/lib/python3.9/site-packages/statsmodels/base/model.py", line 92, in __init__
self.data = self._handle_data(endog, exog, missing, hasconst,
File "/Users/justkiddings/opt/anaconda3/lib/python3.9/site-packages/statsmodels/base/model.py", line 132, in _handle_data
data = handle_data(endog, exog, missing, hasconst, **kwargs)
File "/Users/justkiddings/opt/anaconda3/lib/python3.9/site-packages/statsmodels/base/data.py", line 673, in handle_data
return klass(endog, exog=exog, missing=missing, hasconst=hasconst,
File "/Users/justkiddings/opt/anaconda3/lib/python3.9/site-packages/statsmodels/base/data.py", line 86, in __init__
self._handle_constant(hasconst)
File "/Users/justkiddings/opt/anaconda3/lib/python3.9/site-packages/statsmodels/base/data.py", line 132, in _handle_constant
raise MissingDataError('exog contains inf or nans')
MissingDataError: exog contains inf or nans
Some of the Data:
DATE,HOUR,STATION,CO,FSP,NO2,NOX,O3,RSP,SO2
1/1/2022,1,TUEN MUN,75,38,39,40,83,59,2
1/1/2022,2,TUEN MUN,72,35,29,30,90,61,2
1/1/2022,3,TUEN MUN,74,38,28,30,91,66,2
1/1/2022,4,TUEN MUN,76,39,31,32,79,61,2
1/1/2022,5,TUEN MUN,72,38,25,26,83,65,2
1/1/2022,6,TUEN MUN,74,37,24,25,86,60,2
I have removed the N.A. in my dataset and they have converted into blanks. (Eg. 3/1/2022,12,TUEN MUN,85,,53,70,59,,5) Why there is MissingDataError? How to fix it? Thanks.

Trouble using pandas df.rolling() with my own functions

I have a pandas dataframe raw_data with two columns: 'T' and 'BP':
T BP
0 -0.500 115.790
1 -0.499 115.441
2 -0.498 115.441
3 -0.497 115.441
4 -0.496 115.790
... ... ...
647163 646.663 105.675
647164 646.664 105.327
647165 646.665 105.327
647166 646.666 105.327
647167 646.667 104.978
[647168 rows x 2 columns]
I want to apply the Hodges-Lehmann mean (it's a robust average) over a rolling window and create a new column. Here's the function:
def hodgesLehmannMean(x):
m = np.add.outer(x, x)
ind = np.tril_indices(len(x), 0)
return 0.5 * np.median(m[ind])
I therefore write:
raw_data[new_col] = raw_data['BP'].rolling(21, min_periods=1, center=True,
win_type=None, axis=0, closed=None).agg(hodgesLehmannMean)
but I get a string of error messages:
Traceback (most recent call last):
File "C:\Users\tkpme\miniconda3\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\tkpme\miniconda3\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "c:\Users\tkpme\.vscode\extensions\ms-python.python-2020.8.101144\pythonFiles\lib\python\debugpy\__main__.py", line 45, in <module>
cli.main()
File "c:\Users\tkpme\.vscode\extensions\ms-python.python-2020.8.101144\pythonFiles\lib\python\debugpy/..\debugpy\server\cli.py", line 430, in main
run()
File "c:\Users\tkpme\.vscode\extensions\ms-python.python-2020.8.101144\pythonFiles\lib\python\debugpy/..\debugpy\server\cli.py", line 267, in run_file
runpy.run_path(options.target, run_name=compat.force_str("__main__"))
File "C:\Users\tkpme\miniconda3\lib\runpy.py", line 265, in run_path
return _run_module_code(code, init_globals, run_name,
File "C:\Users\tkpme\miniconda3\lib\runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "C:\Users\tkpme\miniconda3\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "c:\Users\tkpme\OneDrive\Documents\Work\CMC\BP Satya and Suresh\Code\Naveen_peak_detect test.py", line 227, in <module>
main()
File "c:\Users\tkpme\OneDrive\Documents\Work\CMC\BP Satya and Suresh\Code\Naveen_peak_detect test.py", line 75, in main
raw_data[new_col] = raw_data['BP'].rolling(FILTER_WINDOW, min_periods=1, center=True, win_type=None,
File "C:\Users\tkpme\miniconda3\lib\site-packages\pandas\core\window\rolling.py", line 1961, in aggregate
return super().aggregate(func, *args, **kwargs)
File "C:\Users\tkpme\miniconda3\lib\site-packages\pandas\core\window\rolling.py", line 523, in aggregate
return self.apply(func, raw=False, args=args, kwargs=kwargs)
File "C:\Users\tkpme\miniconda3\lib\site-packages\pandas\core\window\rolling.py", line 1987, in apply
return super().apply(
File "C:\Users\tkpme\miniconda3\lib\site-packages\pandas\core\window\rolling.py", line 1300, in apply
return self._apply(
File "C:\Users\tkpme\miniconda3\lib\site-packages\pandas\core\window\rolling.py", line 507, in _apply
result = calc(values)
File "C:\Users\tkpme\miniconda3\lib\site-packages\pandas\core\window\rolling.py", line 495, in calc
return func(x, start, end, min_periods)
File "C:\Users\tkpme\miniconda3\lib\site-packages\pandas\core\window\rolling.py", line 1326, in apply_func
return window_func(values, begin, end, min_periods)
File "pandas\_libs\window\aggregations.pyx", line 1375, in pandas._libs.window.aggregations.roll_generic_fixed
File "c:\Users\tkpme\OneDrive\Documents\Work\CMC\BP Satya and Suresh\Code\Naveen_peak_detect test.py", line 222, in hodgesLehmannMean
m = np.add.outer(x, x)
File "C:\Users\tkpme\miniconda3\lib\site-packages\pandas\core\series.py", line 705, in __array_ufunc__
return construct_return(result)
File "C:\Users\tkpme\miniconda3\lib\site-packages\pandas\core\series.py", line 694, in construct_return
raise NotImplementedError
NotImplementedError
which appear to be driven by the line
m = np.add.outer(x, x)
and points to something not being implemented or numpy being missing. But I import numpy right at the beginning as follows:
import numpy as np
import pandas as pd
The function works perfectly well on its own if I feed it a list or a numpy array, so I'm not sure what the problem is. Interestingly, if I use the median instead of the Hodges-Lehmann Mean, it runs like a charm
raw_data[new_col] = raw_data['BP'].rolling(21, min_periods=1, center=True,
win_type=None, axis=0, closed=None).median()
What is the cause of my problem, and how do I fix it?
Sincerely
Thomas Philips
I've tried your code with a small dataframe and it worked well, so maybe there is something on your dataframe that must be cleaned or transformed.
Solved it. It turns out that
m = np.add.outer(x, x)
requires x to be array like. When I tested it using lists, numpy arrays, etc. it worked perfectly, just as it did for you. But the .rolling line generates a slice of a dataframe, which is not array like, and the function fails with a confusing error message. I modified the function to create a numpy array from the input and it now works as it should.
def hodgesLehmannMean(x):
x_array = np.array(x)
m = np.add.outer(x_array, x_array)
ind = np.tril_indices(len(x_array), 0)
return 0.5 * np.median(m[ind])
Thanks for looking at it!

groupby.rolling.count() yield `non-unique multi-index` exception

I'm using train_sample.csv here.
Consider the following:
import pandas as pd
df = pd.read_csv('train_sample.csv')
df = df.drop(['attributed_time'], axis=1)
df['click_time'] = pd.to_datetime(df['click_time'])
df = df.set_index('click_time')
df = df.sort_index()
df['clicks_last_hour'] = df.groupby(['ip']).rolling('1H').count()
Where I'm trying to create a new column that counts the number times where a certain ip clicked in the last hour.
I'm getting:
Traceback (most recent call last):
File "train_sample.py", line 11, in <module>
df['clicks_last_hour'] = df.groupby(['ip']).rolling('1H').count()
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\core\frame.py", line 3119, in __setitem__
self._set_item(key, value)
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\core\frame.py", line 3194, in _set_item
value = self._sanitize_column(key, value)
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\core\frame.py", line 3378, in _sanitize_column
value = reindexer(value).T
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\core\frame.py", line 3358, in reindexer
raise e
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\core\frame.py", line 3353, in reindexer
value = value.reindex(self.index)._values
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\util\_decorators.py", line 187, in wrapper
return func(*args, **kwargs)
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\core\frame.py", line 3566, in reindex
return super(DataFrame, self).reindex(**kwargs)
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\core\generic.py", line 3689, in reindex
fill_value, copy).__finalize__(self)
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\core\frame.py", line 3501, in _reindex_axes
fill_value, limit, tolerance)
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\core\frame.py", line 3509, in _reindex_index
tolerance=tolerance)
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\core\indexes\multi.py", line 2068, in reindex
raise Exception("cannot handle a non-unique multi-index!")
Exception: cannot handle a non-unique multi-index!
Though from what i checked there are no duplicates based on same ip and click_time.
What am I doing wrong?

Unicode DecodeError while concatenating

I am trying to train my model and i have csv file and one gz file, which was generated earlier. I am getting this error as mentioned below and not sure what is wrong.
Traceback (most recent call last):
File "Model.py", line 87, in <module>
data = pd.concat([pd.read_csv(log)])
File "/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py", line 440, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py", line 787, in __init__
self._make_engine(self.engine)
File "/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1014, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1708, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 539, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 767, in pandas._libs.parsers.TextReader._get_header
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
Mycode:
for foo in range(0,1):
# Read dataframe
#data = pd.concat([pd.read_csv(log.replace('0',str(idx),1)) for idx in range(5)])
log = path + 'train_features/log_.csv'
test_log = path + 'test_features/log_features.gz'
data = pd.concat([pd.read_csv(log)])
Try:
data = pd.read_csv(log, encoding = "utf-8")
Although I don't understand why you need the for loop or pd.concat
If you don't know your type of encoding, try: this:
import chardet
with open(log, 'rb') as f:
result = chardet.detect(f.read()) # or readline if the file is large
data = pd.read_csv(log, encoding=result['encoding'])
source

pandas.read_csv gives FileNotFound error inside a loop

pandas.read_csv is working properly when used as a single statement. But it is giving FileNotFoundError when it is being used inside a loop even though the file exists.
for filename in os.listdir("./Datasets/pollution"):
print(filename) # To check which file is under processing
df = pd.read_csv(filename, sep=",").head(1)
These above lines are giving this following error.
pollutionData184866.csv <----- The name of the file is printed properly.
Traceback (most recent call last):
File "/home/parnab/PycharmProjects/FinalYearProject/locationExtractor.py", line 13, in <module>
df = pd.read_csv(i, sep=",").head(1)
File "/usr/lib/python3.6/site-packages/pandas/io/parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib/python3.6/site-packages/pandas/io/parsers.py", line 389, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/lib/python3.6/site-packages/pandas/io/parsers.py", line 730, in __init__
self._make_engine(self.engine)
File "/usr/lib/python3.6/site-packages/pandas/io/parsers.py", line 923, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/lib/python3.6/site-packages/pandas/io/parsers.py", line 1390, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 373, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4184)
File "pandas/parser.pyx", line 667, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:8449)
FileNotFoundError: File b'pollutionData184866.csv' does not exist
But when I am doing
filename = 'pollutionData184866.csv'
df = pd.read_csv(filename, sep=',')
It is working fine.
What am I doing wrong?
os.listdir("./Datasets/pollution") returns a list of files without a path and according to the path "./Datasets/pollution" you are parsing CSV files NOT from the current directory ".", so changing it to glob.glob('./Datasets/pollution/*.csv') should work, because glob.glob() returns a list of satisfying files/directories including given path
Demo:
In [19]: os.listdir('d:/temp/.data/629509')
Out[19]:
['AAON_data.csv',
'AAON_data.png',
'AAPL_data.csv',
'AAPL_data.png',
'AAP_data.csv',
'AAP_data.png']
In [20]: glob.glob('d:/temp/.data/629509/*.csv')
Out[20]:
['d:/temp/.data/629509\\AAON_data.csv',
'd:/temp/.data/629509\\AAPL_data.csv',
'd:/temp/.data/629509\\AAP_data.csv']