How do I filter by date and count these dates using relational algebra (Reframe)? - pandas

I'm really stuck. I read the Reframe documentation https://reframe.readthedocs.io/en/latest/ that's based on pandas and I tried multiple things on my own but it still doesn't work. So I got a CSV file called weddings that looks like this:
Weddings, Date
Wedding1,20181107
Wedding2,20181107
And many more rows. As you can see, there are duplicates in the date column but this doesn't matter I think. I want to count the amount of weddings filtered by date, for example, the amount of weddings after 5 october 2016 (20161005). So first I tried this:
Weddings = Relation('weddings.csv')
Weddings.sort(['Date']).query('Date > 20161005').project(['Weddings', 'Date'])
This seems logical to me but I get a Keyerror 'Date' and don't know why? So I tried something more simple
Weddings = Relation('weddings.csv')
Weddings.groupby(['Date']).count()
And this doesn't work either, I still get a Keyerror 'Date' and don't know why. Can someone help me?
Track trace
KeyError Traceback (most recent call last)
<ipython-input-44-b358cdf55fdb> in <module>()
1
2 Weddings = Relation('weddings.csv')
----> 3 weddings.sort(['Date']).query('Date > 20161005').project(['Weddings', 'Date'])
4
5
~\Documents\Reframe.py in sort(self, *args,
**kwargs)
110 """
111
--> 112 return Relation(super().sort_values(*args, **kwargs))
113
114 def intersect(self, other):
~\Anaconda3\lib\site-packages\pandas\core\frame.py in sort_values(self, by,
axis, ascending, inplace, kind, na_position)
4416 by = by[0]
4417 k = self._get_label_or_level_values(by, axis=axis,
-> 4418 stacklevel=stacklevel)
4419
4420 if isinstance(ascending, (tuple, list)):
~\Anaconda3\lib\site-packages\pandas\core\generic.py in
_get_label_or_level_values(self, key, axis, stacklevel)
1377 values = self.axes[axis].get_level_values(key)._values
1378 else:
-> 1379 raise KeyError(key)
1380
1381 # Check for duplicates
KeyError: 'Date'

Related

Iterating Rows in DataFrame and Applying difflib.ratio()

Context of Problem
I am working on a project where I would like to compare two columns from a dataframe to determine what percent of the strings are similar to each other. Specifically, I'm comparing whether bullets scraped from retailer websites match the bullets that I expect to see on those sites for a given product.
I know that I can simply use boolean logic to determine if the value from column ['X'] == column ['Y']. But I'd like to take it to another level and determine what percentage of X matches Y. I did some research and found that difflib.ratio() can accomplish what I want.
Example of difflib.ratio()
a = 'preview'
b = 'previeu'
SequenceMatcher(a=a, b=b).ratio()
My Use Case
Where I'm having trouble is applying this logic to iterate through a DataFrame. This is what my DataFrame looks like.
DataFrame
The DataFrame has 5 "Bullets" and 5 "SEO Bullets". So I tried using a for loop to apply a lambda function to my DataFrame called test.
for x in range(1,6):
test[f'Bullet {x} Ratio'] = test.apply(lambda row: SequenceMatcher(a=row[f'SeoBullet_{x}'], b=row[f'Bullet {x}']).ratio())
But I received the following error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-409-39a6ba3c8879> in <module>
1 for x in range(1,6):
----> 2 test[f'Bullet {x} Ratio'] = test.apply(lambda row: SequenceMatcher(a=row[f'SeoBullet_{x}'], b=row[f'Bullet {x}']).ratio())
~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7539 kwds=kwds,
7540 )
-> 7541 return op.get_result()
7542
7543 def applymap(self, func) -> "DataFrame":
~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\apply.py in get_result(self)
178 return self.apply_raw()
179
--> 180 return self.apply_standard()
181
182 def apply_empty_result(self):
~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\apply.py in apply_standard(self)
253
254 def apply_standard(self):
--> 255 results, res_index = self.apply_series_generator()
256
257 # wrap results
~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
282 for i, v in enumerate(series_gen):
283 # ignore SettingWithCopy here in case the user mutates
--> 284 results[i] = self.f(v)
285 if isinstance(results[i], ABCSeries):
286 # If we have a view on v, we need to make a copy because
<ipython-input-409-39a6ba3c8879> in <lambda>(row)
1 for x in range(1,6):
----> 2 test[f'Bullet {x} Ratio'] = test.apply(lambda row: SequenceMatcher(a=row[f'SeoBullet_{x}'], b=row[f'Bullet {x}']).ratio())
~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
880
881 elif key_is_scalar:
--> 882 return self._get_value(key)
883
884 if (
~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\series.py in _get_value(self, label, takeable)
989
990 # Similar to Index.get_value, but we do not fall back to positional
--> 991 loc = self.index.get_loc(label)
992 return self.index._get_values_for_loc(self, loc, label)
993
~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\indexes\range.py in get_loc(self, key, method, tolerance)
352 except ValueError as err:
353 raise KeyError(key) from err
--> 354 raise KeyError(key)
355 return super().get_loc(key, method=method, tolerance=tolerance)
356
KeyError: 'SeoBullet_1'
Desired Output
Ideally, the final output would be a dataframe that has 5 additional columns with the ratios for each Bullet comparison.
I'm still new-ish to Python, so I could just naïve and missing something very obvious. I say this also to say that if there is another route I could go to accomplish the same thing (or something very similar) I am open to those suggestions.

Converting financial information to pandas data frame

I am trying to get stock data such as balance_sheet, income_statement and cash_flow for multiple stocks and converting it to a data frame for manipulations.
here is the getting the data part of the code :
**import yahoo_fin.stock_info as yfs
tickers = ['AMZN','AAPL','MSFT','DIS','GOOG']
balance_sheet=[]
income_statement=[]
cash_flow=[]
balance_sheet.append({ticker : yfs.get_balance_sheet(ticker) for ticker in tickers})
income_statement.append({ticker : yfs.get_income_statement(ticker) for ticker in tickers })
cash_flow.append({ticker : yfs.get_cash_flow(ticker) for ticker in tickers})**
This part works well and returns a dictionary for each category. I then this :
my_dict=cash_flow+balance_sheet+income_statement
dff=pd.DataFrame.from_dict(my_dict, orient='columns', dtype=None, columns=None)
Note that when I try orient='index' I get the following error message :
**AttributeError Traceback (most recent call last)
in
1 my_dict=cash_flow+balance_sheet+income_statement
----> 2 dff=pd.DataFrame.from_dict(my_dict, orient='index', dtype=None, columns=None)
3 # dff=dff.set_index('endDate')
4 dff
5 # cash_flow
/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in from_dict(cls, data, orient, dtype, columns)
1361 if len(data) > 0:
1362 # TODO speed up Series case
-> 1363 if isinstance(list(data.values())[0], (Series, dict)):
1364 data = _from_nested_dict(data)
1365 else:
enter code here
AttributeError: 'list' object has no attribute 'values'**
If someone could let me know what I'm doing wrong that would be very appreciated ! :)

How to make a dataframe from another dataframe, using the shape[] of the old dataframe

I have a dataframe with shape (18,1), each cell of this dataframe is with shape (27, 2000, 10). I want to make a new dataframe with shape (27, 18*10) that each cell consists of 2000 data.
I was trying to do like this, but I got error:
[This is how my dataframe looks like][1]
with shape (18,1).
I tried this code to make the new dataframe:
for i in range(18):
a= pd.DataFrame(((HC[i,0]).shape[2]) ,(18* HC[i,0].shape[0]))
andI get this error:
KeyError Traceback (most recent call last)
<ipython-input-20-3138007415b4> in <module>()
19 HC
20 for i in range(18):
---> 21 a= pd.DataFrame(((HC[i,0]).shape[2]) ,(18* HC[i,0].shape[0]))
22
23 # print(a)
1 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
356 except ValueError as err:
357 raise KeyError(key) from err
--> 358 raise KeyError(key)
359 return super().get_loc(key, method=method, tolerance=tolerance)
360
KeyError: (0, 0)
any idea how can I make the new dataframe?
[1]: https://i.stack.imgur.com/6cKqO.png

Facebook-Prophet: Overflow error when fitting

I wanted to practice with prophet so I decided to download the "Yearly mean total sunspot number [1700 - now]" data from this place
http://www.sidc.be/silso/datafiles#total.
This is my code so far
import numpy as np
import matplotlib.pyplot as plt
from fbprophet import Prophet
from fbprophet.plot import plot_plotly
import plotly.offline as py
import datetime
py.init_notebook_mode()
plt.style.use('classic')
df = pd.read_csv('SN_y_tot_V2.0.csv',delimiter=';', names = ['ds', 'y','C3', 'C4', 'C5'])
df = df.drop(columns=['C3', 'C4', 'C5'])
df.plot(x="ds", style='-',figsize=(10,5))
plt.xlabel('year',fontsize=15);plt.ylabel('mean number of sunspots',fontsize=15)
plt.xticks(np.arange(1701.5, 2018.5,40))
plt.ylim(-2,300);plt.xlim(1700,2020)
plt.legend()df['ds'] = pd.to_datetime(df.ds, format='%Y')
df['ds'] = pd.to_datetime(df.ds, format='%Y')
m = Prophet(yearly_seasonality=True)
Everything looks good so far and df['ds'] is in date time format.
However when I execute
m.fit(df)
I get the following error
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
<ipython-input-57-a8e399fdfab2> in <module>()
----> 1 m.fit(df)
/anaconda2/envs/mde/lib/python3.7/site-packages/fbprophet/forecaster.py in fit(self, df, **kwargs)
1055 self.history_dates = pd.to_datetime(df['ds']).sort_values()
1056
-> 1057 history = self.setup_dataframe(history, initialize_scales=True)
1058 self.history = history
1059 self.set_auto_seasonalities()
/anaconda2/envs/mde/lib/python3.7/site-packages/fbprophet/forecaster.py in setup_dataframe(self, df, initialize_scales)
286 df['cap_scaled'] = (df['cap'] - df['floor']) / self.y_scale
287
--> 288 df['t'] = (df['ds'] - self.start) / self.t_scale
289 if 'y' in df:
290 df['y_scaled'] = (df['y'] - df['floor']) / self.y_scale
/anaconda2/envs/mde/lib/python3.7/site-packages/pandas/core/ops/__init__.py in wrapper(left, right)
990 # test_dt64_series_add_intlike, which the index dispatching handles
991 # specifically.
--> 992 result = dispatch_to_index_op(op, left, right, pd.DatetimeIndex)
993 return construct_result(
994 left, result, index=left.index, name=res_name, dtype=result.dtype
/anaconda2/envs/mde/lib/python3.7/site-packages/pandas/core/ops/__init__.py in dispatch_to_index_op(op, left, right, index_class)
628 left_idx = left_idx._shallow_copy(freq=None)
629 try:
--> 630 result = op(left_idx, right)
631 except NullFrequencyError:
632 # DatetimeIndex and TimedeltaIndex with freq == None raise ValueError
/anaconda2/envs/mde/lib/python3.7/site-packages/pandas/core/indexes/datetimelike.py in __sub__(self, other)
521 def __sub__(self, other):
522 # dispatch to ExtensionArray implementation
--> 523 result = self._data.__sub__(maybe_unwrap_index(other))
524 return wrap_arithmetic_op(self, other, result)
525
/anaconda2/envs/mde/lib/python3.7/site-packages/pandas/core/arrays/datetimelike.py in __sub__(self, other)
1278 result = self._add_offset(-other)
1279 elif isinstance(other, (datetime, np.datetime64)):
-> 1280 result = self._sub_datetimelike_scalar(other)
1281 elif lib.is_integer(other):
1282 # This check must come after the check for np.timedelta64
/anaconda2/envs/mde/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in _sub_datetimelike_scalar(self, other)
856
857 i8 = self.asi8
--> 858 result = checked_add_with_arr(i8, -other.value, arr_mask=self._isnan)
859 result = self._maybe_mask_results(result)
860 return result.view("timedelta64[ns]")
/anaconda2/envs/mde/lib/python3.7/site-packages/pandas/core/algorithms.py in checked_add_with_arr(arr, b, arr_mask, b_mask)
1006
1007 if to_raise:
-> 1008 raise OverflowError("Overflow in int64 addition")
1009 return arr + b
1010
OverflowError: Overflow in int64 addition```
I understand that there's an issue with 'ds', but I am not sure whether there is something wring with the column's format or an open issue.
Does anyone have any idea how to fix this? I have checked some issues in github, but they haven't been of much help in this case.
Thanks
This is not an answer to fix the issue, but how to avoid the error.
I got the same error, and manage to get rid of the error when I reduce the number of data that is coming in OR when I reduce the horizon span of the forecast.
For example, I limit my training data to only start since 1825 meanwhile I have data from the year of 1700s. I also tried to limit my forecast days from 10 years forecast to only 1 year. Both managed to get rid of the error.
My guess this problem has something to do with how the ARIMA is implemented inside the Prophet itself which in some cases the number is just to huge to be managed by int64 and become overflow.

to_dataframe() bug when query returns no results

If a valid BigQuery query returns 0 rows, to_dataframe() crashes. (btw, I am running this on Google Cloud Datalab)
for example:
q = bq.Query('SELECT * FROM [isb-cgc:tcga_201510_alpha.Somatic_Mutation_calls] WHERE ( Protein_Change="V600E" ) LIMIT 10')
r = q.results()
r.to_dataframe()
produces:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-17-de55245104c0> in <module>()
----> 1 r.to_dataframe()
/usr/local/lib/python2.7/dist-packages/gcp/bigquery/_table.pyc in to_dataframe(self, start_row, max_rows)
628 # Need to reorder the dataframe to preserve column ordering
629 ordered_fields = [field.name for field in self.schema]
--> 630 return df[ordered_fields]
631
632 def to_file(self, destination, format='csv', csv_delimiter=',', csv_header=True):
TypeError: 'NoneType' object has no attribute '__getitem__'
is this a known bug?
Certainly not a known bug. Please do log a bug as mentioned by Felipe.
Contributions, both bug reports, and of course fixes, are welcome! :)