pandas read_html error: unexpected keyword argument 'max_rows' - pandas

I'm trying to write a scraper to scrape Option prices from Yahoo Finance. The code below is working and even gives the correct output answer. The problem is that right before the answer, I get the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/anaconda3/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
343 method = get_real_method(obj, self.print_method)
344 if method is not None:
--> 345 return method()
346 return None
347 else:
~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in _repr_html_(self)
694 See Also
695 --------
--> 696 to_html : Convert DataFrame to HTML.
697
698 Examples
~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in to_html(self, buf, columns, col_space, header, index, na_rep, formatters, float_format, sparsify, index_names, justify, bold_rows, classes, escape, max_rows, max_cols, show_dimensions, notebook, decimal, border, table_id)
2035 Dictionary mapping columns containing datetime types to stata
2036 internal format to use when writing the dates. Options are 'tc',
-> 2037 'td', 'tm', 'tw', 'th', 'tq', 'ty'. Column can be either an integer
2038 or a name. Datetime columns that do not have a conversion type
2039 specified will be converted to 'tc'. Raises NotImplementedError if
~/anaconda3/lib/python3.7/site-packages/pandas/io/formats/format.py in to_html(self, classes, notebook, border)
751 need_leadsp = dict(zip(fmt_columns, map(is_numeric_dtype, dtypes)))
752
--> 753 def space_format(x, y):
754 if (y not in self.formatters and
755 need_leadsp[x] and not restrict_formatting):
TypeError: __init__() got an unexpected keyword argument 'max_rows'
I tried researching the cause of the error in different stackoverflow questions, as well as the github repo of the pandas library. The closest thing I found was in the what's new section of the pandas 0.24.0 "max_rows and max_cols parameters removed from HTMLFormatter since truncation is handled by DataFrameFormatter GH23818"
My code is as follows:
import lxml
import requests
from time import sleep
ticker = 'AAPL'
url = "http://finance.yahoo.com/quote/%s/options?p=%s"%(ticker,ticker)
response = requests.get(url, verify=False)
print ("Parsing %s"%(url))
sleep(15)
parser = lxml.html.fromstring(response.text)
tables = parser.xpath('//table')
print(len(tables))
puts = lxml.etree.tostring(tables[1], method='html')
df = pd.read_html(puts, flavor='bs4')[0]
df.tail()
The df.tail() shows correctly the last 5 rows of the table, but I can't seem to remove the error. Also every time I use the dataframe, I get a correct result, but the error is shown again.
Thanks in advance in helping with my error.

For future reference:
The error was driven by the anaconda install of the packages.
By pip installing the packages, the error goes away.
BR

Related

How to fix __init__() got an unexpected keyword argument 'location' error in scanpy.pl.umap?

I am trying to run single-cell analysis on scanpy 1.9.1. When I try to run scanpy.pl.umap(adata, color=["PDGFRB","RGS5"], s = 30), I get the following error:
TypeError Traceback (most recent call last)
in
----> 1 sc.pl.umap(adata, color=["PDGFRB","RGS5"], s = 30)
5 frames
/usr/local/lib/python3.8/dist-packages/matplotlib/colorbar.py in init(self, ax, mappable, **kw)
1228 """
1229 Return colorbar data coordinates for the boundaries of
-> 1230 a proportional colorbar, plus extension lengths if required:
1231 """
1232 if (isinstance(self.norm, colors.BoundaryNorm) or
TypeError: init() got an unexpected keyword argument 'location'
And I get a blank color heat legend.
blank_color_heat_legend
I saw someone suggested using an older version of matplotlib for a similar problem. This error occurred in matplotlib 3.6.3, so I tried installing matplotlib 3.1.3 but it did not work either.
Any help will be appreciated!

When trying to create html report the program throws error in

When executing the below
profile = ProfileReport(df,title="Data Profile Report")
profile.to_file("data_profile_report.html")
Here is the exception thrown
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
c:\Projections 2022-08-16\Projections.py in <cell line: 4>()
102 # %%
103 #Creating EDA of data
104 profile = ProfileReport(df_cdap,title="CDAP Data Profile Report")
----> 105 profile.to_file("cdap_data_profile_report.html")
File c:\Users\fengq\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas_profiling\profile_report.py:257, in ProfileReport.to_file(self, output_file, silent)
254 self.config.html.assets_prefix = str(output_file.stem) + "_assets"
255 create_html_assets(self.config, output_file)
--> 257 data = self.to_html()
259 if output_file.suffix != ".html":
260 suffix = output_file.suffix
File c:\Users\fengq\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas_profiling\profile_report.py:368, in ProfileReport.to_html(self)
360 def to_html(self) -> str:
361 """Generate and return complete template as lengthy string
362 for using with frameworks.
363
(...)
366
367 """
--> 368 return self.html
...
--> 810 fig = manager.canvas.figure
811 if fig_label:
812 fig.set_label(fig_label)
AttributeError: 'NoneType' object has no attribute 'canvas'
I've tried to re-install python and reinstalling the dependencies for pandas-profiling but nothing seems to work so far. I've also tried downgrading python to python 3.9 and the matplotlib to an older version as well. It has not changed this error.
I notice that the error seems to be attributed to "manager.canvas.figure" but I'm not sure how to resolve it from that point onwards. Any help is greatly appreciated!
The problem resolved as I set the matplotlib to inline as per some comments that I was able to find on another forum. I'm still really interested to learn what causes this! Please feel free to answer and suggest other solutions! I would love to try them!

pandas value_counts() with IntEnum raises RecursionError

I got the following code to elaborate on my problem. I'm using python 3.6 with pandas==0.25.3.
import pandas as pd
from enum import Enum, IntEnum
class BookType(Enum):
DRAMA = 5
ROMAN = 3
class AuthorType(IntEnum):
UNKNOWN = 0
GROUP = 1
MAN = 2
def print_num_type(df: pd.DataFrame, col_name: str, enum_type: Enum) -> int:
counts = df[col_name].value_counts()
val = counts[enum_type]
print('value counts:', counts)
print(f'Found "{val}" of type {enum_type}')
d = {'title': ['Charly Morry', 'James', 'Watson', 'Marry L.'], 'isbn': [21412412, 334764712, 12471021, 124141111], 'book_type': [BookType.DRAMA, BookType.ROMAN, BookType.ROMAN, BookType.ROMAN], 'author_type': [AuthorType.UNKNOWN, AuthorType.UNKNOWN, AuthorType.MAN, AuthorType.UNKNOWN]}
df = pd.DataFrame(data=d)
df.set_index(['title', 'isbn'], inplace=True)
df['book_type'] = df['book_type'].astype('category')
df['author_type'] = df['author_type'].astype('category')
print(df)
print(df.dtypes)
print_num_type(df, 'book_type', BookType.DRAMA)
print_num_type(df, 'author_type', AuthorType.UNKNOWN)
My pandas.DataFrame consists of two columns (book_type and author_type) of type categorical.
Furthermore, book_type is a class inheriting from type Enum and author_type from IntEnum. When calling print_num_type(df, 'book_type', BookType.DRAMA) everything works out as expected and the number of books of this type are printed, whereas print_num_type(df, 'author_type', AuthorType.UNKNOWN) raises the error:
Traceback (most recent call last):
File "C:\Users\User\AppData\Local\Programs\Python\Python36-32\lib\abc.py", line 182, in __instancecheck__
if subclass in cls._abc_cache:
File "C:\Users\User\AppData\Local\Programs\Python\Python36-32\lib\_weakrefset.py", line 72, in __contains__
wr = ref(item)
RecursionError: maximum recursion depth exceeded while calling a Python object
Exception ignored in: 'pandas._libs.lib.c_is_list_like'
Traceback (most recent call last):
File "C:\Users\User\AppData\Local\Programs\Python\Python36-32\lib\abc.py", line 182, in __instancecheck__
if subclass in cls._abc_cache:
File "C:\Users\User\AppData\Local\Programs\Python\Python36-32\lib\_weakrefset.py", line 72, in __contains__
wr = ref(item)
RecursionError: maximum recursion depth exceeded while calling a Python object
What am I doing wrong here?
Is there a workaround to get this error fixed? since I can't change the IntEnum type of AuthorType since it's provided from another library.
Thanks in advance!
See answer here
The main idea is that since x.value_counts() or counts in your function is itself a pandas Series, it's best to use .iat or .iloc when calling it, e.g, see iat docs
I think the easiest solution is to just use (x==0).sum(), or in your syntax:
val = (df[col_name]==enum_type).sum()
I put a minimal working example in the comments under your question so you can reproduce the problem/fix easily with the "x" notation.
What version of Pandas are you using? I realized after reproducing the error that upgrading Pandas (now on pandas-1.4.2) fixes the error, and the value_counts()[0] worked as expected.
run pip install --upgrade pandas

geopandas cannot read a geojson properly

I am trying the following:
After downloading http://eric.clst.org/assets/wiki/uploads/Stuff/gz_2010_us_050_00_20m.json
In [2]: import geopandas
In [3]: geopandas.read_file('./gz_2010_us_050_00_20m.json')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-83a1d4a0fc1f> in <module>
----> 1 geopandas.read_file('./gz_2010_us_050_00_20m.json')
~/miniconda3/envs/ml3/lib/python3.6/site-packages/geopandas/io/file.py in read_file(filename, **kwargs)
24 else:
25 f_filt = f
---> 26 gdf = GeoDataFrame.from_features(f_filt, crs=crs)
27
28 # re-order with column order from metadata, with geometry last
~/miniconda3/envs/ml3/lib/python3.6/site-packages/geopandas/geodataframe.py in from_features(cls, features, crs)
207
208 rows = []
--> 209 for f in features_lst:
210 if hasattr(f, "__geo_interface__"):
211 f = f.__geo_interface__
fiona/ogrext.pyx in fiona.ogrext.Iterator.__next__()
fiona/ogrext.pyx in fiona.ogrext.FeatureBuilder.build()
TypeError: startswith first arg must be bytes or a tuple of bytes, not str
On the page http://eric.clst.org/tech/usgeojson/ with 4 geojson files under the 20m column, the above file corresponds to the US Counties row, and is the only one that cannot be read out of the 4. The error message isn't very informative, I wonder what's the reason, please?
If your error message looks anything like "Polygons and MultiPolygons should follow the right-hand rule", it means the order of the coordinates in those GeoObjects should be clock-wise.
Here's an online tool to "fix" your objects, with a short explanation:
https://mapster.me/right-hand-rule-geojson-fixer/
Possibly an answer for people arriving at this page, I received the same error and the error was thrown due to encoding issues.
Try encoding the initial file with utf-8 or try opening the file with an encoding which you think is applied to the file. This fixed my error.
More info here

RuntimeError: Unsupported type in conversion to Arrow: VectorUDT

I want to convert a big spark data frame to Pandas with more than 1000000 rows. I tried to convert a spark data Frame to Pandas data frame using the following code:
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
result.toPandas()
But, I got the error:
TypeError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py in toPandas(self)
1949 import pyarrow
-> 1950 to_arrow_schema(self.schema)
1951 tables = self._collectAsArrow()
/usr/local/lib/python3.6/dist-packages/pyspark/sql/types.py in to_arrow_schema(schema)
1650 fields = [pa.field(field.name, to_arrow_type(field.dataType), nullable=field.nullable)
-> 1651 for field in schema]
1652 return pa.schema(fields)
/usr/local/lib/python3.6/dist-packages/pyspark/sql/types.py in <listcomp>(.0)
1650 fields = [pa.field(field.name, to_arrow_type(field.dataType), nullable=field.nullable)
-> 1651 for field in schema]
1652 return pa.schema(fields)
/usr/local/lib/python3.6/dist-packages/pyspark/sql/types.py in to_arrow_type(dt)
1641 else:
-> 1642 raise TypeError("Unsupported type in conversion to Arrow: " + str(dt))
1643 return arrow_type
TypeError: Unsupported type in conversion to Arrow: VectorUDT
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
<ipython-input-138-4e12457ff4d5> in <module>()
1 spark.conf.set("spark.sql.execution.arrow.enabled", "true")
----> 2 result.toPandas()
/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py in toPandas(self)
1962 "'spark.sql.execution.arrow.enabled' is set to true. Please set it to false "
1963 "to disable this.")
-> 1964 raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
1965 else:
1966 pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
RuntimeError: Unsupported type in conversion to Arrow: VectorUDT
Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.
It's not working, but if I set arrow to false, it works. But It's so slow... Any idea?
Arrow supports only a small set of types, and Spark UserDefinedTypes, including ml and mllib VectorUDTs are not among supported ones.
If you rally want to use arrow you'll have to convert your data to a format that it is supported. One possible solution is to expand Vectors into columns - How to split Vector into columns - using PySpark
You can also serialize output using to_json method:
from pyspark.sql.functions import to_json
df.withColumn("your_vector_column", to_json("your_vector_column"))
but if data is large enough for toPandas to be a serious bottleneck, then I would reconsider collecting data like this.