geopandas cannot read a geojson properly - pandas

I am trying the following:
After downloading http://eric.clst.org/assets/wiki/uploads/Stuff/gz_2010_us_050_00_20m.json
In [2]: import geopandas
In [3]: geopandas.read_file('./gz_2010_us_050_00_20m.json')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-83a1d4a0fc1f> in <module>
----> 1 geopandas.read_file('./gz_2010_us_050_00_20m.json')
~/miniconda3/envs/ml3/lib/python3.6/site-packages/geopandas/io/file.py in read_file(filename, **kwargs)
24 else:
25 f_filt = f
---> 26 gdf = GeoDataFrame.from_features(f_filt, crs=crs)
27
28 # re-order with column order from metadata, with geometry last
~/miniconda3/envs/ml3/lib/python3.6/site-packages/geopandas/geodataframe.py in from_features(cls, features, crs)
207
208 rows = []
--> 209 for f in features_lst:
210 if hasattr(f, "__geo_interface__"):
211 f = f.__geo_interface__
fiona/ogrext.pyx in fiona.ogrext.Iterator.__next__()
fiona/ogrext.pyx in fiona.ogrext.FeatureBuilder.build()
TypeError: startswith first arg must be bytes or a tuple of bytes, not str
On the page http://eric.clst.org/tech/usgeojson/ with 4 geojson files under the 20m column, the above file corresponds to the US Counties row, and is the only one that cannot be read out of the 4. The error message isn't very informative, I wonder what's the reason, please?

If your error message looks anything like "Polygons and MultiPolygons should follow the right-hand rule", it means the order of the coordinates in those GeoObjects should be clock-wise.
Here's an online tool to "fix" your objects, with a short explanation:
https://mapster.me/right-hand-rule-geojson-fixer/

Possibly an answer for people arriving at this page, I received the same error and the error was thrown due to encoding issues.
Try encoding the initial file with utf-8 or try opening the file with an encoding which you think is applied to the file. This fixed my error.
More info here

Related

When trying to create html report the program throws error in

When executing the below
profile = ProfileReport(df,title="Data Profile Report")
profile.to_file("data_profile_report.html")
Here is the exception thrown
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
c:\Projections 2022-08-16\Projections.py in <cell line: 4>()
102 # %%
103 #Creating EDA of data
104 profile = ProfileReport(df_cdap,title="CDAP Data Profile Report")
----> 105 profile.to_file("cdap_data_profile_report.html")
File c:\Users\fengq\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas_profiling\profile_report.py:257, in ProfileReport.to_file(self, output_file, silent)
254 self.config.html.assets_prefix = str(output_file.stem) + "_assets"
255 create_html_assets(self.config, output_file)
--> 257 data = self.to_html()
259 if output_file.suffix != ".html":
260 suffix = output_file.suffix
File c:\Users\fengq\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas_profiling\profile_report.py:368, in ProfileReport.to_html(self)
360 def to_html(self) -> str:
361 """Generate and return complete template as lengthy string
362 for using with frameworks.
363
(...)
366
367 """
--> 368 return self.html
...
--> 810 fig = manager.canvas.figure
811 if fig_label:
812 fig.set_label(fig_label)
AttributeError: 'NoneType' object has no attribute 'canvas'
I've tried to re-install python and reinstalling the dependencies for pandas-profiling but nothing seems to work so far. I've also tried downgrading python to python 3.9 and the matplotlib to an older version as well. It has not changed this error.
I notice that the error seems to be attributed to "manager.canvas.figure" but I'm not sure how to resolve it from that point onwards. Any help is greatly appreciated!
The problem resolved as I set the matplotlib to inline as per some comments that I was able to find on another forum. I'm still really interested to learn what causes this! Please feel free to answer and suggest other solutions! I would love to try them!

What is OSError: [Errno 95] Operation not supported for pandas to_csv on colab?

My input is:
test=pd.read_csv("/gdrive/My Drive/data-kaggle/sample_submission.csv")
test.head()
It ran as expected.
But, for
test.to_csv('submitV1.csv', header=False)
The full error message that I got was:
OSError Traceback (most recent call last)
<ipython-input-5-fde243a009c0> in <module>()
9 from google.colab import files
10 print(test)'''
---> 11 test.to_csv('submitV1.csv', header=False)
12 files.download('/gdrive/My Drive/data-
kaggle/submission/submitV1.csv')
2 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in
to_csv(self, path_or_buf, sep, na_rep, float_format, columns,
header, index, index_label, mode, encoding, compression, quoting,
quotechar, line_terminator, chunksize, tupleize_cols, date_format,
doublequote, escapechar, decimal)
3018 doublequote=doublequote,
3019 escapechar=escapechar,
decimal=decimal)
-> 3020 formatter.save()
3021
3022 if path_or_buf is None:
/usr/local/lib/python3.6/dist-packages/pandas/io/formats/csvs.pyi
in save(self)
155 f, handles = _get_handle(self.path_or_buf,
self.mode,
156 encoding=self.encoding,
--> 157
compression=self.compression)
158 close = True
159
/usr/local/lib/python3.6/dist-packages/pandas/io/common.py in
_get_handle(path_or_buf, mode, encoding, compression, memory_map,
is_text)
422 elif encoding:
423 # Python 3 and encoding
--> 424 f = open(path_or_buf, mode,encoding=encoding,
newline="")
425 elif is_text:
426 # Python 3 and no explicit encoding
OSError: [Errno 95] Operation not supported: 'submitV1.csv'
Additional Information about the error:
Before running this command, if I run
df=pd.DataFrame()
df.to_csv("file.csv")
files.download("file.csv")
It is running properly, but the same code is producing the operation not supported error if I try to run it after trying to convert test data frame to a csv file.
I am also getting a message A Google Drive timeout has occurred (most recently at 13:02:43). More info. just before running the command.
You are currently in the directory in which you don't have write permissions.
Check your current directory with pwd. It might be gdrive of some directory inside it, that's why you are unable to save there.
Now change the current working directory to some other directory where you have permissions to write. cd ~ will work fine. It wil chage the directoy to /root.
Now you can use:
test.to_csv('submitV1.csv', header=False)
It will save 'submitV1.csv' to /root

pandas read_html error: unexpected keyword argument 'max_rows'

I'm trying to write a scraper to scrape Option prices from Yahoo Finance. The code below is working and even gives the correct output answer. The problem is that right before the answer, I get the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/anaconda3/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
343 method = get_real_method(obj, self.print_method)
344 if method is not None:
--> 345 return method()
346 return None
347 else:
~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in _repr_html_(self)
694 See Also
695 --------
--> 696 to_html : Convert DataFrame to HTML.
697
698 Examples
~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in to_html(self, buf, columns, col_space, header, index, na_rep, formatters, float_format, sparsify, index_names, justify, bold_rows, classes, escape, max_rows, max_cols, show_dimensions, notebook, decimal, border, table_id)
2035 Dictionary mapping columns containing datetime types to stata
2036 internal format to use when writing the dates. Options are 'tc',
-> 2037 'td', 'tm', 'tw', 'th', 'tq', 'ty'. Column can be either an integer
2038 or a name. Datetime columns that do not have a conversion type
2039 specified will be converted to 'tc'. Raises NotImplementedError if
~/anaconda3/lib/python3.7/site-packages/pandas/io/formats/format.py in to_html(self, classes, notebook, border)
751 need_leadsp = dict(zip(fmt_columns, map(is_numeric_dtype, dtypes)))
752
--> 753 def space_format(x, y):
754 if (y not in self.formatters and
755 need_leadsp[x] and not restrict_formatting):
TypeError: __init__() got an unexpected keyword argument 'max_rows'
I tried researching the cause of the error in different stackoverflow questions, as well as the github repo of the pandas library. The closest thing I found was in the what's new section of the pandas 0.24.0 "max_rows and max_cols parameters removed from HTMLFormatter since truncation is handled by DataFrameFormatter GH23818"
My code is as follows:
import lxml
import requests
from time import sleep
ticker = 'AAPL'
url = "http://finance.yahoo.com/quote/%s/options?p=%s"%(ticker,ticker)
response = requests.get(url, verify=False)
print ("Parsing %s"%(url))
sleep(15)
parser = lxml.html.fromstring(response.text)
tables = parser.xpath('//table')
print(len(tables))
puts = lxml.etree.tostring(tables[1], method='html')
df = pd.read_html(puts, flavor='bs4')[0]
df.tail()
The df.tail() shows correctly the last 5 rows of the table, but I can't seem to remove the error. Also every time I use the dataframe, I get a correct result, but the error is shown again.
Thanks in advance in helping with my error.
For future reference:
The error was driven by the anaconda install of the packages.
By pip installing the packages, the error goes away.
BR

Concurrently read an HDF5 file in Pandas

I have a data.h5 file organised in multiple chunks, the entire file being several hundred gigabytes. I need to work with a filtered subset of the file in memory, in the form of a Pandas DataFrame.
The goal of the following routine is to distribute the filtering work across several processes, then concatenate the filtered results into the final DataFrame.
Since reading from the file takes a significant amount of time, I'm trying to make each process read its own chunk in a concurrent manner as well.
import multiprocessing as mp, pandas as pd
store = pd.HDFStore('data.h5')
min_dset, max_dset = 0, len(store.keys()) - 1
dset_list = list(range(min_dset, max_dset))
frames = []
def read_and_return_subset(dset):
# each process is intended to read its own chunk in a concurrent manner
chunk = store.select('batch_{:03}'.format(dset))
# and then process the chunk, do the filtering, and return the result
output = chunk[chunk.some_condition == True]
return output
with mp.Pool(processes=32) as pool:
for frame in pool.map(read_and_return_subset, dset_list):
frames.append(frame)
df = pd.concat(frames)
However, the above code triggers this error:
HDF5ExtError Traceback (most recent call last)
<ipython-input-174-867671c5a58f> in <module>()
53
54 with mp.Pool(processes=32) as pool:
---> 55 for frame in pool.map(read_and_return_subset, dset_list):
56 frames.append(frame)
57
/usr/lib/python3.5/multiprocessing/pool.py in map(self, func, iterable, chunksize)
258 in a list that is returned.
259 '''
--> 260 return self._map_async(func, iterable, mapstar, chunksize).get()
261
262 def starmap(self, func, iterable, chunksize=None):
/usr/lib/python3.5/multiprocessing/pool.py in get(self, timeout)
606 return self._value
607 else:
--> 608 raise self._value
609
610 def _set(self, i, obj):
HDF5ExtError: HDF5 error back trace
File "H5Dio.c", line 173, in H5Dread
can't read data
File "H5Dio.c", line 554, in H5D__read
can't read data
File "H5Dchunk.c", line 1856, in H5D__chunk_read
error looking up chunk address
File "H5Dchunk.c", line 2441, in H5D__chunk_lookup
can't query chunk address
File "H5Dbtree.c", line 998, in H5D__btree_idx_get_addr
can't get chunk info
File "H5B.c", line 340, in H5B_find
unable to load B-tree node
File "H5AC.c", line 1262, in H5AC_protect
H5C_protect() failed.
File "H5C.c", line 3574, in H5C_protect
can't load entry
File "H5C.c", line 7954, in H5C_load_entry
unable to load entry
File "H5Bcache.c", line 143, in H5B__load
wrong B-tree signature
End of HDF5 error back trace
Problems reading the array data.
It seems that Pandas/pyTables have troubles when trying to access the same file in a concurrent manner, even if it's only for reading.
Is there a way to be able to make each process read its own chunk concurrently?
IIUC you can index those columns that are used for filtering data (chunk.some_condition == True - in your sample code) and then read up only that subset of data that satisfies needed conditions.
In order to be able to do that you need to:
save HDF5 file in table format - use parameter: format='table'
index columns, that will be used for filtering - use parameter: data_columns=['col_name1', 'col_name2', etc.]
After that you should be able to filter your data just by reading:
store = pd.HDFStore(filename)
df = store.select('key_name', where="col1 in [11,13] & col2 == 'AAA'")

Getting a EOF error when calling pd.read_pickle

Had a quick question regarding a pandas DataFrame and the pd.read_pickle() function. Basically, I have a large but simple Dataframe (333 mb). When I run pd.read_pickle on the dataframe, I am getting and EOFError.
Is there any way around this issue? What might be causing this?
I saw the same EOFError when I created a pickle using:
pandas.DataFrame.to_pickle('path.pkl', compression='bz2')
and then tried to read with:
pandas.read_pickle('path.pkl')
I fixed the issue by supplying the compression on read:
pandas.read_pickle('path.pkl', compression='bz2')
According to the Pandas docs:
compression : {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’
string representing the compression to use in the output file. By default,
infers from the file extension in specified path.
Thus, simply changing the path from 'path.pkl' to 'path.bz2' also fixed the problem.
I can confirm the valuable comment of greg_data:
When I encountered this error I worked out that it was due to the
initial pickling not having completed correctly. The pickle file was
created, but not finished correctly. Seems to me this is the only
possible source of the EOFError in pickle, that the pickle is
malformed, i.e. not finished.
My error during pickling was:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-40-263240bbee7e> in <module>()
----> 1 main()
<ipython-input-38-9b3c6d782a2a> in main()
43 with open("/content/drive/MyDrive/{}.file".format(tm.id), "wb") as f:
---> 44 pickle.dump(tm, f, pickle.HIGHEST_PROTOCOL)
45
46 print('Coherence:', get_coherence(tm, token_lists, 'c_v'))
TypeError: can't pickle weakref objects
And when reading that pickle file that was obviously not finished during pickling, the reported error occured:
pd.read_pickle(r'/content/drive/MyDrive/TEST_2021_06_01_10_23_02.file')
Error:
---------------------------------------------------------------------------
EOFError Traceback (most recent call last)
<ipython-input-41-460bdd0a2779> in <module>()
----> 1 object = pd.read_pickle(r'/content/drive/MyDrive/TEST_2021_06_01_10_23_02.file')
/usr/local/lib/python3.7/dist-packages/pandas/io/pickle.py in read_pickle(filepath_or_buffer, compression)
180 # We want to silence any warnings about, e.g. moved modules.
181 warnings.simplefilter("ignore", Warning)
--> 182 return pickle.load(f)
183 except excs_to_catch:
184 # e.g.
EOFError: Ran out of input