How to remove and identify a key from xarray in order to write a netcdf file? - error-handling

I would like to write to a netcdf file my xarray. I have derived it from a netcdf file that I have mask by using a shape file. This how it looks like:
<xarray.Dataset>
Dimensions: (DATE: 14245, x: 214, y: 173)
Coordinates:
* DATE (DATE) datetime64[ns] 1980-01-01T12:00:00 ... 2018-1...
* x (x) float64 6.168e+05 6.171e+05 ... 6.698e+05 6.701e+05
* y (y) float64 5.113e+06 5.113e+06 ... 5.156e+06 5.156e+06
transverse_mercator int64 0
Data variables:
precipitation (DATE, y, x) float32 nan nan nan nan ... nan nan nan
Attributes:
CDI: Climate Data Interface version 1.9.9 (https://mpimet.mpg.de...
Conventions: CF-1.5
Title: Daily total precipitation Trentino-South Tyrol 250-meter re...
Created on: Fri Feb 26 21:30:51 2021
history: Fri Feb 26 23:31:30 2021: cdo -z zip -mergetime DAILYPCP_19...
CDO: Climate Data Operators version 1.9.9 (https://mpimet.mpg.de...
If I try to write it as:
ds_subset.to_netcdf('test.nc')
However, I get the following error:
ValueError: failed to prevent overwriting existing key grid_mapping in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.
Due to the fact that I am not able to find the key where I have the issue, I have tried to save it as xarray as:
xr.save_mfdataset(ds_subset, 'test.xr')
but I get the following error:
ValueError: cannot use mode='w' when writing multiple datasets to the same path
I would prefer the first solution. I have two thing to point them out: i) I have the felling that the problem is due to the fact that the original file is a dataset; ii) The original netcdf file seems to have some missing field which I fill with rioxarray
However, I am not able to find out where the problem is.
Thanks for nay kind of help.
Here some other, I hope, useful information:
this is the result of
ds_subset.precipitation.attrs["grid_mapping"]
Out[3]: 'transverse_mercator'
,
ds_subset.encoding
Out[4]: {}
and
ds_subset.precipitation.encoding
Out[5]:
{'zlib': True,
'szip': False,
'zstd': False,
'bzip2': False,
'blosc': False,
'shuffle': True,
'complevel': 1,
'fletcher32': False,
'contiguous': False,
'chunksizes': (1, 643, 641),
'source': 'read file path',
'original_shape': (14245, 643, 641),
'dtype': dtype('float32'),
'missing_value': -999.0,
'_FillValue': -999.0,
'grid_mapping': 'transverse_mercator'}

Related

ValueError: GeoDataFrame does not support multiple columns using the geometry column name 'geometry'

I am receiving this error when I try to upload a csv file as a geodataframe.According to other questions resolutions on this site, this method should do the trick.
Here is the code that I am using to: upload the file as a gdf, then produce a subset dataframe with only some of the columns present.
cp_union = gpd.read_file(r'C:\Users\User\Desktop\CPAWS\terrestrial_outputs\cp_union.csv')
cp_union.crs = 'epsg:3005'
cp_trimmed = cp_union[['COSEWIC_status','reason_for_designation','cnm_eng','iucn_cat','mgmt_e','status_e','classification','sq_m']]
As stated in the title, the error that i am receiving is this: ValueError: GeoDataFrame does not support multiple columns using the geometry column name 'geometry'. Is there some part of the process of saving a gdf as a csv and then reloading it as a gdf that would cause the creation of an additional geometry column?
EDIT
In another script, I loaded the same csv file as a pd dataframe. Here is the first row of data within that pd dataframe.
Unnamed: 0 0
fid_critic 0
scntfc_nm Castilleja victoriae
cnm_eng Victoria's Owl-clover
cnm_fren Castilléjie de Victoria
cswc_pop NaN
ch_stat Final
cb_site_nm Cattle Point
ch_detail Detailed Polygon
ch_variant NaN
ch_method NaN
hectares 0.8559
utm_zone 10
utm_east 478383
utm_north 5365043
latitude 48.438164
longitude -123.29226
shape_1 0.0
objectid 10251681.0
area_sqm 8478.6733
feat_len 326.5008
fid_protec -1
name_e NaN
name_f NaN
aichi_t11 NaN
iucn_cat NaN
oecm NaN
o_area 0.0
loc_e NaN
loc_f NaN
type_e NaN
mgmt_e NaN
gov_type NaN
legisl_e NaN
status_e NaN
protdate 0
delisdate 0
owner_e NaN
owner_f NaN
subs_right NaN
comments NaN
url NaN
shape_leng 0.0
protected 0
shape_le_1 320.859687
shape_area 6499.790343
geometry POLYGON ((1200735.4438 384059.0133999996, 1200...
COSEWIC_status Endangered
reason_for_designation This small annual herb is confined to a very s...
sq_m 6499.790343
classification c
Name: 0, dtype: object
So my only theory here is that, when you save a gdf as a csv, that the csv contains a column called geometry. Then when you load that csv as a gdf, that geopandas tries to create a new geometry column ontop of the one that was already in the csv. I could be completely wrong about this. Even if this is the case, Im not sure how to go about resolving the issue.
Thanks for the help!
using your sample data to create a CSV. Had to replace geometry as sample is not a valid WKT string
re-produced your error
solved by loading using pandas then converting to geopandas
solution
df = pd.read_csv(f)
cp_union = gpd.GeoDataFrame(
df.loc[:, [c for c in df.columns if c != "geometry"]],
geometry=gpd.GeoSeries.from_wkt(df["geometry"]),
crs="epsg:3005",
)
full code
import pandas as pd
import geopandas as gpd
import io
from pathlib import Path
# fmt: off
df_q = pd.read_csv(io.StringIO("""Unnamed: 0 0
fid_critic 0
scntfc_nm Castilleja victoriae
cnm_eng Victoria's Owl-clover
cnm_fren Castilléjie de Victoria
cswc_pop NaN
ch_stat Final
cb_site_nm Cattle Point
ch_detail Detailed Polygon
ch_variant NaN
ch_method NaN
hectares 0.8559
utm_zone 10
utm_east 478383
utm_north 5365043
latitude 48.438164
longitude -123.29226
shape_1 0.0
objectid 10251681.0
area_sqm 8478.6733
feat_len 326.5008
fid_protec -1
name_e NaN
name_f NaN
aichi_t11 NaN
iucn_cat NaN
oecm NaN
o_area 0.0
loc_e NaN
loc_f NaN
type_e NaN
mgmt_e NaN
gov_type NaN
legisl_e NaN
status_e NaN
protdate 0
delisdate 0
owner_e NaN
owner_f NaN
subs_right NaN
comments NaN
url NaN
shape_leng 0.0
protected 0
shape_le_1 320.859687
shape_area 6499.790343
geometry POLYGON ((5769135.557632876 7083849.386658552, 5843426.213336911 7098018.122146672, 5852821.812968816 7081377.7312996285, 5914814.478616157 7091734.620966213, 5883751.009067913 7017032.330573363, 5902031.719573214 6983898.953064103, 5864452.659165712 6922039.030140929, 5829585.402576889 6878872.269967912, 5835906.522449658 6846685.714836724, 5800391.382286092 6827305.509709548, 5765261.646424723 6876008.057438379, 5765261.402301509 6876010.894933639, 5765264.431247815 6876008.786040769, 5760553.056402712 6927522.42488809, 5720896.599172597 6983360.181762057, 5755349.303491102 7039380.015177476, 5769135.557632876 7083849.386658552))
COSEWIC_status Endangered
reason_for_designation This small annual herb is confined to a very s...
sq_m 6499.790343
classification c"""), sep="\s\s+", engine="python", header=None).set_index(0).T
# fmt: on
# generate a CSV file from sample data
f = Path.cwd().joinpath("SO_q.csv")
df_q.to_csv(f, index=False)
# replicate issue...
try:
gpd.read_file(f)
except ValueError as e:
print(e)
# now the actual solution
df = pd.read_csv(f)
cp_union = gpd.GeoDataFrame(
df.loc[:, [c for c in df.columns if c != "geometry"]],
geometry=gpd.GeoSeries.from_wkt(df["geometry"]),
crs="epsg:3005",
)
GeoPandas automatically adds a geometry field when you read in your CSV.
As you already have a field called "geometry", GeoPandas raises an exception. GeoPandas doesn't read the WKT strings in the "geometry" field as geometry.
Some workarounds:
the GDAL/OGR CSV driver (which geopandas uses) supports the open option GEOM_POSSIBLE_NAMES which lets you specify a different field name to look for geometry in.
gdf = gpd.read_file('/path/to/gdf.csv',
GEOM_POSSIBLE_NAMES="geometry",
KEEP_GEOM_COLUMNS="NO")
The KEEP_GEOM_COLUMNS option is also required otherwise GDAL/OGR will return a "geometry" column as well as the geometry and you still get the original ValueError: GeoDataFrame does not support multiple columns using the geometry column name 'geometry'.
Alternatively, if you have control over the original writing of the CSV, the GDAL/OGR CSV driver documentation notes that if a field is named "WKT" it will be read as geometry:
When reading a field named “WKT” is assumed to contain WKT geometry, but also is treated as a regular field.
So one option is to write out the CSV using "WKT" as the geometry column name and then it can be read straight back in:
gdf = gdf.rename_geometry('WKT')
gdf.to_csv('/path/to/gdf.csv', index=False)
# Then in later scripts you can just read it straight back in
gdf = gpd.read_file('/path/to/gdf.csv')
A final option is to read the CSV in as a Pandas DataFrame (per #RobRaymond's answer) then manually create the geometry from the WKT and convert to a GeoDataFrame. Here is an alternative way of doing so:
import geopandas as gpd
df = gpd.read_file('/path/to/gdf.csv', ignore_geometry=True)
# Create geometry objects from WKT strings
df['geometry'] = gpd.GeoSeries.from_wkt(df['geometry'])
# Convert to GDF
gdf = gpd.GeoDataFrame(df)

OneHotEncoder gives ValueError : Input contains NaN ; even though my DataFrame doesn't contain any NaN as indicated by df.isna()

I am working on the titanic dataset and when trying to apply OneHotEncoding on one of the columns called 'Embarked' which has 3 possible values 'S','Q' and 'C'.
It gives me the
ValueError: Input contains NaN
I checked the contents of the column by using 2 methods. The first one being the for-loop with value_counts and the second one by writing the entire table to a csv:
for col in X.columns:
print(col)
print(X[col].value_counts(dropna=False))
X.isna().to_csv("xisna.csv")
print("notna================== :",X.notna().shape)
X.dropna(axis=0,how='any',inplace=True)
print("X.shape " ,X.shape)
return pd.DataFrame(X)
Which yielded
Embarked
S 518
C 139
Q 55
Name: Embarked, dtype: int64
I checked the contents of csv and reading through the over 700 entries, I did not find any 'True'-statement.
The pipeline that blocks at the ("cat",One...)
cat_attribs=["Sex","Embarked"]
special_attribs = {'drop_attribs' : ["Name","Cabin","Ticket","PassengerId"], k : [3]}
full_pipeline = ColumnTransformer([
("fill",fill_pipeline,list(strat_train_set)),
("emb_cat",OneHotEncoder(),['Sex']),
("cat",OneHotEncoder(),['Embarked']),
])
So where exactly is the NaN-value that I am missing?
I figured it out, a ColumnTransformer will concatenate the transformations instead of passing them along to the next transformer in line.
So any transformations done in fill_pipeline, won't be noticed by OneHotEncoder since it is still working with the untransformed dataset.
So I had to put the one hot encoding into the fill_pipeline instead of the ColumnTransformer.
full_pipeline = ColumnTransformer([
("fill",fill_pipeline,list(strat_train_set)),
("emb_cat",OneHotEncoder(),['Sex']),
("cat",OneHotEncoder(),['Embarked']),
])

Pandas Interpolation: {ValueError}Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got linear

I am trying to interpolate time series data, df, which looks like:
id data lat notes analysis_date
0 17358709 NaN 26.125979 None 2019-09-20 12:00:00+00:00
1 17358709 NaN 26.125979 None 2019-09-20 12:00:00+00:00
2 17352742 -2.331365 26.125979 None 2019-09-20 12:00:00+00:00
3 17358709 -4.424366 26.125979 None 2019-09-20 12:00:00+00:00
I try: df.groupby(['lat', 'lon']).apply(lambda group: group.interpolate(method='linear')), and it throws {ValueError}Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got linear
I suspect the issue is with the fact that I have None values, and I do not want to interpolate those. What is the solution?
df.dtypes gives me:
id int64
data float64
lat float64
notes object
analysis_date datetime64[ns, psycopg2.tz.FixedOffsetTimezone...
dtype: object
DataFrame.interpolate has issues with timezone-aware datetime64ns columns, which leads to that rather cryptic error message. E.g.
import pandas as pd
df = pd.DataFrame({'time': pd.to_datetime(['2010', '2011', 'foo', '2012', '2013'],
errors='coerce')})
df['time'] = df.time.dt.tz_localize('UTC').dt.tz_convert('Asia/Kolkata')
df.interpolate()
ValueError: Invalid fill method. Expecting pad (ffill) or backfill
(bfill). Got linear
In this case interpolating that column is unnecessary so only interpolate the column you need. We still want DataFrame.interpolate so select with [[ ]] (Series.interpolate leads to some odd reshaping)
df['data'] = df.groupby(['lat', 'lon']).apply(lambda x: x[['data']].interpolate())
This error happens because one of the columns you are interpolating is of object data type. Interpolating works only for numerical data types such as integer or float.
If you need to use interpolating for an object or categorical data type, then first convert it to a numerical data type. For this, you need to encode your column first. The following piece of code will resolve your problem:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
notes_encoder=LabelEncoder()
df['notes'] = notes_encoder.fit_transform(df['notes'])
After doing this, check the column's data type. It must be int. If it is categorical ,then change its type to int using the following code:
df['notes']=df['notes'].astype('int32')

Why does sum and lambda sum differ in transform?

For the dataframe:
df = pd.DataFrame({
'key1': [1,1,1,2,3,np.nan],
'key2': ['one','two','one', 'three', 'two','one'],
'data1': [1,2,3,3,4,5]
})
The following transform using the sum function does not produce an error:
df.groupby(['key1'])['key1'].transform(sum)
However, this transform, also using the sum function, produces an error:
df.groupby(['key1'])['key1'].transform(lambda x : sum(x))
ValueError: Length mismatch: Expected axis has 5 elements, new values have 6 elements
Why?
This is probably a bug, but the reason as to why the two behave differently is easily explained by the fact that pandas internally overrides the builtin sum, min, and max functions. When you pass any of these functions to pandas, they are internally replaced by the numpy equivalents.
Now, your grouper has NaNs, and NaNs are automatically excluded, as the docs mention. With any of the builtin pandas agg functions, this issue appears to be handled as NaNs inserted in the output automatically, as you see with the first statement. The output is the same if you run df.groupby(['key1'])['key1'].transform('sum'). However, when you pass a lambda as in the second statement, for whatever reason this automatic replacement of missing outputs with NaN is not done.
A possible workaround is to group on the strings:
df.groupby(df.key1.astype(str))['key1'].transform(lambda x : sum(x))
0 3.0
1 3.0
2 3.0
3 2.0
4 3.0
5 NaN
Name: key1, dtype: float64
This way, the NaNs are not dropped, and you get rid of the length mismatch.

Pandas not detecting the datatype of a Series properly

I'm running into something a bit frustrating with pandas Series. I have a DataFrame with several columns, with numeric and non-numeric data. For some reason, however, pandas thinks some of the numeric columns are non-numeric, and ignores them when I try to run aggregating functions like .describe(). This is a problem, since pandas raises errors when I try to run analyses on these columns.
I've copied some commands from the terminal as an example. When I slice the 'ND_Offset' column (the problematic column in question), pandas tags it with the dtype of object. Yet, when I call .describe(), pandas tags it with the dtype float64 (which is what it should be). The 'Dwell' column, on the other hand, works exactly as it should, with pandas giving float64 both times.
Does anyone know why I'm getting this behavior?
In [83]: subject.phrases['ND_Offset'][:3]
Out[83]:
SubmitTime
2014-06-02 22:44:44 0.3607049
2014-06-02 22:44:44 0.2145484
2014-06-02 22:44:44 0.4031347
Name: ND_Offset, dtype: object
In [84]: subject.phrases['ND_Offset'].describe()
Out[84]:
count 1255.000000
unique 432.000000
top 0.242308
freq 21.000000
dtype: float64
In [85]: subject.phrases['Dwell'][:3]
Out[85]:
SubmitTime
2014-06-02 22:44:44 111
2014-06-02 22:44:44 81
2014-06-02 22:44:44 101
Name: Dwell, dtype: float64
In [86]: subject.phrases['Dwell'].describe()
Out[86]:
count 1255.000000
mean 99.013546
std 30.109327
min 21.000000
25% 81.000000
50% 94.000000
75% 111.000000
max 291.000000
dtype: float64
And when I use the .groupby function to group the data by another attribute (when these Series are a part of a DataFrame), I get the DataError: No numeric types to aggregate error when I try to call .agg(np.mean) on the group. When I try to call .agg(np.sum) on the same data, on the other hand, things work fine.
It's a bit bizarre -- can anyone explain what's going on?
Thank you!
It might be because the ND_Offset column (what I call A below) contains a non-numeric value such as an empty string. For example,
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [0.36, ''], 'B': [111, 81]})
print(df['A'].describe())
# count 2.00
# unique 2.00
# top 0.36
# freq 1.00
# dtype: float64
try:
print(df.groupby(['B']).agg(np.mean))
except Exception as err:
print(err)
# No numeric types to aggregate
print(df.groupby(['B']).agg(np.sum))
# A
# B
# 81
# 111 0.36
Aggregation using np.sum works because
In [103]: np.sum(pd.Series(['']))
Out[103]: ''
whereas np.mean(pd.Series([''])) raises
TypeError: Could not convert to numeric
To debug the problem, you could try to find the non-numeric value(s) using:
for val in df['A']:
if not isinstance(val, float):
print('Error: val = {!r}'.format(val))