JsonSchema Validation Exception - jsonschema

I am trying to to do Json Schema validation in restassured by matching response body with JsonSchema in class path . I am getting following error
io.restassured.module.jsv.JsonSchemaValidationException: com.fasterxml.jackson.core.JsonParseException: Non-standard token 'NaN': enable JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS to allow
at [Source: (StringReader); line: 1, column: 3948]
Response body contains the following values "20": NaN, "21": NaN, "22": NaN, "23": NaN, "24": NaN, "25": ""25minutes"", "26": NaN, "27": NaN, "28": NaN, "29": NaN, which causes the problem.How can i avoid this error. Pls someone help.

NaN is not valid JSON.
Reference

Related

How to use scipy.interpolate.interpn function with xarray (3d), to fill nan gaps? Current Error [The points in dimension 0 must be strictly ascending]

I am a bit frustrated as I could not find a solution to my problem which seems easy to do in r with package gapfill but here in python it is more difficult.
Coming to my problem: I have an xarray (3d) with the dimensions latitude, longitude and time. What I want is to interpolate nan values in each raster/array (caused by cloud and other distortions). The nan values form blocks (due to the clouds) and are sometimes relatively big. My idea is to interpolate not only with the neighbouring pixels of each timestep but also with the timesteps from before and after (the assumption is that the pixel some days before and some days after have a realtively similar value as the landcoverage is not changing so fast). My aim is to do a linear interpolation over time with the same pixel position. (how many timesteps before and after is also something where I am not sure how I can define that in the interpn function?)
I found different options to do that, however non was working yet. The most promising method I found is from the package scipy with the interpolate.interpn function. This function uses a numpy array not an xarray. My attempt:
#change from xarray to numpy
my array_np = my array.to_numpy()
# lable dimensions (what is done when building a numpy with meshgrid)
x = array_np [0]
y = array_np [1]
z = array_np [2]
#get index of nan values
nanIndex= np.isnan(array_np ).nonzero()
nanIndex
#name dimensions of nan values
xc= nanIndex[0]
yc= nanIndex[1]
zc= nanIndex[2]
# For using the scipy interpolate. interpn function:
# points = the regular grid - in my case x,y,z
# values = the data on the regular grid - in my case my array (my_array_np)
# point_nan = the point that is evaluate in the 3D grid - in my case xc, y,c, zy
points = (x, y, z) # dimensions
points_nan = (xc, yc, zc) #nandimensions
print(interpolate.interpn(points, my_array_np, points_nan))
What I get now as an error is:
"The points in dimension 0 must be strictly ascending"
Where am I wrong? Thanks for you help in advance! If you have another other solutions which also solves my probelem beside scipy I am also happy for help!
This is a how my array looks:
array([[[ nan, nan, nan, ..., 279.64 , 282.16998,
279.66998],
[277.62 , 277.52 , 277.88 , ..., 281.75998, 281.72 ,
281.66 ],
[277.38 , 277.75 , 277.88998, ..., 281.75998, 281.75998,
280.91998],
...,
[ nan, nan, nan, ..., 280.72998, 280.33 ,
280.94 ],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]],
[[ nan, nan, nan, ..., 272.22 , 271.54 ,
271.02 ],
[280.02 , 280.44998, 281.18 , ..., 271.47998, 271.88 ,
272.03 ],
[280.32 , 281. , 281.27 , ..., 270.83 , 271.58 ,
272.03 ],
...,
[ nan, nan, nan, ..., 290.34 , 290.25 ,
288.365 ],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]],
[[ nan, nan, nan, ..., nan, nan,
nan],
[276.44998, 276.19998, 276.19 , ..., nan, nan,
nan],
[276.50998, 276.79 , 276.58 , ..., nan, nan,
nan],
...,
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]],
...,
[[ nan, nan, nan, ..., 276.38998, 276.44 ,
275.72998],
[ nan, nan, nan, ..., 276.55 , 276.81 ,
276.72998],
[ nan, nan, nan, ..., 279.74 , 277.11 ,
276.97 ],
...,
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]],
[[ nan, nan, nan, ..., 277.38 , 278.08 ,
277.79 ],
[279.66998, 280.00998, 283.13 , ..., 277.34 , 277.41998,
277.62 ],
[ nan, 277.41 , 277.41 , ..., 277.825 , 277.31 ,
277.52 ],
...,
[ nan, nan, nan, ..., 276.52 , nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]],
[[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
...,
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]]], dtype=float32)
interpn cannot be used to fill gaps in a regular grid - interpn is a fast method for interpolating a full regular grid (with no gaps) to different coordinates.
To fill missing values with N-dimensional interpolation, use of the scipy interpolation methods for unstructured N-dimensional data.
Since you're interpolating to a regular grid, I'll demo the use of scipy.interpolate.griddata:
import xarray as xr, pandas as pd, numpy as np, scipy.interpolate
# create dummy data
x = y = z = np.linspace(0, 1, 5)
da = xr.DataArray(
np.sin(x).reshape(-1, 1, 1) * np.cos(y).reshape(1, -1, 1) + z.reshape(1, 1, -1),
dims=['x', 'y', 'z'],
coords=[x, y, z],
)
# randomly fill with NaNs
da = da.where(np.random.random(size=da.shape) > 0.1)
This looks like the following
In [11]: da
Out[11]:
<xarray.DataArray (x: 5, y: 5, z: 5)>
array([[[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ]],
[[0.24740396, 0.49740396, 0.74740396, 0.99740396, nan],
[ nan, 0.48971277, nan, 0.98971277, 1.23971277],
[0.2171174 , 0.4671174 , 0.7171174 , 0.9671174 , 1.2171174 ],
[0.18102272, 0.43102272, 0.68102272, 0.93102272, 1.18102272],
[ nan, 0.38367293, 0.63367293, 0.88367293, 1.13367293]],
[[0.47942554, nan, 0.97942554, 1.22942554, 1.47942554],
[0.46452136, 0.71452136, 0.96452136, 1.21452136, 1.46452136],
[0.42073549, 0.67073549, 0.92073549, 1.17073549, nan],
[0.35079033, 0.60079033, 0.85079033, 1.10079033, 1.35079033],
[ nan, 0.50903472, 0.75903472, nan, 1.25903472]],
[[0.68163876, 0.93163876, 1.18163876, 1.43163876, 1.68163876],
[0.66044826, 0.91044826, nan, 1.41044826, 1.66044826],
[0.59819429, 0.84819429, 1.09819429, 1.34819429, nan],
[ nan, 0.74874749, 0.99874749, 1.24874749, nan],
[0.36829099, 0.61829099, 0.86829099, 1.11829099, 1.36829099]],
[[0.84147098, 1.09147098, nan, 1.59147098, 1.84147098],
[0.81531169, 1.06531169, 1.31531169, 1.56531169, 1.81531169],
[0.73846026, 0.98846026, 1.23846026, 1.48846026, nan],
[0.61569495, 0.86569495, 1.11569495, 1.36569495, 1.61569495],
[0.45464871, 0.70464871, 0.95464871, nan, 1.45464871]]])
Coordinates:
* x (x) float64 0.0 0.25 0.5 0.75 1.0
* y (y) float64 0.0 0.25 0.5 0.75 1.0
* z (z) float64 0.0 0.25 0.5 0.75 1.0
To use an unstructured scipy interpolator, you must convert the gridded data with missing values to vectors of 1D points with no missing values:
# ravel all points and find the valid ones
points = da.data.ravel()
valid = ~np.isnan(points)
points_valid = points[valid]
# construct arrays of (x, y, z) points, masked to only include the valid points
xx, yy, zz = np.meshgrid(x, y, z)
xx, yy, zz = xx.ravel(), yy.ravel(), zz.ravel()
xxv = xx[valid]
yyv = yy[valid]
zzv = zz[valid]
# feed these into the interpolator, and also provide the target grid
interpolated = scipy.interpolate.griddata(np.stack([xxv, yyv, zzv]).T, points_valid, (xx, yy, zz), method="linear")
# reshape to match the original array and replace the DataArray values with
# the interpolated data
da.values = interpolated.reshape(da.shape)
This results in the array being filled
In [32]: da
Out[32]:
<xarray.DataArray (x: 5, y: 5, z: 5)>
array([[[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ]],
[[0.24740396, 0.49740396, 0.74740396, 0.99740396, 1.23971277],
[0.23226068, 0.48971277, 0.73226068, 0.98971277, 1.23971277],
[0.2171174 , 0.4671174 , 0.7171174 , 0.9671174 , 1.2171174 ],
[0.18102272, 0.43102272, 0.68102272, 0.93102272, 1.18102272],
[0.12276366, 0.38367293, 0.63367293, 0.88367293, 1.13367293]],
[[0.47942554, 0.71452136, 0.97942554, 1.22942554, 1.47942554],
[0.46452136, 0.71452136, 0.96452136, 1.21452136, 1.46452136],
[0.42073549, 0.67073549, 0.92073549, 1.17073549, 1.40765584],
[0.35079033, 0.60079033, 0.85079033, 1.10079033, 1.35079033],
[0.24552733, 0.50903472, 0.75903472, 1.00903472, 1.25903472]],
[[0.68163876, 0.93163876, 1.18163876, 1.43163876, 1.68163876],
[0.66044826, 0.91044826, 1.16044826, 1.41044826, 1.66044826],
[0.59819429, 0.84819429, 1.09819429, 1.34819429, 1.57184545],
[0.48324264, 0.74874749, 0.99874749, 1.24874749, 1.48324264],
[0.36829099, 0.61829099, 0.86829099, 1.11829099, 1.36829099]],
[[0.84147098, 1.09147098, 1.34147098, 1.59147098, 1.84147098],
[0.81531169, 1.06531169, 1.31531169, 1.56531169, 1.81531169],
[0.73846026, 0.98846026, 1.23846026, 1.48846026, 1.71550332],
[0.61569495, 0.86569495, 1.11569495, 1.36569495, 1.61569495],
[0.45464871, 0.70464871, 0.95464871, 1.20464871, 1.45464871]]])
Coordinates:
* x (x) float64 0.0 0.25 0.5 0.75 1.0
* y (y) float64 0.0 0.25 0.5 0.75 1.0
* z (z) float64 0.0 0.25 0.5 0.75 1.0
Note that this filled the complete array because the convex hull of available points covers the whole array. If this is not the case, you may need a second step using nearest neighbor or fitting a spline to the filled data.

How to assign increment values to pandas column names?

For any columns without column names, I want to arbitrarily assign increment numbers to each column name. Meaning if column name is NaN, assign 1, 2, 3...If column name exists, ignore.
Here, column 28 onwards do not have column names.
My code below did not change the column names.
import pandas as pd
import numpy as np
# Arbitrarily assign the NaN column names with numbers (i.e., column 28 onwards)
df.iloc[:, 27:].columns = range(1, df.iloc[:, 27:].shape[1] + 1)
df.columns
Original column names
df.columns
Index([ 'strand', 'start',
'stop', 'total_probes',
'gene_assignment', 'mrna_assignment',
'swissprot', 'unigene',
'GO_biological_process', 'GO_cellular_component',
'GO_molecular_function', 'pathway',
'protein_domains', 'crosshyb_type',
'category', 'seqname',
'Gene Title', 'Cytoband',
'Entrez Gene', 'Swiss-Prot',
'UniGene', 'GO Biological Process',
'GO Cellular Component', 'GO Molecular Function',
'Pathway', 'Protein Domains',
'Probe ID', nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan],
dtype='object', name=0)
Expected output:
Index([ 'strand', 'start',
'stop', 'total_probes',
'gene_assignment', 'mrna_assignment',
'swissprot', 'unigene',
'GO_biological_process', 'GO_cellular_component',
'GO_molecular_function', 'pathway',
'protein_domains', 'crosshyb_type',
'category', 'seqname',
'Gene Title', 'Cytoband',
'Entrez Gene', 'Swiss-Prot',
'UniGene', 'GO Biological Process',
'GO Cellular Component', 'GO Molecular Function',
'Pathway', 'Protein Domains',
'Probe ID', 1,
2, 3,
4, 5,
6, 7,
8, 9,
10, 11,
12, 13,
14, 15,
16, 17,
18, 19,
20, 21,
22, 23,
24, 25,
26, 27,
28, 29],
dtype='object', name=0)
This will do it.
temp_columns_name= []
nan_count= 1
for i in df.columns:
if pd.isnull(i):
temp_columns_name.append(nan_count)
nan_count+= 1
else:
temp_columns_name.append(i)
df.columns= temp_columns_name
print(df.columns)
Output:
['strand',
'start',
'stop',
'total_probes',
'gene_assignment',
'mrna_assignment',
'swissprot',
'unigene',
'GO_biological_process',
'GO_cellular_component',
'GO_molecular_function',
'pathway',
'protein_domains',
'crosshyb_type',
'category',
'seqname',
'Gene Title',
'Cytoband',
'Entrez Gene',
'Swiss-Prot',
'UniGene',
'GO Biological Process',
'GO Cellular Component',
'GO Molecular Function',
'Pathway',
'Protein Domains',
'Probe ID',
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29]

How to (correctly) merge 2 Pandas DataFrames and scatter-plot

Thank you for your answers, in advance.
My end goal is to produce a scatter-plot - corruption as an explanatory variable (x axis, from a DataFrame 'corr') and inequality as a dependent variable (y axis, from a DataFrame 'inq').
A hint to produce an informative table (DataFrame) by joining these two Dataframes would be much appreciated
I have a dataframe 'inq' for a country inequality (GINI index) and another one 'corr' for country corruption index.
pd.DataFrame(
{
"country": {0: "Angola", 1: "Albania", 2: "United Arab Emirates"},
"1975": {0: nan, 1: nan, 2: nan},
"1976": {0: nan, 1: nan, 2: nan},
"2017": {0: nan, 1: 33.2, 2: nan},
"2018": {0: 51.3, 1: nan, 2: nan},
}
)
pd.DataFrame(
{
"country": {0: "Afghanistan", 1: "Angola", 2: "Albania"},
"1975": {0: 44.8, 1: 48.1, 2: 75.1},
"1976": {0: 44.8, 1: 48.1, 2: 75.1},
"2018": {0: 24.2, 1: 40.4, 2: 28.4},
"2019": {0: 40.5, 1: 37.6, 2: 35.9},
}
)
I concatenate and manipulate and get
cm = pd.concat([inq, corr], axis=0, keys=["Inequality", "Corruption"]).reset_index(
level=1, drop=True
)
a new Dataframe
pd.DataFrame(
{
"indicator": {0: "Inequality", 1: "Inequality", 2: "Inequality"},
"country": {0: "Angola", 1: "Albania", 2: "United Arab Emirates"},
"1967": {0: nan, 1: nan, 2: nan},
"1969": {0: nan, 1: nan, 2: nan},
"2018": {0: 51.3, 1: nan, 2: nan},
"2019": {0: nan, 1: nan, 2: nan},
}
)
You should concatenate your dataframe in a different way:
df = (pd.concat([inq.set_index('country'),
corr.set_index('country')],
axis=1,
keys=["Inequality", "Corruption"]
)
.stack(level=1)
)
Inequality Corruption
country
Angola 1975 NaN 48.1
1976 NaN 48.1
2018 51.3 40.4
2019 NaN 37.6
Albania 1975 NaN 75.1
1976 NaN 75.1
2017 33.2 NaN
2018 NaN 28.4
2019 NaN 35.9
Afghanistan 1975 NaN 44.8
1976 NaN 44.8
2018 NaN 24.2
2019 NaN 40.5
Then to plot:
df.plot.scatter(x='Corruption', y='Inequality')
NB. there is only one point as most of your data is NaN

Efficient way to create dictionary of symmetric matrix with colum-row pair as key, and corresponding value in matrix as value

I want to create a dictionary in the form of (row, column): value, from a symmetric matrix (like a distance matrix) as depicted below, whithout taking into account the NaN values or zeros (zeros is the diagonal). The matrix is a pandas dataframe.
Material 100051 100120 100138 100179 100253 100265 100281
100051 0.0 0.953488 0.959302 0.953488 0.959302 0.953488 0.953488
100120 NaN 0.000000 0.965116 0.953488 0.959302 0.959302 0.959302
100138 NaN NaN 0.000000 0.959302 0.970930 0.970930 0.970930
100179 NaN NaN NaN 0.000000 0.959302 0.953488 0.953488
100253 NaN NaN NaN NaN 0.000000 0.976744 0.976744
... ... ... ... ... ... ... ...
So a dictionary that looks like:
{((100120, 100051): 0.953488); ((1000138, 100051): 0.959302); ....}
For creating a dictionary, you can probably iterate over both rows and columns like:
jacsim_values = {}
for i in jacsim_matrix2:
for j in jacsim_matrix2:
if jacsim_matrix[i][j] != 0:
jacsim_values[i,j] = jacsim_matrix2[i][j]
But I am looking for something more efficient. This takes quite some time for the size of the matrix. However, I could not find how to do so. Is there somebody who can help me out?
IIUC, DataFrame.stack (row, column) or DataFrame.unstack (column, row) + DataFrame.to_dict
df.set_index('Material').rename(int, axis=1).unstack().to_dict()
{(100051, 100051): 0.0,
(100051, 100120): nan,
(100051, 100138): nan,
(100051, 100179): nan,
(100051, 100253): nan,
(100120, 100051): 0.9534879999999999,
(100120, 100120): 0.0,
(100120, 100138): nan,
(100120, 100179): nan,
(100120, 100253): nan,
(100138, 100051): 0.9593020000000001,
(100138, 100120): 0.965116,
(100138, 100138): 0.0,
(100138, 100179): nan,
(100138, 100253): nan,
(100179, 100051): 0.9534879999999999,
(100179, 100120): 0.9534879999999999,
(100179, 100138): 0.9593020000000001,
(100179, 100179): 0.0,
(100179, 100253): nan,
(100253, 100051): 0.9593020000000001,
(100253, 100120): 0.9593020000000001,
(100253, 100138): 0.97093,
(100253, 100179): 0.9593020000000001,
(100253, 100253): 0.0,
(100265, 100051): 0.9534879999999999,
(100265, 100120): 0.9593020000000001,
(100265, 100138): 0.97093,
(100265, 100179): 0.9534879999999999,
(100265, 100253): 0.9767440000000001,
(100281, 100051): 0.9534879999999999,
(100281, 100120): 0.9593020000000001,
(100281, 100138): 0.97093,
(100281, 100179): 0.9534879999999999,
(100281, 100253): 0.9767440000000001}

data Frame to dictionary

I can create a new dataframe based on the list of dicts. But how do I get the same list back from dataframe?
mylist=[{'points': 50, 'time': '5:00', 'year': 2010},
{'points': 25, 'time': '6:00', 'month': "february"},
{'points':90, 'time': '9:00', 'month': 'january'},
{'points_h1':20, 'month': 'june'}]
import pandas as pd
df = pd.DataFrame(mylist)
The following will return the dictionary as per column and not row as shown in the example above.
n [18]: df.to_dict()
Out[18]:
{'month': {0: nan, 1: 'february', 2: 'january', 3: 'june'},
'points': {0: 50.0, 1: 25.0, 2: 90.0, 3: nan},
'points_h1': {0: nan, 1: nan, 2: nan, 3: 20.0},
'time': {0: '5:00', 1: '6:00', 2: '9:00', 3: nan},
'year': {0: 2010.0, 1: nan, 2: nan, 3: nan}}
df.to_dict(outtype='records')
Answer is from from: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_dict.html