Trying to subplot groups with date axis different methods

Trying to subplot groups with date axis different methods - pandas

I got a DataFrame:
date phone sensor pallet
126 2019-04-15 940203 C0382C391A4D 47
127 2019-04-15 940203 C0382D392A4D 47
133 2019-04-16 940203 C0382C391A4D 47
134 2019-04-16 940203 C0382D392A4D 47
138 2019-04-17 940203 C0382C391A4D 47
139 2019-04-17 940203 C0382D392A4D 47
144 2019-04-18 940203 C0382C391A4D 47
145 2019-04-18 940203 C0382D392A4D 47
156 2019-04-19 940203 C0382D392A4D 47
157 2019-04-19 940203 C0382C391A4D 47
277 2019-04-15 941557 C0392D362735 32
279 2019-04-15 941557 C03633364D50 32
286 2019-04-16 941557 C03633364D50 32
287 2019-04-16 941557 C0392D362735 32
296 2019-04-17 941557 C03633364D50 32
297 2019-04-17 941557 C0392D362735 32
305 2019-04-18 941557 C0392D362735 32
306 2019-04-18 941557 C03633364D50 32
317 2019-04-19 941557 C03633364D50 32
318 2019-04-19 941557 C0392D362735 32
561 2019-04-15 942316 C0384639224D 45
562 2019-04-15 942316 C03632364950 45
563 2019-04-15 942316 C03920363835 45
564 2019-04-15 942316 C0382939384D 45
573 2019-04-16 942316 C0382939384D 45
574 2019-04-16 942316 C0384639224D 45
575 2019-04-16 942316 C03632364950 45
i want to be able to make subplot for each pallet which contain the sensors arrived in each date.
example:
i have tried few methods:
ax.plot_date
looping through opened ax's and plotting through each 1
grouped = pallets_arrived.groupby('pallet')
nrows = 2
ncols = 2
fig, axs = plt.subplots(nrows, ncols)
targets = zip(grouped.groups.keys(), axs.flatten())
for i, (key, ax) in enumerate(targets):
ax.plot_date(grouped.get_group(key)['date'], grouped.get_group(key)['sensor'], 'o')
plt.show()
return pallets_arrived
which gives wierdly formatted repeating dates (index the Df with date isnt solving the prob)
Df plotting
grouped = pallets_arrived.groupby('pallet')
nrows = 2
ncols = 2
fig, axs = plt.subplots(nrows, ncols)
targets = zip(grouped.groups.keys(), axs.flatten())
for i, (key, ax) in enumerate(targets):
grouped.get_group(key).plot(x='date', y='sensor', ax=ax)
ax.legend()
plt.show()
or
grouped = pallets_arrived.set_index('date').groupby('pallet')
nrows = 2
ncols = 2
fig, axs = plt.subplots(nrows, ncols)
targets = zip(grouped.groups.keys(), axs.flatten())
for i, (key, ax) in enumerate(targets):
grouped.get_group(key).plot(grouped.get_group(key).index, y='sensor', ax=ax)
ax.legend()
plt.show()
pyplot
grouped = pallets_arrived.groupby('pallet')
nrows = 2
ncols = 2
fig, axs = plt.subplots(nrows, ncols)
targets = zip(grouped.groups.keys(), axs.flatten())
for i, (key, ax) in enumerate(targets):
plt.sca(ax)
plt.plot(grouped.get_group(key)['date'], grouped.get_group(key)['sensor'])
ax.legend()
plt.show()
which again
Pivot pallets to Plot() on columns(pallets)
which doesnt work because there are more than 1 sensor in each pallet in same date. so there is a duplicated value error...
I really dont know what method to use to make this 1 correct:
grouping similar dates in x axis.
being able to plot each pallet to different subplot.
i think i dont get the pandas wrapping of matplotlib correctly.
ill be glad for some explenation because im reading guides and cant understand the preferred method for those stuff.
Thanks alot for the helpers.

you can use matplotlib to plot categorical data:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
%matplotlib inline
fig, ax = plt.subplots()
ax.scatter(df['date'], df['sensor'])
plt.show()
or if you want to color the groups:
fig, ax = plt.subplots()
for _,g in df.groupby('pallet'):
ax.scatter(g['date'], g['sensor'])
plt.show()
you can also add a legend:
fig, ax = plt.subplots()
for _,g in df.groupby('pallet'):
ax.scatter(g['date'], g['sensor'], label='Pallet_'+str(_))
ax.legend()
plt.show()

Related

ValueError: Number of columns must be a positive integer, not 0

when i want to execute below code and plot figer
scatter_matrix(total_frame)
total_frame is a dataframe like this
the error like this:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_11336\1619863705.py in <module>
1 total_frame.dropna(how='any')
----> 2 scatter_matrix(total_frame)
3 plt.show()
~\.conda\envs\env2\lib\site-packages\pandas\plotting\_misc.py in scatter_matrix(frame, alpha, figsize, ax, grid, diagonal, marker, density_kwds, hist_kwds, range_padding, **kwargs)
137 hist_kwds=hist_kwds,
138 range_padding=range_padding,
--> 139 **kwargs,
140 )
141
~\.conda\envs\env2\lib\site-packages\pandas\plotting\_matplotlib\misc.py in scatter_matrix(frame, alpha, figsize, ax, grid, diagonal, marker, density_kwds, hist_kwds, range_padding, **kwds)
48 n = df.columns.size
49 naxes = n * n
---> 50 fig, axes = create_subplots(naxes=naxes, figsize=figsize, ax=ax, squeeze=False)
51
52 # no gaps between subplots
~\.conda\envs\env2\lib\site-packages\pandas\plotting\_matplotlib\tools.py in create_subplots(naxes, sharex, sharey, squeeze, subplot_kw, ax, layout, layout_type, **fig_kw)
265
266 # Create first subplot separately, so we can share it if requested
--> 267 ax0 = fig.add_subplot(nrows, ncols, 1, **subplot_kw)
268
269 if sharex:
~\.conda\envs\env2\lib\site-packages\matplotlib\figure.py in add_subplot(self, *args, **kwargs)
770 projection_class, pkw = self._process_projection_requirements(
771 *args, **kwargs)
--> 772 ax = subplot_class_factory(projection_class)(self, *args, **pkw)
773 key = (projection_class, pkw)
774 return self._add_axes_internal(ax, key)
~\.conda\envs\env2\lib\site-packages\matplotlib\axes\_subplots.py in __init__(self, fig, *args, **kwargs)
34 self._axes_class.__init__(self, fig, [0, 0, 1, 1], **kwargs)
35 # This will also update the axes position.
---> 36 self.set_subplotspec(SubplotSpec._from_subplot_args(fig, args))
37
38 #_api.deprecated(
~\.conda\envs\env2\lib\site-packages\matplotlib\gridspec.py in _from_subplot_args(figure, args)
595 f"{len(args)} were given")
596
--> 597 gs = GridSpec._check_gridspec_exists(figure, rows, cols)
598 if gs is None:
599 gs = GridSpec(rows, cols, figure=figure)
~\.conda\envs\env2\lib\site-packages\matplotlib\gridspec.py in _check_gridspec_exists(figure, nrows, ncols)
223 return gs
224 # else gridspec not found:
--> 225 return GridSpec(nrows, ncols, figure=figure)
226
227 def __getitem__(self, key):
~\.conda\envs\env2\lib\site-packages\matplotlib\gridspec.py in __init__(self, nrows, ncols, figure, left, bottom, right, top, wspace, hspace, width_ratios, height_ratios)
385 super().__init__(nrows, ncols,
386 width_ratios=width_ratios,
--> 387 height_ratios=height_ratios)
388
389 _AllowedKeys = ["left", "bottom", "right", "top", "wspace", "hspace"]
~\.conda\envs\env2\lib\site-packages\matplotlib\gridspec.py in __init__(self, nrows, ncols, height_ratios, width_ratios)
51 if not isinstance(ncols, Integral) or ncols <= 0:
52 raise ValueError(
---> 53 f"Number of columns must be a positive integer, not {ncols!r}")
54 self._nrows, self._ncols = nrows, ncols
55 self.set_height_ratios(height_ratios)
ValueError: Number of columns must be a positive integer, not 0
<Figure size 432x288 with 0 Axes>
i search such error and don't find anything,please help me!!!!!

i have solved it,
my data's class is object,the function need num,
so i use pd.convert_dtypes() and it works

How to fix 'X.toarray() to convert to a dense numpy array' error on sklearn fit_transform?

R&D Spend Administration Marketing Spend State Profit
0 165349.20 136897.80 471784.10 New York 192261.83
1 162597.70 151377.59 443898.53 California 191792.06
2 153441.51 101145.55 407934.54 Florida 191050.39
3 144372.41 118671.85 383199.62 New York 182901.99
4 142107.34 91391.77 366168.42 Florida 166187.94
5 131876.90 99814.71 362861.36 New York 156991.12
6 134615.46 147198.87 127716.82 California 156122.51
7 130298.13 145530.06 323876.68 Florida 155752.60
8 120542.52 148718.95 311613.29 New York 152211.77
9 123334.88 108679.17 304981.62 California 149759.96
10 101913.08 110594.11 229160.95 Florida 146121.95
11 100671.96 91790.61 249744.55 California 144259.40
12 93863.75 127320.38 249839.44 Florida 141585.52
13 91992.39 135495.07 252664.93 California 134307.35
14 119943.24 156547.42 256512.92 Florida 132602.65
15 114523.61 122616.84 261776.23 New York 129917.04
16 78013.11 121597.55 264346.06 California 126992.93
17 94657.16 145077.58 282574.31 New York 125370.37
18 91749.16 114175.79 294919.57 Florida 124266.90
19 86419.70 153514.11 0.00 New York 122776.86
20 76253.86 113867.30 298664.47 California 118474.03
21 78389.47 153773.43 299737.29 New York 111313.02
22 73994.56 122782.75 303319.26 Florida 110352.25
23 67532.53 105751.03 304768.73 Florida 108733.99
24 77044.01 99281.34 140574.81 New York 108552.04
25 64664.71 139553.16 137962.62 California 107404.34
26 75328.87 144135.98 134050.07 Florida 105733.54
27 72107.60 127864.55 353183.81 New York 105008.31
28 66051.52 182645.56 118148.20 Florida 103282.38
29 65605.48 153032.06 107138.38 New York 101004.64
30 61994.48 115641.28 91131.24 Florida 99937.59
31 61136.38 152701.92 88218.23 New York 97483.56
32 63408.86 129219.61 46085.25 California 97427.84
33 55493.95 103057.49 214634.81 Florida 96778.92
34 46426.07 157693.92 210797.67 California 96712.80
35 46014.02 85047.44 205517.64 New York 96479.51
36 28663.76 127056.21 201126.82 Florida 90708.19
37 44069.95 51283.14 197029.42 California 89949.14
38 20229.59 65947.93 185265.10 New York 81229.06
39 38558.51 82982.09 174999.30 California 81005.76
40 28754.33 118546.05 172795.67 California 78239.91
41 27892.92 84710.77 164470.71 Florida 77798.83
42 23640.93 96189.63 148001.11 California 71498.49
43 15505.73 127382.30 35534.17 New York 69758.98
44 22177.74 154806.14 28334.72 California 65200.33
45 1000.23 124153.04 1903.93 New York 64926.08
46 1315.46 115816.21 297114.46 Florida 49490.75
47 0.00 135426.92 0.00 California 42559.73
48 542.05 51743.15 0.00 New York 35673.41
49 0.00 116983.80 45173.06 California 14681.40
code
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('State', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X), dtype=np.float)
I keep getting an error like -
TypeError Traceback (most recent call last)
<ipython-input-36-17f64bed7e4c> in <module>
3
4 ct = ColumnTransformer([('State', OneHotEncoder(), [3])], remainder='passthrough')
----> 5 X = np.array(ct.fit_transform(X), dtype=object)
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\sklearn\compose\_column_transformer.py in fit_transform(self, X, y)
516 self._validate_remainder(X)
517
--> 518 result = self._fit_transform(X, y, _fit_transform_one)
519
520 if not result:
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\sklearn\compose\_column_transformer.py in _fit_transform(self, X, y, func, fitted)
446 self._iter(fitted=fitted, replace_strings=True))
447 try:
--> 448 return Parallel(n_jobs=self.n_jobs)(
449 delayed(func)(
450 transformer=clone(trans) if not fitted else trans,
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
1002 # remaining jobs.
1003 self._iterating = False
-> 1004 if self.dispatch_one_batch(iterator):
1005 self._iterating = self._original_iterator is not None
1006
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
833 return False
834 else:
--> 835 self._dispatch(tasks)
836 return True
837
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
752 with self._lock:
753 job_idx = len(self._jobs)
--> 754 job = self._backend.apply_async(batch, callback=cb)
755 # A job can complete so quickly than its callback is
756 # called before we get here, causing self._jobs to
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
207 def apply_async(self, func, callback=None):
208 """Schedule a func to be run"""
--> 209 result = ImmediateResult(func)
210 if callback:
211 callback(result)
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
588 # Don't delay the application, to avoid keeping the input
589 # arguments in memory
--> 590 self.results = batch()
591
592 def get(self):
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\joblib\parallel.py in __call__(self)
253 # change the default number of processes to -1
254 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 255 return [func(*args, **kwargs)
256 for func, args, kwargs in self.items]
257
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
253 # change the default number of processes to -1
254 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 255 return [func(*args, **kwargs)
256 for func, args, kwargs in self.items]
257
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\sklearn\pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
726 with _print_elapsed_time(message_clsname, message):
727 if hasattr(transformer, 'fit_transform'):
--> 728 res = transformer.fit_transform(X, y, **fit_params)
729 else:
730 res = transformer.fit(X, y, **fit_params).transform(X)
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\sklearn\preprocessing\_encoders.py in fit_transform(self, X, y)
370 """
371 self._validate_keywords()
--> 372 return super().fit_transform(X, y)
373
374 def transform(self, X):
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
569 if y is None:
570 # fit method of arity 1 (unsupervised transformation)
--> 571 return self.fit(X, **fit_params).transform(X)
572 else:
573 # fit method of arity 2 (supervised transformation)
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\sklearn\preprocessing\_encoders.py in fit(self, X, y)
345 """
346 self._validate_keywords()
--> 347 self._fit(X, handle_unknown=self.handle_unknown)
348 self.drop_idx_ = self._compute_drop_idx()
349 return self
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\sklearn\preprocessing\_encoders.py in _fit(self, X, handle_unknown)
72
73 def _fit(self, X, handle_unknown='error'):
---> 74 X_list, n_samples, n_features = self._check_X(X)
75
76 if self.categories != 'auto':
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\sklearn\preprocessing\_encoders.py in _check_X(self, X)
41 if not (hasattr(X, 'iloc') and getattr(X, 'ndim', 0) == 2):
42 # if not a dataframe, do normal check_array validation
---> 43 X_temp = check_array(X, dtype=None)
44 if (not hasattr(X, 'dtype')
45 and np.issubdtype(X_temp.dtype, np.str_)):
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
506 if sp.issparse(array):
507 _ensure_no_complex_data(array)
--> 508 array = _ensure_sparse_format(array, accept_sparse=accept_sparse,
509 dtype=dtype, copy=copy,
510 force_all_finite=force_all_finite,
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\sklearn\utils\validation.py in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite, accept_large_sparse)
304
305 if accept_sparse is False:
--> 306 raise TypeError('A sparse matrix was passed, but dense '
307 'data is required. Use X.toarray() to '
308 'convert to a dense numpy array.')
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
Please help me to resolve this issue.....

You can set sparse_threshold=0, not very sure about the rest of your code, what X is etc:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd
import numpy as np
X = pd.DataFrame({"R&D":[1,2,3,4],
"State":["New Tork","Florida","New York","California"]})
ct = ColumnTransformer([('State', OneHotEncoder(), [0])],
sparse_threshold=0,remainder='passthrough')
np.array(ct.fit_transform(X[['State']]), dtype=np.float)
array([[0., 0., 1., 0.],
[0., 1., 0., 0.],
[0., 0., 0., 1.],
[1., 0., 0., 0.]])

Why are the limits of the histogram data autodetected as [nan, nan] instead of discarding NaNs?

The following code generates an error
print(g['resp'])
par = {'hist': True, 'kde': False, 'fit': scipy.stats.norm, 'bins': 'auto'}
sns.distplot(g['resp'], color='blue', **par)
31 23.0
32 28.0
33 29.0
34 31.0
35 32.0
36 35.0
37 35.0
38 36.0
39 37.0
40 38.0
41 38.0
42 38.0
43 41.0
44 42.0
45 42.0
46 42.0
47 42.0
48 46.0
49 48.0
50 49.0
51 50.0
52 52.0
53 55.0
54 56.0
55 60.0
56 60.0
57 100.0
58 NaN
59 NaN
60 NaN
61 NaN
Name: resp, dtype: float64
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-23-42944bf1e405> in <module>
1 print(g['resp'])
2 par = {'hist': True, 'kde': False, 'fit': scipy.stats.norm, 'bins': 'auto'}
----> 3 sns.distplot(g['resp'], color='blue', **par)
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py in distplot(a, bins, hist, kde, rug, fit, hist_kws, kde_kws, rug_kws, fit_kws, color, vertical, norm_hist, axlabel, label, ax)
223 hist_color = hist_kws.pop("color", color)
224 ax.hist(a, bins, orientation=orientation,
--> 225 color=hist_color, **hist_kws)
226 if hist_color != color:
227 hist_kws["color"] = hist_color
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\__init__.py in inner(ax, data, *args, **kwargs)
1808 "the Matplotlib list!)" % (label_namer, func.__name__),
1809 RuntimeWarning, stacklevel=2)
-> 1810 return func(ax, *args, **kwargs)
1811
1812 inner.__doc__ = _add_data_doc(inner.__doc__,
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py in hist(self, x, bins, range, density, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, normed, **kwargs)
6589 # this will automatically overwrite bins,
6590 # so that each histogram uses the same bins
-> 6591 m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
6592 m = m.astype(float) # causes problems later if it's an int
6593 if mlast is None:
C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\histograms.py in histogram(a, bins, range, normed, weights, density)
708 a, weights = _ravel_and_check_weights(a, weights)
709
--> 710 bin_edges, uniform_bins = _get_bin_edges(a, bins, range, weights)
711
712 # Histogram is an integer or a float array depending on the weights.
C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\histograms.py in _get_bin_edges(a, bins, range, weights)
331 "bins is not supported for weighted data")
332
--> 333 first_edge, last_edge = _get_outer_edges(a, range)
334
335 # truncate the range if needed
C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\histograms.py in _get_outer_edges(a, range)
259 if not (np.isfinite(first_edge) and np.isfinite(last_edge)):
260 raise ValueError(
--> 261 "autodetected range of [{}, {}] is not finite".format(first_edge, last_edge))
262
263 # expand empty range to avoid divide by zero
ValueError: autodetected range of [nan, nan] is not finite
It looks like the NaN values are causing trouble - how to discard them?

I think not, so possible solution is Series.dropna for remove missing values:
sns.distplot(g['resp'].dropna(), color='blue', **par)

How to fix "Exception: Data must be 1-dimensional" error when running Kmeans

I have resolved all errors up till now. I am not quite sure I understand the problem except for I get the error "Exception: Data must be 1-dimensional".
Here is my code. Here is a link to the excel file im using.
import pandas as pd
import numpy as np
import warnings
from sklearn import preprocessing
from sklearn.preprocessing import LabelBinarizer
from sklearn.cluster import KMeans
df1 = pd.read_excel('PERM_Disclosure_Data_FY2018_EOYV2.xlsx', 'PERM_FY2018')
warnings.filterwarnings("ignore")
df1 = df1.dropna(subset=['PW_AMOUNT_9089'])
df1 = df1.dropna(subset=['CASE_STATUS'])
df1 = df1.dropna(subset=['PW_SOC_TITLE'])
df1.CASE_STATUS[df1['CASE_STATUS']=='Certified-Expired'] = 'Certified'
df1 = df1[df1.CASE_STATUS != 'Withdrawn']
df1 = df1.dropna()
df1 = df1[df1.PW_AMOUNT_9089 != '#############']
df1 = df1.dropna(subset=['PW_AMOUNT_9089'])
df1 = df1.dropna(subset=['CASE_STATUS'])
df1 = df1.dropna(subset=['PW_SOC_TITLE'])
df1.PW_AMOUNT_9089 = df1.PW_AMOUNT_9089.astype(float)
df1=df1.iloc[:, [2,4,5]]
enc = LabelBinarizer()
y = enc.fit_transform(df1.CASE_STATUS)[:, [0]]
at this point the output for y is an array:
array([[0],
[0],
[0],
...,
[1],
[1],
[0]])
then I define XZ
le = preprocessing.LabelEncoder()
X = df1.iloc[:, [1]]
Z = df1.iloc[:, [2]]
X2 = X.apply(le.fit_transform)
XZ = pd.concat([X2,Z], axis=1)
the output for XZ is:
PW_SOC_TITLE PW_AMOUNT_9089
12 176 60778.0
13 456 100901.0
14 134 134389.0
15 134 104936.0
16 134 95160.0
17 294 66976.0
18 73 38610.0
19 598 122533.0
20 220 109574.0
21 99 67850.0
22 399 132018.0
23 68 56118.0
24 139 136781.0
25 134 111405.0
26 598 58573.0
27 362 75067.0
28 598 85862.0
29 572 33301.0
30 598 112840.0
31 134 134971.0
32 176 100568.0
33 176 100568.0
34 626 19614.0
35 153 26354.0
36 405 79248.0
37 220 93350.0
38 139 153213.0
39 598 131997.0
40 598 131997.0
41 1 90438.0
... ... ...
119741 495 23005.0
119742 63 46030.0
119743 153 20301.0
119744 95 21965.0
119745 153 29890.0
119746 295 79680.0
119747 349 79498.0
119748 223 38930.0
119749 223 38930.0
119750 570 39160.0
119751 302 119392.0
119752 598 106001.0
119753 416 64230.0
119754 598 115482.0
119755 99 80205.0
119756 134 78329.0
119757 598 109325.0
119758 598 109325.0
119759 570 49770.0
119760 194 18117.0
119761 404 46987.0
119762 189 35131.0
119763 73 49900.0
119764 323 32240.0
119765 372 28122.0
119766 468 67974.0
119767 399 78520.0
119768 329 25875.0
119769 329 25875.0
119770 601 82098.0
I then continue:
from sklearn.model_selection import train_test_split
XZ_train, XZ_test, y_train, y_test = train_test_split(XZ, y,
test_size = .25,
random_state=20,
stratify=y )
# loading library
from pandas_ml import ConfusionMatrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# instantiate learning model loop(k = i)
for weights in ['uniform', 'distance']:
for i in range(1,11,2):
knn = KNeighborsClassifier(n_neighbors=i, weights=weights)
# fitting the model
knn.fit(XZ_train, y_train)
# predict the response
pred = knn.predict(XZ_test)
confusion = ConfusionMatrix(y_test, pred)
if i<11:
# evaluate accuracy
print('Weight Measure:', knn.weights)
print('n_neighbors=', knn.n_neighbors)
print('Accuracy=', accuracy_score(y_test, pred))
#print('')
#print('Confusion Matrix')
#print(confusion)
print('-----------------------------')
The error I get is as follows:
G:\Anaconda\lib\site-packages\ipykernel_launcher.py:11: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
# This is added back by InteractiveShellApp.init_path()
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-20-bf6054d911ba> in <module>
12 # predict the response
13 pred = knn.predict(XZ_test)
---> 14 confusion = ConfusionMatrix(y_test, pred)
15 if i<11:
16 # evaluate accuracy
G:\Anaconda\lib\site-packages\pandas_ml\confusion_matrix\cm.py in __new__(cls, y_true, y_pred, *args, **kwargs)
21 if len(set(uniq_true) - set(uniq_pred)) == 0:
22 from pandas_ml.confusion_matrix.bcm import BinaryConfusionMatrix
---> 23 return BinaryConfusionMatrix(y_true, y_pred, *args, **kwargs)
24 return LabeledConfusionMatrix(y_true, y_pred, *args, **kwargs)
25
G:\Anaconda\lib\site-packages\pandas_ml\confusion_matrix\bcm.py in __init__(self, *args, **kwargs)
19 def __init__(self, *args, **kwargs):
20 # super(BinaryConfusionMatrix, self).__init__(y_true, y_pred)
---> 21 super(BinaryConfusionMatrix, self).__init__(*args, **kwargs)
22 assert self.len() == 2, \
23 "Binary confusion matrix must have len=2 but \
G:\Anaconda\lib\site-packages\pandas_ml\confusion_matrix\abstract.py in __init__(self, y_true, y_pred, labels, display_sum, backend, true_name, pred_name)
31 self._y_true.name = self.true_name
32 else:
---> 33 self._y_true = pd.Series(y_true, name=self.true_name)
34
35 if isinstance(y_pred, pd.Series):
G:\Anaconda\lib\site-packages\pandas\core\series.py in __init__(self, data, index, dtype, name, copy, fastpath)
273 else:
274 data = _sanitize_array(data, index, dtype, copy,
--> 275 raise_cast_failure=True)
276
277 data = SingleBlockManager(data, index, fastpath=True)
G:\Anaconda\lib\site-packages\pandas\core\series.py in _sanitize_array(data, index, dtype, copy, raise_cast_failure)
4163 elif subarr.ndim > 1:
4164 if isinstance(data, np.ndarray):
-> 4165 raise Exception('Data must be 1-dimensional')
4166 else:
4167 subarr = com._asarray_tuplesafe(data, dtype=dtype)
Exception: Data must be 1-dimensional
Is the data I am passing through not the correct type? The datatypes match the datatypes I've used in a past project so I thought I could replicate it here. For those wondering X is Company names that I encoded, Y is binarized case status, and Z is a wage amount in the float dtype.

"...the output for y is an array..." The array that you show is two-dimensional, with shape (n, 1). (One of the dimensions is trivial, but it is still 2-d.) Do something like y[:, 0] or y.ravel() to get a 1-d version.

index value vs. flight (data range A row & E row )

I want to know the scatter plot of the sum of the flight fields per minute. My information is as follows
http://python2018.byethost10.com/flights.csv
My grammar is as follows
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
matplotlib.rcParams['font.sans-serif'] = ['Noto Serif CJK TC']
matplotlib.rcParams['font.family']='sans-serif'
Df=pd.read_csv('flights.csv')
Df["time_hour"] = pd.to_datetime(df['time_hour'])
grp = df.groupby(by=[df.time_hour.map(lambda x : (x.hour, x.minute))])
a=grp.sum()
plt.scatter(a.index, a['flight'], c='b', marker='o')
plt.xlabel('index value', fontsize=16)
plt.ylabel('flight', fontsize=16)
plt.title('scatter plot - index value vs. flight (data range A row & E row )', fontsize=20)
plt.show()
Produced the following error：
Produced the following error
Traceback (most recent call last):
File "I:/PycharmProjects/1223/raise1/char3.py", line 10, in
Plt.scatter(a.index, a['flight'], c='b', marker='o')
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\pyplot.py", line 3470, in scatter
Edgecolors=edgecolors, data=data, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib__init__.py", line 1855, in inner
Return func(ax, *args, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\axes_axes.py", line 4320, in scatter
Alpha=alpha
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\collections.py", line 927, in init
Collection.init(self, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\collections.py", line 159, in init
Offsets = np.asanyarray(offsets, float)
File "C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\numeric.py", line 544, in asanyarray
Return array(a, dtype, copy=False, order=order, subok=True)
ValueError: setting an array element with a sequence.
How can I produce the following results? Thank you.
http://python2018.byethost10.com/image.png

Problem is in aggregation, in your code it return tuples in index.
Solution is convert time_dt column to strings HH:MM by Series.dt.strftime:
a = df.groupby(by=[df.time_hour.dt.strftime('%H:%M')]).sum()
All together:
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
matplotlib.rcParams['font.sans-serif'] = ['Noto Serif CJK TC']
matplotlib.rcParams['font.family']='sans-serif'
#first column is index and second clumn is parsed to datetimes
df=pd.read_csv('flights.csv', index_col=[0], parse_dates=[1])
a = df.groupby(by=[df.time_hour.dt.strftime('%H:%M')]).sum()
print (a)
year sched_dep_time flight air_time distance hour minute
time_hour
05:00 122793 37856 87445 11282.0 72838 366 1256
05:01 120780 44810 82113 11115.0 71168 435 1310
05:02 122793 52989 99975 11165.0 72068 515 1489
05:03 120780 57653 98323 10366.0 65137 561 1553
05:04 122793 67706 110230 10026.0 63118 661 1606
05:05 122793 75807 126426 9161.0 55371 742 1607
05:06 120780 82010 120753 10804.0 67827 799 2110
05:07 122793 90684 130339 8408.0 52945 890 1684
05:08 120780 93687 114415 10299.0 63271 922 1487
05:09 122793 101571 99526 11525.0 72915 1002 1371
05:10 122793 107252 107961 10383.0 70137 1056 1652
05:11 120780 111351 120261 10949.0 73350 1098 1551
05:12 122793 120575 135930 8661.0 57406 1190 1575
05:13 120780 118272 104763 7784.0 55886 1166 1672
05:14 122793 37289 109300 9838.0 63582 364 889
05:15 122793 42374 67193 11480.0 78183 409 1474
05:16 58377 22321 53424 4271.0 27527 216 721
plt.scatter(a.index, a['flight'], c='b', marker='o')
#rotate labels of x axis
plt.xticks(rotation=90)
plt.xlabel('index value', fontsize=16)
plt.ylabel('flight', fontsize=16)
plt.title('scatter plot - index value vs. flight (data range A row & E row )', fontsize=20)
plt.show()
Another solution is convert datetimes to times:
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
matplotlib.rcParams['font.sans-serif'] = 'Noto Serif CJK TC'
matplotlib.rcParams['font.family']='sans-serif'
df=pd.read_csv('flights.csv', index_col=[0], parse_dates=[1])
a = df.groupby(by=[df.time_hour.dt.time]).sum()
print (a)
year sched_dep_time flight air_time distance hour minute
time_hour
05:00:00 122793 37856 87445 11282.0 72838 366 1256
05:01:00 120780 44810 82113 11115.0 71168 435 1310
05:02:00 122793 52989 99975 11165.0 72068 515 1489
05:03:00 120780 57653 98323 10366.0 65137 561 1553
05:04:00 122793 67706 110230 10026.0 63118 661 1606
05:05:00 122793 75807 126426 9161.0 55371 742 1607
05:06:00 120780 82010 120753 10804.0 67827 799 2110
05:07:00 122793 90684 130339 8408.0 52945 890 1684
05:08:00 120780 93687 114415 10299.0 63271 922 1487
05:09:00 122793 101571 99526 11525.0 72915 1002 1371
05:10:00 122793 107252 107961 10383.0 70137 1056 1652
05:11:00 120780 111351 120261 10949.0 73350 1098 1551
05:12:00 122793 120575 135930 8661.0 57406 1190 1575
05:13:00 120780 118272 104763 7784.0 55886 1166 1672
05:14:00 122793 37289 109300 9838.0 63582 364 889
05:15:00 122793 42374 67193 11480.0 78183 409 1474
05:16:00 58377 22321 53424 4271.0 27527 216 721
plt.scatter(a.index, a['flight'], c='b', marker='o')
plt.xticks(rotation=90)
plt.xlabel('index value', fontsize=16)
plt.ylabel('flight', fontsize=16)
plt.title('scatter plot - index value vs. flight (data range A row & E row )', fontsize=20)
plt.show()

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Trying to subplot groups with date axis different methods - pandas

Related

ValueError: Number of columns must be a positive integer, not 0

How to fix 'X.toarray() to convert to a dense numpy array' error on sklearn fit_transform?

Why are the limits of the histogram data autodetected as [nan, nan] instead of discarding NaNs?

How to fix "Exception: Data must be 1-dimensional" error when running Kmeans

index value vs. flight (data range A row & E row )

Categories

Resources