I have a dataframe with the population by age in several cities:
City Age_25 Age_26 Age_27 Age_28 Age_29 Age_30
New York 11312 3646 4242 4344 4242 6464
London 6446 2534 3343 63475 34433 34434
Paris 5242 34343 6667 132 323 3434
Hong Kong 354 979 878 6776 7676 898
Buenos Aires 4244 7687 78 8676 786 9798
I want to create a new dataframe with the sum of the columns based on ranges of three years. That is, people from 25 to 27 and people from 28 to 30. Like this:
City Age_25_27 Age_28_30
New York 19200 15050
London 12323 132342
Paris 46252 3889
Hong Kong 2211 15350
Buenos Aires 12009 19260
In this example I gave a range of three year but in mine real database it has to be 5 five and with 100 ages.
How could I do that? I've saw some related answers but neither work very well in my case.
Try this:
age_columns = df.filter(like='Age_').columns
n = age_columns.str.split('_').str[-1].astype(int)
df['Age_25-27'] = df[age_columns[(n >= 25) & (n <= 27)]].sum(axis=1)
df['Age_28-30'] = df[age_columns[(n >= 28) & (n <= 30)]].sum(axis=1)
Output:
>>> df
City Age_25 Age_26 Age_27 Age_28 Age_29 Age_30 Age_25-27 Age_28-30
New York 11312 3646 4242 4344 4242 6464.0 19200 15050.0
London 6446 2534 3343 63475 34433 34434 NaN 69352 68867.0
Paris 5242 34343 6667 132 323 3434 NaN 41142 3757.0
Hong Kong 354 979 878 6776 7676 898.0 2211 15350.0
Buenos Aires 4244 7687 78 8676 786 9798.0 12009 19260.0
You can use groupby:
In [1]: import pandas as pd
...: import numpy as np
In [2]: d = {
...: 'City': ['New York', 'London', 'Paris', 'Hong Kong', 'Buenos Aires'],
...: 'Age_25': [11312, 6446, 5242, 354, 4244],
...: 'Age_26': [3646, 2534, 34343, 979, 7687],
...: 'Age_27': [4242, 3343, 6667, 878, 78],
...: 'Age_28': [4344, 63475, 132, 6776, 8676],
...: 'Age_29': [4242, 34433, 323, 7676, 786],
...: 'Age_30': [6464, 34434, 3434, 898, 9798]
...: }
...:
...: df = pd.DataFrame(data=d)
...: df = df.set_index('City')
...: df
Out[2]:
Age_25 Age_26 Age_27 Age_28 Age_29 Age_30
City
New York 11312 3646 4242 4344 4242 6464
London 6446 2534 3343 63475 34433 34434
Paris 5242 34343 6667 132 323 3434
Hong Kong 354 979 878 6776 7676 898
Buenos Aires 4244 7687 78 8676 786 9798
In [3]: n_cols = 3 # change to 5 for actual dataset
...: sum_every_n_cols_df = df.groupby((np.arange(len(df.columns)) // n_cols) + 1, axis=1).sum()
...: sum_every_n_cols_df
Out[3]:
1 2
City
New York 19200 15050
London 12323 132342
Paris 46252 3889
Hong Kong 2211 15350
Buenos Aires 12009 19260
You can extract the columns of the dataframe and put them in a list. Use
col_list = df.columns
But ultimately, I think what you'd want to do is more of a while loop with your inputs (band of 5 and up to 100 ages) as static values that you iterate over.
band = 5
start = 20
max_age = 120
i = start
while i < max_age:
age_start = i
age_end = i
sum_cols = []
col_name = 'age_' + str(age_start) + '_to_' + str(age_end)
for i in range(age_start,age_end):
age_adder = 'age_' + str(i)
df[col_name] += df[age_adder]
i += band
Related
import pandas as pd
df = pd.DataFrame({“Employee_ID”: [192, 561, 440, 264, 112, 374, 230, 251, 893, 562],
“Name”: [“Jose”, “Kent”, “Carl”, “Mary”, “Michael”, “Cindy”, “Greg”, “John”, “Frank”, “Angela”],
“Dept”: [“Production”, “Marketing”, “Operations”, “HR”, “Finance”, “Operations”, “Marketing”, “Production”, “Finance”, “HR”],
“Phone”: [2725373, 3647364, 3184778, 1927472, 2394723, 0874872, 1018374, 2127476, 2973973, 0247462],
“Salary”: [120000, 140000, 115000, 210000, 172000, 95000, 132000, 127000, 133000, 178000]})
df
I tried the following code to get the names and salaries of the IDs
df[(df[“Employee_ID”] == 264) & (df[“Employee_ID”] == 374) & (df[“Employee_ID”] == 893)][[“Name”, “Salary”]]
I was expecting to get their names and salaries
As stated by #abokey in the comments, you can get any desired subset of your data using masking. One way of doing that is .isin():
import pandas as pd
df = pd.DataFrame({"Employee_ID": [192, 561, 440, 264, 112, 374, 230, 251, 893, 562],
"Name": ["Jose", "Kent", "Carl", "Mary", "Michael", "Cindy", "Greg", "John", "Frank", "Angela"],
"Dept": ["Production", "Marketing", "Operations", "HR", "Finance", "Operations", "Marketing", "Production", "Finance", "HR"],
"Phone": [2725373, 3647364, 3184778, 1927472, 2394723, 874872, 1018374, 2127476, 2973973, 247462],
"Salary": [120000, 140000, 115000, 210000, 172000, 95000, 132000, 127000, 133000, 178000]})
df
Output:
Employee_ID Name Dept Phone Salary
0 192 Jose Production 2725373 120000
1 561 Kent Marketing 3647364 140000
2 440 Carl Operations 3184778 115000
3 264 Mary HR 1927472 210000
4 112 Michael Finance 2394723 172000
5 374 Cindy Operations 874872 95000
6 230 Greg Marketing 1018374 132000
7 251 John Production 2127476 127000
8 893 Frank Finance 2973973 133000
9 562 Angela HR 247462 178000
get_ids = [264, 374, 893]
df = df[df["Employee_ID"].isin(get_ids)]
df = df[["Name", "Salary"]]
df
Name Salary
3 Mary 210000
5 Cindy 95000
8 Frank 133000
R&D Spend Administration Marketing Spend State Profit
0 165349.20 136897.80 471784.10 New York 192261.83
1 162597.70 151377.59 443898.53 California 191792.06
2 153441.51 101145.55 407934.54 Florida 191050.39
3 144372.41 118671.85 383199.62 New York 182901.99
4 142107.34 91391.77 366168.42 Florida 166187.94
5 131876.90 99814.71 362861.36 New York 156991.12
6 134615.46 147198.87 127716.82 California 156122.51
7 130298.13 145530.06 323876.68 Florida 155752.60
8 120542.52 148718.95 311613.29 New York 152211.77
9 123334.88 108679.17 304981.62 California 149759.96
10 101913.08 110594.11 229160.95 Florida 146121.95
11 100671.96 91790.61 249744.55 California 144259.40
12 93863.75 127320.38 249839.44 Florida 141585.52
13 91992.39 135495.07 252664.93 California 134307.35
14 119943.24 156547.42 256512.92 Florida 132602.65
15 114523.61 122616.84 261776.23 New York 129917.04
16 78013.11 121597.55 264346.06 California 126992.93
17 94657.16 145077.58 282574.31 New York 125370.37
18 91749.16 114175.79 294919.57 Florida 124266.90
19 86419.70 153514.11 0.00 New York 122776.86
20 76253.86 113867.30 298664.47 California 118474.03
21 78389.47 153773.43 299737.29 New York 111313.02
22 73994.56 122782.75 303319.26 Florida 110352.25
23 67532.53 105751.03 304768.73 Florida 108733.99
24 77044.01 99281.34 140574.81 New York 108552.04
25 64664.71 139553.16 137962.62 California 107404.34
26 75328.87 144135.98 134050.07 Florida 105733.54
27 72107.60 127864.55 353183.81 New York 105008.31
28 66051.52 182645.56 118148.20 Florida 103282.38
29 65605.48 153032.06 107138.38 New York 101004.64
30 61994.48 115641.28 91131.24 Florida 99937.59
31 61136.38 152701.92 88218.23 New York 97483.56
32 63408.86 129219.61 46085.25 California 97427.84
33 55493.95 103057.49 214634.81 Florida 96778.92
34 46426.07 157693.92 210797.67 California 96712.80
35 46014.02 85047.44 205517.64 New York 96479.51
36 28663.76 127056.21 201126.82 Florida 90708.19
37 44069.95 51283.14 197029.42 California 89949.14
38 20229.59 65947.93 185265.10 New York 81229.06
39 38558.51 82982.09 174999.30 California 81005.76
40 28754.33 118546.05 172795.67 California 78239.91
41 27892.92 84710.77 164470.71 Florida 77798.83
42 23640.93 96189.63 148001.11 California 71498.49
43 15505.73 127382.30 35534.17 New York 69758.98
44 22177.74 154806.14 28334.72 California 65200.33
45 1000.23 124153.04 1903.93 New York 64926.08
46 1315.46 115816.21 297114.46 Florida 49490.75
47 0.00 135426.92 0.00 California 42559.73
48 542.05 51743.15 0.00 New York 35673.41
49 0.00 116983.80 45173.06 California 14681.40
code
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('State', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X), dtype=np.float)
I keep getting an error like -
TypeError Traceback (most recent call last)
<ipython-input-36-17f64bed7e4c> in <module>
3
4 ct = ColumnTransformer([('State', OneHotEncoder(), [3])], remainder='passthrough')
----> 5 X = np.array(ct.fit_transform(X), dtype=object)
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\sklearn\compose\_column_transformer.py in fit_transform(self, X, y)
516 self._validate_remainder(X)
517
--> 518 result = self._fit_transform(X, y, _fit_transform_one)
519
520 if not result:
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\sklearn\compose\_column_transformer.py in _fit_transform(self, X, y, func, fitted)
446 self._iter(fitted=fitted, replace_strings=True))
447 try:
--> 448 return Parallel(n_jobs=self.n_jobs)(
449 delayed(func)(
450 transformer=clone(trans) if not fitted else trans,
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
1002 # remaining jobs.
1003 self._iterating = False
-> 1004 if self.dispatch_one_batch(iterator):
1005 self._iterating = self._original_iterator is not None
1006
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
833 return False
834 else:
--> 835 self._dispatch(tasks)
836 return True
837
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
752 with self._lock:
753 job_idx = len(self._jobs)
--> 754 job = self._backend.apply_async(batch, callback=cb)
755 # A job can complete so quickly than its callback is
756 # called before we get here, causing self._jobs to
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
207 def apply_async(self, func, callback=None):
208 """Schedule a func to be run"""
--> 209 result = ImmediateResult(func)
210 if callback:
211 callback(result)
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
588 # Don't delay the application, to avoid keeping the input
589 # arguments in memory
--> 590 self.results = batch()
591
592 def get(self):
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\joblib\parallel.py in __call__(self)
253 # change the default number of processes to -1
254 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 255 return [func(*args, **kwargs)
256 for func, args, kwargs in self.items]
257
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
253 # change the default number of processes to -1
254 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 255 return [func(*args, **kwargs)
256 for func, args, kwargs in self.items]
257
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\sklearn\pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
726 with _print_elapsed_time(message_clsname, message):
727 if hasattr(transformer, 'fit_transform'):
--> 728 res = transformer.fit_transform(X, y, **fit_params)
729 else:
730 res = transformer.fit(X, y, **fit_params).transform(X)
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\sklearn\preprocessing\_encoders.py in fit_transform(self, X, y)
370 """
371 self._validate_keywords()
--> 372 return super().fit_transform(X, y)
373
374 def transform(self, X):
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
569 if y is None:
570 # fit method of arity 1 (unsupervised transformation)
--> 571 return self.fit(X, **fit_params).transform(X)
572 else:
573 # fit method of arity 2 (supervised transformation)
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\sklearn\preprocessing\_encoders.py in fit(self, X, y)
345 """
346 self._validate_keywords()
--> 347 self._fit(X, handle_unknown=self.handle_unknown)
348 self.drop_idx_ = self._compute_drop_idx()
349 return self
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\sklearn\preprocessing\_encoders.py in _fit(self, X, handle_unknown)
72
73 def _fit(self, X, handle_unknown='error'):
---> 74 X_list, n_samples, n_features = self._check_X(X)
75
76 if self.categories != 'auto':
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\sklearn\preprocessing\_encoders.py in _check_X(self, X)
41 if not (hasattr(X, 'iloc') and getattr(X, 'ndim', 0) == 2):
42 # if not a dataframe, do normal check_array validation
---> 43 X_temp = check_array(X, dtype=None)
44 if (not hasattr(X, 'dtype')
45 and np.issubdtype(X_temp.dtype, np.str_)):
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
506 if sp.issparse(array):
507 _ensure_no_complex_data(array)
--> 508 array = _ensure_sparse_format(array, accept_sparse=accept_sparse,
509 dtype=dtype, copy=copy,
510 force_all_finite=force_all_finite,
c:\users\dell\appdata\local\programs\python\python38\lib\site-packages\sklearn\utils\validation.py in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite, accept_large_sparse)
304
305 if accept_sparse is False:
--> 306 raise TypeError('A sparse matrix was passed, but dense '
307 'data is required. Use X.toarray() to '
308 'convert to a dense numpy array.')
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
Please help me to resolve this issue.....
You can set sparse_threshold=0, not very sure about the rest of your code, what X is etc:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd
import numpy as np
X = pd.DataFrame({"R&D":[1,2,3,4],
"State":["New Tork","Florida","New York","California"]})
ct = ColumnTransformer([('State', OneHotEncoder(), [0])],
sparse_threshold=0,remainder='passthrough')
np.array(ct.fit_transform(X[['State']]), dtype=np.float)
array([[0., 0., 1., 0.],
[0., 1., 0., 0.],
[0., 0., 0., 1.],
[1., 0., 0., 0.]])
I have a data frame as shown below
ID Name Address
1 Kohli Country: India; State: Delhi; Sector: SE25
2 Sachin Country: India; State: Mumbai; Sector: SE39
3 Ponting Country: Australia; State: Tasmania
4 Ponting State: Tasmania; Sector: SE27
From the above I would like to prepare below data frame
ID Name Country State Sector
1 Kohli India Delhi SE25
2 Sachin India Mumbai SE39
3 Ponting Australia Tasmania None
4 Ponting None Tasmania SE27
I tried below code
df[['Country', 'State', 'Sector']] = pd.DataFrame(df['ADDRESS'].str.split(';',2).tolist(),
columns = ['Country', 'State', 'Sector'])
But from the above again I have to clean the data by slicing the column. I would like to know is there any easy method than this.
Use list comprehension with dict comprehension for list of dictionaries and pass to DataFrame constructor:
L = [{k:v for y in x.split('; ') for k, v in dict([y.split(': ')]).items()}
for x in df.pop('Address')]
df = df.join(pd.DataFrame(L, index=df.index))
print (df)
ID Name Country State Sector
0 1 Kohli India Delhi SE25
1 2 Sachin India Mumbai SE39
2 3 Ponting Australia Tasmania NaN
Or use split with reshape stack:
df1 = (df.pop('Address')
.str.split('; ', expand=True)
.stack()
.reset_index(level=1, drop=True)
.str.split(': ', expand=True)
.set_index(0, append=True)[1]
.unstack()
)
print (df1)
0 Country Sector State
0 India SE25 Delhi
1 India SE39 Mumbai
2 Australia NaN Tasmania
df = df.join(df1)
print (df)
ID Name Country Sector State
0 1 Kohli India SE25 Delhi
1 2 Sachin India SE39 Mumbai
2 3 Ponting Australia NaN Tasmania
You are almost there
cols = ['ZONE', 'State', 'Sector']
df[cols] = pd.DataFrame(df['ADDRESS'].str.split('; ',2).tolist(),
columns = cols)
for col in cols:
df[col] = df[col].str.split(': ').apply(lambda x:x[1])
Original answer
This can also do the job:
import pandas as pd
df = pd.DataFrame(
[
{'ID': 1, 'Name': 'Kohli', 'Address': 'Country: India; State: Delhi; Sector: SE25'},
{'ID': 2, 'Name': 'Sachin','Address': 'Country: India; State: Mumbai; Sector: SE39'},
{'ID': 3,'Name': 'Ponting','Address': 'Country: Australia; State: Tasmania'}
]
)
cols_to_extract = ['ZONE', 'State', 'Sector']
list_of_rows = df['Address'].str.split(';', 2).tolist()
df[cols_to_extract] = pd.DataFrame(
[[item.split(': ')[1] for item in row] for row in list_of_rows],
columns=cols_to_extract)
Output would be the following:
>> df[['ID', 'Name', 'ZONE', 'State', 'Sector']]
ID Name ZONE State Sector
1 Kohli India Delhi SE25
2 Sachin India Mumbai SE39
3 Ponting Australia Tasmania None
Edited answer
As #jezrael pointed out very well in question comment, my original answer was wrong, because it aligned values by position and could tend to wrong key - value pairs, when some of the values were NaNs. The following code should work on edited data set.
import pandas as pd
df = pd.DataFrame(
[
{'ID': 1, 'Name': 'Kohli', 'Address': 'Country: India; State: Delhi; Sector: SE25'},
{'ID': 2, 'Name': 'Sachin','Address': 'Country: India; State: Mumbai; Sector: SE39'},
{'ID': 3,'Name': 'Ponting','Address': 'Country: Australia; State: Tasmania'},
{'ID': 4, 'Name': 'Ponting','Address': 'State: Tasmania; Sector: SE27'}
]
)
cols_to_extract = ['Country', 'State', 'Sector']
list_of_rows = df['Address'].str.split(';', 2).tolist()
df[cols_to_extract] = pd.DataFrame(
[{item.split(': ')[0].strip(): item.split(': ')[1] for item in row} for row in list_of_rows],
columns=cols_to_extract)
df = df.rename(columns={'Country': 'ZONE'})
Output would be:
>> df[['ID', 'Name', 'ZONE', 'State', 'Sector']]
ID Name ZONE State Sector
1 Kohli India Delhi SE25
2 Sachin India Mumbai SE39
3 Ponting Australia Tasmania NaN
3 Ponting NaN Tasmania SE27
I got a DataFrame:
date phone sensor pallet
126 2019-04-15 940203 C0382C391A4D 47
127 2019-04-15 940203 C0382D392A4D 47
133 2019-04-16 940203 C0382C391A4D 47
134 2019-04-16 940203 C0382D392A4D 47
138 2019-04-17 940203 C0382C391A4D 47
139 2019-04-17 940203 C0382D392A4D 47
144 2019-04-18 940203 C0382C391A4D 47
145 2019-04-18 940203 C0382D392A4D 47
156 2019-04-19 940203 C0382D392A4D 47
157 2019-04-19 940203 C0382C391A4D 47
277 2019-04-15 941557 C0392D362735 32
279 2019-04-15 941557 C03633364D50 32
286 2019-04-16 941557 C03633364D50 32
287 2019-04-16 941557 C0392D362735 32
296 2019-04-17 941557 C03633364D50 32
297 2019-04-17 941557 C0392D362735 32
305 2019-04-18 941557 C0392D362735 32
306 2019-04-18 941557 C03633364D50 32
317 2019-04-19 941557 C03633364D50 32
318 2019-04-19 941557 C0392D362735 32
561 2019-04-15 942316 C0384639224D 45
562 2019-04-15 942316 C03632364950 45
563 2019-04-15 942316 C03920363835 45
564 2019-04-15 942316 C0382939384D 45
573 2019-04-16 942316 C0382939384D 45
574 2019-04-16 942316 C0384639224D 45
575 2019-04-16 942316 C03632364950 45
i want to be able to make subplot for each pallet which contain the sensors arrived in each date.
example:
i have tried few methods:
ax.plot_date
looping through opened ax's and plotting through each 1
grouped = pallets_arrived.groupby('pallet')
nrows = 2
ncols = 2
fig, axs = plt.subplots(nrows, ncols)
targets = zip(grouped.groups.keys(), axs.flatten())
for i, (key, ax) in enumerate(targets):
ax.plot_date(grouped.get_group(key)['date'], grouped.get_group(key)['sensor'], 'o')
plt.show()
return pallets_arrived
which gives wierdly formatted repeating dates (index the Df with date isnt solving the prob)
Df plotting
grouped = pallets_arrived.groupby('pallet')
nrows = 2
ncols = 2
fig, axs = plt.subplots(nrows, ncols)
targets = zip(grouped.groups.keys(), axs.flatten())
for i, (key, ax) in enumerate(targets):
grouped.get_group(key).plot(x='date', y='sensor', ax=ax)
ax.legend()
plt.show()
or
grouped = pallets_arrived.set_index('date').groupby('pallet')
nrows = 2
ncols = 2
fig, axs = plt.subplots(nrows, ncols)
targets = zip(grouped.groups.keys(), axs.flatten())
for i, (key, ax) in enumerate(targets):
grouped.get_group(key).plot(grouped.get_group(key).index, y='sensor', ax=ax)
ax.legend()
plt.show()
pyplot
grouped = pallets_arrived.groupby('pallet')
nrows = 2
ncols = 2
fig, axs = plt.subplots(nrows, ncols)
targets = zip(grouped.groups.keys(), axs.flatten())
for i, (key, ax) in enumerate(targets):
plt.sca(ax)
plt.plot(grouped.get_group(key)['date'], grouped.get_group(key)['sensor'])
ax.legend()
plt.show()
which again
Pivot pallets to Plot() on columns(pallets)
which doesnt work because there are more than 1 sensor in each pallet in same date. so there is a duplicated value error...
I really dont know what method to use to make this 1 correct:
grouping similar dates in x axis.
being able to plot each pallet to different subplot.
i think i dont get the pandas wrapping of matplotlib correctly.
ill be glad for some explenation because im reading guides and cant understand the preferred method for those stuff.
Thanks alot for the helpers.
you can use matplotlib to plot categorical data:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
%matplotlib inline
fig, ax = plt.subplots()
ax.scatter(df['date'], df['sensor'])
plt.show()
or if you want to color the groups:
fig, ax = plt.subplots()
for _,g in df.groupby('pallet'):
ax.scatter(g['date'], g['sensor'])
plt.show()
you can also add a legend:
fig, ax = plt.subplots()
for _,g in df.groupby('pallet'):
ax.scatter(g['date'], g['sensor'], label='Pallet_'+str(_))
ax.legend()
plt.show()
I want to know the scatter plot of the sum of the flight fields per minute. My information is as follows
http://python2018.byethost10.com/flights.csv
My grammar is as follows
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
matplotlib.rcParams['font.sans-serif'] = ['Noto Serif CJK TC']
matplotlib.rcParams['font.family']='sans-serif'
Df=pd.read_csv('flights.csv')
Df["time_hour"] = pd.to_datetime(df['time_hour'])
grp = df.groupby(by=[df.time_hour.map(lambda x : (x.hour, x.minute))])
a=grp.sum()
plt.scatter(a.index, a['flight'], c='b', marker='o')
plt.xlabel('index value', fontsize=16)
plt.ylabel('flight', fontsize=16)
plt.title('scatter plot - index value vs. flight (data range A row & E row )', fontsize=20)
plt.show()
Produced the following error:
Produced the following error
Traceback (most recent call last):
File "I:/PycharmProjects/1223/raise1/char3.py", line 10, in
Plt.scatter(a.index, a['flight'], c='b', marker='o')
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\pyplot.py", line 3470, in scatter
Edgecolors=edgecolors, data=data, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib__init__.py", line 1855, in inner
Return func(ax, *args, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\axes_axes.py", line 4320, in scatter
Alpha=alpha
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\collections.py", line 927, in init
Collection.init(self, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\collections.py", line 159, in init
Offsets = np.asanyarray(offsets, float)
File "C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\numeric.py", line 544, in asanyarray
Return array(a, dtype, copy=False, order=order, subok=True)
ValueError: setting an array element with a sequence.
How can I produce the following results? Thank you.
http://python2018.byethost10.com/image.png
Problem is in aggregation, in your code it return tuples in index.
Solution is convert time_dt column to strings HH:MM by Series.dt.strftime:
a = df.groupby(by=[df.time_hour.dt.strftime('%H:%M')]).sum()
All together:
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
matplotlib.rcParams['font.sans-serif'] = ['Noto Serif CJK TC']
matplotlib.rcParams['font.family']='sans-serif'
#first column is index and second clumn is parsed to datetimes
df=pd.read_csv('flights.csv', index_col=[0], parse_dates=[1])
a = df.groupby(by=[df.time_hour.dt.strftime('%H:%M')]).sum()
print (a)
year sched_dep_time flight air_time distance hour minute
time_hour
05:00 122793 37856 87445 11282.0 72838 366 1256
05:01 120780 44810 82113 11115.0 71168 435 1310
05:02 122793 52989 99975 11165.0 72068 515 1489
05:03 120780 57653 98323 10366.0 65137 561 1553
05:04 122793 67706 110230 10026.0 63118 661 1606
05:05 122793 75807 126426 9161.0 55371 742 1607
05:06 120780 82010 120753 10804.0 67827 799 2110
05:07 122793 90684 130339 8408.0 52945 890 1684
05:08 120780 93687 114415 10299.0 63271 922 1487
05:09 122793 101571 99526 11525.0 72915 1002 1371
05:10 122793 107252 107961 10383.0 70137 1056 1652
05:11 120780 111351 120261 10949.0 73350 1098 1551
05:12 122793 120575 135930 8661.0 57406 1190 1575
05:13 120780 118272 104763 7784.0 55886 1166 1672
05:14 122793 37289 109300 9838.0 63582 364 889
05:15 122793 42374 67193 11480.0 78183 409 1474
05:16 58377 22321 53424 4271.0 27527 216 721
plt.scatter(a.index, a['flight'], c='b', marker='o')
#rotate labels of x axis
plt.xticks(rotation=90)
plt.xlabel('index value', fontsize=16)
plt.ylabel('flight', fontsize=16)
plt.title('scatter plot - index value vs. flight (data range A row & E row )', fontsize=20)
plt.show()
Another solution is convert datetimes to times:
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
matplotlib.rcParams['font.sans-serif'] = 'Noto Serif CJK TC'
matplotlib.rcParams['font.family']='sans-serif'
df=pd.read_csv('flights.csv', index_col=[0], parse_dates=[1])
a = df.groupby(by=[df.time_hour.dt.time]).sum()
print (a)
year sched_dep_time flight air_time distance hour minute
time_hour
05:00:00 122793 37856 87445 11282.0 72838 366 1256
05:01:00 120780 44810 82113 11115.0 71168 435 1310
05:02:00 122793 52989 99975 11165.0 72068 515 1489
05:03:00 120780 57653 98323 10366.0 65137 561 1553
05:04:00 122793 67706 110230 10026.0 63118 661 1606
05:05:00 122793 75807 126426 9161.0 55371 742 1607
05:06:00 120780 82010 120753 10804.0 67827 799 2110
05:07:00 122793 90684 130339 8408.0 52945 890 1684
05:08:00 120780 93687 114415 10299.0 63271 922 1487
05:09:00 122793 101571 99526 11525.0 72915 1002 1371
05:10:00 122793 107252 107961 10383.0 70137 1056 1652
05:11:00 120780 111351 120261 10949.0 73350 1098 1551
05:12:00 122793 120575 135930 8661.0 57406 1190 1575
05:13:00 120780 118272 104763 7784.0 55886 1166 1672
05:14:00 122793 37289 109300 9838.0 63582 364 889
05:15:00 122793 42374 67193 11480.0 78183 409 1474
05:16:00 58377 22321 53424 4271.0 27527 216 721
plt.scatter(a.index, a['flight'], c='b', marker='o')
plt.xticks(rotation=90)
plt.xlabel('index value', fontsize=16)
plt.ylabel('flight', fontsize=16)
plt.title('scatter plot - index value vs. flight (data range A row & E row )', fontsize=20)
plt.show()