reading into DataFrame instead of Panel - pandas

I'd like to read the quotations of several tickers at the same time. I am using:
import numpy as np
import pandas as pd
import pandas_datareader.data as web
import datetime
import matplotlib.pyplot as plt
%matplotlib inline
start = datetime.datetime(2017, 9, 20)
end = datetime.datetime(2017,9,22)
h = web.DataReader(["EWI", "EWG"], "yahoo", start, end)
... and it seems to work.
However, the data are read into a panel data structure. If I print variable "h" I get:
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 4 (major_axis) x 2 (minor_axis)
Items axis: Adj Close to Volume
Major_axis axis: 2017-09-22 00:00:00 to 2017-09-19 00:00:00
Minor_axis axis: EWG to EWI
I'd like:
to "see" the resulting panel values (I'm relatively new to pandas).
is it possible to flatten the panel into a DataFrame? (IMO it is better documented)
If I read the "Adjusted close" for me it would be more than enough. Perhaps reading into DataFrame directly would be easier?
Thank you

I think you need Panel.to_frame for MultiIndex DataFrame:
#with random data
df = h.to_frame()
print (df)
Adj Close Close High Low Open Volume
major minor
2013-01-01 EWI 0.471435 0.471435 0.471435 0.471435 0.471435 0.471435
EWG -1.190976 -1.190976 -1.190976 -1.190976 -1.190976 -1.190976
2013-01-02 EWI 1.432707 1.432707 1.432707 1.432707 1.432707 1.432707
EWG -0.312652 -0.312652 -0.312652 -0.312652 -0.312652 -0.312652
2013-01-03 EWI -0.720589 -0.720589 -0.720589 -0.720589 -0.720589 -0.720589
EWG 0.887163 0.887163 0.887163 0.887163 0.887163 0.887163
2013-01-04 EWI 0.859588 0.859588 0.859588 0.859588 0.859588 0.859588
EWG -0.636524 -0.636524 -0.636524 -0.636524 -0.636524 -0.636524
And then select column:
s = df['Adj Close']
print (s)
major minor
2013-01-01 EWI 0.471435
EWG -1.190976
2013-01-02 EWI 1.432707
EWG -0.312652
2013-01-03 EWI -0.720589
EWG 0.887163
2013-01-04 EWI 0.859588
EWG -0.636524
Name: Adj Close, dtype: float64
df1 = df[['Adj Close']]
print (df1)
Adj Close
major minor
2013-01-01 EWI 0.471435
EWG -1.190976
2013-01-02 EWI 1.432707
EWG -0.312652
2013-01-03 EWI -0.720589
EWG 0.887163
2013-01-04 EWI 0.859588
EWG -0.636524
Notice:
In future Panel will be deprecated.

Related

Finding ranges from a dataframe

I have a dataframe that looks like below,
Date 3tier1 3tier2
2013-01-01 08:00:00+08:00 20.97946282 20.97946282
2013-01-02 08:00:00+08:00 20.74539378 20.74539378
2013-01-03 08:00:00+08:00 20.51126054 20.51126054
2013-01-04 08:00:00+08:00 20.27707322 20.27707322
2013-01-05 08:00:00+08:00 20.04284112 20.04284112
2013-01-06 08:00:00+08:00 19.80857234 19.80857234
2013-01-07 08:00:00+08:00 19.57427331 19.57427331
2013-01-08 08:00:00+08:00 19.33994822 19.33994822
2013-01-09 08:00:00+08:00 19.10559849 19.10559849
2013-01-10 08:00:00+08:00 18.87122241 18.87122241
2013-01-11 08:00:00+08:00 18.63681507 18.63681507
2013-01-12 08:00:00+08:00 18.40236877 18.40236877
2013-01-13 08:00:00+08:00 18.16787383 18.16787383
2013-01-14 08:00:00+08:00 17.93331972 17.93331972
2013-01-15 08:00:00+08:00 17.69869612 17.69869612
2013-01-16 08:00:00+08:00 17.46399372 17.46399372
2013-01-17 08:00:00+08:00 17.22920466 17.22920466
2013-01-18 08:00:00+08:00 16.9943227 16.9943227
2013-01-19 08:00:00+08:00 17.27850867 16.7593431
2013-01-20 08:00:00+08:00 17.69762778 16.52426248
2013-01-21 08:00:00+08:00 18.12537837 16.28907864
2013-01-22 08:00:00+08:00 18.56180775 16.05379043
2013-01-23 08:00:00+08:00 19.00689471 15.81839767
2013-01-24 08:00:00+08:00 19.46053468 15.58290109
2013-01-25 08:00:00+08:00 19.92252218 15.3473024
2013-01-26 08:00:00+08:00 20.3925305 15.11160423
2013-01-27 08:00:00+08:00 20.87008788 14.87581016
2013-01-28 08:00:00+08:00 21.35454987 14.63992467
2013-01-29 08:00:00+08:00 21.84506726 14.40395298
2013-01-30 08:00:00+08:00 22.34054913 14.16790086
2013-01-31 08:00:00+08:00 22.83962058 13.93177434
2013-02-01 08:00:00+08:00 23.34057473 13.69557937
2013-02-02 08:00:00+08:00 23.84131896 13.45932144
2013-02-03 08:00:00+08:00 24.33931544 13.22300514
2013-02-04 08:00:00+08:00 24.8315166 12.98663374
2013-02-05 08:00:00+08:00 25.31429677 12.7502088
2013-02-06 08:00:00+08:00 25.78338191 12.51372976
2013-02-07 08:00:00+08:00 26.23378052 12.27719367
2013-02-08 08:00:00+08:00 26.65971992 12.04059517
2013-02-09 08:00:00+08:00 27.05459343 11.80392662
2013-02-10 08:00:00+08:00 27.41092527 11.56717871
2013-02-11 08:00:00+08:00 27.72036088 11.3303412
2013-02-12 08:00:00+08:00 27.97369094 11.09340384
2013-02-13 08:00:00+08:00 28.16091685 10.85635718
2013-02-14 08:00:00+08:00 28.27136466 10.61919323
2013-02-15 08:00:00+08:00 28.29385218 10.38190579
2013-02-16 08:00:00+08:00 28.21691143 10.14449064
2013-02-17 08:00:00+08:00 28.02906576 9.906945571
2013-02-18 08:00:00+08:00 27.71915819 9.669270289
2013-02-19 08:00:00+08:00 27.27672516 9.431466436
2013-02-20 08:00:00+08:00 26.69240919 9.193537583
2013-02-21 08:00:00+08:00 25.9584032 8.955489323
2013-02-22 08:00:00+08:00 25.06891975 8.717329426
2013-02-23 08:00:00+08:00 24.02067835 8.479068052
2013-02-24 08:00:00+08:00 22.81340411 8.240718006
2013-02-25 08:00:00+08:00 21.45033241 8.002294987
2013-02-26 08:00:00+08:00 19.93872048 7.763817801
2013-02-27 08:00:00+08:00 18.29038758 7.525308512
2013-02-28 08:00:00+08:00 16.5223583 7.286792516
2013-03-01 08:00:00+08:00 14.65781009 7.048298548
2013-03-02 08:00:00+08:00 12.72782154 6.809858708
2013-03-03 08:00:00+08:00 10.77512952 6.57150857
2013-03-04 08:00:00+08:00 8.862866684 6.333287469
2013-03-05 08:00:00+08:00 7.095368405 6.095239078
2013-03-06 08:00:00+08:00 5.857412338 5.857412338
2013-03-07 08:00:00+08:00 6.062085995 5.619862847
2013-03-08 08:00:00+08:00 7.707047277 5.382654808
2013-03-09 08:00:00+08:00 9.419192265 5.145863673
2013-03-10 08:00:00+08:00 11.12489254 4.909579657
2013-03-11 08:00:00+08:00 12.78439056 4.673912321
2013-03-12 08:00:00+08:00 14.37406958 4.438996486
2013-03-13 08:00:00+08:00 15.87932086 4.204999838
2013-03-14 08:00:00+08:00 17.29126015 3.97213278
2013-03-15 08:00:00+08:00 18.60496304 3.740661371
2013-03-16 08:00:00+08:00 19.81836754 3.510924673
2013-03-17 08:00:00+08:00 20.9315104 3.283358444
2013-03-18 08:00:00+08:00 21.94595693 3.058528064
2013-03-19 08:00:00+08:00 22.86436015 2.837174881
2013-03-20 08:00:00+08:00 23.69011593 2.620282024
2013-03-21 08:00:00+08:00 24.42709384 2.409168144
2013-03-22 08:00:00+08:00 25.07942941 2.205620134
2013-03-23 08:00:00+08:00 25.65136634 2.012076744
2013-03-24 08:00:00+08:00 26.14713926 1.831868652
2013-03-25 08:00:00+08:00 26.57088882 1.669492776
2013-03-26 08:00:00+08:00 26.92660259 1.53082259
2013-03-27 08:00:00+08:00 27.21807571 1.423006398
2013-03-28 08:00:00+08:00 27.44888683 1.353644799
2013-03-29 08:00:00+08:00 27.66626757 1.328979238
2013-03-30 08:00:00+08:00 28.03215155 1.351655979
2013-03-31 08:00:00+08:00 28.34758652 1.419589908
I would like to find the range for each month for column of my choice. and group the months when there is a change in direction of range, Say for example: 3tier1 for the month 1 actually starts from 20 goes to 16 and then again goes to 22, ex: From Jan 1 to Jan 18 - downward 20 to 16 and then from Jan 19 to Feb 15 upward from 17 to 28 and so on and so forth,
Expected output:
2013-01-01 to 2013-01-18 - 20 to 16
2013-01-19 to 2013-02-15 - 17 to 28
Is there a builtin pandas function that can do this with ease? Thanks for your help in advance.
I don't know of built in function that does what you are looking for. It can be put together with enough lines of code. I would use .diff() and .shift().
This is what I came up with.
import pandas as pd
import numpy as np
file = 'C:/path_to_file/data.csv'
df = pd.read_csv(file, parse_dates=['Date'])
# Now I have your dataframe loaded. ** Your procedures are below.
df['trend'] = np.where(df['3tier1'].diff()>0,1,-1) # trend is increasing or decreasing
df['again'] = df['trend'].diff() # get the differnece in trend
df['again'] = df['again'].shift(periods=-1) + df['again']
df['change'] = np.where(df['again'].isin([2,-2,np.nan]), 2, 0)
# get to the desired data.
dfc = df[df['change']==2]
dfc['to_date'] = dfc['Date'].shift(periods=-1)
dfc['to_End'] = dfc['3tier1'].shift(periods=-1)
dfc.drop(columns=['trend', 'again','change'], inplace=True)
# get the rows that show the trend
dfc = dfc.iloc[::2, :]
print(dfc)

Sum columns in pandas based on the names of the columns

I have a dataframe with the population by age in several cities:
City Age_25 Age_26 Age_27 Age_28 Age_29 Age_30
New York 11312 3646 4242 4344 4242 6464
London 6446 2534 3343 63475 34433 34434
Paris 5242 34343 6667 132 323 3434
Hong Kong 354 979 878 6776 7676 898
Buenos Aires 4244 7687 78 8676 786 9798
I want to create a new dataframe with the sum of the columns based on ranges of three years. That is, people from 25 to 27 and people from 28 to 30. Like this:
City Age_25_27 Age_28_30
New York 19200 15050
London 12323 132342
Paris 46252 3889
Hong Kong 2211 15350
Buenos Aires 12009 19260
In this example I gave a range of three year but in mine real database it has to be 5 five and with 100 ages.
How could I do that? I've saw some related answers but neither work very well in my case.
Try this:
age_columns = df.filter(like='Age_').columns
n = age_columns.str.split('_').str[-1].astype(int)
df['Age_25-27'] = df[age_columns[(n >= 25) & (n <= 27)]].sum(axis=1)
df['Age_28-30'] = df[age_columns[(n >= 28) & (n <= 30)]].sum(axis=1)
Output:
>>> df
City Age_25 Age_26 Age_27 Age_28 Age_29 Age_30 Age_25-27 Age_28-30
New York 11312 3646 4242 4344 4242 6464.0 19200 15050.0
London 6446 2534 3343 63475 34433 34434 NaN 69352 68867.0
Paris 5242 34343 6667 132 323 3434 NaN 41142 3757.0
Hong Kong 354 979 878 6776 7676 898.0 2211 15350.0
Buenos Aires 4244 7687 78 8676 786 9798.0 12009 19260.0
You can use groupby:
In [1]: import pandas as pd
...: import numpy as np
In [2]: d = {
...: 'City': ['New York', 'London', 'Paris', 'Hong Kong', 'Buenos Aires'],
...: 'Age_25': [11312, 6446, 5242, 354, 4244],
...: 'Age_26': [3646, 2534, 34343, 979, 7687],
...: 'Age_27': [4242, 3343, 6667, 878, 78],
...: 'Age_28': [4344, 63475, 132, 6776, 8676],
...: 'Age_29': [4242, 34433, 323, 7676, 786],
...: 'Age_30': [6464, 34434, 3434, 898, 9798]
...: }
...:
...: df = pd.DataFrame(data=d)
...: df = df.set_index('City')
...: df
Out[2]:
Age_25 Age_26 Age_27 Age_28 Age_29 Age_30
City
New York 11312 3646 4242 4344 4242 6464
London 6446 2534 3343 63475 34433 34434
Paris 5242 34343 6667 132 323 3434
Hong Kong 354 979 878 6776 7676 898
Buenos Aires 4244 7687 78 8676 786 9798
In [3]: n_cols = 3 # change to 5 for actual dataset
...: sum_every_n_cols_df = df.groupby((np.arange(len(df.columns)) // n_cols) + 1, axis=1).sum()
...: sum_every_n_cols_df
Out[3]:
1 2
City
New York 19200 15050
London 12323 132342
Paris 46252 3889
Hong Kong 2211 15350
Buenos Aires 12009 19260
You can extract the columns of the dataframe and put them in a list. Use
col_list = df.columns
But ultimately, I think what you'd want to do is more of a while loop with your inputs (band of 5 and up to 100 ages) as static values that you iterate over.
band = 5
start = 20
max_age = 120
i = start
while i < max_age:
age_start = i
age_end = i
sum_cols = []
col_name = 'age_' + str(age_start) + '_to_' + str(age_end)
for i in range(age_start,age_end):
age_adder = 'age_' + str(i)
df[col_name] += df[age_adder]
i += band

Get stock Low of Day (LOD) price for incomplete daily bar using minute bar data (multiple stocks, multiple sessions in one df) SettingWithCopyWarning

I have a dataframe of minute data for multiple Stocks, each stock has multiple sessions. See sample below
Symbol Time Open High Low Close Volume LOD
2724312 AEHR 2019-09-23 09:31:00 1.42 1.42 1.42 1.42 200 NaN
2724313 AEHR 2019-09-23 09:43:00 1.35 1.35 1.34 1.34 6062 NaN
2724314 AEHR 2019-09-23 09:58:00 1.35 1.35 1.29 1.30 8665 NaN
2724315 AEHR 2019-09-23 09:59:00 1.32 1.32 1.32 1.32 100 NaN
2724316 AEHR 2019-09-23 10:00:00 1.35 1.35 1.35 1.35 400 NaN
... ... ... ... ... ... ... ... ...
4266341 ZI 2021-09-10 15:56:00 63.08 63.16 63.08 63.15 18205 NaN
4266342 ZI 2021-09-10 15:57:00 63.14 63.14 63.07 63.07 19355 NaN
4266343 ZI 2021-09-10 15:58:00 63.07 63.12 63.07 63.10 16650 NaN
4266344 ZI 2021-09-10 15:59:00 63.09 63.12 63.06 63.11 25775 NaN
4266345 ZI 2021-09-10 16:00:00 63.11 63.17 63.11 63.17 28578 NaN
I need the Low Of Day(LOD) for the session (9:30-4pm) up to the time in each row.
The completed df should look like this
Symbol Time Open High Low Close Volume LOD
2724312 AEHR 2019-09-23 09:31:00 1.42 1.42 1.42 1.42 200 1.42
2724313 AEHR 2019-09-23 09:43:00 1.35 1.35 1.34 1.34 6062 1.34
2724314 AEHR 2019-09-23 09:58:00 1.35 1.35 1.29 1.30 8665 1.29
2724315 AEHR 2019-09-23 09:59:00 1.32 1.32 1.32 1.32 100 1.29
2724316 AEHR 2019-09-23 10:00:00 1.35 1.35 1.35 1.35 400 1.29
... ... ... ... ... ... ... ... ...
4266341 ZI 2021-09-10 15:56:00 63.08 63.16 63.08 63.15 18205 63.08
4266342 ZI 2021-09-10 15:57:00 63.14 63.14 63.07 63.07 19355 63.07
4266343 ZI 2021-09-10 15:58:00 63.07 63.12 63.07 63.10 16650 63.07
4266344 ZI 2021-09-10 15:59:00 63.09 63.12 63.06 63.11 25775 63.06
4266345 ZI 2021-09-10 16:00:00 63.11 63.17 63.11 63.17 28578 63.06
My current solution
prev_symbol = "WXYZ"
prev_low = 10000000
prev_session = datetime.date(1920, 1, 1)
session_start = 1
for i, row in df.iterrows():
current_session = (df['Time'].iloc[i]).time()
current_symbol = df['Symbol'].iloc[i]
if current_symbol == prev_symbol:
if current_session == prev_session:
sesh_low = df.iloc[session_start:i, 'Low'].min()
df.at[i, 'LOD'] = sesh_low
else:
df.at[i, 'LOD'] = df.at[i, 'Low']
prev_session = current_session
session_start = i
else:
df.at[i, 'LOD'] = df.at[i, 'Low']
prev_symbol = current_symbol
prev_session = current_session
session_start = i
This returns a SettingWithCopyWarning error. Please help
You can try .groupby() + .expanding():
# if you have values already converted/sorted, skip:
# df["Time"] = pd.to_datetime(df["Time"])
# df = df.sort_values(by=["Symbol", "Time"])
df["LOD"] = df.groupby("Symbol")["Low"].expanding().min().values
print(df)
Prints:
Symbol Time Open High Low Close Volume LOD
2724312 AEHR 2019-09-23 09:31:00 1.42 1.42 1.42 1.42 200 1.42
2724313 AEHR 2019-09-23 09:43:00 1.35 1.35 1.34 1.34 6062 1.34
2724314 AEHR 2019-09-23 09:58:00 1.35 1.35 1.29 1.30 8665 1.29
2724315 AEHR 2019-09-23 09:59:00 1.32 1.32 1.32 1.32 100 1.29
2724316 AEHR 2019-09-23 10:00:00 1.35 1.35 1.35 1.35 400 1.29
4266341 ZI 2021-09-10 15:56:00 63.08 63.16 63.08 63.15 18205 63.08
4266342 ZI 2021-09-10 15:57:00 63.14 63.14 63.07 63.07 19355 63.07
4266343 ZI 2021-09-10 15:58:00 63.07 63.12 63.07 63.10 16650 63.07
4266344 ZI 2021-09-10 15:59:00 63.09 63.12 63.06 63.11 25775 63.06
4266345 ZI 2021-09-10 16:00:00 63.11 63.17 63.11 63.17 28578 63.06

index value vs. flight (data range A row & E row )

I want to know the scatter plot of the sum of the flight fields per minute. My information is as follows
http://python2018.byethost10.com/flights.csv
My grammar is as follows
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
matplotlib.rcParams['font.sans-serif'] = ['Noto Serif CJK TC']
matplotlib.rcParams['font.family']='sans-serif'
Df=pd.read_csv('flights.csv')
Df["time_hour"] = pd.to_datetime(df['time_hour'])
grp = df.groupby(by=[df.time_hour.map(lambda x : (x.hour, x.minute))])
a=grp.sum()
plt.scatter(a.index, a['flight'], c='b', marker='o')
plt.xlabel('index value', fontsize=16)
plt.ylabel('flight', fontsize=16)
plt.title('scatter plot - index value vs. flight (data range A row & E row )', fontsize=20)
plt.show()
Produced the following error:
Produced the following error
Traceback (most recent call last):
File "I:/PycharmProjects/1223/raise1/char3.py", line 10, in
Plt.scatter(a.index, a['flight'], c='b', marker='o')
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\pyplot.py", line 3470, in scatter
Edgecolors=edgecolors, data=data, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib__init__.py", line 1855, in inner
Return func(ax, *args, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\axes_axes.py", line 4320, in scatter
Alpha=alpha
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\collections.py", line 927, in init
Collection.init(self, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\collections.py", line 159, in init
Offsets = np.asanyarray(offsets, float)
File "C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\numeric.py", line 544, in asanyarray
Return array(a, dtype, copy=False, order=order, subok=True)
ValueError: setting an array element with a sequence.
How can I produce the following results? Thank you.
http://python2018.byethost10.com/image.png
Problem is in aggregation, in your code it return tuples in index.
Solution is convert time_dt column to strings HH:MM by Series.dt.strftime:
a = df.groupby(by=[df.time_hour.dt.strftime('%H:%M')]).sum()
All together:
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
matplotlib.rcParams['font.sans-serif'] = ['Noto Serif CJK TC']
matplotlib.rcParams['font.family']='sans-serif'
#first column is index and second clumn is parsed to datetimes
df=pd.read_csv('flights.csv', index_col=[0], parse_dates=[1])
a = df.groupby(by=[df.time_hour.dt.strftime('%H:%M')]).sum()
print (a)
year sched_dep_time flight air_time distance hour minute
time_hour
05:00 122793 37856 87445 11282.0 72838 366 1256
05:01 120780 44810 82113 11115.0 71168 435 1310
05:02 122793 52989 99975 11165.0 72068 515 1489
05:03 120780 57653 98323 10366.0 65137 561 1553
05:04 122793 67706 110230 10026.0 63118 661 1606
05:05 122793 75807 126426 9161.0 55371 742 1607
05:06 120780 82010 120753 10804.0 67827 799 2110
05:07 122793 90684 130339 8408.0 52945 890 1684
05:08 120780 93687 114415 10299.0 63271 922 1487
05:09 122793 101571 99526 11525.0 72915 1002 1371
05:10 122793 107252 107961 10383.0 70137 1056 1652
05:11 120780 111351 120261 10949.0 73350 1098 1551
05:12 122793 120575 135930 8661.0 57406 1190 1575
05:13 120780 118272 104763 7784.0 55886 1166 1672
05:14 122793 37289 109300 9838.0 63582 364 889
05:15 122793 42374 67193 11480.0 78183 409 1474
05:16 58377 22321 53424 4271.0 27527 216 721
plt.scatter(a.index, a['flight'], c='b', marker='o')
#rotate labels of x axis
plt.xticks(rotation=90)
plt.xlabel('index value', fontsize=16)
plt.ylabel('flight', fontsize=16)
plt.title('scatter plot - index value vs. flight (data range A row & E row )', fontsize=20)
plt.show()
Another solution is convert datetimes to times:
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
matplotlib.rcParams['font.sans-serif'] = 'Noto Serif CJK TC'
matplotlib.rcParams['font.family']='sans-serif'
df=pd.read_csv('flights.csv', index_col=[0], parse_dates=[1])
a = df.groupby(by=[df.time_hour.dt.time]).sum()
print (a)
year sched_dep_time flight air_time distance hour minute
time_hour
05:00:00 122793 37856 87445 11282.0 72838 366 1256
05:01:00 120780 44810 82113 11115.0 71168 435 1310
05:02:00 122793 52989 99975 11165.0 72068 515 1489
05:03:00 120780 57653 98323 10366.0 65137 561 1553
05:04:00 122793 67706 110230 10026.0 63118 661 1606
05:05:00 122793 75807 126426 9161.0 55371 742 1607
05:06:00 120780 82010 120753 10804.0 67827 799 2110
05:07:00 122793 90684 130339 8408.0 52945 890 1684
05:08:00 120780 93687 114415 10299.0 63271 922 1487
05:09:00 122793 101571 99526 11525.0 72915 1002 1371
05:10:00 122793 107252 107961 10383.0 70137 1056 1652
05:11:00 120780 111351 120261 10949.0 73350 1098 1551
05:12:00 122793 120575 135930 8661.0 57406 1190 1575
05:13:00 120780 118272 104763 7784.0 55886 1166 1672
05:14:00 122793 37289 109300 9838.0 63582 364 889
05:15:00 122793 42374 67193 11480.0 78183 409 1474
05:16:00 58377 22321 53424 4271.0 27527 216 721
plt.scatter(a.index, a['flight'], c='b', marker='o')
plt.xticks(rotation=90)
plt.xlabel('index value', fontsize=16)
plt.ylabel('flight', fontsize=16)
plt.title('scatter plot - index value vs. flight (data range A row & E row )', fontsize=20)
plt.show()

Select subset by a conditional expression from a PANDAS dataframe , but a error

a sample like this :
In [39]: ts = pd.Series(np.random.randn(20),index=pd.date_range('1/1/2000',periods=20))
In [40]: t = pd.DataFrame(ts,columns=['base'],index=ts.index)
In [42]: t['shift_one'] = t.base - t.base.shift(1)
In [43]: t['shift_two'] = t.shift_one.shift(1)
In [44]: t
Out[44]:
base shift_one shift_two
2000-01-01 -1.239924 NaN NaN
2000-01-02 1.116260 2.356184 NaN
2000-01-03 0.401853 -0.714407 2.356184
2000-01-04 -0.823275 -1.225128 -0.714407
2000-01-05 -0.562103 0.261171 -1.225128
2000-01-06 0.347143 0.909246 0.261171
.............
2000-01-20 -0.062557 -0.467466 0.512293
now , if we use t[t.shift_one > 0 ] , it works ok ,but when we use:
In [48]: t[t.shift_one > 0 and t.shift_two < 0]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in ()
----> 1 t[t.shift_one > 0 and t.shift_two < 0]
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Suppose we want to get a subset that include both two conditions , how to ? thanks a lot.
you need parens and use &, not and
see docs here:
http://pandas.pydata.org/pandas-docs/dev/indexing.html#boolean-indexing
In [3]: ts = pd.Series(np.random.randn(20),index=pd.date_range('1/1/2000',periods=20))
In [4]: t = pd.DataFrame(ts,columns=['base'],index=ts.index)
In [5]: t['shift_one'] = t.base - t.base.shift(1)
In [6]: t['shift_two'] = t.shift_one.shift(1)
In [7]: t
Out[7]:
base shift_one shift_two
2000-01-01 -1.116040 NaN NaN
2000-01-02 1.592079 2.708118 NaN
2000-01-03 0.958642 -0.633436 2.708118
2000-01-04 0.431970 -0.526672 -0.633436
2000-01-05 1.275624 0.843654 -0.526672
2000-01-06 0.498401 -0.777223 0.843654
2000-01-07 -0.351465 -0.849865 -0.777223
2000-01-08 -0.458742 -0.107277 -0.849865
2000-01-09 -2.100404 -1.641662 -0.107277
2000-01-10 0.601658 2.702062 -1.641662
2000-01-11 -2.026495 -2.628153 2.702062
2000-01-12 0.391426 2.417921 -2.628153
2000-01-13 -1.177292 -1.568718 2.417921
2000-01-14 -0.374543 0.802749 -1.568718
2000-01-15 0.338649 0.713192 0.802749
2000-01-16 -1.124820 -1.463469 0.713192
2000-01-17 0.484175 1.608995 -1.463469
2000-01-18 -1.477772 -1.961947 1.608995
2000-01-19 0.481843 1.959615 -1.961947
2000-01-20 0.760168 0.278325 1.959615
In [8]: t[(t.shift_one>0) & (t.shift_two<0)]
Out[8]:
base shift_one shift_two
2000-01-05 1.275624 0.843654 -0.526672
2000-01-10 0.601658 2.702062 -1.641662
2000-01-12 0.391426 2.417921 -2.628153
2000-01-14 -0.374543 0.802749 -1.568718
2000-01-17 0.484175 1.608995 -1.463469
2000-01-19 0.481843 1.959615 -1.961947