Using Pandas and Numpy to search for conditions within binned data in 2 data frames - pandas

Python newbie here. Here's a simplified example of my problem. I have 2 pandas dataframes.
One dataframe lightbulb_df has data on whether a light is on or off and looks something like this:
Light_Time
Light On?
5790.76
0
5790.76
0
5790.771
1
5790.779
1
5790.779
1
5790.782
0
5790.783
1
5790.783
1
5790.784
0
Where the time is in seconds since start of day and 1 is the lightbulb is on, 0 means the lightbulb is off.
The second dataframe sensor_df shows whether or not a sensor detected the lightbulb and has different time values and rates.
Sensor_Time
Sensor Detect?
5790.8
0
5790.9
0
5791.0
1
5791.1
1
5791.2
1
5791.3
0
Both dataframes are very large with 100,000s of rows. The lightbulb will turn on for a few minutes and then turn off, then back on, etc.
Using the .diff function, I was able to compare each row to its predecessor and depending on whether the result was 1 or -1 create a truth table with simplified on and off times and append it to lightbulb_df.
# use .diff() to compare each row to the last row
lightbulb_df['light_diff'] = lightbulb_df['Light On?'].diff()
# the light on start times are when
#.diff is less than 0 (0 - 1 = -1)
light_start = lightbulb_df.loc[lightbulb_df['light_diff'] < 0]
# the light off start times (first times when light turns off)
# are when .diff is greater than 0 (1 - 0 = 1)
light_off = lightbulb_df.loc[lightbulb_df['light_diff'] > 0]
# and then I can concatenate them to have
# a single changed state df that only captures when the lightbulb changes
lightbulb_changes = pd.concat((light_start, light_off)).sort_values(by=['Light_Time'])
So I end up with a dataframe of on start times, a dataframe of off start times, and a change state dataframe that looks like this.
Light_Time
Light On?
light_diff
5790.771
1
1
5790.782
0
-1
5790.783
1
1
5790.784
0
-1
Now my goal is to search the sensor_df dataframe during each of the changed state times (above 5790.771 to 5790.782 and 5790.783 to 5790.784) by 1 second intervals to see whether or not the sensor detected the lightbulb. So I want to end up with the number of seconds the lightbulb was on and the number of seconds the sensor detected the lightbulb for each of the many light on periods in the change state dataframe. I'm trying to get % correctly detected.
Whenever I try to plan this out, I end up using lots of nested for loops or while loops which I know will be really slow with 100,000s of rows of data. I thought about using the .cut function to divide up the dataframe into 1 second intervals. I made a for loop to cycle through each of the times in the changed state dataframe and then nested a while loop inside to loop through 1 second intervals but that seems like it would be really slow.
I know python has a lot of built in functions that could help but I'm having trouble knowing what to google to find the right one.
Any advice would be appreciated.

Related

Understanding Pandas Series Data Structure

I am trying to get my head around the Pandas module and started learning about the Series data structure.
I have created the following Series in Spyder :-
songs = pd.Series(data = [145,142,38,13], name = "Count")
I can obtain information about the Series index using the code:-
songs.index
The output of the above code is as follows:-
My question is where it states Start = 0 and Stop = 4, what are these referring to?
I have interpreted start = 0 as the first element in the Series is in row 0.
But i am not sure what Stop value refers to as there are no elements in row 4 of the Series?
Can some one explain?
Thank you.
This concept as already explained adequately in the comments (indexing is at minus one the count of items) is prevalent in many places.
For instance, take the list data structure-
z = songs.to_list()
[145, 142, 38, 13]
len(z)
4 # length is four
# however indexing stops at i-1 position 'i' being the length/count of items in the list.
z[4] # this will raise an IndexError
# you will have to start at index 0 going till only index 3 (i.e. 4 items)
z[0], z[1], z[2], z[-1] # notice how -1 can be used to directly access the last element

Count past relative number of rows that meet criteria similar to excel's COUNTIF

I have a stock data csv with the following info
Open High Low Close
0 154.55 155.54 152.90 153.41
1 156.82 158.75 155.42 156.76
2 150.21 157.44 150.15 156.33
3 147.78 149.38 146.88 149.11
4 144.25 147.28 143.90 146.27
5 142.90 144.05 140.79 143.73
>>>
I want to count how many of the previous two highs were above the current open.
In excel I can do this
I would like to calculate this using pandas, but so far I have not been able to get anything working.
The closest I have gotten is with the following function and then using apply
def high_counter(day_open):
count = 0
for i in data['High'][:+2]:
if i > day_open:
count += 1
return count
data['Number of previous highs above'] = data['Open'].apply(high_counter)
However, this leads to the comparison always starting from the top down instead of down from the relative cell like in excel.
To summarize, I need to compare the Open with the previous N highs and get a count of how many are above the relative open whether that Open is the first row or the 50th row. The Highs to be compared would begin at the row prior the Open and would be of N size.
IIUC:
df["above"] = pd.DataFrame([df["High"].shift(-1)>df["Open"],
df["High"].shift(-2)>df["Open"]]).T.sum(axis=1)
# or [df["High"].shift(n)>df["Open"] for n in range(-1,-5,-1)] if you want to generalize the number n
print (df)
Open High Low Close above
0 154.55 155.54 152.90 153.41 2
1 156.82 158.75 155.42 156.76 1
2 150.21 157.44 150.15 156.33 0
3 147.78 149.38 146.88 149.11 0
4 144.25 147.28 143.90 146.27 0
5 142.90 144.05 140.79 143.73 0

Dendrograms with SciPy

I have a dataset that I shaped according to my needs, the dataframe is as follows:
Index A B C D ..... Z
Date/Time 1 0 0 0,35 ... 1
Date/Time 0,75 1 1 1 1
The total number of rows is 8878
What I try to do is create a time-series dendrogram (Example: Whole A column will be compared to whole B column in whole time).
I am expecting an output like this:
(source: rsc.org)
I tried to construct the linkage matrix with Z = hierarchy.linkage(X, 'ward')
However, when I print the dendrogram, it just shows an empty picture.
There is no problem if a compare every time point with each other and plot, but in that way, the dendrogram becomes way too complicated to observe even in truncated form.
Is there a way to handle the data as a whole time series and compare within columns in SciPy?

Graph to show departure and arrival times between stations

I have the start and end times of trips made by a bus, with the times in an Excel sheet. I want to make the graph as below :
I tried with Matlab nodes and graphs but did not got the exact figure, below is the Matlab code which I tried as an example:
A = [1 4]
B = [2 3]
weights = [5 5];
G = digraph(A,B,weights,4)
plot(G)
And the figure it generates:
I have got many more than 4 points in the Excel sheet, and I want them to all be displayed as in the first image.
Overview
You don't need any sort of complicated graph package for this, just use normal line plots! Here are methods in Excel and Matlab.
Excel
Give each bus stop a number, and list the bus stop number by the time it arrives/leaves there. I'll use stops number 0 and 1 for this example.
0 04:41
1 05:35
1 05:40
0 06:34
0 06:51
1 07:45
1 15:21
0 16:15
Then simply highlight the data and insert a "scatter with straight lines"
The rest is formatting. You can format the y-axis and tick "values in reverse order" to get the time increasing as in your desired plot. You can change the x-axis tick marks to just show integer stop numbers, get rid of the legend etc.
Final output:
Matlab
Here is the Matlab documentation for converting Excel formatted dates into Matlab datetime arrays: Convert Excel Date Number to Datetime.
Once you have the datetime objects, you can do this easily with the standard plot function.
% Set times up as a datetime array, could do this any number of ways
times = datetime(strcat({'1/1/2000 '}, {'04:41', '05:35', '05:40', '06:34', '06:51', '07:45', '15:21', '16:15'}, ':00'), 'format', 'dd/MM/yyyy HH:mm:ss');
% Set up the location of the bus at each of the above times
station = [0,1,1,0,0,1,1,0];
% Plot
plot(station, times) % Create plot
set(gca, 'xtick', [0,1]) % Limit to just ticks at the 2 stops
set(gca, 'ydir', 'reverse') % Reverse y axis to have earlier at top
set(gca,'XTickLabel',{'R', 'L'}) % Name the stops
Output:

Pandas shifting uneven timeseries data

I have some irregularly stamped time series data, with timestamps and the observations at every timestamp, in pandas. Irregular basically means that the timestamps are uneven, for instance the gap between two successive timestamps is not even.
For instance the data may look like
Timestamp Property
0 100
1 200
4 300
6 400
6 401
7 500
14 506
24 550
.....
59 700
61 750
64 800
Here the timestamp is say seconds elapsed since a chose origin time. As you can see we could have data at the same timestamp, 6 secs in this case. Basically the timestamps are strictly different, just that second resolution cannot measure the change.
Now I need to shift the timeseries data ahead, say I want to shift the entire data by 60 secs, or a minute. So the target output is
Timestamp Property
0 750
1 800
So the 0 point got matched to the 61 point and the 1 point got matched to the 64 point.
Now I can do this by writing something dirty, but I am looking to use as much as possible any inbuilt pandas feature. If the timeseries were regular, or evenly gapped, I could've just used the shift() function. But the fact that the series is uneven makes it a bit tricky. Any ideas from Pandas experts would be welcome. I feel that this would be a commonly encountered problem. Many thanks!
Edit: added a second, more elegant, way to do it. I don't know what will happen if you had a timestamp at 1 and two timestamps of 61. I think it will choose the first 61 timestamp but not sure.
new_stamps = pd.Series(range(df['Timestamp'].max()+1))
shifted = pd.DataFrame(new_stamps)
shifted.columns = ['Timestamp']
merged = pd.merge(df,shifted,on='Timestamp',how='outer')
merged['Timestamp'] = merged['Timestamp'] - 60
merged = merged.sort(columns = 'Timestamp').bfill()
results = pd.merge(df,merged, on = 'Timestamp')
[Original Post]
I can't think of an inbuilt or elegant way to do this. Posting this in case it's more elegant than your "something dirty", which is I guess unlikely. How about:
lookup_dict = {}
def assigner(row):
lookup_dict[row['Timestamp']] = row['Property']
df.apply(assigner, axis=1)
sorted_keys = sorted(lookup_dict.keys)
df['Property_Shifted'] = None
def get_shifted_property(row,shift_amt):
for i in sorted_keys:
if i >= row['Timestamp'] + shift_amt:
row['Property_Shifted'] = lookup_dict[i]
return row
df = df.apply(get_shifted_property, shift_amt=60, axis=1)