Graph to show departure and arrival times between stations - vba

I have the start and end times of trips made by a bus, with the times in an Excel sheet. I want to make the graph as below :
I tried with Matlab nodes and graphs but did not got the exact figure, below is the Matlab code which I tried as an example:
A = [1 4]
B = [2 3]
weights = [5 5];
G = digraph(A,B,weights,4)
plot(G)
And the figure it generates:
I have got many more than 4 points in the Excel sheet, and I want them to all be displayed as in the first image.

Overview
You don't need any sort of complicated graph package for this, just use normal line plots! Here are methods in Excel and Matlab.
Excel
Give each bus stop a number, and list the bus stop number by the time it arrives/leaves there. I'll use stops number 0 and 1 for this example.
0 04:41
1 05:35
1 05:40
0 06:34
0 06:51
1 07:45
1 15:21
0 16:15
Then simply highlight the data and insert a "scatter with straight lines"
The rest is formatting. You can format the y-axis and tick "values in reverse order" to get the time increasing as in your desired plot. You can change the x-axis tick marks to just show integer stop numbers, get rid of the legend etc.
Final output:
Matlab
Here is the Matlab documentation for converting Excel formatted dates into Matlab datetime arrays: Convert Excel Date Number to Datetime.
Once you have the datetime objects, you can do this easily with the standard plot function.
% Set times up as a datetime array, could do this any number of ways
times = datetime(strcat({'1/1/2000 '}, {'04:41', '05:35', '05:40', '06:34', '06:51', '07:45', '15:21', '16:15'}, ':00'), 'format', 'dd/MM/yyyy HH:mm:ss');
% Set up the location of the bus at each of the above times
station = [0,1,1,0,0,1,1,0];
% Plot
plot(station, times) % Create plot
set(gca, 'xtick', [0,1]) % Limit to just ticks at the 2 stops
set(gca, 'ydir', 'reverse') % Reverse y axis to have earlier at top
set(gca,'XTickLabel',{'R', 'L'}) % Name the stops
Output:

Related

Using Pandas and Numpy to search for conditions within binned data in 2 data frames

Python newbie here. Here's a simplified example of my problem. I have 2 pandas dataframes.
One dataframe lightbulb_df has data on whether a light is on or off and looks something like this:
Light_Time
Light On?
5790.76
0
5790.76
0
5790.771
1
5790.779
1
5790.779
1
5790.782
0
5790.783
1
5790.783
1
5790.784
0
Where the time is in seconds since start of day and 1 is the lightbulb is on, 0 means the lightbulb is off.
The second dataframe sensor_df shows whether or not a sensor detected the lightbulb and has different time values and rates.
Sensor_Time
Sensor Detect?
5790.8
0
5790.9
0
5791.0
1
5791.1
1
5791.2
1
5791.3
0
Both dataframes are very large with 100,000s of rows. The lightbulb will turn on for a few minutes and then turn off, then back on, etc.
Using the .diff function, I was able to compare each row to its predecessor and depending on whether the result was 1 or -1 create a truth table with simplified on and off times and append it to lightbulb_df.
# use .diff() to compare each row to the last row
lightbulb_df['light_diff'] = lightbulb_df['Light On?'].diff()
# the light on start times are when
#.diff is less than 0 (0 - 1 = -1)
light_start = lightbulb_df.loc[lightbulb_df['light_diff'] < 0]
# the light off start times (first times when light turns off)
# are when .diff is greater than 0 (1 - 0 = 1)
light_off = lightbulb_df.loc[lightbulb_df['light_diff'] > 0]
# and then I can concatenate them to have
# a single changed state df that only captures when the lightbulb changes
lightbulb_changes = pd.concat((light_start, light_off)).sort_values(by=['Light_Time'])
So I end up with a dataframe of on start times, a dataframe of off start times, and a change state dataframe that looks like this.
Light_Time
Light On?
light_diff
5790.771
1
1
5790.782
0
-1
5790.783
1
1
5790.784
0
-1
Now my goal is to search the sensor_df dataframe during each of the changed state times (above 5790.771 to 5790.782 and 5790.783 to 5790.784) by 1 second intervals to see whether or not the sensor detected the lightbulb. So I want to end up with the number of seconds the lightbulb was on and the number of seconds the sensor detected the lightbulb for each of the many light on periods in the change state dataframe. I'm trying to get % correctly detected.
Whenever I try to plan this out, I end up using lots of nested for loops or while loops which I know will be really slow with 100,000s of rows of data. I thought about using the .cut function to divide up the dataframe into 1 second intervals. I made a for loop to cycle through each of the times in the changed state dataframe and then nested a while loop inside to loop through 1 second intervals but that seems like it would be really slow.
I know python has a lot of built in functions that could help but I'm having trouble knowing what to google to find the right one.
Any advice would be appreciated.

Dendrograms with SciPy

I have a dataset that I shaped according to my needs, the dataframe is as follows:
Index A B C D ..... Z
Date/Time 1 0 0 0,35 ... 1
Date/Time 0,75 1 1 1 1
The total number of rows is 8878
What I try to do is create a time-series dendrogram (Example: Whole A column will be compared to whole B column in whole time).
I am expecting an output like this:
(source: rsc.org)
I tried to construct the linkage matrix with Z = hierarchy.linkage(X, 'ward')
However, when I print the dendrogram, it just shows an empty picture.
There is no problem if a compare every time point with each other and plot, but in that way, the dendrogram becomes way too complicated to observe even in truncated form.
Is there a way to handle the data as a whole time series and compare within columns in SciPy?

Mapping column values to a combination of another csv file's information

I have a dataset that indicates date & time in 5-digit format: ddd + hm
ddd part starts from 2009 Jan 1. Since the data was collected only from then to 2-years period, its [min, max] would be [1, 365 x 2 = 730].
Data is observed in 30-min interval, making 24 hrs per day period to lengthen to 48 at max. So [min, max] for hm at [1, 48].
Following is the excerpt of daycode.csv file that contains ddd part of the daycode, matching date & hm part of the daycode, matching time.
And I think I agreed to not showing the dataset which is from ISSDA. So..I will just describe that the daycode in the File1.txt file reads like '63317'.
This link gave me a glimpse of how to approach this problem, and I was in the middle of putting up this code together..which of course won't work at this point.
consume = pd.read_csv("data/File1.txt", sep= ' ', encoding = "utf-8", names =['meter', 'daycode', 'val'])
df1= pd.read_csv("data/daycode.csv", encoding = "cp1252", names =['code', 'print'])
test = consume[consume['meter']==1048]
test['daycode'] = test['daycode'].map(df1.set_index('code')['print'])
plt.plot(test['daycode'], test['val'], '.')
plt.title('test of meter 1048')
plt.xlabel('daycode')
plt.ylabel('energy consumption [kWh]')
plt.show()
Not all units(thousands) have been observed at full length but 730 x 48 is a large combination to lay out on excel by hand. Tbh, not an elegant solution but I tried by dragging - it doesn't quite get it.
If I could read the first 3 digits of the column values and match with another file's column, 2 last digits with another column, then combine.. is there a way?
For the last 2 lines you can just do something like this
df['first_3_digits'] = df['col1'].map(lambda x: str(x)[:3])
df['last_2_digits'] = df['col1'].map(lambda x: str(x)[-2:])
for joining 2 dataframes
df3 = df.merge(df2,left_on=['first_3_digits','last_2_digits'],right_on=['col1_df2','col2_df2'],how='left')

Gnuplot: How to load and display single numeric value from data file

My data file has this content
# data file for use with gnuplot
# Report 001
# Data as of Tuesday 03-Sep-2013
total 1976
case1 522 278 146 65 26 7
case2 120 105 15 0 0 0
case3 660 288 202 106 63 1
I am making a histogram from the case... lines using the script below - and that works. My question is: how can I load the grand total value 1976 (next to the word 'total') from the data file and either (a) store it into a variable or (b) use it directly in the title of the plot?
This is my gnuplot script:
reset
set term png truecolor
set terminal pngcairo size 1024,768 enhanced font 'Segoe UI,10'
set output "output.png"
set style fill solid 1.00
set style histogram rowstacked
set style data histograms
set xlabel "Case"
set ylabel "Frequency"
set boxwidth 0.8
plot for [i=3:7] 'mydata.dat' every ::1 using i:xticlabels(1) with histogram \
notitle, '' every ::1 using 0:2:2 \
with labels \
title "My Title"
For the benefit of others trying to label histograms, in my data file, the column after the case label represents the total of the rest of the values on that row. Those total numbers are displayed at the top of each histogram bar. For example for case1, 522 is the total of (278 + 146 + 65 + 26 + 7).
I want to display the grand total somewhere on my chart, say as the second line of the title or in a label. I can get a variable into sprintf into the title, but I have not figured out syntax to load a "cell" value ("cell" meaning row column intersection) into a variable.
Alternatively, if someone can tell me how to use the sum function to total up 522+120+660 (read from the data file, not as constants!) and store that total in a variable, that would obviate the need to have the grand total in the data file, and that would also make me very happy.
Many thanks.
Lets start with extracting a single cell at (row,col). If it is a single values, you can use the stats command to extract the values. The row and col are specified with every and using, like in a plot command. In your case, to extract the total value, use:
# extract the 'total' cell
stats 'mydata.dat' every ::::0 using 2 nooutput
total = int(STATS_min)
To sum up all values in the second column, use:
stats 'mydata.dat' every ::1 using 2 nooutput
total2 = int(STATS_sum)
And finally, to sum up all values in columns 3:7 in all rows (i.e. the same like the previous command, but without using the saved totals) use:
# sum all values from columns 3:7 from all rows
stats 'mydata.dat' every ::1 using (sum[i=3:7] column(i)) nooutput
total3 = int(STATS_sum)
These commands require gnuplot 4.6 to work.
So, your plotting script could look like the following:
reset
set terminal pngcairo size 1024,768 enhanced
set output "output.png"
set style fill solid 1.00
set style histogram rowstacked
set style data histograms
set xlabel "Case"
set ylabel "Frequency"
set boxwidth 0.8
# extract the 'total' cell
stats 'mydata.dat' every ::::0 using 2 nooutput
total = int(STATS_min)
plot for [i=3:7] 'mydata.dat' every ::1 using i:xtic(1) notitle, \
'' every ::1 using 0:(s = sum [i=3:7] column(i), s):(sprintf('%d', s)) \
with labels offset 0,1 title sprintf('total %d', total)
which gives the following output:
For linux and similar.
If you don't know the row number where your data is located, but you know it is in the n-th column of a row where the value of the m-th column is x, you can define a function
get_data(m,x,n,filename)=system('awk "\$'.m.'==\"'.x.'\"{print \$'.n.'}" '.filename)
and then use it, for example, as
y = get_data(1,"case2",4,"datafile.txt")
using data provided by user424855
print y
should return 15
It's not clear to me where your "grand total" of 1976 comes from. If I calculate 522+120+660 I get 1302 not 1976.
Anyway, here is a solution which works even without stats and sum which were not available in gnuplot 4.4.0.
In the data you don't necessarily need the "grand total" or the sum of each row, because gnuplot can calculate this for you. This is done by (not) plotting the file as a matrix, and at the same time summing up the rows in the string variable S0 and the total sum in variable Total. There will be a warning warning: matrix contains missing or undefined values which you can ignore. The labels are added by plotting '+' ... with labels extracting the desired values from the S0 string.
Data: SO18583180.dat
So, the reduced input data looks like this:
# data file for use with gnuplot
# Report 001
# Data as of Tuesday 03-Sep-2013
case1 278 146 65 26 7
case2 105 15 0 0 0
case3 288 202 106 63 1
Script: (works for gnuplot>=4.4.0, March 2010 and gnuplot 5.x)
### histogram with sums and total sum
reset
FILE = "SO18583180.dat"
set style histogram rowstacked
set style data histograms
set style fill solid 0.8
set xlabel "Case"
set ylabel "Frequency"
set boxwidth 0.8
set key top left noautotitle
set grid y
set xrange [0:2]
set offsets 0.5,0.5,0,0
Total = 0
S0 = ''
addSums(v) = S0.sprintf(" %g",(M=$2,(N=$1+1)==1?S1=0:0,S1=S1+v))
plot for [i=2:6] FILE u i:xtic(1) notitle, \
'' matrix u (S0=addSums($3),Total=Total+$3,NaN) w p, \
'+' u 0:(real(S2=word(S0,int($0*N+N)))):(S2) every ::::M w labels offset 0,0.7 title sprintf("Total: %g",Total)
### end of script
Result: (created with gnuplot 4.4.0, Windows terminal)

How to Resize using Lanczos

I can easily calculate the values for sinc(x) curve used in Lanczos, and I have read the previous explanations about Lanczos resize, but being new to this area I do not understand how to actually apply them.
To resample with lanczos imagine you
overlay the output and input over
eachother, with points signifying
where the pixel locations are. For
each output pixel location you take a
box +- 3 output pixels from that
point. For every input pixel that lies
in that box, calculate the value of
the lanczos function at that location
with the distance from the output
location in output pixel coordinates
as the parameter. You then need to
normalize the calculated values by
scaling them so that they add up to 1.
After that multiply each input pixel
value with the corresponding scaling
value and add the results together to
get the value of the output pixel.
For example, what does "overlay the input and output" actually mean in programming terms?
In the equation given
lanczos(x) = {
0 if abs(x) > 3,
1 if x == 0,
else sin(x*pi)/x
}
what is x?
As a simple example, suppose I have an input image with 14 values (i.e. in addresses In0-In13):
20 25 30 35 40 45 50 45 40 35 30 25 20 15
and I want to scale this up by 2, i.e. to an image with 28 values (i.e. in addresses Out0-Out27).
Clearly, the value in address Out13 is going to be similar to the value in address In7, but which values do I actually multiply to calculate the correct value for Out13?
What is x in the algorithm?
If the values in your input data is at t coordinates [0 1 2 3 ...], then your output (which is scaled up by 2) has t coordinates at [0 .5 1 1.5 2 2.5 3 ...]. So to get the first output value, you center your filter at 0 and multiply by all of the input values. Then to get the second output, you center your filter at 1/2 and multiply by all of the input values. Etc ...