Loading multiple GTFS.zip format files in r5r - r5r

I am trying to analyze trips in a long time period in r5r that requires more than one GTFS files. I am using a for loop since I want to study trips in various depature dates in the Excel file. Right now, I have placed all three GTFS.zip files with different names in the data path together, but I could only receive mode information by public transportation within one date range, while trips in the other two dates produced walk time only. Is there a way to let r5r include all of them?
options(java.parameters = "-Xmx16G")
library(r5r)
library(sf)
library(data.table)
File_Path = file.path("C:","Research", "Data Sep", fsep = .Platform$file.sep)
list.files(File_Path)
poi <- fread(file.path(File_Path, "OriginsDestinationsPugetSound.csv"))
r5r_core <- setup_r5(data_path = File_Path, verbose = TRUE)
mode <- c("WALK","TRANSIT")
max_walk_dist <- 1000 # in meters
max_trip_duration <- 300 # in minutes
LengthOfFile = length(poi[[1]])
ListOfDetailedItineries = (matrix(ncol = 15,nrow = 0))
start = 1
end = 25
for (i in start:end) {
OriginPoint = poi[i,2:4]
DestinationPoint = poi[i,5:7]
Time_of_Trip = poi[i,9]
departure_datetime = as.POSIXct(Time_of_Trip[[1]], format = "%m/%d/%Y %H:%M")
dit <- detailed_itineraries(r5r_core = r5r_core,
origins = OriginPoint,
destinations = DestinationPoint,
mode = mode,
departure_datetime = departure_datetime,
max_walk_dist = max_walk_dist,
max_trip_duration = max_trip_duration,
shortest_path = TRUE,
verbose = TRUE)
ListOfDetailedItineries = rbind(ListOfDetailedItineries, as.matrix(dit))
cat('On iteration ',i,'\n',dit[[9]],"\n")
flush.console()
}
dit1 = as.data.frame(ListOfDetailedItineries)

As far as I understood, you have 3 feeds inside your data directory and you tried to generate travel time estimates for 3 different departure times, but only one of them return transit public transport trips. Is that correct?
If that's the case, you have to make sure that the public transit services listed in your GTFS feeds run on your specified departure times. This information is usually listed in the calendar table, but can also be listed in the calendar_dates table in some feeds.
The best practice here would be to choose dates that fall inside the service intervals of all 3 of your feeds. Alternatively, you can edit the start_date/end_date columns of their calendar table to include the days you have already chosen.

Related

Is there any way to convert a portfolio class from portfolio analytics into a data frame

I'm trying to find the optimal weights for an especific target return using Portfolio Analytics library and ROI optimization; However, even that I know that that target return should be feasable and should be part of the efficient frontier, the ROI optimization does not find any solution.
The code that I'm using is the following:
for(i in 0:n){
target=minret+(i)*Del
p <- portfolio.spec(assets = colnames(t_EROAS)) #Specification of asset classes
p <- add.constraint(p, type = "full_investment") #An investment that has to sum 1
p <- add.constraint(portfolio=p, type="box", min=0, max=1) #No short position long-only
p <- add.constraint(p,
type="group",
groups=group_list,
group_min=VCONSMIN[,1],
group_max=VCONSMAX[,1])
p <- add.constraint(p, type = "return", name = "mean", return_target = target)
p <- add.objective(p, type="risk", name="var")
eff.opt <- optimize.portfolio(t_EROAS, p, optimize_method = "ROI",trace=TRUE)}
n=30 but is just finding 27 portfolios and the efficient frontier that I'm creating is looking empty from portfolio 27 to portfolio 30, the 28 and 29 seems to not have a solution but I'm not sure that this is correct.
What I want to have is an efficient frontier on a data frame format with a fixed number of portfolios, and it seems that the only way to achive this is by this method. Any help or any ideas that could help?

Workaround Google Sheets API does not accept range request without specifying desired final line

My spreadsheet has values in this model:
And I need to create a list to use in Python, including the empty fields that exist between values:
CLIENT_SECRET_FILE = 'client_secrets.json'
API_NAME = 'sheets'
API_VERSION = 'v4'
SCOPES = ['https://www.googleapis.com/auth/spreadsheets']
service = Create_Service(CLIENT_SECRET_FILE, API_NAME, API_VERSION, SCOPES)
spreadsheet_id = sheet_id
get_page_id = 'Winning_Margin'
range_score = 'O1:O10000'
spreadsheets_match_score = []
range_names2 = get_page_id + '!' + range_score
result2 = service.spreadsheets().values().get(
spreadsheetId=spreadsheet_id, range=range_names2, valueRenderOption='UNFORMATTED_VALUE').execute()
sheet_output_data2 = result2["values"]
for i, eventao2 in enumerate(sheet_output_data2):
try:
spreadsheets_match_score.append(sheet_output_data2[i][0])
except:
spreadsheets_match_score.append('')
In this case, this list (spreadsheets_match_score = []) would result in:
["0-0","0-0","4-0","0-1","6-0","","","","0-3","2-2","","","","","0-1","","","3-0","1-1","3-1","","","",""]
My spreadsheet currently has 24 rows, but it will grow without a fixed ending value.
So, I tried to use the range without putting the value of the last line (range_score = 'O1:O'), but it doesn't accept, the range needs to specify the final line (range_score = 'O1:O10000').
I put 10000 exactly so I don't have to change, but this is very wrong to do, because it does a search for a non-existent range, I'm very afraid that in the future it will generate an error.
Is there any method so that I can not need to specify the last row of the worksheet?
To be something like:
range_score = 'O1:O'
The problem is not in the range specification method for data collection, can use either range_score = 'O1:O' or range_score = 'O1:O100000000000' if looking for all the column rows.
In the case of the question, the problem was when line 1 of the desired column has no values, being null, the request failed but because of the empty ["values"] return.
In short, I was looking for the error in the wrong place.

combining CSV files from Covid-data

I want to combine the CSV files from the Johns Hopkins Covid Data (e.g. https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/05-10-2020.csv & https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-23-2020.csv).
I already managed to load the files into a DataFrame as well as sanitizing the header (_ vs. / in some names). Now I want to pick one column (e.g. Confirmed), rename it to the day of the file and then combine those CSV files to get a progress over time.
This merge needs to be done by state_province. In both frames, the key may not be present. How can I do this? I experimented with rightjoin and outerjoin, but didn't have any success. Can someone point me the right way please?
I initially didn't want to share the code that I have so far because I didn't want to guide to a specific solution - but here it is. It is copied together from several Jupyter cells.
using Dates
start = Dates.Date(2020,1,22) #begin of recording
now = Dates.Date(Dates.now())- Dates.Day(1) #today
date_range = collect(start:Dates.Day(1):now) #create a date range with 1 element per day
prefix = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"
suffix = ".csv"
function create_url(date)
return prefix * Dates.format(date, "mm-dd-YYYY") * suffix
end
function cleanup_column_names(name)
if name == "Country/Region" || name == "Country_Region"
return "country"
elseif name == "Province/State" || name == "Province_State"
return "state"
else
return name
end
end
using CSV
using HTTP
using DataFrames
selected_data = "Confirmed"
date = date_range[1]
data = DataFrame(CSV.File(HTTP.get(create_url(date)).body))
DataFrames.rename!(cleanup_column_names, data)
DataFrames.select!(data,["state", "country", selected_data])
DataFrames.rename!(data, 3 => Dates.format(date, "YYYY-mm-dd"))
Regards
Tobias
I am relatively new to Julia, so take my answer with a bit of scepticism:
First, we wrap the DataFrame creation into a function:
function prepare_date_df(date)
data = DataFrame(CSV.File(HTTP.get(create_url(date)).body))
DataFrames.rename!(cleanup_column_names, data)
DataFrames.select!(data,["state", "country", selected_data])
DataFrames.rename!(data, 3 => Dates.format(date, "YYYY-mm-dd"))
return data
end
Let's create our first Dataframe:
df = prepare_date_df(date_range[1])
Now, let's iterate over all the other dates, create a dataframe for each date and merge this with our first dataframe:
for date in date_range[2:end]
df_new = prepare_date_df(date)
df = outerjoin(df, df_new, on = [:state, :country])
end
This works fine for the first two months, but with the growing Dataframes, it suddenly gets very slow (and even hangs?). So I would be very interested in a more performative answer!

How do I represent this data using Redis?

I want to be able to store data such as "store x is open between 9am and 5pm on Monday but it's only open during 9am and 12pm on Saturday"
What's the best way to store this using redis?
I would later like to query it using something like this. Show me all stores that are open on Saturday at 10:30am
In Redis, like most if not all other NoSQL databases, you want to store your data in the manner that's most suitable for answering the query. There are quite a few ways you can represent this data and answer the query, choosing between them requires knowledge about the other access patterns that you need to support.
However, in the context of this specific question alone, the simplest way of doing that IMO is to use two Sorted Sets per for each day of the week. Assuming that stores are open continuously and at most once each day (i.e. no siestas), the members of these Sorted Sets should be the store ids and the scores their opening hours - the first Sort Set's scores will denote the time that the store opens whereas the second's the time it closes. For example:
ZADD monday:open 9 store:x
ZADD monday:close 17 store:x
ZADD saturday:open 9 store:x
ZADD saturday:close 12 store:x
Once you have all the Sorted Sets in place, answering the query requires two calls to ZRANGEBYSCORE and intersecting the results. The snippet below demonstrates how to do it using Lua since doing using server scripts will be more efficient than moving the entire thing to the client in most cases.
Note: an alternative approach to doing the intersect in Lua is actually storing the temporary results in Redis' Sets and calling SINTER.
-- helper function to make a "set" out of a table
local function makeset(t)
local r = {}
for _, v in ipairs(t) do r[v] = true end
return(r)
end
-- get opening and closing hours for a given day
local ot = redis.call('ZRANGEBYSCORE', KEYS[1], '-inf', ARGV[1])
local ct = redis.call('ZRANGEBYSCORE', KEYS[2], '(' .. ARGV[1], '+inf')
-- convert to sets and choose the smaller set as s1
local s1 = {}
local s2 = {}
if #ot < #ct then
s1 = makeset(ot)
s2 = makeset(ct)
else
s1 = makeset(ct)
s2 = makeset(ot)
end
-- intersect s1 and s2
local t = {}
for k in pairs(s1) do
t[k] = s2[k]
end
-- prepare a response table
local r = {}
for k in pairs(t) do
r[#r+1] = k
end
return(r)
Run this script by passing to it the two keys and the hour, like so:
redis-cli --eval storehours.lua saturday:open saturday:close , 10.5

Error Handling with Live Data Matlab

I am using a live data API that is returning the next arriving trains. I plan on giving the user the next 5 trains arriving. If there are less than 5 trains arriving, how you handle that? I am having trouble thinking of a way, I was thinking a way with if statements but don't know how I would set them up.
time1Depart = dataReturnedFromLiveAPI{1,1}.orig_departure_time;
time2Depart = dataReturnedFromLiveAPI{1,2}.orig_departure_time;
time3Depart = dataReturnedFromLiveAPI{1,3}.orig_departure_time;
time4Depart = dataReturnedFromLiveAPI{1,4}.orig_departure_time;
time5Depart = dataReturnedFromLiveAPI{1,5}.orig_departure_time;
time1Arrival = dataReturnedFromLiveAPI{1,1}.orig_arrival_time;
time2Arrival = dataReturnedFromLiveAPI{1,2}.orig_arrival_time;
time3Arrival = dataReturnedFromLiveAPI{1,3}.orig_arrival_time;
time4Arrival = dataReturnedFromLiveAPI{1,4}.orig_arrival_time;
time5Arrival = dataReturnedFromLiveAPI{1,5}.orig_arrival_time;
The code right now uses a matrix that goes from 1:numoftrains but I am using just the first five.
It's a bad practice to assign individual value to a separate variable. Better if you pass all related values to a vector or cell array depending on class of orig_departure_time and orig_arrival_time.
It looks like dataReturnedFromLiveAPI is a cell array of structures. Then you can do:
timeDepart = cellfun(#(x), x.orig_departure_time, ...
dataReturnedFromLiveAPI(1,1:min(5,size(dataReturnedFromLiveAPI,2))), ...
'UniformOutput',0 );
timeArrival = cellfun(#(x), x.orig_arrival_time, ...
dataReturnedFromLiveAPI(1,1:min(5,size(dataReturnedFromLiveAPI,2))), ...
'UniformOutput',0 );
Then you how to access the values one by one as
time1Depart = timeDepart{1};
If orig_departure_time and orig_arrival_time are numeric scalars, you can use ...'UniformOutput',1.... You will get output as a vector and can get single values with timeDepart(1).