How to create a GTFS_RT feed from a bogus AVL feed in csv format - gps

I am trying to produce a GTFS Realtime Feed from AVL data that a transit agency is providing. The thing is that there doesn't seem to be a link between the offical GTFS data and this AVL data set.
The vehicle position and 'lateness' data is under an url /gps_full.txt and is updated every 5-7 seconds. The format of the data is as follows:
TransportType,RouteShortName,TransitId,VehicleNumber,Longitude,Latitude,Speed,Azimuth,TripStartTime,DeviationInSeconds,MeasurementTime,VehicleType,
Bus,20,9790770943,7031,25206880,54644738,0,232,583,0,39179,KZ,
Trolleybus,6,9733751022,1681,25279878,54687890,0,18,622,93,39175,KZ,
The only primary key I could think of is a composite key (RouteShortName,TripStartTime) but I am not sure if that would not cause some collisions.
Is there any better way of doing this? Maybe someone had the same issue with data from other transit agencies and could point me to some resources?
I did try to search for resources myself but it seems that almost everyone had some sort of a link between the AVL feeds and static GTFS data.
All help will be greatly appreciated.

If there really are no identifiers that allow you to link to the GTFS, the best you can do is guess which route/trip each vehicle is on.
TheTransitClock (an open-source version of Transitime that has continued to be updated and maintained) is a project that attempts to do this: https://github.com/TheTransitClock/transitime. I have never used it myself.
However, if you know the route and the trip start time, then you might be able to identify the scheduled trip (e.g., if the trip started at 09:00, go look in the GTFS schedule for trips on that route starting at 09:00). The logic for this could be quite complex (and essentially probabilistic) -- this is essentially a simplified version of what TheTransitClock is doing. I see that your data contains a TripStartTime field, although I don't understand the units.
Maybe the easiest approach would be to get in touch with the transit agency and ask if they could add the GTFS trip identifiers to the AVL data.

Related

YouTrack - Historical issue snapshots

The new YouTrack API is missing the old Issue history /rest/issue/{issue}/history end-point which our code heavily depends on. There's only the Issue activities /api/issues/{issueID}/activities end-point, which returns only delta differences between changes from this never-ending list of diff/activity categories.
Is there some simple way to get a list of issue's historical snapshots, or do I actually have to parse all these activity categories and somehow merge them together to (re)implement this whole thing by myself?
The /history endpoint didn't provide a history snapshot either, but /activity does output much more data indeed. Yet, that's the way to do it — traverse through data and build a snapshot based on the provided timestamps.

What is the best way to structure this database?

So I am in the process of building a database from my clients data. Each month they create roughly 25 csv's, which are unique by their topic and attributes, but they all have 1 thing in common; a registration number.
The registration number is the only common variable across all of these csv's.
My task is to move all of this into a database, for which I am leaning towards postgres (If anyone believes nosql would be best for this then please shout out!).
The big problem; structuring this within a database. Should I create 1 table per month that houses all the data, with column 1 being registration and column 2-200 being the attributes? Or should put all the csv's into postgres as they are, and then join them later?
I'm struggling to get my head around the method to structure this when there will be monthly updates to every registration, and we dont want to destroy historical data - we want to keep it for future benchmarks.
I hope this makes sense - I welcome all suggestions!
Thank you.
There are some ways where your question is too broad and asking for an opinion (SQL vs NoSQL).
However, the gist of the question is whether you should load your data one month at a time or into a well-developed data model. Definitely the latter.
My recommendation is the following.
First, design the data model around how the data needs to be stored in the database, rather than how it is being provided. There may be one table per CSV file. I would be a bit surprised, though. Data often wants to be restructured.
Second, design the archive framework for the CSV files.
You should archive all the incoming files in a nice directory structure with files from each month. This structure should be able to accommodate multiple uploads per month, either for all the files or some of them. Mistakes happen and you want to be sure the input data is available.
Third, copy (this is the Postgres command) the data into staging tables. This is the beginning of the monthly process.
Fourth, process the data -- including doing validation checks to load it into your data model.
There may be tweaks to this process, based on questions such as:
Does the data need to be available 24/7 even during the upload process?
Does a validation failure in one part of the data prevent uploading any data?
Are SQL checks (referential integrity and check) sufficient for validating the data?
Do you need to be able to "rollback" the system to any particular update?
These are just questions that can guide your implementation. They are not intended to be answered here.

Amazon Datawarehouse architecture and design

I have some flight booking data in a Sql server table with booking for a passenger.
The below query highlights all the tables involved along with joins
"SELECT distinct * FROM
Booking B
JOIN BookingPassenger BP
ON B.BookingId = BP.BookingId
JOIN PassengerJourneyLeg PJL
ON PJL.PassengerId = BP.PassengerId
JOIN InventoryLeg IL
ON IL.InventoryLegId = PJL.InventoryLegId
join passengerjourneysegment ps
on ps.PassengerId= BP.PassengerId
WHERE IL.departuredate = '2014/03/26' and il.flightnumber = 123
AND B.CreatedDate < '2014/03/22'"
Now Revenue departments needs this data to be put into a Data Warehouse so that they can compute a booking curve for each flight on any day or all flights on any day or within specific dates. Currently they are doing it via excel which fetches data through sql but it is very time consuming and does not give real time data. Later they want to gather data from our corporate booking website and want to manage customer profile on this data warehouse which will be our main analytical platform. I am fresh to data warehouse and learning and researching on how to implement a effective data-warehouse to meet their needs.
Can someone help me on How should i collect the data ? Should i upload it into dynamodb or s3 and what is the best way to do that as one time job and as recurring job?
The later aim of this data warehouse will be to plot all information related to the PNR. Flight revenue by day, by class, by subclass, by event etc.
Later phase, every time user interact with our website, i want to store that in redshift.. so when should i write the files to S3 or dynamodb? and how many ? i.e : - If i do write files to S3 on each user event, i will end up with hundreds of files, that may not seems like a good solution. What about introducing RDS or dynamodb to store each transaction? or is it possible to allow server log files to store info (user interaction on website) and doing any event (booking, cancel etc) to be recorded into RDS or Dynamodb?
what are the best practices ? What may be the best design in my specific scenario? Also if someone can please give more clarify on How can that be implemented?
what are the best practices to have reports with 1-5 TB of data to come back in few minutes or seconds and avoid any duplication or latency?
Also can someone suggest how can have ease of maintenance and have the cost effectiveness and at par with some the best solutions?
I will really appreciate any help, links, suggestion on the topics of Data warehousing, Amazon (Redshift,s3, Dynamo Db) technologies specific to my requirements.
There are a lot of questions, i suspect that some of them was answered due the elapsed time. Anyway, let me explain my thougths about this.
Later they want to gather data from our corporate booking website and want to manage customer profile on this data warehouse which will be our main analytical platform
Establish a staging area database is a good idea, a "database for drafts". You can create simple tables to deal with this data.
Can someone help me on How should i collect the data ? Should i upload it into dynamodb or s3 and what is the best way to do that as one time job and as recurring job?
A good path is to use some ETL tool to collect the data. I like Pentaho CE and its PDI.
Later phase, every time user interact with our website, i want to store that in redshift.. so when should i write the files to S3 or dynamodb? and how many ? i.e : - If i do write files to S3 on each user event, i will end up with hundreds of files, that may not seems like a good solution. What about introducing RDS or dynamodb to store each transaction? or is it possible to allow server log files to store info (user interaction on website) and doing any event (booking, cancel etc) to be recorded into RDS or Dynamodb?
I prefer the last idea. Store your server log and time by time copy to the staging database keeping the more amount of log you can for statistical purposes in the staging area. In my opinion it is the most common practice.
what are the best practices ? What may be the best design in my specific scenario? Also if someone can please give more clarify on How can that be implemented?
Take a look at this answer on this discussion https://stackoverflow.com/a/2015115/2249963 (I just didn´t want to copy and paste as if it was my answer, so i agree :) )
what are the best practices to have reports with 1-5 TB of data to come back in few minutes or seconds and avoid any duplication or latency?
There are many ways to do this, basically duplication is used and latency is the counterpart of the performance. You really need to elect how many time it is required to analyse your data. The previous day, also called D-1 is usual.
Also can someone suggest how can have ease of maintenance and have the cost effectiveness and at par with some the best solutions?
Doing a good design on model and keep things simple, as in everything on IT.
I hope it can help

Designing an audit trail for billing purposes

We're working on Rails apps with a pricing model similar to that of Amazon DynamoDB (i.e., flexibly provision what you'll need). For the sake of simplicity, let's say you can configure:
Number of users
Number of documents you're allowed to create
Requirements
Simply put, you pay based on what you configure. Our specific requirements are as follows:
You pay a monthly fee based on the maximum for the month. If you upgrade to 1,000 users on the 3rd and downgrade to 500 on the 10th, you pay for 1,000 users for the whole month.
You can upgrade anytime.
You can downgrade once a day.
(This may sound unfair at first glance, but we're allocating some serious resources here.)
I am looking for a way to design a datamodel that does what we want without getting in the way too much.
Things I've considered
Auditing gems
As far I can see, I cannot use gems like simple_audit or paper_trail.
They store model changes serialized in the database. This is great for undo and versioning, but not for requirement #1 because you cannot get the changes within a date range and then find the MAX value (without calculating most of it in Ruby).
Home-made solution
I can imagine the following home-made solution: A database table that stores records like
(model, metric, value, time_of_change, user_who_made_the_change)
This makes it possible to:
have all changes in a single place
query the maximum within a date range (requirement #1)
query when the next change is allowed (requirement #3)
This table would be updated in an ActiveRecord (presumable after_save) callback which is wrapped in the transaction around save.
I have concerns about the home-made solution because of NIH Syndrome, and maybe my concerns about the auditing gems are completely unsubstantiated.
Or, just maybe, I am overlooking an aspect or a whole other solution. What do you think?

How to manage multiple versions of the same record

I am doing short-term contract work for a company that is trying to implement a check-in/check-out type of workflow for their database records.
Here's how it should work...
A user creates a new entity within the application. There are about 20 related tables that will be populated in addition to the main entity table.
Once the entity is created the user will mark it as the master.
Another user can make changes to the master only by "checking out" the entity. Multiple users can checkout the entity at the same time.
Once the user has made all the necessary changes to the entity, they put it in a "needs approval" status.
After an authorized user reviews the entity, they can promote it to master which will put the original record in a tombstoned status.
The way they are currently accomplishing the "check out" is by duplicating the entity records in all the tables. The primary keys include EntityID + EntityDate, so they duplicate the entity records in all related tables with the same EntityID and an updated EntityDate and give it a status of "checked out". When the record is put into the next state (needs approval), the duplication occurs again. Eventually it will be promoted to master at which time the final record is marked as master and the original master is marked as dead.
This design seems hideous to me, but I understand why they've done it. When someone looks up an entity from within the application, they need to see all current versions of that entity. This was a very straightforward way for making that happen. But the fact that they are representing the same entity multiple times within the same table(s) doesn't sit well with me, nor does the fact that they are duplicating EVERY piece of data rather than only storing deltas.
I would be interested in hearing your reaction to the design, whether positive or negative.
I would also be grateful for any resoures you can point me to that might be useful for seeing how someone else has implemented such a mechanism.
Thanks!
Darvis
I've worked on a system like this which supported the static data for trading at a very large bank. The static data in this case is things like the details of counterparties, standard settlement instructions, currencies (not FX rates) etc. Every entity in the database was versioned, and changing an entity involved creating a new version, changing that version and getting the version approved. They did not however let multiple people create versions at the same time.
This lead to a horribly complex database, with every join having to take version and approval state into account. In fact the software I wrote for them was middleware that abstracted this complex, versioned data into something that end-user applications could actually use.
The only thing that could have made it any worse was to store deltas instead of complete versioned objects. So the point of this answer is - don't try to implement deltas!
This looks like an example of a temporal database schema -- Often, in cases like that, there is a distinction made between an entity's key (EntityID, in your case) and the row primary key in the database (in your case, {EntityID, date}, but often a simple integer). You have to accept that the same entity is represented multiple times in the database, at different points in its history. Every database row still has a unique ID; it's just that your database is tracking versions, rather than entities.
You can manage data like that, and it can be very good at tracking changes to data, and providing accountability, if that is required, but it makes all of your queries quite a bit more complex.
You can read about the rationale behind, and design of temporal databases on Wikipedia
You are describing a homebrew Content Management System which was probably hacked together over time, is - for the reasons you state - redundant and inefficient, and given the nature of such systems in firms is unlikely to be displaced without massive organizational effort.