1st iteration December 2015

Simple algorithm

Non-real-time, single pass

2nd iteration February 2016

Segments algorithm

Real-time with segments generation

3rd iteration April 2016 latest

Statistical algorithm

Real-time with incremental statistics

Prototype

In the first iteration of our prototype, we implemented a simple algorithm. It matched the Trust Reports to the GPS Reports by comparing the Tiploc codes and Event Types within a certain time limit. Next, the algorithm checked if the Unit was supposed to run that service and gave a preference to those Units. Finally, we calculated how the percentage of how likely a given Rolling Stock ran the services.

Algorithms

The first algorithm we used was a simple statistical one, and it was used in last year’s project. It calculated how likely a Service was run by a particular Rolling Stock by calculating how many Trust Reports match with a GPS Report. Next, we combined that algorithm with our visualisation to see how accurate it was. After careful evaluation, we concluded that it was not accurate enough and therefore we had to move on to a more complex version.

Statistical Algorithm (see demo)

Like the first visualisation, the statistical algorithm uses D3.js in order to display the results. The interface looks similar as well. In the top left corner, you can select the gps_car_id of the Rolling Stock you want to analyze. Then the algorithm lists all of the GPS Reports in the first column.

Next, for all services, we compare all of the Trust Reports to the GPS Reports. Using that data, we can calculate how many Trust Reports match with a GPS Report and calculate the probability of a Rolling Stock running the service.

Finally, we display all of the Services with a high enough probability. Every column, which represents a service, lists all of the Trust events and again, a green circle is shown if the Reports match and a red circle if no match is found.

API Endpoints

Our prototype uses two main API endpoints to serve data, one for GPS and the other for TRUST data. These are used to fuel the data visualisations. They are located at /events/gps.json and /events/trust.json

Infrastructure

In the beginning of the project, we researched various technologies and analyzed them to determine which ones were the most appropriate. Below is the full list of all the technologies we have used and why:

PostgreSQL

Why a Database?

We receive data in various different sources (JSON, CSV, text files). Therefore querying and analyzing data is not always easy in their original form. So the team decided to have a database, to store all of the data.

Why an SQL Database?

Since some of the data is very inconsistent and is missing several values quite often, we first considered a NoSQL database since it would make storing the data very easy. However, we noticed some relations between several data sources that we could express very naturally in an SQL database. Additionally, our algorithm will be querying and joining the data very often. Therefore the advantages of using an SQL database outweighed the disadvantages.

Why PostgreSQL?

Now that we have decided to use an SQL database, we still had to choose from a wide range of options. After some research we found that PostgreSQL would be the most appropriate for several reasons:

  • PostgreSQL is easily extensible. We learned that the PostGIS extension is very capable and popular when it comes to querying and storing location data in a database, whereas MySQL is much more limited.
  • PostgreSQL is widely used, which means that there is a strong community that can offer support.
  • It’s open-source.

Flask

We wanted to use a object-oriented language that everyone in the team was already comfortable with. Thus essentially we had to choose between Python and Java. However Python seemed more appropriate because it is dynamically typed, it also is a lot more concise and therefore also easier to read and debug and then finally, Python is a very compact and powerful language. Therefore, we decided on using Python for the backend and a web application for the frontend. But in order to use Python to create a web application, we needed a web framework.

Everyone wanted a small, minimal framework with a very short learning curve so we could immediately start developing. Additionally, the framework had to be extensible so we could use it with PostgreSQL and any other technologies that we might decide on later. So since Flask checked all of these requirements, we decided to use it, together with various dependencies.

SQLAlchemy and psycopg2

SQLAlchemy is one of the most used tools to use SQL inside a Python application. Together with psycopg2, a Python adapter for PostgreSQL, we can use it to access our database. We mainly chose both tools because they are very easy to install through pip and to integrate with Flask. Additionally, there are a lot of examples, which use the same setup.

Virtual Environment

A Virtual Environment is a tool to keep all of the Python dependencies in the same place. In order to do so, virtualenv creates a folder with all of the necessary executable to use the packages. Therefore all of the team members do not need to worry about which version of a dependency they are using. This makes the development process a lot easier because we spend less time trying to configure each other’s machines.

D3.js

D3.js is a JavaScript library for manipulating HTML DOM elements based on data. There are a lot of JavaScript libraries for visualisations, but D3.js is one of the most flexible ones. Even though it might not be easy to learn, the library will allow us to customize everything in the visualisation. Additionally, there is a lot of documentation and examples available online that helped us to learn.

So the overall infrastructure of our application looks like:

7Vpbd9o4EP41PIbjO+QxKUn60J6ThvTs7qOwhe2tbbmySEJ/fWdsyfcCJeayZXkAayTLYma+b0Yjj8wP8dsDJ2nwmXk0Ghma9zYyZyPDmDgmfKNgXQhs3SoEPg+9QqRXgnn4g0qhJqWr0KNZY6BgLBJh2hS6LEmoKxoywjl7bQ5bsqj51JT46omVYO6SqCv9K/REUEinhlPJP9LQD9STdee66FkQ95vP2SqRzxsZ5jL/FN0xUXPJP/qmyRksOcFaCgzZTknSWNIPxuKGgNOsUp/8u2FTIwl5aTw0CpNvTQUtGPcorw0y78C6nDGYCK/itw80Qgsr6xUz3f+it1Qfp4lcyuYbzOXimlBNcyY6cZfW9ErO8EKilfxvczeg3goMZGgzIgj8zNmKu9DGgZlYK9u9BqGg85RAlzl7BfccmbeBiCNo6XApJ6ZcUOmtPavNRXKpD5TFVPA1Wkv22lKR0rt1Y1K0Xytf0ZWyg5qfOFJGpPb9cupKN3Ah1bOjquDWlqoeHudnqKXS/U6iJUlINS09P32dP5+hnsyT6knydB1460zQGGRXI8OJ4Nm3XvgClz5e3kck+6bk8LBa1wlUaenSzGfhcnZHlT0KVKIFb0ua2pyV8p6blwz0U9e1832F1J13XBXx4QYG6Fr6VnWqWR5ZJnxO518+1R5dTLlpRR37ZgFJ8dJdQ4DJw8ntEUxuNU1uWV2TK+vWLT49hMVlelCz+C3mIqAK4/5r/tMLopk5/jc7GxA5Sl1nASIZWRsgaqmEJt4N5nzoekBHWeg2dULfQvE3XGtjW7b+wRZeJ7A87MqHYSPvGRs4sHgO9TqZYkuRsJYidGwkVEG4T+Vt/dlO1yA1hasYUde3knEaERG+NJfZZwT5hEcW5nShHKRp7rYViz8n76mncK1pWl6jZlXTFP+/M03uD+Vf3s9Fph0XuQMAoAJnBL8JrhKgqRHInIEGO/jj9PuKZjgqoa+YCpSp5t50vSOaAaUorzlrns6TRT4AXTRFjeU6tG9H9gwkZCWYTPnxBhKFPm4RIrrEqRD6IWxkbqRYMGSIDAgjTPxnbMzAJQchijJR2ZQGmz2Oqyh6UKJQDvcuplB8UJLD8HygtrOb+CDnjFPxgTFp8b8y4e8ygmlp44kO/2qq61PHtiYyPCp3aYfgA1KE3t1PPlGx4slQaP+TcF0GfGWo654EQOtxwLajDIPrHhgDBueyybgImM8SEt1VUsi6oAJDcUrUdU8uoK4LfA8Ib7nafeC9M27312V3FzwHToTcUyP5nXnlwCEx+layyPCnBxZPNAW1412ckuhKhPH/oXILpBSEFKcqrq5DSkWww4fKvp3pQKFyMiCWpLf+d0KlpVhx4FBpqvUeI1R2N7FNjpB1s99hhUuCdm/5rg/aB4mWCsnDQHsMTNFAt1a2HykHA4v80CCngHKTnSfLKrZWNYmjbKRzYjsVG1iTZinK3DdxtlRm3HapI8Df6EkRBON47LIhUSZw3gXLUz0ANMysu/VIKAzSK6X6vCI5xqJoqyCZr7Ckk5JCilkvaH9ty8pGGV169tdHy8ON7snAOypx00YpTvIKNDu0UnFRnlmUXDQgschs6HzTDLt1QmTtX6NrTXTEKh2Ehbb/3FPhBnhPsmQcjsZDBviGmTv04tEUXAsABoNxzKZtRE0EVT0ewmsEF8AVttY6BunjiqNtMNR7EjVbz0LicxJjeth3BHIDeoSuuziN2Jrumjge/0RRUetpTvq7le73JHdNQkUHH6oGIs1/vozaTtX0Ngx2TtXaEzn9G7ljEKx8Y6jmHTcepmUV9Dqw+0ySFezqUPEhvKKkiYDC9Y706gYkAU+5CHaVQCsPtyQSt2ViB2FXxUHDVkTVJq524nGM49F3UMXh66Vm37GBOit8otkqwg3KH+XpdtPTy3PcbZ7e3qIOo/8j1P6rKHjgauYu9YscDSfx9O62vzwg+/j8Gd8M2nYq8Ijv1KKjgJ/8YvTWkFJi6gJiShtpqpZcQ5oqLQ4dUixNd5yFfe0tNMtYWFrf67ZfPt1EUPCJMR/YZvqi/JNma5elfvmW2yXtt8oK/Rbr7VGagWb1InaRKFbv3Jt3PwE=

Heroku

Heroku is a could-based application platform that aims to make building, deploying and scaling a lot easier. Therefore we would not have to worry about deployment because it abstracts away a lot of the SysAdmin work. Thus we can focus on the actual algorithm and prototype. Another reason why we use Heroku is because it allows us to automate a large part of the deployment process. We simply have to push our code to the server and then deployment is automated from there on. In general, Heroku seemed to have a lot of advantages over traditional deployment on an amazon EC2 instance.

In order to deploy, we only need to run 3 commands.

heroku login

In your git repo, add git remote to Heroku

heroku git:remote -a atos-service

Pushing to production

git push heroku master

Building the application

There are a couple of tools you need in order to build, run and deploy the application:

  • Python
  • pip - A Python Package Manager to share and reuse code
    • Flask - The Python web framework
    • psycopg2 - A Python adapter for PostgreSQL
    • SQLAlchemy - Tool to write SQL within a python application
    • virtualenv - Manages all of your Python dependencies
  • Node Package Manager (npm) - A package manager that can be used by JavaScript developers to share and reuse code
  • JavaScript Package Manager (jspm) - A second JavaScript package manager, which is required to use ES6
  • d3 - JavaScript library for manipulating HTML DOM elements based on data
  • git - Version Control Software
  • PostgreSQL - SQL Database
  • Heroku - Cloud-Application Platform to deploy the application

A description of how to download, install and use these tools can be found on the project’s readme.

UML

Database Model

7Vptb+I4EP41fLwTEN76sdDu3um6UrX0dHefkJuYxKqJc44pZX/9juMZEoeXhRK6dxIIITyxx/Y8M8+MDa1gsnj7rFmWfFERl61uO3prBXetbnc4CODTCtZO0O/0nCDWInKiTimYim8chW2ULkXEc6+jUUoakfnCUKUpD40nY1qrld9trqQ/a8ZimrEUTEMmt6V/icgkTjrqDkr5b1zECc3cGdy4J88sfIm1WqY4X6sbzIuXe7xgpAs3+tamjaOCNQpusEPGUm9J35RaeALN89J8uF3hWyRlr96kUqQvvoGelY64rnQK7gFdrRQost8WbxMuLcKEntP0ac/Tjfk0T3Ephwf0n+csDIfRkI3CYNjp/dJ3Gl6ZXOLenvQyN06amzUBla/EQrIUWuPELCQIO/B1rlIzxU5gz3GYCBk9sLVa2tXkBkCi1jhRWnyD/owGw2Nt0CUBcNAmpJwoqax9UuXm2gyaWmU4jYPikXZtlTnRA4Ol41KUlCzLxXOxONtlwXQs0rEypkDWdqJdfarMXHpSMEbjcG04RtwOixciNPdnrhbc6LX1OHJZdAaM0BvUsCrdvTPELknF1TchytCD4o3qEl/4ghAfCTfGVQXuVhd2aQeA5wS38Pk77Mt60YAtMjCBtNj5rcc/Nu3Ye0qtolckXuEriuwcCWdRCATmppkaLdKYOsNGvP57VIA3AIQzqUJmhIKAPVdTxAFJs9TnLArIEhZ0hoKc/1sx/UlD+Su44cysM9zAJGH6fSrEAlXcMcOfbOsENRkEUcqjWcZy8FarZQxZZJ+GGrVAbBX8UNJKbrR64TUq2MEOTIrYsrbkc6vBxqmA3HKLYqOsU+YZCwGTh6LPXa+UfMVgs6JVIgyfgtyuaQWpFmQK9M2lWoEkEVHEQeVYK8MMc5xiCSRTIjVFgPbH8IawmbR/7bf6sK8JtCHWqQ1v212biUphfwzGwTAOhLXilrSO45rd/L1NPkg2lEd/RDadEZYNjZINZtoK2UzDhEdLsN41vVwovQRUz/wQ8t4l8gutZleCuWx2Waa2GHs3B/8PktN7iDkU8xmU+Mes5crKzbFywXznszId5poNUVRaCdHPj9MrIzfFyP2Bz8gbaKtwE/n6cB9gZJzuKxzCWRrbDHrKfNTFO2CgF9JsTBquU+CYsT1V51s+ttnpO91u+6T5QZkhzvJZyPTsOBq8XKF99jmhUqlfOfziHO548iQS3xXVF6msO9ulNawwtleEFqYHKkGupN4QqQdEqoT+DdLZj+CncrxR+Ol+cYtLd1HMJSjVBpdVTsTNY7jUPI3MUgjY5FgdV15rkNdOL053OXb/ErzWxZ8zPMd2LgEQAJ85h/gTznpPCqrWLyzL/Dxa7Xalv8Zq2r5Pf71gdJyXEG026yXbJ5h9dwA/v568Uldz1OXo4Wzq6l2EuoY7nLIGPk+jW/ujKbRCCVflAlK1Y5BtccVD+Jswf1sMwLiu9Q8i4mbg0daPrLXAhlWopS7wPnBRCwuJ+T7bF7bdNn3F1sQRXpZAmeYSKtJXf5kHzrmP1tUqt5w1/glGNV5x28NR1d8/a4q67ZqioKbI2WBLURMn4C6aueoQgBylHlsPqVilTN6XUohBOIxzq9LivcMp6HvhEs5BUlhjxV9s0z61PRtyGLqpOOgx+wqNo33h3bFIoH6gpbueqRuMTeS0y1q6GAocxOwo6oB0vzcoe208AxH94l8rStScxsYCKEDebhZWK94L6/BSsNJ12H81gnb9UeCYbFY18B4eIoO7Rz2gxRIAAuaRa7jmggvJn2D1fVXGx6S6AR3uqdSu3wofm+oGtSuLgC6BPyDV0fHgHPc5HvLDaO65xvsYNLsjHEJojrBMPBXNze+69Ec/KmTORhOa5f/PXPfyr4bB/Xc=

Application Model

7Vxrb+K8Ev41SLuvtFVCuPVj6WXfy+6raunRnvOpchMDUUOck5jS7q/fcTJDboaGkkAvQQjI4Nix5/Ez4xknHet88fg1ZMH8u3C41+kazmPHuuh0u8OBBZ9K8JQI+mYvEcxC10lEZiqYuL84Cg2ULl2HR7mCUghPukFeaAvf57bMyVgYilW+2FR4+VYDNqMWU8HEZl5Z+tN15DyRjrqDVP4nd2dzatkcnCb/3DH7fhaKpY/tdbrWNH4lfy8Y1YUdfTSSQzr/CY97eBwwP3dFv4RY5AQhj9LRw966+QHx2UOuTc/17/PjcydCh4eZQtYlKDcUAipSvxaP59xTCiblJTVdbfh3PXoh9/FStp8wODXsoc1Yr2/1WHdqfVGFVRUPzFti5xJBJJ9IR3O58OCX2bHGWJiHkiMCNZcQi7D9r1wsuAyflAYQsjg4BFg8XKXaXxeZZzTfIyHDEZ2ta077Cz+wy1W7j5UervvWEGfs5v73R8ND9R8vP9P9y8cA5jl3Lh+gQ9+ZtOeuPyuNSbRyFx7z4WicGZ6p8OUEC6lj5rkzNatsqEqhfqxGzoXJf4Z/SBGAFJrwnG/sSSxVXyMJM5uOxnMRur+gWkZtwN+hRB4DlsiWmKgzQQwze5zM1mtSizozEX1jkRKoMrbwPBZE7t36ghcsnLn+WEgZT35ViHp65XreufBEMnmJa2qBxKifQ0TX0kyJnmZKmITeeqdECRJflJ0Ioh88EKGEps7guNMff72edPrw/8BTinLcB/g5Uz9VeRkuI1k840YJM+fAtWVOK4EMRjSGQAqwSIbinpMafBEDcJrRDIoIeB6fqhr0sIsCZgO2v8VlLnqp5AcOsRKt5q7kE5Cra1qBCQaZgPqmnliBZO46Docqx6GQTLIESQo2gXB91fML6Da8QVHnxkk/7vs5HAMg6Bjeqngoz4UP/WNwHpzGAaYrrqBaDWH6mV2GHGEM7etzELPQm6gVYUiA2zgXLGeBWxLVk48Qs83Oel+AtlQLpOgbhYMLGKoiGKwyGJSoqGSP3XHvWkSudIWqP0zKFpR/DP0OKlLIqAH1ImQKBAIOmT3/pDgDJkDXiKngc8IMExnGNkZLJJ5gzgWM+ics/CnDPKqiDK183sQrpYo7XXgbTshWUG31kwZsoea/fxepr5bFjs1iA4TvcyjvI8HUCnNcN2RgnuCwdZUadJV6o/zqwTTKRGcS+eV8pWbcZ93yIWEFmIV+nm6gFqCrmMT+gt6m7JEvuoGs5pw5Nqz+dZRZke9A+6CyW0/YLDFY+9bkcNCcXIb7XBREGuCC9qgg4v/XDWqVU7la49zKpwA7cD5n4cuqcBdYBZgqfqOOdqgmgEnjc+c2YBH6ymMIwbTmpXnzEnP4Lk6ynlsa8ZJNbF3DLXeEhtidWsNkLX6BDYpJfi/jokHYe7A3+WiNRUvuLCa0a3MyDTpMYHM/INDC/Bl0cd1en/BF9o3MVqY9HQQJqtQa88CL8IGLxipCGpWAuO7pC7GJAaocNtHGFbm4RJkQSri1WXhLZV/E+jVQ996WJ8P9LVc3vxSIIb4TWWtXvE1wNXVgG1dfeSy6hzJnQeCB7mIPrAp3l0ETs+VG1ajRZ0sJUYmYm7OLiyrowZiIgkoN/Nk3celF0W5cNWW1REyZ47NGIpvlyERJSz85/PxYOjIpBfEadESx74yO/oVVzxXnaMXapfVhshAjrODZJISFcKkXBmVCXWcV/o5o7Rr/qpp/wOgMznflLGyy7up80Huw3NCU3j1obX9ztj9hhb2zGb1GshkVMuhtOmNPDWuWQQfLZ1i6fNWmJcEjDIktY66hjEWFxKc69W4J9ucm5FxlONKUxjaiq5LGaGnocCA9ZjrCwsYzuuYObLXCQxiLuZgJn3mXqRRGGKIDXFWpxjcDCv7oyv9mfv9PFYGxhSMfrlT9ZZwY5ikJkv8NNTqJ4JqHsC6O/a247uTi1BVtd0+gA2IZxsjY5heA+zXjG1eKG9QUcg9c+of8FdSrhHJSyDg5+eN9rRd6hfWCaZTNL7nxja8XaH4dDvZ5yKv/asI2ssIrhXav7JF/AGjTZr2jQFvnWBbCFRN7zp2liiC3WYDmlsbDAiwAKOUQiW7T6joCWS8wdA5pKfoOLeN+FURG9uj6n/Vxfm/L9oj30lc7nl8cMH8DOeyX5G9td3oL2+irXEvrjdfnjSf8uFNCQDtH6Y6Jeudo2RNsowL7sfCAyJQUTAbyuUB1E0EBsggZ/ZKP9058oUEFo0d+T+O+UL9s8t7ZcA/Jr3wVw63L7+v2tRXc0f+Af3AjYFfKdxYEeTOYcUs3bHxrvdWaJm7hDqueWU7laFPjjWyR7CP7lpzVsi/ZhMe622aT1j2rzz1LOGTv/RqN5GwGuv0arXe2F+uQISLW0eyUI4+iaedsUI6cvDfnrOAt9Iwjegt0D3p+uHX3z2zIve9Mw2/Yja6qKVJwrZoa4h6PeqPlxgnUmwmY57NEZm+YDZmbJ4axThsVs0Tcd87UQwVSOgWJcpXqzSKRe7s11J5E2Y4Rah+Wd36382m9TircAWSdajZ/HGw+lddJ721ZWhxu3T13hzI0lGA/ZrIvQ1E27OiNXNhBnmGp9CkRe7MU2ont+cANTnZlkopPhQ4xdRYVQDymNV8rQSZNVojEWRirWD915Lnyo0SRKQ6SK6jt/oThx8jDF+5NGaB1P0aykmgiM+IQP5FuFI9G+zCRY23j1YFiwz7eRmCh38fbPk3kjQVzkun9Oh8nMmo34Dav4WM+UITuwWzmiSIV8/zqTDAXZNHImO30/JD2oSPvgeyqPnWEopD1zgVk0I+xBCL34ZXuiSRf64MoA2n4DSxIe/iUzwMtSOEwfYRnUjx9WKt1+Rs=

User Interface

The user interface of our application contains a map that visualises all of the reported data points and colours indicate whether the algorithm has found a match. The section to the right would list all of the units (of rolling stock) that have been identified as not matching their services. The algorithm parameter option allows the user to change the matching algorithms parameters like time and distance tolerance between Tiplocs. The user can select a particular mismatched service and the system will display details of the mismatched reports and it’s predictions.

A rough UI wireframe is shown below:

HCI Considerations

As the client did not have any explicit usability requirements (as the project itself is an algorithm rather than an application), we explored a few UI styles that may help make interpreting the output of the algorithm easier.

The map is a solution to a common problem that we have been experiencing with train services data. It helps the user see the whole picture, as to where the GPS or TRUST reports happened, and what the algorithm has decided. In odd situations where trains loop around the user is also able to see it graphically as opposed to a string of text that doesn’t tell you much.

The right section displays the most important information that the client needs in this project, which is which services aren’t being run by the planned rolling stock. The details pane below will initially be hidden, but once the user selects a particular mismatched service, the details of what the algorithm has found is displayed.

The button that is used to export the corrected diagrams is also easily accessible and the only button on the page to help emphasise the most important feature.

User Testing

We will provide a group of users with background and context and ask them to perform common tasks on the test application and then observe their behaviour. After this we will have a brief interview with them to gather feedback on specific points of the application.

Some open questions we would ask:

  1. How easy was it trying to find the new corrected diagrams?
  2. Do you feel like you could easily find mismatched services?
  3. What do you feel about the representation of data points on the map?
  4. Are you able to tell what is happening to the services as a whole at a particular time?

As our main target audience for this algorithm are people working in the train industry, our focus would be to test on people in this audience, the most important one being our client.

Testing Strategies

We have researched techniques and tools for testing which we plan to use once we start developing and experimenting with more complex algorithms. Our plan is to automate testing as much as possible, by connecting it to our version control system and testing on every change, so we detect errors early.

We plan to use Travis, which is popular, free and has easy integration with Github for continuous integration.

Below are some of our test plans for Term 2:

Unit Tests

We will have a test suite written for the main algorithm itself which includes:

  • The matching of station names and the geographical locations
  • The matching of the time within x minutes
  • Matching of event type
  • Matching headcodes with specific train types and rough destinations
  • Matching services to rolling stock with Genius Allocations

Functional Tests

Aside from the algorithm itself, we plan on using Selenium to automatically test the web application.

Once we do get the dataset that outlines what actually happened,that was generated manually by the TOC’s, for the sample data that we were given, we can also write tests that can take multiple data streams and results from historic data to test the accuracy of the algorithm.