An Open Data Architecture

Ben Morris
STSI Point of View
Published in
3 min readSep 27, 2019

--

At STSI, we support a number of “open data” missions, i.e. applications that share government data with the general public. We’re excited about the potential of open data, and care about doing it right. The below is an architecture we’ve used to deliver these solutions.

I’ve had the privilege to support the creation of data APIs for FDIC, and will cover the basics of the solution. My hope is to share something useful for similar projects, and to get feedback on improvements to this approach.

The Open Data Opportunity

Before we talk solution, let’s address the ‘why’. Government data can add a great deal of value, in more ways than I can articulate in this post. Banking data can help markets to function more effectively. Specifically, more and better data would potentially mitigate events like the 2008 financial crisis. If economists and traders can better model what is going on, they can move to correct earlier, avoiding a surprise bubble. The numbers (below) are so large, even a small dent is a tremendous value.

The financial crisis cost $19.2 billion in household wealth and 8.8 million jobs, according to a US Treasury report.

Open Data Solution

The overall solution consists of three high-level parts.

The basic flow of the solution includes: get data, serve data, and display data.
  1. Get data from upstream source systems, which are inside the firewall.
  2. Serve data in bulk and via a queryable API.
  3. Display data to end users via web browser.

The above solution follows an “API-first” philosophy, which fits well with the value proposition of government open data in general. The primary value is to make the data available in any form. Once data is “in the wild” someone can create value from it. Think of weather data from NOAA. It is great if the government provides a user interface for that data, but even if they don’t, many third parties will.

Technical Architecture

We settled on a simple architecture to make this happen, built on top of AWS services provisioned via cloud.gov.

Data flow through tech stack as explained below.
  1. Data is uploaded in files (in our case all JSON format) to an S3 bucket. This provides a simple interface, as the upstream upload utility need only deal in files, not any API spec.
  2. An ingest microservice indexes the file contents into Elasticsearch indexes. Elasticsearch allows for flexible data structures, very fast read times, structured data queries, and text search queries.
  3. An API microservice selectively exposes elasticsearch capabilities to the public — for developers, data scientists, applications, etc.
  4. A UI app (in our case a React app) provides an in-browser interface to provide access to the data for “normal” users, such as consumers, journalists, and less technical researchers.

Conclusions

In general, this architecture has worked well. The hardest part is always on the far-left side of the diagram, making sure the correct data is coming in. It scales well and provides flexibility. I’d love to hear of alternatives and tradeoffs that others have learned, or any thoughts on the API.

--

--