Discovery is the Problem

Discovery is the Solution

Every Aspect of Your Project Is Design

image

In my personal projects, I’m not a world-changing software engineer. I tend toward developer tools, productivity tools, and personal scratch-an-itch projects (like, I suspect, many people). So it’s rare that I have an idea that is small enough in scope that I can finish v. 0.1 on my own, seems like it fills a hole, and seems like it could be useful to a relatively wider group of others.

In sofine, I hit on something I thought might have a wider appeal. While still targeted at developers, it solves a general use case with several possible applications. The project gives you a simple way to create and manage data collection plugins, and to compose them as you wish to return back unified data sets.

My original motivation was to be able combine the data I could get from scraping my Fidelity portfolio page with calls on the same stock tickers to multiple other web APIs. But I realized the same approach could also be used to build pipelines for machine learning, where you are building wide feature sets of attributes from multiple data sources for a set of initial keys. Or by a web developer chaining calls to Web APIs to build a single data set to display.

Because the project seemed to potentially have general utility, I wanted it to be as easy to start using, and as easy to continue using, as possible.

In pursuit of this noble goal, I decided to think hard about the design and to very strictly follow a few principles. These became apparent as I thought about the requirements, which themselves fell naturally from the use case described above.

I decided a user should be able to:

  1. Manage any number of data collection plugins in one way, but with no dependencies on each other and no possibility of collision
  2. Combine calls to them in one call, to support use in shell scripts, automation, piped expressions, and so on
  3. Retrieve one set of data from any number data sources
  4. Begin and continue to use the library with the minimum possible number of install and configuration steps

It was obvious to me once I formulated these requirements that the first one required very careful design, and the second was in fact the API for usage of the library.

But I didn’t realize the third one would raise design issues until I ran into them, at which point I had the obvious-in-hindsight realization that data sets returned by sofine were also an API.

The fourth point clearly concerned the unsexy issues of package management and configuration management (among other things), which we don’t always look at as design. But here too I ran into novel issues because of the goals of the project, and in solving them came to see these aspects as integral to project design.

In short, I learned that design matters, even if your application’s only interfaces are the command line, URLs, configuration files, data output, and the file system.

Designing the Call APIs

This was the first area of design I thought about, because it was the most apparent at the start (which is probably a lesson in itself). I wanted users to be able to chain Unix-style piped calls to multiple plugins from the command line. I wanted as few arguments as possible.

The first iteration looked like this:

python runner.py '-s fidelity arg_1 arg_val_1 | -s ystockquotelib'

Some design basics were already in place. I decided to avoid making runner.py executable, to save users from relying on a hardcoded shebang path. And I knew I wanted the sofine arguments to come first, followed by any additional arguments required by the plugin call itself.

But I’d already hit my first design issue. Using traditional single-letter short-form arguments for sofine would break as soon as any plugin took an -s arg of its own. I’d hit a namespacing issue. I decided the simplest solution was to prefix sofine arguments with --SF:

python runner.py '--SF-s fidelity arg_1 arg_val_1 | --SF-s ystockquotelib'

Designing Plugin Management

I hit another namespacing issue with plugin management. Early in development I had a few test plugins in a directory. But I knew that system wouldn’t scale to fulfill my first requirement: “Manage any number of data collection plugins in one way, but with no dependencies on each other and no possibility of collision.”

The simplest solution seemed to be to use the natural namespacing provided by the file system. So a plugin directory might look like this:

- plugins
  - scrapers
    - fidelity.py
    - schwab.py
  - api_wrappers
    - ystockquotelib.py
    - all_finance.py
  - db_wrappers
    - ratings_data.py

or this:

- plugins
  - finance
    - fidelity.py
    - schwab.py
  - municipal
    - city_1_data.py
    - city_2_data.py
  - ratings
    - muni_ratings.py

This design had several other benefits that convinced me it was a good approach. Plugin groups let users logically organize plugins by whatever criteria worked for them, as in the above examples. Separate directories per group also meant users could manage all plugins in one code repository from the root plugin directory, or have one repo per group. And as long as plugin calls now supplied the plugin name and its group, configuration would still only require one value, the plugin root directory.

This last point showed that the CLI API design and plugin management design were tightly coupled. But here the coupling was intuitive and didn’t reduce user flexibility (or code flexibility, as I confirmed by implementing the feature). You could manage plugins as you wanted, and still call them in any combination with a very simple interface, which now looked like this:

python runner.py '--SF-s fidelity --SF-g example arg_1 arg_val_1 | --SF-s ystockquotelib --SF-g example'

Designing the Data Output Format

The data output format also went through two iterations, again related to namespacing. The purpose of the library is to collect data from multiple data sources for a set of keys, so I initially decided on a data format like so:

{"APPL": {"price": 102.45, "52-wk-high": 178.32}}

But what if the fidelity and ystockquote plugins both had a field named price?

My first solution was to use the same namespacing I’d used to solve the plugin issue, prepending the plugin group and name to each field:

{"APPL": 
    {"example::ystockquotelib::price": 102.45,
     "example::ystockquotelib::52-wk-high": 178.32,
     "example::fidelity::price": 102.45}
}

This solved the problem, and also let users filter or group data based on the plugin that had provided it. But it also added a lot of repetitive bloat to the return data set for users who just wanted a flat data set, and returning a flat data set was in fact the stated goal of the library.

Thinking more about it I came up with the idea to use JSON arrays rather than objects, which would allow attribute keys to repeat:

{"APPL": 
    [{"price": 102.45}, {"52-wk-high": 178.32}, {"price": 102.45}]
}

Namespacing became an option that you could access by making a slightly different call. (Which led to other design considerations in the CLI API.)

Packaging, Installation, Configuration and Build

In keeping with my goal of optimal simplicity for the end user, I knew I wanted a package install for sofine. I did some research and chose pip, and then did some more to work through the issues of getting my package script working.

I got a nice rush the first time I went through the steps and they worked. My project was uploaded to pypi! I could download it and install it!

But then I hit a snag. The import paths were broken. To fix them to work with the packaged install I had to change them from the paths I was using when running directly from the development directory. But then I couldn’t develop the code in a way that worked when it ran from the package install. I needed to use the package install import paths all the time.

The solution I came up with was running the code from my Python package directory, which I called using $PYTHONPATH for portability. The design cost was that this required the new user to set one additional configuration.

python $PYTHONPATH/runner.py '--SF-s fidelity --SF-g example arg_1 arg_val_1 | --SF-s ystockquotelib --SF-g example'

There was an additional cost to this decision. The development workflow now required a local build to create a new pip package and install it in my Python package directory. This seemed cumbersome, but automating it and running it locally added only a few seconds to each test run. I also learned how to use pip to make and install builds locally, and a couple of make tricks to suppress the build output.

My make targets look like this:

deploy:
     @rm -rf dist > /dev/null
     @rm -rf sofine.egg-info > /dev/null
     @rm -rf $$PYTHONPATH/sofin* > /dev/null
     @python setup.py sdist --formats=gztar,zip > /dev/null
     @pip install --allow-unverified --no-index --find-links dist sofine > /dev/null

test: deploy
     python ./sofine/tests/test_runner_from_cli.py
     python ./sofine/tests/test_runner_from_py.py
     python ./sofine/tests/test_runner_from_rest.py
     python ./sofine/tests/test_format_csv.py

Conclusion

Open source software is made to be used. To encourage others to use your project, you should make it easy to get started with and easy to use. Achieving these broad goals means thinking about design across many aspects of your project – it’s call APIs of course, but also return formats, extension mechanisms, installation process, development workflow and any other user-facing aspect.