With a background in mathematics and physics, I came into data science looking for interesting work that, more or less, involved solving problems with computers. As a high school kid I build HTML-, ASP-, and then PHP-powered websites to, among other things, serve up guitar tabs. As a graduate student I wrote and ran Fortan code on thousands of supercomputer CPU cores at a time to better understand nuclear decay processes.

One of my favorite things about my role in graduate school was the freedom I was given to balance theoretical, literally pencil-and-paper work with heads-down software development. And it’s turned out, by grace and through the encouragement of my company, that data science offers a similar blend of theory and practice. I’ve applied mean field theory to the analysis of contagions on networks, and I’ve reimplemented deep learning algorithms from scratch for fun. It’s pretty neat.

Recently, I’m especially interested in a few areas, but these change regularly:

  1. Bringing software development best practices into data science
  2. Modern deep-learning-powered text/NLP algorithms
  3. “Fairness” in machine learning/artificial intelligence/whatever buzzword you like

Projects

Most of these are learning projects or fun spare-time things. A couple (e.g., the Neo4j DB Manager) came about during work time, too.

arxivrss (Python)

Available on GitHub

I use the arxiv.org RSS feeds to (try to) stay current on a number of physics, deep learning, and machine learning fields, but there are typically papers put into these feeds every day. But, many of these feed items are either (1) updates to original papers or (2) so-called “cross posts” that either point back to a feed in my collection or are duplicated across multiple feeds (e.g., a nucl-th article is cross posted to both the cs.LG and stat.ML machine-learning subject areas). The arxivrss package strips these unwanted items and slightly reformats the RSS entries, resulting in about a 60% reduction in feed volume.


emojidome (R)

Available on GitHub

The r-emojiome R package bundles up the results of XKCD no. 2131 with some additional data. No. 2131 was a multi-day March Madness-style bracketed tournament where readers voted in a series of head-to-head emoji matches to decide which was victorious. The package bundles the final game results and commentary from XKCD as well as (roughly) contemporary results from the emojitracker to get a sense of the relative popularity of the emoji.


Scrivo Blog Engine (Python)

Available on GitLab

Scrivo (“I write” in Italian) is the simple Python package I wrote to build this website. It generates static HTML from Markdown files using Markdown and Jinja2, with some additional logic for handling blog posts.


bcf Bayesian Game Simulations (R)

Available on GitHub
Read the blog post

I wrote the bcf R package as a way to understand Approximate Bayesian Computation a bit better. The package simulates turn-based head-to-head competitions by approximating games as simple coin flips. Each player or team is modeled as a coin with some prior probability distribution for landing heads-up in a given flip, and the player or team that first obtains a Heads result wins. Ties are broken by running sub-games among the participating players as necessary.

The idea for this package came from multiplayer dart-throwing competitions in my office and the need desire to roughly quantify players’ skill levels. I learned a ton from well-written articles by Rasmus Bååth and Darren Wilkinson.


Object Detector (Python)

Available on GitHub

The object-detector package is a straightforward implementation of YOLO 3 as a way to better understand the PyTorch deep learning framework. I’d just wrapped a project using Joseph Redmon’s original Darknet implementation, so the implementation details were fresh in my mind. And Python is much easier for me to write in than C.


Neo4j Database Manager (Python)

Available on GitHub

I wrote the neo4j-db-manager tool to make it easier to manage multiple Neo4j databases while I was working on graph database projects at work. It’s a pretty simple tool (and might be obviated by Neo4j 4.0?), but it was handy when I needed it and even got picked up in This Week in Neo4j, which was pretty gratifying.

Writing

Blog Posts

Other Places

The 42 V’s of Big Data and Data Science

For Elder Research, syndicated to KDnuggets, etc.

Encouraged by my boss at the time, I wrote a satirical post about the ever-increasing number of “V’s” of analytics/big data/machine learning (e.g., “velocity,” “value,” …) as a way to poke fun at the business world’s appetite for silly alliteration. Unfortunately, the satire was too well hidden, and more than one group thought we were serious.