I came to data science looking for interesting work that, more or less, lets me solve problems using computers like I’d done growing up. As a high-school kid I had learned how to build websites with HTML, ASP, and PHP to serve up guitar tabs among other things, and as a graduate student I learned how to write, compile, and run Fortan code on supercomputers to better understand nuclear decay processes.
I loved the freedom graduate school gave me to switch between pencil-and-paper theory and heads-down software development, and data science has so far offered a similar blend of theory and practice, largely thanks to my employer. Along the way I’ve been able to apply mean field theory to analyze how contagions spread across networks, implemented deep learning algorithms from scratch for fun, and tried out languages like Julia and Haskell.
My interests change pretty often, but most recently I’ve been interested in:
- Bringing software development best practices into data science
- Bayesian inference and Markov Chain Monte Carlo
- “Fairness” in machine learning and automated systems
Most of these are learning projects or fun spare-time things. A couple (e.g., the Neo4j DB Manager) are work projects, too.
I use the arxiv.org RSS feeds to (try to) stay current with several physics,
deep learning, and machine learning fields, but there are typically
papers put into these feeds every day. Many of these feed entries are either (1)
updates to original papers or (2) so-called “cross posts” that point back to a
feed already in my collection or are duplicated across multiple feeds (e.g., a
nucl-th article is cross posted to both the
machine-learning subject areas). The
arxivrss package strips these unwanted
items and slightly reformats the RSS entries, providing about a 60% reduction in
The r-emojiome R package bundles up the results of XKCD no. 2131 with some additional data. No. 2131 was a multi-day March Madness-style bracketed tournament where readers voted in a series of head-to-head emoji matches to decide which emoji would be victorious. The package bundles the final game results and commentary from XKCD as well as (roughly) contemporary results from the emojitracker to get a sense of the relative popularity of the emoji.
Scrivo Blog Engine (Python)
Scrivo (“I write” in Italian) is the simple Python package I wrote to build this website. It generates static HTML from Markdown files using Markdown and Jinja2, with some additional logic for handling blog posts.
bcf Bayesian Game Simulations (R)
I wrote the bcf R package as a way to understand Approximate Bayesian Computation a bit better. The package simulates turn-based head-to-head competitions by approximating games as simple coin flips. Each player or team is modeled as a coin with some prior probability distribution for landing heads-up in a given flip, and the player or team that first obtains a Heads result wins. Ties are broken by running sub-games among the participating players as necessary.
The idea for this package came from multiplayer dart-throwing competitions in
my office and the
need desire to roughly quantify players’
skill levels. I learned a ton from well-written articles by Rasmus Bååth
and Darren Wilkinson.
Object Detector (Python)
The object-detector repository houses a straightforward implementation of YOLO 3 I built to better understand PyTorch. I’d just wrapped up a project that applied Joseph Redmon’s original Darknet YOLOv3 implementation, so the implementation details were fresh in my mind. And I know Python well; C not so much.
Neo4j Database Manager (Python)
I wrote the neo4j-db-manager tool to make it easier to manage multiple Neo4j databases a few years ago at work. It’s a pretty simple tool and might be obviated by Neo4j 4, but it was handy when I needed it. The project got picked up in This Week in Neo4j, too, which was pretty neat.
- Empirical Bayes: UMBC Over Virginia
- Bayesian Coin Flips—The bcf Package
- Are NFL Teams Exceedingly Mediocre This Year?
- Dart Physics
- Installing R on EC2 with RHEL 7
The 42 V’s of Big Data and Data Science
Encouraged by my boss at the time, I wrote a satirical post about the ever-increasing number of “V’s” of analytics/big data/machine learning (e.g., “velocity,” “value,” …) as a way to poke fun at the business world’s appetite for silly alliteration. Unfortunately, the satire was too well hidden, and more than one group thought we were serious.