With a background in mathematics and physics, I came into data science looking for interesting work that, more or less, involved solving problems with computers. As a high school kid I build HTML-, ASP-, and then PHP-powered websites to, among other things, serve up guitar tabs. As a graduate student I wrote and ran Fortan code on thousands of supercomputer CPU cores at a time to better understand nuclear decay processes.
One of my favorite things about my role in graduate school was the freedom I was given to balance theoretical, literally pencil-and-paper work with heads-down software development. And it’s turned out, by grace and through the encouragement of my company, that data science offers a similar blend of theory and practice. I’ve applied mean field theory to the analysis of contagions on networks, and I’ve reimplemented deep learning algorithms from scratch for fun. It’s pretty neat.
Recently, I’m especially interested in a few areas, but these change regularly:
- Bringing software development best practices into data science
- Modern deep-learning-powered text/NLP algorithms
- “Fairness” in machine learning/artificial intelligence/whatever buzzword you like
Most of these are learning projects or fun spare-time things. A couple (e.g., the Neo4j DB Manager) came about during work time, too.
I use the arxiv.org RSS feeds to (try to) stay current on a number of physics, deep learning, and machine learning fields, but there are typically papers put into these feeds every day. But, many of these feed items are either (1) updates to original papers or (2) so-called “cross posts” that either point back to a feed in my collection or are duplicated across multiple feeds (e.g., a
nucl-th article is cross posted to both the
stat.ML machine-learning subject areas). The
arxivrss package strips these unwanted items and slightly reformats the RSS entries, resulting in about a 60% reduction in feed volume.
The r-emojiome R package bundles up the results of XKCD no. 2131 with some additional data. No. 2131 was a multi-day March Madness-style bracketed tournament where readers voted in a series of head-to-head emoji matches to decide which was victorious. The package bundles the final game results and commentary from XKCD as well as (roughly) contemporary results from the emojitracker to get a sense of the relative popularity of the emoji.
Scrivo Blog Engine (Python)
Scrivo (“I write” in Italian) is the simple Python package I wrote to build this website. It generates static HTML from Markdown files using Markdown and Jinja2, with some additional logic for handling blog posts.
bcf Bayesian Game Simulations (R)
I wrote the bcf R package as a way to understand Approximate Bayesian Computation a bit better. The package simulates turn-based head-to-head competitions by approximating games as simple coin flips. Each player or team is modeled as a coin with some prior probability distribution for landing heads-up in a given flip, and the player or team that first obtains a Heads result wins. Ties are broken by running sub-games among the participating players as necessary.
The idea for this package came from multiplayer dart-throwing competitions in my office and the
need desire to roughly quantify players’ skill levels. I learned a ton from well-written articles by Rasmus Bååth and Darren Wilkinson.
Object Detector (Python)
The object-detector package is a straightforward implementation of YOLO 3 as a way to better understand the PyTorch deep learning framework. I’d just wrapped a project using Joseph Redmon’s original Darknet implementation, so the implementation details were fresh in my mind. And Python is much easier for me to write in than C.
Neo4j Database Manager (Python)
I wrote the neo4j-db-manager tool to make it easier to manage multiple Neo4j databases while I was working on graph database projects at work. It’s a pretty simple tool (and might be obviated by Neo4j 4.0?), but it was handy when I needed it and even got picked up in This Week in Neo4j, which was pretty gratifying.
- Empirical Bayes: UMBC Over Virginia
- Bayesian Coin Flips—The bcf Package
- Are NFL Teams Exceedingly Mediocre This Year?
- Dart Physics
- Installing R on EC2 with RHEL 7
The 42 V’s of Big Data and Data Science
Encouraged by my boss at the time, I wrote a satirical post about the ever-increasing number of “V’s” of analytics/big data/machine learning (e.g., “velocity,” “value,” …) as a way to poke fun at the business world’s appetite for silly alliteration. Unfortunately, the satire was too well hidden, and more than one group thought we were serious.