In high school I learned how to build websites and make them useful with server-side languages like PHP. As a graduate student I mixed physical and mathematical theory with programming and supercomputers to better understand nuclear decay processes. Now, I get to apply a similar blend of theory and practice to solve all kinds of problems in the analytics and data science space.
My interests don’t stay put for very long, but these days I’m mainly interested in Bayesian modeling, typically with Stan, as a means to consistently treat the limits of our knowledge. More practically, I’m also interested in reproducibility and “data science workflow,” trying to work out what aspects of the data science process can be codified and instrumented to better support the less rigid, more open-ended and artistic elements of model design.
Projects
Because most of my work is not public, here I’ve mainly collected a list of learning projects and fun spare-time things.
arxivrss (Python)
I use the arxiv.org RSS feeds to (try to) stay current with several physics,
deep learning, and machine learning fields, but there are typically
papers put into these feeds every day. Many of these feed entries are either (1)
updates to original papers or (2) so-called “cross posts” that point back to a
feed already in my collection or are duplicated across multiple feeds (e.g., a
nucl-th
article is cross posted to both the cs.LG
and stat.ML
machine-learning subject areas). The arxivrss
package strips these unwanted
items and slightly reformats the RSS entries, providing about a 60% reduction in
feed volume.
emojidome (R)
The r-emojiome R package bundles up the results of XKCD no. 2131 with some additional data. No. 2131 was a multi-day March Madness-style bracketed tournament where readers voted in a series of head-to-head emoji matches to decide which emoji would be victorious. The package bundles the final game results and commentary from XKCD as well as (roughly) contemporary results from the emojitracker to get a sense of the relative popularity of the emoji.
Scrivo Blog Engine (Python)
Scrivo (“I write” in Italian) is the simple Python package I wrote to build this website. It generates static HTML from Markdown files using Markdown and Jinja2, with some additional logic for handling blog posts.
bcf Bayesian Game Simulations (R)
Available on GitHub
Read the blog post
I wrote the bcf R package as a way to understand Approximate Bayesian Computation a bit better. The package simulates turn-based head-to-head competitions by approximating games as simple coin flips. Each player or team is modeled as a coin with some prior probability distribution for landing heads-up in a given flip, and the player or team that first obtains a Heads result wins. Ties are broken by running sub-games among the participating players as necessary.
The idea for this package came from multiplayer dart-throwing competitions in
my office and the need desire to roughly quantify players’
skill levels. I learned a ton from well-written articles by Rasmus Bååth
and Darren Wilkinson.
Object Detector (Python)
The object-detector repository houses a straightforward implementation of YOLO 3 I built to better understand PyTorch. I’d just wrapped up a project that applied Joseph Redmon’s original Darknet YOLOv3 implementation, so the implementation details were fresh in my mind. And I know Python well; C not so much.
Neo4j Database Manager (Python)
I wrote the neo4j-db-manager tool to make it easier to manage multiple Neo4j databases a few years ago at work. It’s a pretty simple tool and might be obviated by Neo4j 4, but it was handy when I needed it. The project got picked up in This Week in Neo4j, too, which was pretty neat.
Writing
Blog Posts
I occasionally write here, usually to share something I find interesting or a problem I eventually worked out that might be helpful to someone else. Posts involving the R programming language are also kindly republished by R-bloggers, an aggregator through which I’ve discovered quite a few interesting blogs to follow.
- Empirical Bayes: UMBC Over Virginia
- Bayesian Coin Flips—The bcf Package
- Are NFL Teams Exceedingly Mediocre This Year?
- Dart Physics
- Installing R on EC2 with RHEL 7
Other Places
The 42 V’s of Big Data and Data Science
For Elder Research, syndicated to KDnuggets, etc.
Encouraged by my boss at the time, I wrote a satirical post about the ever-increasing number of “V’s” of analytics/big data/machine learning (e.g., “velocity,” “value,” …) as a way to poke fun at the business world’s appetite for silly alliteration. Unfortunately, the satire was too well hidden, and more than one group thought we were serious.
Policy Impact on COVID-19 Spread
For Elder Research. Also presented to the Data Science Conference on COVID-19 as “Stay at Home, or at Least Tread Lightly: Using County-Level Data to Study the Effectiveness of COVID-19 Policy.”
This blog entry is a short write up of some small COVID-19 work we did, grappling with the impact of U.S. governmental policy on the spread of COVID-19.