{ "version": "https://jsonfeed.org/version/1", "title": "Tom Shafer", "home_page_url": "https://tshafer.com/blog/", "feed_url": "https://tshafer.com/blog/feed.json", "description": "Tom Shafer's personal blog.", "author": { "name": "Tom Shafer", "url": "https://tshafer.com" }, "items": [{ "id": "https://tshafer.com/blog/2023/11/override-rstudio-desktop-font", "url": "https://tshafer.com/blog/2023/11/override-rstudio-desktop-font", "title": "Overriding RStudio Desktop's Font Picker", "content_html": "
I’ve been doing a decent amount of R programming lately and I’ve\nwanted to experiment with GitHub’s new Monaspace type family.\nOnce installed, though, only the variable font\u2014er\u2014variants\nregister as fixed-width fonts on macOS. This is a problem because\nRStudio Desktop (at least as of version 2023.06.1+524
) only\nallows users to select fixed-width fonts in the interface.
RStudio manages most preferences in flat JSON files these days,\nso I figured I could pick whatever font file I wanted in these\nJSON files. After some searching, though, RStudio does not\nstore font configuration in the usual places:\n~/.local/share/rstudio/
or ~/.config/rstudio/
.
Instead, the font setting is stored (on macOS Sonoma, at least)\nin ~/Library/Application Support/RStudio/config.json
. I\nswapped in font[\"fixedWidthFont\"] = \"MonaspaceNeon-Regular\"
,\nand everything works after an RStudio UI reload.
Last year I benchmarked a few ways of shuffling columns\nin a data.table, but what about pandas? I didn’t\nknow, so let’s revisit those tests and add a few more operations!\npandas winds up being much more competitive than I expected.
\nFirst, dplyr is by far the slowest. Second, pandas is\n(more than) competitive with the R options. In the small-size\nregime (vector sizes up to about 1,000), the pandas option is\nsimilar to, but faster than, most of the slower R options, and\nthe numpy-backed solution is nearly as fast as base R\nassignment and data.table’s in-place option. I expected\npandas to be a lot slower.
\nMore surprising, in the large-vector regime both Python solutions\noutperform all R options, and the in-place Python option is\nmuch faster than everything else, starting with vector sizes of\nabout 10,000. I’m not sure how representative this benchmark is,\nbut it’s an interesting data point.
\nMore than most frameworks, pandas feels sensitive to the\nway we do something: calling .apply()
isn’t just a little\nslower than .transform()
—it’s miles slower. So the simple\ntransformations we’re doing here are pretty easy to optimize;\nand numpy-backed operations should be fast anyway.
There also might be systematic differences between R and Python\ntests: R functions are tested using microbenchmark and Python\ntests were run with timeit. New code is below.
\nfrom timeit import Timer\nimport pandas as pd\n\ndef scramble_naive(df: pd.DataFrame, colname: str) -> pd.DataFrame:\n df[colname] = df[colname].sample(frac=1, ignore_index=True)\n return df\n\ntest_naive = "scramble_naive(df, colname='x')"\n\nresults_naive = {\n n: Timer(test_naive, setup % n).repeat(repeat=100, number=1)\n for n in range(21)\n}\n
import numpy as np\nimport pandas as pd\n\ndef scramble_inplace(df: pd.DataFrame, colname: str) -> pd.DataFrame:\n np.random.shuffle(df[colname].to_numpy())\n return df\n\ntest_inplace = "scramble_naive(df, colname='x')"\n\nresults_inplace = {\n n: Timer(test_inplace, setup % n).repeat(repeat=100, number=1)\n for n in range(21)\n}\n
\n\n
scramble_base <- function(input_df, colname) {\n input_df[[colname]] <- sample(input_df[[colname]])\n input_df\n}\n
library(dplyr)\n\nscramble_dplyr <- function(input_tbl, colname) {\n input_tbl %>% mutate({{colname}} := sample(.data[[colname]]))\n}\n
I’ve been getting a lot of use recently from the Posit (n\u00e9e\nRStudio) Package Manager (PPM), because it offers freely\navailable R package binaries for quite a few Linux\ndistributions—including common ones I tend to see in\nDocker containers (rocker) and ‘the cloud’ (Amazon Linux\n2). Recent versions of renv seem to take advantage\nof these binaries, too.
\nBinary packages can cut package install times by an order of\nmagnitude, since they come precompiled. Many popular packages,\nincluding data.table and dplyr, are more or less\nR wrappers around C or C++ code at this point in an effort to\nmake things fast, and so installing/compiling a package like\ndplyr from source can take minutes. That’s fine once or\ntwice, but it’s not fine when I regularly rebuild containers or\nrun package checks with GitHub Actions.
\nIt’s pretty easy to get running with PPM once you know where to\nlook in the documentation, and it basically comes down to two\nsteps.
\nPosit offers different endpoints for the various Linux\ndistributions. To find yours:
\nNavigate to the PPM setup page.
\nChoose your Linux distribution. At the right-hand side of the\n header, click “Source” and choose from the drop-down menu.
\nMake sure the “Repository URL” setting is “Latest”. This is\n the default.
\nCopy the URL for your distribution. E.g., the URL for Ubuntu\n 22.04 is https://packagemanager.posit.co/cran/__linux__/jammy/latest.
\nIt’s easy to configure R to use the PPM endpoint by setting or\nchanging a couple of values in .Rprofile
:
Set a header to tell Posit your R configuration.\n The header passes along your version of R and a few other\n platform details.
\nSet the PPM URL as your CRAN source. CRAN
is\n generally the default package source and is configured as part\n of the “repos” option.
Put together, you get an addition to .Rprofile
like this:
options(HTTPUserAgent = sprintf(\n "R/%s R (%s)", \n getRversion(), \n paste(\n getRversion(), \n R.version["platform"], \n R.version["arch"], \n R.version["os"]\n )\n))\n\n.ppm <- "https://packagemanager.posit.co/cran/__linux__/jammy/latest"\noptions(repos = c(CRAN = .ppm))\n
By setting these values in .Rprofile
they will generally\npropagate to all your R sessions and, if the stuff I’m doing is\nan indication (Docker, GitHub Actions, Databricks,1 etc.), you\nmight save a good bit of time.
Ugh. ↩︎
\nAt work, I recently had the opportunity to spend time\nthinking about the rise of generative AI from the perspective of\nour clients: businesses who hear the hype and wonder how to sort\nthrough all of this stuff. What are the early benefits? What are\nsome of the obvious risks?
\nOur marketing folks took some of that thinking and turned it\ninto a blog post. I’m hardly an expert in generative AI,\nbut that’s kind of the fun of it. These large language models\nwe’ve been watching and working with over the last few years are,\nat least for now, opening up a bunch of new applications. Maybe\nthis fire will keep burning or maybe it’ll cool, but it’s an\ninteresting time.
", "date_published": "2023-07-21T22:45:00-04:00"}, { "id": "https://tshafer.com/blog/2023/03/more-apple-music-usb", "url": "https://tshafer.com/blog/2023/03/more-apple-music-usb", "title": "More on Apple Music and USB Sticks", "content_html": "I have a script that cleans up a USB stick for playing music\nin my car, but that’s not quite enough. What I really need is a\nway to synchronize Apple Music with the volume, like you would do\nwith an iPhone\u2014or iPod, back in the day.
\nI started by dragging tracks from the Music app to the USB drive\nicon, but the app doesn’t offer a dialog asking whether we’d like\nto overwrite existing files; instead, I wound up with multiple\ncopies of every file instead of even one-way synchronization. So\napparently I had to do this myself, and the result is a tiny\nlittle Python package named musicsync.
\nI tried a few different things, including rsync, but a\ncombination of AppleScript (to gather files from the Music app),\nPython (for comparing directory trees between selected songs and\nthe destination, and for handling the synchronization looping),\nand OS-level file operations works really nicely.
\n$ musicsync --playlist "Selected for Car" /Volumes/UNTITLED\n\nCollecting songs from playlist "Selected for Car"\n Collected 2514 songs\nPlaylist root directory is "~/Music/Music/Media.localized/Music/"\nBuilding playlist song tree\n Found 365 dirs and 2514 files\nRemoving extra files from "/Volumes/UNTITLED/"\nCopying new files from "Selected for Car" to "/Volumes/UNTITLED/"\n Aaron Keyes/In the Living Room: 17%|\u2588\u2588 | 2/12 [00:06<00:25, 2.60s/it]\n
A couple of points I found interesting while building this:
\nditto --nocache --noextattr --noqtn\n --norsrc
was great for working with FAT32 and ignoring\n resource forks and other extended attributes.Sometimes when I’m running errands or whatever\nI just don’t want to take my phone out.\nIt’s nice to have at least a few minutes without it,\nand if I’m headed to the grocery store\nit isn’t like I actually need it to get there.\nSome times quiet and stillness is really nice,\nbut my car also has a couple of USB slots\nand purportedly supports MP3 and AAC audio.
\nBut it’s 2023 and apparently there are still pieces of tech\nthat absolutely do not work out of the box with macOS\u2013related stuff.\nI’m a reasonably technical person, but after all the Google searching\nand the approximately ten million times this didn’t work,\nhere are the steps I needed to follow to play MP3s on my 2019 Accord\nfrom a USB stick plugged into a Mac laptop.\nThese directions basically mirror a set I found, eventually,\non a forum for Honda Ridgeline owners.\nMaybe this post can amplify those directions\nand offer a little more Google-fu.
\nrm -r .Trashes .Spotlight-V100
, etc.\n This requires the Terminal, iTerm 2, etc. to have Full Disk Access./usr/sbin/dot_clean
is installed with macOS.And then we probably need to do this every time we add new files to the drive.\nSigh.\nI wrote a little Python script to do this, in case it’s helpful:\nhttps://github.com/tomshafer/cleanusb.\nOnce installed, it offers a command named cleanusb
that follows these steps.
Here’s a trick I find useful during exploratory analysis and feature engineering; really, whenever I’m querying against database servers I don’t control: Log the connection PID at query time.
\nIt happens pretty often to me during exploratory analysis that I launch a query and then, either right away or after the query begins to drag on longer than expected, wish I could cancel the query to edit it and try again.\nMaybe I forgot to put a filter into the query, and it’s about to return way too much data, or maybe the database is underpowered and I’d rather extract a smaller result for analysis.
\nOften there isn’t a good way to stop a query from within RStudio/VS Code when the query is directed to a remote database server.\nUnless we want to wait for the query to finish and return control of the IDE to us, maybe the best we can do is to restart the R session or Jupyter kernel and start another query.\nIf we log the connection PID, though, we get another option: We can open another session, instead of killing this one, and ask the database server to cancel any running queries on that PID.
\nThis is really easy to do, both in Python and R, and most drivers have a way to get the PID pretty easily.\nI’ve been using Postgres/Redshift a lot recently, so I’ll use it as an example.\nWith Python and psycopg2, we can call get_backend_pid()
to log the PID:
log.info(f"Query PID = {conn.get_backend_pid()}")\n
#> Query PID = 948\n
Or, with R and DBI we can call dbGetInfo()
:
message("Query PID = ", DBI::dbGetInfo(conn)$pid)\n
#> Query PID = 950\n
Once we have our PID, it’s really easy to open up a new session and ask the database to cancel the running query:
\nSELECT pg_cancel_backend(<PID>)\n
Because this is just SQL, we can run this query anywhere and recover control of our IDE session.\nSimple and useful.
", "date_published": "2022-09-13T07:05:00-04:00"}, { "id": "https://tshafer.com/blog/2022/06/experimenting-with-quarto", "url": "https://tshafer.com/blog/2022/06/experimenting-with-quarto", "title": "Experimenting with Quarto", "content_html": "Quarto is the up-and-coming “next generation version of R Markdown” being developed by RStudio.\nIt’s more or less a superset of R Markdown/knitr that’s suited to programming languages besides R.\nQuarto’s heading towards a 1.0, and I’ve started experimenting for a few client projects.
\nSo far I like the system a lot, and at this point I really think Quarto’s worth a try;\nespecially since it’s available with the recent versions of RStudio.
\nThis post lists a few of my favorite elements after a couple weeks’ using the tool off and on.
\nBecause Quarto uses knitr to execute R code, my usual workflows don’t change unless I want them to.\nIt’s just about all upside so far, \nand I’ve been able to use the old-style knitr::opts_chunk$set()
syntax in places I haven’t been able to configure Quarto immediately.
Quarto takes the YAML metadata styling used by R Markdown (and pandoc and many other tools) and extends it.\nIn particular, Quarto introduces some special syntax (#| key: value
) to specify chunk-level options.\nThis is supposed to replace the old style, where the options are crammed into the language identifier ({r title, fig.asp=0.618, ...}
),\nand it’s particularly nice to specify a good number of options\u2014like I might when building a figure:
```{r}\n #| label: my-figure\n #| fig-asp: 0.618\n #| fig-cap: |\n #| This is a caption for my figure, \n #| using YAML formatting, etc.\n\n ggplot() + ...\n ```\n
One of the nicest new features is the upgraded styling.\nFirst, the default theming feels much nicer than the old RMarkdown style (even if the default font size is a little big for my taste).\nSecond, though, the new styling makes margin notes easily possible and subfigures much easier to compose.
\nMargin notes are a big deal and can be enabled by switching reference-location: margin
in the YAML front matter.
In the past I’ve relied on R4DS’s figures guidance for composing multiple figures in one chunk.\nBut now Quarto can do this composition immediately using layout options,\nand, with the additional column:
options, we can tell Quarto to expand the content width as we like.\nE.g., for two subfigures stretching across the page and outside of the normal body content block:
```{r}\n #| label: my-figure\n #| layout-ncol: 2\n #| column: page\n\n ggplot() + ...\n ggplot() + ...\n ```\n
It’s just so nice.
\nQuarto can be downloaded or built into a standalone application, it isn’t dependent on RStudio or any IDE.\nUsing the command line tool, we can call quarto render
to compile a document or we can call quarto preview
to render a live preview that automatically updates when the source files are saved. This is exposed in RStudio via “Render on Save,” but it’s available in any editor using the command-line tool.
Finally, there are the IDE integrations.\nWhen I’m working with R I’m almost always using RStudio,\nand RStudio has the expected set of built-in niceties. \nMostly, these are all the usual knitr-powered switches, buttons, and keyboard shortcuts,\nbut there is also editor completion for the various YAML configurations at the cell and editor level.
\nJust about all the work I do outside of R is handled with VS Code and, often, its Remote SSH and Remote Containers extensions.\nI’ve so far used Quarto less with VS Code than with RStudio,\nbut I’ve already experimented with one nice feature:\nQuarto’s ability to render VS Code .ipynb
documents (for which VS Code has a very nice integrated experience) as Quarto documents.\nThis seems to work just like one would expect: Simply put the relevant YAML metadata in the top notebook cell and add #| key: value
comments to blocks as necessary.\nVS Code renders and executes the notebook just as expected,\nand Quarto takes the notebook\u2014and any output that’s been generated\u2014and renders it like I’d expect if I’d written the code in R rather than Python.
In a post last week I offered a couple of simple techniques\nfor randomly shuffle a data.table column in place and\nbenchmarked them as well. A comment on the original\nquestion, though, argued these timings aren’t useful\nsince the benchmarked data set only contains five rows (the size\nof the table in the original post).
\nThat seemed plausible, so I’ve carried the test further. Often\nwe’re interested in vectors with hundreds, thousands, or millions\nof elements, not a handful. Do the timings change as the vector\nsize grows?
\nTo find out, I simply extended my computation from last time\nusing microbenchmark and plotted the results below. I’m\nsurprised to see just how much set()
continues to outperform\nthe other options even to fairly large vector sizes.
library(data.table)\nlibrary(microbenchmark)\n\nscramble_orig <- function(input_dt, colname) {\n new_col <- sample(input_dt[[colname]])\n input_dt[, c(colname) := new_col]\n}\n\nscramble_set <- function(input_dt, colname) {\n set(input_dt, j = colname, value = sample(input_dt[[colname]]))\n}\n\nscramble_sd <- function(input_dt, colname) {\n input_dt[, c(colname) := .SD[sample(.I, .N)], .SDcols = colname]\n}\n\ntimes <- rbindlist(\n lapply(\n setNames(nm = 2 ** seq(0, 20)),\n function(n) {\n message("n = ", n)\n setDT(microbenchmark(\n orig = scramble_orig(input_dt, "x"),\n set = scramble_set(input_dt, "x"),\n sd = scramble_sd(input_dt, "x"),\n setup = {\n input_dt <- data.table(x = seq_len(n))\n set.seed(1)\n },\n check = "identical"\n ))\n }\n ),\n idcol = "vector_size"\n)\n
Reading the chart from left to right, small vectors to large\nones, the first regime is one where set()
dominates the other\nmethods, having a much shorter runtime. This is followed by a\ntransition to a regime where the time required for sample()
to\nshuffle large vectors dominates the run time. (Notice both axes\nare on the logarithmic scale, so the time is exponentially increasing.)
Does this matter? The differences here are so small that we\ncan’t even use profvis to benchmark a single run. But, what\nif we were calling this functionality repeatedly in a loop? The\ndifferences add up.
\nThis is a good example of where it’s nice to know the options\navailable to us in the languages and packages being used: The\ndata.table authors built set()
for these kinds of reasons, as a\nway to programmatically assign to data.tables in place within\nloops.
In a one-off analysis, maybe it’s not worth the trouble to care\ntoo much about speed, and it’s likely not a good use of time to\nbenchmark everything. But when writing packaged code, for\nexample, we give up the ability to know how and where our code\nwill be used. It pays to be aware of things like the difference\nbetween using .SD
and set()
and which is the better option.\nIt makes our code more easily used in places we’d never thought\nabout and can’t think about at the time.
Yesterday, in a post syndicated to R-bloggers, kjytay\nasked about how to programmatically shuffle a data.table\ncolumn in place, as the straightforward way didn’t work well.
\nHere are two other ways to solve the same problem, one using\ndata.table::set()
and the other .SDcols
:
scramble_set <- function(input_dt, colname) {\n set(input_dt, j = colname, value = sample(input_dt[[colname]]))\n}\n\nscramble_sd <- function(input_dt, colname) {\n input_dt[, c(colname) := .SD[sample(.I, .N)], .SDcols = colname]\n}\n
Each approach returns the correct result and avoids the strange\ndispatch problem when trying to shuffle a column named “colname”.
\nIt’s good to check performance with these kinds of things, too,\nespecially when .SD
is involved, and set()
handily\noutperforms the other two solutions (kjytay’s original solution I\nnamed “orig”):
microbenchmark(\n orig = scramble_orig(input_dt, "x"),\n set = scramble_set(input_dt, "x"),\n sd = scramble_sd(input_dt, "x"), \n setup = {\n input_dt <- data.table(x = 1:5)\n set.seed(1)\n }, \n check = "identical"\n)\n
Unit: microseconds\n expr min lq mean median uq max neval\n orig 291.970 315.4400 351.52132 319.474 327.5635 3248.663 100\n set 33.196 36.0965 61.62936 37.262 39.5380 2419.880 100\n sd 557.834 591.2370 636.88657 597.579 616.2675 3821.737 100\n
I wrote a thing! Well, I edited someone else’s thing, and then I\nadded a lot of words, and then someone else (multiple someones?)\nedited my words. And then they added fancy graphics and stuff.
\nBut here’s a very business-y blog post that talks about our\ncompany’s Chief Scientist Committee, of which I’m a member: Why\nDoes Elder Research Need a Chief Scientist Committee?
", "date_published": "2022-05-02T21:57:00-04:00"}, { "id": "https://tshafer.com/blog/2022/04/vscode-devcontainer-uid-gid", "url": "https://tshafer.com/blog/2022/04/vscode-devcontainer-uid-gid", "title": "Update VS Code Remote Container UID and GID", "content_html": "I’m increasingly relying on VS Code’s Remote Container\nfeatures for remote development in clients’ cloud computing\nsystems. It’s a little fiddly (I wouldn’t say I’m a Docker\nexpert, either) but it mostly works out of the box, and the\nability to encapsulate my environment makes a lot of other things\neasier.
\nI ran into a new problem recently, though, on a remote compute\ninstance with multiple Unix user accounts and VS Code’s default\nPython 3 image. VS Code’s containers are set up with a non-root\nuser named vscode, linked to the user and group IDs 1000:1000.\nMostly that’s fine, but this time (because I’d set up a few user\naccounts) my UID was, e.g., 1005, not 1000, and my primary GID\nwas totally different. The container needs to get this wiring\nright for permissions to be consistent inside and outside of the\ncontainer.
\nIt wasn’t super clear to me initially, but the answer seems to be\nmanually updating the container’s user (vscode) directly in the\ncontainer’s Dockerfile. Lifting directives from “Change the\nUID/GID of an existing container user” works like a charm,\nand I also appended another group:
\nARG USERNAME=vscode\nARG USER_UID=1005\nARG USER_GID=10\n\nRUN groupmod --gid $USER_GID $USERNAME \\\n && usermod --uid $USER_UID --gid $USER_GID $USERNAME \\\n && usermod -aG $USER_UID $USERNAME \\\n && chown -R $USER_UID:$USER_GID /home/$USERNAME\n\nUSER $USERNAME\n
The only thing left to do, I think, is figure out how automate\nthis for the others on my team so that at creation time the\ncontainer picks up id -u
and id -g
and populates the relevant\nfields automatically.
I’ve been playing music since around sixth grade, mostly with\nschool and church groups and bands, and I’ve had amazing\nopportunities to play fun venues, but last week took the cake. I\nhad the opportunity to play with our church at the Walnut\nCreek Amphitheatre in Raleigh and it was legendary: nearly\nthree hours long, 16,000 in attendance, and 200 baptized.
\nGod’s doing some amazing stuff in us, our friends, and in\nRaleigh, and what a moment this was to remember. Many notes were\nplayed and the gospel was preached. I’ve included the stream\nbelow\u2014check it out!
\nFor work, I’ll be in Las Vegas in June to present on Bayesian\nmodeling and workflow to Predictive Analytics World’s\nHealthcare track. Over the last year I’ve worked on a pretty\nsimple problem that nicely illustrates the need for\u2014and\napplication of\u2014a modeling workflow along the lines of\nBetancourt, Gelman, and others, especially to guard\nagainst overly complex models.
\nI’m supposed to present for 45 minutes on June 21.
", "date_published": "2022-04-09T09:17:00-04:00"}, { "id": "https://tshafer.com/blog/2022/03/programming-with-lapply", "url": "https://tshafer.com/blog/2022/03/programming-with-lapply", "title": "Programming with lapply", "content_html": "TIL that lapply
accepts both functions and function names (as\ncharacter vectors). From right there in the documentation\n(emph. mine):
\n\n\n
FUN
is found by a call tomatch.fun
and typically is\nspecified as a function or a symbol (e.g., a backquoted name)\nor a character string specifying a function to be searched for\nfrom the environment of the call tolapply
.
So, lapply
can use match.fun
to find and apply our functions\ndirectly; no need to hack around when we need custom function\napplication. That’s way simpler in cases like we’ve just had\nrecently at work, where we need to apply one of a variety of\nfunctions depending on the program’s state.
Bonus picks:
\nTo this point, our family photo collection has grown to a little\nless than 28,000 photos and videos: not totally unwieldy, but\nlarge enough that tagging and facial recognition are important\ntools for finding specific photos or groups of photos. Facial\nrecognition also seems to drive a lot of Apple’s “Memories”\nfeatures, which seem to have improved a lot in iOS 15.
\nOur library is also big enough that it contains plenty of\nunlabeled “People”\u2014folks Photos can detect reliably enough to\nlabel, but with enough uncertainty that they aren’t given a\npositive identification. Sorting through this sub-collection ends\nup being really useful because about 10% of our total library\nfalls into this category, where faces are detected but not\nidentified.
\nBefore, I used to scroll through sections of our library and\nlabel photos as I found them, but it turns out that it’s possible\nto create an album that automatically collects all of these\ntogether: Just create a Smart Album where the “Person” is set to\nthe empty string:1
\nBecause this approach relies on Smart Albums it’s only available\non the Mac, but it makes library tagging much easier\u2014especially\nwhen combined with another hard-to-discover Photos feature. I\nhave thousands of photos with unlabeled people, but many of these\nphotos are correlated: the pictures of the same person, sometimes\nin the same setting. Photos doesn’t make it easy to determine\nwhen this is the case, but, in some cases, you can double-click\non a detected-but-not-identified face in the Info panel (\u2318+I) and\nview a page collecting all the times this person appears. By\nrenaming this unnamed individual, all of their photos are then\nmerged into the correct identity. Super useful.
\nI just moved from a Windows 7 virtual machine to Windows 10 at work, but, as\nsoon as I moved over to the new VM, the Control key stopped registering in Pulse\nSecure remote desktop sessions. I’m working in RStudio, so Ctrl+Enter
and\nCtrl+Shift+Enter
are massively important key commands.
After a lot of searching around, I found that this is a Parallels issue, not a\nWindows 10 or Pulse Secure bug. For one reason or another, instructing Parallels\nto optimize its keyboard handling for games fixes the issue. (See this forum\npost and KB article.)
\nMaybe this post will save someone else an hour of Googling.
", "date_published": "2021-01-18T06:31:00-05:00"}, { "id": "https://tshafer.com/blog/2020/10/russell-wilson-mvp", "url": "https://tshafer.com/blog/2020/10/russell-wilson-mvp", "title": "Russell Wilson and MVP Voting", "content_html": "I was listening to one of Bill Simmons’s podcasts recently while he was talking about Russell Wilson, a superb NFL quarterback who has apparently has never received an MVP vote despite performing excellently for many years. In the discussion, it was decided that arguing for Wilson receiving vote(s) at some point in the past is pointless because what should someone have done, voted for the wrong guy? Surely not!
\nThat idea stuck with me for a while, and finally it occurred to me that we probably shouldn’t be asking whether Wilson ought to have ‘taken’ votes from an MVP winner. Most years don’t have unanimous winners, so looking over the list of non-MVP players receiving votes is a better measure. I did some ‘research’ (OK, I googled) and came up with this figure, which tabulates NFL MVP voting over the past five seasons:
\n\n\nEar bar segment counts the number of MVP votes a player received in a year. I’ve highlighted in blue the votes received by players who did not win MVP and I’ve broken those numbers out into the following table:
\n\n\nPlayer | \n\u2193 Votes | \nYears | \n\n |
---|---|---|---|
J.J. Watt | 13 | 1 | |
Tom Brady | 12 | 3 | Won NFL MVP in 2017 |
Drew Brees | 9 | 1 | |
Todd Gurley | 8 | 1 | |
Derek Carr | 6 | 1 | |
Ezekiel Elliott | 6 | 1 | |
DeMarco Murray | 2 | 1 | |
Aaron Rodgers | 2 | 1 | |
Tony Romo | 2 | 1 | |
Carson Wentz | 2 | 1 | |
Carson Palmer | 1 | 1 | |
Dak Prescott | 1 | 1 | |
Bobby Wagner | 1 | 1 |
If the NFL MVP voting isn’t typically unanimous, then, instead of asking whether Russell should have gotten votes destined for Patrick Mahomes or Lamar Jackson, maybe instead we could ask whether he should have received as many votes as, say, Carson Palmer. Yikes.
\nThe data and code used to write this post are available on GitHub.
", "date_published": "2020-10-03T09:48:00-04:00"}, { "id": "https://tshafer.com/blog/2020/09/data-science-covid19", "url": "https://tshafer.com/blog/2020/09/data-science-covid19", "title": "Some Recent COVID-19 Work", "content_html": "We don’t usually talk much about what we’re up to at work,\nbut in the last few weeks I’ve had the opportunity to share some\nresearch work from earlier this year. Back in June, I teamed with\na group to produce an analysis of how (or whether) U.S.\ngovernment policy had affected COVID-19 infection rates. We’ve\nwritten up the work in a post on the Elder Research\nblog, and I presented the same at the Data\nScience Conference on COVID-19 (DSCC19).
\nI’ll defer most details to the linked post, but we analyzed\nmonths of data at the U.S. county level to test for policy\nimpacts on the growth of COVID-19 cases. We also adjusted\nfor many other potential explanatory inputs including\ntesting, population size and density, and key demographic factors\nincluding income and minority representation. Even after these\nadjustments, stay-at-home orders were associated with a slight\ndecrease in cases on average and were linked to a continuing\nreduction in cases over time (i.e., these orders seemed to\ndecelerate the disease progression).
\nWe’ve published the source code for both the original\nproject and our article. Having code\navailable made the project a good fit for DSCC19. The article\nsource, particularly, contains data and an R Markdown document\nthat, when compiled, should produce the same model fits we show\nin the article.
", "date_published": "2020-09-26T13:15:00-04:00"}, { "id": "https://tshafer.com/blog/2020/08/influence-maximization", "url": "https://tshafer.com/blog/2020/08/influence-maximization", "title": "New Influence Maximization Article", "content_html": "Last week, Social Network Analysis and Mining published a new research article that I coauthored with a few colleagues\u2014my first published data science paper. We did the work itself quite a while ago as part of a high-performance computing project, and my gratitude goes to Hautahi for steering the paper to completion.
\nIn the article we studied the problem of Influence Maximization, a subfield of social network analytics and network science that searches for ways of identifying the most influential entities in a network (think Instagram ‘influencers’ \ud83d\ude44). This kind of problem turns out to be a computational nightmare, and so most approaches are either heuristic\u2014seeming to work well in practice but not mathematically guaranteed to return good solutions\u2014or are only proven to provide OK solutions. This second group of so-called “provable” algorithms typically guarantee solutions within % of the best possible answer. It’s still possible that they give a perfect solution in a given case, but it’s only assured that their solutions will be close to optimal.
\nIn the paper we took a really fast algorithm, tweaked it to provide brute-force calculations of the exact1 solution, and implemented it using HPC and GPUs. Then we compared various approximate approaches to the exact solution: do they only give 63%-optimal solutions, or do they perform better in practice? It turns out they perform extremely well on our test networks, representing common cases. There are likely pathological networks out there, but the techniques perform well with common graph constructions.
\nThe paper is available through SNAM, and Hautahi has provided source code, a workshop presentation, and a preprint through his website.
\nAs exact as you can get using a Monte Carlo approach. ↩︎
\nI’ve recently spent some time at work updating a few R packages we’ve built and deployed over the last several years, and during these updates I’ve run up against an old foe:
\nmy_package::run_model(...)\n#> Error in UseMethod("predict") : \n#> no applicable method for 'predict' applied to an object of class "ranger"\n
In case you’re unfamiliar, this error stems from R’s S3 method-dispatch system: model
is a “ranger” object, so R goes looking for a “ranger” version of the ‘predict’ method, but can’t find it\u2014even though I have the ranger package installed and Import
ed. I’ve seen this occasionally ever since I stated working with R in 2016, but until now I’ve treated the symptoms. I usually patch around this error by forcing R to use the (private) function ranger:::predict.ranger()
even though this is poor form, shouldn’t need to be done, and raises NOTE
s during the R CMD check
process.
I recently made some progress, though, by realizing that things ‘magically’ worked if I called, e.g., library(ranger)
at the top of whatever script uses my package to get things done. And then, finally, I realized something else. One embedded call would work A-OK:
result <- ranger::predictions(predict(model, data))\n
while a separate call in another function would not:
\nresult <- predict(model, data)\n
These two clues unlocked the solution.1 It turns out that, while Import
ing a package is necessary, it is not sufficient in order for its S3 methods to be made available during runtime. This includes methods like predict.<class>()
. In fact, S3 methods aren’t registered at all unless you tell R to use some bit of the imported package earlier in your program.2
You can test this by calling methods(predict)
, which, in my example, will not list predict.ranger()
. After calling any ranger function, however\u2014say, ranger::predictions()
\u2014a subsequent call to methods(predict)
does indeed list predict.ranger()
as an available S3 method. This is why the ‘predict’ call wrapped in ranger::predictions
worked for years, and I didn’t even notice; the ::
call causes R to immediately load ranger
along with its various S3 methods, so predict()
dispatches just fine.
If the S3 method were exported from a package, I suppose one could simply import the predict.<class>()
method directly, e.g., in roxygen2 documentation syntax:
#' @importFrom <package name> <method name>\n
But this doesn’t work if the S3 method isn’t exported\u2014and a ‘predict’ method shouldn’t really need to be.
\nWith the problem better understood, my solution is to do one of two things:
\nranger::predictions()
). This alone should cause the package to attach its \n namespace and register S3 methods when my package is loaded.requireNamespace(<package name>, quietly = TRUE)
call to the \n top of the function of interest, or to my package’s .onLoad()
function.\n Unlike library()
, this causes R to register the appropriate S3 methods, etc., but prevents the package from “attaching”, from adding itself to the search path so that all its functions are globally available.3 You can confirm this again by checking methods(predict)
before and after calling ‘requireNamespace’, including for non-exported S3 methods like predict.ranger()
.Live and learn, I guess. \ud83e\udd37\ud83c\udffb\u200d\u2642\ufe0f
\nI’ve also written up an answer on StackOverflow. ↩︎
\nThere’s probably a good reason for this, but I cant’t say that I like it. ↩︎
\nNote that once the S3 methods are registered there seems to be \n no good way to deregister them. ↩︎
\nUpdate (October 4, 2021): This trick seemed to work at the\ntime, but, when I returned to this work, multi-GPU training began\nto fail again. As always, your mileage may vary.
\nI’ve been working with Detectron 2 a lot recently, building\nobject-detection models for work using PyTorch and\nFaster R-CNN. Detectron 2 is a delightful and extensible\nframework for computer-vision tasks1 but it turns out not to\noffer a baked-in method for tracking evaluation losses during\ntraining—kind of a basic thing in machine-learning world. In\nML, evaluation losses and other tests against out-of-sample data\nare critical for estimating overfit and finding a suitable\nresting point on the bias-variance tradeoff curve, but I\nwonder if this isn’t a big concern for most computer-vision\nresearchers, who are trying to learn from millions of images and\nbillions of pixels.
\nI wasn’t the first to realize that this was missing, of course.\nDetectron 2’s GitHub repo contains a few issues like this\none discussing how to implement evaluation loss\ntracking, and there’s also a related Medium\npost that uses Detectron 2’s hook system to\nsolve the problem. In a nutshell, Detectron 2’s hook system\nworks like so:
\nwith EventStorage(start_iter) as self.storage:\n try:\n self.before_train()\n for self.iter in range(start_iter, max_iter):\n self.before_step()\n self.run_step()\n self.after_step()\n except Exception:\n logger.exception("Exception during training:")\n raise\n finally:\n self.after_train()\n
(Source code from Detectron 2 on GitHub.)
\nCustom training code code can cleanly register for\nthese hook methods, and this approach works\nwell for single-GPU training. But I figured out pretty quickly\nthat the hook-based system falls over when training with multiple\nGPUs (I’m often training this particular model with 4 V100s on\nAWS), probably from communication errors among the GPUs.\nI saw a post suggesting that different GPUs might be getting\nstuck in different parts of the code, since the hook system is\nimplemented across multiple functions, and this tracks with my\nexperience.
\nOne way around this multi-GPU issue is to bypass the hook system\nentirely, directly subclassing SimpleTrainer
’s run_step()
\nmethod since we use a custom trainer descended\nfrom SimpleTrainer
:
class OurFancyTrainer(DefaultTrainer):\n\n def run_step(self) -> None:\n super().run_step()\n\n # At a given number of iterations...\n self.calculate_test_losses()\n
This approach is similar in spirit to the hook-based system\u2014we’ve\nonly moved some code from, e.g., after_step()
into the end of\nrun_step()
\u2014but by subclassing the trainer we’re now able to\ndeliver code of similar complexity that works just fine for both\nsingle and multiple GPUs.
Detectron 2 is a great Python code base. It’s well\norganized, extensible, and uses type hints in many places. With\nonly a few thousand lines of code, I’ve been able to write data\nloaders, evaluators, etc., without writing any models from\nscratch. Detectron 2 also uses the YACS config system for\nspecifying and tracking experiments, which I really like. ↩︎
\nTwo good friends of ours were married this past weekend, tucked into the far corner of a backyard under a few trees, warm and breezy with light clouds and singing birds. It was lovely. They moved up their wedding day by several months because of the pandemic, hosting a small gathering for family instead of waiting for a larger celebration in the fall.1 That later gathering will hopefully happen, too, but it’s still another “normal” thing hurriedly rearranged.
\nIt’s natural to feel sad about all the big and small inconveniences that COVID-19 is causing. And to recognize that, for many people, “inconvenient” drastically undersells the situation: it’s life or death, whether because of the disease itself or from the economic consequences. But in view of everything moved around or put on hold over the last few months, this wedding brought some particular kind of joy. For the Christian, a wedding isn’t just a big party (though it is that), it’s a group celebration of two people choosing to team up for the rest of their lives. And to continue choosing, day after day. In these times, “choosing to choose” now seems more tangible than ever.
\n‘By extreme coincidence’ Martha and I happened to join the next-door neighbors for dinner around wedding time, too. After consulting with the bride and groom ahead of time, of course. \ud83d\ude01 ↩︎
\nDuring a recent visit to my in-laws, we hilariously realized that they were receiving 100 Mbps downlink internet service…but piping that through a super old Netgear 802.11b router I had installed a while back. Whoops.
\nWe remedied that situation via a Netgear R6900 (similar to the Wirecutter’s recommended R7000), and I wanted to add en emoji to the new SSID. Because that’s just indisputably better than not having an emoji. But, when I tried to enter a fun emoji-based SSID, I was greeted with a JavaScript alert dialog saying ‘nope’.
\nA quick search yielded two super helpful solutions for this problem: one for a TP-Link Archer C1900 and another for a Netgear R6300, which I figured should be very similar to my situation. After some fiddling, I was successful in adapting their work to the R6900 with just a bit more effort. The Netgear R6300 solution works by replacing the SSID validation JavaScript code with a function that always returns true
:
checkData = function() { return true; }\n
The R6900 is a little more complicated. The router supports both 2.4 GHz and 5 GHz broadcast, with two associated SSIDs. As a result, for the R6900 we have to update two checkData
functions: one associated with the 2.4 GHz band and one with the 5 GHz band. It turns out that you can update these functions in the same way as described in the above posts, being sure to choose the appropriate page to update in the lower right-hand corner of the Safari Developer Console. For the R6900, we have to replace checkData
in both the wl2gsetting and wl5gsetting pages.
The R6900 checkData functions are also a bit complicated, so I opted to just copy the function definition, comment out the SSID regular expression check, and reassign the function:
\ncheckData = function(save_only) {\n ...\n //if (cf.ssid.value.match(/[^(\\x20-\\x7E\\xA0)]/)) {\n // return alertR(getErrorMsgByVar("gsm_msg_inv_ssid"));\n //}\n ...\n}\n
After updating checkData
in both pages, setting the SSID, and hitting “Apply”, we’re good to go. \ud83d\ude04\ud83d\udc4d\ud83c\udffb\ud83d\ude4c\ud83c\udffb
I’ve recently had the opportunity to do a bit of remote advising in Physics\u2014debugging modern Fortran code, mostly\u2014and have been reminded:
\nI’m still pretty proud of the software Mika and I wrote while I was in school, and I’m happy the codes are still being used. And that they still work!
\nThis year’s NCAA tournament finally brought the mother of all upsets: 16-seeded UMBC (the Retrievers!) beat the Virginia Cavaliers by 21 points!1 Just how unlikely was UMBC’s upset?
\nTo get a more precise without delving deeply into basketball metrics we can use Empirical Bayes as a framework for estimating the likelihood of a 16-over-1 upset. Our NCAA problem turns out to be especially suited for this approach because we’re asking a question about probabilities (this involves a Beta distribution) using win-loss data (“Bernoulli trials”). Under this setup, the probability of a 16-over-1 upset is described by a probability distribution . The function gives us us the “probability of a probability”: it tells us what the probability of a 16-over-1 upset could be. Maybe it’s 1%, maybe 10%, etc.
\nWe need two pieces of information to apply Empirical Bayes: an initial guess at the probability of an upset (a “prior” belief) and some hard data. The data is easy to gather; in the 33 years since the NCAA tournament format expanded to include 64 teams, 16 seeds had won zero games in attempts.
\nOur “initial guess” is subjective, but it won’t turn out to matter much. For fun, let’s pick two extreme guesses and see how they affect our estimates.
\nOur two guesses look like this:
\n\nThe first guess is completely flat; before seeing hard data, we do not claim to know if 16 seeds win 0% of the time, 55%, or 99%, etc. The second guess, though, is much more opinionated: we’re pretty sure that the 16 seed is going to lose.
\nGuess | \n99% CI Lower | \n99% CI Upper | \n
---|---|---|
Flat Prior | \n0 | \n1.000 | \n
Opinionated Prior | \n0 | \n0.197 | \n
With the “flat” prior, we don’t claim to know the probability of an upset: The 99% credible interval for the upset probability ranges from 0% to 100%. On the other hand, our 20-to-1 guess really does make an impact: The 99% CI ranges between 0% and 20%. That seems more reasonable.
\nNow, let’s add some data and answer the question: Going into this year, how unlikely was it that any individual 16 seed would beat a 1 seed?3 Since 1985 we’d observed 132 failures and zero successes, so we can update our initial guesses very easily: and . This is possible because of the Beta distribution/Bernoulli trial problem type—they are conjugate priors:
\nThe resulting probability distributions are below—we have to really zoom in to even see where the distribution is nonzero.
\n\nTwo takeaways:
\nJust how small is that probability?
\nGuess | \nMost Likely | \n99% CI Lower | \n99% CI Upper | \n
---|---|---|---|
Flat Prior | \n0 | \n0 | \n0.0340 | \n
Opinionated Prior | \n0 | \n0 | \n0.0298 | \n
Even for the flat prior, our 99% CI only ranges up to a 3.4% chance of upset. But, as promised, the opinionated prior hasn’t made much of a difference: incorporating our 20-to-1 guess only shrinks the probability to a 3% chance. In both cases, the most likely probability of an upset (the mode) is 0%!
\nSo—we could not have ruled out an upset being impossible, given these results. That’s good, given what happened next! But, also happily, the estimated probability of an upset is very small and matches our intuition. Pretty cool!
\nSo…what about after this year? To find out, we just follow the same procedure as before, adding three non-upsets and our one shiny new upset. The posterior distributions are now:
\nAnd as a result, our updated probabilities are:
\nGuess | \nMost Likely | \n99% CI Lower | \n99% CI Upper | \n
---|---|---|---|
Flat Prior | \n0.0074 | \n1e-04 | \n0.0474 | \n
Opinionated Prior | \n0.0065 | \n1e-04 | \n0.0417 | \n
Still small, with the 99% CI ranging from about 0.01% to 4.7% and the most likely probability somewhere around 0.7%. But notice—now that we have finally observed an upset—that our posterior distribution no longer allows for a 0% chance of upset. As it shouldn’t!
\n\nEven worse: Virginia was the overall number one seed, the highest-ranked team in the nation. ↩︎
\nThis is the only constraint given for a two-parameter problem; it’s only enough to guarantee and are related: . In keeping with the flat prior, we set so that . ↩︎
\nThe probability that any of the four 16 seeds would win is slightly different: . ↩︎
\nIn an experiment with Approximate Bayesian Computation and R packages, I uploaded a new R package of my own to GitHub a few days ago named bcf for Bayesian Coin Flip. It simulates -person games of skill, approximating these games as multiple players flipping coins with different “fairness parameters” . The first player to obtain a “Heads” result first wins, dealing with ties in a sensible way.1
\nThe ABC concept is well explained in a pair of articles. First, Rasmus B\u00e5\u00e5th introduces ABC through an exercise involving mismatched socks in the laundry (thanks for pointing me to this, Kenny). And Darren Wilkinson also does a nice job explaining how ABC works.
\nAs far as bcf is concerned, the probability of a coin coming up Heads is picked from a distribution assigned to each player, . The package then simulates a set of games using these parameters and provides samples from the joint distribution . Finally, by keeping only the data that match an observed result, we end up with a distribution proportional to the posterior .2
\nI built the package to better understand ABC and to humorously model our office’s dart-playing abilities. To keep things simple, the bcf package only provides a few functions: We can initialize a player, run a game, and use the results of the game to update a player’s statistics.
\nA basic game might go like so: We assume a pretty weak prior () for each player before running a few games. After each game, we update the involved players.
\nIn practice, I think it’s pretty easy to use. First, instantiate a few players:
\nlibrary(bcf)\n\ntom <- new_player("Tom", alpha = 1.2, beta = 1.2)\ndavid <- new_player("David", alpha = 1.2, beta = 1.2)\nkevin <- new_player("Kevin", alpha = 1.2, beta = 1.2)\n\nprint(tom)\n
## UUID: d8cfe17e-c81d-11e7-9f88-f45c899c4b7b\n## Name: Tom\n##\n## Games: 0\n## Wins: 0\n## Losses: 0\n##\n## Est. Distribution: Beta(1.200, 1.200)\n## MAP Win Percentage: 50.000\n
Then simulate three games for which we already have results:
\n# Tom wins, David places second, Kevin finishes third\ngame_1 <- abc_coin_flip_game(\n players = list(tom, david, kevin),\n result = c(1, 2, 3),\n iters = 5000, cores = 5L)\n
## No. players: 3\n## Assign result: 1, 2, 3\n## Iters: 5000\n## CPU cores: 5\n## Workloads: 1000, 1000, 1000, 1000, 1000\n
tom <- update_player(tom, game_1)\ndavid <- update_player(david, game_1)\nkevin <- update_player(kevin, game_1)\n\n# Tom wins, Kevin places second, David finishes third\ngame_2 <- abc_coin_flip_game(\n players = list(tom, david, kevin),\n result = c(1, 3, 2),\n iters = 5000, cores = 5L)\n\n\ntom <- update_player(tom, game_2)\ndavid <- update_player(david, game_2)\nkevin <- update_player(kevin, game_2)\n\n# Tom finishes second, David wins, Kevin finishes third\ngame_3 <- abc_coin_flip_game(\n players = list(tom, david, kevin),\n result = c(2, 1, 3),\n iters = 5000, cores = 5L)\n\n\ntom <- update_player(tom, game_3)\ndavid <- update_player(david, game_3)\nkevin <- update_player(kevin, game_3)\n
bcf then provides methods for examining both players and games:
\nprint(game_3)\n
## # A tibble: 6 x 6\n## Tom David Kevin outcome n pct\n## <dbl> <dbl> <dbl> <chr> <int> <dbl>\n## 1 1 2 3 1627 32.5\n## 2 1 3 2 1917 38.3\n## 3 2 1 3 *** 439 8.8\n## 4 2 3 1 590 11.8\n## 5 3 1 2 222 4.4\n## 6 3 2 1 205 4.1\n
plot(game_3)\n
plot(tom)\n
This has been a pretty fun experiment in ABC and in R packaging. I’ll update this post if I ever return to the project.
\nIf mote than one player obtains the same result on a given flip, these players play one or more sub-games to break the tie. ↩︎
\nOne gotcha\u2014for now\u2014is that bcf imposes a beta distribution for each player’s win probability. After working out the likelihood on paper, I don’t think the posterior is actually a beta…just almost a beta. The possibility of ties and additional rounds adds complication. ↩︎
\nOn a recent podcast, Bill Simmons wondered aloud if the NFL as a whole is especially mediocre this year. I haven’t been watching all that much NFL football this season, but from what I have seen this observation rings true—the teams do seem pretty bad.
\nFortunately, the question are NFL teams exceptionally mediocre this year? is pretty easy to answer precisely thanks to football-reference.com. First, I pulled week-by-week results for nine recent seasons,1 being mindful to be a good citizen and download the data very slowly:
\nwget -w 3 https://www.pro-football-reference.com/years/2016/week_{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17}.htm\n
Then, I parsed out the weekly winners and losers using rvest and purrr:
\n# Team name is the last word\n. %>% str_split(" ") %>% `[[`(1) %>% tail(1) -> get_team_name\n\n# Extract teams\n. %>%\n html_text() %>%\n Filter(function(.x) .x != "Final", .) %>%\n map_df(~ tibble(team = get_team_name(.x))) -> extract_teams\n\nget_winners_losers <- function(html_document, week) {\n doc <- read_html(html_document)\n\n bind_rows(\n html_nodes(doc, "tr.winner td a") %>%\n extract_teams() %>%\n mutate(week = week, win = 1, loss = 0),\n html_nodes(doc, "tr.loser td a") %>%\n extract_teams() %>%\n mutate(week = week, win = 0, loss = 1))\n}\n\n# 2008 wk 9 is bad\nyears <- setdiff(2007:2017, 2008)\n\nwon_loss_df <- map_df(years, ~ {\n year <- .x\n message(paste("Year", year))\n\n map_df(1:17, ~ {\n week <- .x\n message(paste("Week", week))\n\n path <- paste0("./pages/", year, "/week_", week, ".htm.xz")\n if (!file.exists(path)) {\n message(paste0("File does not exist: ", path))\n } else {\n paste0("./pages/", year, "/week_", week, ".htm.xz") %>%\n get_winners_losers(week) %>%\n mutate(year = year)\n }\n })\n})\n
The resulting data frame, won_loss_df
, marks whether teams won, lost, or had a bye. From this point it isn’t hard to aggregate the results and take a look at historical results through Week Six, and I was a bit surprised by the following figure:
2017 is the most mediocre season in the sample! (Even though it seemed possible, I wasn’t really expecting the data to back me up.) But…2017 doesn’t take the crown by all that much. The 2017 season brings 12 teams with three wins after Week Six, but the 2012 season had 11 teams and 2013 had 10. It just feels anomalous—maybe because so many other teams have 2 or 4 wins.2
\nSo, by this simple metric the NFL isn’t extraordinarily mediocre this season—just ho-hum mediocre.
\nCongratulations.
\nIt’s a little early for Christmas (okay, it’s absurdly early for Christmas), but I never linked the 2016 Christmas at DPAC show here, and Summit recently re-posted it with a nicely remixed audio track.
\nThis is one of the best things I’ve been able to do for the last four years, and I’m super grateful for the opportunity to play again this year. If you’ll be around the Triangle near Christmas Eve, pick up some tickets in December and come join us!
\nAlready putting 2024 on our calendar.
\nWe have a dart board at the office and have a good time lofting darts in nice, looping arcs. A recent project pushed me back into physics and led me to consider just how sensitive the dart-throwing motion is to small imperfections in angle; how precise do we need to be? Calculating those angular perturbations ( in the coordinates I’ll set up next) requires the kinematics of the problem and provides an opportunity to solve the equations numerically with R.
\nOur office dart board is fixed at a location , and a thrower (darter? player?) stands with their throwing elbow at a location . Here is the player’s forearm length and things are arranged so a “perfect” 90 release starts from a height . Recall that the kinematics derive from Newton’s second law. Assuming a constant gravitational acceleration near the earth’s surface, we want to solve the equations\n\nBecause there are no forces acting left to right (except for the neglected air drag), the equation has no acceleration term.
\nWith the above equations come initial conditions: and with the release angle – the angle of the player’s forearm at the moment of the throw. The angle is measured conventionally (and unintuitively here) from the forward direction, so a throw vertically upwards would have . Computing is a little trickier but possible with a bit of trigonometry and calculus; the magnitude is just , and one can show that\n\n
\nTo find a solution wherein we hit the bullseye, we fix and , with the time to target. These definitions specify the solution after some tedious algebra. First, the equation yields time as a simple function of release angle and angular velocity (or linear velocity ):\n\nIntuitively, the time of flight is the ratio of the distance traveled in the direction to . The equation generates a more complicated expression for the angular velocity:\n\nThis one’s harder to interpret, but it has the correct units (s) and exhibits interesting divergences: as , , , etc. Note, too, that the divergence differs from its counterpart in that different parts of the denominator go to zero ( vs. the term in square brackets).
\nWith the solution in hand we can also find the maximum height of the dart during its flight. The path is a function of and , and the maximum height is an extremum of : . A check of the second derivative confirms this is, in fact, a maximum, and\n\nThe maximum depends on the starting height and also varies with the distance to be crossed by the dart.
\nFinally, we can work out an answer to my original question: how sensitive is the solution to perturbations in angle, ? The answer comes from the accurate solution by computing the derivative at fixed – the angular frequency of the accurate solution. The derivative is\n\nand this quantity can be interpreted as the vertical distance by which the dart would miss for a small (e.g., degree) imperfection in the release angle. We could go on to ask similar questions about imperfections in velocity.
\nNow that we have the general solutions, let’s take a look at the results numerically. Start by making a few assignments:
\nG_EARTH <- 9.8\n\nL <- 10*12 * 2.54/100 # 10 feet\nr <- 14 * 2.54/100 # 14 inch forearm\nC <- 6*12 * 2.54/100 # 6 more feet\n
The constant measures the ceiling height relative to the bullseye, is the forearm length, and is the horizontal distance between the player and the board. Here I’ve estimated feet, inches, and feet of clearance. The gravitational constant is, as always, 9.8 meters per squared second. With the constants fixed, I’ve implemented the kinematic equations as functions in R and computed them on a radian grid spanning :2
\ncalc <- tibble(\n theta = seq(pi, 0, -pi*1e-4),\n theta_deg = theta * 180/pi,\n omega = omega(L = L, r = r, theta = theta),\n vel = r*omega,\n vel_mph = vel * 3600 * 100 / 2.54 / 12 / 5280,\n time = tt(L = L, r = r, theta = theta),\n y0 = r*(sin(theta - 1)),\n ymax = ymax(L = L, r = r, theta = theta),\n yratio = ymax/y0,\n ymax_ft = ymax * 100 / 2.54 / 12,\n hit_ceil = ymax >= C,\n dy_dtheta = dydtheta(L = L, r = r, theta = theta)\n)\n
The results can be simply divided into two main categories: “standard” vs. “extreme” solutions.
\nI’ve defined standard solutions as those for which (i) the dart doesn’t hit the ceiling and (ii) to separate out the slightly crazier results.
\nWe find, intuitively, that the throw velocity must increase as the release angle approaches 90 degrees. The maximum height and air travel time always decrease as the trajectory flattens, but interestingly the required throw velocity has a minimum value. (In the figures I’m plotting vs. ; corresponds to a vertical upwards throw and to a vertical downwards throw.)
\n\nFor we have to throw the dart harder because much of its velocity is “wasted” traveling vertically. On the other hand, for we need more velocity to make it to the target before gravity can pull the dart too far. We could take a derivative to find , but since we’re already here we can just find the minimum numerically:
\nopt_value <- optimize(function(.x) omega(L, r, .x)^2, c(pi/2, pi))\n180 - opt_value$minimum * 180/pi\n
## [1] 46.30904\n
The minimum is not , but slightly larger (a flatter trajectory).
\nThe first set of results not yet considered covers solutions approaching a perfectly horizontal throw. The travel time and maximum height again decrease (the maximum height decreases roughly linearly), but the throw velocity diverges: as the dart needs to hit the bullseye before gravity pulls it off line.
\n\nFinally, there’s a slice of the solution for which we need a more vertical headroom:
\n\nOnce again we get crazy behavior with velocity, but now the time and maximum height are both extremely large – these plots are on a logarithmic scale. These solutions basically correspond to throwing a dart upwards and still managing to hit the bullseye. For some of these solutions our constant- approximation would be in big trouble!
\nFinally, what is our margin of error on dart throws? We can check by plotting and letting .
\n\nThis plot suggests two things:
\nFor the set up we considered, it turns out that there is an angle near for which the required velocity is a minimum. The dart throw is also most forgiving near that angle. There are also an interesting class of solutions that require fast dart throws, some of which would put the dart into orbit!
\nThis was a fun exercise – I was able to work the problem and do the computation in an evening.
\nThis week I’ve re-added an RSS feed and also created a new JSON feed. These changes are mostly motivated by a desire to try out the new JSON feed format – it’s pretty clever and was easy to implement.
", "date_published": "2017-06-25T23:30:00-04:00"}, { "id": "https://tshafer.com/blog/2017/06/installing-r-on-ec2", "url": "https://tshafer.com/blog/2017/06/installing-r-on-ec2", "title": "Installing R on EC2 with RHEL 7", "content_html": "I’ve been tasked a few times recently to stand up AWS EC2 instances as shared data science/development platforms, including the R and RStudio Server stack. (I prefer RHEL 7 for familiarity.) R depends on EPEL for installation on top of RHEL, and adding EPEL to yum
is pretty straightforward:
$ sudo yum install -y epel-release\n$ sudo yum update -y\n
Trying to sudo yum install R
, however, still fails having not found the dependency usetex-tex
. It was surprisingly difficult to track down a clean solution, but a buried, not-accepted StackExchange answer has it right: usetex-tex
is listed in a disabled-by-default set of packages. Enable rhel-server-optional
and we’re in business:
$ sudo yum-config-manager --enable rhui-REGION-rhel-server-optional\n$ sudo yum install -y texinfo-tex\n$ sudo yum install -y R\n
Hello – I’ve finally put in the work to resurrect this blog! I’ve been working at Elder Research for about 15 months touching a variety of technologies and need a place to put odds and ends.
", "date_published": "2017-06-14T23:20:00-04:00"}, { "id": "https://tshafer.com/blog/2016/01/summit-dpac-2015", "url": "https://tshafer.com/blog/2016/01/summit-dpac-2015", "title": "Christmas at DPAC 2015", "content_html": "This was tremendously fun.\nSo fun that it feels unfair to be a part — the video is fun to watch and listen to, but it can’t capture the feeling of waiting in the wings for my cue while Branden and Molly break people’s hearts on Emmanuel.
\nIt’s also one of the precious few times each year we get to bring everyone together, playing music with friends from other campuses and making new ones, too.\nThis year was especially fun, being the only time we have played the four songs on the Carols EP live.\nSome musical highlights:
\nChristmas at DPAC 2015 from The Summit Church Sermons on Vimeo.
", "date_published": "2016-01-07T19:38:00-05:00"}, { "id": "https://tshafer.com/blog/2015/05/ten-thousand-fathers-album", "url": "https://tshafer.com/blog/2015/05/ten-thousand-fathers-album", "title": "10,000 Fathers – Invitation, Volume One", "content_html": "The 10,000 Fathers Worship School run by Aaron Keyes released a new record today, and I really like it (you could purchase it from iTunes or Amazon, listen on Spotify, or stream the first few tracks for free at their website). I think I found out about the worship school from my good friend Duane Mixon, who has a track on the record (and a good one at that), and another friend from Wilmington has been a part of the school as well.
\nKnowing a little of the heart behind the school and hearing the first track, Invitation Song, I was excited enough to preorder the record and as a result received it a week ahead of the release. For me, highlights (or at least probably my most-listened tracks) are Invitation Song (also see the accompanying video), Rend the Heavens, Love Lifted Me, and Never Ending Love. A couple of these have already racked up some impressive numbers on my iTunes play counter — they’re really good.
\nBut (surprisingly, at least to me) that’s all a bit secondary. I enjoy the record, and the music and melodies certainly do it for me, but this record has already made a mark where relatively few do. On their website announcing the launch is the line “May the deep places in your heart be awakened to His reality all around and within you.” This record has already gone some distance in making that a reality. As the first track says…
\n\n", "date_published": "2015-05-05T07:49:00-04:00"}, { "id": "https://tshafer.com/blog/2015/02/christmas-dpac", "url": "https://tshafer.com/blog/2015/02/christmas-dpac", "title": "Christmas at DPAC 2014", "content_html": "Open up our eyes to see You in the ordinary,
\n
\nWe don’t want to miss You anymore
\nOpen every eye to see every day
\nEverything is burning with the glory of the Lord.
The Summit Church, our church here in Raleigh, has held Christmas Eve services at the Durham Performing Arts Center each of the last three years, and I’ve had the great privilege of being a part of the most recent two with Summit Worship. I’ve posted below the recording from the 2014 services — if you want a good time but aren’t looking to invest an hour and a half, check out music director Branden’s piano piece (30:55), Hank Murphy and campus pastor Chris Green rapping1 (37:50), or pastor J.D.’s message (44:16).
\nIf the Christmas Eve program piques your interest, a few good places to poke around are Summit’s Sermons Vimeo channel, messages page, and podcast on iTunes.
\nWelcome to my blog! I really didn’t think I’d have one again, but I’ve been getting the itch recently and it was a fun programming exercise.
\nI was particularly inspired to write again by Brent Simmons and his blog Inessential — I love his short, undecorated style. It takes the pressure off. I’m not an expert in very much, and I hope the tone here reflects that. I probably won’t write often (I feel like I don’t have all that much to say), but I wanted a place that was my own on the internet again. If I do write anything here, expect posts about programming, physics, and music with some family stuff thrown in.
\nPart of the fun of blogging is coming up with a Rube Goldberg contraption for posting. To make this blog go, I wrote a blogging engine in Python. It is mostly a port of Marco Arment’s Second Crack static engine (hosted here). It uses Dropbox and inotify-tools in pretty much the same way as Second Crack to automate posting. I chose Python because I’m much more familiar with it these days (second only to Fortran) than PHP.
\nPutting the blog together was a fun weekend-long break from physics programming, and the engine is still rough around the edges. The styling needs some work, and I don’t have an RSS feed yet (Update 11/9/14: I now have an RSS feed here!). I don’t have archives yet or any kind of tagging, but they’ll come eventually. I don’t have any kind of commenting, either, and I doubt I will. These days Twitter (I’m @tomshafer) is mostly the rage.
This represents a (mostly) clean break for me. I’ve written before, but many of those posts represent the kind of writing I want to avoid. Martha and I have a lot of fun things coming in the next year or two and this seems like a good time to at least have the option of blogging. I might sneak a few old posts in at some point, but the future is the fun part.
", "date_published": "2014-09-14T23:20:00-04:00"}] }