Log Connection PIDs During Exploration

September 13, 2022 at 7 AM

Here’s a trick I find useful during exploratory analysis and feature engineering; really, whenever I’m querying against database servers I don’t control: Log the connection PID at query time.

It happens pretty often to me during exploratory analysis that I launch a query and then, either right away or after the query begins to drag on longer than expected, wish I could cancel the query to edit it and try again. Maybe I forgot to put a filter into the query, and it’s about to return way too much data, or maybe the database is underpowered and I’d rather extract a smaller result for analysis.

Often there isn’t a good way to stop a query from within RStudio/VS Code when the query is directed to a remote database server. Unless we want to wait for the query to finish and return control of the IDE to us, maybe the best we can do is to restart the R session or Jupyter kernel and start another query. If we log the connection PID, though, we get another option: We can open another session, instead of killing this one, and ask the database server to cancel any running queries on that PID.

This is really easy to do, both in Python and R, and most drivers have a way to get the PID pretty easily. I’ve been using Postgres/Redshift a lot recently, so I’ll use it as an example. With Python and psycopg2, we can call get_backend_pid() to log the PID:

log.info(f"Query PID = {conn.get_backend_pid()}")
#> Query PID = 948

Or, with R and DBI we can call dbGetInfo():

message("Query PID = ", DBI::dbGetInfo(conn)$pid)
#> Query PID = 950

Once we have our PID, it’s really easy to open up a new session and ask the database to cancel the running query:

SELECT pg_cancel_backend(<PID>)

Because this is just SQL, we can run this query anywhere and recover control of our IDE session. Simple and useful.