Datasette Powered Investigative Journalism

Tips for using Datasette for investigative projects

Mon 14 November 2022 ~ Brandon Roberts
Datasette logo, blurry but under a magnifying glass

I hope these tips help clarify some best practices. 😏

Datasette is a uniquely amazing tool. It's not Excel, it's not Django, but has the capabilities of both. I find it particularly well suited to collaborative investigative projects involving lots of data sources and journalists with varying technical abilities.

The tool is particularly genius in two ways: It takes advantage of the power and portability of SQL, the lingua franca of computing, and it's extremely easy to modify with its extensive plugin system.

Thanks to the ecosystem surrounding it, here's a list of things Datasette can do out of the box:

  • mapping
  • charts
  • advanced full-text search
  • SQL analysis
  • indexing
  • import of most data formats (even super old/obscure ones)
  • export to common formats
  • authentication

I used Datasette in three different investigative projects in 2022 alone. In one, I really hit the limits of its design. Here are some tips and best practices to enable big investigative projects.

Tip #1: Have a well-organized data collection strategy 🔗

Pre-production is the key to successful large-scale, data-centric investigations. Especially if it involves managing hundreds of records requests. Roughly, my process is the following:

  • I maintain a GMail account specifically for records requests.
  • I use GMail labels to keep track of agencies, investigations and request response type (extension, installment, completed, etc).
  • I maintain a shared spreadsheet to keep track of my requests both before I send them and once they're out.
  • I use FOIAMail to automate the sending and tracking of requests.
  • I create a folder for each agency and save responsive records there. (FOIAMail also does this with attachments, automatically.)

Once requested data has been delivered, I convert each file to CSV. These also go into the respective agency folder. I use git to keep track of processed versions of files and use a consistent naming format for completion. This makes it easy for me to scan the directory and pull clean, processed CSVs.

Careful organization will prevent data from being lost, ease onboarding of new people and help you maintain sanity a year into the project.

Tip #2: Convert responsive records to SQL tables 🔗

Once I've converted my responsive records to CSV, I organize them in a specific way:

  1. I create a SQLite database for each agency I've requested files from.
  2. I merge any CSVs that have identical or mostly identical headers.
  3. I convert each CSV to a table within each agency's DB, using CSVKit.

These DB files can all be copied to my Datasette folder, ready for browsing and analysis with Datasette.

My Datasette home page - shows my database organization strategy

This is the Datasette home page for one of my investigative projects. My top-level organization is at the agency level, since each agency provided data in starkly different formats. Using this organization strategy, I can search them using the "Search all tables" plugin and perform cross-joins using the _memory virtual table.

Tip #3: When your data is too big for Datasette, try my branch 🔗

Datasette currently has one glaring problem: when you have too many databases or too many tables, it chokes. The whole system becomes painfully slow. This is due to a limitation in the design and the intended use case of creating single dataset web interfaces. By using it with investigations where there are hundreds of agencies or thousands of files, it is stretched beyond its natural limits. Thankfully, I've come up with a solution.

I maintain a set of patches, called datasette-extras, on top of Datasette's main branch. Here's some of the features they provide:

  • significantly improve page load times and removes database/table limits
  • add configurable pagination of various pages
  • a new CLI setting --cross-dbs which lets users select which DBs will be available for cross-DB joins
  • privacy fixes

You can use it by cloning and installing the repo here.

Tip #4: Use cross-DB queries 🔗

You don't always need to create a new table to do an analysis. Datasette has an option that lets you do cross-DB queries on up to 10 databases. It does this using a builtin SQLite ability called ATTACH.

You can interact with attached databases easily using Datasette's _memory virtual database. As its name suggests, this DB lives entierly in memory and is pretty fast. Using it, you can write queries like this:

SELECT db1.names.name, db2.employers.name
JOIN db2.employers ON db1.names.id=db2.employers.name_id
WHERE db2.employers.name LIKE '%Police%';

The above example does a join on two tables in two different databases, grabbing names from db1 and cross referencing them with police employers in db2. I use this ability to do quick exploratory analyses. This enables the common investigative recipe of cross-referencing people in one database with another (e.g., police roster against a list of criminal cases).

Tip #5: Use SQLite indexes and virtual tables 🔗

I did a whole workshop about this at NICAR 2022, but basically: indexes are your friend. In SQLite, things are simple because you don't get many options when it comes to index types. If you're filtering a field in a query or trying to facet its values, put an index on it.

CREATE INDEX table_name_column_name ON table_name ( column_name );

This will create a B-tree index on a column (column_name in the above example), significantly speeding up most accesses.

SQLite also has another indexing feature that I use very often in my investigations: virtual tables. A virtual table lets you treat the result of another SQL query as a table. You can use them to bake a complex query into a table for ease-of-use. They can also be used to provide special domain-specific indexes. I use SQLite's full-text search (FTS) virtual table functionality in my reporting frequently. To create a FTS table, issue the following SQL statement:

CREATE VIRTUAL TABLE fts_index_name USING FTS5 (
  [col_1], [col_2], [col_N],
  tokenize='porter',
  content=[table_name]
)

The query above will build a FTS index with porter stemming, which will perform efficient full-text search queries on the text in the col_1, col_2 and col_N columns inside the table_name table. It's especially good for searching people and organization names. I have a script that runs through all my databases and, via a regex, automatically marks all columns that are likely to contain names as FTS searchable.

An easier way to setup FTS is to use the sqlite-utils companion library:

from sqlite_utils import Database
# Load our sqlite3 database
db = Database("my-database.db")
# enable FTS for two of my-table's columns
db["my-table"].enable_fts(["column_name", "other_col"])

This will also set up triggers so, when my-table is modified, the FTS table will be kept up to date.

Tip #6: Make use of Datasette's plugin system 🔗

This one's a no-brainer. For most work, the built-in Datasette browsing, filtering and SQL query interfaces are good enough. But sometimes, I need extra abilities that are too complex for a query. That's where using Datasette's plugin system really shines.

The first example is Search all which lets me do advanced searches across all the tables and databases that have a FTS index. This is based on Simon Willison 's plugin, but adds numerous features. I built this so that I could look up individuals' names across all the data I had and construct keyword searches that might help identify certain incidents.

When pre-existing plugins don't work, you should roll your own. I've built quick plugins to do all kinds of things that I need in my investigations. Here are some examples:

  • adding context to table rows
  • viewing event and person histories
  • running complex cross-referenced searches

Writing a Datasette plugin is extremely simple and the documentation is good. If you can write a simple Python function, you can write a Datasette plugin.

Tip #7: Use custom SQLite3 extensions 🔗

Unlike other database systems, SQlite3 is actually fairly limited. There are only a few data types and the number of functions available is small. But there's a bright side: it's surprisingly easy to write your own SQL functions for use in your queries.

Example: I often need to compare names across data from different agencies. This usually means they're in slightly different formats. Here are some of the name formats that are common in my data that refer to the same person:

  • Roberts, Brandon X.
  • Roberts, Brandon (DOB: 19910101)
  • Roberts, B.

Doing comparisons across such columns in SQL is complicated, especially since SQLite doesn't even have regex support built-in. Fortunately, we can write a new SQLite function to use in our queries. We'll call it SAMENAME and it will compare names in multiple formats. Here's an example (samename.c):

/// samename.c written by Brandon Roberts, July 2022
// USAGE: SAMENAME(column_A, column_B)
// This SQL function assumes that both column_A and B are non-zero,
// not null and are in Last, First Middle format (ignoring case)
#if !defined(SQLITE_CORE)
#include 
#include 
#include 
#include 
#ifndef SQLITE_CORE
#include "sqlite3ext.h"
SQLITE_EXTENSION_INIT1
#else
#include "sqlite3.h"
#endif

static void samenamefunc(sqlite3_context *context, int argc,
                         sqlite3_value **argv) {
  // two arguments: name1, name2
  assert(argc == 2);
  int equal = 0;

  // Convert names to char* and calculate min length
  const char *name1 = (const char *)sqlite3_value_text(argv[0]);
  const char *name2 = (const char *)sqlite3_value_text(argv[1]);

  // NOTE: null and zero length checks removed for brevity

  // get both name lengths
  size_t name1len = strlen(name1); size_t name2len = strlen(name2);

  // get the length of the shortest name
  int minlen = (name2len < name1len) ? (int)name2len : (int)name1len;

  char *n = (char *)name1; char *m = (char *)name2;
  equal = 1;
  for (int k = 0; k < minlen; k++) {
    if (tolower(*n++) != tolower(*m++)) {
      equal = 0;
      break;
    }
  }

  // return equal, 0=no match, 1=match
  sqlite3_result_int(context, equal);
}

// init the SAMENAME function for SQLite
int sqlite3samenameInit(sqlite3 *db) {
  sqlite3_create_function(db, "samename", -1,
                          SQLITE_UTF8, 0, &samenamefunc, 0, 0);
  return 0;
}

#if !SQLITE_CORE
int sqlite3_extension_init(sqlite3 *db, char **pzErrMsg,
                           const sqlite3_api_routines *pApi) {
  SQLITE_EXTENSION_INIT2(pApi)
  return sqlite3samenameInit(db);
}
#endif
#endif

// Compile with: gcc -shared -fPIC -lm samename.c -o libsamename.so
// Run Datasette with arg: --load-extension=libsamename.so

Using the new function this SQLite extension provides, we could write a query to cross-reference names across datasets:

SELECT *
FROM DB_A.table1 A
JOIN DB_B.table2 B
ON SAMENAME(A.name, B.name)

The query will run significantly faster than any SQL equivalent. I've used this strategy for all kinds of text re-formatting and fuzzy search problems.

Tip #8: Try immutable mode and an inspect file 🔗

One of the easiest pain points in Datasette that you're likely to hit is an extremely slow-loading home page. As mentioned above, This can happen if you have a lot of big DBs. Before trying forks or patches, try building an inspect file.

Run Datasette with the inspect command (as opposed to the serve command, which is the default). This creates a JSON file that has statistical information on all your databases. You can run it like this:

datasette inspect *.db --inspect-file=inspect-data.json

Then, when you run Datasette in serve mode, you can include the --inspect-file=inspect-data.json argument. This can be a significant performance boost for your home, DB and table pages.

Using an inspect file will also automatically enable the second performance feature: immutable mode. This is a setting for databases that marks them as read-only. There's a gotcha to this as certain plugins will break, because they expect DBs to be writable. But, if Datasette can assume that a DB won't change, then more optimizations are available for use.

Caching is another huge optimization that's available for DBs operating in immutable mode. Using the datasette-hashed-urls, you can significantly reduce the amount of CPU load required for loading the home, table, and row browsing pages.

That's it! 🔗

I've been using Datasette so much these days that it's right up there with my spreadsheet editor in terms of frequency of use. It's hard to find something comparable for situations where I want to be able to do a lot of different analyses and be able to quickly share them with team members.