Derek Sivers

Database triggers to clean text inputs

2022-03-01

Even smart people can accidently put bad data into a database.

We copy text from a web page, then paste it into a form, not realizing it also copied a space, tab, or newline character. Then your system thinks someone’s name is not “Jim” but “ Jim\n”.

You could use JavaScript to sanitize all form inputs, but what about when you import a CSV file, or get data from an API?

No, the best place for your data-cleaning functions is in the database itself. So no matter what code is inserting or updating, a database trigger will sanitize it before saving.

Here’s a PostgreSQL example. Let’s make two tables, people and emails, so you can see how one function can be used by many triggers.

create table people (
  id serial primary key,
  name text,
  code text
);

create table emails (
  id serial primary key,
  person_id integer not null references people(id),
  email text
);
download code

Let’s make two simple functions:

  1. Remove all whitespace, then lowercase.
  2. Remove unwanted whitespace characters like tab and newline. Replace them all with a single space. Then trim spaces from the front and end.
-- remove all whitespace, then lowercase it
create function lower_no_space(text) returns text as $$
  select lower(regexp_replace($1, '\s', '', 'g'));
$$ language sql;

-- replace all whitespace with single space, then trim start and end
create function no_extra_space(text) returns text as $$
  select btrim(regexp_replace($1, '\s+', ' ', 'g'));
$$ language sql;
download code

The function that removes all whitespace then lowercases, is good for email addresses and codes like checksums.

The function that removes extra whitespace then trims, is good for many things like names, addresses, email subjects, and anywhere that needs to retain inner spaces. So “ New \t Zealand \n” will be “New Zealand”.

Now make trigger functions that use your smaller re-usable cleaning functions. I find it best to make one trigger per table, sanitizing all fields on any insert or update.

create function clean_people() returns trigger as $$
begin
  new.name = no_extra_space(new.name);
  new.code = lower_no_space(new.code);
  return new;
end;
$$ language plpgsql;
create trigger clean_people
before insert or update on people
for each row execute procedure clean_people();

create function clean_emails() returns trigger as $$
begin
  new.email = lower_no_space(new.email);
  return new;
end;
$$ language plpgsql;
create trigger clean_emails
before insert or update on emails
for each row execute procedure clean_emails();
download code

The language is a little verbose, but oh well. The ugliness of the boilerplate code is made up for by the beautiful simplicity of having all of this handled in the database. It’s so nice to not have to sanitize form fields! Just toss the unwashed inputs at the database.

Here, let’s give it some dirty data, and watch it come out clean.

insert into people (name, code) values (e' \t \r \n Dr.  \n \r JM \t Lim \r\n', '   XX o Z ') returning *;
-- id │    name    │ code 
--────┼────────────┼──────
--  1 │ Dr. JM Lim │ xxoz

insert into emails (person_id, email) values (1, e' \r\n \t DR. L @ JM Lim . com \n') returning *;
-- id │ person_id │     email      
--────┼───────────┼────────────────
--  1 │         1 │ dr.l@jmlim.com
download code

This approach has made my code much simpler, and data much cleaner.

Download the full code example at /code/clean1.sql.

Also look at using triggers to ensure data integrity.