Even smart people can accidently put bad data into a database.
We copy text from a web page, then paste it into a form, not realizing it also copied a space, tab, or newline character. Then your system thinks someone’s name is not “Jim” but “ Jim\n”.
No, the best place for your data-cleaning functions is in the database itself. So no matter what code is inserting or updating, a database trigger will sanitize it before saving.
Here’s a PostgreSQL example. Let’s make two tables, people and emails, so you can see how one function can be used by many triggers.
create table people ( id serial primary key, name text, code text ); create table emails ( id serial primary key, person_id integer not null references people(id), email text );
Let’s make two simple functions:
- Remove all whitespace, then lowercase.
- Remove unwanted whitespace characters like tab and newline. Replace them all with a single space. Then trim spaces from the front and end.
-- remove all whitespace, then lowercase it create function lower_no_space(text) returns text as $$ select lower(regexp_replace($1, '\s', '', 'g')); $$ language sql; -- replace all whitespace with single space, then trim start and end create function no_extra_space(text) returns text as $$ select btrim(regexp_replace($1, '\s+', ' ', 'g')); $$ language sql;
The function that removes all whitespace then lowercases, is good for email addresses and codes like checksums.
The function that removes extra whitespace then trims, is good for many things like names, addresses, email subjects, and anywhere that needs to retain inner spaces. So “ New \t Zealand \n” will be “New Zealand”.
Now make trigger functions that use your smaller re-usable cleaning functions. I find it best to make one trigger per table, sanitizing all fields on any insert or update.
create function clean_people() returns trigger as $$ begin new.name = no_extra_space(new.name); new.code = lower_no_space(new.code); return new; end; $$ language plpgsql; create trigger clean_people before insert or update on people for each row execute procedure clean_people(); create function clean_emails() returns trigger as $$ begin new.email = lower_no_space(new.email); return new; end; $$ language plpgsql; create trigger clean_emails before insert or update on emails for each row execute procedure clean_emails();
The language is a little verbose, but oh well. The ugliness of the boilerplate code is made up for by the beautiful simplicity of having all of this handled in the database. It’s so nice to not have to sanitize form fields! Just toss the unwashed inputs at the database.
Here, let’s give it some dirty data, and watch it come out clean.
insert into people (name, code) values (e' \t \r \n Dr. \n \r JM \t Lim \r\n', ' XX o Z ') returning *; -- id │ name │ code --────┼────────────┼────── -- 1 │ Dr. JM Lim │ xxoz insert into emails (person_id, email) values (1, e' \r\n \t DR. L @ JM Lim . com \n') returning *; -- id │ person_id │ email --────┼───────────┼──────────────── -- 1 │ 1 │ email@example.com
This approach has made my code much simpler, and data much cleaner.
Download the full code example at /code/clean1.sql.
Also look at using triggers to ensure data integrity.