In the previous post I demonstrated how small typos in SQL become large logical errors, and how a “proper” type system could help in preventing these. This time I explore how PostgreSQL’s enum data type can be leveraged to gain some data safety and prevent logical errors.
PostgreSQL’s ENUM
In PostgreSQL the enum data type offers some type safety, for example:
CREATE TYPE d1 AS ENUM ('1', '2', '3'); CREATE TYPE d2 AS ENUM ('1', '2', '3');
Enumerated values are labels, hence the quotation marks.
Equality
-- compare as default type -- select '1' = '1' as bool; bool ------ t -- compare as two same Enum types -- select '1'::d1 = '1'::d1 as bool; bool ------ t -- compare as two different Enum types -- select '1'::d1 = '1'::d2 as bool; ERROR: operator does not exist: d1 = d2 LINE 1: select '1'::d1 = '1'::d2 as bool; ^
Again, this is a good error to get. But notice that the following still works:
select '1'::d1 = '1' as bool; bool Boolean -------- t
Why did this work? Take a look at this:
select '1' as x; x unknown -------- 1
A label out of the context of an expression is of an unknown type. An implicit type cast (conversion) rule — in the previous example — converted the unknown type to the type d1 based on what was expected in the expression. If I explicitly specify the type of the second label, the error is back.
select '1'::d1 = '1'::text as bool; ERROR: operator does not exist: d1 = text LINE 1: select '1'::d1 = '1'::text as bool; ^
The equals operator worked fine, as in the Haskell example from the previous post.
Functions and Operators
To make this a bit more realistic, I’ll use country codes for the next example; say cc2 is a domain of county codes.
CREATE TYPE cc2 AS ENUM ('AU', 'BE', 'CA', 'FR', 'US');
First, let’s see what happens if I treat country codes as text (characters) and use a string operator, say concatenation.
SELECT 'AU'::char(2) || 'BE'::char(2) as ctr ; ctr text ------ AUBE
No surprises here, all looks as expected. But, what does it mean to concatenate two country codes? Did these two countries merge? Did they sign some kind of political alliance? Maybe they signed a beer-drinking trade agreement; if so, how do we join?
The bottom line is that — logically — country code is not a string, but a type; hence concatenation does not make sense. In other words, this is a logical error.
However, if I treat country codes as the enum type:
SELECT ('AU'::cc2) || ('CA'::cc2) as ctr; ERROR: operator does not exist: cc2 || cc2 LINE 1: SELECT ('AU'::cc2) || ('CA'::cc2) as ctr; ^
In other words, there is no concatenation operator for this data type; and that is a good thing.
It is important to realise that in this case the type system prevented a logical error. Also, this error would be caught early — by a programmer or analyst while developing a query — and would not propagate into production nor reports.
More Goodies
It gets even better: consider two money-transaction tables, where one defines currency code as a character type, the other one as an enum; both have the same data. Check out the DDL with some data; I will just sketch it here:
currency = {'USD', 'CAD', 'EUR', 'GBP'} T1 { TX int PK , AMT decimal(19,2) , CUR char(3) -- <- focus here } T2 { TX int PK , AMT decimal(19,2) , CUR currency -- <- focus here } -- same data for both tables T_ (TX, AMT, CUR ) {( 1, 10.0, 'USD') ,( 2, 13.0, 'CAD') ,( 3, 9.0, 'EUR') }
Note that in the sketch, the currency type is represented as a set of all possible values: logically a domain.
Now, consider an analyst wanting to count a number of transactions in Euros, but making a typo in the currency name. Two examples: the first one for currency treated as text, and the second one for the enumerated type.
-- T1.CUR defined as a string SELECT count(1) as cnt FROM T1 WHERE CUR = 'ERU' ; -- <- typo cnt ----- 0 -- T2.CUR defined as an enum SELECT count(1) as cnt FROM T2 WHERE CUR = 'ERU' ; -- <- typo ERROR: invalid input value for enum currency: "ERU" LINE 3: WHERE CUR = 'ERU' ;
In the second example the type check results in an error, which is much better than getting a wrong result. You may argue that the first result is OK too, in that case consider this:
-- foolish SELECT sum(AMT) as amt_ , max(CUR) as cur_ FROM T1 WHERE CUR like '%D' ; amt_ cur_ ----------- 23.00 USD
So, it added amounts of US and Canadian dollars, and reported the total in USD. Although this is the result of the foolish query, consider what happens when I try it on the table T2:
-- foolish SELECT sum(AMT) as amt_ , max(CUR) as cur_ FROM T2 WHERE CUR like '%D' ; ERROR: operator does not exist: currency ~~ unknown LINE 4: WHERE CUR like '%D' ; ^
Much better, saved by the type system.
Summary
My argument is that some type safety is better than none. Hence, whenever there is a finite list of standardised codes, it is a good candidate for an enumerated type.
Unfortunately, all this reasoning does not apply to MySQL: although it does have data type enum, the scope is a singe-table column and its primary purpose is compact data storage.
Further Reading
By now it should be obvious that a type system has something to do with logic; after all, as seen in previous examples, it prevents logical errors.
Take a good look at this:
Γ ⊢ B → A Δ ⊢ B ----------------- (e1) Γ,Δ ⊢ A Γ ⊢ t:B → A Δ ⊢ u:B --------------------- (e2) Γ,Δ ⊢ t(u):A
The first rule (e1) is modus ponens in Gentzen's natural deduction; the second rule (e2) is function application in typed lambda calculus.
From this point on I can only "point a finger to the Moon" and recommend two -- nicely written and readable -- papers by P. Wadler: [Wad00] and [Wad14].
Have fun.