Contradiction in Database [3]

MySQL Type Horrors may have been a more appropriate title. Although it is possible to demonstrate the problem in another DB system, I have deliberately chosen MySQL for this article because it literally lends itself to this kind of errors.

Example

A small, but powerful example. Just by looking at this you may guess that this post may have something to do with case sensitivity, but as you will see the problems go much deeper. Here is the MySQL code for the example.

  T { rw integer
    , A  varchar(32)
    , B  blob
    }
KEY {rw}


 T (rw,   A  ,   B  )
 { ( 1, 'ABC', 'ABC')
 , ( 2, 'abc', 'abc')
 }  

For the example I have created a fresh instance of MySQL (v 5.6.27) on Amazon RDS, using default settings. Note that the collation is case and accent insensitive.

select @@collation_database as cdb;

cdb
------------------
latin1_swedish_ci

Trouble

Let’s see what equals to what in table T.

select rw from T where A = B;

rw
--
1
2

Therefore I conclude that:

A1 = B1 
A2 = B2

For simplicity, I am using “Excel notation” to refer to values at row-column intersections in the table.

To check for equality in column A, I simply count number of distinct values in the column.

select count(distinct A) as cnt_a from T;

cnt_a
-----
  1

There is only one distinct value in column A, therefore:

A1 = A2

Based on these two conclusions:

B1 = A1 = A2 = B2

How about column B?

select count(distinct B) as cnt_b from T;

cnt_b
-----
  2

Given that there are two distinct values in column B it is obvious that:

B1 <> B2

If I put this together, the following holds:

B1 = A1 = A2 = B2
B1 <> B2

Interestingly, both statements are true in table T. And we have the contradiction again. I can formalize this a bit. Given a proposition P stating that (B1 = B2), the following holds:

X,Y ∈ T
X = (B1 = A1 = A2 = B2)
Y = (B1 <> B2)

P = (B1 = B2)

⊢ X → P  T ⊢ X       ⊢ Y → ¬P  T ⊢ Y
-------------- (e1)  ---------------- (e2)
T ⊢ P                T ⊢ ¬P
------------------------------------- (i1)
T ⊢ (P ∧ ¬P)

A contradiction leads to nonsense — as previously demonstrated — so, a query that includes table T is likely to return a wrong result.

It Gets Worse

Take a good look at these three queries:

Query Result
rw
select rw from T
where A = B;
1
2
select rw from T
where upper(A) = upper(B);
1
select rw from T
where lower(A) = lower(B);
2

This is a head-scratcher. What happens is weird: upper() and lower() do not work on blob, binary, and varbinary types. Instead of returning an error, they silently return the input value of the function. Duh. To be fair, it is in the manual; but really?

Equality

Time to revisit the definition and properties of equality.

Seems that it was Leibniz who rigorously defined equality; these four properties actually became theorems:

Property For any Must be True
Reflexive a a = a
Symmetry a,b if a = b then b = a
Transitive a,b,c if a = b and b = c then a = c
Substitution a,b,f(x) if a = b then f(a) = f(b)

Given the example, how are we doing some 300 years later?

Property Note
Reflexive ok
Symmetry ok
Transitive broken, first example
Substitution broken, second example

Summary

Using the small example — one table with three columns and two rows — I have managed to break two out of four properties of equality. A contradiction ensued. Not good, but this is the state of the industry.

Avoid using blobs, binary, and varbinary types for strings. Even if developers and DBAs are certain that they can handle nuances of an implicit type conversion, it is not likely that analysts will be able to do so. As time goes by, databases grow, and people come and go: errors from this kind of problems are inevitable.

Contradiction in Database [2]

The previous post focused on a contradiction in general; in this one I present a specific example using PostgreSQL.

Example

Starting with a predicate:
[p_1] Country named (CNAME) is assigned country code (CC2) and number (CID).

Country {CID, CC2, CNAME}
    KEY {CID}

Country (CID, CC2,  CNAME   )
      { (1,  'CA', 'Canada' )
      , (2,  'FR', 'France' )
      , (3,  'DE', 'Germany')
      , (4,  'DE', 'Denmark') -- typo
      }

Take a look at the sketch and the DLL. This is a common find: the ID is used as a primary key and there are no other constraints.

Consider the tuple with CID = 4; it has a wrong country code for Denmark, which results in a contradiction. From tuples with CID = 3 and CID = 4 it follows:

  • Country code DE is assigned to country named Germany.
  • Country code DE is assigned to country named Denmark.

In this case the contradiction is a result of the typo and a (missing) constraint. The constraint is essentially a part of an ISO standard, a generally accepted fact.

[c1.1] Each country code is assigned to exactly one country.

The idea is to see how can this be represented as a query. Again, let’s name the database Γ, and show that it contains a contradiction.

A = Code DE is assigned to Germany.
B = Code DE is assigned to Denmark.


        B ⊢ ¬A   Γ ⊢ B
        ----------------- (e1)
Γ ⊢ A   Γ ⊢ ¬A
------------------------- (i1)
Γ ⊢ (A ∧ ¬A)

The premise B ⊢ ¬A is consequence of the constraint c1.1.
As per the rule i1, it is possible to conclude that both A and (¬A) are true.

Lilliput

As explained in the previous post: “From contradiction anything follows”. Let’s see if I can prove — from my database — something from Gulliver’s travels, e.g. that country Lilliput exists.

A ∈ Γ, N ∉ Γ
A = Code DE is assigned to Germany.
N = Code DE is assigned to Lilliput.


              Γ ⊢ A            
------ (e1)   ---------- (i2)
Γ ⊢ ¬A        Γ ⊢ (A ∨ N)       
------------------------ (e4)
Γ ⊢ N

The following query is self-explanatory; query labels match rule names. The output of the query lists all countries in the database.

-- [c1.1] Each country code 
-- is assigned to exactly one country.
--
with
i2 as (
    select CC2
    from Country
    where (CC2 = 'DE' and CNAME = 'Germany')
       or (CC2 = 'DE' and CNAME = 'Lilliput')
),
e1 as (
    select CC2
    from Country
    where (CC2 = 'DE' and CNAME != 'Germany')
),
e4 as (
    select CC2, 'Lilliput'::text as CNAME
    from i2
    join e1 using (CC2)
)
select CNAME
from Country
union
select CNAME
from e4 
order by CNAME
;

The result may be interpreted using the predicate:
[p_2] Country named (CNAME) exists.

CNAME
--------
Canada
Denmark
France
Germany
Lilliput

You can actually hear a VP of marketing: “Lilliput, almost forgot about it. Do they use Euros? We have to do a campaign over there ASAP.”

Foolish

“But this is just a result of a foolish query. Why would anyone write a query like that?”, you say. Let’s agree on that — for the time being — and see what happens if I fix the data.

-- fix the typo
update Country set CC2 = 'DK'
where CNAME = 'Denmark' ;

-- add constraints
alter table Country
  add constraint ak1_cn UNIQUE (CC2)
, add constraint ak2_cn UNIQUE (CNAME) ;

Let’s run that foolish query again:

CNAME
--------
Canada
Denmark
France
Germany

Not so Foolish

Interestingly, fixing the data returned the correct result. Time to revisit the concept of a sound deductive (logical) system.

A deductive system is sound if any sentence that is provable in that system is true.

In other words, if my database and a reasoning mechanism form a sound logical system, then I should not be able to conclude falsehood.

Not so Simple

It may be tempting to think that contradiction may be avoided by carefully crafting queries, but it is not that simple.

SQL is a declarative, fourth-generation programming language. The idea is to specify what to do, not how to do it.

A query is passed to the query processor. The processor takes a declarative query, validates the query, optimizes it, converts it to an executable data-flow plan, and executes the plan.

Parse --> Rewrite --> Optimize --> Execute

The query parser validates the query, resolves names, converts it to a format used by the rewrite and the optimizer, and checks user authorization.

The query rewrite simplifies queries without changing the semantics (meaning). It expands views, simplifies arithmetic expressions, logically rewrites predicates in the where clause, semantically optimizes joins, and flattens sub-queries.

The query optimizer produces an executable data-flow plan. It has to deal with selectivity estimates, data distribution histograms, search algorithms, parallelism, and auto-tuning.

The query executor takes a query plan data-flow and executes it.

Bottom line: it is a long way between the declarative query and the way it is actually executed; and that is a good thing.

Not Only SQL

Although this example was about SQL, note that the reasoning applies to any higher-level-of-abstraction language.

Imagine a long and complicated mathematical expression. First you check out if it makes sense (parse); then try to simplify it using some math rules (rewrite); figure out how to calculate it (optimize), e.g.: first calculate numerator, then denominator, then divide; and finally plug in the numbers (execute).

Consider Apache Spark: it started with RDDs only, then added data-frames and SQL; take a look at the optimizer and note the logical optimization section.

Reasoning Machine

There is a “reasoning machine” between your data and a query result based on that data. In general, that machine may be extended to include you and all the code you write; also, other developers, analysts and all the code they write.

If a database contains a contradiction and the reasoning machinery gets hold of it, what comes out is anybody’s guess.

A metaphor, but a good one.

Contradiction in Database [1]

Suppose you have a database, of any type: SQL, NoSQL, not-yet-SQL, a spreadsheet, a bunch of CSV files, or a shoebox full of post-it notes.

I would argue that the database represents a bag (multiset) of facts; even if it does not look like that, it can be verbalized as such.

Let’s name this bag of facts: Γ.

Consider a case when two of these facts contradict each other.
Suppose that in Γ we find a fact: today is Monday; then another fact that states: today is Friday. Let’s see if I can formalize this:

A = Today is Monday.
B = Today is Friday.


        B ⊢ ¬A   Γ ⊢ B
        ----------------- (e1)
Γ ⊢ A   Γ ⊢ ¬A
------------------------- (i1)
Γ ⊢ (A ∧ ¬A)

Rule e1 states that B entails (not A) and given that Γ entails B, from Γ we can conclude that (not A) is true. Rule i1 states that Γ entails A is true and (not A) is true, so we conclude that (A and not A) is true.

According to the database, both propositions — today is Monday, and today is not Monday — are true; hence the contradiction.

Nonsense

There is a problem with contradiction and reasoning. Let’s introduce a proposition, a pure nonsense: pigs can fly.

A,B ∈ Γ
A = Today is Monday.
B = Today is Friday.
N = Pigs can fly.


Γ ⊢ B      B ⊢ ¬A
---------------------- (e1)
Γ ⊢ ¬A

           Γ ⊢ A            
           ----------- (i2)
Γ ⊢ ¬A     Γ ⊢ (A ∨ N)       
---------------------- (e4)
Γ ⊢ N

Rule i2 states that given that Γ entails A is true, we can conclude that (A or N) is true; this is simple: (true or x) evaluates to true.
According to the rule e4, Γ entails A is false, and (A or N) is true; hence from Γ we conclude that N must be true.

Note that initially the proposition N did not come from Γ, but the conclusion that N is true does.

Summary

In general, for a database Γ having a proposition (fact) A that can be considered true and false at the same time, the following holds:

A ∈ Γ, Γ ⊢ (A ∧ ¬A)
A = Any fact.
N = Any nonsense.


                   Γ ⊢ (A ∧ ¬A)
                   ----------- (e2)
Γ ⊢ (A ∧ ¬A)       Γ ⊢ A            
----------- (e3)   ----------- (i2)
Γ ⊢ ¬A             Γ ⊢ (A ∨ N)       
------------------------------ (e4)
Γ ⊢ N

This is called the principle of explosion, and is usually stated as: “from contradiction anything follows”. I prefer a bit more dramatic version:

Contradiction leads to nonsense.

How does a contradiction enter a database? Is it possible to prevent it? What are the consequences, if any? About all that, in the remainder of this series.

Types and Typos in SQL [2]

In the previous post I demonstrated how small typos in SQL become large logical errors, and how a “proper” type system could help in preventing these. This time I explore how PostgreSQL’s enum data type can be leveraged to gain some data safety and prevent logical errors.

PostgreSQL’s ENUM

In PostgreSQL the enum data type offers some type safety, for example:

CREATE TYPE d1 AS ENUM
('1', '2', '3');

CREATE TYPE d2 AS ENUM
('1', '2', '3');

Enumerated values are labels, hence the quotation marks.

Equality

-- compare as default type
--
select '1' = '1' as bool;

 bool
------
  t


-- compare as two same Enum types
--
select '1'::d1 = '1'::d1 as bool;

 bool
------
  t


-- compare as two different Enum types
--
select '1'::d1 = '1'::d2 as bool;

ERROR:  operator does not exist: d1 = d2
LINE 1: select '1'::d1 = '1'::d2 as bool;
                       ^

Again, this is a good error to get. But notice that the following still works:

select '1'::d1 = '1' as bool;

 bool
Boolean
--------
   t

Why did this work? Take a look at this:

select '1' as x;

   x
unknown
--------
   1

A label out of the context of an expression is of an unknown type. An implicit type cast (conversion) rule — in the previous example — converted the unknown type to the type d1 based on what was expected in the expression. If I explicitly specify the type of the second label, the error is back.

select '1'::d1 = '1'::text as bool;

ERROR:  operator does not exist: d1 = text
LINE 1: select '1'::d1 = '1'::text as bool;
                       ^

The equals operator worked fine, as in the Haskell example from the previous post.

Functions and Operators

To make this a bit more realistic, I’ll use country codes for the next example; say cc2 is a domain of county codes.

CREATE TYPE cc2 AS ENUM
('AU', 'BE', 'CA', 'FR', 'US');

First, let’s see what happens if I treat country codes as text (characters) and use a string operator, say concatenation.

SELECT 'AU'::char(2) || 'BE'::char(2) as ctr ;

 ctr
 text
------
 AUBE

No surprises here, all looks as expected. But, what does it mean to concatenate two country codes? Did these two countries merge? Did they sign some kind of political alliance? Maybe they signed a beer-drinking trade agreement; if so, how do we join?

The bottom line is that — logically — country code is not a string, but a type; hence concatenation does not make sense. In other words, this is a logical error.

However, if I treat country codes as the enum type:

SELECT ('AU'::cc2) || ('CA'::cc2) as ctr;

ERROR:  operator does not exist: cc2 || cc2
LINE 1: SELECT ('AU'::cc2) || ('CA'::cc2) as ctr;
                           ^

In other words, there is no concatenation operator for this data type; and that is a good thing.
It is important to realise that in this case the type system prevented a logical error. Also, this error would be caught early — by a programmer or analyst while developing a query — and would not propagate into production nor reports.

More Goodies

It gets even better: consider two money-transaction tables, where one defines currency code as a character type, the other one as an enum; both have the same data. Check out the DDL with some data; I will just sketch it here:

currency = {'USD', 'CAD', 'EUR', 'GBP'}

T1 { TX  int  PK
   , AMT decimal(19,2)
   , CUR char(3)  -- <- focus here
   }

T2 { TX  int  PK
   , AMT decimal(19,2)
   , CUR currency -- <- focus here
   }

-- same data for both tables
T_  (TX,  AMT,  CUR )
   {( 1, 10.0, 'USD')
   ,( 2, 13.0, 'CAD')
   ,( 3,  9.0, 'EUR')
   }

Note that in the sketch, the currency type is represented as a set of all possible values: logically a domain.

Now, consider an analyst wanting to count a number of transactions in Euros, but making a typo in the currency name. Two examples: the first one for currency treated as text, and the second one for the enumerated type.

-- T1.CUR defined as a string
SELECT count(1) as cnt
FROM T1 
WHERE CUR = 'ERU' ; -- <- typo

 cnt
-----
  0


-- T2.CUR defined as an enum
SELECT count(1) as cnt
FROM T2 
WHERE CUR = 'ERU' ; -- <- typo

ERROR:  invalid input value for enum currency: "ERU"
LINE 3: WHERE CUR = 'ERU' ;

In the second example the type check results in an error, which is much better than getting a wrong result. You may argue that the first result is OK too, in that case consider this:

-- foolish
SELECT sum(AMT) as amt_
     , max(CUR) as cur_
FROM T1
WHERE CUR like '%D' ;

amt_   cur_
-----------
23.00  USD

So, it added amounts of US and Canadian dollars, and reported the total in USD. Although this is the result of the foolish query, consider what happens when I try it on the table T2:

-- foolish
SELECT sum(AMT) as amt_
     , max(CUR) as cur_
FROM T2
WHERE CUR like '%D' ;

ERROR:  operator does not exist: currency ~~ unknown
LINE 4: WHERE CUR like '%D' ;
                  ^

Much better, saved by the type system.

Summary

My argument is that some type safety is better than none. Hence, whenever there is a finite list of standardised codes, it is a good candidate for an enumerated type.

Unfortunately, all this reasoning does not apply to MySQL: although it does have data type enum, the scope is a singe-table column and its primary purpose is compact data storage.

Further Reading

By now it should be obvious that a type system has something to do with logic; after all, as seen in previous examples, it prevents logical errors.

Take a good look at this:

Γ ⊢ B → A   Δ ⊢ B 
----------------- (e1)
Γ,Δ ⊢ A


Γ ⊢ t:B → A   Δ ⊢ u:B 
--------------------- (e2)
Γ,Δ ⊢ t(u):A

The first rule (e1) is modus ponens in Gentzen's natural deduction; the second rule (e2) is function application in typed lambda calculus.

From this point on I can only "point a finger to the Moon" and recommend two -- nicely written and readable -- papers by P. Wadler: [Wad00] and [Wad14].

Have fun.