Contradiction in Database [1]

Suppose you have a database, of any type: SQL, NoSQL, not-yet-SQL, a spreadsheet, a bunch of CSV files, or a shoebox full of post-it notes.

I would argue that the database represents a bag (multiset) of facts; even if it does not look like that, it can be verbalized as such.

Let’s name this bag of facts: Γ.

Consider a case when two of these facts contradict each other.
Suppose that in Γ we find a fact: today is Monday; then another fact that states: today is Friday. Let’s see if I can formalize this:

A = Today is Monday.
B = Today is Friday.


        B ⊢ ¬A   Γ ⊢ B
        ----------------- (e1)
Γ ⊢ A   Γ ⊢ ¬A
------------------------- (i1)
Γ ⊢ (A ∧ ¬A)

Rule e1 states that B entails (not A) and given that Γ entails B, from Γ we can conclude that (not A) is true. Rule i1 states that Γ entails A is true and (not A) is true, so we conclude that (A and not A) is true.

According to the database, both propositions — today is Monday, and today is not Monday — are true; hence the contradiction.

Nonsense

There is a problem with contradiction and reasoning. Let’s introduce a proposition, a pure nonsense: pigs can fly.

A,B ∈ Γ
A = Today is Monday.
B = Today is Friday.
N = Pigs can fly.


Γ ⊢ B      B ⊢ ¬A
---------------------- (e1)
Γ ⊢ ¬A

           Γ ⊢ A            
           ----------- (i2)
Γ ⊢ ¬A     Γ ⊢ (A ∨ N)       
---------------------- (e4)
Γ ⊢ N

Rule i2 states that given that Γ entails A is true, we can conclude that (A or N) is true; this is simple: (true or x) evaluates to true.
According to the rule e4, Γ entails A is false, and (A or N) is true; hence from Γ we conclude that N must be true.

Note that initially the proposition N did not come from Γ, but the conclusion that N is true does.

Summary

In general, for a database Γ having a proposition (fact) A that can be considered true and false at the same time, the following holds:

A ∈ Γ, Γ ⊢ (A ∧ ¬A)
A = Any fact.
N = Any nonsense.


                   Γ ⊢ (A ∧ ¬A)
                   ----------- (e2)
Γ ⊢ (A ∧ ¬A)       Γ ⊢ A            
----------- (e3)   ----------- (i2)
Γ ⊢ ¬A             Γ ⊢ (A ∨ N)       
------------------------------ (e4)
Γ ⊢ N

This is called the principle of explosion, and is usually stated as: “from contradiction anything follows”. I prefer a bit more dramatic version:

Contradiction leads to nonsense.

How does a contradiction enter a database? Is it possible to prevent it? What are the consequences, if any? About all that, in the remainder of this series.

Types and Typos in SQL [2]

In the previous post I demonstrated how small typos in SQL become large logical errors, and how a “proper” type system could help in preventing these. This time I explore how PostgreSQL’s enum data type can be leveraged to gain some data safety and prevent logical errors.

PostgreSQL’s ENUM

In PostgreSQL the enum data type offers some type safety, for example:

CREATE TYPE d1 AS ENUM
('1', '2', '3');

CREATE TYPE d2 AS ENUM
('1', '2', '3');

Enumerated values are labels, hence the quotation marks.

Equality

-- compare as default type
--
select '1' = '1' as bool;

 bool
------
  t


-- compare as two same Enum types
--
select '1'::d1 = '1'::d1 as bool;

 bool
------
  t


-- compare as two different Enum types
--
select '1'::d1 = '1'::d2 as bool;

ERROR:  operator does not exist: d1 = d2
LINE 1: select '1'::d1 = '1'::d2 as bool;
                       ^

Again, this is a good error to get. But notice that the following still works:

select '1'::d1 = '1' as bool;

 bool
Boolean
--------
   t

Why did this work? Take a look at this:

select '1' as x;

   x
unknown
--------
   1

A label out of the context of an expression is of an unknown type. An implicit type cast (conversion) rule — in the previous example — converted the unknown type to the type d1 based on what was expected in the expression. If I explicitly specify the type of the second label, the error is back.

select '1'::d1 = '1'::text as bool;

ERROR:  operator does not exist: d1 = text
LINE 1: select '1'::d1 = '1'::text as bool;
                       ^

The equals operator worked fine, as in the Haskell example from the previous post.

Functions and Operators

To make this a bit more realistic, I’ll use country codes for the next example; say cc2 is a domain of county codes.

CREATE TYPE cc2 AS ENUM
('AU', 'BE', 'CA', 'FR', 'US');

First, let’s see what happens if I treat country codes as text (characters) and use a string operator, say concatenation.

SELECT 'AU'::char(2) || 'BE'::char(2) as ctr ;

 ctr
 text
------
 AUBE

No surprises here, all looks as expected. But, what does it mean to concatenate two country codes? Did these two countries merge? Did they sign some kind of political alliance? Maybe they signed a beer-drinking trade agreement; if so, how do we join?

The bottom line is that — logically — country code is not a string, but a type; hence concatenation does not make sense. In other words, this is a logical error.

However, if I treat country codes as the enum type:

SELECT ('AU'::cc2) || ('CA'::cc2) as ctr;

ERROR:  operator does not exist: cc2 || cc2
LINE 1: SELECT ('AU'::cc2) || ('CA'::cc2) as ctr;
                           ^

In other words, there is no concatenation operator for this data type; and that is a good thing.
It is important to realise that in this case the type system prevented a logical error. Also, this error would be caught early — by a programmer or analyst while developing a query — and would not propagate into production nor reports.

More Goodies

It gets even better: consider two money-transaction tables, where one defines currency code as a character type, the other one as an enum; both have the same data. Check out the DDL with some data; I will just sketch it here:

currency = {'USD', 'CAD', 'EUR', 'GBP'}

T1 { TX  int  PK
   , AMT decimal(19,2)
   , CUR char(3)  -- <- focus here
   }

T2 { TX  int  PK
   , AMT decimal(19,2)
   , CUR currency -- <- focus here
   }

-- same data for both tables
T_  (TX,  AMT,  CUR )
   {( 1, 10.0, 'USD')
   ,( 2, 13.0, 'CAD')
   ,( 3,  9.0, 'EUR')
   }

Note that in the sketch, the currency type is represented as a set of all possible values: logically a domain.

Now, consider an analyst wanting to count a number of transactions in Euros, but making a typo in the currency name. Two examples: the first one for currency treated as text, and the second one for the enumerated type.

-- T1.CUR defined as a string
SELECT count(1) as cnt
FROM T1 
WHERE CUR = 'ERU' ; -- <- typo

 cnt
-----
  0


-- T2.CUR defined as an enum
SELECT count(1) as cnt
FROM T2 
WHERE CUR = 'ERU' ; -- <- typo

ERROR:  invalid input value for enum currency: "ERU"
LINE 3: WHERE CUR = 'ERU' ;

In the second example the type check results in an error, which is much better than getting a wrong result. You may argue that the first result is OK too, in that case consider this:

-- foolish
SELECT sum(AMT) as amt_
     , max(CUR) as cur_
FROM T1
WHERE CUR like '%D' ;

amt_   cur_
-----------
23.00  USD

So, it added amounts of US and Canadian dollars, and reported the total in USD. Although this is the result of the foolish query, consider what happens when I try it on the table T2:

-- foolish
SELECT sum(AMT) as amt_
     , max(CUR) as cur_
FROM T2
WHERE CUR like '%D' ;

ERROR:  operator does not exist: currency ~~ unknown
LINE 4: WHERE CUR like '%D' ;
                  ^

Much better, saved by the type system.

Summary

My argument is that some type safety is better than none. Hence, whenever there is a finite list of standardised codes, it is a good candidate for an enumerated type.

Unfortunately, all this reasoning does not apply to MySQL: although it does have data type enum, the scope is a singe-table column and its primary purpose is compact data storage.

Further Reading

By now it should be obvious that a type system has something to do with logic; after all, as seen in previous examples, it prevents logical errors.

Take a good look at this:

Γ ⊢ B → A   Δ ⊢ B 
----------------- (e1)
Γ,Δ ⊢ A


Γ ⊢ t:B → A   Δ ⊢ u:B 
--------------------- (e2)
Γ,Δ ⊢ t(u):A

The first rule (e1) is modus ponens in Gentzen's natural deduction; the second rule (e2) is function application in typed lambda calculus.

From this point on I can only "point a finger to the Moon" and recommend two -- nicely written and readable -- papers by P. Wadler: [Wad00] and [Wad14].

Have fun.

Types and Typos in SQL [1]

Almost anyone who has spent time working with SQL made — or had to fix — this kind of bug:

-- two tables (sketch)
--
users {id, user_name}

trans {id, amt, cur, tx_time, user_id}
-- query with a typo
--
SELECT user_name
     , tx_time
     , amt
     , cur
 FROM users as u
 JOIN trans as t ON u.id = t.id -- <- typo !!
WHERE user_name = 'jack' ;

Try the query, here is the DDL with some data; note the wrong result.

The naming convention uses a generic id propagated from the application's OO model and the default data type is usually integer. The typo joins transaction table on id, instead of user_id; very easy mistake to make and not so easy to catch.

When it does happen that the bug propagates to a production system, the usual response is to first blame a developer, then argue that the pair programming (or peer review) process is not working, then blame QA for not catching it, and eventually discuss naming conventions.

Although it is true that naming attributes user_id, tran_id instead of the generic id would help -- because the predicate would read u.user_id = t.user_id -- the problem is deeper. As it is often with SQL, the problem is not in users nor developers, but in the SQL itself; namely u.id = t.id is both: a type error and a logical error.

In order to demonstrate the concept I have to use something with a "proper" type system.

Haskell to the Rescue

Haskell's type system is based on Hindley-Milner type system, and if you have never heard of it, don't worry. All you have to know about it right now are just two things: 1) it is a beautiful thing, and 2) you want one.

Let's start with the definition of two new types, one for users and one for transactions. I am typing into Haskell's interactive environment, so "λ" is just a terminal prompt.

New Type

λ  newtype User = MkUser Int deriving (Eq, Show)

λ  newtype Tran = MkTran Int deriving (Eq, Show)

The type system can infer types, hence I can ask for types of constructors: MkUser and MkTran.

λ  :type MkUser 
MkUser :: Int -> User

λ  :type MkTran
MkTran :: Int -> Tran

MkUser is a function which takes an Int and returns a User; MkTran is a function which takes an Int and returns a Tran (transaction).

The deriving (Eq, Show) -- in new type definitions -- simply means to derive rules for equality and string-display from the underlying integer type. And right here -- at the very moment of a new type definition -- a question arises:

What does it mean for two instances of a type to be equal?

The answer seems simple, but in general it requires more thought.

Equality

Create two variables: u for a user, and t for a transaction; note that for both of them the value of the underlying integer type is 1.

λ  let u = MkUser 1

λ  let t = MkTran 1

Let's test for equality, first between terms of the same type.

λ  1 == 1
True

λ  u == u
True

λ  t == t
True

λ  MkUser 1 == MkUser 1
True

λ  MkUser 1 == MkUser 2
False

λ  MkTran 4 == MkTran 4
True

λ  MkTran 5 == MkTran 4
False

However, when different types -- user and transaction -- are compared, an error is raised.

λ  u == t

Couldn't match expected type `User' with actual type `Tran'
...

This is a good error to get. A user and a transaction are two different things hence can not be compared, regardless of the fact that both have the underlying integer value of 1.

Contrast this thinking to the previous SQL example, which was happy to join a user on a transaction just because 1=1. An error would be much better.

Functions & Operators

What about the type of the equals operator (==) ?

λ :type (==)
(==) :: Eq a => a -> a -> Bool

Equals is a function which takes two arguments of a same type a and returns Boolean, given that definition of equality for that type exists. In Haskell speak, the type a must be a member of the Eq class, but this is not about Haskell: it is the principle that matters.

How about addition? What will happen if I try to sum two users, or a user and a transaction, since both have an integer as the underlying data type?

Let's see the type of the addition operator.

λ :type (+)
(+) :: Num a => a -> a -> a

The addition is a function which takes two arguments of a numeric type a and returns a result of the type a.

By now it should be easy to understand what happens here:

λ u + u

No instance for (Num User) arising from a use of `+'
...

λ t + t

No instance for (Num Tran) arising from a use of `+'
...

λ t + u

Couldn't match expected type `Tran' with actual type `User'
...

Note that the error in the third example is different from errors in the first two examples. The last one complains about different types, while the first two essentially state that types User and Tran are not numeric.

Summary

If all this type-reasoning looks too complicated, consider the query:

SELECT ((u.id + t.id)^3) - 22 as x
 FROM  users as u
 JOIN  trans as t ON t.id = (u.id + 2)
WHERE  u.id = 1 ;

 x
----
 42

Take user number one, add transaction number three, raise the result to the power of three, subtract 22; and the result is: the meaning of life.

Hmm, so there may be a logical mistake somewhere in that query. Wouldn't it be nice to have a type system that prevents mistakes like this? Although SQL dialects do not offer much in terms of type safety, PostgreSQL's enum data type does.

Next time: PostgreSQL's enum and type safety.

Eight Queens in SQL [4]

This is the fourth and final part of the series showing how to solve the “Eight Queens” chess puzzle, using a database design and SQL. The main idea is to design using natural language and concepts of domains, predicates, constraints, sets, set operations, and relations.

Links to all previous and this post’s code.
Part Main Topic Code (pgSQL)
1 Domains, Types part 1
2 Predicates, Constraints, Relations part 2
3 Propositions and Facts part 3
4 The Algorithm and the Solution part 4

Algorithm

Say we place a queen, Q1, on square c5, the queen will cover row 5, column c, and diagonals a3:f8 and a7:g1. Each one of these is a line (a set of squares), so the square c5 is an element of each of these four sets. Using the (anchor, movement direction) notation to identify a line, the following holds:

     row 5     = ('a5', 'RGHT')    = {'a5' ... 'h5'}
  column c     = ('c1', 'UP')      = {'c1' ... 'c8'}
diagonal a3:f8 = ('a3', 'RGHT-UP') = {'a3' ... 'f8'}
diagonal a7:g1 = ('a7', 'RGHT-DN') = {'a7' ... 'g1'}

'c5' ∈ ('a5', 'RGHT') 
'c5' ∈ ('c1', 'UP')
'c5' ∈ ('a3', 'RGHT-UP')
'c5' ∈ ('a7', 'RGHT-DN')

Let’s introduce a set, L1, which is the set of these four lines; note that L1 is a set of sets.

L1 = { ('a5', 'RGHT')   , ('c1', 'UP'),
       ('a3', 'RGHT-UP'), ('a7', 'RGHT-DN')
     }

'c5' ∈ ⋃ L1

⋂ L1 = {'c5'}

⋃ L1 is the union of all sets in L1, and represents all squares that the queen placed on c5 attacks. ⋂ L1 is the intersection of these sets and it has only one element, the c5 square itself.

Now let’s place another queen, Q2, on square f2; using the reasoning from the previous example, the following holds:

L2 = { ('a2', 'RGHT')   , ('f1', 'UP'),
       ('e1', 'RGHT-UP'), ('a7', 'RGHT-DN')
     }

'f2' ∈ ⋃ L2

⋂ L2 = {'f2'}

The rule that two queens attack each other if placed on the same line (row, column or diagonal) can be expressed as an intersection between sets L1 and L2. If the intersection is an empty set then they do not attack each other, otherwise they do; in this example they both have one diagonal in common.

L1 ⋂ L2 = {('a7', 'RGHT-DN')}

Suppose that eight queens Q1 … Q8 are placed on a chessboard one by one. As queen Qn is placed on a square we calculate the corresponding set Ln. The intersection of set Ln — for the queen being placed — and union of all previous L1 … Ln-1 sets — for queens already on the board — must be an empty set.

Queen  Rule for placement

Q1     L1 ⋂  {} = {}  -- always True
Q2     L2 ⋂  L1 = {}
Q3     L3 ⋂ (L1 ⋃ L2) = {}
Q4     L4 ⋂ (L1 ⋃ L2 ⋃ L3) = {}
Q5     L5 ⋂ (L1 ⋃ L2 ⋃ L3 ⋃ L4) = {}
Q6     L6 ⋂ (L1 ⋃ L2 ⋃ L3 ⋃ L4 ⋃ L5) = {}
Q7     L7 ⋂ (L1 ⋃ L2 ⋃ L3 ⋃ L4 ⋃ L5 ⋃ L6) = {}
Q8     L8 ⋂ (L1 ⋃ L2 ⋃ L3 ⋃ L4 ⋃ L5 ⋃ L6 ⋃ L7) = {}

In general, to place queen Qn on the board, the following must hold:

⋃{L1 ... Ln-1} ⋂ Ln = {}

This now looks like an algorithm; although it does appear recursive, I will avoid recursion in this example.

Reducing the Search Space

A solution to the problem is in the form of:

 
{s1, s2, s3, s4, s5, s6, s7, s8}

sn ∈ square_name

where sn is a square identified by a square name.

In general, there is more than one solution to the puzzle, hence the
final result is a set of all possible solutions:

 
{
  {s1.1, s1.2, s1.3, s1.4, s1.5, s1.6, s1.7, s1.8}
, {s2.1, s2.2, s2.3, s2.4, s2.5, s2.6, s2.7, s2.8}
, ...
}

si.n ∈ square_name

The idea is now to estimate the search space for which the placement rules have to be evaluated, and reduce it as much as possible by reasoning about the problem.

In the first estimate we can start with reasoning that the first queen can be placed on one out of 64 squares, second on one out of 63, third on one out of 62, etc. The number of permutations here is 64! / (64-8)! = 1.78E+14.

However, the order of queens within one solution does not matter, so we should use combinations instead and choose 8 out of 64 — (64! / (8! x (64-8)!) = 4.43E+9.

Given that two queens can not be placed in the same column, we can start placing them column by column, so each queen can be placed on a maximum of eight squares — one column. This brings down the number of combinations to 8^8 = 1.68E+7.

Given that two queens can not be in the same row, each subsequent placement has one square less available than the previous one; this now brings number of combinations to 8! = 40320.

Further improvement can be made avoiding rows +/-1 from the queen in the previous column. For example, if the first queen was placed in column ‘a’, row 4, then the second queen can be placed in column ‘b’, but not in rows 3, 4 or 5; this further reduces the number of possible combinations to 5424.

To summarize the reasoning about number of combinations involved:

Reasoning Calc Search Space
Permute 8 out of 64 64! / (64-8)! 1.78E+14
Choose 8 out of 64 64! / (8! x (64-8)!) 4.43E+9
By column 8^8 1.68E+7
By column, avoid previous rows 8! 40320
By column, avoid previous rows,
avoid row +/-1 from the queen
in the previous column.
Simply count 5424

This looks quite feasible; in total the placement rules argued in the algorithm section have to be evaluated for 5424 combinations.

Solution

Take a look at the code; the result is implemented as a single query in a view named solution.

There are 92 possible solutions to the puzzle:

select * from solution;


s1  s2  s3  s4  s5  s6  s7  s8
------------------------------
a1  b5  c8  d6  e3  f7  g2  h4
a1  b6  c8  d3  e7  f4  g2  h5
...
a8  b3  c1  d6  e2  f5  g7  h4
a8  b4  c1  d3  e6  f2  g7  h5

It should be easy to read the code and the reasoning section in parallel, which has been the main guiding principle of this exercise.

The first part of the query — CTE named w — reduces the search space as described in the previous section. Consider the step of placing queen Q3 on column c:

JOIN board1 as q3 
            ON q3.CL  = 'c' 
           and q3.RW not in (q1.RW, q2.rw) 
           and q3.RW_NO not BETWEEN (q2.RW_NO - 1)
                                and (q2.RW_NO + 1)

The predicate in the join condition is selecting the column c, avoiding the rows of the two previously placed queens, and avoiding rows +/- 1 from the queen next to it (Q2).

The CTE w returns a set of tuples — the reduced search space — then the second part of the query applies the rules (constraints) from the algorithm section to this set.

Using queen Q3 as an example, the rule for placement:

L3 ⋂ (L1 ⋃ L2) = {}

is implemented as:

and not exists ( -- placing queen 3

      SELECT distinct e.ANCHOR, e.MOV_DIR 
        FROM square_line as e 
        WHERE e.SQUARE in (w.s1, w.s2)

      INTERSECT

      SELECT distinct e.ANCHOR, e.MOV_DIR 
        FROM square_line as e 
       WHERE e.SQUARE = w.s3
    )

Evaluating the Solution

When evaluating a solution to a problem, I tend to focus on the following criteria:

  1. Correctness.
  2. Ability to reason about the code and the problem now.
  3. Ability to reason about the code and the problem in the future. Suppose that this happens to be a part of a much larger project, and ten years from now a “third generation” of developers has to understand the code. How hard will it be?
  4. Ability to change technology and scale in size. Instead of arguing that this approach can easily scale and change technologies, I will simply present it in a few future posts.
  5. Performance: although I did not even attempt to optimize queries — and have given priority to readability and reasoning — the solution runs reasonably well.

I encourage you to compare this to Rosetta’s implementation — and a few more — using all of the criteria.