Types and Typos in SQL [1]

Almost anyone who has spent time working with SQL made — or had to fix — this kind of bug:

-- two tables (sketch)
--
users {id, user_name}

trans {id, amt, cur, tx_time, user_id}
-- query with a typo
--
SELECT user_name
     , tx_time
     , amt
     , cur
 FROM users as u
 JOIN trans as t ON u.id = t.id -- <- typo !!
WHERE user_name = 'jack' ;

Try the query, here is the DDL with some data; note the wrong result.

The naming convention uses a generic id propagated from the application's OO model and the default data type is usually integer. The typo joins transaction table on id, instead of user_id; very easy mistake to make and not so easy to catch.

When it does happen that the bug propagates to a production system, the usual response is to first blame a developer, then argue that the pair programming (or peer review) process is not working, then blame QA for not catching it, and eventually discuss naming conventions.

Although it is true that naming attributes user_id, tran_id instead of the generic id would help -- because the predicate would read u.user_id = t.user_id -- the problem is deeper. As it is often with SQL, the problem is not in users nor developers, but in the SQL itself; namely u.id = t.id is both: a type error and a logical error.

In order to demonstrate the concept I have to use something with a "proper" type system.

Haskell to the Rescue

Haskell's type system is based on Hindley-Milner type system, and if you have never heard of it, don't worry. All you have to know about it right now are just two things: 1) it is a beautiful thing, and 2) you want one.

Let's start with the definition of two new types, one for users and one for transactions. I am typing into Haskell's interactive environment, so "λ" is just a terminal prompt.

New Type

λ  newtype User = MkUser Int deriving (Eq, Show)

λ  newtype Tran = MkTran Int deriving (Eq, Show)

The type system can infer types, hence I can ask for types of constructors: MkUser and MkTran.

λ  :type MkUser 
MkUser :: Int -> User

λ  :type MkTran
MkTran :: Int -> Tran

MkUser is a function which takes an Int and returns a User; MkTran is a function which takes an Int and returns a Tran (transaction).

The deriving (Eq, Show) -- in new type definitions -- simply means to derive rules for equality and string-display from the underlying integer type. And right here -- at the very moment of a new type definition -- a question arises:

What does it mean for two instances of a type to be equal?

The answer seems simple, but in general it requires more thought.

Equality

Create two variables: u for a user, and t for a transaction; note that for both of them the value of the underlying integer type is 1.

λ  let u = MkUser 1

λ  let t = MkTran 1

Let's test for equality, first between terms of the same type.

λ  1 == 1
True

λ  u == u
True

λ  t == t
True

λ  MkUser 1 == MkUser 1
True

λ  MkUser 1 == MkUser 2
False

λ  MkTran 4 == MkTran 4
True

λ  MkTran 5 == MkTran 4
False

However, when different types -- user and transaction -- are compared, an error is raised.

λ  u == t

Couldn't match expected type `User' with actual type `Tran'
...

This is a good error to get. A user and a transaction are two different things hence can not be compared, regardless of the fact that both have the underlying integer value of 1.

Contrast this thinking to the previous SQL example, which was happy to join a user on a transaction just because 1=1. An error would be much better.

Functions & Operators

What about the type of the equals operator (==) ?

λ :type (==)
(==) :: Eq a => a -> a -> Bool

Equals is a function which takes two arguments of a same type a and returns Boolean, given that definition of equality for that type exists. In Haskell speak, the type a must be a member of the Eq class, but this is not about Haskell: it is the principle that matters.

How about addition? What will happen if I try to sum two users, or a user and a transaction, since both have an integer as the underlying data type?

Let's see the type of the addition operator.

λ :type (+)
(+) :: Num a => a -> a -> a

The addition is a function which takes two arguments of a numeric type a and returns a result of the type a.

By now it should be easy to understand what happens here:

λ u + u

No instance for (Num User) arising from a use of `+'
...

λ t + t

No instance for (Num Tran) arising from a use of `+'
...

λ t + u

Couldn't match expected type `Tran' with actual type `User'
...

Note that the error in the third example is different from errors in the first two examples. The last one complains about different types, while the first two essentially state that types User and Tran are not numeric.

Summary

If all this type-reasoning looks too complicated, consider the query:

SELECT ((u.id + t.id)^3) - 22 as x
 FROM  users as u
 JOIN  trans as t ON t.id = (u.id + 2)
WHERE  u.id = 1 ;

 x
----
 42

Take user number one, add transaction number three, raise the result to the power of three, subtract 22; and the result is: the meaning of life.

Hmm, so there may be a logical mistake somewhere in that query. Wouldn't it be nice to have a type system that prevents mistakes like this? Although SQL dialects do not offer much in terms of type safety, PostgreSQL's enum data type does.

Next time: PostgreSQL's enum and type safety.

Eight Queens in SQL [4]

This is the fourth and final part of the series showing how to solve the “Eight Queens” chess puzzle, using a database design and SQL. The main idea is to design using natural language and concepts of domains, predicates, constraints, sets, set operations, and relations.

Links to all previous and this post’s code.
Part Main Topic Code (pgSQL)
1 Domains, Types part 1
2 Predicates, Constraints, Relations part 2
3 Propositions and Facts part 3
4 The Algorithm and the Solution part 4

Algorithm

Say we place a queen, Q1, on square c5, the queen will cover row 5, column c, and diagonals a3:f8 and a7:g1. Each one of these is a line (a set of squares), so the square c5 is an element of each of these four sets. Using the (anchor, movement direction) notation to identify a line, the following holds:

     row 5     = ('a5', 'RGHT')    = {'a5' ... 'h5'}
  column c     = ('c1', 'UP')      = {'c1' ... 'c8'}
diagonal a3:f8 = ('a3', 'RGHT-UP') = {'a3' ... 'f8'}
diagonal a7:g1 = ('a7', 'RGHT-DN') = {'a7' ... 'g1'}

'c5' ∈ ('a5', 'RGHT') 
'c5' ∈ ('c1', 'UP')
'c5' ∈ ('a3', 'RGHT-UP')
'c5' ∈ ('a7', 'RGHT-DN')

Let’s introduce a set, L1, which is the set of these four lines; note that L1 is a set of sets.

L1 = { ('a5', 'RGHT')   , ('c1', 'UP'),
       ('a3', 'RGHT-UP'), ('a7', 'RGHT-DN')
     }

'c5' ∈ ⋃ L1

⋂ L1 = {'c5'}

⋃ L1 is the union of all sets in L1, and represents all squares that the queen placed on c5 attacks. ⋂ L1 is the intersection of these sets and it has only one element, the c5 square itself.

Now let’s place another queen, Q2, on square f2; using the reasoning from the previous example, the following holds:

L2 = { ('a2', 'RGHT')   , ('f1', 'UP'),
       ('e1', 'RGHT-UP'), ('a7', 'RGHT-DN')
     }

'f2' ∈ ⋃ L2

⋂ L2 = {'f2'}

The rule that two queens attack each other if placed on the same line (row, column or diagonal) can be expressed as an intersection between sets L1 and L2. If the intersection is an empty set then they do not attack each other, otherwise they do; in this example they both have one diagonal in common.

L1 ⋂ L2 = {('a7', 'RGHT-DN')}

Suppose that eight queens Q1 … Q8 are placed on a chessboard one by one. As queen Qn is placed on a square we calculate the corresponding set Ln. The intersection of set Ln — for the queen being placed — and union of all previous L1 … Ln-1 sets — for queens already on the board — must be an empty set.

Queen  Rule for placement

Q1     L1 ⋂  {} = {}  -- always True
Q2     L2 ⋂  L1 = {}
Q3     L3 ⋂ (L1 ⋃ L2) = {}
Q4     L4 ⋂ (L1 ⋃ L2 ⋃ L3) = {}
Q5     L5 ⋂ (L1 ⋃ L2 ⋃ L3 ⋃ L4) = {}
Q6     L6 ⋂ (L1 ⋃ L2 ⋃ L3 ⋃ L4 ⋃ L5) = {}
Q7     L7 ⋂ (L1 ⋃ L2 ⋃ L3 ⋃ L4 ⋃ L5 ⋃ L6) = {}
Q8     L8 ⋂ (L1 ⋃ L2 ⋃ L3 ⋃ L4 ⋃ L5 ⋃ L6 ⋃ L7) = {}

In general, to place queen Qn on the board, the following must hold:

⋃{L1 ... Ln-1} ⋂ Ln = {}

This now looks like an algorithm; although it does appear recursive, I will avoid recursion in this example.

Reducing the Search Space

A solution to the problem is in the form of:

 
{s1, s2, s3, s4, s5, s6, s7, s8}

sn ∈ square_name

where sn is a square identified by a square name.

In general, there is more than one solution to the puzzle, hence the
final result is a set of all possible solutions:

 
{
  {s1.1, s1.2, s1.3, s1.4, s1.5, s1.6, s1.7, s1.8}
, {s2.1, s2.2, s2.3, s2.4, s2.5, s2.6, s2.7, s2.8}
, ...
}

si.n ∈ square_name

The idea is now to estimate the search space for which the placement rules have to be evaluated, and reduce it as much as possible by reasoning about the problem.

In the first estimate we can start with reasoning that the first queen can be placed on one out of 64 squares, second on one out of 63, third on one out of 62, etc. The number of permutations here is 64! / (64-8)! = 1.78E+14.

However, the order of queens within one solution does not matter, so we should use combinations instead and choose 8 out of 64 — (64! / (8! x (64-8)!) = 4.43E+9.

Given that two queens can not be placed in the same column, we can start placing them column by column, so each queen can be placed on a maximum of eight squares — one column. This brings down the number of combinations to 8^8 = 1.68E+7.

Given that two queens can not be in the same row, each subsequent placement has one square less available than the previous one; this now brings number of combinations to 8! = 40320.

Further improvement can be made avoiding rows +/-1 from the queen in the previous column. For example, if the first queen was placed in column ‘a’, row 4, then the second queen can be placed in column ‘b’, but not in rows 3, 4 or 5; this further reduces the number of possible combinations to 5424.

To summarize the reasoning about number of combinations involved:

Reasoning Calc Search Space
Permute 8 out of 64 64! / (64-8)! 1.78E+14
Choose 8 out of 64 64! / (8! x (64-8)!) 4.43E+9
By column 8^8 1.68E+7
By column, avoid previous rows 8! 40320
By column, avoid previous rows,
avoid row +/-1 from the queen
in the previous column.
Simply count 5424

This looks quite feasible; in total the placement rules argued in the algorithm section have to be evaluated for 5424 combinations.

Solution

Take a look at the code; the result is implemented as a single query in a view named solution.

There are 92 possible solutions to the puzzle:

select * from solution;


s1  s2  s3  s4  s5  s6  s7  s8
------------------------------
a1  b5  c8  d6  e3  f7  g2  h4
a1  b6  c8  d3  e7  f4  g2  h5
...
a8  b3  c1  d6  e2  f5  g7  h4
a8  b4  c1  d3  e6  f2  g7  h5

It should be easy to read the code and the reasoning section in parallel, which has been the main guiding principle of this exercise.

The first part of the query — CTE named w — reduces the search space as described in the previous section. Consider the step of placing queen Q3 on column c:

JOIN board1 as q3 
            ON q3.CL  = 'c' 
           and q3.RW not in (q1.RW, q2.rw) 
           and q3.RW_NO not BETWEEN (q2.RW_NO - 1)
                                and (q2.RW_NO + 1)

The predicate in the join condition is selecting the column c, avoiding the rows of the two previously placed queens, and avoiding rows +/- 1 from the queen next to it (Q2).

The CTE w returns a set of tuples — the reduced search space — then the second part of the query applies the rules (constraints) from the algorithm section to this set.

Using queen Q3 as an example, the rule for placement:

L3 ⋂ (L1 ⋃ L2) = {}

is implemented as:

and not exists ( -- placing queen 3

      SELECT distinct e.ANCHOR, e.MOV_DIR 
        FROM square_line as e 
        WHERE e.SQUARE in (w.s1, w.s2)

      INTERSECT

      SELECT distinct e.ANCHOR, e.MOV_DIR 
        FROM square_line as e 
       WHERE e.SQUARE = w.s3
    )

Evaluating the Solution

When evaluating a solution to a problem, I tend to focus on the following criteria:

  1. Correctness.
  2. Ability to reason about the code and the problem now.
  3. Ability to reason about the code and the problem in the future. Suppose that this happens to be a part of a much larger project, and ten years from now a “third generation” of developers has to understand the code. How hard will it be?
  4. Ability to change technology and scale in size. Instead of arguing that this approach can easily scale and change technologies, I will simply present it in a few future posts.
  5. Performance: although I did not even attempt to optimize queries — and have given priority to readability and reasoning — the solution runs reasonably well.

I encourage you to compare this to Rosetta’s implementation — and a few more — using all of the criteria.

Eight Queens in SQL [3]

The third article in the series.

Links to all previous and this post’s code.
Part Main Topic Code (pgSQL)
1 Domains, Types part 1
2 Predicates, Constraints, Relations part 2
3 Propositions and Facts part 3

Relation Values

To populate tables simply run the code; it is self explanatory. You may want to keep an eye on the model while reading the code.

Propositions and Facts

A proposition is a declarative sentence, which can be evaluated (flagged) as true or false. A fact is a proposition considered to be true.

Starting with an example:

SELECT * from board;

SQUARE  CL   RW
---------------
  a1    a    1
  a2    a    2
  ... 
  h8    h    8

consider the result in conjunction with the predicate: [p_03] the board square named (SQUARE) is located in the row labelled (RW) and the column labelled (CL).

Substituting predicate variables with values from the query results in a set of facts, a set of declarative sentences considered to be true.

The board square named 'a1' is located in the row labelled 'a' and the column labelled '1'.

The board square named 'a2' is located in the row labelled 'a' and the column labelled '2'.

...

The board square named 'h8' is located in the row labelled 'h' and the column labelled '8'.

Each of these sentences can be seen as as a term that evaluates to true, hence it is possible to look at the relation board and the matching predicate [p_03] as a function:

board::(SQUARE, RW, CL) → TRUE

-- or more general

board::(SQUARE, RW, CL) → Boolean

So, board is a Boolean function with three parameters: SQUARE, RW, and CL.

Staring with the function board and knowing keys of the relation ({SQUARE}, {RW, CL}) it is easy to derive few related functions.

board::(SQUARE, RW, CL) → TRUE


f1:: SQUARE → (RW, CL) -- returns pair

f2::(RW, CL) → SQUARE

f3:: RW → {SQUARE} -- returns set

f4:: CL → {SQUARE} -- returns set

It takes just a bit of mental exercise to see things this way, and I would encourage you to practice a bit. Once and for all, the exercise will liberate you from the association of the term “relational” with “something to do with a square-shaped data-storage“, which happens to be the popular opinion.

Towards the Solution

It it easy to see how this thinking can be applied to the relation square_line and the matching predicate [p_05].

square_line::(SQUARE, ANCHOR, MOV_DIR) → TRUE

g1:: SQUARE  → {(ANCHOR, MOV_DIR)}

g2::{SQUARE} → {(ANCHOR, MOV_DIR)}

g3:: (ANCHOR, MOV_DIR)  → {SQUARE}

g4::{(ANCHOR, MOV_DIR)} → {SQUARE}

Look at the board and consider the query:

select * 
from square_line
where square in ('c5', 'f2');

square   anchor    mov_dir
---------------------------
  c5       a3      RGHT-UP
  c5       a5      RGHT
  c5       a7      RGHT-DN
  c5       c1      UP
  f2       a2      RGHT
  f2       a7      RGHT-DN
  f2       e1      RGHT-UP
  f2       f1      UP

Given the predicate [p_05], the result is a set of facts:

The square named 'c5' is located on the line identified by the starting square 'a3' and the direction of movement 'RGHT-UP'.

The square named 'c5' is located on the line identified by the starting square 'a5' and the direction of movement 'RGHT'.

The square named 'c5' is located on the line identified by the starting square 'a7' and the direction of movement 'RGHT-DN'.

The square named 'c5' is located on the line identified by the starting square 'c1' and the direction of movement 'UP'.

The square named 'f2' is located on the line identified by the starting square 'a2' and the direction of movement 'RGHT'.

The square named 'f2' is located on the line identified by the starting square 'a7' and the direction of movement 'RGHT-DN'.

The square named 'f2' is located on the line identified by the starting square 'e1' and the direction of movement 'RGHT-UP'.

The square named 'f2' is located on the line identified by the starting square 'f1' and the direction of movement 'UP'.

Suppose two queens are placed on the board, on squares ‘c5’ and ‘f2’:

select anchor, mov_dir
from square_line
where square = 'c5'

intersect

select anchor, mov_dir
from square_line
where square = 'f2'
;

anchor  mov_dir
---------------------------
  a7    RGHT-DN

This fact may be verbalised as:

The queens are located on the line identified by the starting square 'a7' and the direction of movement 'RGHT-DN'.

Now change ‘f2’ to ‘g2’ and run the query.
Next time: the algorithm and the solution to the puzzle.

Eight Queens in SQL [2]

This is the second article in the series; the previous one introduced the problem and specified domains.

Links to all previous and this post’s code.
Part Main Topic Code (pgSQL)
1 Domains, Types part 1
2 Predicates, Constraints, Relations part 2

The Essence

The idea is to describe the problem using natural language and concepts of domains, predicates, constraints, relations and keys; here are a few simple rules to keep in mind.

From predicates and constraints to relations and keys
  • A predicate variable — from a specific domain — maps to an attribute of the relation.
  • Internal (to predicate) uniqueness constraints map to keys of the relation.
  • External (to predicate) inclusion constraints map to foreign keys between two relations.
  • A predicate and the matching relation represent — evaluate to — a set of facts about the universe of discourse (the problem).
  • A predicate and the matching relation should not be separated, otherwise information will be lost, for all practical purposes.
For SQL implementation
  • Relations map to tables.
  • Derived relations map to views.
  • Attributes map to columns.
  • Domains map to data types.
  • A key maps to a primary key or an alternate key (unique constraint).
  • A foreign key maps to a foreign key.
  • Predicates and constraints are added as comments on matching database objects (tables, keys).

Predicates, Constraints, Relations

[p_01] The column labelled (CL) with the assigned column number (CL_NO) exists.

(c1.1) Each column is assigned exactly one column number.
(c1.2) Each column number is assigned to exactly one column.

a_col { CL     col_lbl      -- p_01
      , CL_NO  ordinal_num
      }

  KEY {CL}                  -- c1.1

  KEY {CL_NO}               -- c1.2

 
[p_02] The row labelled (RW) with the assigned row number (RW_NO) exists.

(c2.1) Each row is assigned exactly one row number.
(c2.2) Each row number is assigned to exactly one row.

a_row { RW     row_lbl      -- p_02
      , RW_NO  ordinal_num
      }

  KEY {RW}                  -- c2.1

  KEY {RW_NO}               -- c2.2

 
[p_03] The board square named (SQUARE) is located in the row labelled (RW) and the column labelled (CL).

(c3.1) Each square is located in exactly one row; for each row it is possible that more than one square is located in that row.
(c3.2) Each square is located in exactly one column; for each column it is possible that more than one square is located in that column.
(c3.3) For each row and column, exactly one square is located in both that row and that column.
(c3.4) If a square is located in a row then that row must exist.
(c3.5) If a square is located in a column then that column must exist.

board { SQUARE  square_name   -- p_03
      , CL      col_lbl
      , RW      row_lbl
      }

  KEY {SQUARE}          -- c3.1, c3.2

  KEY {CL, RW}          -- c3.3

FOREIGN KEY {RW}        -- c3.4
 REFERENCES a_row {RW}

FOREIGN KEY {CL}        -- c3.5
 REFERENCES a_col {CL}

 
[p_04] The line identified by the starting square (ANCHOR) and the direction of movement (MOV_DIR) exists.

(c4.1) Each line is identified by exactly one starting square and direction of movement combination.
(c4.2) If a starting square is part of a line identifier, then that square must exist.

line { ANCHOR   square_name  -- p_04
     , MOV_DIR  mov_dir
     }

 KEY {ANCHOR, MOV_DIR}       -- c4.1

FOREIGN KEY {ANCHOR}         -- c4.2
 REFERENCES board {SQUARE}

 
[p_05] The square named (SQUARE) is located on the line identified by the starting square (ANCHOR) and the direction of movement (MOV_DIR).

(c5.1) For each square, that square may be located on more than one line.
(c5.2) For each line, that line may contain more than one square.
(c5.3) For each square and line, that square and line combination occurs at most once.
(c5.4) If a square is located on a line then that square must exist.
(c5.5) If a square is located on a line then that line must exist.

square_line { SQUARE   square_name    -- p_05
            , ANCHOR   square_name
            , MOV_DIR  mov_dir
            }

KEY {SQUARE, ANCHOR, MOV_DIR} -- c5.3, c5.2, c5.1

FOREIGN KEY       {SQUARE}            -- c5.4
 REFERENCES board {SQUARE}

FOREIGN KEY      {ANCHOR, MOV_DIR}    -- c5.5
 REFERENCES line {ANCHOR, MOV_DIR}

 

For convenience I will add a relation that allows me to access a square by name, row and column labels, or row and column numbers. This can be derived from the existing relations: board, a_col, a_row.

[p_06] The board square named (SQUARE) is located in the row labelled (RW), row number (RW_NO); and the column labelled (CL), column number (CL_NO).

-- derived relation
--
board1 { SQUARE  square_name   -- p_06
       , CL      col_lbl
       , RW      row_lbl
       , CL_NO   ordinal_num
       , RW_NO   ordinal_num
       }

  KEY {SQUARE}
  KEY {CL, RW}
  KEY {CL_NO, RW_NO}


-- derive using 
--

board1 =    {SQUARE, CL, RW} 
       JOIN {CL, CL_NO} 
       JOIN {RW, RW_NO}

board1 = JOIN {board, a_col, a_row}

 

There is one more predicate and relation to consider: the actual solution. At this point we do know that eight queens have to be placed on eight squares, and that there may be more than one solution. There will be some kind of calculation (algorithm) involved so the final relation will be derived from others.

[p_07] Placing queens on squares (S1), (S2), (S3), (S4), (S5), (S6), (S7), and (S8) is a solution to the puzzle.

(c7.1) A solution is identified by combination of squares S1 … S8.

-- derived relation
--
solution { S1 square_name             -- p_07
         , S2 square_name
         , S3 square_name
         , S4 square_name
         , S5 square_name
         , S6 square_name
         , S7 square_name
         , S8 square_name
         }

KEY {S1, S2, S3, S4, S5, S6, S7, S8}  -- c7.1

SQL Implementation

Creating an entity-relationship model is purely optional, but it does help.

Again, the translation from the reasoning to the SQL code is straightforward, here is the part 2 of the code.

What we have now are essentially empty relational variables. The relational variables a_col, a_row, board, line, and square_line need actual relational values to represent sets of facts about the chessboard.

Next time: relation values, propositions, and facts.