Forgotten Simplicity

Teaching normalization past 1NF in introductory database (DB) courses should be considered harmful. Obsessing about normal forms (past 1NF) should be left for advanced database courses.

Every basic course on relational databases includes one or more chapters on normal forms: 1NF, 2NF, 3NF, BCNF, 4NF, 5NF, 6NF; and sometimes a few more. These are somehow considered essential. I find that this leads to confusion and most people simply do not know what to do with it, other than to learn just enough to pass an exam. Take a look at your own (personal or business) SQL DBs, very likely they are a complete mess.

This post is geared towards developers who want to keep their databases free of logical errors, easily expandable, and easy to maintain. At the same time, to stay away from terms like: functional dependency, join dependency, multivalued dependency, Heath theorem, the chase algorithm, rigorous mathematical definitions of normal forms, and so on.

There is a simpler way, a better way, a beautiful way, once understood by most, forgotten now. It somehow got swept under the sands of time.

Q: The way?

Use natural language and logic. Describe your business problem using predicates and constraints. Match relations to predicates (they go together), everything else simply follows.

Q: What is a relation?

For a developer, it’s a data type. A higher-order data type. It has a precise definition, look it up. The same way there exist integer: data type, value and variable; there exist relation: data type, value and variable.

Q: Relation value, variable? What does it represent?

A set of facts, under a common predicate.

Q: Facts, predicate?

A fact is a proposition considered to be true. A proposition is a meaning of a declarative sentence. A predicate is a generalized proposition. So, a predicate with a matching relation represents a set of facts — sentences considered to be true.

Q: Store sets of facts?

Relational variables (SQL tables in 1NF) store sets of facts. Use relational operators to calculate on these facts.

Q: Relational operators?

Every data type comes with operators, otherwise it would not be very useful. The same way integers come with: addition, subtraction multiplication, …; lists with: head, tail, concatenation …; relations come with: join, union, project, restrict, (their sql equivalents) etc.

Q: What are these operators doing?

Operations on sets of facts: reasoning.

Q: What about joins, are they bad, should we avoid them?

Should we avoid multiplication of integers, concatenation of strings and lists; unions and intersects of sets, etc.?

Q: And keys: primary, alternate, foreign?

Every set of facts needs one or more logical constraints to make sense. These should be spelled out in a natural language, together with predicates. Table keys (PK, AK) are simply implementations of these common-sense logical constraints. Similar for FK, it is an implementation of an inclusion constraint in between two sets of facts.
It is important to understand keys and other constraints.

Q: What is this normalization thing about?

It is all about removing and preventing logical errors.
First, SQL does not work with relations out-of-the-box, but with tables. A table can represent a relational variable, but it takes rigor — the burden is on the programmer. When a table properly represents a relational variable it is said to be in 1NF, so this is important– look it up.
Second, and this is the big one: preventing contradiction. Contradiction in logic leads to so called "deductive explosion". Steps have to be taken to prevent contradiction. Most of the time this involves implementing proper constraints and eliminating (reducing) redundancy by decomposition into projections (vertically splitting tables).

Q: How far to go with decomposition, when is it enough?

Stop when one or more of the following holds true:

  • There is no redundancy.
  • There is redundancy, but it can not be removed by decomposition.
  • It is not possible to decompose table without losing information.
  • It is not possible to decompose table at all.

Just make sure that when reasoning abut redundancy, you reason about the predicate and constraints, not about a few rows of sample data.

Q: How does redundancy lead to contradiction?

It is possible to state something that is true and not true at the same time, when redundant data goes out of sync.

Q: What to do about it?

Remove redundancy from relations (tables in 1NF), mostly by decomposing into projections. Use predicates and constraints, write them down, start with basic building blocks, use logic.

Q: What basic building blocks?

Simple predicates, which lead to sets of elementary facts. Simple predicate is a predicate which can not be split without losing information, same for elementary facts.

Q: How to use this in design?

Start with simple predicates and combine them based on constraints (keys). Most of DB normalization story is about decomposing tables, the real question is how to compose simple predicates.

Q: What about normal forms?

Reasoning about relations yields 1NF, removing redundancy to 5NF, simple predicates to 6NF. In other words, you can get your tables into 5NF even if you do not know the precise definition of it. Btw, useful normal forms can be easily verbalized:

Form Meaning
1NF Table properly represents a relational variable.
BCNF Every fact depends on the key, the whole key and nothing but the key; for every key (there can be more than one).
5NF Decomposing table would not remove any redundancies.
6NF Matches simple predicate.

Take a look at this again, only meaning is important, the form name is irrelevant.

Q: What about 2NF, 3NF, 4NF, and few others; should developers study them?

Not very useful. If you need to pass an exam, then you do as the curriculum demands. You may decide to study them under "advanced db topics", or to learn some db-history. Maybe you are a DB aficionado, I am.
It is important to understand 1NF, the rest is optional as long as you understand the logic of it (see the table above).

Q: So not much rigorous math required?

No, just basic understanding of predicates and constraints in logic, and how do they map to DB concepts of: relations, relational variables, and constraints.

Q: Any other benefits?

  • Using logic and natural language is a huge advantage when one needs to communicate with analysts, product managers, and other business people.
  • It is a good method to elicit business requirements, makes it easy to reason with non-techies; and once done you actually end up with a usable DB model. Bonus documentation.
  • Easy to expand. As business evolves, new predicates (tables) are added, as opposed to constant refactoring.
  • Schema is stable over long period of time, new tables may be added, but old ones are rarely altered. There is no need to change analytics queries (code) that already work, even if application code changes.
  • And finally performance, aka speed. Prevailing mantra "joins are slow" is true only for toy examples, low data volumes, when CPU cycle count is the only resource that matters. Large DBs run faster with narrow, "normalized" tables — a natural consequence of this design approach.

Q: Any examples?

See the Eight Queens problem solved in SQL, using this design method.

Map Coloring

Another fun challenge from Decision Management Community. A canonical example of the constraint programming (over finite domains) paradigm leads to a simple and elegant SQL solution.

The task is to color a map of six European countries: Belgium, Denmark, France, Germany, Luxembourg, and the Netherlands in such a way that no neighboring countries use the same color. You need to use no more than 4 colors: blue, red, green, or yellow.

Starting with two sets: country and color; who borders whom, and the rule.

country {BE, DK, FR, DE, LU, NL}
color   {B, R, G, Y}

BE borders {FR, LU, DE, NL}
DK borders {DE}
FR borders {BE, LU, DE}
DE borders {FR, LU, BE, NL, DK}
LU borders {FR, DE, BE}
NL borders {BE, DE}

Neighboring countries can not use the same color.

Reasoning

There are six countries, each one can have one of four colors, so in total there is a maximum of 4^6 = 4096 combinations to examine. The strategy is simple:

  1. Create a set of all possible combinations.
  2. Apply constraints.
  3. Present all results that satisfy given constraints.

The SQL code is straightforward and easy to follow. Try it, there are 144 possible solutions.

Variation

In a variation of the problem rules are changed a bit. Only three colors are allowed and some neighboring countries are allowed to have the same color, but there is a penalty cost assigned. The task is to find solutions with minimum penalty cost.

country {BE, DK, FR, DE, LU, NL}
color   {R, G, B}

These neighboring countries can use the same color,
with assigned cost:

LU-FR  $257
LU-DE  $904
LU-BE  $568

Reasoning v2

This time there are 3^6 = 729 combinations to examine. The strategy:

  1. Create a set of all possible combinations.
  2. Apply mandatory constraints.
  3. Calculate the cost for all tuples satisfying mandatory constraints.
  4. Find the minimum cost.
  5. Present all results that have the minimum cost.

Again, SQL code is easy to follow, this time there are 12 possible solutions.

Home Network DNS Resolver

Raspberry Pi Kit
Scott Helme has a detailed procedure of how to configure Raspberry PI to act as a DNS resolver with DNS-level “content-blocking” for a network. It is a great weekend project, fun and useful. The hardware cost is around 100 CAD on Amazon.

Benefits

  • Content filter on DNS level, including ads and known nasty sites.
  • Reduced web traffic. Depends on sites you visit, but 40% is a reasonable expectation.
  • Upstream DNS queries are directed over https, so you get some extra privacy.
  • Pages load faster, take a look at the example.
  • Pi-Hole has a nice admin interface, so you get insight into DNS chatter on the network.

Here are network requests for a home page of a popular news site using default (no filtering) DNS resolver. In total it took 395 requests, 5.3 MB and six minutes to load.
Default DNS

Now if I switch to the DNS resolver on Raspberry PI:
DNS via pi-hole
Total of 143 request, 2.7MB to load, and 20.75 seconds. Take a look at all the lines in red with failed status, this is where the domain got blocked by the pi-hole on the Raspberry.

That’s ~ 50% less data and 17 times faster.

Continue reading “Home Network DNS Resolver”

Monkey Business

A challenge by Decision Management Community.

A zoo has four monkeys, during the monkey lunchtime each one ate a different fruit in their favourite resting place. Sam, who doesn’t like bananas, likes sitting on the grass. The monkey who sat on the rock ate the apple. The monkey who ate the pear didn’t sit on the tree branch. Anna sat by the stream but she didn’t eat the pear. Harriet didn’t sit on the tree branch. Mike doesn’t like oranges.

  1. What kind of fruit each monkey ate?
  2. Where their favourite resting place was?

So, how to solve this in SQL? It is easy to identify domains, predicates and rules; after that it is straightforward. Continue reading “Monkey Business”