Database tips

How to set up a small database

It happens too often in the business world: a business person sets up a small ad hoc database, say, a contacts list, without spending a lot of time worrying about perfecting the format; it gets added to over time and becomes more and more important to the business, and eventually someone has to work through the whole database and restructure it to make it work better.

If you've already done some databases, and maybe had to fix them later on, there may well be nothing here you don't already understand. But if you're a first-timer, read on; there are some tricks you can use when you first create a database that will tend to save time and expense later.

Let's say you were to set up a sales contact list with the following fields (each line describes a single field):

                    COMPANY
                    CONTACT NAME
                    ADDRESS
                    CITY/STATE/ZIP
                    AREA CODE/PHONE
                    EMAIL ADDRESS

Looks reasonable, doesn't it? The mistake people usually make is not breaking their data down far enough. If you use the fields above, sooner or later you're going to need to get at the "pieces" of some of them. You might just need to sort by contacts' last name, or by state; or you might want an address on a form letter to start with "John Doe" and the salutation to read "Dear Mr. Doe". Also, many people will have both a street address and a post office box.

The trick is to split up all the parts of the information you might need to access or sort by separately, like this:

                    COMPANY
                    FIRST NAME
                    MIDDLE NAME
                    LAST NAME
                    SUFFIX         (Jr/Sr/PhD etc.)
                    SALUTATION     ("Dear Mr. Smith" "Dear Fred" etc.)
                    ADDRESS 1      (Street address)
                    ADDRESS 2      (PO Box)
                    CITY
                    STATE          (incl. Canadian province codes)
                    ZIP            (US Zip or Canadian postal code)
                    COUNTRY        (forced to all caps when used)
                    AREA CODE
                    PHONE
                    FAX AREA CODE
                    FAX PHONE
                    EMAIL ADDRESS

If you do it this way, your data will be neatly pigeonholed for you from Day One. If you use a structure like the first one, sooner or later, somebody will have to set up new separated fields, and go through the data record by record and pick the data for the new fields out of the old fields. I originally wrote this section after a department of a company that really should have known better (I'll never tell who) had to pay me to sit down and do exactly that for them.


Database normalization

Normalization is a ten-cent word database programmers use when they want to intimidate you. The basic idea isn't really that hard to understand.

Suppose we're selling something, and we want to keep track of invoices. Okay, here's an invoice table:

Cust#  Customer              Phone     Invoice#     Date          Total
101    ABC Corporation       325-7289      1319     10/22/03     891.57
101    ABC Corporation       325-7289      2649     11/05/03     532.68
101    ABC Corporation       325-7289     26883     12/06/03     520.92
101    A.B.C. Corp.          325-7289     42926     02/10/04     362.82
101    A.B.C. Corp.          325-7289     75284     05/03/04     901.86
101    ABC Co.               325-7289     91958     05/24/04     740.60
102    Beta Partners Ltd.    902-4416
103    Custom Services       381-9408
104    Data Automation Inc.  452-9520     13999     11/30/03     842.51
104    Data Automation Inc.  452-9520     23413     12/12/03     469.36
104    Data Automation Inc.  452-9520     52056     12/14/03     222.11
104    Data Automation       452-9520     64976     01/16/04      36.36
104    Data Automation       452-9520     95954     03/24/04     566.60
105    Epsilon Enterprises   902-7767     29950     01/20/04     529.67
105    Epsilon Enterprises   902-7767     74628     04/15/04     602.86
105    Epsilon Enterprises   902-7767     79050     06/25/04     677.66
106    Financial Consulting  428-0131      3522     01/02/04     128.95
106    Financial Consulting  428-0131     20471     03/20/04     699.19
106    Financial Consulting  428-0113     49515     04/06/04     798.51
106    Financial Contracting 428-0113     55215     05/01/04     158.21
106    Financial Contracting 428-0131     79507     05/02/04     432.50
106    Financial Contracting 428-0131     79877     05/14/04     428.50

Hmm. Who dealt this mess? We don't seem to be too sure about what some of these companies' names are, or their right phone numbers. And there are two customers with no invoices; wasn't this supposed to be an invoice list?

The idea is, if you allow two or more places for the same information to exist, that creates the possibility that it can get recorded differently. You can't tell from the table what the correct information is after that. Try the same data, without the mistakes, in two separate tables like this:

                                        Cust# Invoice#   Date        Total
                                        101       1319   10/22/03   891.57
                                        101       2649   11/05/03   532.68
Cust# Customer              Phone       101      26883   12/06/03   520.92
101   ABC Corporation       325-7289    101      42926   02/10/04   362.82
102   Beta Partners Ltd.    902-4416    101      75284   05/03/04   901.86
103   Custom Services       381-9408    101      91958   05/24/04   740.60
104   Data Automation Inc.  452-9520    104      13999   11/30/03   842.51
105   Epsilon Enterprises   902-7767    104      23413   12/12/03   469.36
106   Financial Consulting  428-0131    104      52056   12/14/03   222.11
                                        104      64976   01/16/04    36.36
                                        104      95954   03/24/04   566.60
                                        105      74628   04/15/04   602.86
                                        105      79050   06/25/04   677.66
                                        106       3522   01/02/04   128.95
                                        106      20471   03/20/04   699.19
                                        106      49515   04/06/04   798.51
                                        106      55215   05/01/04   158.21
                                        106      79507   05/02/04   432.50
                                        106      79877   05/14/04     8.50

Better, no? Now there's no question about customer names and phone numbers; a phone number might still be wrong, but at least it can't be ambiguous. If we're printing an invoice report, we can just use the customer number entry from the invoice file to get the customer's name from the customer file.

See also Wikipedia's database normalization article.

The second example can be described as normalized. Normalization theory gets a lot more elaborate than this, including first, second, third, fourth, fifth, and sixth "normal forms" and so on, but this is the central idea of it. A database manager program that can use multiple tables linked together in this way is called a relational database manager.

One important point to note here: if you have the kind of data duplication problem demonstrated above, and you're using a non-relational system of some kind, such as a flat-file database system or a spreadsheet program like Excel or Calc, you probably have a case of the wrong tool for the job.


HTML checked
site feedback