It happens too often in the business world: a business person sets up a small ad hoc database, say, a contacts list, without spending a lot of time worrying about perfecting the format; it gets added to over time and becomes more and more important to the business, and eventually someone has to work through the whole database and restructure it to make it work better.
If you've already done some databases, and maybe had to fix them later on, there may well be nothing here you don't already understand. But if you're a first-timer, read on; there are some tricks you can use when you first create a database that will tend to save time and expense later.
Let's say you were to set up a sales contact list with the following fields (each line describes a single field):
COMPANY
CONTACT NAME
ADDRESS
CITY/STATE/ZIP
AREA CODE/PHONE
EMAIL ADDRESS
Looks reasonable, doesn't it? The mistake people usually make is not breaking their data down far enough. If you use the fields above, sooner or later you're going to need to get at the "pieces" of some of them. You might just need to sort by contacts' last name, or by state; or you might want an address on a form letter to start with "John Doe" and the salutation to read "Dear Mr. Doe". Also, many people will have both a street address and a post office box.
The trick is to split up all the parts of the information you might need to access or sort by separately, like this:
COMPANY
FIRST NAME
MIDDLE NAME
LAST NAME
SUFFIX (Jr/Sr/PhD etc.)
SALUTATION ("Dear Mr. Smith" "Dear Fred" etc.)
ADDRESS 1 (Street address)
ADDRESS 2 (PO Box)
CITY
STATE (incl. Canadian province codes)
ZIP (US Zip or Canadian postal code)
COUNTRY (forced to all caps when used)
AREA CODE
PHONE
FAX AREA CODE
FAX PHONE
EMAIL ADDRESS
If you do it this way, your data will be neatly pigeonholed for you from Day One. If you use a structure like the first one, sooner or later, somebody will have to set up new separated fields, and go through the data record by record and pick the data for the new fields out of the old fields. I originally wrote this section after a department of a company that really should have known better (I'll never tell who) had to pay me to sit down and do exactly that for them.
Normalization is a ten-cent word database programmers use when they want to intimidate you. The basic idea isn't really that hard to understand.
Suppose we're selling something, and we want to keep track of invoices. Okay, here's an invoice table:
Cust# Customer Phone Invoice# Date Total 101 ABC Corporation 325-7289 1319 10/22/03 891.57 101 ABC Corporation 325-7289 2649 11/05/03 532.68 101 ABC Corporation 325-7289 26883 12/06/03 520.92 101 A.B.C. Corp. 325-7289 42926 02/10/04 362.82 101 A.B.C. Corp. 325-7289 75284 05/03/04 901.86 101 ABC Co. 325-7289 91958 05/24/04 740.60 102 Beta Partners Ltd. 902-4416 103 Custom Services 381-9408 104 Data Automation Inc. 452-9520 13999 11/30/03 842.51 104 Data Automation Inc. 452-9520 23413 12/12/03 469.36 104 Data Automation Inc. 452-9520 52056 12/14/03 222.11 104 Data Automation 452-9520 64976 01/16/04 36.36 104 Data Automation 452-9520 95954 03/24/04 566.60 105 Epsilon Enterprises 902-7767 29950 01/20/04 529.67 105 Epsilon Enterprises 902-7767 74628 04/15/04 602.86 105 Epsilon Enterprises 902-7767 79050 06/25/04 677.66 106 Financial Consulting 428-0131 3522 01/02/04 128.95 106 Financial Consulting 428-0131 20471 03/20/04 699.19 106 Financial Consulting 428-0113 49515 04/06/04 798.51 106 Financial Contracting 428-0113 55215 05/01/04 158.21 106 Financial Contracting 428-0131 79507 05/02/04 432.50 106 Financial Contracting 428-0131 79877 05/14/04 428.50
Hmm. Who dealt this mess? We don't seem to be too sure about what some of these companies' names are, or their right phone numbers. And there are two customers with no invoices; wasn't this supposed to be an invoice list?
The idea is, if you allow two or more places for the same information to exist, that creates the possibility that it can get recorded differently. You can't tell from the table what the correct information is after that. Try the same data, without the mistakes, in two separate tables like this:
Cust# Invoice# Date Total
101 1319 10/22/03 891.57
101 2649 11/05/03 532.68
Cust# Customer Phone 101 26883 12/06/03 520.92
101 ABC Corporation 325-7289 101 42926 02/10/04 362.82
102 Beta Partners Ltd. 902-4416 101 75284 05/03/04 901.86
103 Custom Services 381-9408 101 91958 05/24/04 740.60
104 Data Automation Inc. 452-9520 104 13999 11/30/03 842.51
105 Epsilon Enterprises 902-7767 104 23413 12/12/03 469.36
106 Financial Consulting 428-0131 104 52056 12/14/03 222.11
104 64976 01/16/04 36.36
104 95954 03/24/04 566.60
105 74628 04/15/04 602.86
105 79050 06/25/04 677.66
106 3522 01/02/04 128.95
106 20471 03/20/04 699.19
106 49515 04/06/04 798.51
106 55215 05/01/04 158.21
106 79507 05/02/04 432.50
106 79877 05/14/04 8.50
Better, no? Now there's no question about customer names and phone numbers; a phone number might still be wrong, but at least it can't be ambiguous. If we're printing an invoice report, we can just use the customer number entry from the invoice file to get the customer's name from the customer file.
The second example can be described as normalized. Normalization theory gets a lot more elaborate than this, including first, second, third, fourth, fifth, and sixth "normal forms" and so on, but this is the central idea of it. A database manager program that can use multiple tables linked together in this way is called a relational database manager.
One important point to note here: if you have the kind of data duplication problem demonstrated above, and you're using a non-relational system of some kind, such as a flat-file database system or a spreadsheet program like Excel or Calc, you probably have a case of the wrong tool for the job.