| Home > About PCs > Excel > Imported data |
Lists or tables of various kinds of data are often imported into Excel from various source applications, including PC and mainframe database managers, personal information managers (PIMs, also known as contact managers) and email address books. The data is most often names and addresses, but can be inventory or transactions or anything.
There are some special problems that sometimes occur in Excel with imported data, more often with mainframe data than with data from PC sources.
Sometimes the big challenge is how to get the data into Excel in the first place. In other cases the data will appear to import into Excel in good order, but formulas will produce unexpected results, where they appear to be ignoring all or part of the data. This sort of thing is usually caused by either numbers stored as text or the presence of unexpected non-printing characters.
Sometimes Excel will be able to open the source file directly. My Excel KB articles page links to two articles that list all the file formats Excel 2000 and 2002/2003 can open directly.
When Excel can't open the source file, you'll need to find an intermediate file format the source application can output, that Excel will be able to read. Any version of Excel should be able to read any older Excel format. If the source application can output in any Excel format that's the same as or older than your current Excel version, that should be the first intermediate format you try.
Excel and other spreadsheets can import data from a wide variety of text file formats. Excel has a Text Import Wizard that helps you with this process. Delimited text files have a long history as data interchange formats, including between different mainframe systems before personal computers existed.
The last step of the Text Import Wizard gives you the opportunity to override the data type for each column. Two common issues where this is necessary are import of US zip codes and of credit card numbers. Imported East Coast zip codes with leading zeros by default get their leading zeros stripped, because Excel mistakes them for actual numbers.* Many credit card numbers consist of a string of sixteen digits; with Excel again following its default import behavior and trying to treat them as actual numbers, they get "rounded" to fifteen significant digits, in compliance with an IEEE spec that defines this precision limit for microprocessors. In either of these cases you can override by specifying the data type Text for that column in the last step of the wizard.
I also remember one support case in which imported data containing slashes was being mistakenly formatted as Excel dates. Here again, the fix was to override the data type for that column to Text.
After the older Excel formats, the next best import format is usually CSV text file ("comma separated values"). There are some detail variations in CSV formats, but typically they have each record on one line of text, with the fields separated by commas, and enclosed in quotes, either in all cases or only when the field text contains spaces or commas. Generally the text file output by the source application will have the extension CSV.
CSV is probably the most robust text interchange format; CSV files usually import into Excel and parse into columns without problems. In fact, when Excel sees the extension CSV, it bypasses the Text Import Wizard, because it assumes it can import that format correctly. If there are problems, you can force the Text Import Wizard to run by changing the extension of the text file from CSV to TXT. This works in all Excel versions from 97 through 2003. In Excel 2000 and later, you can force the wizard to run from Data, Get External Data, without changing the CSV extension. (In Excel 2003: "Data, Import External Data, Import Data.")
If your source application has a fixed width text file option (may be identified as "Lotus PRN") where each field starts a certain number of characters from the beginning of the line, regardless of the content of previous fields, that kind of text format will import into Excel pretty cleanly. You'll need to make sure Fixed Width is selected in the Text Import Wizard. and go through the data in the last step of the wizard and make sure it's parsing the columns correctly. That wizard step shows you divider lines representing Excel's best guess for where each field ends and the next begins, along with directions for making changes if necessary.
You may also be able to use Fixed Width import with the source application's reports and the output of a "print to text" feature, because Excel's text wizard has an option to exclude header rows from the import.
Sometimes your source application will let you output a text file and choose a delimiter character to put between the fields. Then you'll need to select Delimited in the Text Import Wizard and specify the delimiter you chose, either from the offered choices, or by selecting the "Other" choice and typing your delimiter character. Obviously you'll want to choose a delimiter character that doesn't appear anywhere in the data.
Occasionally you may see tab delimited text interchange formats. This format is a little harder to recognize; when a tab delimited text file is viewed in Notepad, some of the field contents will appear to line up vertically and some won't. Sometimes the best way to recognize tab delimited format is to try specifying the tab delimiter in the text import wizard and find out that it works.
Works is an inexpensive application suite, including word processor, spreadsheet, and database modules, with a greatly reduced feature set compared to Microsoft Office. It's fairly common for PC OEMs to bundle Works with their PCs because it's so cheap. When I was doing Excel support I used to keep Works installed on one of my two PCs, along with a couple of Office versions, for comparison and to help support spreadsheet transition issues.
The Works spreadsheet definitely feels like a "training wheels" version of Excel: limited feature set and supported functions, no array formulas, and only one worksheet per file, vs. Excel's multiple worksheet tabs. In many ways it resembles PC spreadsheet programs circa 1990.
If you have existing Works spreadsheet files and you're switching to Excel, I recommend you save your Works files to Excel format from Works, then open the resulting XLS files in Excel. See my Excel KB articles page for more information on Works and Excel interoperability.
For anyone who's been using bundled Works for everything, open-source OpenOffice.org (OOo) is a great freeware upgrade. I consider OOo 2.x to be about on a par with MS Office 2000, and it certainly beats Works for functionality every which way. For more see my OOo pages in this section. To transition your Works document files to OOo, save them from Works to MS Office formats such as DOC and XLS, and open those in OOo. You can leave the files in the Office formats if you like, or use OOo's document converter wizard to convert them to OpenDocument.
I would recommend you keep Works installed on your PC after you install either Excel/Office or OOo, at least for a while. Works and Office coexist just fine, better than multiple Office versions do, in fact. I haven't tried Works and OOo installed together, but I know of no reason why it shouldn't work.
Sometimes the imported data will not be in the desired format, in terms of capitalization, missing prefixes, or data will be in two columns where one column is desired or vice versa.
To convert imported data using formulas:
When you've entered a new formula in the first data row of a very long list, and you need to copy it to all the other rows—as in step 3 above—there's a way to do it that can save you a lot of time.
Whenever you click on a cell, it will always be surrounded by a dark selection border. The lower right corner of that border is a slightly wider dark square called the fill handle. When you point to the fill handle, your mouse cursor will change to a small black cross. Click on the cell with the new formula, and double-click the fill handle. Excel will find the extent of the list column to the left and copy the formula to all the other rows in the list, whether you can see them on the screen at the moment or not.
Capitalization problems can be fixed using the text functions LOWER() UPPER() and PROPER() to force all lowercase, all uppercase, and first-letter uppercase, respectively. More precisely, PROPER() will capitalize the first letter in the string and any other letters that follow any character other than a letter, including spaces.
Missing prefixes, or any constant data which needs to be added to all cells in a column, can be fixed using the concatenation operator (&) which combines two strings into one. To add the string "ABC" to the front of all entries in column A you can use a formula like ="ABC"&A2
You may be able to remove unwanted extra characters using Excel's text functions such as LEFT(), RIGHT(), MID(), FIND(), and SEARCH(), depending on how the data is formatted and whether the entries all have the same number of characters or not. Or Excel's search and replace function from the menus may be more effective.
Two imported columns can be combined into one with concatenation. To combine columns A and B, use =A2&B2 in inserted column C, or if a space between is desired =A2&" "&B2 with a space between the two quotes. In the latter case we're combining three strings: two from cells A2 and B2, plus a quoted constant string consisting of a single space.
Parsing one imported column into two or more new columns is trickier, and you'll probably have to tackle that situation on a case-by-case basis using the Excel text functions mentioned above. In some cases like this the Data, Text to Columns feature, which splits text strings at the spaces, may be useful.
Sometimes numbers in imported data will end up stored in Excel inappropriately as text strings rather than numbers. This is about real numbers that could be used in calculations, not things like part numbers, zip codes, or credit card numbers. You can use the ISNUMBER() function in an inserted temporary column to diagnose this; if it returns FALSE for data that appears to be numeric, then you may have numbers stored as text.
To convert numbers stored as text to real Excel numbers:
If you still have your inserted temporary column with the ISNUMBER() formulas, you should see them all change from FALSE to TRUE. Then you can get rid of the temporary cell with the 1 in it, and the temporary column containing the diagnostic ISNUMBER() formulas.
My Excel KB articles page links to four articles that cover converting numbers stored as text in various Excel versions.
In Excel 2002 and later, by default numbers stored as text will be flagged with SmartTags, and when you select the flagged cell or cells, the SmartTag menu will say "Number stored as text" and one of the menu choices will say "Convert to number," which is convenient.
Note that we've now talked about two different kinds of problems with numbers as numbers versus numbers as text. Sometimes Excel tries to treat what are really text strings which happen to contain only digits (zip codes, credit card numbers, part numbers) as if they were actual numbers that could be used in calculations. Sometimes Excel imports actual numbers and inappropriately stores them as text strings, because that's how they were stored in the source file.
(How to delete something that's not there.)
It's rather common for mainframe databases to add non-printing characters to field contents, often at the end of the text string, for use as undocumented internal flags. These non-printing characters can get imported into Excel. They cause problems in lookup and consolidation formulas, because the extra characters will prevent the test string that's part of your formula from matching the actual cell contents. The cause of this problem is usually not obvious to the eye, because the extra characters don't display or print, and by default Excel presents text data as left-justified.
To verify there are extra non-printing characters, you can insert a temporary extra column and use the LEN() function in all list rows, to tell you the lengths of the character strings in the relevant column. If you're seeing four characters displayed in a particular cell, and the LEN() formula in that row is giving you a string length of five or more, then you have some kind of extra characters.
How you remove these non-printing characters depends on which characters you're dealing with. If your problem data is in column A, a formula like =CODE(RIGHT(A2,1)) in another inserted column will tell you the numeric code for the last character in the string in cell A2, which will help determine which technique to use. Or you can just try TRIM(), if that doesn't work try CLEAN(), and if that doesn't work either, move on to the Find & Replace procedure below.
If the problem can be corrected using the TRIM() or CLEAN() functions in a temporary inserted column, then you'll need to use the Paste Special Values technique as described in the Data conversion section above, to replace the problem data with the sanitized version.
If the extra characters are just spaces (code 32) the TRIM() function will remove all spaces from text except for single spaces between words.
CLEAN() will remove many non-printing characters other than spaces. This should include codes 128, 129, 141-144, 157, and 158. When you're done, if you see four characters displayed and the LEN() function is telling you there's only four characters, and if your lookup or consolidation formulas start working, then you've been successful.
Excel documentation will tell you that CLEAN() removes "all non-printable characters," but it ain't necessarily so: one such is CHAR(160). I suspect there are also times when what's causing the problem is leftover fragments of machine code that don't match any character that's part of the code scheme.
To remove non-printing characters not removable using TRIM() or CLEAN() functions:
If you have more than one kind of mystery character, you may need to repeat this procedure.
If none of this works, you might want to try a different method for importing the data, or you might have to consult tech support for the source application.
In December 2005 I got email from a Brit; tabular data pasted from the Web was found to have trailing CHAR(160) which he resolved using the Replace procedure above. I know of no reason why this should happen with data from the Web as such. It sounded like the original source of the data may have been a bank mainframe, which implies sloppy work at the bank setting up the data connection from the mainframe to the Website. In April 2007 I got email about a similar case involving data pasted from a Web site, again resolved using the Replace procedure.