Zip it Up with File Compression

Tim McLellan

Dispatch, September / October 1995, Computing and Network Services, University of Alberta
ISB Connector, March 1996, Information Systems Branch, Ministry of Forests, Government of BC

Copyright 1995 by Tim McLellan. All Rights Reserved. Distribution or publication of this article in any form, electronic or otherwise, without the consent of the author is a violation of copyright.

Contents
- Introduction
- File Compression Programs
- PC File Storage
- File Slack
- PKZip
- PKZip Options
- WinZip
- Zip It Up

As an integral component of most computers, the hard drive makes it very easy to store and accumulate program and data files. As the files get older, we might not use them as often as when they were first stored. Within a couple of years, we can establish a large collection of historical or aged files. The files may still have some relevance and value so we cannot delete them. Depending on the computer's operating system and the type of hard drive, these older files can represent a significant portion of the space on the hard drive. If we could move them out of the mainstream until we need them again, it would free up some valuable hard drive space.

One way to sideline the files would be to move them off the hard drive entirely and onto floppy disks. That solution has two downsides. First, the floppy disks could get misplaced or damaged. Secondly, when the floppy disks are found, we must copy the files back to the hard drive. This is not quite as immediate or convenient as having the files already on the hard drive. Another way to sideline the files is to create a directory for older files and move the files to that directory. This removes them from the mainstream files, but the older files still take up the same amount of space as they did in the original directory.

File Compression Programs

A better solution is to use a file compression program. These programs work by creating a compressed file from any program or data file. The file compression program decides on the best compression method to use with the data in the file. Once it chooses the method, the program reduces the repetition and redundancy in the file. Then it stores the compressed data in a new file called an archive file. An archive file is a collection of one or more compressed files.

Although compressed, the data in the files remains intact. The file compression program simply reduces the disk space taken up by the files. To visualize the process, think of a wet string mop. When we wring it out, it gets compacted. In essence, though, it is still the same mop, it only looks different.

Usually, we only compress files not currently being used or when we are in desperate need of increasing disk space. When we compress or archive a file, most application programs cannot use the file. We have to uncompress or extract it first. The file compression program usually has an option to uncompress the file. If not, it probably has a companion program that will do it. In either case, the uncompress process restores the file back to its original condition. Then we can use the file as usual. Add water to the wrinkled mop and it comes back to its full recognizable shape.

Archiving and extracting a file may take a few seconds depending on its size and the type of data. In day-to-day reading and writing of computer files, a constant delay of a few seconds would not be acceptable to most users. This is why operating systems do not usually do the compression when they save or write to a file. Disk doubling and memory resident PC programs such as Disk Doubler and Stacker do the compress and uncompress processes on demand when DOS saves or retrieves a file. Some application programs even compress their own data files. Most file compression programs do the archive and extract only when the user requests it.

PC File Storage

Before we see how PC file compression programs reduce disk space, we will look at how DOS stores files. DOS creates and manages files almost the same way for both floppy disks and hard drives. While formatting a disk, DOS divides it into concentric rings called tracks. Then it divides the tracks into sectors. The number of tracks and sectors varies according to the type and size of disk. The size of each sector is always the same, 512 characters or bytes. DOS groups consecutive sectors together to make a cluster. The number of sectors in a cluster is also dependent on the type and size of disk. On a 1.44 Mb floppy disk ,DOS makes a cluster with 1 sector and its size is 512 bytes. A 510 Mb hard drive has 16 sectors in a cluster. The hard drive's cluster size is 8192 bytes.

When DOS creates or expands a file on a disk, it allocates disk space to the file by cluster. The file uses as little or as much of the cluster as necessary. The cluster belongs to that file and no other file can use any part of it.


When the file grows larger, DOS allocates another cluster, and so on. This continues until DOS has allocated enough clusters to store the file completely on the disk.


A file might not use all of the space in the last allocated cluster. Even still, no other file can use the remaining cluster space except the one to which the cluster belongs.

File Slack

We refer to the unused portion of a cluster as the slack. The file slack helps to show the efficiency of the operating system's file storage method. Too little slack means the operating system has to frequently search the disk to find an available cluster and then allocate it to the file that is getting larger. Too much slack means the disk may have a large amount of unusable disk space tied up in the slack.

Slack associated with a file is not too hard to figure out. The slack is the number of unused bytes in the last cluster. To get a percentage, divide the slack by the size of the cluster. For example, consider a 3000 byte file. On a floppy disk, DOS allocates a total of 6 clusters, or 6 sectors, or 3072 bytes to the file. The file uses 440 bytes in the last cluster, so 72 (512 minus 440) bytes are unused. It has a slack of 14 percent (72 ÷ 512). On the 510 Mb hard drive, DOS allocates 1 cluster, or 16 sectors, or 8192 bytes to the file. The file uses 3000 bytes in the one cluster, so 5192 (8192 minus 3000) bytes are unused. The slack is then 63 percent (5192 ÷ 8192).

Now consider two files of 3000 bytes each. They use 6144 bytes on the floppy disk and have a total slack of 144 bytes. On the hard drive, the two files use 16 384 bytes and have a total slack of 10 384 bytes. This amount is not too much to worry about on a 510 Mb drive. When the drive has 200 or 300 similar files then maybe it does become a consideration.

Disk slack space is only one cause of filling a hard drive. Another is the way in which DOS stores the data in the files. For instance, if a file has many repeating characters, DOS simply stores them, all of them as they are. On the other hand, a file compression program recognizes the repetition of characters. In its simplest form, the compress program counts the number of repeated characters. Then it simply stores the count of characters and just one occurrence of the character. Depending on the number of repeating characters and how much they are repeated, this compression process alone saves disk space. Every file compression program uses its own processes or common-knowledge algorithms for compressing files.

PKZip

File compression programs are available for both PC and Macintosh computers. For the PC we have PKZip, LHArc, ARJ, and ZOO among others. On the Macintosh we have StuffIt, CompactPro, and others. The most popular file compression program for the PC is PKZip from PKWare.

PKZip is a shareware program. Shareware means a user can examine the program at no charge. If they try it and find it useful, users are required to pay a nominal fee to the author. A user can also purchase PKZip commercially and then they receive a preprinted manual as well. By registering and paying for the program, users motivate the author to improve it.

PKZip has gone through many enhancements over the years. The current version number is 2.04g. Users and sysops (system operators) have widely shared and distributed the program. Everyone recognizes it as an industry standard. Copies of the program can be found on most FTP (file transfer protocol) sites and bulletin board services that deal with PC files.

When PKZip archives a file, it reads the file, analyzes it, and determines the best compression method. Then it writes the compressed data to an archive file, which is a separate file from the original file. The archive file usually has the .ZIP file extension. This file extension has become an industry standard. Anytime we see a .ZIP file we can be sure it is a PKZip file or an equivalent archive file.

Like other file compression programs, PKZip can store multiple files in one archive file. This eliminates the wasted slack of each individual file. The compress program stores the compressed files in the archive file one after the other, with no slack between them. The only slack belongs to the archive file itself.


A directory can have more than one archive file. For example, it could have a budget documents archive file called BUDGSUMM.ZIP and an archive file for correspondence called CORRESP.ZIP. These multiple archive files act like subdirectories in that they organize the files into groups with common purposes.

Depending on the data in the original file, PKZip might compress a file down to 20 percent or less of its original size. We say that this file has a compression rate of 80 percent. Many files often have compression rates between 30 percent and 70 percent. PKZip might compress a 3000 byte file down to somewhere between 900 and 2100 bytes. With only the 1 compressed file in it, the archive file will only use 1 cluster or 8192 bytes on the hard drive. The slack after the compressed file is somewhere between 6092 and 7292 bytes. That is enough room for at least two more similarly compressed files.


When we do compress and move additional files to the archive file, the original clusters held by those files will be free for use by other files. Three 3000 byte files each use 8192 bytes or a total of 24 576 bytes on the hard drive described earlier. When compressed into one archive file, together they use no more than 1 cluster or 8192 bytes.


Compressing the three files saves 16 384 bytes. Two-thirds the original space taken up by the three files is now free for other files to use.

PKZip Options

PKZip has many options to control its operation. In its simplest DOS command line form, PKZip compresses a file and adds it to an archive file. The following command archives the BUDGET92.LTR file and adds it to the OLDLETRS.ZIP archive file:

pkzip oldletrs budget92.ltr

When PKZip adds a compressed file to an archive file, the original file is left on the disk. The real disk savings come when PKZip removes the original file from the disk. Users can choose to have the original file erased after the compression by using the -m (move) option as in the following command:

pkzip -m oldletrs summry93.ltr

To uncompress, un-archive, extract, or unzip a file, we use the PKUnzip program. The following command tells PKUnzip to extract the BUDGET92.LTR file from the OLDLETRS.ZIP archive file, and to recreate the BUDGET92.LTR file in its original condition:

pkunzip oldletrs budget92.ltr

The compressed BUDGET92.LTR file still remains in the archive file. PKUnzip simply extracts the file and saves it in the current directory with the original file name and in the original state.

To delete a compressed file that is in an archive file, we use the PKZip program and the -d (delete) option. To delete the SUMMRY93.LTR file from the OLDLETRS.ZIP archive file, enter the following command:

pkzip -d oldletrs summry93.ltr

PKZip squeezes together the remaining compressed files in the archive to eliminate the space once used by the removed file.

Other PKZip command options include ones to view the contents of the archive file, encrypt an archived file with a password, freshen the contents of the archive with more up-to-date files, test the integrity of the archive file, do a recursive search through subdirectories, and span the archive file across multiple disks. PKZip is a very powerful and useful program.

WinZip

WinZip is another shareware file compression program produced by Nico Mak Computing. WinZip uses the same process as PKZip to compress and extract files from archive files. However, WinZip is a Windows program. It recognizes the drag-and-drop technique of Windows. We can drag file names from the File Manager into the WinZip window, and WinZip immediately adds that file to the currently opened archive file. For the Windows user, WinZip saves them from having to exit Windows to work with an archive file.

As a Windows program, WinZip has the menus, lists, and buttons familiar to the Windows environment. This makes it easier for a less-technical user to manipulate the available options without having to memorize an alphabet of cryptic command codes. WinZip has most of the functions of PKZip.

Zip It Up

A file compression program is a valuable tool in the computer arena. It helps to alleviate the headache of a full disk and helps minimize large directories of historical files. Anyone who manages or works with the files on a hard drive should consider having a file compression program as one utility in their toolbox.


browse through other computer articles I've written
e-mail:tim.mclellan@islandnet.com
www:www.islandnet.com/~tmc/