The format in which a scanned image is saved can have a significant effect on file size – and file size is an important consideration when scanning, since the high resolutions supported by many modern scanners can result in the creation of image files as large as 30MB for an A4 page.
Windows bitmap (BMP) files are the largest, since they store the image in full colour without compression or in 256 colours with simple run-length encoding (RLE) compression. Images to be used as Windows wallpaper have to be saved in BMP format, but for most other cases it can be avoided.
Tagged image file format (TIFF) files are the most flexible, since they can store images in RGB mode for screen display, or CMYK for printing. TIFF also supports LZW compression, which can reduce the file size significantly without any loss of quality. This is based on two techniques introduced by Jacob Ziv and Abraham Lempel in 1977 and subsequently refined by Unisys researcher Terry Welch. LZ77 creates pointers back to repeating data, and LZ78 creates a dictionary of repeating phrases with pointers to those phrases.
CompuServe’s graphics interchange format (GIF) stores images using indexed colour. A total of 256 colours are available in each image, although what these colours are can change from image to image. A table of RGB values for each index colour is stored at the start of the image file. GIFs tend to be smaller than most other file formats because of this decreased colour depth, making them a good choice for use in WWW-published material.
The PC Paintbrush (PCX) format has fallen into disuse, but offers a compressed format at 24-bit colour depth. The JPEG file format uses lossy compression and can achieve small file sizes at 24-bit colour depth. The level of compression can be selected – and hence the amount of data loss – but even at the maximum quality setting JPEG loses some detail and is therefore only really suitable for viewing images on-line. The number of levels of compression available depends on the image editing software being used.
Unless there is a need to preserve colour information from the original document, images stored for subsequent OCR processing are best scanned in greyscale. This uses a third of the space of an RGB colour scan. An alternative is to scan in Line art mode – black and white with no greyscales – but this often loses detail, reducing the accuracy of the subsequent OCR process.
The table below illustrates the relative file sizes that can be achieved by the different file formats in storing a native 1MB image, and also indicates the colour depth supported:
|File format||Image size||No. of colours|
|BMP – RGB||1MB||16.7 million|
|JPEG – min. compression||185KB||16.7 million|
|JPEG – min. progressive compression||150KB||16.7 million|
|JPEG – max. compression||20KB||16.7 million|
|JPEG – max. progressive compression||16KB||16.7 million|
|TIFF – LZW compression||83KB||16.7 million|