Knowing And Overcoming Windows Hazards
http://ftp.aset.psu.edu/pub/ger/documents/DataIntegrity.htm
Academic Services and Emerging
Technologies
Graduate Education and Research
Services
Outreach
Services
28 February 2005
Introduction
There are hazards that can compromise file integrity when using features of microcomputer operating systems to transfer or copy data and programs. Here we give some details, examples, and practical recommendations that will help you improve the reliability of copying and moving data and program files. We also deal in the practical sense with reducing the risks of data loss associated with the "weakest links" of various devices. Please note that the related reliability statistics (Bit Error Rates) are not from manufacturers’ published tables or formal Bit Error Rate Test (BERT) sets, but rather from experience of this author and others. A Bit Error Rate of 10**(-n) means that the probability of at least a 1-bit error is 10**(-n); that is, on the average, there will be at least a one bit error in transmitting 10**n bits.
Reliability and Recommendations
Some microcomputer media we will deal with here are 1)
Fixed Disks; 2) Floppy
Diskettes,
including Zip and Jaz diskettes;
3) CD-R/RW
writing; 4) FTP; 5)
Remote Network File Sharing; 6)
Data Collection Devices Attached to Ports;
7) Encrypted Files;
and
Other
Methods That Help To Increase Data Integrity. In each case
below,
realize that reliability of the remote location of files as well as
their
companion data channels (e.g., modems) may be unknown and may in fact
be
considerably less reliable than the file media you are using locally.
While
somewhat unrelated to this topic, it is assumed here that the computer
systems involved are virus free. While these principles apply to any
computer,
here we deal more specifically with PCs running Windows(TM). More
detailed
hardware information on these and other devices may be found in the Winn
L. Rosch Hardware Bible,
http://www.hardwarebible.com/,
the PC Guide at http://www.pcguide.com/,
and Computer hardware tutorials at: http://www.hardwarecentral.com/hardwarecentral/tutorials/.
An interesting study on Corruption of TCP connections by Malformed
Acknowledgements
by Wietse Venema, IBM T.J. Watson Research Center, Hawthorne, NY may be
found at: ftp://ftp.porcupine.org/pub/debugging/ack-corruption.README.
The topic of Error Detection and Correction is reviewed in various
levels
of detail at:
Error Control - Available Books: http://non.com/books/Error_Control_ca.html
Error Correction FAQ(by ECC Technologies): http://members.aol.com/mnecctek/faqs.html
Introduction to Error Control: http://www.rad.com/networks/1994/err_con/intro.htm
Fast MD5 Checksum Utility (for file integrity
control):
http://www.fastsum.com/
Understanding Cyclic Redundancy Checks: http://www.4d.com/ACIDOC/CMU/CMU79909.HTM
Cyclic Redunancy Check Calculator: http://www.efg2.com/Lab/Mathematics/CRC.htm
CRC32 code by David Powell et al (compatible with PKWARE CRC):
http://www.iss.u-net.com/crc32.htm
Advanced CRC checker for files/folders by Irnis I.Haliullin:
http://www.irnis.net/
Fixed disks have relatively sophisticated data path logic built into their disk controllers. They also have short, precise data channels. Because of these properties, and also the fact that magnetic media are air-tight and manufactured to be highly consistent, fixed disks are one of the most reliable of PC media. Fixed disks also have faster access and data transfer times than floppy disks, Internet FTP, or Network files spaces. Thus, when files are being transferred to or from remote data media, such as FTP or shared files on a network, we recommend that they be transferred from or to a fixed disk, and then copied or moved later (see 2 below) to less reliable media under controlled circumstances. For example, if you intend to FTP a file from floppy disk to a remote FTP server (or vice versa), transfer it (see 2 below) first to a fixed disk and FTP it from there.
2) Floppy Diskettes – Including Zip and Jaz Diskettes
Even though floppy diskette controlers do a cyclic redundancy check, experience has shown that the Bit Error Rate of a floppy disk is between 10**(-9) and 10**(-8). Therefore, it is likely that at least a 1-bit error in writing a relatively large file on floppy disk could occur as often as one out of 10**8 bits or a 1 byte error in 100MB depending on the condition of the floppy disk and floppy disk drive. Dirt and dust on the diskette or in the drive make this medium less reliable. Older floppies are generally less reliable than newer floppies not only because of exposure to dust and dirt, but because magnetic media wear (magnetic sensitivity can diminish) over time through physical use and unfavorable environment. Floppy diskettes should not be exposed to moisture or extremes of heat and cold or magnetic sources (telephones, magnetic coils, or power supplies, for example) and should be at room temperature before using them.
When possible, avoid installing applications, device drivers, Cab or Zip or other compressed files, or any other install files directly from floppy diskette.
When Windows drag/drop or Edit-copy/Edit-paste is used to copy file(s) to or from a floppy diskette, there is a relatively high probability compromising data (about a 1-bit erorr in 1 GB transfer, but could be as high as 1 error in a 100MB transfer). No error message may be given if the data is compromised (corrupted). Thus, floppy diskettes are not very robust. If the file is a program and it was corrupted, it may further compromise your system. If any file is corrupted, research results may be compromised – perhaps without knowing it.
Recommendations:
When copying or moving PC files, either learn to:
Example: copy data files from C:\XDATA to
floppy
disk in drive A:
COPY C:\XDATA\*.DAT
A:\ /V
Example: copy all files from folder
(subdirectory)
C:\XDATA to floppy disk in drive A:
XCOPY C:\XDATA A:\ /V
The /V option causes the copy command to compare source files and target files and to retry the copy if they are not identical. If retries fail, an error message is given;
AND ALSO
If relatively few (perhaps very large) files are
involved
in a transfer, get the free utility program called COMPARE via: http://ftp.aset.psu.edu/pub/ger/fortran/hdk/compare.exe
documentation of which is at: http://ftp.aset.psu.edu/pub/ger/fortran/hdk/compare.txt.
This program will indicate with a brief message whether files are
identical.
Either CSDIFF or COMPARE may be used via drag/drop or as a DOS Prompt
command.
Issue CSDIFF /? or COMPARE /? for related details.
If the files are relatively small the native
system command,
FC (File Compare; issue FC /? for syntax) can be used to compare the
source
and target file(s). This command is rather slow for very large files;
for
very large files see for example:
http://ftp.aset.psu.edu/pub/ger/documents/handstat.html#Compare
The reliability of CD-ROM is less than that for a fixed disk. Therefore, for CD-R/RW, after writing files to a CD, it is a good practice to compare both source and target files. For more information see the CD FAQ: ftp://rtfm.mit.edu/pub/usenet-by-group/comp.publish.cdrom.hardware/
Using the Internet File Transfer Protocol to transfer files to and from a PC can (and does) occasionally compromise integrity of the files involved even though a Check Sum is used for TCP/IP packets. Again, reliability is affected by several factors, including the data channels and Internet connection devices (for example, modems, phone cables, etc.). FTP to a fixed disk is more reliable than directly to floppy diskettes.
Recommendations:
Do not FTP to a floppy disk. FTP to a fixed disk and use the recommendations 2 in above for floppy disks; that is, copy with Verify or copy and compare the files to or from floppy disks. When using FTP either with an FTP client or a Web browser, the Bit Error Rate is typically much better than that of floppy diskettes. However, even under good Internet conditions, the Bit Error Rate of transferring files across the Internet is about 10**(-10). Thus, for every 10 Gigabytes bytes transferred, there may be at least a 1-bit error. That may not seem like much, but heavy use of FTP over a year can result in at least one such error. When using FTP to obtain install files and research data, consider downloading two copies of each file; store them in different fixed disk folders; and then compare the two copies. If they are not identical, FTP a third copy and repeat the comparison. If you regularly use the Internet to transfer files, especially through many "hops" and via modem channels, you eventually may see cases where a third transfer of a file is necessary to avoid compromising data integrity of the transferred file. Please note that a corrupted install file often installs with no error messages, but subsequently it often fails to work correctly.
5) Remote Network File Sharing
Experience has taught us that at least a one-bit error when copying files from fixed disk to a network drive is not all that unusual. Depending how and with what cables, devices, data channels, and network protocol with which a network is made accessible to your PC, reliability can differ. This author has experienced several error bytes when copying one 25MB file to our local WINS network user space; in fact it took three repeated copy operations to get a correct image of that file onto this network. Even when it was copied to the network correctly, the integrity problem still exists each time it is retrieved by someone. Several other similar cases have been documented with other network protocols.
Recommendations:
We recommend either mapping network folders as a drive letter and using COPY with Verify, or comparing source and target files for being identical (see 2 above). Likewise, we recommend NOT storing dynamic files (for example, bookmarks, temporary or program scratch files, buffers or caches, swap files, print spools) on network drives.
6) Data
Collection
Devices Attached to Ports
There are many data collection devices that attach to
microcomputer ports. Depending on the specific application, data
handled
through these ports may or may not include checking. In all cases we
recommend
the same thing that NOAA (National Oceanic and Atmospheric
Administration)
recommend (at http://www.ngs.noaa.gov/PROJECTS/GPSmanual/data.htm),
Section
B. Data Download, Reformatting, and Shipping Instructions: "Using your
manufacturer's software, transfer the data from the GPS receiver to
your
computer twoseparate times. Place one copy of the data in a
working
directory and the second copy in a backup directory. Compare the
files to ensure that uncorrupted data were successfully downloaded to
your
computer." That is, repeat data up/download through the companion
Serial,
USB, Parallel or custom port and then compare
results.
For more information on ports see: http://www.lvr.com/parport.htm
.
Encrypted files, especially files encrypted with strong algorithms, can pose both security and integrity risks. This is true because for some strong encryption algorithms (e.g., PGP CAST algorithm) a one byte change anywhere in a plaintext file yields, when encrypted, a cipher text file that has no ordered bytes in common with an encrypted version of the original plaintext file; this can be both good (because it is "strong") and disastrous since the inverse could also be true - any change in a ciphertext file thus renders it undecryptable. From a security/integrity viewpoint, if an intruder accesses encrypted files and changes one or more bits of those files, OR any bit error rate when copying or moving an encrypted file might similarly render the file unusable (useless); that is, it may not be able to be decrypted to recover plain text since the cipher text has been changed. This may be a property of encryption algorithms and/or the implementations. Even if the implementation would allow decrypting, the result could be a plaintext file of "random" characters (not at all like the original plaintext file). This is easily verifiable by using implementations of "strong" algorithms (e.g., PGP).
A second integrity problem exists with many encryption/decryption implementations of the implementation of their corresponding key creation schemes. Real-time events, like time-of-day and mouse-pointer coordinates are used with the user's pass phrase to create "key ring files". If these key ring files are lost or corrupted (by copying or backing them up to an unreliable media), even the owner of the file cannot re-create the key ring files (even if the owner knows the orignal pass phase); this could result in ALL encrypted files being undecryptable (useless).
A third problem with Encrypted email is that that it cannot be scanned for viruses at the Internet level (or any other level until it is decrypted). That is with Encrypted files, it is not possible to trap viruses before these files reach the destination (your file system). Thus there is this conflict between encrypted security and virus security. This drives home the need for all computer users to be educated about virus prevention.
Not all vendors warn of these exposures; for example, see: http://www.pcguardian.com/, http://www.securecomputing.com/, http://www.F-Secure.com/ or http://www.rsasecurity.com/. Thus the above recommentations for copying/moving files apply more so to encrypted files. For more on this topic see The Risk Digest, Volume 21, Number 6 article by Camillo Sars at: http://catless.ncl.ac.uk/Risks/21.06.html#subj12.
Other Methods That Help To Increase Data Integrity
Using programming languages which are designed to diagnose illegal numeric input characters can provide a good check on the integrity of numeric data. Fortran provides such diagnostics. Being connected to research communities is another way to keep watch for better data handling methods. One Penn State Fortran Web page is:http://www.personal.psu.edu/faculty/h/d/hdk/fortran.html. Fortran compilers for the PC are reviewed at: http://ftp.aset.psu.edu/pub/ger/fortran/FortranPSUVM.html, http://www.polyhedron.com/ and http://ftp.aset.psu.edu/pub/ger/fortran/test/results.txt. The CAC Numerically Intensive Computing Group home page is: http://beatnic.cac.psu.edu/ .
Use an Uninterruptable Power Supply (UPS) to avoid both surges and power failures that can cause your PC to lose data. When data is being written to fixed disk, write caches are used. Thus, if a PC is powered down abruptly - that is, not normally shutdown - file(s) that were in the process of being written may not have finished and so may suffer integrity loss. When purchasing a UPS, make sure that its power backup specifications exceed the amount of power (Watts) your PC system complex (e.g., CPU & Monitor, Printer(s), Scanner) is using. Typical price for 300 Watt 110Volt UPS is currently around: $150. (Typical PC and Monitor power consumption is 200 Watts and typical jet printer, 50 Watts or less).
To increase the reliability of research computing
with
PCs, consider obtaining a PC with ECC (Error Correcting Code)
RAM
and making sure that ECC is turned on via the companion BIOS option.
Not
all PC’s have this option available, so check with your vendor. More
details
on ECC RAM and error correction in general may be found at: http://www.whatis.com/ecc.htm
and/or http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?query=ECC&action=Search
and/or http://www.techweb.com/encyclopedia/defineterm?term=ECCmemory.
To determine whether your PC has ECC RAM and ECC mode is turned on, get
the free WCPUID utility (follow Download link then WCPUID) from: http://www.h-oda.com/.
For good information about memory and how to install memory chips see
"The
Ultimate Mamory Guide": http://www.kingston.com/tools/umg/default.asp
When installing software from any source, exit or disable all running application programs, especially virus stoppers. Running programs, especially virus stoppers and schedulers, can and do have adverse affects on various install processes.
Some programs, like the PKZIP /! option, for example, use I/O program logic (e.g., 64-bit Cyclic Redundancy Checks) to detect integrity loss. Others use simple "sum checks" which help; but sum checks may not detect multiple-bit data loss.
The skillful use of text editors on numeric data can
also
help to check the integrity of numeric data. One text editor for
Windows
9x/NT is Kedit for Windows: http://www.kedit.com/.
The Kedit for Windows Editor can not only edit files larger than one
million
lines (if your virtual storage swap file is large enough), but also has
very powerful tools and subcommands to isolate on the screen lines
matching
(or not matching) possibly complex string and numeric patterns. For
example,
if an ASCII data file is supposed to contain only integers and your
research
programs fail to read the file correctly because of "illegal characters
in an input format," you could Kedit this file and issue the Kedit
subcommand:
ALL
REG /[~0-9 ]/ to display any and all lines that do not contain
integers
and blanks. If the data file is supposed to contain real numbers, the
Kedit
subcommand: ALL REG /[~0-9 .E\+\-]/ will display all lines
that
contain any character(s) that are not characters in formatted Real
numbers.
To display all non-keyboard characters issue the Kedit subcommand:
ALL reg /[~ -\x7E]/ .
Recognize the limitations of various data types that a particular application supports. For example, Microsoft Excel, and many spread sheet and statistical programs, normally represent numbers, including integers or whole numbers, using Double Precision floating-point format. Thus, to enter 16-digit credit card numbers in an Excel spread sheet, for example, the cells would need to be marked as Text since many 16-digit whole numbers cannot be internally represented using Double Precision floating-point format. If 16-digit integers are entered as Numeric instead of Text, many of them will have a trailing zero in the 16th digit.
Summary
Responsible use of the PC, especially as a research or business tool, carries with it responsibility for the integrity of your data. This article touched on a few practical areas that can help you increase data file integrity and thereby improve the reliability of computations involving your data.
Acknowledgments
The author gratefully acknowledges Susan K. Donner-Knoble, Peter M. Weiss, and the Academic Computing Newsletter Editors for careful review and suggestions that significantly improved this article.