http://ftp.aset.psu.edu/pub/ger/documents/handstat.html
Academic Services and Emerging Technologies
and Consulting and Support Services
and
29 June 2004
INTRODUCTION
To get an idea of what porting sizable data files to a microcomputer implies, we did a short, quick experiment "handling" a large sample file of integer data. This "handling" and some timings are reported here. This shows that, at least for a few simple kinds of data manipulation, relatively large files can be handled on microcomputers rather efficiently. And it gives some ideas of what minimal "power" microcomputers need to have in order to process this kind of data. Part of that "power" is having reasonably defragmented disk(s); if the contrary is true, I/O time particularly for large files increases rather dramatically. We also show where to find out more about these topics and provide download sources for the various tools used here.
By the way, we do know that microcomputer versions of SAS®,
MINITAB®, SPSS® will handle larger files similar to the one used
here. We did not time these applications, but performance was quite good
for these statistical products on the platform described below. Lahey
Fortran maximum file size (Version 5.7b and above) is 2**64 bytes =
16 Exabytes = 16,000,000,000 GB. See Operating system maximum file sizes
at the end of this paper.
The difference in timings show how drastic faster CPU's and more RAM can effect performance. Timings were done on an otherwise idle (quiesced) system.
1) Dell Optiplex GX1p, 500MHz, 384MB RAM, 1GB swap file, two 8GB IDE fixed disks. Operating system: Windows NT 4, SP 6. File systems: FAT16 and NTFS Version 5. If the FAT and NTFS file system are both defragmented, there is no substantive difference in file performance for the test run here. If the file systems are not defragmented, the I/O times noticeably increases (a factor of two was observed). Thus it pays to have a good defragmenter both for FAT and NTFS.
2) Dell Optiplex GX110, 800MHz, 500MB RAM, 2GB swap file, two 10GB IDE fixed disks. Operating system: Windows 2000 with FAT32.
3) Same platform as (1) above, but drive used for timing is a PSU DCE-DFS shared network space. Timings with two files were done with both files on DFS disk. Note carefully that network drives do not have the same data transfer integrity as fixed disk do; see: http://ftp.aset.psu.edu/pub/ger/documents/DataIntegrity.htm.
SAMPLE FILE USED FOR THESE TIMINGS
The file, women94a.zip is an ASCII file that is 145,648,282 (139MB)
bytes big. Bill uploaded this file from the mainframe for a research project.
So it is actual data. It's content is supposed to be integers, blanks,
and minus(-) sign. It is a fixed record length file and has 5083 records
(lines), each which is 28652 bytes wide.
KEDIT® is an excellent Windows 9x/NT text editor. See http://www.kedit.com/.
VEDIT® is another editor choice for large files;. see: http://www.vedit.com/
Operations Timed: Input, edit, scan for validity,
output, subset.
Input
Command: Kedit women94a.data (width 29000
Time: 1) 32 seconds
2) 5 seconds
3) 167 seconds
Rate: 1) 4.3 MB per second
2) 27 MB per second 3) 0.83 MB per second
Edit
Kedit Subcommands: add, delete, copy or modify, move lines; search
for string.
Time: virtually instant.
Scan
Kedit subcommand to scan the file for valid character content:
that is, show all non-integers: all reg /[~0-9 \-]
Or to display all non-keyboard characters issue the Kedit subcommand:
all reg /[~ -\x7E]/
Time: 1) 45 seconds
2) 29 seconds
Rate: 1) 3.1 MB per second 2)
4.7 MB per second
Output
Kedit subcommand: FILE/SAVE w.dat: .
Time: 1) 32 seconds
2) 9 seconds
3) 122
Rate: 1) 4.3 MB per second
2) 15.4 MB per second 3) 1.4 MB per second
Subset
Kedit (width 29000 then issue Kedit subcommand: get women94a.data
101 100 to get and edit records 101, 102, ... 200.
Time: 1) 4 seconds
2) < 1 second 3) 25 seconds
The following editor/viewers, which work on all Win32 systems will also
edit this test file (and files up to 2GB):
The V Viewer, a fine piece of work, will view this file virtually
instantly. It's Shareware available at: http://www.fileviewer.com/
The XV Hex Editor/Viewer will Edit this file. It's free and
available at: http://www.chmaas.handshake.de/
Use SAS for more sophisticated criteria for subsetting a large
file.
SYSTEM COPY and SORTING (on defragmented disks)
System COPY command: COPY women94a.data w.dat
Time: 1) 33 seconds
2) 8 seconds
3) 239 seconds
Rate: 1) 4.2 MB per second
2) 17.4 MB second 3) 0.58 MB second
The following Sorts yield identical results:
1) System command: SORT /+10 < women94a.data > wsort.dat
(for Windows 2000 used: SORT /+10 /rec 29000 /O wsort.dat women94a.dat
)
Time: 1) 181 seconds
2) 17 seconds
3) 435 seconds
Rate: 2) .79 MB per second
2) 8.2 MB per second 3) 0.32 MB per second
Command: Kedit women94a.data, subcommands: SORT * A 1 10,
FILE wsort.dat
Time: 1) 65 seconds
2) 12 seconds
3) 359 seconds
Rate: 2) 2.1 MB per second
2) 11.6 MB second
3) 0.39 MB per second
COMPARE LARGE FILES FOR CONTENT
Comparing two large files via Windows native (system) FC command versus a Fortran implementation, a 32-bit console command, COMPARE.EXE. The two files will compare "same" or "identical" in this case. For systems with adequate RAM and paging file greater than twice the test file size repeating a compare command will yield much faster results. E.g., an immediate second COMPARE test below will complete in elapsed time of 10 seconds instead of 304 seconds.
System compare: FC w.dat women94a.data
Time: 1) 513 seconds
2) 305 seconds
3) 878 seconds
Rate: 2) .28 MB per second
2) 4.7 MB per second 3) 0.16 MB per second
COMPARE w.dat women94a.data
Time: 1) 34 seconds
2) 5 seconds
3) 304 seconds
Rate: 2) 4.1 MB per second
2) 27 MB per second 3) 0.46 MB per second
Compare utility may be found at: http://ftp.aset.psu.edu/pub/ger/fortran/hdk/compare.exe. Documentation is the file: http://ftp.aset.psu.edu/pub/ger/fortran/hdk/compare.txt. COMPARE.EXE compares 256000 characters per compare; native Windows FC compares 1 cpc; this is is one reason COMPARE was written and is made available to the public.
Note: To binary compare many pairs of files for "same" or "different"
in two subdirectories and also optionally in two children subdirectories,
use the program, CSDIFF. Get the "Standalone" version
from: http://www.ComponentSoftware.com/csdiff/.
CSDIFF
also
can do an "intelligent" compare of two TEXT, HTML, or MS WORD files; it
will display file differences in one of two easy to understand formats.
CSDIFF
is
free for personal use.
COMPRESSING/UNCOMPRESSING PROGRAMS USED
INFOZIP ZIP and UNZIP are free Zip compress/uncompress Win32 programs that work under all Windows platforms. They are also available for other platforms. Here we compress and uncompress a large sample data file. InfoZip zip includes cyclic redundancy check bytes in the zip file and a check against this with unzip. INFOZIP ZIP/UNZIP for Windows 9x/NT/2000 are available on the Web: http://www.info-zip.org/pub/infozip/ ; see files: Unzip=unznnn.exe, and Zip=zipnnxN.zip respectively where nnn or nn are current version numbers. Note that this Zip/UnZIP will compress/decompress a single 8GB file. For more on limits of InfoZip Zip/Unzip see: http://www.info-zip.org/FAQ.html#limits
zip -j women94a.zip women94a.data
compressed the file to: 12,006,112 bytes, including 92 bytes of crc,
a factor of 92%.
Time: 1) 53 seconds
2) 27 seconds
Rate: 1) 2.6MB per second
2) 5.1MB second
unzip women94a.zip
Time: 1) 37 seconds
2) 8 seconds
Rate: 1) 3.8MB per second
2) 17.4 MB second
unzip -p woman94a.zip | readzip1.exe
Where readzip1.exe is the result of compiling the Fortran program:
program readzip1
implicit none
character(28652) :: line
integer :: i
do i=1,5083
read(*,'(A)') line
end do
end
Time: 1) 423 seconds
2) 367 seconds
Rate: 1) 0.33MB/second 2) 0.38MB/second
unzip -p woman94a.zip > temp.dat
readzip2.f90
Where readzip2.exe is the result of compiling the Fortran program:
program readzip2
implicit none
character(28652) :: line
integer :: i
open(unit=50,file='temp.dat')
do i=1,5083
read(50,'(A)') line
end do
end
Time: 1) 45 seconds
2) 9 seconds
Rate: 1) 3.0 MB/second 2) 15.4
MB/second
PKWARE® PKZIP and PKUNZIP are commercial versions of Zip compression tools. Here we use the DOS 32-bit versions. Other versions, incuding versions that run via Windows Explore, are available at: http://www.pkware.com.
PKZIP -a -! women94a.zip WOMEN9~1.DAT
compressed the file to: 11,590,145 bytes, a factor of 93%. This
version of PKZIP/PKUNZIP recognizes only DOS 8.3 fileids. The -!
option creates "authentication" check bytes, similar to the crc of ZIP/UNZIP
above.
Time: 1) 38 seconds
2) 21 seconds
Rate: 1) 3.7MB per second
2) 6.6MB second
PKUNZIP women94a.zip
Time: 1) 39 seconds
2) 8 seconds
Rate: 1) 3.6MB per second
2) 17.4 MB second
NOTE: At least two products will handle very large compressed
files in a variety of formats.
Power Archiver: http://www.powerarchiver.com/
WinZip: http://www.winzip.com/
V-Viewer: http://www.fileviewer.com/
MAXIMUM FILE SIZES BY OPERATING SYSTEM
For Maximum file sizes for FAT/FAT32/NTFS see: http://www.microsoft.com/resources/documentation/Windows/XP/all/reskit/en-us/prkc_fil_tdrn.asp
Also see:
FILE SYSTEM SUPPORT BY OPERATING SYSTEM
OPERATING SYSTEM, FILE SYSTEMS SUPPORTED
---------------------------------------
MS-DOS, Windows 95 FAT16
Windows 95 OSR2, 98, Me FAT16, FAT32
Windows NT, 2000, XP NTFS, FAT16, FAT32
Linux Ext2, FAT32, FAT16
---------------------------------------
File System Specs
---------------------------------------------
SYSTEM FILE NAME
MAXIMUM MAXIMUM
LENGTH
VOLUME FILE
(CHARACTERS)
SIZE SIZE
---------------------------------------------
FAT16 8
2GB* 2GB
FAT32 255
2TB 4GB
NTFS 255
16TB 16TB
Ext2 255
4TB 2GB
---------------------------------------------
*4GB under Windows NT
Above tables courtesy of Computer World: http://www.computerworld.com/softwaretopics/os/story/0,10801,73872,00.html
Big Numbers of Bytes
1 Terabyte = 1000 Gigabytes
1 Petabyte = 1000 Terabytes
1 Exabyte = 1000 Petabytes
1 Zettabyte = 1000 Exabytes
1 Yottabyte = 1000 Zetabytes
+++