Analysis of Costs and Benefits of compressing data on the Usenet
Analysis of Costs and Benefits of compressing data on the Usenet
USENET traffic occupies a growing volume of data communications capacity, several gigabits per day is common for the popular newsgroups. The news servers storing the messages and files devote a correspondingly large amount of disc space. The prevalent standard for newsgroup traffic is the Network News Transfer Protocol (NNTP) which deals with text that is uncompressed. This lack of compression represents a significant waste of resources on communications links and data storage space on the Internet. To analyse costs and benefits of compressing newsgroup traffic we installed and executed compression and decompression software under controlled conditions on the Internet. Using real newsgroup traffic generated by the population of subscribers in a typical Intranet, we found significant increases in the efficiency of data communications between the news clients and news servers. In some cases useful traffic increased by 30% on the data links. Also the occupancy of data storage on the news servers was more efficient with generally 30% and in some cases 40% more useful data stored in the equivalent disc space . In general the improvements in system performance depended on the characteristics of the newsgroup traffic being compressed and decompressed. In particular the average length of the messages in the newsgroup traffic determined the potential performance improvements. Traffic with a higher average message length enjoyed a higher increase in efficiency. We compared the two compression algorithms and discovered that their relative performance depended on the characteristics of the newsgroup traffic. Zlib is more effective than Bzip2 for large text files whereas Bzip2 is more effective than Zlib for traffic comprising shorter messages. We investigated the effect of the Zlib dictionary on the effectiveness of our system and found that the performance of Zlib is very dependent on the dictionary used to initialise the compressor. When a simple dictionary is used an increase in compression from 30% to 40% is achieved on sample data. We found that the nature of the optimum dictionary varied between newsgroups because of differing word patterns in each newsgroup's traffic. An implementation comprising multiple dictionaries is therefore the key to optimum performance by Zlib and further work is needed to investigate this approach. For the current investigation we created an NNTP proxy server program which allows unmodified newsreaders to access and create compressed messages. We then used the system for the reported performance evaluations. A second objective of our architecture was to facilitate the adoption of the system across the Internet as an ancillary to NNTP. We have defined our classes in Java thus making them available to writers of Java news readers. We have conceived, built and used an effective compression and decompression system for newsgroup traffic. Our system is compatible with the NNTP standard and with the current architectural infrastructure of news servers and clients. The results of our analysis of data compression under controlled USENET usage indicate that the design approach is widely applicable. Similar implementations on local Intranets could lead to widespread efficiencies in data storage and data transmission.
garratt, paul
949a7e95-1648-47e6-85a0-a526dfa98f8e
warde, simon
e104dc4c-5254-4bd4-8258-fdf8d872ae2b
2001
garratt, paul
949a7e95-1648-47e6-85a0-a526dfa98f8e
warde, simon
e104dc4c-5254-4bd4-8258-fdf8d872ae2b
garratt, paul and warde, simon
(2001)
Analysis of Costs and Benefits of compressing data on the Usenet.
MS2001 INTERNATIONAL CONFERENCE ON MODELLING AND SIMULATION IN DISTRIBUTED APPLICATIONS, Changsha, China.
Record type:
Conference or Workshop Item
(Paper)
Abstract
USENET traffic occupies a growing volume of data communications capacity, several gigabits per day is common for the popular newsgroups. The news servers storing the messages and files devote a correspondingly large amount of disc space. The prevalent standard for newsgroup traffic is the Network News Transfer Protocol (NNTP) which deals with text that is uncompressed. This lack of compression represents a significant waste of resources on communications links and data storage space on the Internet. To analyse costs and benefits of compressing newsgroup traffic we installed and executed compression and decompression software under controlled conditions on the Internet. Using real newsgroup traffic generated by the population of subscribers in a typical Intranet, we found significant increases in the efficiency of data communications between the news clients and news servers. In some cases useful traffic increased by 30% on the data links. Also the occupancy of data storage on the news servers was more efficient with generally 30% and in some cases 40% more useful data stored in the equivalent disc space . In general the improvements in system performance depended on the characteristics of the newsgroup traffic being compressed and decompressed. In particular the average length of the messages in the newsgroup traffic determined the potential performance improvements. Traffic with a higher average message length enjoyed a higher increase in efficiency. We compared the two compression algorithms and discovered that their relative performance depended on the characteristics of the newsgroup traffic. Zlib is more effective than Bzip2 for large text files whereas Bzip2 is more effective than Zlib for traffic comprising shorter messages. We investigated the effect of the Zlib dictionary on the effectiveness of our system and found that the performance of Zlib is very dependent on the dictionary used to initialise the compressor. When a simple dictionary is used an increase in compression from 30% to 40% is achieved on sample data. We found that the nature of the optimum dictionary varied between newsgroups because of differing word patterns in each newsgroup's traffic. An implementation comprising multiple dictionaries is therefore the key to optimum performance by Zlib and further work is needed to investigate this approach. For the current investigation we created an NNTP proxy server program which allows unmodified newsreaders to access and create compressed messages. We then used the system for the reported performance evaluations. A second objective of our architecture was to facilitate the adoption of the system across the Internet as an ancillary to NNTP. We have defined our classes in Java thus making them available to writers of Java news readers. We have conceived, built and used an effective compression and decompression system for newsgroup traffic. Our system is compatible with the NNTP standard and with the current architectural infrastructure of news servers and clients. The results of our analysis of data compression under controlled USENET usage indicate that the design approach is widely applicable. Similar implementations on local Intranets could lead to widespread efficiencies in data storage and data transmission.
Text
changfull.doc
- Other
More information
Published date: 2001
Additional Information:
Event Dates: sep 2001
Venue - Dates:
MS2001 INTERNATIONAL CONFERENCE ON MODELLING AND SIMULATION IN DISTRIBUTED APPLICATIONS, Changsha, China, 2001-09-01
Organisations:
Electronic & Software Systems
Identifiers
Local EPrints ID: 258659
URI: http://eprints.soton.ac.uk/id/eprint/258659
PURE UUID: ddb15518-632b-4414-b3a3-2146e0cac2e5
Catalogue record
Date deposited: 04 Dec 2003
Last modified: 14 Mar 2024 06:11
Export record
Contributors
Author:
paul garratt
Author:
simon warde
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics