SoK: Prudent Evaluation Practices for Fuzzing

Moritz Schloegel; Nils Bars; Nico Schiller; Lukas Bernhard; Tobias Scharnowski; Addison Crump; Arash Ale-Ebrahim; Nicolai Bissantz; Marius Muench; Thorsten Holz

doi:10.1109/SP54263.2024.00137

SoK: Prudent Evaluation Practices for Fuzzing

Moritz Schloegel, Nils Bars, Nico Schiller, Lukas Bernhard, Tobias Scharnowski, Addison Crump, Arash Ale-Ebrahim, Nicolai Bissantz, Marius Muench, Thorsten Holz

Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

556 Downloads (Pure)

Abstract

Fuzzing has proven to be a highly effective approach to uncover software bugs over the past decade. After AFL popularized the groundbreaking concept of lightweight coverage feedback, the field of fuzzing has seen a vast amount of scientific work proposing new techniques, improving methodological aspects of existing strategies, or porting existing methods to new domains. All such work must demonstrate its merit by showing its applicability to a problem, measuring its performance, and often showing its superiority over existing works in a thorough, empirical evaluation. Yet, fuzzing is highly sensitive to its target, environment, and circumstances, e.g., randomness in the testing process. After all, relying on randomness is one of the core principles of fuzzing, governing many aspects of a fuzzer's behavior. Combined with the often highly difficult to control environment, the reproducibility of experiments is a crucial concern and requires a prudent evaluation setup. To address these threats to validity, several works, most notably Evaluating Fuzz Testing by Klees et al., have outlined how a carefully designed evaluation setup should be implemented, but it remains unknown to what extent their recommendations have been adopted in practice. In this work, we systematically analyze the evaluation of 150 fuzzing papers published at the top venues between 2018 and 2023. We study how existing guidelines are implemented and observe potential shortcomings and pitfalls. We find a surprising disregard of the existing guidelines regarding statistical tests and systematic errors in fuzzing evaluations. For example, when investigating reported bugs, we find that the search for vulnerabilities in real-world software leads to authors requesting and receiving CVEs of questionable quality. Extending our literature analysis to the practical domain, we attempt to reproduce claims of eight fuzzing papers. These case studies allow us to assess the practical reproducibility of fuzzing research and identify archetypal pitfalls in the evaluation design. Unfortunately, our reproduced results reveal several deficiencies in the studied papers, and we are unable to fully support and reproduce the respective claims. To help the field of fuzzing move toward a scientifically reproducible evaluation strategy, we propose updated guidelines for conducting a fuzzing evaluation that future work should follow.

Original language	English
Title of host publication	2024 IEEE Symposium on Security and Privacy (SP)
Place of Publication	Los Alamitos, CA, USA
Publisher	IEEE
ISBN (Electronic)	9798350331301
DOIs	https://doi.org/10.1109/SP54263.2024.00137
Publication status	Accepted/In press - 4 Feb 2024
Event	2024 IEEE Symposium on Security and Privacy (SP) - San Francisco, United States Duration: 19 May 2024 → 23 May 2024

Publication series

Name	Proceedings of the IEEE Symposium on Security and Privacy
Publisher	IEEE
ISSN (Electronic)	2375-1207

Conference

Conference	2024 IEEE Symposium on Security and Privacy (SP)
Country/Territory	United States
City	San Francisco
Period	19/05/24 → 23/05/24

Bibliographical note

Acknowledgment:
We thank our anonymous shepherd and reviewers for their valuable feedback. Further, we thank Dominik Maier, Johannes Willbold, Daniel Klischies, Merlin Chlosta, and Marcel Böhme (in no particular order) for their helpful comments on a draft of this work. We also thank the countless researchers with whom we have discussed fuzzing research and how to evaluate it, ultimately paving the way for this work. This work was funded by the European Research Council (ERC) under the consolidator grant RS3 (101045669) and the German Federal Ministry of Education and Research under the grants KMU-Fuzz (16KIS1898) and CPSec (16KIS1899). Additionally, this research was partially supported by the UK Engineering and Physical Sciences Research Council (EPSRC) under grant EP/V000454/1. The results feed into DsbDtech.

Keywords

fuzzing
fuzz testing
reproducibility

Access to Document

10.1109/SP54263.2024.00137

SchloegelM2024SoK
© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Accepted author manuscript, 582 KBLicence: Other (please specify with Rights Statement)

https://doi.ieeecomputersociety.org/10.1109/SP54263.2024.00137

CAP-TEE: Capability Architectures in Trusted Execution
Ryan, M., Thomas, R., Ordean, M., Garcia, F., Oswald, D., Muench, M. & Sinha Roy, S.
Engineering & Physical Science Research Council
12/08/20 → 30/11/24
Project: Research Councils

Cite this

@inproceedings{2a46bffe26f443be88e109c327795ecf,

title = "SoK: Prudent Evaluation Practices for Fuzzing",

abstract = "Fuzzing has proven to be a highly effective approach to uncover software bugs over the past decade. After AFL popularized the groundbreaking concept of lightweight coverage feedback, the field of fuzzing has seen a vast amount of scientific work proposing new techniques, improving methodological aspects of existing strategies, or porting existing methods to new domains. All such work must demonstrate its merit by showing its applicability to a problem, measuring its performance, and often showing its superiority over existing works in a thorough, empirical evaluation. Yet, fuzzing is highly sensitive to its target, environment, and circumstances, e.g., randomness in the testing process. After all, relying on randomness is one of the core principles of fuzzing, governing many aspects of a fuzzer's behavior. Combined with the often highly difficult to control environment, the reproducibility of experiments is a crucial concern and requires a prudent evaluation setup. To address these threats to validity, several works, most notably Evaluating Fuzz Testing by Klees et al., have outlined how a carefully designed evaluation setup should be implemented, but it remains unknown to what extent their recommendations have been adopted in practice. In this work, we systematically analyze the evaluation of 150 fuzzing papers published at the top venues between 2018 and 2023. We study how existing guidelines are implemented and observe potential shortcomings and pitfalls. We find a surprising disregard of the existing guidelines regarding statistical tests and systematic errors in fuzzing evaluations. For example, when investigating reported bugs, we find that the search for vulnerabilities in real-world software leads to authors requesting and receiving CVEs of questionable quality. Extending our literature analysis to the practical domain, we attempt to reproduce claims of eight fuzzing papers. These case studies allow us to assess the practical reproducibility of fuzzing research and identify archetypal pitfalls in the evaluation design. Unfortunately, our reproduced results reveal several deficiencies in the studied papers, and we are unable to fully support and reproduce the respective claims. To help the field of fuzzing move toward a scientifically reproducible evaluation strategy, we propose updated guidelines for conducting a fuzzing evaluation that future work should follow.",

keywords = "fuzzing, fuzz testing, reproducibility",

author = "Moritz Schloegel and Nils Bars and Nico Schiller and Lukas Bernhard and Tobias Scharnowski and Addison Crump and Arash Ale-Ebrahim and Nicolai Bissantz and Marius Muench and Thorsten Holz",

note = "Acknowledgment: We thank our anonymous shepherd and reviewers for their valuable feedback. Further, we thank Dominik Maier, Johannes Willbold, Daniel Klischies, Merlin Chlosta, and Marcel B{\"o}hme (in no particular order) for their helpful comments on a draft of this work. We also thank the countless researchers with whom we have discussed fuzzing research and how to evaluate it, ultimately paving the way for this work. This work was funded by the European Research Council (ERC) under the consolidator grant RS3 (101045669) and the German Federal Ministry of Education and Research under the grants KMU-Fuzz (16KIS1898) and CPSec (16KIS1899). Additionally, this research was partially supported by the UK Engineering and Physical Sciences Research Council (EPSRC) under grant EP/V000454/1. The results feed into DsbDtech.; 2024 IEEE Symposium on Security and Privacy (SP) ; Conference date: 19-05-2024 Through 23-05-2024",

year = "2024",

month = feb,

day = "4",

doi = "10.1109/SP54263.2024.00137",

language = "English",

series = "Proceedings of the IEEE Symposium on Security and Privacy",

publisher = "IEEE",

booktitle = "2024 IEEE Symposium on Security and Privacy (SP)",

}

Schloegel, M, Bars, N, Schiller, N, Bernhard, L, Scharnowski, T, Crump, A, Ale-Ebrahim, A, Bissantz, N, Muench, M & Holz, T 2024, SoK: Prudent Evaluation Practices for Fuzzing. in 2024 IEEE Symposium on Security and Privacy (SP). Proceedings of the IEEE Symposium on Security and Privacy, IEEE, Los Alamitos, CA, USA, 2024 IEEE Symposium on Security and Privacy (SP), San Francisco, California, United States, 19/05/24. https://doi.org/10.1109/SP54263.2024.00137

TY - GEN

T1 - SoK

T2 - 2024 IEEE Symposium on Security and Privacy (SP)

AU - Schloegel, Moritz

AU - Bars, Nils

AU - Schiller, Nico

AU - Bernhard, Lukas

AU - Scharnowski, Tobias

AU - Crump, Addison

AU - Ale-Ebrahim, Arash

AU - Bissantz, Nicolai

AU - Muench, Marius

AU - Holz, Thorsten

N1 - Acknowledgment: We thank our anonymous shepherd and reviewers for their valuable feedback. Further, we thank Dominik Maier, Johannes Willbold, Daniel Klischies, Merlin Chlosta, and Marcel Böhme (in no particular order) for their helpful comments on a draft of this work. We also thank the countless researchers with whom we have discussed fuzzing research and how to evaluate it, ultimately paving the way for this work. This work was funded by the European Research Council (ERC) under the consolidator grant RS3 (101045669) and the German Federal Ministry of Education and Research under the grants KMU-Fuzz (16KIS1898) and CPSec (16KIS1899). Additionally, this research was partially supported by the UK Engineering and Physical Sciences Research Council (EPSRC) under grant EP/V000454/1. The results feed into DsbDtech.

PY - 2024/2/4

Y1 - 2024/2/4

N2 - Fuzzing has proven to be a highly effective approach to uncover software bugs over the past decade. After AFL popularized the groundbreaking concept of lightweight coverage feedback, the field of fuzzing has seen a vast amount of scientific work proposing new techniques, improving methodological aspects of existing strategies, or porting existing methods to new domains. All such work must demonstrate its merit by showing its applicability to a problem, measuring its performance, and often showing its superiority over existing works in a thorough, empirical evaluation. Yet, fuzzing is highly sensitive to its target, environment, and circumstances, e.g., randomness in the testing process. After all, relying on randomness is one of the core principles of fuzzing, governing many aspects of a fuzzer's behavior. Combined with the often highly difficult to control environment, the reproducibility of experiments is a crucial concern and requires a prudent evaluation setup. To address these threats to validity, several works, most notably Evaluating Fuzz Testing by Klees et al., have outlined how a carefully designed evaluation setup should be implemented, but it remains unknown to what extent their recommendations have been adopted in practice. In this work, we systematically analyze the evaluation of 150 fuzzing papers published at the top venues between 2018 and 2023. We study how existing guidelines are implemented and observe potential shortcomings and pitfalls. We find a surprising disregard of the existing guidelines regarding statistical tests and systematic errors in fuzzing evaluations. For example, when investigating reported bugs, we find that the search for vulnerabilities in real-world software leads to authors requesting and receiving CVEs of questionable quality. Extending our literature analysis to the practical domain, we attempt to reproduce claims of eight fuzzing papers. These case studies allow us to assess the practical reproducibility of fuzzing research and identify archetypal pitfalls in the evaluation design. Unfortunately, our reproduced results reveal several deficiencies in the studied papers, and we are unable to fully support and reproduce the respective claims. To help the field of fuzzing move toward a scientifically reproducible evaluation strategy, we propose updated guidelines for conducting a fuzzing evaluation that future work should follow.

AB - Fuzzing has proven to be a highly effective approach to uncover software bugs over the past decade. After AFL popularized the groundbreaking concept of lightweight coverage feedback, the field of fuzzing has seen a vast amount of scientific work proposing new techniques, improving methodological aspects of existing strategies, or porting existing methods to new domains. All such work must demonstrate its merit by showing its applicability to a problem, measuring its performance, and often showing its superiority over existing works in a thorough, empirical evaluation. Yet, fuzzing is highly sensitive to its target, environment, and circumstances, e.g., randomness in the testing process. After all, relying on randomness is one of the core principles of fuzzing, governing many aspects of a fuzzer's behavior. Combined with the often highly difficult to control environment, the reproducibility of experiments is a crucial concern and requires a prudent evaluation setup. To address these threats to validity, several works, most notably Evaluating Fuzz Testing by Klees et al., have outlined how a carefully designed evaluation setup should be implemented, but it remains unknown to what extent their recommendations have been adopted in practice. In this work, we systematically analyze the evaluation of 150 fuzzing papers published at the top venues between 2018 and 2023. We study how existing guidelines are implemented and observe potential shortcomings and pitfalls. We find a surprising disregard of the existing guidelines regarding statistical tests and systematic errors in fuzzing evaluations. For example, when investigating reported bugs, we find that the search for vulnerabilities in real-world software leads to authors requesting and receiving CVEs of questionable quality. Extending our literature analysis to the practical domain, we attempt to reproduce claims of eight fuzzing papers. These case studies allow us to assess the practical reproducibility of fuzzing research and identify archetypal pitfalls in the evaluation design. Unfortunately, our reproduced results reveal several deficiencies in the studied papers, and we are unable to fully support and reproduce the respective claims. To help the field of fuzzing move toward a scientifically reproducible evaluation strategy, we propose updated guidelines for conducting a fuzzing evaluation that future work should follow.

KW - fuzzing

KW - fuzz testing

KW - reproducibility

U2 - 10.1109/SP54263.2024.00137

DO - 10.1109/SP54263.2024.00137

M3 - Conference contribution

T3 - Proceedings of the IEEE Symposium on Security and Privacy

BT - 2024 IEEE Symposium on Security and Privacy (SP)

PB - IEEE

CY - Los Alamitos, CA, USA

Y2 - 19 May 2024 through 23 May 2024

ER -

SoK: Prudent Evaluation Practices for Fuzzing

Abstract

Publication series

Conference

Bibliographical note

Keywords

Access to Document

Fingerprint

Projects

CAP-TEE: Capability Architectures in Trusted Execution

Cite this