Data auditing: Get started

Introduction

Welcome to the ‘Data auditing’ vignette of the jfa package. This page provides a straightforward guide to the functions in the package that are designed to facilitate data auditing. Specifically, these functions implement techniques to test the distribution of (leading or last) digits against a reference distribution (e.g., Benford’s law), and techniques to assess whether values are repeated more frequently than expected. The package enables users to specify a prior probability distribution to perform Bayesian data auditing with these functions.

Functions and intended usage

Below you can find an explanation of the available data auditing functions in jfa.

digit_test()
repeated_test()

Testing digit distributions

The digit_test() function accepts a vector of numeric values, extracts the requested digits, and compares the frequencies of these digits to a reference distribution. By default, the function performs a frequentist hypothesis test of the null hypothesis that the digits are distributed according to the reference distribution, and produces a p-value. When a prior is specified, the function performs a Bayesian hypothesis test of the null hypothesis that the digits are distributed according to the reference distribution against the alternative hypothesis that the digits are not distributed according to the reference distribution, and produces a Bayes factor (Kass & Raftery, 1995). The function returns an object that can be used with the associated summary() and plot() methods.

For additional details about this function, please refer to the function documentation on the package website.

Example usage:

# Compare first digits to Benford's law
digit_test(sinoForest[["value"]], check = "first", reference = "benford")

## 
##  Classical Digit Distribution Test
## 
## data:  sinoForest[["value"]]
## n = 772, MAD = 0.0065981, X-squared = 7.6517, df = 8, p-value = 0.4682
## alternative hypothesis: leading digit(s) are not distributed according to the benford distribution.

Testing for repeated values

The repeated_test() function analyzes the frequency with which values are repeated within a set of numbers. Unlike Benford’s law, and its generalizations, this approach examines the entire number at once, not only the first or last digit. For the technical details of this procedure, see (Simonsohn, 2019). The function returns an object that can be used with the associated summary() and plot() methods.

For additional details about this function, please refer to the function documentation on the package website.

Example usage:

# Inspect last two digits for repeated values
repeated_test(sanitizer[["value"]], check = "lasttwo", samples = 5000)

## 
##  Classical Repeated Values Test
## 
## data:  sanitizer[["value"]]
## n = 1600, AF = 1.5225, p-value = 0.0022
## alternative hypothesis: average frequency in data is greater than for random data.

Benchmarks

To ensure the accuracy of statistical results, jfa employs automated unit tests that regularly validate the output from the package against the following established benchmarks in the area of data auditing:

benford.analysis (R package version 0.1.5)
BenfordTests (R package version 1.2.0)
BeyondBenford (R package version 1.4)

Cheat sheet

The cheat sheet below will help you get started with jfa’s data audit functionality. A pdf version can be downloaded here.

cheatsheet-data

References

Cinelli, C. (2018). Benford.analysis: Benford analysis for data validation and forensic analytics. https://CRAN.R-project.org/package=benford.analysis

Joenssen, D. W. (2015). BenfordTests: Statistical tests for evaluating conformity to benford’s law. https://CRAN.R-project.org/package=BenfordTests

Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795. https://doi.org/10.1080/01621459.1995.10476572

Simonsohn, U. (2019). Number-bunching: A new tool for forensic data analysis. https://datacolada.org/77

Stephane, B. D. S. (2020). BeyondBenford: Compare the goodness of fit of benford’s and blondeau da silva’s digit distributions to a given dataset. https://CRAN.R-project.org/package=BeyondBenford

Koen Derks