`vignettes/data-auditing.Rmd`

`data-auditing.Rmd`

Welcome to the ‘Data auditing’ vignette of the **jfa**
package. Here you can find a simple explanation of the functions in the
package that facilitate data auditing. For more detailed explanations of
each function, read the other vignettes on the package website.

Below you can find an explanation of the available data auditing
functions in **jfa**.

`digit_test()`

The function `digit_test()`

takes a vector of numeric
values, extract the requested digits, and compares the frequencies of
these digits to a reference distribution. By default, the function
performs a frequentist hypothesis test of the null hypothesis that the
digits are distributed according to the reference distribution and
produces a *p* value. When a prior is specified, the function
performs a Bayesian hypothesis test of the null hypothesis that the
digits are distributed according to the reference distribution against
the alternative hypothesis that the digits are not distributed according
to the reference distribution and produces a Bayes factor (Kass &
Raftery, 1995).

*Full function with default arguments:*

```
digit_test(x,
check = c("first", "last", "firsttwo"),
reference = "benford",
prior = FALSE)
```

*Supported options for the check argument:*

`check` |
Returns |
---|---|

`fist` |
First digit |

`firsttwo` |
First and second digit |

`last` |
Last digit |

*Supported options for the reference
argument:*

`check` |
Returns |
---|---|

`benford` |
Benford’s law |

`uniform` |
Uniform distribution |

Vector of probabilities | Custom distribution |

*Example usage:*

Benford’s law (Benford, 1938) is a principle that describes a pattern
in many naturally-occurring numbers. According to Benford’s law, each
possible leading digit *d* in a naturally occurring, or
non-manipulated, set of numbers occurs with a probability:

The distribution of leading digits in a data set of financial
transaction values (e.g., the `sinoForest`

data) can be
extracted and tested against the expected frequencies under Benford’s
law using the code below.

```
# Frequentist hypothesis test
digit_test(sinoForest$value, check = "first", reference = "benford")
```

```
##
## Digit distribution test
##
## data: sinoForest$value
## n = 772, X-squared = 7.6517, df = 8, p-value = 0.4682
## alternative hypothesis: leading digit(s) are not distributed according to the benford distribution.
```

```
# Bayesian hypothesis test using default prior
digit_test(sinoForest$value, check = "first", reference = "benford", prior = TRUE)
```

```
##
## Digit distribution test
##
## data: sinoForest$value
## n = 772, BF10 = 1.4493e-07
## alternative hypothesis: leading digit(s) are not distributed according to the benford distribution.
```

`repeated_test()`

The function `repeated_test()`

analyzes the frequency with
which values get repeated within a set of numbers. Unlike Benford’s law,
and its generalizations, this approach examines the entire number at
once, not only the first or last digit. For the technical details of
this procedure, see Simohnsohn (2019).

*Full function with default arguments:*

```
repeated_test(x,
check = "last",
method = "af",
samples = 2000)
```

*Supported options for the check argument:*

`check` |
Returns |
---|---|

`last` |
Last decimal |

`lasttwo` |
Last two decimals |

`all` |
All decimals |

*Supported options for the method argument:*

`check` |
Returns |
---|---|

`af` |
Average frequency |

`entropy` |
Entropy |

*Example usage:*

In this example, we analyze a data set from a (retracted) paper that
describes three experiments run in Chinese factories, where workers were
nudged to use more hand-sanitizer. These data were shown to exhibited
two classic markers of data tampering: impossibly similar means and the
uneven distribution of last digits (Yu, Nelson, & Simohnson, 2018).
We can use the `rv.test()`

function to test if these data
also contain a greater amount of repeated values than expected if the
data were not tampered with.

`repeated_test(sanitizer$value, check = "lasttwo", samples = 5000)`

```
##
## Repeated values test
##
## data: sanitizer$value
## n = 1600, AF = 1.5225, p-value = 0.0028
## alternative hypothesis: average frequency in data is greater than for random data.
```

To validate the statistical results, **jfa**’s automated
unit
tests regularly verify the main output from the package against the
following benchmarks:

- benford.analysis (R package version 0.1.5)
- BenfordTests (R package version 1.2.0)
- BeyondBenford (R package version 1.4)

- Benford, F. (1938). The law of anomalous numbers. In
*Proceedings of the American Philosophical Society*, 551-572. - View online - Kass, R. E., & Raftery, A. E. (1995). Bayes factors.
*Journal of the American Statistical Association*,*90*(430), 773-795. - View online - Simohnsohn, U. (2019, May 25).
*Number-Bunching: A New Tool for Forensic Data Analysis*. - View online - Yo, F., Nelson, L., & Simonsohn, U. (2018, December 5).
*In Press at Psychological Science: A New ‘Nudge’ Supported by Implausible Data*. - View online