R: How can I count instances of probable (human) rounding in survey responses

33 views Asked by At

I have a largish Census/BLS data set with roughly ten million person records and about 250 variables per record, including missing values. Many of these values are dollar amounts. I suspect, but do not know, that the rightmost digit is disproportionately a zero. I want to count the number of consecutive zeros starting from the right and excluding anything after the decimal point. I want to produce a table showing the frequency with which different numbers of zeros occur, and another which shows the frequency with which different digits occur as the rightmost nonzero digit.

So if my numbers were

13,568,700
   449,000
    43,560
    20,010
    34,600
    32,620

The tables I want would be For zeros:

 3.   1
 2.   2
 1.   3

And for rightmost nonzero digit:

digit.      count
      1.    1
      2.    1
      6.    2
      7.    1
      9.    1

I have a function that does this for a single number and increments some counters appropriately, but it is not at all vectorized and it is unacceptably slow. Because there are many variables that might have been rounded for each person, if I just run the function once on each number, I need to run it on the order of 500,000,000 times. If I wanted the leftmost digit it would be easy to vectorize, but I have not been able to work out a vectorized algorithm for either the number of right-hand zeros or the rightmost nonzero digit.

I'd be grateful for help from some person smarter, or at least more knowledgeable, than I am.

1

There are 1 answers

1
Jon Spring On

Starting from something not-so-fast as a proof of concept:

n = 5E7
set.seed(42)
library(tidyverse)
df1 <- tibble(signif = rpois(n, lambda = 2),
              x = floor(runif(n, max = 1000)) * 10^signif)
# Here I'll start with 50M rows
# (note, my fake data is a little misleading because it will understate some
# significant digits, e.g. when runif produces a number ending with zeroes.


tictoc::tic()
df2 <- df1 |>
  mutate(v1 = format(x, trim = TRUE, scientific = FALSE),
         zeroes = nchar(x) - nchar(sub("0*$", "", format(x, trim = TRUE, scientific = FALSE))),
         right_dig = (x %/% (10^zeroes)) %% 10)
tictoc::toc()

This takes about 90 seconds on my machine, and you need 10x as much, and multiple columns, so there's loads of room for improvement, but in a pinch maybe getting to borderline workable.

Result

# A tibble: 50,000,000 × 5
   signif       x v1      zeroes right_dig
    <int>   <dbl> <chr>    <int>     <dbl>
 1      4 9020000 9020000      4         2
 2      4 3770000 3770000      4         7
 3      1    5370 5370         1         7
 4      3  540000 540000       4         4
 5      2   95900 95900        2         9
 6      2   33700 33700        2         7
 7      3  656000 656000       3         6
 8      0     152 152          0         2
 9      2   43100 43100        2         1
10      3  659000 659000       3         9
# ℹ 49,999,990 more rows
# ℹ Use `print(n = ...)` to see more rows

Then it's straightforward and fast to run these:

df2 %>% 
  count(zeroes)

# A tibble: 6 × 2
  zeroes        n
   <int>    <int>
1      0  6094178
2      1 12837105
3      2 13457359
4      3  9459420
5      4  5228366
6      5  2923572


df2 %>% 
  count(right_dig)

# A tibble: 10 × 2
   right_dig       n
       <dbl>   <int>
 1         0 1234711
 2         1 5419854
 3         2 5419492
 4         3 5418213
 5         4 5417654
 6         5 5414904
 7         6 5418602
 8         7 5417624
 9         8 5419975
10         9 5418971