R: How can I count instances of probable (human) rounding in survey responses

Question

R: How can I count instances of probable (human) rounding in survey responses

33 views Asked by andrewH At 04 August 2023 at 04:16

I have a largish Census/BLS data set with roughly ten million person records and about 250 variables per record, including missing values. Many of these values are dollar amounts. I suspect, but do not know, that the rightmost digit is disproportionately a zero. I want to count the number of consecutive zeros starting from the right and excluding anything after the decimal point. I want to produce a table showing the frequency with which different numbers of zeros occur, and another which shows the frequency with which different digits occur as the rightmost nonzero digit.

So if my numbers were

The tables I want would be For zeros:

 3.   1
 2.   2
 1.   3

And for rightmost nonzero digit:

digit.      count
      1.    1
      2.    1
      6.    2
      7.    1
      9.    1

I have a function that does this for a single number and increments some counters appropriately, but it is not at all vectorized and it is unacceptably slow. Because there are many variables that might have been rounded for each person, if I just run the function once on each number, I need to run it on the order of 500,000,000 times. If I wanted the leftmost digit it would be easy to vectorize, but I have not been able to work out a vectorized algorithm for either the number of right-hand zeros or the rightmost nonzero digit.

I'd be grateful for help from some person smarter, or at least more knowledgeable, than I am.

Original Q&A

There are 1 answers

**Jon Spring** · Answer 1 · 2023-08-04T04:55:21+00:00

Starting from something not-so-fast as a proof of concept:

n = 5E7
set.seed(42)
library(tidyverse)
df1 <- tibble(signif = rpois(n, lambda = 2),
              x = floor(runif(n, max = 1000)) * 10^signif)
# Here I'll start with 50M rows
# (note, my fake data is a little misleading because it will understate some
# significant digits, e.g. when runif produces a number ending with zeroes.


tictoc::tic()
df2 <- df1 |>
  mutate(v1 = format(x, trim = TRUE, scientific = FALSE),
         zeroes = nchar(x) - nchar(sub("0*$", "", format(x, trim = TRUE, scientific = FALSE))),
         right_dig = (x %/% (10^zeroes)) %% 10)
tictoc::toc()

This takes about 90 seconds on my machine, and you need 10x as much, and multiple columns, so there's loads of room for improvement, but in a pinch maybe getting to borderline workable.

Result

# A tibble: 50,000,000 × 5
   signif       x v1      zeroes right_dig
    <int>   <dbl> <chr>    <int>     <dbl>
 1      4 9020000 9020000      4         2
 2      4 3770000 3770000      4         7
 3      1    5370 5370         1         7
 4      3  540000 540000       4         4
 5      2   95900 95900        2         9
 6      2   33700 33700        2         7
 7      3  656000 656000       3         6
 8      0     152 152          0         2
 9      2   43100 43100        2         1
10      3  659000 659000       3         9
# ℹ 49,999,990 more rows
# ℹ Use `print(n = ...)` to see more rows

Then it's straightforward and fast to run these:

df2 %>% 
  count(zeroes)

# A tibble: 6 × 2
  zeroes        n
   <int>    <int>
1      0  6094178
2      1 12837105
3      2 13457359
4      3  9459420
5      4  5228366
6      5  2923572


df2 %>% 
  count(right_dig)

# A tibble: 10 × 2
   right_dig       n
       <dbl>   <int>
 1         0 1234711
 2         1 5419854
 3         2 5419492
 4         3 5418213
 5         4 5417654
 6         5 5414904
 7         6 5418602
 8         7 5417624
 9         8 5419975
10         9 5418971

TechQA.

R: How can I count instances of probable (human) rounding in survey responses

There are 1 answers

Related Questions in R

Related Questions in PERFORMANCE

Related Questions in ROUNDING

Related Questions in EXPLORATORY-DATA-ANALYSIS

Popular Questions

Trending Questions