Reemplazo de NA con el último valor que no es NA

En data.frame(o data.table), me gustaría "rellenar" los NA con el valor anterior que no sea NA más cercano. Un ejemplo simple, usando vectores (en lugar de a data.frame) es el siguiente:

> y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)

Me gustaría una función fill.NAs()que me permita construir yytal que:

> yy
[1] NA NA NA  2  2  2  2  3  3  3  4  4

Necesito repetir esta operación para muchos (en total ~1 Tb) data.frames de tamaño pequeño (~30-50 Mb), donde una fila es NA y todas sus entradas son. ¿Cuál es una buena manera de abordar el problema?

La fea solución que preparé usa esta función:

last <- function (x){
    x[length(x)]
}    

fill.NAs <- function(isNA){
if (isNA[1] == 1) {
    isNA[1:max({which(isNA==0)[1]-1},1)] <- 0 # first is NAs 
                                              # can't be forward filled
}
isNA.neg <- isNA.pos <- isNA.diff <- diff(isNA)
isNA.pos[isNA.diff < 0] <- 0
isNA.neg[isNA.diff > 0] <- 0
which.isNA.neg <- which(as.logical(isNA.neg))
if (length(which.isNA.neg)==0) return(NULL) # generates warnings later, but works
which.isNA.pos <- which(as.logical(isNA.pos))
which.isNA <- which(as.logical(isNA))
if (length(which.isNA.neg)==length(which.isNA.pos)){
    replacement <- rep(which.isNA.pos[2:length(which.isNA.neg)], 
                                which.isNA.neg[2:max(length(which.isNA.neg)-1,2)] - 
                                which.isNA.pos[1:max(length(which.isNA.neg)-1,1)])      
    replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
} else {
    replacement <- rep(which.isNA.pos[1:length(which.isNA.neg)], which.isNA.neg - which.isNA.pos[1:length(which.isNA.neg)])     
    replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
}
replacement
}

La función fill.NAsse utiliza de la siguiente manera:

y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
isNA <- as.numeric(is.na(y))
replacement <- fill.NAs(isNA)
if (length(replacement)){
which.isNA <- which(as.logical(isNA))
to.replace <- which.isNA[which(isNA==0)[1]:length(which.isNA)]
y[to.replace] <- y[replacement]
}

Producción

> y
[1] NA  2  2  2  2  3  3  3  4  4  4

... que parece funcionar. Pero hombre, ¡qué feo es! ¿Alguna sugerencia?

Jan 01 '70 08:01 Ryogi

Probablemente desee utilizar la na.locf()función del paquete zoo para llevar adelante la última observación y reemplazar sus valores NA.

Aquí está el comienzo de su ejemplo de uso desde la página de ayuda:

library(zoo)

az <- zoo(1:6)

bz <- zoo(c(2,NA,1,4,5,2))

na.locf(bz)
1 2 3 4 5 6 
2 2 1 4 5 2 

na.locf(bz, fromLast = TRUE)
1 2 3 4 5 6 
2 1 1 4 5 2 

cz <- zoo(c(NA,9,3,2,3,2))

na.locf(cz)
2 3 4 5 6 
9 3 2 3 2

Oct 12 '2011 05:10 Dirk is no longer here

Perdón por desenterrar una vieja pregunta. No pude buscar la función para hacer este trabajo en el tren, así que escribí una yo mismo.

Me enorgulleció descubrir que es un poquito más rápido.
Aunque es menos flexible.

Pero funciona bien con ave, que es lo que necesitaba.

repeat.before = function(x) {   # repeats the last non NA value. Keeps leading NA
    ind = which(!is.na(x))      # get positions of nonmissing values
    if(is.na(x[1]))             # if it begins with a missing, add the 
          ind = c(1,ind)        # first position to the indices
    rep(x[ind], times = diff(   # repeat the values at these indices
       c(ind, length(x) + 1) )) # diffing the indices + length yields how often 
}                               # they need to be repeated

x = c(NA,NA,'a',NA,NA,NA,NA,NA,NA,NA,NA,'b','c','d',NA,NA,NA,NA,NA,'e')  
xx = rep(x, 1000000)  
system.time({ yzoo = na.locf(xx,na.rm=F)})  
## user  system elapsed   
## 2.754   0.667   3.406   
system.time({ yrep = repeat.before(xx)})  
## user  system elapsed   
## 0.597   0.199   0.793

Editar

Cuando esta se convirtió en mi respuesta más votada, a menudo me recordaron que no uso mi propia función, porque a menudo necesito el maxgapargumento del zoológico. Debido a que zoo tiene algunos problemas extraños en casos extremos cuando uso dplyr + fechas que no pude depurar, volví a esto hoy para mejorar mi función anterior.

Comparé mi función mejorada y todas las demás entradas aquí. Para el conjunto básico de funciones, tidyr::filles más rápido y tampoco falla en los casos extremos. La entrada Rcpp de @BrandonBertelsen es aún más rápida, pero es inflexible con respecto al tipo de entrada (probó casos extremos incorrectamente debido a un malentendido all.equal).

Si lo necesita maxgap, mi función a continuación es más rápida que la del zoológico (y no tiene problemas extraños con las fechas).

Puse la documentación de mis pruebas .

nueva función

repeat_last = function(x, forward = TRUE, maxgap = Inf, na.rm = FALSE) {
    if (!forward) x = rev(x)           # reverse x twice if carrying backward
    ind = which(!is.na(x))             # get positions of nonmissing values
    if (is.na(x[1]) && !na.rm)         # if it begins with NA
        ind = c(1,ind)                 # add first pos
    rep_times = diff(                  # diffing the indices + length yields how often
        c(ind, length(x) + 1) )          # they need to be repeated
    if (maxgap < Inf) {
        exceed = rep_times - 1 > maxgap  # exceeding maxgap
        if (any(exceed)) {               # any exceed?
            ind = sort(c(ind[exceed] + 1, ind))      # add NA in gaps
            rep_times = diff(c(ind, length(x) + 1) ) # diff again
        }
    }
    x = rep(x[ind], times = rep_times) # repeat the values at these indices
    if (!forward) x = rev(x)           # second reversion
    x
}

También puse la función en mi paquete formr (solo Github).

Dec 10 '2012 22:12 Ruben

una data.tablesolución:

dt <- data.table(y = c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA))
dt[, y_forward_fill := y[1], .(cumsum(!is.na(y)))]
dt
     y y_forward_fill
 1: NA             NA
 2:  2              2
 3:  2              2
 4: NA              2
 5: NA              2
 6:  3              3
 7: NA              3
 8:  4              4
 9: NA              4
10: NA              4

Este enfoque también podría funcionar completando ceros hacia adelante:

dt <- data.table(y = c(0, 2, -2, 0, 0, 3, 0, -4, 0, 0))
dt[, y_forward_fill := y[1], .(cumsum(y != 0))]
dt
     y y_forward_fill
 1:  0              0
 2:  2              2
 3: -2             -2
 4:  0             -2
 5:  0             -2
 6:  3              3
 7:  0              3
 8: -4             -4
 9:  0             -4
10:  0             -4

este método resulta muy útil en datos a escala y donde desea realizar un llenado directo por grupo(s), lo cual es trivial con data.table. simplemente agregue los grupos a la bycláusula antes de la cumsumlógica.

dt <- data.table(group = sample(c('a', 'b'), 20, replace = TRUE), y = sample(c(1:4, rep(NA, 4)), 20 , replace = TRUE))
dt <- dt[order(group)]
dt[, y_forward_fill := y[1], .(group, cumsum(!is.na(y)))]
dt
    group  y y_forward_fill
 1:     a NA             NA
 2:     a NA             NA
 3:     a NA             NA
 4:     a  2              2
 5:     a NA              2
 6:     a  1              1
 7:     a NA              1
 8:     a  3              3
 9:     a NA              3
10:     a NA              3
11:     a  4              4
12:     a NA              4
13:     a  1              1
14:     a  4              4
15:     a NA              4
16:     a  3              3
17:     b  4              4
18:     b NA              4
19:     b NA              4
20:     b  2              2

Aug 09 '2017 16:08 Tony DiFranco