R: Recortar una cuerda muy larga con palabras completas con principio y final.

Resuelto Unai Vicente asked hace 54 años • 1 respuestas

Supongamos que tengo este marco de datos:

df =data.frame(text=c("This is a very long sentence that I would like to trim because I might need to put it as a label somewhere",
               "This is another very long sentence that I would also like to trim because I might need to put it as who knows what"),col2=c("1234","5678"))

Después de esta publicación pude obtener una nueva columna que me muestra el comienzo de la oración con palabras completas, lo cual está bien.

df$short_txt = sapply(strsplit(df$text, ' '), function(i) paste(i[cumsum(nchar(i)) <= 20], collapse = ' '))

> df$short_txt
[1] "This is a very long"  "This is another very"

Sin embargo, también me interesaría pegar el final de palabras completas de 20 caracteres antes del final, para tener algo parecido a este resultado.

> df$short_txt
[1] "This is a very long...it as a label somewhere"  "This is another very...it as who knows what"

No puedo entender cómo completar la sapplyfunción para llegar a este resultado. Intenté usar la función pegar y cambiar la cumsumfunción df$short_txt = sapply(strsplit(df$text, ' '), function(i) paste(i[cumsum(nchar(i)) <= 20],"...",i[cumsum(nchar(i)) >= (nchar(i)-20)], collapse = ' '))pero no devuelve lo que quiero.

Agradezco la ayuda.

Unai Vicente avatar Jan 01 '70 08:01 Unai Vicente
Aceptado

¿Quizás podamos expresar esto con regularidad?

gsub("^(.{20}\\S*)\\b.*\\b(\\S*.{20})$", "\\1...\\2", df$text)
# [1] "This is a very long sentence...as a label somewhere" "This is another very...it as who knows what"        

Explicación de expresiones regulares:

^(.{20}\\S*)\\b.*\\b(\\S*.{20})$
^                              $   beginning and end of string, respectively
 (.........)        (.........)    first and second saved groups
  .{20}                  .{20}     exactly 20 characters of any kind
       \\S*          \\S*          zero or more non-space characters
            \\b  \\b               word boundaries
               .*                  anything else (including nothing)

Esto no incluía tu ital principio porque sin él, la subcadena tiene 20 longitudes.

Examinaré df$text[1]con varios números las subcadenas de palabras completas iniciales/finales.

sapply(seq(10, 24, by = 2), function(len) gsub(sprintf("^(.{%d}\\S*)\\b.*\\b(\\S*.{%d})$", len, len), "\\1...\\2", df$text[1]))
# [1] "This is a very... somewhere"                            
# [2] "This is a very...label somewhere"                       
# [3] "This is a very...label somewhere"                       
# [4] "This is a very long... label somewhere"                 
# [5] "This is a very long... a label somewhere"               
# [6] "This is a very long sentence...as a label somewhere"    
# [7] "This is a very long sentence...it as a label somewhere" 
# [8] "This is a very long sentence... it as a label somewhere"

No sé de antemano cómo protegerme contra los espacios antes/después de agregarlos ...aquí, pero se puede limpiar después de la edición (es seguro siempre y cuando sus cadenas no contengan archivos "...").

sapply(seq(10, 24, by = 2), function(len) gsub(sprintf("^(.{%d}\\S*)\\b.*\\b(\\S*.{%d})$", len, len), "\\1...\\2", df$text[1])) |>
  sub(" *(\\.\\.\\.) *", "\\1", x = _)
# [1] "This is a very...somewhere"                            
# [2] "This is a very...label somewhere"                      
# [3] "This is a very...label somewhere"                      
# [4] "This is a very long...label somewhere"                 
# [5] "This is a very long...a label somewhere"               
# [6] "This is a very long sentence...as a label somewhere"   
# [7] "This is a very long sentence...it as a label somewhere"
# [8] "This is a very long sentence...it as a label somewhere"
r2evans avatar Feb 15 '2024 16:02 r2evans