- 
                Notifications
    
You must be signed in to change notification settings  - Fork 72
 
Description
Prework
- Read and agree to the code of conduct and contributing guidelines.
 
Description
When I extract tables from a PDF file, some cells contain multiple lines of text. For example, the name of a chemical substance can be very long, or contain synonyms, annotations, or comments. The problem is that the the extract_tables() function reads each line of the PDF as a distinct line in the table, even though some lines are just a continuation of a cell.This caused errors: a substance was sometimes split into multiple lines, or information was found to be misassociated.
Here is how the table is after the extraction
Reproducible example
I can't provide real data that i use.
library(tabulapdf)
#I can't provide the data for now
f <- system.file("examples", "data.pdf", package = "tabulapdf")
t1 <- extract_tables(f, pages = 2, guess = FALSE, method = "stream",  output = "tibble")Expected result
Here is what i expect to have when i use the function.
#########
I created two alternatives function to fix multiline cells problem and one function for artifacts. But i think it will be a good thing to fix it directly in tabulapdf package or extract_tables() function.
Session info
sessionInfo()
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=French_France.utf8  LC_CTYPE=French_France.utf8    LC_MONETARY=French_France.utf8
[4] LC_NUMERIC=C                   LC_TIME=French_France.utf8
time zone: Europe/Paris
tzcode source: internal
attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base
other attached packages:
[1] shinycssloaders_1.1.0 DT_0.33               writexl_1.5.1         readxl_1.4.3          tabulapdf_1.0.5-5
[6] lubridate_1.9.4       forcats_1.0.0         stringr_1.5.1         dplyr_1.1.4           purrr_1.0.4
[11] readr_2.1.5           tidyr_1.3.1           tibble_3.2.1          ggplot2_3.5.1         tidyverse_2.0.0
[16] pdftools_3.4.1        shiny_1.10.0
loaded via a namespace (and not attached):
[1] shinyalert_3.1.0  sass_0.4.9        generics_0.1.3    renv_1.1.0        stringi_1.8.4     hms_1.1.3
[7] digest_0.6.37     magrittr_2.0.3    evaluate_1.0.3    timechange_0.3.0  grid_4.3.2        fastmap_1.2.0
[13] cellranger_1.1.0  jsonlite_1.8.9    promises_1.3.2    crosstalk_1.2.1   scales_1.3.0      jquerylib_0.1.4
[19] cli_3.6.4         rlang_1.1.5       shinythemes_1.2.0 munsell_0.5.1     yaml_2.3.10       withr_3.0.2
[25] cachem_1.1.0      tools_4.3.2       tzdb_0.4.0        memoise_2.0.1     colorspace_2.1-1  httpuv_1.6.15
[31] vctrs_0.6.5       R6_2.6.0          mime_0.12         png_0.1-8         lifecycle_1.0.4   htmlwidgets_1.6.4
[37] fontawesome_0.5.3 pkgconfig_2.0.3   rJava_1.0-11      pillar_1.10.1     bslib_0.9.0       later_1.4.1
[43] gtable_0.3.6      glue_1.8.0        Rcpp_1.0.14       xfun_0.50         tidyselect_1.2.1  knitr_1.49
[49] rstudioapi_0.17.1 xtable_1.8-4      htmltools_0.5.8.1 qpdf_1.3.4        compiler_4.3.2    askpass_1.2.1