extract_tables(): multiline cells and artifacts when parsing tables (stream)

## Prework

* [x] Read and agree to the [code of conduct](https://www.contributor-covenant.org/version/2/1/code_of_conduct.html) and [contributing guidelines](https://github.com/ropensci/freedomhouse/blob/main/.github/CONTRIBUTING.md).

## Description

When I extract tables from a PDF file, some cells contain multiple lines of text. For example, the name of a chemical substance can be very long, or contain synonyms, annotations, or comments. The problem is that the the extract_tables() function reads each line of the PDF as a distinct line in the table, even though some lines are just a continuation of a cell.This caused errors: a substance was sometimes split into multiple lines, or information was found to be misassociated.
Here is how the table is after the extraction
<img width="868" height="652" alt="Image" src="https://github.com/user-attachments/assets/e152e643-3df8-4d5f-a673-a711803ae782" />

## Reproducible example
I can't provide real data that i use.
```r
library(tabulapdf)

#I can't provide the data for now
f <- system.file("examples", "data.pdf", package = "tabulapdf")

t1 <- extract_tables(f, pages = 2, guess = FALSE, method = "stream",  output = "tibble")
```

## Expected result

Here is what i expect to have when i use the function.

<img width="901" height="545" alt="Image" src="https://github.com/user-attachments/assets/31ca83fc-9ee3-42dc-b093-ecff0d56a1cd" />

#########
I created two alternatives function to fix multiline cells problem and one function for artifacts. But i think it will be a good thing to fix it directly in tabulapdf package or extract_tables() function.
## Session info
> sessionInfo()
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=French_France.utf8  LC_CTYPE=French_France.utf8    LC_MONETARY=French_France.utf8
[4] LC_NUMERIC=C                   LC_TIME=French_France.utf8    

time zone: Europe/Paris
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] shinycssloaders_1.1.0 DT_0.33               writexl_1.5.1         readxl_1.4.3          tabulapdf_1.0.5-5    
 [6] lubridate_1.9.4       forcats_1.0.0         stringr_1.5.1         dplyr_1.1.4           purrr_1.0.4          
[11] readr_2.1.5           tidyr_1.3.1           tibble_3.2.1          ggplot2_3.5.1         tidyverse_2.0.0      
[16] pdftools_3.4.1        shiny_1.10.0         

loaded via a namespace (and not attached):
 [1] shinyalert_3.1.0  sass_0.4.9        generics_0.1.3    renv_1.1.0        stringi_1.8.4     hms_1.1.3        
 [7] digest_0.6.37     magrittr_2.0.3    evaluate_1.0.3    timechange_0.3.0  grid_4.3.2        fastmap_1.2.0    
[13] cellranger_1.1.0  jsonlite_1.8.9    promises_1.3.2    crosstalk_1.2.1   scales_1.3.0      jquerylib_0.1.4  
[19] cli_3.6.4         rlang_1.1.5       shinythemes_1.2.0 munsell_0.5.1     yaml_2.3.10       withr_3.0.2      
[25] cachem_1.1.0      tools_4.3.2       tzdb_0.4.0        memoise_2.0.1     colorspace_2.1-1  httpuv_1.6.15    
[31] vctrs_0.6.5       R6_2.6.0          mime_0.12         png_0.1-8         lifecycle_1.0.4   htmlwidgets_1.6.4
[37] fontawesome_0.5.3 pkgconfig_2.0.3   rJava_1.0-11      pillar_1.10.1     bslib_0.9.0       later_1.4.1      
[43] gtable_0.3.6      glue_1.8.0        Rcpp_1.0.14       xfun_0.50         tidyselect_1.2.1  knitr_1.49       
[49] rstudioapi_0.17.1 xtable_1.8-4      htmltools_0.5.8.1 qpdf_1.3.4        compiler_4.3.2    askpass_1.2.1  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

extract_tables(): multiline cells and artifacts when parsing tables (stream) #170

Prework

Description

Reproducible example

Expected result

Session info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

extract_tables(): multiline cells and artifacts when parsing tables (stream) #170

Description

Prework

Description

Reproducible example

Expected result

Session info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions