Skip to content

Memory leaking in page.widgets() #4751

@Simon-xx-yy

Description

@Simon-xx-yy

Description of the bug

I find memory building up in my application and traced this to the page.widgets()。
What's puzzling is that once I use the iterator it generates, even if it's only used for printing and not referencing the object in subsequent processes, a memory leak problem will occur. In the simplified script I provided, the analysis function will cause the memory to continue to grow and cannot be released, while analysis_test will not.

  • This phenomenon occurs in both 1.24.7 and 1.26.0, but when I roll back the version to 1.20.2, the memory leak disappears.

How to reproduce the bug

```python
# coding = utf-8
"""
Created on 2025/10/15 17:08
@File : pdf_mem_leak
@Author: Y
Description : 
"""
import gc
import time

import fitz
import tracemalloc

def analysis(stream_data):
    pdf_info = fitz.Document(stream=stream_data, filetype='pdf')
    tmp_list = range(len(pdf_info))
    for page_num in tmp_list:
        page = pdf_info[page_num]
        raw_info = page.get_text('rawdict')['blocks']
        page_widgets_list = page.widgets() 
        for widget_info in page_widgets_list:
            print(widget_info)
        del page_widgets_list
    pdf_info.close()
    pdf_info =None
    Tools_ = fitz.TOOLS
    Tools_.store_shrink(100)
    gc.collect()


def analysis_test(stream_data):
    pdf_info = fitz.Document(stream=stream_data, filetype='pdf')
    tmp_list = range(len(pdf_info))
    for page_num in tmp_list:
        page = pdf_info[page_num]
        raw_info = page.get_text('rawdict')['blocks']
        page_widgets_list = page.widgets()
        # for widget_info in page_widgets_list:
        #     print(widget_info)
        del page_widgets_list
    pdf_info.close()
    pdf_info =None
    Tools_ = fitz.TOOLS
    Tools_.store_shrink(100)
    gc.collect()


if __name__ =='__main__':
    file_path = r'2407.10671v4.pdf'
    tracemalloc.start(30)
    snapshot1 = tracemalloc.take_snapshot()
    last_record = []
    for i in range(100):
        print('iter is :{}'.format(i))
        bytes_data = open(file_path,'rb').read()
        analysis(bytes_data)  #with memory leak
        # analysis_test(bytes_data) #with not memory leak
        gc.collect()
        snapshot2 = tracemalloc.take_snapshot()
        top_stats = snapshot2.compare_to(snapshot1, 'traceback')
        # top_stats = snapshot2.compare_to(snapshot1, 'lineno')
        snapshot1 = tracemalloc.take_snapshot()
        top_stats = sorted(top_stats, key=lambda x: -x.size_diff)
        print("-----begin comp-----")
        for nums,stat in enumerate(top_stats[0:10]):
            if stat.size_diff<=0 or stat in last_record or stat.size_diff == stat.size or "tracemalloc" in stat.traceback.format()[0]:
                continue
            else:
                print("index is :{}\nstat info:{}".format(nums,stat))
                print("\n".join(stat.traceback.format()))
        print("-----stop comp-----\n")
        last_record = top_stats


[2407.10671v4.pdf](https://github.com/user-attachments/files/22955884/2407.10671v4.pdf)/

### PyMuPDF version

1.26.0

### Operating system

Linux

### Python version

3.9

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions