Skip to content

Conversation

@dworthen
Copy link
Contributor

Restructure codebase as a monorepo project.

Checklist

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have updated the documentation (if necessary).
  • I have added appropriate unit tests (if applicable).

@dworthen dworthen requested a review from a team as a code owner October 22, 2025 17:31
@AlonsoGuevara AlonsoGuevara requested a review from Copilot October 30, 2025 01:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR restructures the codebase into a monorepo by extracting the Factory pattern into a separate package (graphrag-factory) and updating all import references. The Factory class is enhanced with singleton support and improved error messages.

Key Changes:

  • Extracted Factory class into standalone graphrag-factory package with enhanced singleton/transient service scope support
  • Updated all Factory imports from graphrag.factory.factory to graphrag_factory
  • Added --all-packages flags to CI/CD workflows to support monorepo structure

Reviewed Changes

Copilot reviewed 26 out of 403 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
packages/graphrag-factory/pyproject.toml New package configuration for extracted Factory module
packages/graphrag-factory/graphrag_factory/factory.py Enhanced Factory class with singleton/transient service scopes
packages/graphrag-factory/graphrag_factory/init.py Package initialization exposing Factory class
packages/graphrag-factory/README.md Documentation for the new Factory package with usage examples
packages/graphrag/graphrag/logger/factory.py Updated Factory import to use new package
packages/graphrag/graphrag/language_model/factory.py Updated Factory import to use new package
packages/graphrag/graphrag/language_model/providers/litellm/services/retry/retry_factory.py Updated Factory import to use new package
packages/graphrag/graphrag/language_model/providers/litellm/services/rate_limiter/rate_limiter_factory.py Updated Factory import to use new package
packages/graphrag/graphrag/index/input/factory.py Updated Factory import to use new package
packages/graphrag/graphrag/cache/factory.py Updated Factory import to use new package
packages/graphrag/README.md New README for graphrag package within monorepo
.vscode/launch.json Enhanced debug configuration with user input prompts
.github/workflows/*.yml Updated CI/CD workflows to use --all-packages flag
docs/examples_notebooks/*.ipynb Formatting cleanup of import statements

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@AlonsoGuevara
Copy link
Collaborator

Please include an architecture diagram in the documentation illustrating this change, along with a short explanation of what each submodule is responsible for. This will help establish clear guardrails for future development.

I might be misunderstanding, but from what I see, this change introduces two modules — Factories and GraphRAG. Could you clarify the role of the Factory module? Specifically, what’s the rationale for treating it as a standalone logical unit worth exposing independently?

@dworthen
Copy link
Contributor Author

Please include an architecture diagram in the documentation illustrating this change, along with a short explanation of what each submodule is responsible for. This will help establish clear guardrails for future development.

Hey @AlonsoGuevara, the monorepo structure does not change the system architecture or public API surface of GraphRAG. The workflows and all the pieces still fit together and work as they have been. The monorepo structure just pulls out some code into separate, independent pypi packages so that they can be used in isolation in our other projects. Our team GraphRAG Monorepo loop page discusses goals, principles, and modules to pull out. I think that might answer questions about guardrails and future development plans but let me know if I am misunderstanding that piece.

I might be misunderstanding, but from what I see, this change introduces two modules — Factories and GraphRAG. Could you clarify the role of the Factory module? Specifically, what’s the rationale for treating it as a standalone logical unit worth exposing independently?

So far there are only two packages but there will be more packages pulled out from graphrag core in future PRs that I am working on. I did two packages instead of one in this PR to give a better idea of the monorepo structure and how it will look as we add more packages. Factory was chosen as the first package to pull out from core because it is simple with minimal impact and will need to exist as a package as other packages we pull out (cache, vectorstore, etc) will rely on the base factory class. Let me know if you disagree with factory needing to be its own package and what alternate approach may be better suited. One such alternative approach may be to just copy the factory class code to packages that need it.

@andresmor-ms
Copy link
Contributor

So far there are only two packages but there will be more packages pulled out from graphrag core in future PRs that I am working on. I did two packages instead of one in this PR to give a better idea of the monorepo structure and how it will look as we add more packages. Factory was chosen as the first package to pull out from core because it is simple with minimal impact and will need to exist as a package as other packages we pull out (cache, vectorstore, etc) will rely on the base factory class.

Just so that I understand correctly, what you are describing here is to have something like:

flowchart TD
    A[graphrag] -->|depends on| B[graphrag-vectorstore]
    B --> |depends on| C[graphrag-factory]
    A -->|depends on| C
Loading

Let me know if you disagree with factory needing to be its own package and what alternate approach may be better suited. One such alternative approach may be to just copy the factory class code to packages that need it.

I kind of don't like the idea of exposing a package that only have one file in it, and we would need to publish this into pypi so that it can be used as a dependency in other packages.

Also, would this mean that for example if I had my own custom implementation of a vector store, would i need to first register that vectorstore in some factory in graphrag-vectorstore and then pass that to graphrag-core?

What do you think about not having a graphrag-factory and let graphrag-core manage the factories so that we don't have that dependency and only have graphrag-core depend on the different packages? Since graphrag-core will depend on the different vectorstore, cache, etc it will have access to the ABC or Protocols we have in there so it would be able to create and manage all the factories it needs to work and register default implementations, while not having to copy paste the factories in every module.

Let me know what you think :)

@dworthen
Copy link
Contributor Author

What do you think about not having a graphrag-factory and let graphrag-core manage the factories so that we don't have that dependency and only have graphrag-core depend on the different packages? Since graphrag-core will depend on the different vectorstore, cache, etc it will have access to the ABC or Protocols we have in there so it would be able to create and manage all the factories it needs to work and register default implementations, while not having to copy paste the factories in every module.

Fair point. GraphRAG core can and will manage some of the factories. The one other package I know that will need a factory implementation is graphrag-llm. The language model config contains configuration for subservices such as retries, rate limiting, etc. That means graphrag-llm encapsulates service definitions (ABCs), service implementations, and the factories for managing those implementations. Even if the other packages don't contain factories, that still leaves at least two packages that do need a factory implementation, graphrag and graphrag-llm. In my early monorepo explorations, graphrag-llm was one of the first packages I started to pull out of graphrag and I immediately ran into a situation where I needed to share factory across packages and so I pulled it out into its own package. I included it here as the second package since it was simple and easy to grok but perhaps I should have included graphrag-config as the second package.

So far there are only two packages but there will be more packages pulled out from graphrag core in future PRs that I am working on. I did two packages instead of one in this PR to give a better idea of the monorepo structure and how it will look as we add more packages. Factory was chosen as the first package to pull out from core because it is simple with minimal impact and will need to exist as a package as other packages we pull out (cache, vectorstore, etc) will rely on the base factory class.

Just so that I understand correctly, what you are describing here is to have something like:

flowchart TD
    A[graphrag] -->|depends on| B[graphrag-vectorstore]
    B --> |depends on| C[graphrag-factory]
    A -->|depends on| C
Loading

Let me know if you disagree with factory needing to be its own package and what alternate approach may be better suited. One such alternative approach may be to just copy the factory class code to packages that need it.

Not exactly. I did a poor job of listing out packages. My list was merely a hypothetical list of packages that may need a factory but I agree with your point that some of this management should be done by graphrag core. I should have listed out graphrag-llm.

I kind of don't like the idea of exposing a package that only have one file in it, and we would need to publish this into pypi so that it can be used as a dependency in other packages.

Why not? GitHub actions will manage publishing to pypi so that's not problematic. Another approach would be to not roll out our own DI container logic and lean on an existing library like https://pypi.org/project/dependency-injector/ but that is a bigger lift and there has been hesitation to do this in the past.

Also, would this mean that for example if I had my own custom implementation of a vector store, would i need to first register that vectorstore in some factory in graphrag-vectorstore and then pass that to graphrag-core?

I may be misunderstanding this point, but this is true regardless of where the factory lives. Whether the factory is in graphrag core or graphrag-vectorstore users need to register custom vector stores with the factory using custom strategy names in order to use them in graphrag. I don't think the extensibility model changes based on where the factories are defined.

If the concern is around what gets imported, graphrag core can and will still manage a public API surface so users will not need to from graphrag-vectorstore import VectorStoreFactory even if that is where the factory is defined. As an example, we don't expect end users to directly import python-dotenv for managing environment variables. Instead, we wrap up/encapsulate the functionality of third-party libraries in our own public API surface (load_config in this case). To extend that example, I have pulled out the config loading logic (based on your work in benchmark-qed) into graphrag-config (not in this PR as I am trying to keep these PRs small and manageable) but I did not update our docs or sample notebooks to from graphrag-config import load_config. Instead, the sample notebooks still show from graphrag.config.load_config import load_config as that method still exists apart of graphrag core API surface, it just now sits on top of the new graphrag-config package. The same approaches we take to encapsulate third-party dependencies can be used to encapsulate our own packages in order to maintain a public API that works. please let me know if I completely misunderstood this last point.

I hope I was able to address your concerns in a reasonable manner. In hindsight, I wish I kept this PR more focused and had only 1 package in the PR, graphrag. If the blocker to merging is around graphrag-factory then I am super-duper happy to take that out and revisit the need for that package. The primary goal of this PR was to establish the monorepo folder structure and CI/CD processes around managing a monorepo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants