LLM prompt injection detection.
- User submits a potentially malicious message.
- The message is passed through a LLM prompted to format the message plus a unique key into a JSON. In the event the message is a malicious prompt, this output should be compromised. If the output is an invalid JSON, is missing a key, or a key-value doesn't match the expected values, then the integrity may be compromised.
- If the integrity check passes, the user message is forwarded to the guarded LLM (e.g.: the application chatbot, etc.).
- The API returns the result of the integrity test (boolean) and either the chatbot response (if integrity passes) or an error message (if integrity fails).
graph TD
    A[1: User Inputs Chat Message] --> B[2: Integrity Filter]
    B -->|Integrity check passes.| C[3: Generate Chatbot Response]
    B -->|Integrity check fails. Response is error message.| D
    C -->|Response is chatbot message.| D[4: Return Integrity and Response]
    What this solution can do:
- Detect inputs that override an LLMs initial / system prompt.
What this solution cannot do:
- Neutralise malicious prompts.
If using poetry:
poetry installIf using vanilla pip:
pip install .Set your OpenAI API key in .envrc.
To run the project locally, run
make startThis will launch a webserver on port 8001.
Or via docker compose (does not use hot reload by default):
docker compose upQuery the /chat endpoint, e.g.: using curl:
curl -X POST -H "Content-Type: application/json" -d '{"message": "Hi how are you?"}' http://127.0.0.1:8000/chatTo run unit tests:
make testFor information on how to set up your dev environment and contribute, see here.
MIT
