Skip to content

Onboard Project: Usage

Matthew Beech edited this page Nov 6, 2025 · 12 revisions

Onboarding Projects

This page details the silnlp.common.onboard_project script's usage and the configuration options available.

onboard_project

Cleans and uploads a Paratext project from a local machine to the MinIO bucket. Optionally performs other Onboarding tasks.

usage: python -m silnlp.common.onboard_project [--copy-from [local_dir]] [--config path_to_config]
[--extract-corpora] [--collect-verse-counts] [--clean-project] [--timestamp] [--wildebeest]
[--stats] projects ...

Arguments:

Argument Purpose Description
projects list of Paratext project names (Required) These projects will be stored on the bucket at Paratext/projects.
--copy-from [local_dir] Path to a directory with a Paratext project. The local project(s) will be copied to the bucket. Default if included without a local_dir is the user's Downloads folder
--config path_to_config Path to a config.yml file This is used to configure what optional Onboarding tasks will run.
--extract-corpora Runs silnlp.common.extract_corpora Extracts corpora. See here for more information.
--collect-verse-counts Runs silnlp.common.collect_verse_counts Collects verse counts. Stores results in MT/experiments/verse_counts/project_name by default.
--clean-project Runs silnlp.common.clean_projects Cleans the Paratext project folder by removing unnecessary files and folders before copying. Only used if --copy-from is provided.
--timestamp Appends a current timestamp to the project name Adds a timestamp to the project folder name when creating a new Paratext project folder.
--wildebeest Runs a Wildebeest analysis on the extracted corpora. Produces a Wildebeest report for the project. Stores results as project_name_wildebeest in the current working directory by deafult.
--stats Compute tokenization statistics Compute tokenization statistics. Stores results in stats/project_name in the current working directory by deafult.

config file

The config file contains the parameters for all of the optional onboarding tasks this script can execute.

Below is an example of a onboarding config:

extract_corpora:
  include: NT
  exclude: OT
verse_counts:
  output_folder: verse_counts/test_onboard_project
  files: *.txt
  deutero: false
  recount: false
wildebeest:
  x: 500
  n: 500
  r: vref.txt
zip_password:
  project_name_1: password_1
  project_name_2: password_2
stats:
  use_default_model_dir: False
  data:
     corpus_pairs:
       type: train
       src: project_extract_file
       trg: project_extract_file
       lang_codes:
          iso_code: nllb_tag

Parameter Definitions

extract_corpora

  • include=[]: A list of books to include; e.g., 'NT', 'OT', 'GEN'.
  • exclude=[]: A list of books to exclude; e.g., 'NT', 'OT', 'GEN'.
  • markers=False: If true, include USFM markers in extraction.
  • lemmas=False: If true, extract lemmas.
  • project_vrefs=False: If true, extract project_vrefs.

collect_verse_counts

  • output_folder=path_to_output_folder: Folder to store the verse counts.
  • files=*.txt: Semicolon-delimited list of patterns of extract file names to count (e.g. 'en-*.txt;fr-NT.txt).
  • deutero=False: If true, include counts for Deuterocanon books.
  • recount=False: If true, force recount of verse counts.

wildebeest

  • x=500: max number of examples per line
  • n=500: max number of cases per group
  • r=vref.txt: file with sentence reference IDs
  • See the Wildebeest Repo for more info

zip_password

  • Stores the project names and the respective passwords for any encrypted zip files

stats

  • Configures the Tokenizer used for tokenization statistics.
  • The example shows the default used when the stats section is empty.
  • See Configure A Model for a full list of parameter definitions
Clone this wiki locally