-
-
Notifications
You must be signed in to change notification settings - Fork 7
Onboard Project: Usage
Matthew Beech edited this page Nov 6, 2025
·
12 revisions
This page details the silnlp.common.onboard_project script's usage and the configuration options available.
Cleans and uploads a Paratext project from a local machine to the MinIO bucket. Optionally performs other Onboarding tasks.
usage: python -m silnlp.common.onboard_project [--copy-from [local_dir]] [--config path_to_config]
[--extract-corpora] [--collect-verse-counts] [--clean-project] [--timestamp] [--wildebeest]
[--stats] projects ...
Arguments:
| Argument | Purpose | Description |
|---|---|---|
projects |
list of Paratext project names | (Required) These projects will be stored on the bucket at Paratext/projects. |
--copy-from [local_dir] |
Path to a directory with a Paratext project. | The local project(s) will be copied to the bucket. Default if included without a local_dir is the user's Downloads folder |
--config path_to_config |
Path to a config.yml file | This is used to configure what optional Onboarding tasks will run. |
--extract-corpora |
Runs silnlp.common.extract_corpora | Extracts corpora. See here for more information. |
--collect-verse-counts |
Runs silnlp.common.collect_verse_counts | Collects verse counts. Stores results in MT/experiments/verse_counts/project_name by default. |
--clean-project |
Runs silnlp.common.clean_projects | Cleans the Paratext project folder by removing unnecessary files and folders before copying. Only used if --copy-from is provided. |
--timestamp |
Appends a current timestamp to the project name | Adds a timestamp to the project folder name when creating a new Paratext project folder. |
--wildebeest |
Runs a Wildebeest analysis on the extracted corpora. | Produces a Wildebeest report for the project. Stores results as project_name_wildebeest in the current working directory by deafult. |
--stats |
Compute tokenization statistics | Compute tokenization statistics. Stores results in stats/project_name in the current working directory by deafult. |
The config file contains the parameters for all of the optional onboarding tasks this script can execute.
Below is an example of a onboarding config:
extract_corpora:
include: NT
exclude: OT
verse_counts:
output_folder: verse_counts/test_onboard_project
files: *.txt
deutero: false
recount: false
wildebeest:
x: 500
n: 500
r: vref.txt
zip_password:
project_name_1: password_1
project_name_2: password_2
stats:
use_default_model_dir: False
data:
corpus_pairs:
type: train
src: project_extract_file
trg: project_extract_file
lang_codes:
iso_code: nllb_tag
-
include=[]: A list of books to include; e.g., 'NT', 'OT', 'GEN'. -
exclude=[]: A list of books to exclude; e.g., 'NT', 'OT', 'GEN'. -
markers=False: If true, include USFM markers in extraction. -
lemmas=False: If true, extract lemmas. -
project_vrefs=False: If true, extract project_vrefs.
-
output_folder=path_to_output_folder: Folder to store the verse counts. -
files=*.txt: Semicolon-delimited list of patterns of extract file names to count (e.g. 'en-*.txt;fr-NT.txt). -
deutero=False: If true, include counts for Deuterocanon books. -
recount=False: If true, force recount of verse counts.
-
x=500: max number of examples per line -
n=500: max number of cases per group -
r=vref.txt: file with sentence reference IDs - See the Wildebeest Repo for more info
- Stores the project names and the respective passwords for any encrypted zip files
- Configures the Tokenizer used for tokenization statistics.
- The example shows the default used when the
statssection is empty. - See Configure A Model for a full list of parameter definitions