Onboard Project: Usage

Onboarding Projects

This page details the silnlp.common.onboard_project script's usage and the configuration options available.

onboard_project

Cleans and uploads a Paratext project from a local machine to the MinIO bucket. Optionally performs other Onboarding tasks.

usage: python -m silnlp.common.onboard_project [--copy-from [local_dir]] [--config path_to_config]
[--extract-corpora] [--collect-verse-counts] [--clean-project] [--timestamp] [--wildebeest]
[--stats] projects ...

Arguments:

Argument	Purpose	Description
`projects`	list of Paratext project names	(Required) These projects will be stored on the bucket at `Paratext/projects`.
`--copy-from [local_dir]`	Path to a directory with a Paratext project.	The local project(s) will be copied to the bucket. Default if included without a local_dir is the user's `Downloads` folder
`--config path_to_config`	Path to a config.yml file	This is used to configure what optional Onboarding tasks will run.
`--extract-corpora`	Runs silnlp.common.extract_corpora	Extracts corpora. See here for more information.
`--collect-verse-counts`	Runs silnlp.common.collect_verse_counts	Collects verse counts. Stores results in MT/experiments/verse_counts/project_name by default.
`--clean-project`	Runs silnlp.common.clean_projects	Cleans the Paratext project folder by removing unnecessary files and folders before copying. Only used if --copy-from is provided.
`--timestamp`	Appends a current timestamp to the project name	Adds a timestamp to the project folder name when creating a new Paratext project folder.
`--wildebeest`	Runs a Wildebeest analysis on the extracted corpora.	Produces a Wildebeest report for the project. Stores results as project_name_wildebeest in the current working directory by deafult.
`--stats`	Compute tokenization statistics	Compute tokenization statistics. Stores results in stats/project_name in the current working directory by deafult.

config file

The config file contains the parameters for all of the optional onboarding tasks this script can execute.

Below is an example of a onboarding config:

extract_corpora:
  include: NT
  exclude: OT
verse_counts:
  output_folder: verse_counts/test_onboard_project
  files: *.txt
  deutero: false
  recount: false
wildebeest:
  x: 500
  n: 500
  r: vref.txt
zip_password:
  project_name_1: password_1
  project_name_2: password_2
stats:
  use_default_model_dir: False
  data:
     corpus_pairs:
       type: train
       src: project_extract_file
       trg: project_extract_file
       lang_codes:
          iso_code: nllb_tag

Parameter Definitions

extract_corpora

include=[]: A list of books to include; e.g., 'NT', 'OT', 'GEN'.
exclude=[]: A list of books to exclude; e.g., 'NT', 'OT', 'GEN'.
markers=False: If true, include USFM markers in extraction.
lemmas=False: If true, extract lemmas.
project_vrefs=False: If true, extract project_vrefs.

collect_verse_counts

output_folder=path_to_output_folder: Folder to store the verse counts.
files=*.txt: Semicolon-delimited list of patterns of extract file names to count (e.g. 'en-*.txt;fr-NT.txt).
deutero=False: If true, include counts for Deuterocanon books.
recount=False: If true, force recount of verse counts.

wildebeest

x=500: max number of examples per line
n=500: max number of cases per group
r=vref.txt: file with sentence reference IDs
See the Wildebeest Repo for more info

zip_password

Stores the project names and the respective passwords for any encrypted zip files

stats

Configures the Tokenizer used for tokenization statistics.
The example shows the default used when the stats section is empty.
See Configure A Model for a full list of parameter definitions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Onboard Project: Usage

Onboarding Projects

onboard_project

config file

Parameter Definitions

extract_corpora

collect_verse_counts

wildebeest

zip_password

stats

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally