Quickstart

Training

python -u aispace/trainer.py \
    --schedule train_and_eval \
    --config_name CONFIG_NAME \
    --config_dir CONFIG_DIR \
    [--experiment_name EXPERIMENT_NAME] \
    [--model_name MODEL_NAME] \
    [--gpus GPUS] 

Training with resumed model

python -u aispace/trainer.py \
    --schedule train_and_eval \
    --config_name CONFIG_NAME \
    --config_dir CONFIG_DIR \
    --model_resume_path MODEL_RESUME_PATH \
    [--experiment_name EXPERIMENT_NAME] \
    [--model_name MODEL_NAME] \
    [--gpus GPUS] 

–model_resume_path is a path to initialization model.

Average checkpoints

python -u aispace/trainer.py \
    --schedule avg_checkpoints \
    --config_name CONFIG_NAME \
    --config_dir CONFIG_DIR \
    --prefix_or_checkpoints PREFIX_OR_CHECKPOINGS \
    [--ckpt_weights CKPT_WEIGHTS] \
    [--experiment_name EXPERIMENT_NAME] \
    [--model_name MODEL_NAME] \
    [--gpus GPUS] 

–prefix_or_checkpoints is paths to multiple checkpoints separated by comma.

–ckpt_weights is weights same order as the prefix_or_checkpoints.

Deployment

Generate deployment files before deployment, you need to specify the model path (–model_resume_path) to be deployed like following.

python -u aispace/trainer.py \
    --schedule deploy \
    --config_name CONFIG_NAME \
    --config_dir CONFIG_DIR \
    --model_resume_path MODEL_RESUME_PATH \
    [--experiment_name EXPERIMENT_NAME] \
    [--model_name MODEL_NAME] \
    [--gpus GPUS] 

We use BentoML as deploy tool, so your must implement the deploy function in your model class.

Output file structure

The default output path is save, which may has multiple output directories under name as:

{experiment_name}_{model_name}_{dataset_name}_{random_seed}_{id}

Where id indicates the sequence number of the experiment for the same task, increasing from 0.

Take the text classification task as an example, the output file structure is similar to the following:

test_bert_for_classification_119_0
├── checkpoint                  # 1. checkpoints
│   ├── checkpoint
│   ├── ckpt_1.data-00000-of-00002
│   ├── ckpt_1.data-00001-of-00002
│   ├── ckpt_1.index
|   ...
├── deploy                      # 2. Bentoml depolyment directory
│   └── BertTextClassificationService
│       └── 20191208180211_B6FC81
├── hparams.json                # 3. Json file of all hyperparameters
├── logs                        # 4. general or tensorboard log directory
│   ├── errors.log              # error log file
│   ├── info.log                # info log file
│   ├── train                
│   │   ├── events.out.tfevents.1574839601.jshd-60-31.179552.14276.v2
│   │   ├── events.out.tfevents.1574839753.jshd-60-31.profile-empty
│   └── validation
│       └── events.out.tfevents.1574839787.jshd-60-31.179552.151385.v2
├── model_saved                 # 5. last model saved
│   ├── checkpoint
│   ├── model.data-00000-of-00002
│   ├── model.data-00001-of-00002
│   └── model.index
└── reports                     # 6. Eval reports for every output or task
    └── output_1_classlabel     # For example, text classification task
        ├── confusion_matrix.txt
        ├── per_class_stats.json
        └── stats.json