Training GPT-2

Now we can proceed to traing you data set. Execute the next command.

$ python train.py --dataset lyric.npz

If you would like to see more examples, you can modify it accordingly. For example, to output 3 samples every 50 steps, type the following command instead:

$ python train.py --dataset lyric.npz --sample_every 50 --sample_num 3

There is also an option for you to increase the batch size and learning rate. Make sure that you have sufficient RAM to handle the increase in batch size (the default is 1). Learning rate is used for fine-tuning the model.

python train.py --dataset lyric.npz --batch_size 2 --learning_rate 0.0001

****Training using horovod

For those that wish to distribute on multiple GPUs to train GPT-2, you can try the following code (all of them in one line):

$ mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x PYTHONPATH=src -mca pml ob1 -mca btl ^openib train-horovod.py --dataset lyric.npz

How to stop training?

You can stop the training by using Ctrl+C. By default, the model will be saved once every 1000 steps and a sample will be generated once every 100 steps. After you have interrupted the process, a checkpoint folder and samples folder will be generated for you. Inside each folder, you can find another folder called run1.

Samples will contain the example output from the model, you can view it in any text editor to evaluate your model. The checkpoint folder will contains the necessary data for you to resume your training in the future.

Last updated