Training DataBase and Model
Training DataBase
We need to create files that are basically "parent" and "reply" text files, where each line is the sample. So line 15 in the parent file is the parent comment, and then line 15 in the reply file is the response to line 15 in the parent file. To create these files, we just need to grab pairs from the database, and append them to the respective training files.
The first line just establishes our connection, then we define the cursor, then thelimit
. The limit is the size of chunk that we're going to pull at a time from the database. We want to set limit to 5000 for now, so we can have some testing data. We'll use thelast_unix
to help us make pulls from the database,cur_length
will tell us when we're done,counter
will allow us to show some debugging information, andtest_done
for when we're done building testing data.
So long as thecur_length
is the same as our limit, we've still got more pulling to do. Then, we'll pull the data from the database and slap it into a dataframe.
We'll start with the testing:
Now, if you want, you could also raise the limit at this point. After test_done = True, you could also re-define limit to be something like 100K or something. Now, let's do the training code:
Here, we'll see output for every 20 steps, so every 100K pairs if we keep the limit to 5,000.
Training Model
There are endless models that you could come up with and use, in this case we are going to use a sequence model, since sequence to sequence models can be used for a chatbot.
The project is subject to change, so you should check the readme, which, at the time of my writing this, currently says:
Make sure you download the package recursively, or manually get the nmt package either forked in our repo or from the official TensorFlow source. Our fork just has one change with the version checking, which, at least at the time, required a very specific version of 1.4.0, which wasn't actually necessary.
Once downloaded, edit setup/settings.py. If you don't really know what you're doing, that's okay, you don't need to modify anything. The preset settings will require ~4GB of VRAM, but should still produce an at least coherent model.
Once you've got your settings all set, inside the main dir (with the utils, tests, and setup directories), throw in yourtrain.to
andtrain.from
, along with the matchingtst2012
andtst2013
files into thenew_data
directory. Nowcd setup
run theprepare_data.py
file:
Finally, cd and:
Last updated
Was this helpful?