Training DataBase and Model
Training DataBase
We need to create files that are basically "parent" and "reply" text files, where each line is the sample. So line 15 in the parent file is the parent comment, and then line 15 in the reply file is the response to line 15 in the parent file. To create these files, we just need to grab pairs from the database, and append them to the respective training files.
import sqlite3
import pandas as pd
timeframes = ['2015-05']
for timeframe in timeframes: connection = sqlite3.connect('{}.db'.format(timeframe))
c = connection.cursor()
limit = 5000
last_unix = 0
cur_length = limit
counter = 0
test_done = False
The first line just establishes our connection, then we define the cursor, then thelimit
. The limit is the size of chunk that we're going to pull at a time from the database. We want to set limit to 5000 for now, so we can have some testing data. We'll use thelast_unix
to help us make pulls from the database,cur_length
will tell us when we're done,counter
will allow us to show some debugging information, andtest_done
for when we're done building testing data.
while cur_length == limit:
df = pd.read_sql("SELECT * FROM parent_reply WHERE unix > {} and parent NOT NULL and score > 0 ORDER BY unix ASC LIMIT {}".format(last_unix,limit),connection)
last_unix = df.tail(1)['unix'].values[0]
cur_length = len(df)
So long as thecur_length
is the same as our limit, we've still got more pulling to do. Then, we'll pull the data from the database and slap it into a dataframe.
We'll start with the testing:
if not test_done:
with open('test.from','a', encoding='utf8') as f:
for content in df['parent'].values:
f.write(content+'\n')
with open('test.to','a', encoding='utf8') as f:
for content in df['comment'].values:
f.write(str(content)+'\n')
test_done = True
Now, if you want, you could also raise the limit at this point. After test_done = True, you could also re-define limit to be something like 100K or something. Now, let's do the training code:
else:
with open('train.from','a', encoding='utf8') as f:
for content in df['parent'].values:
f.write(content+'\n')
with open('train.to','a', encoding='utf8') as f:
for content in df['comment'].values:
f.write(str(content)+'\n')
counter += 1
if counter % 20 == 0:
print(counter*limit,'rows completed so far')
Here, we'll see output for every 20 steps, so every 100K pairs if we keep the limit to 5,000.
Training Model
There are endless models that you could come up with and use, in this case we are going to use a sequence model, since sequence to sequence models can be used for a chatbot.
The latest NMT tutorial and code from TensorFlow can be found here: Neural Machine Translation (seq2seq) Tutorial
The project is subject to change, so you should check the readme, which, at the time of my writing this, currently says:
$ git clone --recursive https://github.com/daniel-kukiela/nmt-chatbot
$ cd nmt-chatbot
$ pip install -r requirements.txt
$ cd setup
(optional) edit settings.py to your liking. These are a decent starting point for ~4gb of VRAM, you should first start by trying to raise vocab if you can.
(optional) Edit text files containing rules in setup directory
Place training data inside "new_data" folder (train.(from|to), tst2012.(from|to)m tst2013(from|to)). We have provided some sample data for those who just want to do a quick test drive.
$ python prepare_data.py ...Run setup/prepare_data.py - new folder called "data" will be created with prepared training data
$ cd ../
$ python train.py Begin training
Make sure you download the package recursively, or manually get the nmt package either forked in our repo or from the official TensorFlow source. Our fork just has one change with the version checking, which, at least at the time, required a very specific version of 1.4.0, which wasn't actually necessary.
Once downloaded, edit setup/settings.py. If you don't really know what you're doing, that's okay, you don't need to modify anything. The preset settings will require ~4GB of VRAM, but should still produce an at least coherent model.
Once you've got your settings all set, inside the main dir (with the utils, tests, and setup directories), throw in yourtrain.to
andtrain.from
, along with the matchingtst2012
andtst2013
files into thenew_data
directory. Nowcd setup
run theprepare_data.py
file:
$ cp TensorFlow/Chatbot/scripts/train.to ~/nmt/new_data/
$ cp TensorFlow/Chatbot/scripts/train.from ~/nmt/new_data/
$ cd ~/nmt/new_data/
$ python3 prepare_data.py
Finally, cd and:
$ cd ..
$ python3 train.py
Last updated
Was this helpful?