Training DataBase and Model

Training DataBase

We need to create files that are basically "parent" and "reply" text files, where each line is the sample. So line 15 in the parent file is the parent comment, and then line 15 in the reply file is the response to line 15 in the parent file. To create these files, we just need to grab pairs from the database, and append them to the respective training files.

import sqlite3
import pandas as pd

timeframes = ['2015-05']


for timeframe in timeframes: connection = sqlite3.connect('{}.db'.format(timeframe))
    c = connection.cursor()
    limit = 5000
    last_unix = 0
    cur_length = limit
    counter = 0
    test_done = False

The first line just establishes our connection, then we define the cursor, then thelimit. The limit is the size of chunk that we're going to pull at a time from the database. We want to set limit to 5000 for now, so we can have some testing data. We'll use thelast_unixto help us make pulls from the database,cur_lengthwill tell us when we're done,counterwill allow us to show some debugging information, andtest_donefor when we're done building testing data.

   while cur_length == limit:

        df = pd.read_sql("SELECT * FROM parent_reply WHERE unix > {} and parent NOT NULL and score > 0 ORDER BY unix ASC LIMIT {}".format(last_unix,limit),connection)
        last_unix = df.tail(1)['unix'].values[0]
        cur_length = len(df)

So long as thecur_lengthis the same as our limit, we've still got more pulling to do. Then, we'll pull the data from the database and slap it into a dataframe.

We'll start with the testing:

 if not test_done:
            with open('test.from','a', encoding='utf8') as f:
                for content in df['parent'].values:
                    f.write(content+'\n')

            with open('test.to','a', encoding='utf8') as f:
                for content in df['comment'].values:
                    f.write(str(content)+'\n')

            test_done = True

Now, if you want, you could also raise the limit at this point. After test_done = True, you could also re-define limit to be something like 100K or something. Now, let's do the training code:

 else:
            with open('train.from','a', encoding='utf8') as f:
                for content in df['parent'].values:
                    f.write(content+'\n')

            with open('train.to','a', encoding='utf8') as f:
                for content in df['comment'].values:
                    f.write(str(content)+'\n')

counter += 1
        if counter % 20 == 0:
            print(counter*limit,'rows completed so far')

Here, we'll see output for every 20 steps, so every 100K pairs if we keep the limit to 5,000.

Training Model

There are endless models that you could come up with and use, in this case we are going to use a sequence model, since sequence to sequence models can be used for a chatbot.

The project is subject to change, so you should check the readme, which, at the time of my writing this, currently says:

$ git clone --recursive https://github.com/daniel-kukiela/nmt-chatbot
$ cd nmt-chatbot
$ pip install -r requirements.txt
$ cd setup
(optional) edit settings.py to your liking. These are a decent starting point for ~4gb of VRAM, you should first start by trying to raise vocab if you can.
(optional) Edit text files containing rules in setup directory
Place training data inside "new_data" folder (train.(from|to), tst2012.(from|to)m tst2013(from|to)). We have provided some sample data for those who just want to do a quick test drive.
$ python prepare_data.py ...Run setup/prepare_data.py - new folder called "data" will be created with prepared training data
$ cd ../
$ python train.py Begin training

Make sure you download the package recursively, or manually get the nmt package either forked in our repo or from the official TensorFlow source. Our fork just has one change with the version checking, which, at least at the time, required a very specific version of 1.4.0, which wasn't actually necessary.

Once downloaded, edit setup/settings.py. If you don't really know what you're doing, that's okay, you don't need to modify anything. The preset settings will require ~4GB of VRAM, but should still produce an at least coherent model.

Once you've got your settings all set, inside the main dir (with the utils, tests, and setup directories), throw in yourtrain.toandtrain.from, along with the matchingtst2012andtst2013files into thenew_datadirectory. Nowcd setuprun theprepare_data.pyfile:

$ cp TensorFlow/Chatbot/scripts/train.to ~/nmt/new_data/
$ cp TensorFlow/Chatbot/scripts/train.from ~/nmt/new_data/
$ cd ~/nmt/new_data/
$ python3 prepare_data.py

Finally, cd and:

$ cd ..
$ python3 train.py

PreviousBuilding DataBase NextInteraction with ChatBot

Last updated 5 years ago

Was this helpful?

Training DataBase and Model