Concepts and parameters

For this project we need to be familiar with the concept of neural network, consider a task like sequence to sequences, where the sequences aren't exactly the same length.

First of all what is a sequences to sequences, for that awe are using the framework of TensorFlow NMT( Neural Machine Translation).

Sequence-to-sequence (seq2seq) models (Sutskever et al., 2014,Cho et al., 2014) have enjoyed great success in a variety of tasks such as machine translation, speech recognition, and text summarization.

Background on Neural Machine Translation

NMT system first reads the source sentence using an encoder _to build a "thought" vector, a sequence of numbers that represents the sentence meaning; a _decoder , then, processes the sentence vector to emit a translation. This is often referred to as the encoder-decoder architecture. In this manner, NMT addresses the local translation problem in the traditional phrase-based approach: it can capture _long-range dependencies _in languages, e.g., gender agreements; syntax structures; etc., and produce much more fluent translations.

NMT models vary in terms of their exact architectures. A natural choice for sequential data is the recurrent neural network (RNN), used by most NMT models. Usually an RNN is used for both the encoder and decoder. The RNN models, however, differ in terms of: (a) directionality – unidirectional or bidirectional; (b) depth– single- or multi-layer; and (c) type – often either a vanilla RNN, a Long Short-term Memory (LSTM), or a gated recurrent unit (GRU).

Chatbot

In the case of a chatbot, one word statements could yield 20-word responses, and long statements could return single-word responses, and, each input is going to vary from the last in terms of characters, words, and more. Words themselves will be assigned either arbitrary or meaningful ids (via word vectors), but how do we handle the variable lengths? One answer is to just make all strings of words 50 words long (for example). Then, when statements are 35 words long, we could just pad the other 15. Any data longer than 50 words, we can either not use for training, or truncate.

Unfortunately, this can make training hard, especially for shorter responses which might be the most common, and most of the words/tokens will just be padding.

NMT code that works with variable inputs, no bucketing or padding! Next, this code also contains support for attention mechanisms, which are an attempt at adding longer-term memory to recurrent neural networks. Finally, we'll also be making use of bidirectional recurrent neural networks (BRNN).

LSTM can remember decently sequences of tokens up to 10-20 in length fairly well. After this point, however, performance drops and the network forgets the initial tokens to make room for the new ones. In our case, tokens are words, so a basic LSTM should be capable of learning 10-20 word-length sentences, but, as we go longer than this, chances are, the output is going to not be as good. Attention mechanisms come in to seek to give longer "attention spans," which help a network to reach more like 30,40, or even 80 words, for example.

In many sequence-to-sequence tasks, like with language translation, we can do pretty well by converting words in place, and learning simple patterns about grammar, since many languages are syntactically similar. With natural language and communication, along with some forms of translation like English to Japanese, there's more of an importance in context, flow...etc.

The bidirectional recurrent neural network (BRNN) assumes that data both now, in the past, and in the future is important in an input sequence. The "bidirectional" part of bidirectional recurrent neural network (BRNN) is pretty well descriptive. The input sequence goes both ways. One goes forward, and the other goes in reverse. To illustrate this:

On a simple RNN, you have your input layer, your output layer, and then we'll just have one hidden layer. Then your connections go from the input layer to the hidden layer, where each node in the hidden layer also passes down to the next hidden layer node, which is how we get our "temporal," and not-so-static characteristics from recurrent neural networks, as the previous inputs are allowed to carry forward and down the hidden layer. On a BRNN, instead, your hidden layer consists of nodes going in opposite directions, so you'll have input and output layers, then you'll have your hidden layer(s).

The next addition to our network is aThe next addition to our network is an attention mechanism, since, despite passing data forward and backwards, our network is not good at remembering longer sequences at a time (3-10 tokens at a time max). If you're tokenizing words, which we are, that means 3-10 words at a time maximum, but this is even more problematic for character-level models, where the most you can remember is more like 3-10 characters. With attention mechanisms, we can go out to 30, 40, 80+ tokens in sequence.

BLEU stands for "bilingual evaluation understudy," and it's probably our best way to determine the overall effectiveness of a translation algorithm. It's important to note, however, that BLEU is going to be relative to the sequences that we're translating.

Aside from BLEU, you will also seePerplexity, often in the short-hand "PPL." Perplexity is another decent measurement for a models effectiveness. Unlike BLEU, the lower the better, as it's a probability distribution of how effective your model is at predicting an output from a sample. Again, with language translation.

With BLEU and PPL, with translation, you probably can usually just train a model so long as BLEU is rising and PPL is falling. With a chatbot, however, where there never actually is, or never should be, a "correct" answer, I would warn against continuing to train so long as BLEU and PPL rises, since this is likely to produce more robotic-like responses, rather than highly variant ones.

Basically loss a measure of how far "off" your output layer of the neural network was compared to the sample data. The lower the loss, the better.

The final concept is theBeam Search. Using this, we can check out a collection of the top-translations from the model, rather than just the top one and not even considering the others. Doing this causes longer translation times, but is a definite must-have for a translation model in my opinion, since, as we'll find, our model is still highly likely to produce outputs that we don't desire, but training these outputs out might cause overfitment elsewhere. Allowing for multiple translations will help in both training and production.

PreviousIntroduction NextDependencies

Last updated 6 years ago

Was this helpful?