Data Structure
At this point you have download the data that we are going to use on this chatbot. Now we need to separe and understand what is what, in other words, we need to know what is an input and outpu. WIth neural networks, is what are the input layer and output layer for the actual neural network, and for a Chatbot, we need to know what is a comment and what a is a reply.
In our data of reddit , not all the comments are replies, and also many comments will have many replies. Also there are cases where you might find a better replay than other. So lets find what are some cases that we need to see.
The structure that we are use is:
As we could see there are some stuff that we don't need , so what we are going to consider is the body, comment_id and parent_id. Now also we need to buffer the data because only a single month there can be more than 32GB, and this can't fit into RAM.
The idea is to create a data structure in SQLite database, and insert the comment. All comments will come chronologically, so all comments will be the "parent" initially, and have no parent of their own. Over time through, there will be replies, and we can then store this "reply", which will have a parent in the database that we can also pull by id.
Let start to talk about the code an how is structure.
We will be usingsqlite3
for our database,json
to load in the lines from the datadump, and thendatetime
really just for logging. So the torrent dump came with a bunch of directories by year, which contain the actual json data dumps, named by year and month (YYYY-MM). They are compressed in.bz2
. Make sure you extract the ones you intend to use.
Thetimeframe
value is going to be our year and month of data that we're going to use.
Next, we havesql_transaction
. So the "commit" in SQL is the more costly action.
With SQLite, the database is created with theconnect
if it doesn't already exist.
Here, we're preparing to store the parent_id, comment_id, the parent comment, the reply (comment), subreddit, the time, and then finally the score (votes) for the comment.
Last updated
Was this helpful?