Data Structure

At this point you have download the data that we are going to use on this chatbot. Now we need to separe and understand what is what, in other words, we need to know what is an input and outpu. WIth neural networks, is what are the input layer and output layer for the actual neural network, and for a Chatbot, we need to know what is a comment and what a is a reply.

In our data of reddit , not all the comments are replies, and also many comments will have many replies. Also there are cases where you might find a better replay than other. So lets find what are some cases that we need to see.

The structure that we are use is:

{"author":"Arve","link_id":"t3_5yba3","score":0,"body":"Can we please deprecate the word \"Ajax\" now? \r\n\r\n(But yeah, this _is_ much nicer)","score_hidden":false,"author_flair_text":null,"gilded":0,"subreddit":"reddit.com","edited":false,"author_flair_css_class":null,"retrieved_on":1427426409,"name":"t1_c0299ap","created_utc":"1192450643","parent_id":"t1_c02999p","controversiality":0,"ups":0,"distinguished":null,"id":"c0299ap","subreddit_id":"t5_6","downs":0,"archived":true}

As we could see there are some stuff that we don't need , so what we are going to consider is the body, comment_id and parent_id. Now also we need to buffer the data because only a single month there can be more than 32GB, and this can't fit into RAM.

The idea is to create a data structure in SQLite database, and insert the comment. All comments will come chronologically, so all comments will be the "parent" initially, and have no parent of their own. Over time through, there will be replies, and we can then store this "reply", which will have a parent in the database that we can also pull by id.

Let start to talk about the code an how is structure.

import sqlite3
import json
from datetime import datetime

We will be usingsqlite3for our database,jsonto load in the lines from the datadump, and thendatetimereally just for logging. So the torrent dump came with a bunch of directories by year, which contain the actual json data dumps, named by year and month (YYYY-MM). They are compressed in.bz2. Make sure you extract the ones you intend to use.

timeframe = '2015-05'
sql_transaction = []

connection = sqlite3.connect('{}.db'.format(timeframe))
c = connection.cursor()

Thetimeframevalue is going to be our year and month of data that we're going to use.

Next, we havesql_transaction. So the "commit" in SQL is the more costly action.

With SQLite, the database is created with theconnectif it doesn't already exist.

def create_table():
    c.execute("CREATE TABLE IF NOT EXISTS parent_reply(parent_id TEXT PRIMARY KEY, comment_id TEXT UNIQUE, parent TEXT, comment TEXT, subreddit TEXT, unix INT, score INT)")

Here, we're preparing to store the parent_id, comment_id, the parent comment, the reply (comment), subreddit, the time, and then finally the score (votes) for the comment.

PreviousData - Reddit NextBuffering Data

Last updated 6 years ago

Was this helpful?