Now, let's begin to buffer through the data. We'll also start a couple of counters for tracking progress over time:
if __name__ == '__main__':
create_table()
row_counter = 0
paired_rows = 0
with open('your/directory/path/{}/RC_{}'.format(timeframe.split('-')[0],timeframe), buffering=1000) as f:
for row in f:
Therow_counterwill just output from time to time to let us know how far we are in the file that we're iterating through, and thenpaired_rowswill tell us how many rows of data we have that are paired.
Next, because the file is too large for us to be dealing with in memory, we're going to use thebufferingparameter. Now, we need to read this row, which is of the json format:
if __name__ == '__main__':
create_table()
row_counter = 0
paired_rows = 0
with open('J:/your/path/{}/RC_{}'.format(timeframe.split('-')[0],timeframe), buffering=1000) as f:
for row in f:
row_counter += 1
row = json.loads(row)
parent_id = row['parent_id']
body = format_data(row['body'])
created_utc = row['created_utc']
score = row['score']
comment_id = row['name']
subreddit = row['subreddit']
def format_data(data):
data = data.replace('\n',' newlinechar ').replace('\r',' newlinechar ').replace('"',"'")
return data
We'll throw this in to normalize the comments and to convert the newline character to a word. We can read the data into a python object by usingjson.loads(), which just takes a string formatted like a json object.all comments will initially not have a parent, either because it's a top level comment (and the parent is the reddit post itself), or because the parent isn't in our document. As we go through the document, however, we will find comments that do have parents that we've got in our database. When this happens, we want to instead add this comment to the existing parent. Once we've gone through a file, or a list of files, we'll take the database and output our pairs as training data, train our model. So, before we input our data to the database, we should see if we can find the parent first.
parent_data = find_parent(parent_id)
Now, we need to create thefind_parentfunction:
def find_parent(pid):
try:
sql = "SELECT comment FROM parent_reply WHERE comment_id = '{}' LIMIT 1".format(pid)
c.execute(sql)
result = c.fetchone()
if result != None:
return result[0]
else: return False
except Exception as e:
#print(str(e))
return False
We need to impose a restriction on *all* comments, regardless if there are any others, and that is that we only want to deal with non-pointless comments.
Now let's require the score to be two or higher, and then let's also see if there's already an existing reply to the parent, and what its score is:
if __name__ == '__main__':
create_table()
row_counter = 0
paired_rows = 0
with open('J:/your/path/{}/RC_{}'.format(timeframe.split('-')[0],timeframe), buffering=1000) as f:
for row in f:
row_counter += 1
row = json.loads(row)
parent_id = row['parent_id']
body = format_data(row['body'])
created_utc = row['created_utc']
score = row['score']
comment_id = row['name']
subreddit = row['subreddit']
parent_data = find_parent(parent_id)
# maybe check for a child, if child, is our new score superior? If so, replace. If not...
if score >= 2:
existing_comment_score = find_existing_score(parent_id)
Now, we need to create thefind_existing_scorefunction:
def find_existing_score(pid):
try:
sql = "SELECT score FROM parent_reply WHERE parent_id = '{}' LIMIT 1".format(pid)
c.execute(sql)
result = c.fetchone()
if result != None:
return result[0]
else: return False
except Exception as e:
#print(str(e))
return False
If there is an existing comment, and if our score is higher than the existing comment's score, we'd like to replace it:
if score >= 2:
existing_comment_score = find_existing_score(parent_id)
if existing_comment_score:
if score > existing_comment_score:
Next, many comments are either deleted or removed, but also some comments are very long, or very short. We want to make sure comments are of an acceptable length for training, and that the comment wasn't removed or deleted:
def acceptable(data):
if len(data.split(' ')) > 50 or len(data) < 1:
return False
elif len(data) > 1000:
return False
elif data == '[deleted]':
return False
elif data == '[removed]':
return False
else:
return True