# Buffering Data

Now, let's begin to buffer through the data. We'll also start a couple of counters for tracking progress over time:

```
if __name__ == '__main__':
    create_table()
    row_counter = 0
    paired_rows = 0


    with open('your/directory/path/{}/RC_{}'.format(timeframe.split('-')[0],timeframe), buffering=1000) as f:
        for row in f:
```

The`row_counter`will just output from time to time to let us know how far we are in the file that we're iterating through, and then`paired_rows`will tell us how many rows of data we have that are paired.

Next, because the file is too large for us to be dealing with in memory, we're going to use the`buffering`parameter. Now, we need to read this row, which is of the json format:

```
if __name__ == '__main__':
    create_table()
    row_counter = 0
    paired_rows = 0

    with open('J:/your/path/{}/RC_{}'.format(timeframe.split('-')[0],timeframe), buffering=1000) as f:
        for row in f:
            row_counter += 1
            row = json.loads(row)
            parent_id = row['parent_id']
            body = format_data(row['body'])
            created_utc = row['created_utc']
            score = row['score']
            comment_id = row['name']
            subreddit = row['subreddit']
```

Note the`format_data`function call, let's create that:

```
def format_data(data):
    data = data.replace('\n',' newlinechar ').replace('\r',' newlinechar ').replace('"',"'")
    return data
```

We'll throw this in to normalize the comments and to convert the newline character to a word. We can read the data into a python object by using`json.loads()`, which just takes a string formatted like a json object.all comments will initially not have a parent, either because it's a top level comment (and the parent is the reddit post itself), or because the parent isn't in our document. As we go through the document, however, we will find comments that do have parents that we've got in our database. When this happens, we want to instead add this comment to the existing parent. Once we've gone through a file, or a list of files, we'll take the database and output our pairs as training data, train our model. So, before we input our data to the database, we should see if we can find the parent first.

```
  parent_data = find_parent(parent_id)
```

Now, we need to create the`find_parent`function:

```
def find_parent(pid):
    try:
        sql = "SELECT comment FROM parent_reply WHERE comment_id = '{}' LIMIT 1".format(pid)
        c.execute(sql)
        result = c.fetchone()
        if result != None:
            return result[0]
        else: return False
    except Exception as e:
        #print(str(e))
        return False
```

We need to impose a restriction on \*all\* comments, regardless if there are any others, and that is that we only want to deal with non-pointless comments.

Now let's require the score to be two or higher, and then let's also see if there's already an existing reply to the parent, and what its score is:

```
if __name__ == '__main__':
    create_table()
    row_counter = 0
    paired_rows = 0

    with open('J:/your/path/{}/RC_{}'.format(timeframe.split('-')[0],timeframe), buffering=1000) as f:
        for row in f:
            row_counter += 1
            row = json.loads(row)
            parent_id = row['parent_id']
            body = format_data(row['body'])
            created_utc = row['created_utc']
            score = row['score']
            comment_id = row['name']
            subreddit = row['subreddit']
            parent_data = find_parent(parent_id)
            # maybe check for a child, if child, is our new score superior? If so, replace. If not...

            if score >= 2:
                existing_comment_score = find_existing_score(parent_id)
```

Now, we need to create the`find_existing_score`function:

```
def find_existing_score(pid):
    try:
        sql = "SELECT score FROM parent_reply WHERE parent_id = '{}' LIMIT 1".format(pid)
        c.execute(sql)
        result = c.fetchone()
        if result != None:
            return result[0]
        else: return False
    except Exception as e:
        #print(str(e))
        return False
```

If there is an existing comment, and if our score is higher than the existing comment's score, we'd like to replace it:

```
 if score >= 2:
                existing_comment_score = find_existing_score(parent_id)
                if existing_comment_score:
                    if score > existing_comment_score:
```

Next, many comments are either deleted or removed, but also some comments are very long, or very short. We want to make sure comments are of an acceptable length for training, and that the comment wasn't removed or deleted:

```
def acceptable(data):
    if len(data.split(' ')) > 50 or len(data) < 1:
        return False
    elif len(data) > 1000:
        return False
    elif data == '[deleted]':
        return False
    elif data == '[removed]':
        return False
    else:
        return True
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://alfredo-reyes-montero.gitbook.io/tenser-flow/applications/chatbot/buffering-data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
