Data - Reddit
Last updated
Was this helpful?
Last updated
Was this helpful?
The structure of Reddit is in a tree-form, not like a forum or something where everything is linear. The parent comments are linear, but replies to parent comments branch out. This is an example of how is the strucutre at reddit:
The structure we need for deep learning is input-output. So we really are trying to get something more along the lines of comment and reply pairs. In the above example, we could use the following as comment-reply pairs:
-Top level reply 1
and --Reply to top level reply 1
--Reply to top level reply 1 an --Reply to reply...
So, what we need to do is take this Reddit dump, and produce these pairs. The next thing we need to consider is that we should probably have only 1 reply per comment. Even though many single comments might have many replies, we should really just go with one. We can either go with the first one, or we can go with the top-voted one.
Now what we need too do is to obtain the data so we can use it in our chatbot. In the next page you can download data from reddit from 2005 until now.
For this this tutorial i download only 2015-RC1 , but you can download more than one this is going to make a robust chat. Also you need to have a great disk space becuase each package is aprox 5 G.