Since like most reasonable people I read a lot of fantasy books,
I figured that applying NLP methods to the content of those books would help
improve my intuitive understanding of the techniques involved.
For this purpose, I needed as many fantasy books as possible in a format
that could be easily processed in Python or Lisp.
I started by scraping a list of fantasy authors from wikipedia.
I proceeded to search for ebooks by each of these authors on The Pirate Bay.
Luckily, The Pirate Bay has a nice and easy-to-use API.
True ebooks, i.e. books that contained mainly text, were rarely larger than 2-3MB,
but collections of books could become about 10-20MB in size. Most torrents larger than that
either contained a lot of images or were sorted into the ebook category by accident and were
actually audio books or something else. Thus, limiting the file size to 20MB made sense.
Since the first result when searching for an author was often a collection of books by
that author, I decided to download only the first result.
To download the torrents, I enabled remote access in my torrent client transmission-gtk,
installed transmission-cli (apt install transmission-cli) and then simply used
subprocess to trigger adding the torrent file and starting the download.
The next morning, I found that most of the downloads were complete (in total 282).
Some didn’t complete, probably because the seeders disappeared.
Upon inspecting the downloaded data, I found that the books came in various formats
(some more exotic than others) and often more than one format was provided for a given book.
Quick overview of all file types.
To simplify processing, I deleted all files that had no file extension and then
proceeded to rename all files with uppercase file extensions to their lowercase equivalent.
To accomplish this, I used the rename tool (apt install rename).
After that, since calibre (the tool I was gonna use for conversion) couldn’t handle old-style
doc files, I converted doc files to html with LibreOffice’s CLI.
Then, I converted all books that were not already in the epub format to the epub format
using calibre’s ebook-convert command (apt install calibre).
At first I encountered an error from calibre: PyCapsule_GetPointer called with incorrect name.
Replacing the calibre version that came with Ubuntu (apt purge calibre) with the newest version from
https://calibre-ebook.com/download_linux solved that problem.
A few hours later, I had a whopping 3419 epub files ready to be processed.
I deleted the downloaded files to save disk space.
I used python’s ebooklib package to read the epub files and dumped the metadata + content
into json files.
After that I had a bunch of json files that I could easily use to explore NLP techniques.
If you replicate this process and intend to read one of downloaded books I recommend
that you buy the book or find another way to support the author.