I had a friend ask the other day on Facebook for some recommendations for good, short books to read in 2017.

I thought about books that I could suggest, but very quickly realized that the books that I read aren’t short, and, after I reflected a little, realized that this creates some problems for me — either I lose momentum and don’t finish, or I end up finishing after a seemingly Sisyphean effort that doesn’t feel satisfying.

So I decided to search for the best, shortest books I could read in 2017. I started by pulling data from Goodreads.

Goodreads

Goodreads is a service that lets users rate books and network with other readers. Each book is rated on a 5-point scale. Books can be compiled into reading lists and shared with other users.

There are a lot of books on Goodlist, so in an effort to pull from only the best books, I included only books included on 30 of the most popular “Best Of” lists on the site1. That gave me about 34,000 books to choose from.

Ratings and Rankings

Next, I had to rank the books. I wanted to avoid one of my biggest annoyances when using rating data — ignoring the trade-off between rating and sample size. If I’m looking for pizza on Yelp, why would a restaurant with a single review of 5-stars rank higher than a restaurant with a 4.9 rating based on 1,000 reviews?

My method for overcoming this was to assume a Dirichlet prior over a Multinomial likelihood2. The posterior rating would then give us an estimate on the rating that help penalize obscure books until they demonstrated enough evidence of their acclaim.

The prior I used was based on the aggregate rating of a typical book — one-half of the median of each possible rating, which resulted in Dirichlet(13.5, 42.5, 180.5, 293, 264)3. You can think of it as saying that we’ll assume a book will get 13.5 one-star reviews, 42.5 two-star reviews, 180.5 three-star reviews, 293 four-star reviews, and 264 five-star reviews. Then, each rating observed in the data builds upon the prior assumptions, so each book has to prove its rating, relative to all the other books.

The Final List

So, once I had 34,000 books, their length and their rating, I could finally generate my reading list.

I only had one constraint on the books in my list — they had to be less than 200 pages. So I generated my reading list, and was surprised to see that more than half of the list comprised of Calvin and Hobbes collections. (For whatever reason, Goodreads users really love Calvin and Hobbes — reviews were both numerous (hundreds of thousands of users) and very high.)

So after excluding “Sequential Art” as a genre (and subsequently few more genres I definitely wasn’t interested in4), I got the list below.

Title Author Year Pages Rating
The Last Question Isaac Asimov 1956 9 4.54
We Should All Be Feminists Chimamanda Ngozi Adichie 2012 52 4.46
The Compleat Works of Wllm Shkspr Reduced Shakespeare Company 1994 137 4.42
Sister Outsider Audre Lorde 1984 190 4.41
Between the World and Me Ta Nehisi Coates 2015 152 4.39
The Fire Next Time James Baldwin 1963 141 4.38
The Essential Neruda Pablo Neruda 1979 200 4.38
Illuminations Arthur Rimbaud 1875 182 4.35
Four Quartets T. S. Eliot 1943 48 4.34
The Pillowman Martin McDonagh 2003 104 4.34
Letters to a Young Poet Rainer Maria Rilke 1929 80 4.33
Man’s Search for Meaning Victor Frankl 1946 184 4.32
100 Selected Poems E. E. Cummings 1954 128 4.31
Tao Te Ching Lao Tzu 1989 184 4.31
A Season in Hell/The Drunken Boat Arthur Rimbaud 1837 104 4.3
The Love Song of J. Alfred Prufrock and Other Poems T. S. Eliot 1915 44 4.3

Conclusions

I was definitely surprised with the list that this approach came up with — it’s definitely a list of books that I wouldn’t have read otherwise. I’m not sure if I’ll read them in order of ranking, but I’m excited to see what comes out of my reading this year.

Future Work

I’d like to make this data available interactively in a subsequent post so people can generate their own booklist with their own constraints. That’ll take more time than I have available tonight, but I hope to post it soon.

Footnotes

  1. The lists that I pulled from were Best Books Under 200 Pages, Great Short Short Books, Best Books of the 2010’s, Best Books of the 2000’s, Best Books of the 1990’s, Best Books of the 1980’s, Best Books of the 1970’s, Best Books of the 1960’s, Best Books of the 1950’s, Best Books of the 1940’s, Best Books of the 1930’s, Best Books of the 1920’s, Best Books of the 1910’s, Best Books of the 1900’s, Best Books of the 1890’s, Best Books of the 1880’s, Best Books of the 1870’s, Best Books of the 1860’s, Best Books of the 1850’s, Best Books of the 1840’s, Best Books of the 1830’s, Best Books of the 1820’s, Best Books of the 1810’s, Best Books of the 1800’s, Books That Everyone Should Read At Least Once,The BOOK was BETTER than the MOVIE, Books You Wish More People Knew About, World’s Greatest Novellas, Best 21st Century Non-Fiction, and Best Books Ever

  2. This is a common approach in Bayesian statistics. In Bayesian statistics we allow some prior information to inform our approach, then let data update our prior assumptions, where stronger signals in the data help us make stronger departures in our conclusions. 

  3. That’s actually a pretty high prior, right? I was surprised to see books being so highly rated overall. I mean, should it be the assumption that a book is really 4-stars? 

  4. The full list of exclusions were: Childrens, Young Adult, Religion, Romance, Art, Music, Reference, Law, Fantasy, Sequential Art, Audiobook, Horror, Vampires, and Dystopia.