2017 Reading List

I had a friend ask the other day on Facebook for some recommendations for good, short books to read in 2017.

I thought about books that I could suggest, but very quickly realized that the books that I read aren’t short, and, after I reflected a little, realized that this creates some problems for me — either I lose momentum and don’t finish, or I end up finishing after a seemingly Sisyphean effort that doesn’t feel satisfying.

So I decided to search for the best, shortest books I could read in 2017. I started by pulling data from Goodreads.

Goodreads

Goodreads is a service that lets users rate books and network with other readers. Each book is rated on a 5-point scale. Books can be compiled into reading lists and shared with other users.

There are a lot of books on Goodlist, so in an effort to pull from only the best books, I included only books included on 30 of the most popular “Best Of” lists on the site¹. That gave me about 34,000 books to choose from.

Ratings and Rankings

Next, I had to rank the books. I wanted to avoid one of my biggest annoyances when using rating data — ignoring the trade-off between rating and sample size. If I’m looking for pizza on Yelp, why would a restaurant with a single review of 5-stars rank higher than a restaurant with a 4.9 rating based on 1,000 reviews?

My method for overcoming this was to assume a Dirichlet prior over a Multinomial likelihood². The posterior rating would then give us an estimate on the rating that help penalize obscure books until they demonstrated enough evidence of their acclaim.

The prior I used was based on the aggregate rating of a typical book — one-half of the median of each possible rating, which resulted in Dirichlet(13.5, 42.5, 180.5, 293, 264)³. You can think of it as saying that we’ll assume a book will get 13.5 one-star reviews, 42.5 two-star reviews, 180.5 three-star reviews, 293 four-star reviews, and 264 five-star reviews. Then, each rating observed in the data builds upon the prior assumptions, so each book has to prove its rating, relative to all the other books.

The Final List

So, once I had 34,000 books, their length and their rating, I could finally generate my reading list.

I only had one constraint on the books in my list — they had to be less than 200 pages. So I generated my reading list, and was surprised to see that more than half of the list comprised of Calvin and Hobbes collections. (For whatever reason, Goodreads users really love Calvin and Hobbes — reviews were both numerous (hundreds of thousands of users) and very high.)

So after excluding “Sequential Art” as a genre (and subsequently few more genres I definitely wasn’t interested in⁴), I got the list below.

Title	Author	Year	Pages	Rating
The Last Question	Isaac Asimov	1956	9	4.54
We Should All Be Feminists	Chimamanda Ngozi Adichie	2012	52	4.46
The Compleat Works of Wllm Shkspr	Reduced Shakespeare Company	1994	137	4.42
Sister Outsider	Audre Lorde	1984	190	4.41
Between the World and Me	Ta Nehisi Coates	2015	152	4.39
The Fire Next Time	James Baldwin	1963	141	4.38
The Essential Neruda	Pablo Neruda	1979	200	4.38
Illuminations	Arthur Rimbaud	1875	182	4.35
Four Quartets	T. S. Eliot	1943	48	4.34
The Pillowman	Martin McDonagh	2003	104	4.34
Letters to a Young Poet	Rainer Maria Rilke	1929	80	4.33
Man’s Search for Meaning	Victor Frankl	1946	184	4.32
100 Selected Poems	E. E. Cummings	1954	128	4.31
Tao Te Ching	Lao Tzu	1989	184	4.31
A Season in Hell/The Drunken Boat	Arthur Rimbaud	1837	104	4.3
The Love Song of J. Alfred Prufrock and Other Poems	T. S. Eliot	1915	44	4.3

Conclusions

I was definitely surprised with the list that this approach came up with — it’s definitely a list of books that I wouldn’t have read otherwise. I’m not sure if I’ll read them in order of ranking, but I’m excited to see what comes out of my reading this year.

Future Work

I’d like to make this data available interactively in a subsequent post so people can generate their own booklist with their own constraints. That’ll take more time than I have available tonight, but I hope to post it soon.

Footnotes

The lists that I pulled from were Best Books Under 200 Pages, Great Short Short Books, Best Books of the 2010’s, Best Books of the 2000’s, Best Books of the 1990’s, Best Books of the 1980’s, Best Books of the 1970’s, Best Books of the 1960’s, Best Books of the 1950’s, Best Books of the 1940’s, Best Books of the 1930’s, Best Books of the 1920’s, Best Books of the 1910’s, Best Books of the 1900’s, Best Books of the 1890’s, Best Books of the 1880’s, Best Books of the 1870’s, Best Books of the 1860’s, Best Books of the 1850’s, Best Books of the 1840’s, Best Books of the 1830’s, Best Books of the 1820’s, Best Books of the 1810’s, Best Books of the 1800’s, Books That Everyone Should Read At Least Once,The BOOK was BETTER than the MOVIE, Books You Wish More People Knew About, World’s Greatest Novellas, Best 21st Century Non-Fiction, and Best Books Ever. ↩
This is a common approach in Bayesian statistics. In Bayesian statistics we allow some prior information to inform our approach, then let data update our prior assumptions, where stronger signals in the data help us make stronger departures in our conclusions. ↩
That’s actually a pretty high prior, right? I was surprised to see books being so highly rated overall. I mean, should it be the assumption that a book is really 4-stars? ↩
The full list of exclusions were: Childrens, Young Adult, Religion, Romance, Art, Music, Reference, Law, Fantasy, Sequential Art, Audiobook, Horror, Vampires, and Dystopia. ↩