drew.lanen.ga

2017 Reading List

Mon, 13 Feb 2017 00:00:00 +0000

I had a friend ask the other day on Facebook for some recommendations for good, short books to read in 2017.

I thought about books that I could suggest, but very quickly realized that the books that I read aren’t short, and, after I reflected a little, realized that this creates some problems for me — either I lose momentum and don’t finish, or I end up finishing after a seemingly Sisyphean effort that doesn’t feel satisfying.

So I decided to search for the best, shortest books I could read in 2017. I started by pulling data from Goodreads.

Goodreads

Goodreads is a service that lets users rate books and network with other readers. Each book is rated on a 5-point scale. Books can be compiled into reading lists and shared with other users.

There are a lot of books on Goodlist, so in an effort to pull from only the best books, I included only books included on 30 of the most popular “Best Of” lists on the site¹. That gave me about 34,000 books to choose from.

Ratings and Rankings

Next, I had to rank the books. I wanted to avoid one of my biggest annoyances when using rating data — ignoring the trade-off between rating and sample size. If I’m looking for pizza on Yelp, why would a restaurant with a single review of 5-stars rank higher than a restaurant with a 4.9 rating based on 1,000 reviews?

My method for overcoming this was to assume a Dirichlet prior over a Multinomial likelihood². The posterior rating would then give us an estimate on the rating that help penalize obscure books until they demonstrated enough evidence of their acclaim.

The prior I used was based on the aggregate rating of a typical book — one-half of the median of each possible rating, which resulted in Dirichlet(13.5, 42.5, 180.5, 293, 264)³. You can think of it as saying that we’ll assume a book will get 13.5 one-star reviews, 42.5 two-star reviews, 180.5 three-star reviews, 293 four-star reviews, and 264 five-star reviews. Then, each rating observed in the data builds upon the prior assumptions, so each book has to prove its rating, relative to all the other books.

The Final List

So, once I had 34,000 books, their length and their rating, I could finally generate my reading list.

I only had one constraint on the books in my list — they had to be less than 200 pages. So I generated my reading list, and was surprised to see that more than half of the list comprised of Calvin and Hobbes collections. (For whatever reason, Goodreads users really love Calvin and Hobbes — reviews were both numerous (hundreds of thousands of users) and very high.)

So after excluding “Sequential Art” as a genre (and subsequently few more genres I definitely wasn’t interested in⁴), I got the list below.

Title	Author	Year	Pages	Rating
The Last Question	Isaac Asimov	1956	9	4.54
We Should All Be Feminists	Chimamanda Ngozi Adichie	2012	52	4.46
The Compleat Works of Wllm Shkspr	Reduced Shakespeare Company	1994	137	4.42
Sister Outsider	Audre Lorde	1984	190	4.41
Between the World and Me	Ta Nehisi Coates	2015	152	4.39
The Fire Next Time	James Baldwin	1963	141	4.38
The Essential Neruda	Pablo Neruda	1979	200	4.38
Illuminations	Arthur Rimbaud	1875	182	4.35
Four Quartets	T. S. Eliot	1943	48	4.34
The Pillowman	Martin McDonagh	2003	104	4.34
Letters to a Young Poet	Rainer Maria Rilke	1929	80	4.33
Man’s Search for Meaning	Victor Frankl	1946	184	4.32
100 Selected Poems	E. E. Cummings	1954	128	4.31
Tao Te Ching	Lao Tzu	1989	184	4.31
A Season in Hell/The Drunken Boat	Arthur Rimbaud	1837	104	4.3
The Love Song of J. Alfred Prufrock and Other Poems	T. S. Eliot	1915	44	4.3

Conclusions

I was definitely surprised with the list that this approach came up with — it’s definitely a list of books that I wouldn’t have read otherwise. I’m not sure if I’ll read them in order of ranking, but I’m excited to see what comes out of my reading this year.

Future Work

I’d like to make this data available interactively in a subsequent post so people can generate their own booklist with their own constraints. That’ll take more time than I have available tonight, but I hope to post it soon.

Footnotes

The lists that I pulled from were Best Books Under 200 Pages, Great Short Short Books, Best Books of the 2010’s, Best Books of the 2000’s, Best Books of the 1990’s, Best Books of the 1980’s, Best Books of the 1970’s, Best Books of the 1960’s, Best Books of the 1950’s, Best Books of the 1940’s, Best Books of the 1930’s, Best Books of the 1920’s, Best Books of the 1910’s, Best Books of the 1900’s, Best Books of the 1890’s, Best Books of the 1880’s, Best Books of the 1870’s, Best Books of the 1860’s, Best Books of the 1850’s, Best Books of the 1840’s, Best Books of the 1830’s, Best Books of the 1820’s, Best Books of the 1810’s, Best Books of the 1800’s, Books That Everyone Should Read At Least Once,The BOOK was BETTER than the MOVIE, Books You Wish More People Knew About, World’s Greatest Novellas, Best 21st Century Non-Fiction, and Best Books Ever. ↩
This is a common approach in Bayesian statistics. In Bayesian statistics we allow some prior information to inform our approach, then let data update our prior assumptions, where stronger signals in the data help us make stronger departures in our conclusions. ↩
That’s actually a pretty high prior, right? I was surprised to see books being so highly rated overall. I mean, should it be the assumption that a book is really 4-stars? ↩
The full list of exclusions were: Childrens, Young Adult, Religion, Romance, Art, Music, Reference, Law, Fantasy, Sequential Art, Audiobook, Horror, Vampires, and Dystopia. ↩

Buffer's Organizational Network

Tue, 08 Mar 2016 00:00:00 +0000

Introduction

Some teams perform better than other teams.

A lot of people research why that is, and what high performing teams have in common. Harvard Business Review does. MIT and CMU do. Google does. The New York Times does. Even venture capitalists do.

So what’s the consensus? High performing teams share two common characteristics.

Team members have higher than average levels of empathy and social awareness.
Team members communicate frequently and informally.

And it’s pretty much just those two things. Some researchers went as far as getting people to wear sociometers that measured all of their communication with and proximity to other team members. They ultimately found that “patterns of communication [are] the most important predictor of a team’s success. Not only that, but they are as significant as all the other factors — individual intelligence, personality, skill, and the substance of discussions — combined.”

For most teams, it’s hard to get a handle of one’s own “patterns of communication” and what that looks like, let alone whether it’s good or bad. Today, we’ll use chat data from Slack to look at Buffer’s own patterns of communication to identify what is or isn’t going well.

Building a Network

In order to analyze the patterns efficiently, we’re going to create an organizational network, which will provide us a framework for thinking about interpersonal interactions. An organizational network is a data structure that describes which team members are communicating with which other team members and allows us to identify large scale patterns from small scale interactions.

I’ve looked at a lot of different organizational networks. I used to work with a network science consultancy that specialized solely on being able to create these networks and answer these questions. For most companies, that process usually involved hiring a consultancy, creating surveys and bullying everone into completing them, and then running the analysis. If you wanted the same results six months later, you had to repeat the whole process again.

Luckily for us, we can get the data we need directly from Slack, and we’ll start by pulling message metadata from public Slack channels. For this analysis, we’re not going to focus solely on data from public channels. ?By excluding data from video chats (and arguably meetings), we can focus our exploration on informal communication. We’ll save the analysis on direct messages and private groups for another day.?

Timeline

Here’s a timeline of all messages sent on Buffer’s public Slack channels. The black line shows the raw daily message counts, while the blue line applies a smoother over the raw counts to show a more discernible trend.

There are a couple of things that are immediately apparent from the timeline. First, Buffer employees chat a lot. During weekdays, Buffer employees send an average of over 2,000 per day, which doesn’t include messages sent over private channels, direct messages, reactions to messages, or video conferences.

The second important note on the timeline is the evidence of healthy work-life balance, at least in terms of weekend work. On weekdays, Buffer employees send an average of over 2,000 messages on public channels. On weekends, they send on average less than 100 messages. That is, Buffer employees communicate over 20x less on the weekends.

A Distributed Team

It’s important to not that, by necessity, Buffer employees need chat. Being a fully distributed team, unless they’re meeting face-to-face at one of their semi-annual retreats, all communication happens digitally.

What that’s resulted in, in Buffer’s case, is a team that is much more connected than many teams of their size with a physical presence.

Let’s illustrate why by first defining a term in network science called centrality. We’ll take a subset of four Buffer employees — Åsa, Boris, Courtney and Darcy — and assume they have the following (fictitious) organizational network. Two members will have a connection if there’s a 1 in their cell, and a zero otherwise.

	Åsa	Boris	Courtney	Darcy
Åsa	0	0	1	0
Boris	0	0	1	1
Courtney	1	1	0	1
Darcy	0	1	1	0

Let’s say Åsa needs something from Darcy, but doesn’t usually communicate with Darcy, but she does communicate frequently with Courtney and relays the message through her — Åsa → Courtney → Darcy. Likewise, Boris shares information he knows to those with whom he works closely. If Åsa needed information from Boris, it’d likely go through Courtney first — Boris → Courtney → Åsa.

In this example we see that Courtney’s position in this network makes her very important, or central. Team members with high centrality help broker a lot of information to other team members, usually moreso than is usually noticeable through interpersonal dynamics. In a lot of networks, like Hillary Clinton’s State Department, there are a couple of members who wield a large percentage of the centrality — if they were removed from the network, general connectivity would severly suffer.

In Buffer’s case, if you were to remove given employee from the team, overall centrality only decreases by at most 1.5%. Each member of the Buffer team is contributing to the Buffer network — regardless of team, gender, race, hierarchy or tenure.

Collaboration

While it’s a great thing for teams to be communicating frequently and informally, it’s also important that communication be directed in ways that maintain collaboration.

There are two fundamental ways that multiple teams in an organization collaborate. When members on a given team chat amongst themselves, we’ll call that energy. When they chat with members of other teams, we’ll call that exploration. Usually, it’s important to have a healthy balance of both.

Below we can see how different teams balance their energy and exploration. The grey dot below each team reflects their internal energy, while the blue arcs show the degree of their exploration with other teams.

Let’s note that the size of each dot is not a reflection of team size, but of team energy. We can see that the Happiness team is communicating very frequently amongst themselves, and have a less active role in communicating with some of the other teams, like the Data team or the Marketing team.

On the other hand, the Caretaker team (which consist of founders and other top leaders), have an average inner-team energy, but has strategic above-average exploration with teams, like Product and Engineering. Visualizing the balance between energy and exploration can help identify if certain teams could benefit by increasing or decreasing communication with other teams.

Conclusions

We’ve taken a high-level look at some of Buffer’s chat data to understand what patterns of communication emerge. We’ve taken a look at communication frequency, work-life balance, effects of a global and distributed team, centrality and collaboration. These are just some of the points to consider when evaluating how current patterns of communication enable success. Other points of consideration could include incorporating ticket or project management data to contextualize productivity data or even using sentiment analysis and theme extraction over public messages to identify topics that drive collaboration.

What are your thoughts? What would your organizational network reveal about your organization? Do the modes and patterns of your team’s communication enable or hinder your success?

Sanders Would Win Electoral College

Tue, 08 Mar 2016 00:00:00 +0000

Lately, most political commentary treats Secretary Clinton as the de facto nominee for the Democratic Party. Clinton is leading in the delegate count, and has a staggering number of superdelegates already “pledged”.

However, people don’t elect presidents in the United States — the Electoral College does.

What does that mean for Secretary Clinton? Most of the states that she is winning are likely to vote Republican anyway. So, in effect, her wins wouldn’t put her in office. However, Senator Sanders’ wins would.

If you consider current primary results in terms of the 2012 Electoral College, as of March 8th, Secretary Clinton would have 36 electoral votes, and Senator Sanders would have 46 electoral votes.

And that includes Clinton’s razor-thin victories in Iowa and Massachusetts.

Electoral Map

States in dark blue indicate a win for Clinton that would give her Electoral Votes, while states in light blue indicate a win that wouldn’t give her Electoral Votes.

States in dark green indicate a win for Sanders that would give him Electoral Votes, while states in light green indicate a win that wouldn’t give him Electoral Votes.

Network Science

Sat, 20 Feb 2016 00:00:00 +0000

A couple of jobs ago, I was lucky enough to get to work with some very smart network scientists.

We were building a product together that performed organizational network analysis (ONA) in large enterprise companies. ONA allows you to get strategic and identify how people collaborate. You can identify overall network connectiveness, silos and cliques, key contributors, quantify synergy, etc. We wanted to help companies identify and monitor key metrics and “actionable insights”. People loved it.

Unfortunately, getting there can be a difficult journey. You have to (1) realize that network science can improve the way people work together, then (2) convince leadership that it’s worthwhile because (3) $$$. People don’t have network scientists in-house, and they are definitely scarce and expensive.

Hopefully, by the end of this post you’ll be a network science convert with some tools in your belt to start improving your own organizations.

I’ll start with examples. There’s a glossary at the end for all the terms I just gloss over.

Modeling Relationships

Meet three co-workers: Alice, Bob and Charlie. Alice and Bob are good friends. Bob and Charlie are not. Bob talks to Charlie, but Charlie never reciprocates. He’s grumpy. That’s why him and Alice aren’t friends.

If we wanted to represent this structure mathematically, we’d make a matrix — every person gets a row and a column. Then we put a value in every person in a row talks to each person’s column. Ours would look something like this:

	Alice	Bob	Charlie
Alice	0	1	0
Bob	1	0	1
Charlie	0	0	0

Expressing a network as a matrix allows us to do a lot of math behind the scenes.

Cliques

Cliques exist in every social context. Birds of a feather really do flock together. Organizations get pretty interesting when cliques don’t align with the org chart.

If they don’t align, does it mean that cliques form because the existing org chart failed? Or have they succeeding in opening up collaboration with other groups? If they do align, does it mean that teams are focused, or are they working in isolation? It’s always a good idea to have a core set of productivity metrics, so the answers to these questions can be validated with experimentation.

Key Actors

Sometimes there are actors who perform strategic roles in brokering relationships between cliques or structured areas of the organization. If there are two groups of people working very cohesively, but there’s no communication between teams, that’s usually a problem. There needs to be at least a couple of actors brokering communication between the two groups.

You can see why those actors are important to the network — we’d say that those brokers lie on many shortest paths between the two groups. That is, if you want to talk with someone from the other group, it’s probably through those brokers. Finding out how many shortest paths an actor lies on is a metric called betweenness.

Betweenness is one of a couple metrics that can be used to determine how important — or central — an actor is to a network.

Centrality

Centrality scores are an effort by network scientists to determine how important different actors are. The idea is that more important members have higher scores of centrality.

Actor centrality can give insights into how stable an organization is. A stable network will have pretty similar centrality scores among most of its members. That is, each individual is equally important, so if anyone leaves the network, those connections can be easily absorbed by others in the network.

An unstable network will have most of the centrality in a network carried by a few strategic actors. If a highly central actor were to leave the network, it would leave a disproportionately high number of relationships unbrokered.

Brokerage

When considering an actor’s position in a network, we can also consider their functional role in collaboration.

Brokerage roles let us explain how someone facilitates collaboration. We look at how actors from some groups broker relationship with actors from other groups.

We start by breaking up a network into all of its triads. (Triads are formally defined in the glossary.) For every triad, we classify the relationship according to the following:

Coordinator: The broker mediates contact between two individuals in the same group: A → A → A
Consultant: The broker mediates contact between two individuals in the same group, who are not member of the broker’s group: A → B → A
Representative: The broker mediates an incoming contact from an out-group member to an in-group member: A → B → B
Gatekeeper: The broker mediates an outgoing contact from an in-group member to an out-group member: A → A → B
Liaison: The broker mediates contact between two individuals from different groups: A → B → C

After we classify all of their triads, we’d say an actor’s brokerage role is whatever role they find themselves in most frequently.

When used in conjunction with centrality scores, brokerage roles can be used to identify actors who are important in terms of both centrality and function. For example, for succession planning, you might want to identify individuals who have high centrality and fall into a representative role to fill vacant leadership positions on their team.

Consensus

Some colleagues of the network scientists I worked with wrote a book on the power of network connectivity. They note some particularly interesting work that highlights the importance of a network’s structure in its ability to reach consensus.

Consensus deals with the homogeneity of a network’s state. If everyone in the network were painted red, it would be in consensus. If some were blue and some were red, it would not be in consensus.

Let me illustrate with an example we’re all to familiar with: employee attrition. When there’s unhappiness or dissatisfaction in our network, it’s contagious. Ideally, we’d love it if our network were painted “satisfied”, but that won’t always be the case.

When key actors in the network — high centrality and broker strategic relationships — are dissatisfied, they have a disproportionately high influence on the consensus on the network, which can be fatal.

Why so fatal? Michael Kearns argues that a network whose core actors are not in consensus will never arrive at a consensus. If you’re trying to rally around Business Plan B, you need your company to rally with you. Doubt and negative energy are contagious, and a team that doesn’t have the same vision won’t build the same thing.

Glossary

Network: A collection of things that are connected. Here we’ve only talked about people. Some people call it a graph, or a social graph.
Matrix: A mathematical representation of a network. (Specifically, we’re talking about an adjacency matrix.)
Actor: A person in a network.
Edge: A connection between two actors. We typically infer edges/connections by email or chat history, or surveys.
Triad: A specific type of connection between three actors. For actors A, B, and C, a triad exists when A is connected to B and B is connected to C, but A is not connected to C (A → B → C).
Shortest Path: The shortest set of edges between two actors. You win if you can calculate these for Kevin Bacon. Mathematicians call this geodesic distance.
Connectedness: There are a couple of ways to evaluate connectedness, which were formulated by a guy named Krackhardt. He proposes four dimensions, which are connectedness, hierarchy, efficiency and least upper boundedness (or lubness, for short). Typically, you like it when your network has high connectedness and efficiency.
Centrality: How central, or important, an actor is to a network. Popular ways to calculate it are betweenness, closeness, using eigenvectors, or using PageRank. You can get a lot of mileage from looking at centrality in different ways — comparing among actors, looking at averages across teams, Pareto charts, etc.
Brokerage Role: The way an actor typically brokers relationships . Some actors are gatekeepers, liaisons, representatives, coordinator, or consultant.

Hillary's Damn Emails

Wed, 17 Feb 2016 00:00:00 +0000

[UPDATE: (March 3, 2016) Since I wrote this, there’s been a revival of interest — from the FBI — in the Clinton email spectacle. Who knows what it will mean for her, but maybe it means we’ll get more data!]

The American people are sick and tired about hearing about [Hillary Clinton’s] damn emails.

– Sen. Bernie Sanders

By now the dust has more than settled on Hillary Clinton’s email controversy.

I’m not interested in bringing the topic back from the dead to debate the concerns about the security of potentially confidential information, but more in the network it identifies, which is interesting for two reasons. First, it could be a nice, toy-ish data set to run analysis and benchmarks against, like the Enron corpus. Second, it could potentially help identify the type of leadership style and kind of organization that Hillary Clinton runs, which is naturally relevant seeing as how she might end up being the next American president.

But first, the data…

FOIA

First off, the data aren’t great. I got mine from Kaggle, which had some helpful additions/modifications.

The problem is that the data, which came from a FOIA request, consist of some semi-structured output from computer vision over a bunch of PDFs of scanned copies of printed emails. So needless to say, there are some inconsistencies in what was extracted¹. Some fuzzy matching helped a little, but I still spent a while manually resolving funky naming variations.

The email contacts are the most important part of the data for me — they identify the structure of the network I care about, and I can now create a matrix that identifies how individuals are engaging with every other individual in the network. If I had been interested in sentiment analysis or topic modeling I probably wouldn’t have cared.

Caveat: The data are an incomplete record of the emails sent at the State Department during Secretary Clinton’s tenure. Since only emails sent from her private email server were the subject of the FOIA request, the data represents more of an egocentric network. But it can still be used to explore some interesting ideas.

The Network

That represents Secretary Clinton’s State Department email network. Isn’t that beautiful? Sometimes I wish I were a network scientist².

There are a lot of interesting ways to think about how important an individual is to a network of people. I tend to gravitate to betweenness, and the gist of it is this — I need to talk to Alice, and the only way for me to contact Alice is through our mutual connection, Bob. Therefore, Bob is important to that network.

In an unstable network, betweenness is distributed among a few key individuals. If someone leaves the network, the results in terms of connectivity and information flow are catastrophic. In a stable network, betweenness is distributed more evenly among its members.

In Hillary’s case, we can look at the betweenness centrality of everyone in her network to identify who key individuals are.

Here we see that Secretary Clinton’s State Department is supported by three key people: Jacob Sullivan, Cheryl Mills and Huma Abedin. It seems very reasonable that Secretary Clinton’s three most influential staffers would be her Deputy Chief of Staff, Chief of Staff, and semi-Deputy Chief of Staff, respectively.

Are those staffers over-leveraged, creating instability in the network? Probably.

Looking at network similarity — how similarly each individual engages with every other individual in the network, measured by cosine similarity — we see a bit of interesting asymmetry. In terms of incoming similarity — how similarly they receive email from people in the network — they’re not very similar, with similarites around 0.6 or below. But in terms of outgoing similarity — how similarly they send email in the network — they’re much more similar, reaching similarities as high as 0.85.

Takeaway

What does it all mean? Secretary Clinton worked primarily with her Chief of Staff and Deputy Chiefs of Staff. They were extremely central to the network, and acted as gatekeepers for a lot of other email senders.

Is there any way to de-leverage her Chief of Staff? Probably not, as it’s pretty much the role of a Chief of Staff to broker communication between an executive and its supporting staff and relations. But it does tend to cause a lot of shakeup when it they leave.

Future Work

This is a data set I’d love to spend more time with. If you’d like to use my modified data, it’s available on GitHub. If I were to explore any of this further, I might examine:

How this changes over time. Are there individuals whose importance increase or decrease over time, and how was the network affected by it?
Clique analysis. I might dive deeper on cliques/neighborhoods/clusters to see who was working with whom, which groups worked well together (or didn’t), and what differences might exist in terms of topics discussed and actions taken.
Backchanneling. Is there any evidence in the data? With whom? (How would we detect it?) What the implications be?
Succession. Would there be any catastrophic changes to network connectivity if certain individuals were to leave? What about a changeup with the Chief of Staff? What would that mean to the work of the State Department? What kind of strategies could we surmise to mitigate against it?

At this point I’m just rambling, but I’m sure I’ll have more ideas I might want to address later.

Footnotes

For example, Jake Sullivan, who served as Secretary Clinton’s Deputy Chief of Staff, appears in over 2,500 email threads with over 40 variations of spelling — “Sullivan, Jacob J”, “sullivanjj@state.gov”, “Sullivan, Jacob J Sullivanil@state.gov”, “isullivanjj@state.govi sullivanjj@state.gov”, my personal favorite “lake.sullivar”, and more! ↩
If you’re unfamiliar with network science and haven’t read my post on it, you probably should. ↩

bash in R

Mon, 01 Feb 2016 00:00:00 +0000

There are times when putting reusable code in your own personal R package is useful — saving yourself time by structuring repetitive tooling, sharing difficult or specific implementations of hard problems to help others hate life less, etc.

This is kind of just a one off — and doesn’t fit in either of those two categories — but I thought I’d share it here, since I’ll probably eventually come back to it and I think it could be useful for others.

Sometimes you just want to load bash scripts in R — in my particular current case, mainly just to load environment variables from my .bash_profile — like for cron, running scripts in a cluster, etc. But I don’t want to pass them as arguments to some script, and it doesn’t seem like runr is doing what I want (yet).

So this is a snippet that will load environment variables.

It’d be nice if it would load other nice bash-y things, like functions, aliases, etc., and somehow made them available in the current session, but that might also be unnecessary. (That is, I haven’t yet found it necessary…)