# Proportional venn diagram for more than 3 sets

### Question

I have a collection of documents in MongoDB where each has one or more categories in a list. Using map reduce, I can get the details of how many documents have each unique combination of categories:

``````['cat1']               = 523
['cat2']               = 231
['cat3']               = 102
['cat4']               = 72
['cat1','cat2']        = 710
['cat1','cat3']        = 891
['cat1','cat3','cat4'] = 621 ...
``````

where the totals are for the number of documents that exact combination of categories.

I'm looking for a sensible way to present this data, and I think a venn diagram with proportional areas would be a good idea. Using the above example, the area cat1 would be 523+710+891+621, the area of the overlap between cat1 and cat3 would be 891+621, the area of overlap between cat1, cat3, cat4 would be 621 etc.

Does anyone have any tips for how I might go about implementing this? I'd preferably like to do it in Python (+Numpy/MatPlotLib) or MatLab.

1
9
5/29/2012 6:10:40 PM

## The Problem

We need to represent counts of multiple interconnected categories of object, and a Venn diagram would be unable to represent more than a trivial amount of categories and their overlap.

## A Solution

Consider each of the categories and their combinations as a node in a graph. Draw the graph such that the size of the node represents the count in each category, and the edges connect the related categories. The advantage of this approach is: multiple categories can be accommodated with ease, and this becomes a type of connected bubble chart.

## The Code

The proposed solution uses NetworkX to create the data structure and matplotlib to draw it. If data is presented in the right format, this will scale to a large number of categories with multiple connections.

``````import networkx as nx
import matplotlib.pyplot as plt

text = '''  Node    Size
1        523
2        231
3        102
4         72
1+2      710
1+3      891
1+3+4    621'''
# this may be replaced by some appropriate output
data = text.split('\n')[1:]
data = [ d.split() for d in data ]
data = [ tuple([ d[0],
dict( size=int(d[1]) )
]) for d in data]
return data

text = '''  From   To
1+2    1
1+2    2
1+3    1
1+3    3
1+3+4    1
1+3+4    3
1+3+4    4'''
# this may be replaced by some appropriate output
data = text.split('\n')[1:]
data = [ tuple( d.split() ) for d in data ]
return data

if __name__ == '__main__':
scale_factor = 5
G = nx.Graph()
node_sizes = [ n[1]['size']*scale_factor
for n in nodes ]

nx.draw_networkx(G,
pos=nx.spring_layout(G),
node_size = node_sizes)
plt.axis('off')
plt.show()
``````

## Other Solutions

Other solutions might include: bubble charts, Voronoi diagrams, chord diagrams, and hive plots among others. None of the linked examples use Python; they are just given for illustrative purposes.

10
5/30/2012 2:46:50 PM

I believe ninjagecko is correct and this cannot generally be represented as a diagram of intersections, unless you don't mind the diagram being in n dimensions. However, it can be represented in 2D if you have a diagram for each category showing all its intersections - and this itself can be a single diagram. So this may be a more appropriate way to represent your data. I've produced a stacked barchart to illustrate:

The code:

``````cats = ['cat1','cat2','cat3','cat4']
data = {('cat1',): 523, ('cat2',): 231, ('cat3',): 102, ('cat4',): 72, ('cat1','cat2'): 710,('cat1','cat3'): 891,('cat1','cat3','cat4') : 621}

import matplotlib.pyplot as plt
import numpy as np
from random import random

colors = dict([(k,(random(),random(),random())) for k in data.keys()])
print colors
for i, cat in enumerate(sorted(cats)):
y = 0
for key, val in data.items():
if cat in key:
plt.bar(i, val, bottom=y, color=colors[key])
plt.text(i,y,' '.join(key))
y += val
plt.xticks(np.arange(len(cats))+0.4, cats )
plt.show()
``````