How can I plot the empirical CDF of an array of numbers in matplotlib in Python? I'm looking for the cdf analog of pylab's "hist" function.
One thing I can think of is:
from scipy.stats import cumfreq a = array([...]) # my array of numbers num_bins = 20 b = cumfreq(a, num_bins) plt.plot(b)
Is that correct though? Is there an easier/better way?
That looks to be (almost) exactly what you want. Two things:
First, the results are a tuple of four items. The third is the size of the bins. The second is the starting point of the smallest bin. The first is the number of points in the in or below each bin. (The last is the number of points outside the limits, but since you haven't set any, all points will be binned.)
Second, you'll want to rescale the results so the final value is 1, to follow the usual conventions of a CDF, but otherwise it's right.
Here's what it does under the hood:
def cumfreq(a, numbins=10, defaultreallimits=None): # docstring omitted h,l,b,e = histogram(a,numbins,defaultreallimits) cumhist = np.cumsum(h*1, axis=0) return cumhist,l,b,e
It does the histogramming, then produces a cumulative sum of the counts in each bin. So the ith value of the result is the number of array values less than or equal to the the maximum of the ith bin. So, the final value is just the size of the initial array.
Finally, to plot it, you'll need to use the initial value of the bin, and the bin size to determine what x-axis values you'll need.
Another option is to use
numpy.histogram which can do the normalization and returns the bin edges. You'll need to do the cumulative sum of the resulting counts yourself.
a = array([...]) # your array of numbers num_bins = 20 counts, bin_edges = numpy.histogram(a, bins=num_bins, normed=True) cdf = numpy.cumsum(counts) pylab.plot(bin_edges[1:], cdf)
bin_edges[1:] is the upper edge of each bin.)
If you like
linspace and prefer one-liners, you can do:
plt.plot(np.sort(a), np.linspace(0, 1, len(a), endpoint=False))
Given my tastes, I almost always do:
# a is the data array x = np.sort(a) y = np.arange(len(x))/float(len(x)) plt.plot(x, y)
Which works for me even if there are
>O(1e6) data values.
If you really need to down sample I'd set
x = np.sort(a)[::down_sampling_step]
Edit to respond to comment/edit on why I use
endpoint=False or the
y as defined above. The following are some technical details.
The empirical CDF is usually formally defined as
CDF(x) = "number of samples <= x"/"number of samples"
in order to exactly match this formal definition you would need to use
y = np.arange(1,len(x)+1)/float(len(x)) so that we get
y = [1/N, 2/N ... 1]. This estimator is an unbiased estimator that will converge to the true CDF in the limit of infinite samples Wikipedia ref..
I tend to use
y = [0, 1/N, 2/N ... (N-1)/N] since (a) it is easier to code/more idomatic, (b) but is still formally justified since one can always exchange
1-CDF(x) in the convergence proof, and (c) works with the (easy) downsampling method described above.
In some particular cases it is useful to define
y = (arange(len(x))+0.5)/len(x)
which is intermediate between these two conventions. Which, in effect, says "there is a
1/(2N) chance of a value less than the lowest one I've seen in my sample, and a
1/(2N) chance of a value greater than the largest one I've seen so far.
However, for large samples, and reasonable distributions, the convention given in the main body of the answer is easy to write, is an unbiased estimator of the true CDF, and works with the downsampling methodology.