Scientific Python: Anaconda
As I've mentioned before, I love to use Python for data processing and statistical analysis. In this entry, I'll describe my recent experience with the Anaconda Python distribution published by Continuum Analytics. Thus far, using Anaconda has been very straightforward and I'm sufficiently impressed to recommend it.
A recent round of computer upgrades at the lab means we're all experiencing the minor trauma of reinstalling software. Python is one of my most important tools, so getting it up and running was essential. Unfortunately, while Python's famous maxim about coming with “batteries included” is absolutely true for the base libraries, difficulties with installing and managing third-party libraries can waste a lot of time. That's because despite a terrific community and tons of great third-party packages, Python's built-in package management software remains rudimentary.1
Batteries Plutonium included
Enter Anaconda. This distribution provides a base installation of the Python language along with almost 200 widely-used Python packages. That includes mainstays such as NumPy and SciPy, relative newcomers such as pandas and scikit-learn, and perhaps lesser known but extremely useful entries such as bitarray and xlrd. If there's any downside, it's that you probably won't use all of these additional packages. However, sacrificing a bit of disk space seems a small price to pay for simple Python package management.2
In our lab I've tried Anaconda on my Windows desktop and on our lab's SUSE Linux workstations. In both cases, installation was painless, even without root access on the Linux boxes. Compared to my previous exertions that got full-blown scientific Python working on those systems, this was a snap. Notably, Anaconda also handles non-Python libraries that some third-party packages require (e.g.,
scikit-learn uses LibSVM for support vector calculations), meaning still fewer installation hassles. Python's built-in tools, which are getting much better at managing native Python packages, don't yet handle these external dependencies.
Python's free numerical libraries are arguably as fast as those of the other high-level programming languages3, but more speed is always welcome when you're working with large datasets. On that note, another benefit of Anaconda is that it can get you access to CA's proprietary speed-boosting extensions. For most users, access to these features costs money; academics can apply (painlessly) for a free license.
So how fast is it? As Barf the Dog might say, “They've gone to plaid!” I installed CA's
accelerate library, and without changing a line of my scripts, NumPy suddenly recruited multiple cores for time-consuming calculations. I've found the speed increase to be very significant for calculations using Python's
scikit-learn library among others.
Anaconda seems to be a great solution to a common problem — managing many potentially overlapping Python package requirements. I've been very pleased so far, and will post again with further impressions and tips in the future. If you have experience with Anaconda or other Python distributions that you'd like to share, please contact me or post below.
Python is an open source programming language, and the base Anaconda distribution is likewise free for use. However, the publisher (Continuum Analytics) appears to be a for-profit enterprise. I have no financial interest in or other relationship with CA, but as I mentioned earlier, I did take advantage of CA’s open offer of free academic access to their professional tier of products. I don't believe that this unduly influenced my opinion of CA or Anaconda, but I feel that full disclosure is always the best policy.
pip definitely beat manual installation and may suffice for UNIX-like operating systems. Unfortunately, these tools often aren't enough for Windows users such as me, and it's all too easy to become mired in Python's own dependency hell. If you're on Windows but still want to try managing packages manually, I highly recommend Christophe Gohlke's amazing collection of pre-complied Windows binaries for Python — I relied on these almost exclusively before trying Anaconda.
2 There's also a stripped-down Miniconda distribution if you want to install packages only as needed.
3 It's my understanding that NumPy, MATLAB, and the rest all rely on math libraries such as BLAS and LAPACK under the hood, making their effective speed very similar.