Learning Python - suggestions and resources for business analytics students and professionals

This post is hilariously old and outdated but it’s fun to look back

python
Author

Mark Isken

Published

December 18, 2013

Note

Back when this post was written, it was the top search result if you Googled “business analytics python”. I like to revisit this post now and then to see how much has, or has not, changed.

As I’ve been learning and using Python more and more for business analytics’ey things that I previously would have done in Excel or Access with some VBA sprinkled in, I thought it would be a good idea to put together a list of some of the resources I’ve found to be useful. As you start to read this, you’ll notice that there are quite a few references to “scientific computing” and “scientists”. There are many similarities between the scientific computing (as done by computing inclined scientists) and the business analytics computing done by business analysts. Both are communities of folks whose full time job is NOT to develop software but who do end up using sophisticated computing tools and doing a fair bit of computer programming. Both are dealing with increasingly large and varied data sets. Both need to worry about efficient and reproducible workflows. Many in both camps do end up developing software tools. Both are increasingly turning to tools like Python and R. Both need good resources for learning to effectively use such tools. Business analytics people are very comfortable creating crazy big Excel spreadsheets, dabbling in SQL, and maybe doing some VBA programming. The parallel world of tools like Python and R requires some getting used to but is well worth the investment in time (IMHO).

1. What is Python and why is it called Python?

Python (programming language)

2. Which Python?

As you get started learning Python, it’s unavoidable that you’ll run into the Python 2 vs Python 3 question question. For now, I’m still using Python 2.7.3. Before downloading Python from python.org, read the next section of setting up your sci-computing Python environment.

3. Setting up your sci-computing Python environment

While you can certainly get Python and the various modules needed to get yourself setup to do some analytics or sci-computing, it’s easier to install one of the prepackaged Python distributions containing the SciPy Stack. The SciPy Stack specification contains the minimal set of Python packages one should have for sci-computing and includes things like Python itself, numpy, scipy, matplotlib, pandas, IPython, Sympy, and nose. Of course, there are a zillion more useful packages, some of which I’ll mention below. I have had good luck with the Enthought Canopy Python distribution for scientific computing. It includes not only the SciPy Stack but a ton of other useful packages. In addition, it’s got an integrated update manager that makes it easy to keep all your sci-computing packages up to date. Since Canopy uses virtualenv to create its own virtual Python environment, local users can handle all their own updates and you don’t mess around with your core Python system. Canopy also has a built in code editor and embedded IPython console.

Recently (Feb 2014) I installed another leading sci-computing Python distro - Anaconda from Continuum Analytics. It looks every bit as good as Canopy and I’m looking forward to working with it.

4. Writing Python programs

Now that you’ve got Python installed, what tools will you use to write Python programs? Also, since Python can be used in “interactive mode”, you’ll also want some tools for doing that. Let’s look at tools for interactive use first since that’s a natural way to start to learn to program in Python.

Using Python interactively

Python ships with a bare bones interactive console called IDLE. A much better interactive console is called IPython. It is an “architecture and toolset” for doing interactive computing with Python. I love IPython. It’s terrific for interactively hacking around and doing analysis with Python. The interface is reminiscent of tools like Matlab or Mathematica. There’s a browser based tool called an IPython Notebook that, well, let’s you use IPython right from a browser window. The ability to mix runnable Python code with markdown text makes IPython great for developing tutorials and documenting analysis workflows. It’s a must-have. You can find links to several resources for learning IPython further down in this document. Both Canopy and Anaconda have various IPython shells built in.

Text editors and IDEs for writing Python programs

While certainly you can just write Python code in a text editor, I like a good IDE. For Python on Windows, a nice lightweight IDE is PyScripter. I’ve also been using Spyder on both Windows and Linux (though it doesn’t have a visual debugger, just a command line debugger). An IDE that looks promising is is the new community edition of PyCharm which I just installed a few days ago. I’ve also been using the editor built into Canopy. If you search StackOverflow (and Google in general) for a discussion on Python IDEs, you get a good sense of the landscape. Lots of folks are quite happy with editors like vim and emacs. Many editors have syntax highlighting including widely used tools like Notepad++ on Windows and gedit on Linux. For those used to Eclipse as their Java IDE, there is a plug-in called PyDev. However, Eclipse is a bit of overkill for cranking out beginner Python code. I’d start with something like PyScripter or PyCharm or Spyder. I’m not sure about the first two, but Spyder has both IDLE and IPython interactive consoles embedded in them. Also, get comfortable with IPython Notebooks as they are really useful for creating your own “tutorials” that contain a mixture of interactive code and text written in markdown. My Python tutorials on hselab.org were done this way and there are links to the actual notebooks hosted on GitHub.

6. General and data science focused Python tutorials and [e]Books

The Official Python Tutorial Straight from the source.

Software Carpentry

This is a fantastic resource aimed at teaching scientists how to be better software developers. In their words:

We run bootcamps all over the world, and provide open access material for self-paced instruction.

Their Lessons section has a large number of scientific computing tutorials which include screencasts and slides. The Python section is just one of many. Other topics include things like version control, the Shell, regular expressions, databases, and many, many more.

Data Science in Python

Nice collection of IPython notebooks to provide intro to using Python for data science work. From y-hat.

Getting Started with Python for Data Science

This is from the folks at kaggle.com who have turned data mining into a combination of a sport and community activity.

Online Python Tutor

Simply amazing web based tool that lets you visualize Python program execution. You can even use it from within an IPython notebook. And you can embed and/or share your program visualizations with others - the Online Python Tutor generates embed code or a URL. Wow. learnpython.org An interactive Python tutorial.

Wikibooks Non-programmers tutorial for Python

Aimed at folks with little to no programming experience.

Learn Python the Hard Way

Free HTML “book” with optional videos for purchase.

Think Python: How to think like a computer scientist

Free eBook.

Code Academy Python tutorial

I haven’t checked out this tutorial much myself but have had some students mention that they were using it to learn Python. Python books List of Python books compiled by the Python people.

Learning Python (Lutz)

Nice first book for learning Python.

Practical Computing for Biologists

Terrific book that covers all kinds of practical ground for becoming a more effective scientific programmer. They use Python. Someone needs to write “Practical Computing for Business Analysts”.

A Primer on Scientific Programming with Python

Hardcover and pricey, but quite good.

7. Open access courses

Udacity - Intro to Computer Science 101

Learn Python while building a search engine

Udacity - Intro to Data Science

Assumes basic knowledge of Python and statistics.

Coursera - Introduction to Data Science

Not so much a Python course as a data science course that uses Python among other tools. I really like the design of the course in terms of topical coverage and the balancing act required in terms of trying to serve both computing and analytics folks. Parallels between sci-computing and business analytics computing are nicely covered.

MIT - A Gentle Introduction to Programming with Python

Part of the MIT Open Courseware initiative

Harvard - Intro to Data Science 109

Top notch data science course. Python used.

Quantitative Economics

A set of lectures on quant econ modeling using Python. Includes info on getting your sci-computing environment set up.

8. Toolbox

There are a huge number of Python packages and tools for doing analytical work. I’m just going to mention a few of the biggies. I want to give a shout out to the very nice compilation of Python data analysis related tutorial links at Python for Data Analysis: The Landscape of Tutorials as I’ve gotten a number of links from this post.

SciPy and numpy and friends

SciPy (pronounced “Sigh Pie”) is a Python-based ecosystem of open-source software for mathematics, science, and engineering. numpy and SciPy are the bedrock of scientific computing in Python providing a powerful n-dimensional array construct as well as core scientific computing functionality.

pandas - Python Data Analysis Library

A must-have library for doing data crunching in Python. Developed by Wes McKinney. Pandas includes DataFrames and Series data structures and all kinds of methods for working with them. It includes tools for data IO, data prep, data transformation, data modeling, data analysis, data visualization, and data presentation. It uses numpy and matplotlib under the hood.

IPython

An architecture and toolset for doing interactive computing with Python. It was developed by Fernando Perez and others. I love IPython. It’s terrific for interactively hacking around and doing analysis with Python. The interface is reminiscent of tools like Matlab or Mathematica. There’s a browser based tool called an IPython Notebook that, well, let’s you use IPython right from a browser window. The ability to mix runnable Python code with markdown text makes IPython great for developing tutorials and documenting analysis workflows. It’s a must-have.

matplotlib

Matplotlib is a widely used plotting library. It’s very powerful and has great documentation with tons of examples. It was developed by John Hunter who sadly passed away in 2012. Part of Python’s rise as a force in the sci-computing world is certainly due to matplotlib.

matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB®* or Mathematica®†), web application servers, and six graphical user interface toolkits. You can use matplotlib as an interactive plotting tool (ala Matlab) using its pyplot command or in more of an API based programmatic way. Here are some good links to matplotlib tutorials that I pulled directly from A. Dasgupta’s “Python for Data Analysis: The Landscape of Tutorials”:

scikit-learn

Tools for all kinds of “machine learning” algorithms and tools for classification, clustering, regression, dimensionality reduction, model selection and data preprocessing. It’s built on top of tools like numpy, SciPy, and matplotlib.

statsmodels

This is a Python package that contains all kinds of statistical analysis functionality. It’s not as comprehensive as R but but has the advantage of being part of a general programming language, Python, instead of a domain specific language like R. Regarding R on the data processing side, I find myself preferring Python and then using R for the stats. However, I do also end up doing a fair bit of data prep work in R during course of an analysis (e.g. output dataframe needing some transforming or reshaping for next stage of analysis). Having a full-fledged programming language with strong support for data prep (including regex) is really nice. I also find myself doing more and more statistical work in Python, especially when I’m embedding some kind of statistical model in some sort of analytical application or tool. Bottom line, it’s worth knowing both for both data handling and stats. The R and Python+(various goodies) worlds are colliding:

Sphinx

Sphinx is a reStructuredText based documentation generation system for Python. It makes writing documentation fun.

Sphinx is a tool that makes it easy to create intelligent and beautiful documentation, …

reStructuredText is an easy-to-read, what-you-see-is-what-you-get plaintext markup syntax and parser system. reST is part of docutils, an open source text processing system that takes documentation written in plain text and converts it to useful formats like HTML, pdf, LaTex, and others. The official Python documentation is written in reST and reST is used for docstrings in Python code. Docstrings are just literal string comments included in source code that can be read, parsed, and used to automatically generate documentation. When you write docstrings, you write them in reST. Once you get used to it, you’ll wonder how you ever did without it. While Sphinx and reST are ubiquitous in the Python doc world, you can use them for general documentation and writing. I wrote documentation for a Java based computer simulation model using Sphinx and reST. I generated HTML but could have generated other output formats as well. Best of all, you just write the documentation in a text editor (many of which are reST aware). It’s genius.

PyTables

Built on top of HDF5 - Hierarchical Data Format, PyTables provides tools for working with large datasets in Python. > PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.

9. Musings on Python, open source software and reproducible analysis

NumFocus: Open Code, Better Science

A non-profit foundation dedicated to supporting better sci-computing.

Mission: to promote accessible and reproducible computing in science and technology.

Fernando Perez’s blog on open and literate scientific programming

The developer of IPython writes eloquently about these topics.

Living in an Ivory Basement

C. Titus Brown is a computational biologist at Michigan State University. He’s got a great blog on scientific computing with lots of Python thrown in.

Wes McKinney

The pandas developer who has recently launched a new data analytics venture.

yhat (pronounced “why hat”)

This is a company focused on providing a cloud platform for analytics. They create interesting tools with and write interesting things about Python and R.

Software Carpentry Blog

Pythonic Perambulations

10 reasons Python rocks for research

Victoria Stodden’s bloggings on reproducible research

10. General Python resources

Official Python site

PyPI - the Python Package Index

The Python Package Index is a repository of software for the Python programming language. It’s the CRAN of Python world.

pyvideo.org

Index for Python related videos

StackOverflow

StackOverflow is your friend. It’s a very good Q&A site for all things programmatical. You can learn a lot just by browsing threads of interest. It’s the first place I go to when trying to figure out some Pythonic thing. It uses tags to help you find things.

Stack Overflow is a question and answer site for professional and enthusiast programmers. It’s 100% free, no registration required.

11. The Python packaging ecosystem

This is not a beginner topic but you’ll have to deal with it eventually. The following two articles will lead you into the morass.

A non-magical introduction to Pip and Virtualenv for Python beginners

Python Packaging: Hate, hate, hate everywhere

::: ::: :::

Reuse

Citation

BibTeX citation:
@online{isken2013,
  author = {Mark Isken},
  title = {Learning {Python} - Suggestions and Resources for Business
    Analytics Students and Professionals},
  date = {2013-12-18},
  langid = {en}
}
For attribution, please cite this work as:
Mark Isken. 2013. “Learning Python - Suggestions and Resources for Business Analytics Students and Professionals.” December 18, 2013.