Learning Python - suggestions and resources for business analytics students and professionals
This post is hilariously old and outdated but it’s fun to look back
As I’ve been learning and using Python more and more for business analytics’ey things that I previously would have done in Excel or Access with some VBA sprinkled in, I thought it would be a good idea to put together a list of some of the resources I’ve found to be useful. As you start to read this, you’ll notice that there are quite a few references to “scientific computing” and “scientists”. There are many similarities between the scientific computing (as done by computing inclined scientists) and the business analytics computing done by business analysts. Both are communities of folks whose full time job is NOT to develop software but who do end up using sophisticated computing tools and doing a fair bit of computer programming. Both are dealing with increasingly large and varied data sets. Both need to worry about efficient and reproducible workflows. Many in both camps do end up developing software tools. Both are increasingly turning to tools like Python and R. Both need good resources for learning to effectively use such tools. Business analytics people are very comfortable creating crazy big Excel spreadsheets, dabbling in SQL, and maybe doing some VBA programming. The parallel world of tools like Python and R requires some getting used to but is well worth the investment in time (IMHO).
Contents
- What is Python and why is it called Python?
- Which Python?
- Setting up your sci-computing Python environment
- Writing Python programs
- Collections of links to tutorials
- General and data science focused Python tutorials and [e]Books
- Open access courses
- Toolbox
- Musings on Python, open source software and reproducible analysis
- General Python resources
- The Python packaging ecosystem
1. What is Python and why is it called Python?
2. Which Python?
As you get started learning Python, it’s unavoidable that you’ll run into the Python 2 vs Python 3 question question. For now, I’m still using Python 2.7.3. Before downloading Python from python.org, read the next section of setting up your sci-computing Python environment.
3. Setting up your sci-computing Python environment
While you can certainly get Python and the various modules needed to get yourself setup to do some analytics or sci-computing, it’s easier to install one of the prepackaged Python distributions containing the SciPy Stack. The SciPy Stack specification contains the minimal set of Python packages one should have for sci-computing and includes things like Python itself, numpy, scipy, matplotlib, pandas, IPython, Sympy, and nose. Of course, there are a zillion more useful packages, some of which I’ll mention below. I have had good luck with the Enthought Canopy Python distribution for scientific computing. It includes not only the SciPy Stack but a ton of other useful packages. In addition, it’s got an integrated update manager that makes it easy to keep all your sci-computing packages up to date. Since Canopy uses virtualenv to create its own virtual Python environment, local users can handle all their own updates and you don’t mess around with your core Python system. Canopy also has a built in code editor and embedded IPython console.
Recently (Feb 2014) I installed another leading sci-computing Python distro - Anaconda from Continuum Analytics. It looks every bit as good as Canopy and I’m looking forward to working with it.
4. Writing Python programs
Now that you’ve got Python installed, what tools will you use to write Python programs? Also, since Python can be used in “interactive mode”, you’ll also want some tools for doing that. Let’s look at tools for interactive use first since that’s a natural way to start to learn to program in Python.
Using Python interactively
Python ships with a bare bones interactive console called IDLE. A much better interactive console is called IPython. It is an “architecture and toolset” for doing interactive computing with Python. I love IPython. It’s terrific for interactively hacking around and doing analysis with Python. The interface is reminiscent of tools like Matlab or Mathematica. There’s a browser based tool called an IPython Notebook that, well, let’s you use IPython right from a browser window. The ability to mix runnable Python code with markdown text makes IPython great for developing tutorials and documenting analysis workflows. It’s a must-have. You can find links to several resources for learning IPython further down in this document. Both Canopy and Anaconda have various IPython shells built in.
Text editors and IDEs for writing Python programs
While certainly you can just write Python code in a text editor, I like a good IDE. For Python on Windows, a nice lightweight IDE is PyScripter. I’ve also been using Spyder on both Windows and Linux (though it doesn’t have a visual debugger, just a command line debugger). An IDE that looks promising is is the new community edition of PyCharm which I just installed a few days ago. I’ve also been using the editor built into Canopy. If you search StackOverflow (and Google in general) for a discussion on Python IDEs, you get a good sense of the landscape. Lots of folks are quite happy with editors like vim and emacs. Many editors have syntax highlighting including widely used tools like Notepad++ on Windows and gedit on Linux. For those used to Eclipse as their Java IDE, there is a plug-in called PyDev. However, Eclipse is a bit of overkill for cranking out beginner Python code. I’d start with something like PyScripter or PyCharm or Spyder. I’m not sure about the first two, but Spyder has both IDLE and IPython interactive consoles embedded in them. Also, get comfortable with IPython Notebooks as they are really useful for creating your own “tutorials” that contain a mixture of interactive code and text written in markdown. My Python tutorials on hselab.org were done this way and there are links to the actual notebooks hosted on GitHub.
5. Collections of links to tutorials
Collection of tutorials links for new Python programmers From the official Python.org site.
The Best Way to Learn Python Kind of a meta-tutorial that guides you to a bunch of web based resources for learning Python.
Python for Data Analysis: The Landscape of Tutorials Another nice collection of links to tutorials and resources for learning Python for doing data analysis.
6. General and data science focused Python tutorials and [e]Books
The Official Python Tutorial Straight from the source.
This is a fantastic resource aimed at teaching scientists how to be better software developers. In their words:
We run bootcamps all over the world, and provide open access material for self-paced instruction.
Their Lessons section has a large number of scientific computing tutorials which include screencasts and slides. The Python section is just one of many. Other topics include things like version control, the Shell, regular expressions, databases, and many, many more.
Nice collection of IPython notebooks to provide intro to using Python for data science work. From y-hat.
Getting Started with Python for Data Science
This is from the folks at kaggle.com who have turned data mining into a combination of a sport and community activity.
Simply amazing web based tool that lets you visualize Python program execution. You can even use it from within an IPython notebook. And you can embed and/or share your program visualizations with others - the Online Python Tutor generates embed code or a URL. Wow. learnpython.org An interactive Python tutorial.
Wikibooks Non-programmers tutorial for Python
Aimed at folks with little to no programming experience.
Free HTML “book” with optional videos for purchase.
Think Python: How to think like a computer scientist
Free eBook.
I haven’t checked out this tutorial much myself but have had some students mention that they were using it to learn Python. Python books List of Python books compiled by the Python people.
Nice first book for learning Python.
Practical Computing for Biologists
Terrific book that covers all kinds of practical ground for becoming a more effective scientific programmer. They use Python. Someone needs to write “Practical Computing for Business Analysts”.
A Primer on Scientific Programming with Python
Hardcover and pricey, but quite good.
7. Open access courses
Udacity - Intro to Computer Science 101
Learn Python while building a search engine
Udacity - Intro to Data Science
Assumes basic knowledge of Python and statistics.
Coursera - Introduction to Data Science
Not so much a Python course as a data science course that uses Python among other tools. I really like the design of the course in terms of topical coverage and the balancing act required in terms of trying to serve both computing and analytics folks. Parallels between sci-computing and business analytics computing are nicely covered.
MIT - A Gentle Introduction to Programming with Python
Part of the MIT Open Courseware initiative
Harvard - Intro to Data Science 109
Top notch data science course. Python used.
A set of lectures on quant econ modeling using Python. Includes info on getting your sci-computing environment set up.
8. Toolbox
There are a huge number of Python packages and tools for doing analytical work. I’m just going to mention a few of the biggies. I want to give a shout out to the very nice compilation of Python data analysis related tutorial links at Python for Data Analysis: The Landscape of Tutorials as I’ve gotten a number of links from this post.
SciPy and numpy and friends
SciPy (pronounced “Sigh Pie”) is a Python-based ecosystem of open-source software for mathematics, science, and engineering. numpy and SciPy are the bedrock of scientific computing in Python providing a powerful n-dimensional array construct as well as core scientific computing functionality.
- SciPy2012 Conference - the conference for the SciPy world.
pandas - Python Data Analysis Library
A must-have library for doing data crunching in Python. Developed by Wes McKinney. Pandas includes DataFrames and Series data structures and all kinds of methods for working with them. It includes tools for data IO, data prep, data transformation, data modeling, data analysis, data visualization, and data presentation. It uses numpy and matplotlib under the hood.
pandas cookbook - “This is a respository for short and sweet examples and links for useful pandas recipes.”
Wes McKinney’s 10 minute pandas overview vid - get a quick demo from pandas founder. Longer tutorials from WM can be found here and here.
Intro to pandas data structures - nice introductory tutorial aimed at those coming from the SQL world.
Statistical analysis with SciPy and pandas - an IPython notebook based tutorial on doing stats using Python + SciPy + pandas
IPython
An architecture and toolset for doing interactive computing with Python. It was developed by Fernando Perez and others. I love IPython. It’s terrific for interactively hacking around and doing analysis with Python. The interface is reminiscent of tools like Matlab or Mathematica. There’s a browser based tool called an IPython Notebook that, well, let’s you use IPython right from a browser window. The ability to mix runnable Python code with markdown text makes IPython great for developing tutorials and documenting analysis workflows. It’s a must-have.
-
- A very good place to start.
A collection of notebooks for using IPython effectively
- Become an effective IPython user.
IPython Notebook Viewer -“IPython Notebook Viewer (or nbviewer in short) is a free webservice that allows you to see static html versions of hosted notebook files. As long as a notebook is publicly available, by giving its url to nbviewer you should be able to view it.”
A gallery of interesting IPython notebooks - Hosted at github, this is a HUGE, curated collection of notable IPython notebooks on many sci-computing related topics. Some are companions to books, some are tutorials, some are demonstrations, some are...
Learning IPython for interactive computing and visualization - A physical book. - You can find a bunch of great videos and slides on the development, evolution and how-to use of IPython at http://ipython.org/videos.html
matplotlib
Matplotlib is a widely used plotting library. It’s very powerful and has great documentation with tons of examples. It was developed by John Hunter who sadly passed away in 2012. Part of Python’s rise as a force in the sci-computing world is certainly due to matplotlib.
matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.
matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB®* or Mathematica®†), web application servers, and six graphical user interface toolkits. You can use matplotlib as an interactive plotting tool (ala Matlab) using its pyplot command or in more of an API based programmatic way. Here are some good links to matplotlib tutorials that I pulled directly from A. Dasgupta’s “Python for Data Analysis: The Landscape of Tutorials”:
scikit-learn
Tools for all kinds of “machine learning” algorithms and tools for classification, clustering, regression, dimensionality reduction, model selection and data preprocessing. It’s built on top of tools like numpy, SciPy, and matplotlib.
statsmodels
This is a Python package that contains all kinds of statistical analysis functionality. It’s not as comprehensive as R but but has the advantage of being part of a general programming language, Python, instead of a domain specific language like R. Regarding R on the data processing side, I find myself preferring Python and then using R for the stats. However, I do also end up doing a fair bit of data prep work in R during course of an analysis (e.g. output dataframe needing some transforming or reshaping for next stage of analysis). Having a full-fledged programming language with strong support for data prep (including regex) is really nice. I also find myself doing more and more statistical work in Python, especially when I’m embedding some kind of statistical model in some sort of analytical application or tool. Bottom line, it’s worth knowing both for both data handling and stats. The R and Python+(various goodies) worlds are colliding:
- Using R from Python is pretty easy via Rpy2
- Go the other direction with rPython
- Closed but good discussion on stackehange on R vs Python for data analysis
- ggplot2 (R) is usually considered somewhat better than matplotlib (mainly due to “grammar of graphics” foundation of ggplot2) but y-hat released ggplot -like plotting engine for Python.
- pandas and statsmodels are Python packages that are terrific for stats and data analysis
Sphinx
Sphinx is a reStructuredText based documentation generation system for Python. It makes writing documentation fun.
Sphinx is a tool that makes it easy to create intelligent and beautiful documentation, …
reStructuredText is an easy-to-read, what-you-see-is-what-you-get plaintext markup syntax and parser system. reST is part of docutils, an open source text processing system that takes documentation written in plain text and converts it to useful formats like HTML, pdf, LaTex, and others. The official Python documentation is written in reST and reST is used for docstrings in Python code. Docstrings are just literal string comments included in source code that can be read, parsed, and used to automatically generate documentation. When you write docstrings, you write them in reST. Once you get used to it, you’ll wonder how you ever did without it. While Sphinx and reST are ubiquitous in the Python doc world, you can use them for general documentation and writing. I wrote documentation for a Java based computer simulation model using Sphinx and reST. I generated HTML but could have generated other output formats as well. Best of all, you just write the documentation in a text editor (many of which are reST aware). It’s genius.
PyTables
Built on top of HDF5 - Hierarchical Data Format, PyTables provides tools for working with large datasets in Python. > PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.
9. Musings on Python, open source software and reproducible analysis
NumFocus: Open Code, Better Science
A non-profit foundation dedicated to supporting better sci-computing.
Mission: to promote accessible and reproducible computing in science and technology.
Fernando Perez’s blog on open and literate scientific programming
The developer of IPython writes eloquently about these topics.
C. Titus Brown is a computational biologist at Michigan State University. He’s got a great blog on scientific computing with lots of Python thrown in.
The pandas developer who has recently launched a new data analytics venture.
yhat (pronounced “why hat”)
This is a company focused on providing a cloud platform for analytics. They create interesting tools with and write interesting things about Python and R.
10. General Python resources
PyPI - the Python Package Index
The Python Package Index is a repository of software for the Python programming language. It’s the CRAN of Python world.
Index for Python related videos
StackOverflow is your friend. It’s a very good Q&A site for all things programmatical. You can learn a lot just by browsing threads of interest. It’s the first place I go to when trying to figure out some Pythonic thing. It uses tags to help you find things.
Stack Overflow is a question and answer site for professional and enthusiast programmers. It’s 100% free, no registration required.
11. The Python packaging ecosystem
This is not a beginner topic but you’ll have to deal with it eventually. The following two articles will lead you into the morass.
A non-magical introduction to Pip and Virtualenv for Python beginners
Python Packaging: Hate, hate, hate everywhere
::: ::: :::
Reuse
Citation
@online{isken2013,
author = {Mark Isken},
title = {Learning {Python} - Suggestions and Resources for Business
Analytics Students and Professionals},
date = {2013-12-18},
langid = {en}
}