data-science-across-disciplines

Main repository for the Data Science Across Disciplines module offered at the Centre for Interdisciplinary Methodologies at the University of Warwick

Home
Detailed Information

:: Sessions ::

Session-01
Session-02
Session-03
Session-04
Session-05
Session-06
Session-07
Session-08
Session-09


View the Project on GitHub cagatayTurkay/data-science-across-disciplines

On Python and Python related resources

image-20200930114332193

### Why Python?

Python (https://www.python.org/) is a widely used general-purpose, high-level programming language. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles. It is both powerful, simple, flexible, and extendible.

We will be using one of the latest version of the language which is 3.8. Python has an extensive standard library which provides lots of functionality. What makes Python more powerful is the collection of packages that are contributed by many around the world. PyPi is the place to get access to many of these packages.

What makes Python suitable for data science applications is the great libraries it is equipped with. It has gained popularity in scientific computing and data analysis due to these libraries. A number of very important packages are as follows:

Scipy: SciPy contains modules for optimization, linear algebra, statistics, specialised mathematics, integration, interpolation, signal and image processing, and other tasks common in science and engineering. We will make use of a number Scipy functions whenever we need the functionality.

Numpy: Numpy adds support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays. It is highly efficient to make basic data operations with Numpy, we will mostly use it to subset data, make basic mathematical calculations, etc.

Pandas: Pandas is a Python library to enhance data manipulation and analysis. It is great for data wrangling, merging, basic analytical tasks with tabular datasets and time series. It is Python’s answer to statistical computation package R.

Scikit-learn: Scikit-learn is a package that will provide us basic machine learning capability and supports tasks like regression, classification, clustering, dimension reduction, etc.

Statsmodels: Statsmodels will help us to estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.

Matplotlib: Matplotlib will provide us basic plotting and visualising functionality to have a quick look at our data and results.

Seaborn: Seaborn emerged as a highly popular plotting library that works reasonably well with Pandas data formats. It offers a range of plots and easy to adapt examples.

Altair: Altair is gaining increasing popularity as the most versatile visualisation library for Python. It sits on the shoulder of a visualisation specification language called Vega-Lite. It is possible to build complex interactive plots with it and also share over the web.

Normally Python do not come with these packages and you need to download and install these packages yourself. This process is usually straightforward but might be tricky at times. Luckily, there is a strong Python community that generates easy to install packages that are available for download, a very good repository is here. But remember that you need to be careful which Python version you are using.

In this module, we get around the burden of all these installations by using a pre-packaged Python environment called Anaconda. What comes with Anaconda is the standard Python distribution and most of the necessary packages for data analysis. If you need to read more, the Anaconda documentation pages are a good start.

Although we try to cover the basics in Python programming in this tutorial, some of you, especially those who are new to Python, might benefit from some external tutorials which cover the basics. There are many resources online but here are some good links:

Textbooks

As textbooks for the coding side these two resources are really good:

So you can use these two books in a complementary way to support your learning within the coding labs.

And here are some further books that can help you with your learning.

Online APIs and documentations

You will hear us mentioning the importance of using online resources and library documentations. For instance, for scikit-learn, I start with the User Guide or the API. For Pandas, the user guide is the best and again the API is a very useful starting point. For seaborn the API is pretty good with in-built examples. Of course, if you are using a different library, you need to rely on what the authors are providing online so the quality might vary – that’s the joy of open source, community led development but the standards have risen substantially over the last years so you often get very good guidance.