Introduction to Pandas (Python data analysis toolkit)

Pandas is one of the most useful data analysis library in Python. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python.
pandas is well suited for many different kinds of data:
  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time-series data.
  • Arbitrary matrix data with row and column labels (homogeneously typed or heterogeneous)
  • Any other form of observational/statistical data sets.
Lets understand the 2 key data structures in Pandas - Series, and DataFrames

Introduction to Series and Dataframes

Series can be understood as a 1 dimensional labeled / indexed array. You can access individual elements of this series through these labels.
A dataframe is similar to Excel workbook - you have column names referring to columns and you have rows, which can be accessed with the use of row numbers. The essential difference being that column names and row numbers are known as column and row index, in case of dataframes.
Series and dataframes form the core data model for Pandas in Python. The data sets are first to read into these dataframes and then various operations (e.g. group by, aggregation, etc.) can be applied very easily to its columns.
Here are just a few of the things that pandas do well:
  • Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
  • Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes (possible to have multiple labels per tick)
  • Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data from the ultrafast HDF5 format
  • Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

1 comment:

Powered by Blogger.