Beginners Pandas Getting Started¶
Pandas is a high-level data manipulation tool developed by Wes McKinney. It is built on the Numpy package and its key data structure is called the DataFrame. DataFrames allow you to store and manipulate tabular data in rows of observations and columns of variables.- If you're new to this first get the enviroment Setup in our previous post
- Getting Started with Jupyter [Part -1] http://www.androidxu.com/2017/04/guide-On-Jupyter-Notebook.html
- Getting Started with Jupyter [Part -2] http://www.androidxu.com/2017/04/the-ultimate-guide-on-jupyter-ipython-mardown.html#.WPJOBYVOL4g
pandas is well suited for:
- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure
- Easy handling of missing data
- Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
- Powerful, flexible group by functionality to perform split-apply-combine operations on data sets
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
- Intuitive merging and joining data sets
- Flexible reshaping and pivoting of data sets
- Hierarchical labeling of axes
- Robust IO tools for loading data from flat files, Excel files, databases, and HDF5
- Time series functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.
documentation: http://pandas.pydata.org/pandas-docs/stable/10min.html
Series¶
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers,Python objects, etc.). The axis labels are collectively referred to as the index.
documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html
In [38]:
#importing numpy and pandas library
import pandas as pd
import numpy as np
Create series from NumPy array¶
Creating a basic series from NumpPy array.Number of labels in 'index' must be the same as the number of elements in array
In [39]:
my_simple_series = pd.Series(np.random.randn(7), index=['a', 'b', 'c', 'd', 'e','f','g'])
my_simple_series
Out[39]:
In [40]:
my_simple_series.index
Out[40]:
Create series from NumPy array, without explicit index¶
In [41]:
my_simple_series = pd.Series(np.random.randn(5))
my_simple_series
Out[41]:
Access a series like a NumPy array
In [42]:
my_simple_series[:3]
Out[42]:
Create series from Python dictionary¶
In [43]:
my_dictionary = {'a' : 45., 'b' : -19.5, 'c' : 4444}
my_second_series = pd.Series(my_dictionary)
my_second_series
Out[43]:
Access a series like a dictionary
In [44]:
my_second_series['b']
Out[44]:
note order in display; same as order in "index"
note NaN
note NaN
In [45]:
pd.Series(my_dictionary, index=['b', 'c', 'd', 'a'])
Out[45]:
In [46]:
my_second_series.get('a')
Out[46]:
In [47]:
unknown = my_second_series.get('f')
type(unknown)
Out[47]:
Create series from scalar¶
If data is a scalar value, an index must be provided. The value will be repeated to match the length of indexIn [48]:
pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
Out[48]:
Vectorized Operations¶
- not necessary to write loops for element-by-element operations
- pandas' Series objects can be passed to MOST NumPy functions
In [49]:
my_dictionary = {'a' : 45., 'b' : -19.5, 'c' : 4444}
my_series = pd.Series(my_dictionary)
my_series
Out[49]:
Add Series without loop¶
In [50]:
my_series + my_series
Out[50]:
In [51]:
my_series
Out[51]:
Series within arithmetic expression¶
In [52]:
#adding values into a series
my_series +5
Out[52]:
Series used as argument to NumPy function¶
In [53]:
np.exp(my_series)
Out[53]:
A key difference between Series and ndarray is that operations between Series automatically align the data based on
label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.
label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.
In [54]:
my_series[1:]
Out[54]:
In [55]:
my_series[:-1]
Out[55]:
In [56]:
my_series[1:] + my_series[:-1]
Out[56]:
Apply Python functions on an element-by-element basis¶
In [57]:
def multiply_by_ten (input_element):
return input_element * 10.0
In [58]:
my_series.map(multiply_by_ten)
Out[58]:
Vectorized string methods¶
Series is equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically.In [59]:
series_of_strings = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
In [60]:
series_of_strings.str.lower()
Out[60]:
- Reference resource :
5 comments
Wow...Nyc post to get started with series!
very well written and helpful!
Well documented one for series but still something are missing
Well explained .. Thanks..
Thanks mike you liked it !
EmoticonEmoticon