Pandas in Python - Dataframe Tutorial(With examples)


Creating Data Frames

In the previous post we looked the tutorial on basic of Series in Pandas.In this part of tutorial we will be looking building one of the important data structure in pandas "The DataFrame" .Pandas has an abundance of functionality, far too much for me to cover in this introduction.We will cover more functions of dataframe in the example sections which will be coming in next post!
Hope you are enjoying by learning, our Suggestion would be to practice by writing and calling the functions and understanding it.


Lets get Started!
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it
like a spreadsheet or SQL table, or a dict of Series objects.
The Important
You can create a data frame using:
  • Dict of 1D ndarrays, lists, dicts, or Series
  • 2-D numpy.ndarray
  • Structured or record ndarray
  • A Series
  • Another DataFrame

Data Frame attributes


Importing numpy and pandas librabry

In [61]:
import pandas as pd
import numpy as np


How to create data frame from Python dictionary ?

In [62]:
my_dictionary = {'a' : 45.1, 'b' : -19.52, 'c' : 4444}
print(my_dictionary.keys())
print(my_dictionary.values())




['a', 'c', 'b']
[45.1, 4444, -19.52]


We'll call the dictionary My_Dictionary. It will have three values, 45.1, minus 19.52 and 4,444. Its keys will be A, B, and C.
We can print both the keys and the values. And we see that the keys are C, D, A, and the values are the same as those that we used in the input. Now, let's use the dictionary in a data frames constructor. The column headers for the data frame are derived from the keys in the dictionary.

In [63]:
my_dictionary_df = pd.DataFrame(my_dictionary, index=['first', 'again'])
my_dictionary_df




Out[63]:

a b c
first 45.1 -19.52 4444
again 45.1 -19.52 4444



How to use constructor without explicit index

The values in the dictionary are replicated one time, that is one row, for each of the values in the index. In this case, first and again. In this example from the Pandas cookbook, the constructor has three labels and three lists. Each of the lists has four values. Since an index is not included within the constructor, the integers zero through 3 are used as an index and displayed as row labels. In this example, we create a dictionary whose labels, that is, whose keys, are 1 and 2, and the values associated with each of these keys are a series.

In [64]:
cookbook_df = pd.DataFrame({'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]})
cookbook_df




Out[64]:

AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50



How to use constructor contains dictionary with Series as values

The series has a list of values and its own index. When we display this by pressing shift + enter, we see the column headers again are derived from the keys, and the values are derived from the indices. Note that since the key 1 has an index with A, B, and C, but does not include D, the value NaN, or Not A Number, is displayed for this result. In this example, we create a dictionary whose keys and values are strings.

In [65]:
series_dict = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
series_df = pd.DataFrame(series_dict)
series_df




Out[65]:

one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0



How to create df using dictionary of lists

In [66]:
produce_dict = {'veggies': ['potatoes', 'onions', 'peppers', 'carrots'],
'fruits': ['apples', 'bananas', 'pineapple', 'berries']}
produce_dict




Out[66]:


{'fruits': ['apples', 'bananas', 'pineapple', 'berries'],
'veggies': ['potatoes', 'onions', 'peppers', 'carrots']}



In [67]:
pd.DataFrame(produce_dict)




Out[67]:

fruits veggies
0 apples potatoes
1 bananas onions
2 pineapple peppers
3 berries carrots



list of dictionaries

In [68]:
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
pd.DataFrame(data2)




Out[68]:

a b c
0 1 2 NaN
1 5 10 20.0



dictionary of tuples, with multi index

we use a dictionary of Tuples to create a data frame that has a multi index. Here, you can see the multiple levels of column headers.

In [69]:
pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})




Out[69]:

a b
a b c a b
A B 4.0 1.0 5.0 8.0 10.0
C 3.0 2.0 6.0 7.0 NaN
D NaN NaN NaN NaN 9.0



How to Select, Add, Delete, Columns in df

we'll examine how to select, add and delete columns from a data frame. The select file in your exercises files folders is prepopulated with import statements for pandas and num pi. Execute the cell by pressing shift+enter. If we create a data frame from the pandas cookbook we can reference columns in the data frame using a dictionary like syntax. In this cell we reference the second column with the string BBB.

dictionary like operations

dictionary selection with string index

In [70]:
cookbook_df = pd.DataFrame({'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]})
cookbook_df['BBB']




Out[70]:


0    10
1 20
2 30
3 40
Name: BBB, dtype: int64



arithmetic vectorized operation using string indices

In [71]:
cookbook_df['BBB'] * cookbook_df['CCC']




Out[71]:


0    1000
1 1000
2 -900
3 -2000
dtype: int64



column deletion

In [72]:
del cookbook_df['BBB']
cookbook_df




Out[72]:

AAA CCC
0 4 100
1 5 50
2 6 -30
3 7 -50



We can use these string references or these string selections in arithmetic vectorized operations. Copying from the final version of your file, we can take the column BBB and multiply every value in the column by every column in the column CCC. Here we see we have ten times a hundred, twenty times fifty, thirty times -30 and forty times -50. There are two ways that we can release columns from a data frame, the DEL or delete operator and the pop function.

In [73]:
last_column = cookbook_df.pop('CCC')
last_column




Out[73]:


0    100
1 50
2 -30
3 -50
Name: CCC, dtype: int64



In [24]:
cookbook_df




Out[24]:

AAA
0 4
1 5
2 6
3 7



add a new column using a Python list

In [25]:
cookbook_df['DDD'] = [32, 21, 43, 'hike']
cookbook_df




Out[25]:

AAA DDD
0 4 32
1 5 21
2 6 43
3 7 hike



In [26]:
cookbook_df.insert(1, "new column", [3,4,5,6])
cookbook_df




Out[26]:

AAA new column DDD
0 4 3 32
1 5 4 21
2 6 5 43
3 7 6 hike



Indexing and Selection

OperationSyntaxResult
Select columndf[col]Series
Select row by labeldf.loc[label]Series
Select row by integerdf.iloc[loc]Series
Select rowsdf[start:stop]DataFrame
Select rows with boolean maskdf[mask]DataFrame
documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html

Note the double square brackets. We can select a row from a data frame using an integer index, using the I location, or the I L-O-C function. In this case, we're selecting the row whose index is equal to two. The fruit associated with this row is pineapple, and the veggies associated with this row is peppers. We can select a range of rows by using an integer slice. In this case, we obtain the rows zero through one, up to, but not including, two.
We can also use a slice to count backwards using negative numbers within the slice. In this example, we see that the plus symbol is overloaded as a concatenation operator when dealing with data frames. When we execute this cell, we see that apples is concatenated to each of the previous values in the fruit column.

In [32]:
nutrient_dict = {'veggies': ['potatoes', 'carrot', 'beans', 'leafy'],'fruits': ['apples', 'mango', 'pineapple', 'banana']}
nutrient_df = pd.DataFrame(produce_dict)
nutrient_df




Out[32]:

fruits veggies
0 apples potatoes
1 mango carrot
2 pineapple beans
3 banana leafy



How to select using dectionary-like String

In [33]:
nutrient_df['fruits']




Out[33]:


0       apples
1 mango
2 pineapple
3 banana
Name: fruits, dtype: object



How to Select row using integer index

In [74]:
nutrient_df.iloc[2:]




Out[74]:

fruits veggies
2 pineapple beans
3 banana leafy



Slicing the row

In [43]:
nutrient_df.iloc[3:4]




Out[43]:

fruits veggies
3 banana leafy



+ is over-loaded as concatenation operator

In [46]:
nutrient_df + nutrient_df.iloc[0]




Out[46]:

fruits veggies
0 applesapples potatoespotatoes
1 mangoapples carrotpotatoes
2 pineappleapples beanspotatoes
3 bananaapples leafypotatoes



Data alignment and arithmetic

Data alignment between DataFrame objects automatically align on both the columns and the index (row labels).
Note locations for 'NaN'

In [75]:
df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])
sum_df = df + df2
sum_df




Out[75]:

A B C D
0 2.796434 0.681719 1.249369 NaN
1 -1.920570 -0.748472 -0.455429 NaN
2 -0.335982 -2.323809 0.365608 NaN
3 -0.565566 0.885914 -1.261485 NaN
4 -0.315269 0.300453 0.582013 NaN
5 -0.076879 0.762971 -1.182593 NaN
6 0.460198 -0.533756 -1.903300 NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN



Boolean Indexing

In [76]:
sum_df>0




Out[76]:

A B C D
0 True True True False
1 False False False False
2 False False True False
3 False True False False
4 False True True False
5 False True False False
6 True False False False
7 False False False False
8 False False False False
9 False False False False



In [77]:
sum_df[sum_df>0]




Out[77]:

A B C D
0 2.796434 0.681719 1.249369 NaN
1 NaN NaN NaN NaN
2 NaN NaN 0.365608 NaN
3 NaN 0.885914 NaN NaN
4 NaN 0.300453 0.582013 NaN
5 NaN 0.762971 NaN NaN
6 0.460198 NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN



first select rows in column B whose values are less than zero
then, include information for all columns in that row in the resulting data set

One more on using where function

In [51]:
nutrient_df.where(nutrient_df > 'k')




Out[51]:

fruits veggies
0 NaN potatoes
1 mango NaN
2 pineapple NaN
3 NaN leafy



Great! we have learnt about lots of function in pandas to deal with dataframe. Feel free to fork this long notebook on github and try it .Feel free to share with other learners and on social media.
In the next post we are going to start with Plotting which is again most important as visualisation will help to understand the data better!
See you soon with next post! Happy coding.

In [ ]:
 


Hey I'm Venkat
Developer, Blogger, Thinker and Data scientist. nintyzeros [at] gmail.com I love the Data and Problem - An Indian Lives in US .If you have any question do reach me out via below social media

7 comments

Great post!

Very well explained! subscribed for future..

Thanks for explaining it!

Pandas has lot more things and dataframe can do more things and the post covered it very well to get started..!

Top post.. Thanks for giving time and effort to produce it!
I got to know lots of new function which I forgot long tym back

Excellent post!

"nutrient_df = pd.DataFrame(produce_dict)" should be "nutrient_df = pd.DataFrame(nutrient_dict)"


EmoticonEmoticon