Creating Data Frames¶
In the previous post we looked the tutorial on basic of Series in Pandas.In this part of tutorial we will be looking building one of the important data structure in pandas "The DataFrame" .Pandas has an abundance of functionality, far too much for me to cover in this introduction.We will cover more functions of dataframe in the example sections which will be coming in next post!Hope you are enjoying by learning, our Suggestion would be to practice by writing and calling the functions and understanding it.
Lets get Started!¶
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of itlike a spreadsheet or SQL table, or a dict of Series objects.
The Important
You can create a data frame using:
- Dict of 1D ndarrays, lists, dicts, or Series
- 2-D numpy.ndarray
- Structured or record ndarray
- A Series
- Another DataFrame
Importing numpy and pandas librabry¶
In [61]:
import pandas as pd
import numpy as np
How to create data frame from Python dictionary ?¶
In [62]:
my_dictionary = {'a' : 45.1, 'b' : -19.52, 'c' : 4444}
print(my_dictionary.keys())
print(my_dictionary.values())
We'll call the dictionary My_Dictionary. It will have three values, 45.1, minus 19.52 and 4,444. Its keys will be A, B, and C.
We can print both the keys and the values. And we see that the keys are C, D, A, and the values are the same as those that we used in the input. Now, let's use the dictionary in a data frames constructor. The column headers for the data frame are derived from the keys in the dictionary.
We can print both the keys and the values. And we see that the keys are C, D, A, and the values are the same as those that we used in the input. Now, let's use the dictionary in a data frames constructor. The column headers for the data frame are derived from the keys in the dictionary.
In [63]:
my_dictionary_df = pd.DataFrame(my_dictionary, index=['first', 'again'])
my_dictionary_df
Out[63]:
How to use constructor without explicit index¶
The values in the dictionary are replicated one time, that is one row, for each of the values in the index. In this case, first and again. In this example from the Pandas cookbook, the constructor has three labels and three lists. Each of the lists has four values. Since an index is not included within the constructor, the integers zero through 3 are used as an index and displayed as row labels. In this example, we create a dictionary whose labels, that is, whose keys, are 1 and 2, and the values associated with each of these keys are a series.
In [64]:
cookbook_df = pd.DataFrame({'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]})
cookbook_df
Out[64]:
How to use constructor contains dictionary with Series as values¶
The series has a list of values and its own index. When we display this by pressing shift + enter, we see the column headers again are derived from the keys, and the values are derived from the indices. Note that since the key 1 has an index with A, B, and C, but does not include D, the value NaN, or Not A Number, is displayed for this result. In this example, we create a dictionary whose keys and values are strings.
In [65]:
series_dict = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
series_df = pd.DataFrame(series_dict)
series_df
Out[65]:
How to create df using dictionary of lists¶
In [66]:
produce_dict = {'veggies': ['potatoes', 'onions', 'peppers', 'carrots'],
'fruits': ['apples', 'bananas', 'pineapple', 'berries']}
produce_dict
Out[66]:
In [67]:
pd.DataFrame(produce_dict)
Out[67]:
list of dictionaries¶
In [68]:
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
pd.DataFrame(data2)
Out[68]:
dictionary of tuples, with multi index¶
we use a dictionary of Tuples to create a data frame that has a multi index. Here, you can see the multiple levels of column headers.
In [69]:
pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})
Out[69]:
How to Select, Add, Delete, Columns in df¶
we'll examine how to select, add and delete columns from a data frame. The select file in your exercises files folders is prepopulated with import statements for pandas and num pi. Execute the cell by pressing shift+enter. If we create a data frame from the pandas cookbook we can reference columns in the data frame using a dictionary like syntax. In this cell we reference the second column with the string BBB.
dictionary like operations¶
dictionary selection with string index¶
In [70]:
cookbook_df = pd.DataFrame({'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]})
cookbook_df['BBB']
Out[70]:
arithmetic vectorized operation using string indices¶
In [71]:
cookbook_df['BBB'] * cookbook_df['CCC']
Out[71]:
column deletion¶
In [72]:
del cookbook_df['BBB']
cookbook_df
Out[72]:
We can use these string references or these string selections in arithmetic vectorized operations. Copying from the final version of your file, we can take the column BBB and multiply every value in the column by every column in the column CCC. Here we see we have ten times a hundred, twenty times fifty, thirty times -30 and forty times -50. There are two ways that we can release columns from a data frame, the DEL or delete operator and the pop function.
In [73]:
last_column = cookbook_df.pop('CCC')
last_column
Out[73]:
In [24]:
cookbook_df
Out[24]:
add a new column using a Python list¶
In [25]:
cookbook_df['DDD'] = [32, 21, 43, 'hike']
cookbook_df
Out[25]:
insert function¶
documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.insert.htmlIn [26]:
cookbook_df.insert(1, "new column", [3,4,5,6])
cookbook_df
Out[26]:
Indexing and Selection¶
Operation | Syntax | Result |
---|---|---|
Select column | df[col] | Series |
Select row by label | df.loc[label] | Series |
Select row by integer | df.iloc[loc] | Series |
Select rows | df[start:stop] | DataFrame |
Select rows with boolean mask | df[mask] | DataFrame |
Note the double square brackets. We can select a row from a data frame using an integer index, using the I location, or the I L-O-C function. In this case, we're selecting the row whose index is equal to two. The fruit associated with this row is pineapple, and the veggies associated with this row is peppers. We can select a range of rows by using an integer slice. In this case, we obtain the rows zero through one, up to, but not including, two.
We can also use a slice to count backwards using negative numbers within the slice. In this example, we see that the plus symbol is overloaded as a concatenation operator when dealing with data frames. When we execute this cell, we see that apples is concatenated to each of the previous values in the fruit column.
We can also use a slice to count backwards using negative numbers within the slice. In this example, we see that the plus symbol is overloaded as a concatenation operator when dealing with data frames. When we execute this cell, we see that apples is concatenated to each of the previous values in the fruit column.
In [32]:
nutrient_dict = {'veggies': ['potatoes', 'carrot', 'beans', 'leafy'],'fruits': ['apples', 'mango', 'pineapple', 'banana']}
nutrient_df = pd.DataFrame(produce_dict)
nutrient_df
Out[32]:
How to select using dectionary-like String¶
In [33]:
nutrient_df['fruits']
Out[33]:
How to Select row using integer index¶
In [74]:
nutrient_df.iloc[2:]
Out[74]:
Slicing the row¶
In [43]:
nutrient_df.iloc[3:4]
Out[43]:
+ is over-loaded as concatenation operator¶
In [46]:
nutrient_df + nutrient_df.iloc[0]
Out[46]:
Data alignment and arithmetic¶
Data alignment between DataFrame objects automatically align on both the columns and the index (row labels).Note locations for 'NaN'
In [75]:
df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])
sum_df = df + df2
sum_df
Out[75]:
Boolean Indexing¶
In [76]:
sum_df>0
Out[76]:
In [77]:
sum_df[sum_df>0]
Out[77]:
first select rows in column B whose values are less than zero
then, include information for all columns in that row in the resulting data set
then, include information for all columns in that row in the resulting data set
One more on using where function¶
In [51]:
nutrient_df.where(nutrient_df > 'k')
Out[51]:
Important links data frames from various data types¶
documentation: http://pandas.pydata.org/pandas-docs/stable/dsintro.htmlcookbook: http://pandas.pydata.org/pandas-docs/stable/cookbook.html
documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
Great! we have learnt about lots of function in pandas to deal with dataframe. Feel free to fork this long notebook on github and try it .Feel free to share with other learners and on social media.
In the next post we are going to start with Plotting which is again most important as visualisation will help to understand the data better!
See you soon with next post! Happy coding.
In the next post we are going to start with Plotting which is again most important as visualisation will help to understand the data better!
See you soon with next post! Happy coding.
In [ ]:
7 comments
Great post!
Very well explained! subscribed for future..
Thanks for explaining it!
Pandas has lot more things and dataframe can do more things and the post covered it very well to get started..!
Top post.. Thanks for giving time and effort to produce it!
I got to know lots of new function which I forgot long tym back
Excellent post!
"nutrient_df = pd.DataFrame(produce_dict)" should be "nutrient_df = pd.DataFrame(nutrient_dict)"
EmoticonEmoticon