Welcome to the second part of our Basics of Pandas series. In the first part, we got introduced to Pandas, what it is and its various basic concepts like, creation of series, DataFrames in different methods, indexing, indexers, Ufuncs, and handling of missing data. In this blog, we are going to learn about slightly advanced, but definitely some of the most used concepts of Pandas. So, without any further adieu, let’s begin!
Lets start booming by importing
The concatenation of Panda’s series and DataFrame objects is similar to the concatenation of NumPy arrays, which can be done via np.concatenate function. Let’s have a quick look at NumPy array concatenation.
Please note that here the concatenation happened column-wise by default, since no axis attribute was specified.
Pandas use a function, pd.concat() which acts similar to np.concatenate() with more number of options. The function pd.concat() works for both Series and DataFrame objects.
Concatenation of Pandas Series and DataFrame:
Concatenation of higher-dimensional objects, such as DataFrames:
Pandas also provides flexibility to concatenate in specific axis. By default, the concatenation takes place row-wise within the DataFrame.
Here, axis is assigned as 1 which represents row-wise concatenation. Similarly, we can concatenate column-wise by specifying axis as zero (axis=0).
Sometimes, one may ignore the index and prefer a continuous integer index. This can be done by specifying the parameter ignore_index=True flag in concat() function.
Another option is to use the keys option to specify a label for the data sources. This creates a hierarchical index which basically means an index above other indexes.
Concatenation with joins:
In general, data from different sources can have different column names. The function pd.concat() offers several options to deal with such data.
The join parameter in pd.concat() function provides different ways to combine different DataFrames. By default, the join is a union of input columns (join = ‘outer’), but we can change the intersection of the columns using (join = ‘inner’).
Pandas Merge() function:
Joining and merging are very essential functions in databases for data interaction. The pd.merge() function implements various different types of joins like, one-to-one, many-to-one, and many-to-many joins.
Merging with specific keys:
Often, DataFrame column names will not match. In such a case, the function pd.merge() is really helpful and provides a variety of options.
While merging two different DataFrames with different columns, we explicitly specify the name of the column using the on keyword. The following example will make it easier to understand:
Note: While merging DataFrames using the on keyword, one should make sure that both left and right DataFrames have the same column name.
Let’s check what happens if we specify a column which is not common in DataFrames:
In some cases, one may wish to merge two datasets having different column names but when their content is relatable, one can use the left_on and right_on keywords to specify the two column names.
Sometimes, rather than merging on a column, one may like to merge on an index. In this case, the index can be used as the key of merging by specifying the left_index and/or right_index flags to True. For example, in such case the data may look like this:
Simple Aggregation in Pandas :
Pandas series and DataFrames include many common aggregates like:
- count() # Total count of items
- first() # First item
- mean(), median() # Mean and Median
- min(), max() # Minimum and Maximum
- std(), var() # Standard deviation and variance
- prod() # Product of all items
- sum() # Sum of all items
Aggregation on Series:
Aggregation on DataFrames:
Pandas GroupBy function:
A groupby operation involves one of the following operations on the data objects:
- The Split step involves splitting data into sets depending on the value of the specified key.
- The Apply step involves operations like
- The Combine step merges the results of these operations into an output array.
Instead of doing all the above steps manually, Pandas provides a function called groupby that can perform several operations like sum, mean, count, min, or other aggregates in a single step! Doesn’t that sound amazing?
Let’s import a Pandas
Seaborn datasets to see an example of
As we can see, no computation is done until we call some aggregate on the object. We got computed values only after specifying the aggregate function.
Also, it is interesting to note that the groupby function gives one the flexibility to provide either a single aggregate or a list of aggregate functions.
The Transformation function on a group or a column returns an object that is indexed as the same size that is being grouped. Thus, transform should return a result that is the same size as that of a group chunk.
So, here we are at the end of the second part of Basics of Pandas series! How does it feel to have learnt so much? How many of the above stated functions did you try? Do share with us! To sum up, we have learned various basic concepts like Concatenation of series, DataFrames, joins, merge function, and its parameters, Aggregation and GroupBy function. This is not all, we have a final blog in this series where we will touch upon some advanced topics related to Pandas. Can you guess what all we would be discussing? Get in touch and let us know.
- Complete Jupyter Notebook: https://jovian.ml/v-snehith999/basics-of-pandas-part-2
- Pandas official documentation: https://pandas.pydata.org/
- Python Data Science Handbook: https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html
- Python Pandas: https://www.tutorialspoint.com/python_pandas/python_pandas_groupby.htm
- Pandas Tutorial: https://www.geeksforgeeks.org/pandas-tutorial/
- Pandas – powerful Python data analysis toolkit: https://pandas.pydata.org/docs/pandas.pdf
– Snehit Vaddi
I am a Machine Learning enthusiast. I teach machines how to see, listen, and learn.