Data visualization using Seaborn – Part 2

In the first part we discussed about basics of seaborn, installation, relational and catplot. In this part we will discuss about different type of graphs and their uses.

1. Count plot

It represents the count or the frequency of the data variable passed to it. This is similar to Univariate Data distribution plot.

If you need to add percentage of total in the top of the bar, here is the way to do

Note  - Default representation in catplot() is scatterplot.

There are two different categorical scatter plots in seaborn which take different approaches to resolve the challenge in representing categorical data with a scatter plot, which is that all of the points belonging to one category would fall on the same position along the axis corresponding to the categorical variable.

The approach used by striplot(), which is the default “kind” in catplot() is to adjust the positions of points on the categorical axis with a small amount of random “jitter”

 

2. Strip plot

The strip plot draws a scatter plot where one of the variables is categorical. The strip plot is different in a way that one of the variables is categorical in this case, and for each category in the categorical variable, you will see scatter plot with respect to the numeric column.

3. Swarm plot

This approach adjusts the values along the categorical axis with an algorithm that prevents them from overlapping. It gives a better representation of distribution of observations, although it only works well for relatively small datasets. Also called a “beeswarm” and is drawn in seaborn by swarmplot() , which is activated by setting kind=”swarm” in catplot(). It is a type of scatter plot, but helps in visualizing different categorical variables. Scatter plots generally plots based on numeric values, but most of the data analyses happens on categorical variables. So, swarm plots seem very useful in those cases.

Example

We can add another dimension to a categorical plot by using a hue semantic. (The categorical plots do not currently support size or style semantics). Each different categorical plotting function handles the hue semantic differently. For the scatter plots, it is only necessary to change the color of the points

Unlike with numerical data, it is not always obvious how to order the levels of the categorical variable along its axis.

If your data have a pandas Categorical datatype, then the default order of the categories can be set there. If the variable passed to the categorical axis looks numerical, the levels will be sorted. But the data are still treated as categorical and drawn at ordinal positions on the categorical axes (specifically, at 0, 1, …) even when numbers are used to label them:

Distributions of observations within categories

As the size of the dataset grows, categorical scatter plots become limited in distribution of values. There are several approaches for summarizing the distributional information in ways that facilitate easy comparisons across the category levels

4. Boxplots

The box plot is used to display the distribution of the categorical data in the form of quartiles.

This kind of plot shows the three quartile values of the distribution along with extreme values. The “whiskers” extend to points that lie within 1.5 IQRs of the lower and upper quartile, and then observations that fall outside this range are displayed independently. This means that each value in the boxplot corresponds to an actual observation in the data.

Adding a hue semantic, the box for each level of the semantic variable is moved along the categorical axis so they don’t overlap

5. Boxen Plot

It is more optimized for showing more information about the shape of the distribution. It is best suited for larger datasets. The Boxen plot is very similar to box plot, except that it plots different quartile values.

By plotting different quartile values, we are able to understand the shape of the distribution particularly in the head end and tail end.

6. Violinplots

Box-and-whisker show medians, ranges and variability effectively. They allow comparing groups of different sizes. But box plots can be misleading . They are not affected by data’s distribution. When the data “morph” but manage to maintain their stat summaries (medians and ranges), their box plots stay the same.

The “violin” shape is a replica of data’s density plot. You can just turn that density plot sideway and put it on both sides of the box plot, mirroring each other. It can be made intuitive and attractive

Unlike bar graphs with means and error bars, violin plots contain all data points. This make them an excellent tool to visualize samples of small sizes. Violin plots are appropriate even if your data is non normal distribution. They work well to visualize both quantitative and qualitative data. It combines a boxplot with the kernel density estimation.

Statistical estimation within categories

In some cases, we have to show central tendency of the values. Seaborn has two main ways to show this information

7. Bar plots

Bar graphs are used to compare things between different groups over time. In seaborn, the barplot() function operates on a full dataset and applies a function to obtain the estimate . If multiple observations are there in each category, bootstrapping is used to compute a confidence interval around the estimate, which is plotted using error bars

A special case for the bar plot is when you want to show the number of observations in each category rather than computing a statistic for a second variable. This is similar to a histogram over a categorical, rather than quantitative, variable. In seaborn, it’s easy to do so with the countplot() function:

8. Point plots

Point plots can be more useful than bar plots for focusing comparisons between different levels of one or more categorical variables. This function also encodes the value of the estimate with height on the other axis, but rather than showing a full bar, it plots the point estimate and confidence interval. Additionally, it connects points from the same hue category. This makes it easy to see how the main relationship is changing as a function of the hue semantic, because your eyes are quite good at picking up on differences of slopes

Functions to draw linear regression models

Two main functions in seaborn are used to visualize a linear relationship as determined through regression.

These functions, regplot() and lmplot() are closely related, and share much of their core functionality. In the simplest invocation, both functions draw a scatterplot of two variables, x and y, and then fit the regression model y ~ x and plot the resulting regression line and a 95% confidence interval for that regression

Note that the resulting plots are identical, except that the figure shapes are different.

Note – replot() accepts the x and y variables in a variety of formats including simple numpy arrays, pandas Series objects. In contrast, lmplot() has data as a required parameter and the x and y variables must be specified as strings. This data format is called “long-form” or “tidy” data.

Example

It’s possible to fit a linear regression when one of the variables takes discrete values, however, the simple scatterplot produced by this kind of dataset is often not optimal

Conditioning on other variables

lomplot() combines regplot() with Facegrid() to provide an easy interface to show a linear regression on “faceted” plots that allow you to explore interactions with up to three additional categorical variables. The best way to separate out a relationship is to plot both levels on the same axes and to use color to distinguish them

9. Mosaic Plot

A mosaic plot (also known as a Marimekko diagram) is a graphical method for visualizing data from two or more qualitative variables. It is the multidimensional extension of spineplots, which graphically display the same information for only one variable.

<script src=”https://gist.github.com/Aswath98/4473bcd43312e6e580174c5c13a2ab78.js”></script>

10. Distplot

Seaborn distplot lets you show a histogram with a line on it. This can be shown in all kinds of variations. We use seaborn in combination with matplotlib, the Python plotting module.

A distplot plots a univariate distribution of observations. The distplot() function combines the matplotlib hist function with the seaborn kdeplot() and rugplot() functions.

Conclusion

Seaborn module helps in visualizing the data using different plots according to the purpose of visualization. You can also check out our previous post on Seaborn Framework. Thanks for reading this article.

Spread the knowledge

Aswath Rao

Currently pursuing Msc in Data Science

Leave a Reply

Your email address will not be published. Required fields are marked *