displots Video Lecture Transcript
This transcript was automatically generated by Zoom, so there may be discrepancies between the video and the text.
13:15:39 Hi, everybody, welcome back in this video, we're gonna continue to learn a bunch of different seborn plotting functions with distributional plots or disc plots
13:15:52 So in the previous notebook and video, we learned about relational plots or rel plots in this video, we're gonna discuss distributional plots which are plots that show the empirical distribution of a variable.
13:16:04 So what is the empirical distribution? Well, the empirical distribution is showing the different values that a variable can take on, and then conveying the probability, empirical probability.
13:16:17 So what we've observed empirical probability that a variable takes on those values.
13:16:22 So think of this as a very common images. Sort of the bell curve for the normal distribution which will come up in a little bit.
13:16:31 So there are several axes, level dis plot functions, or distributional plot functions.
13:16:38 We're going to go through them one by one, and then at the end we'll discuss the this plot function, which is a figure level function. So the first is the hiss plot.
13:16:46 So this learn makes a histogram which we learned how to do with Matt.
13:16:51 Plot lib with there. I believe it's plt dot hiss.
13:16:55 The link to the documentation for his plot can be found here for today's notebook.
13:16:59 We're gonna work on a data set called the Ping Data set. Or that's how I'm storing it.
13:17:04 It's the penguins data set. So this gives different measurements for 3 different species of penguins.
13:17:12 And so things like the length of their bill, the depth of their bill.
13:17:14 The length of their flipper and their body mass so why don't we go ahead and start just by making a histogram?
13:17:21 We already know the way to enter data into a Seborn function.
13:17:27 So we're gonna do data equals. Ping, P, e, n, g, and then we're gonna go ahead and we want to make a histogram of the bill length.
13:17:35 And so for us, we're histograms and most distributional plots.
13:17:39 You only need to specify a single variable. The horizontal or the vertical, so X is going to be equal to Bill.
13:17:45 Let. And now you see that we've created a histogram.
13:17:50 So just like with Matt plot, lips hiss.
13:17:53 There's various ways to control the width and number of bins, so you can control the width of the bins with bin width.
13:18:00 So here's an example where I'm going to set the bandwidth to be 2.
13:18:05 And this means it's going to go ahead and create a bunch of bins, all of which, being width 2.
13:18:10 And then the number of bins is then determined by like.
13:18:13 How many bins it takes to cover the range of the observed values.
13:18:17 Okay. So here's that I just had to set Binwidth equal to 2.
13:18:21 Another way you can do this is you can specify the number of bins with just the bins variable, or argument.
13:18:28 So I can do. Let's say bins is equal to 10, and then it will do its best to draw 10 bins.
13:18:36 You can see here that we get 12345678910.
13:18:40 But if we want too large like, let's say we went to a 100.
13:18:43 Well, 1,000. Let's go down one. We went to a 100.
13:18:47 We can see it probably isn't able to draw a 100 bins, but it tries to get as close as it can.
13:18:53 Okay, let's go back to 10. Finally, you can also use bins to set various bin endpoints so we can use this example here.
13:19:04 That I've already put in the comments. If I put in a list or a tuple for bins, I can specify a endpoint.
13:19:11 So I go from 30 to 40, 40 to 43, 43 to 47.
13:19:18 47 to 55, and then 55 to 70.
13:19:22 So this sets the various endpoints for your bins, which can sometimes be desirable as well.
13:19:28 So just like with the other seborn plotting functions, we can change the appearance by doing things like color.
13:19:41 So I said that would leave it to you. But just as a quick example I could put in say, like the hue they may want the hue to be the sex of the penguin, and now you can see, you know female and male, or I could make it to be the species, and so we can
13:19:59 See the 3 different types that we have in our data set.
13:20:04 Okay, so that's just an example. All a lot of the Seborn functions have this functionality, not all of them, but many of them do.
13:20:12 But I'll leave it to you to check out this step because we have so many functions to cover today in this video, I'll leave it to you to check out the documentation which is linked to here for all the different aesthetic arguments.
13:20:26 To to do.
13:20:31 Sorry I just wanted. I forgot what I wrote. So basically, the one last feature we will talk about, though, is it is possible.
13:20:39 So here we made a histogram whose base sits on the horizontal axis.
13:20:44 It's also possible to make histograms whose base bases sit on the vertical axis instead of setting bill length equal to X, we can set y equal to Bill length.
13:20:55 And so then you see the histogram is now drawn with its base on the vertical axis, which can sometimes be desirable.
13:21:03 Later in this notebook will show something called a bivararant histogram, which breaks the grid or breaks the the figure up into squares, and then colors in the squares due to them.
13:21:14 Number of based on the number of observations that fall within that grid that's created and we'll see that in a later example, the next plot type is something called the Kde plot.
13:21:27 So this makes an estimate of a probability. Density, function.
13:21:32 So think of the probability. Density, function as a way of determining what's the probability that a certain random, variable takes on a certain value you can figure out like, what's the probability that a random variable falls within a certain range this is done through something called Gaussian kernel
13:21:53 Density, estimation. If you're interested in knowing what that is, I've linked the Wikipedia article that goes over kernel density estimation here, just know that we won't have to know it all that you really need to understand about kde plots is
13:22:07 It's an estimate of a probability density function for a certain variable.
13:22:11 And I will point out that not every random variable actually has a probability density function this is an estimate of what one might be.
13:22:20 So there are some caveats to using this that will touch on before moving on to the next function.
13:22:25 So one way to think of. Maybe it's useful to imagine a density function is to remind ourselves of what the bell curve looks like.
13:22:33 And so we can use Kde plot to recreate the Gaussian bell curve.
13:22:38 So we for my data, I just did a like a 100,000 random draws from a random, normal and then I've plugged it.
13:22:47 So this is the kernel density, estimate curve of this 100,000 draws from a random, normal, variable.
13:22:55 Okay, so this is the bell curve is an example of a probability.
13:22:59 Density function. And then this is the estimate of that function.
13:23:04 So we can add additional arguments just like we showed with his plot.
13:23:09 So something like species and again, we're doing bills.
13:23:13 Okay. So now, this is the Kde curves for the 3 different species for their bill length, and essentially, the way to read this is that a majority of the data falls below the tallest parts of the of the density curve so from like 35 to a little bit above 40 for
13:23:35 this blue curve is where most of the Adelaide live, similar to the jin twos, and you can also read something like the chin strap.
13:23:43 So like, you see, sort of a bump, and then it goes down, and then another bump is a suggestion that the distribution for the chin strap penguin is perhaps Bimodal, but one of the caveats is this could this Bimodal nature could just be an impact of
13:23:57 Something called the bandwidth of the kernel density, estimation.
13:24:03 So this is similar to the number of bins of a histogram, so it can be useful to try various bandwidth, and so to change that in the Kve plot you set the bw underscore, adjust to it stands for bandwidth that just aren't bandwidth the just argument so we can set the bandwidth of just to be 5
13:24:23 Which is quite large for this state, and and then we can see that how the bill length has been sort of turned into the bell curve, whereas here's what it looks like with the default
13:24:36 Looks by modal right. But if we make a larger bandwidth, it is no longer it doesn't look by modal anymore.
13:24:48 Alternatively, we could set a really small bandwidth, in which case you're kind of just seeing the existence of individual points almost so.
13:24:57 This is another one of those things where there are rules of thumb that people tend to fall, and I believe Seborn has programmed in in the background those rules of thumb don't always work, and so it maybe is useful to try varying levels of bandwidth size and the correct
13:25:13 Size for just depends upon your data. So like a 5 is large for this data set.
13:25:18 But a 5 might be small for another data set. It just depends on the data you're trying to to look at.
13:25:25 Okay, maybe it would be interesting. We could even put in like the hue argument here, where the species of penguin and compare the 2. Okay?
13:25:34 So you can see that now that my bandwidth is larger, the chin strap no longer looks like it's slightly by modal. Okay.
13:25:43 So a quick aside before we move on to the next plot type about Kde plots.
13:25:49 So sometimes people prefer these. So we can kind of compare and contrast with these histograms so like when I put in the species argument into hue.
13:26:01 When I change the hue to be species. This is a little bit difficult to read, because the the different bins sort of overlap, particularly for the jin chin strap in the genome, whereas we can compare this to the Kde plot where while they do overlap it can be
13:26:21 A little bit easier to read the curves than the heights of the bars.
13:26:26 So one reason people maybe prefer the Kde plot to the histogram is that the Kde plot is a little less cluttered because we don't have as many rectangles we just have to follow the shape of the curves.
13:26:36 However, one reason to not like the history or the Kde plot.
13:26:42 There's a couple of reasons. One is that Katie plots don't have any bounds, whereas a histogram stops at the observed points right?
13:26:52 Whereas a Kde plot we can't see it.
13:26:56 But in in theory goes all the way out to negative infinity. Infinity.
13:27:00 This doesn't seem as much of a problem here, but it can give the illusion that certain variables, like, for instance, the length of a bill, could not be negative.
13:27:10 But if we had some observations that were close to 0, the Kde curve might go into the negative values.
13:27:16 So kde curves can give a warped understanding of the tails of the distribution, and then can also make you think that the variable takes on values.
13:27:24 That it doesn't always take on, or is maybe physically impossible to take on another reason that people another reason to be wary about the Kde plot is, it makes the distribution look more smooth than it actually is.
13:27:42 So as an example, this one makes the bill length look like it's basically following a a normal curve.
13:27:50 And that may not be the case. So the Kde sometimes is too good at smoothing, and again you might be able to get around this with changing the bin laden.
13:28:00 But sometimes it's nice. You can, you know, sort of complement.
13:28:02 The Kde curve by looking at the histogram.
13:28:06 So looking at both can be helpful for getting an understanding of the distribution
13:28:11 Okay. So the third type of plot that we can make, that's distributional is something called the Ecdf plot.
13:28:19 So this is the empirical, cumulative distribution function, or the Ecdf, so if you've never heard of the cumulative distribution function, it's the probability that a random variable is less than or equal to a certain value so for us this is going to be drawn using
13:28:41 the like. What we've observed. So, for instance, we would go to the bill length and say, Okay, what's the probability?
13:28:47 How many of my points, what fraction of my points fall at or below 40?
13:28:55 That would be an example. So that's what we're talking about with the empirical, cumulative distribution function or the Ecdf.
13:29:05 So this is of a lot great interest and probability, theory and statistics oftentimes you might see studies that compare this for 2 different study groups to try and show, for instance, when they were looking at the effectiveness of of vaccines.
13:29:22 I believe you might see something like this right, or the effectiveness of medicine.
13:29:26 You'll see compared the Ecdfs of these 2 curves, for whatever random, variable, like days of.
13:29:32 Infection. Do your duration, or something like that. And then, typically what you want to show is that one distribution tends to be to the left or to the right of another distribution.
13:29:44 So let's look at an example. So we'll make the Easydf for the bill length.
13:29:48 So for this, again, we just have to set a single variable, which is the X, and then the bill length, and then and then we can again use the hue argument to look at this for the different species, and so we can see for instance, that a bill length of 40 for the Adelaide about 50 to
13:30:10 60% of the Adelaide penguins have a bill length of at least, or of at most 40, whereas, in comparison, all of the I maybe, said the atmosphere at least part so hopefully interpret it correctly, with what it whatever modifiers whereas the gin 2 and the chin
13:30:30 Strap all of their species, all of their observations have a bill length of greater than 40.
13:30:38 So for us the imperable, empirical, cumulative distribution function for those 2 at 40 would be 0.
13:30:46 Okay. So looking at this kind of tells us, is one way to tell us that the distribution for a bill links or the Adelaide penguin seems to be distinctly different than the other 2 species which can be useful to know
13:31:01 So that's the Ecdf product again, has a lot of different arguments that we've seen examples of.
13:31:09 For instance, the queue, so you can wait them. But I'll leave.
13:31:12 That's more of a statistical argument than a data visualization.
13:31:15 One. I'll leave it to you to explore these more on your own
13:31:22 The last plot type we're gonna look at is called a Rug Plot, which is kind of fun I I think it's interesting.
13:31:29 So rug plots make hash marks on the Y or x, the horizontal or vertical axis that are set at the values corresponding to the observed values plotted in the regular plot, so as an example, we're going to make a rug plot of the
13:31:47 Bill. Length on the horizontal axis. And so it looks weird, because there's actually nothing in the plotting region but you go down to the horizontal axis.
13:31:56 You see all these little blue tick marks. So here you have these blue tick marks that represent like this tick represents one of the penguins having a bill length of about 60.
13:32:08 This tick mark represents one of the penguins having a bill length, probably of about 30, and so each of these tick marks corresponds to one of our observations.
13:32:15 So it's very rare that you have a rug plot that exists on its own oftentimes it's combined with another plot type.
13:32:22 So, for instance, we could plot the bill length against the bill depth or the bill depth against the bill lengths, and then make a rug plot of showing both so I maybe my words made that sound confusing.
13:32:35 So let's look at the plot. So we're gonna make a scatter plot of the bill depth on the vertical and the bill length on the horizontal.
13:32:42 So here's what that scatter plot looks like on its own.
13:32:45 Okay. And now what we're gonna do on top of that is, we're gonna add in the rug plot.
13:32:52 So we're gonna call rug plot and uncommon rug plot data equals.
13:32:56 P e, n G. Axis equal to Bill length y is equal to bill, depth and ax is equal to ax.
13:33:11 So setting the axes. So both of these are plotted on the same axis.
13:33:16 Object. Okay? And so now we can see this gives the distribution of the bill length values on the horizontal.
13:33:26 The bill depth values on the vertical, and then you can kind of combine to see.
13:33:31 So it's sort of a nice way to see the bivariate distribution through the scatter plots as well as the univariate distributions along the axes.
13:33:40 So rug plots are sort of. I remember in our mat plot.
13:33:44 Lib talks. We talked a lot about, are we not a lot?
13:33:48 But we talked about. There's this contingent of people who work in data visualization, who think our axes should not just be solid black lines the entire length of the figure, but rather should be used to convey information.
13:34:00 So one way we thought about that is, you can convey the range by only drawing the length from the minimum to the maximum another.
13:34:10 This rug plot is another way. People are thinking that we could use the axes to convey information about the distribution of the variables being plotted.
13:34:20 So that that's those are a group of people that believe.
13:34:22 Rug plots are useful, and it just depends, like if you it might be useful to look at the rug plot.
13:34:28 Yeah, it just depends on what your use case is and what your personal data visualization.
13:34:34 A step is so. Some people might look at this and think, Oh, I don't like that at all, other people might look at this and think, oh, that's really cool.
13:34:40 Why did I never think of that? So I wanted to point out that really plots like the other ones, allow for hue arguments, so what changes with the hue arguments is that the little tick marks that are drawn are then colored by the species?
13:34:57 So here's an example. And one reason this might be useful is like, you can kind of tell through the rug plot you could also talk through the scatter pot, but you can kind of tell through the red plot that the Adelaide penguins.
13:35:07 Tend to have smaller bills, whereas the gin to and chin straps have about the same.
13:35:13 You know their distributions overlap, and then, similarly, the Gen.
13:35:18 2 penguins have smaller or less deep bills, whereas the Adelaide and chin straps tend to tend to overlap on their bill depth distributions.
13:35:31 So. One thing you might have noticed was that this was an example of a Bivariate distribution plot.
13:35:40 So it's sort of, you know all you have to do to do.
13:35:43 This is put in both an x and a Y value, and I believe in this final example.
13:35:49 With my disc plot, figure, level, function. We're gonna see an example of a bivariate histogram which might be fun.
13:35:56 Okay. So the disk plot function is the figure level function for these distribution plots.
13:36:04 So with this you can make any of hiss plot Kde plot and Ecdf.
13:36:11 Plot. So the hiss plot is made by setting the kind equal to hissed the Kde plot is made by setting the kind equal to Kde and the Ecdf plot is made with setting the kind equal to Ecdf I will point out that this plot cannot make a
13:36:26 rug plot on its own, but one thing you can do is add the rug plot back in.
13:36:32 If you desire. So in this example, we're gonna make it this plot that is a bivariate histogram of and then we'll add a rug plot on top of it.
13:36:40 So we're gonna call this plot
13:36:43 Sns, dot, displ plot, and then we're going to set my data equal to ping P, e, n, G.
13:36:52 My X. I'm going to set equal to Bill Length.
13:36:57 My why, I'm going to set equal to bill depth.
13:37:02 Okay. And then my kind, I'm gonna set equal to hiss.
13:37:09 And before adding in the rug plot, we'll show what this looks like.
13:37:13 So the lighter the way this works. Is it's gonna break your coordinate plane into little squares.
13:37:22 And then within each square it counts up the number of observations that fall within that square, and lighter colors.
13:37:29 So this lighter shade of blue tend to be the lower values, whereas darker colors like this dark shade of blue, are the ones that have the most observations in them.
13:37:42 So now I can go ahead and add in my rug plot, and let's see how did I do this?
13:37:51 I think I just do. G is equal to Yup.
13:37:56 So I set G equal to this, and then we can remind ourselves, well, what is G.
13:38:02 It's a facet grid, and then a nice feature is within the facet grid.
13:38:07 You can access the individual axes objects by doing G.
13:38:11 Dot, a X, okay? So then I can go ahead and then to add my rug plot, I call Sns.
13:38:19 Rug, plot. And I do data equals. Peng X equals Bill length.
13:38:28 Why equals Bill depth, and then for the axes argument, remember, rug plot is an axis level function, so I can set where it's drawn.
13:38:39 I put in my g dot ax okay? And so now you can see that my rug plot is drawn on top of the histogram that I drew with my this plot.
13:38:51 Okay, so that's a nice example to show like how you can access the individual axes of a figure level plot.
13:39:00 And then on top of that, combining a specific axis, level plot, and a figure level plot on one.
13:39:05 Okay. So I hope you enjoyed learning about the various distribution plots that you can make using Seborn.
13:39:14 I enjoyed having you watch this video. And the next video, we continue our way through the the 3 different function categories that Seborn gave in that little diagram to talk about categorical plots.