catplots Video Lecture Transcript This transcript was automatically generated by Zoom, so there may be discrepancies between the video and the text. 13:44:16 Hi! Everybody! Welcome back! We're continuing to learn about Seborn. 13:44:21 And we're today, we're in this video. We're gonna learn about how to make the various categorical plots that Seborn has to offer. 13:44:28 So let's go ahead and get started. So what do we mean when we say categorical plots? 13:44:35 So this is a plot that looks at the interaction between a categorical variable, so like going back to that penguin example from the previous notebook, species or sex or island so looks at the relationship between a categorical variable and then some other variables. 13:44:54 So this could be another categorical, variable, but more often than not a lot of these are looking at different continuous variables, like the bill length or the bill depth or the body mass, etc. 13:45:06 So this is kind of similar so we looked at setting the hue argument for different displacement, like with histogram and and Kdes Kd. E. 13:45:15 Plots. But one of the main differences is that these plots were explicitly set up for comparing the values of variables across different categories. 13:45:24 So there's they're set up so that one of the actions is typically set to be the categorical variable, whereas the other axes is then set to be the continuous variable in some way. 13:45:36 So that's sort of the main difference, whereas in the disc plots all of the category plots are sort of plotted on top of one another. 13:45:43 Here they're sort of separated and spaced out to make it easier for you to make comparisons. 13:45:48 So Seborn has a wide array of functions to make categorical plots we've got strip plots, swarm plots, box blots, boxing plots, violin plots, bar plot, point plot, count plot, and then finally, the figure-level, function that can do most of 13:46:04 These cat, pot. So I've linked the documentation for all these up at the top. 13:46:08 So if at any point you're going through this video and you're like, Oh, I wonder what I can do with this? 13:46:12 You can learn more by just going to the documentation. They have linked to up here. 13:46:18 So we're gonna sort of break these down into individual categories based on sort of the flavor of categorical plot. 13:46:27 And then we won't go as in depth, remember, because at this point we've built up a wide array of functions. 13:46:31 So we're used to working with Seborn functions. 13:46:34 A lot of the arguments are the same across all of these Seborn functions, like hue. 13:46:40 Things like that. So when it comes, when there are certain things that I think it's use useful for you to know right now, I may mention them, but a lot of the different arguments like Hugh and other things, I'm going to leave to you for exploring the documentation. 13:46:54 On your own time. So the first flavor of categorical plot are those that are sort of based in scatter plots. 13:47:00 So these are sort of the idea being we're gonna on one axis plot. 13:47:05 The different categories, and then on the other, basically just set a point where the different observations are. 13:47:11 So you using this penguin example again, basically like we'll say, here is for Adelaide, for gin, 2 for chin strap. 13:47:19 We're gonna basically plot what the bill length is on a line. 13:47:23 So the first one that we're gonna see is called a strip plot. 13:47:27 And so a strip plot basically does just plot everything as a dot on a line almost like a rug plot. 13:47:34 But it adds in something called jittering through the jitter argument that allows the plots to be randomly displaced. 13:47:40 To the left or to the right. If we're plotting vertically so that way, they don't overlap with one another so let's see if an example where I'm going to call a strip plot. 13:47:49 And the only thing I'm gonna type because we've seen this before is just the word strip of plot. 13:47:54 So we're gonna make a strip plot that looks at the flipper length for various species of penguin. 13:48:02 So here we go, and so we can see here that we have these different points. 13:48:05 Okay. So for instance, this particular Adelaide Penguin has a flipper length of a little bit above 170, whereas this transcript penguin has a flipper length of a little bit above 2 10. 13:48:20 Okay, so basically, that's how this works. And the way that you can make it different and change the appearance in terms of like, how spread out it is is with the jitter argument. 13:48:30 So here's an example where I have very little jittering so I set jitter equal to maybe point 5. 13:48:38 Oh, I guess that was really big. Maybe point. I thought it would be. 13:48:40 I thought it would be small. Let's make it point O. 13:48:43 5. Okay? So you can see how it's a little bit tighter. 13:48:46 And we could even go even smaller point one where it's basically a line we could even set it to 0, I believe. 13:48:53 And now it is just on a straight line, alright, and then, in contrast, I could do a much larger jittering so I could set my jitter equal to point 2, and you can see now the band is a little bit wider, and the points are less likely to overlap what if I did a point 2 5 13:49:11 it's even wider still. Okay? So that's sort of what a strip plot is. 13:49:15 Again, it has other arguments that you can customize, based on variable values. 13:49:21 But that's all I think you need to know in order to get started on strip plots 13:49:26 The other type of scatter based categorical plot is called a swarm plot. 13:49:32 So like a strip plot, it tries to plot the individual values along an axis. 13:49:37 The slight differences, instead of doing jittering. What it's going to do is, let's say there's a bunch of Adelaide penguins with a bill length a little bit above 170. 13:49:47 It will then stack them up horizontally, so because we're making vertical plots it'll stack it up horizontally until it can't stack anymore, and then it will just keep putting them on top of one another so why don't we see this example so 13:50:01 Swarm plot, and so you can see here, there's one penguin and this level one penguin. 13:50:08 At this level one peg on this level, and then 3, and then it does its best to stack them like. 13:50:14 So, however, we can notice that if you notice that this figure I had to make it a little bit wider than the previous one, which was a width of 6, so if I have a figure that is not wide enough it's gonna start taking these points, and then just stacking them on top of one another sort of 13:50:33 Making it impossible to see just how many there are. So here's an example where I have a very narrow figure in comparison and you'll notice that I get this warning basically. 13:50:44 Telling me that my figure is too narrow in order to be able to fit all the different observations. 13:50:53 And so you can kind of see in these little regions here, where they're getting stacked up on top of one another. 13:50:58 So basically it's kind of just saying I can't place any more points for these different categories. 13:51:04 And it's nice that it gives you a little bit like 26.5 of the chin straps. 13:51:07 I think I think these percentages correspond to the different categories. 13:51:16 Okay, so one way to change this is, they're saying is, you could in decrease the size of the markers. 13:51:22 You could make them wider. So like 13:51:26 7 isn't wide enough, but 8 was wide enough. Alternatively, we could. 13:51:32 Let's say we go to 5. We could go through and change the size, I think, as will work. 13:51:39 Let's see. So if I make them really small, then they'll fit. 13:51:42 But now, you know, it's just a it's a balancing like, with a lot of data visualization. 13:51:45 It's a balancing act. So yeah, let's go back to where we were 13:51:53 Okay, so those are the 2 scatter base plots. 13:51:59 The next are distribution-based plots. And so basically, what this does is for the different categories. 13:52:03 It's going to in some way try and plot the distribution of the other variable you're looking at. 13:52:09 So these have to be continuous variables. So the first we've seen before is the box plot, so we call Box Plot, and let me change that alignment. There we go. 13:52:23 So we call box Plot. We introduce these and this Jupiter notebook from Matt plot, lib. 13:52:29 So, it's just gonna go ahead and plot the box plot for the various species. 13:52:35 And here again, I'm looking at flipper length so here's our box plot, and remember, we got from 25 percentile Median up to the seventy-fifth percentile. 13:52:46 And then these lines, just like Matt Plot lib, are drawn to be 1.5 times the inter quartile range. Okay. 13:52:55 So also, just like, yeah, just like that. And again, we could alter that with the whites. 13:53:01 Are argument, so we could change it to be whis is equal to say 0 point 7 5. 13:53:07 As an example. And now you can see the bands are a little bit smaller, or we could change it to 2. And now they're a little bit wider. Okay. 13:53:17 An extension of the box plot is something called the Boxing Plot, and so I'm going to go ahead and give the example here. 13:53:24 So box and plot, and then sort of give you an idea, explaining what the difference is. 13:53:31 So here's the boxing plot and the main one of the main points. 13:53:34 Is it still has that rectangle from the box plot? 13:53:37 So these rectangles in the box plot, are they exact same rectangles drawn here, but now it has these additional smaller rectangles drawn on top of it, instead of the whiskers. 13:53:48 So what these smaller rectangles give are essentially this goes from the the upper quarter. 13:53:58 The lower quarter percentile to the upper quarter percentile. 13:54:03 So the 20 fifth to the 70, fifth, now the next rectangle is going to go from the lower eighth percentile. 13:54:12 And then this top of this rectangle is the upper eighth percentile. 13:54:15 Then the third rectangle, the smaller one after that goes from the lower sixteenth percentile gap to the upper sixteenth percentile, and it keeps going in that way until it figures that there's not enough points to justify continuing to draw additional rectangles so 13:54:34 Why do people like this over the box plot? So this one gives a little bit better, gives a little bit better feel for what the actual distribution looks like compared to the box plot, where it's not within the quartet, not within the intercourt tower. Range. 13:54:53 It gets re relegated to this, it gets relegated to these whiskers and the whiskers don't really give us a sense of like, okay, how does the upper Twenty-fifth and the lower 20 fifth work so that's why people may be prefer this if you like to read more 13:55:10 About the idea of the Boxing plot, also known as a letter value plot. 13:55:15 You can click on this and read the original paper where it was introduced. 13:55:20 If you're interested. 13:55:24 The third distribution base plot is the violin plot. 13:55:29 So the vibe in plot, and I think it's best. 13:55:32 I'm just gonna draw it. So violin plot the violin plot just makes the Kve plot. 13:55:39 So this little curve here that I'm tracing out is an example of the Kde plot for the Adelaide, and we can actually even show this. 13:55:46 So common, showing the Kde plot for flipper blanks so as Sns Kde plot data equals ping X equals. 13:56:00 Let's make it y, so you can tell y equals flipper length. 13:56:07 Better make this a string, and then Q equals species. 13:56:15 Okay. And it's a little bit harder to tell. 13:56:18 But you can see like this blue right here, this blue curve for the Adelaide corresponds to the blue curve here, and then it turns it into a viel in shape by reflecting the Kve plot along the middle. 13:56:33 So imagine reflecting this and then. Now it looks like a violin. 13:56:38 A nice feature about violin plots is, in addition to the Kde plot. 13:56:43 You can plot a plot types at the same time. 13:56:46 So this bar in the middle here is the box and whisker plot. 13:56:51 So the little rectangle in the middle is the intercourse hour range. 13:56:55 The white.is the median, and then these black lines are the whiskers. Okay? 13:57:00 So there are other plots that you can input in the middle of the violin plot. 13:57:05 So the default, which is box. Have you set the argument inner equal to box? 13:57:11 It will draw that box spot if you send it equal to quartile, it will draw. 13:57:16 It will just draw lines representing the core tile. So the 20 fifth percentile, the median in the 70 fifth percentile, if you draw a point, it will draw each observation within the violin as a strip plot with 0 jitter and if you choose inter equal to stick it does the same thing, but 13:57:37 Instead of points, it draws them as a horizontal lines. 13:57:42 In this case. So let's show some examples. So what if I set the inner equal to core tile? 13:57:49 And now you can see the the dotted line. In the middle is the median, and this is the upper and lower core tiles. 13:57:56 If I set it equal to point. 13:58:01 Then I've got these black points representing all the different observations, and if I set it equal to none 13:58:10 Then there's nothing drawn. Let's also show what happens when I set it equal to stick 13:58:19 And now, instead of the strip plot, with no jitter, you've got these horizontal lines, or each of these lines represents one of the observations. 13:58:28 Okay, alright. So that is the these are the 3 ways. 13:58:33 So the box plot, the box and plot, the violin plot. 13:58:37 These are the 3 distribution based categorical variables. 13:58:39 Or categorical plot types that you can make with Seborn. 13:58:44 The last grouping is the group. Statistic comparison. 13:58:47 So one of the classic things you might learn in a statistics class is, you ask a group A and group B to vote for something and then you want to compare the votes for, and both of the groups. 13:59:01 So that's sort of a group. Statistic comparison. 13:59:04 So these types of plots are plots that look at some sort of test statistic, like a proportion or a mean or a median, and then compare them between the 2 categories. 13:59:16 So this is a very common task. Institutions. And so these are some charts that maybe you'll want to look at if you're working on a problem like that. 13:59:25 So the first is bar plot, so Bar plot allows you to take in a category on one axis, and then a variable on the other axis that looks like it can make the mean or the meeting or the proportion depending on what you're looking at so in this example 13:59:43 I'm going to use the bar plot to compare the flipper, the average flipper length for the various species, and so the height of these bars represents the average flipper. 13:59:57 Average flipper length for the apple, a chin strap engine to penguins. 14:00:01 Now you might be wondering what these little black lines are. 14:00:06 So these are the error bars on the estimate, and once again, I'm gonna I'm gonna say, for now don't worry about those we'll talk about them in more depth in a later notebook. 14:00:19 We can control what statistic gets compared. So by default. 14:00:22 That was the mean. But if I set my estimator argument equal to Median, for example, now it's comparing the medians. 14:00:30 It's kind of hard to tell the difference, because they're pretty similar but now it's comparing medians. 14:00:34 And these error bars are on the Median estimate 14:00:39 In addition to bar, plot or bar. Yeah, that's what it was. 14:00:43 Right, bar, plot, addition to bar part in addition to bar plot. 14:00:48 You can also make account plot. So Count Plot compares the counts for different categories. 14:00:55 So, for instance, we can look at the different numbers for each sex based on the species, and so here we can see for Adelaide there's equal numbers of males and females at a level over 70 for chin straps. 14:01:09 There's equal numbers for male and females at a little over 30, and then for Gen. 14:01:13 2, there's appears to be a few more males and females at about somewhere between 50 and 60. 14:01:19 Okay? So that's account plot kinda like a bar plot. 14:01:22 But instead of means, or proportions, or Medians, it's comparing counts. 14:01:27 So this would be like a way to compare to Cada a category, or that one categorical variable across another. 14:01:35 The final plot type that we'll look at is called the Point Plot. 14:01:39 So a point plot is a nice alternative to like a bar plot or a pie chart. 14:01:45 So basically maybe not a pie chart. But it is a nice alternative to a bar plot. 14:01:50 So what it's gonna do is instead of plotting the value of the average flipper length as the height of erect angle, it plots it as a single point, and then puts the error bars behind that point. 14:02:03 So here's an example. Where we're gonna look at the flipper length, the average flipper length, vice species. 14:02:10 And then also sex. 14:02:14 And so here, what we can see is a blue line which represents the male flipper lengths, and then an orange line that represents the female flipper links. 14:02:25 So each of these points here gives the average flipper length for that species and that sex. 14:02:31 So, for instance, the average male flipper length for the Adelaide penguins is about 192, whereas the average female flipper length for the Adelaide penguins is about 187 and then the points are connected based upon how the same sex 14:02:47 Grouping. So this is the mail for the analyt, which is connected to the mail for the chin strap which is connected to the mail for the Gen. 14:02:56 Tube, so some people might like this better than a bar plot, because it's little cleaner. 14:03:00 It takes up a little less space than the bar, and maybe also allows for easier comparisons when you want to do subgroups of groups or subcategories of categories. 14:03:11 Again, these are the error bars, and we will be talking about those in its own notebook to give you an idea of what these are plotting. 14:03:20 Okay. So we've now gone over like every cat plot, function, axis, level function. 14:03:26 You can do the figure, level cat, plot, function is called cat plot. 14:03:31 So you can. I've provided a list here for all the different plots that can be made with calling cat plot, and then the appropriate kind argument for that kind. 14:03:43 So, for example, you can make a strip plot with the cat plot function if you just set kind equal to strip. 14:03:49 So since we've spent some time reviewing how rel plot and this plot works. 14:03:55 Cat plot is almost identical, and in terms of what arguments you have to put in to get what you would like. 14:04:01 So I'm not going to spend our time talking about that here. 14:04:04 I'm going to leave it to you to experiment on your own if you're interested and you know at the top, remember, I provided documentation links for all of these functions, including cat plot here at the bottom. 14:04:16 Okay. So now we've gone over the 3 main flavors of plotting function. 14:04:24 And Seborn, the relational plots, the distribution plots, and the categorical plots, and the next notebook. 14:04:30 We're gonna review 2 plotting functions that don't neat. 14:04:34 We fall into one of these 3 categories, but still they do allow you to make pretty powerful plots, so I hope you enjoyed learning about categorical plots. 14:04:44 I enjoyed having you watch this video, and I hope to see you next time when we learn about point or what was it? 14:04:50 Point plot and pair, plot and joint plot.