bokeh Data for bokeh Video Lecture Transcript This transcript was automatically generated by Zoom, so there may be discrepancies between the video and the text. 16:40:29 Welcome back in this video, where you continue to learn about the Boca package in particular, we're gonna talk about the types of data that you can feed into Boca. 16:40:39 So let's go ahead and get started 16:40:48 So just remember, since we're in a Jupiter notebook environment that you are going to make sure to, you have to make sure to run this output notebook function in order to get Boca to run properly. 16:41:00 So in the previous notebook we learned about how to make scatterplots and line plots. 16:41:05 Basic non interactive, static plots within Boca. 16:41:09 So in that notebook we would just take something like an array. 16:41:13 A list or a tuple, provide it to the various arguments like Xy size and color. 16:41:19 And then that was it. But just like Seborn Boca also accepts long format data. 16:41:25 So this means that Boca has the ability to take in an object like it. 16:41:29 Data frame or addictionary, and then use the column. 16:41:33 Names of those are for some of those arguments like Xy size, f, fill, color, etc. So let's go ahead and see a couple of examples before we dive into those examples, though I want to take a second. 16:41:48 So in the rest of these notebooks, we're going to use some of the pre-loaded Boka data sets in order to use those you have to first run. 16:41:56 Boca sample data download. So this is going to download the data on your computer. 16:42:02 So you can call the data sets later. So go ahead and run this, and then you'll see something different, because I've already downloaded the data. 16:42:09 You'll see something like a download progress, so this might take a little bit of time. 16:42:14 So if you need to pause the video while the download is happening and then come back. 16:42:18 Okay. So for my first example, I'm just going to randomly generate some data. 16:42:24 So I'm gonna have an X a Y, which is a function of X, a randomly selected sizes and randomly selected colors from blue, pink, and orange. 16:42:36 So I'm gonna put all that data in a dictionary where the key is just basically the name of the variable. 16:42:42 And then the value is the variable itself, so we can check. 16:42:47 This is what my data dictionary looks like. Okay. 16:42:53 Alright. So now, when I have this, I can use something called the source argument. 16:42:58 When adding a glyph in order to, instead of feed. 16:43:03 In each of the individual Tuples, or Lists, array one by one. 16:43:07 I can just feed in this data dictionary into the source. 16:43:10 Argument, and then use these column. These column labels as my arguments instead. 16:43:17 So I'm going to go ahead and make it figure, and then add the circle. 16:43:20 Glyph. So I'm going to call first my source argument, which is source, is equal to D, ICT data 16:43:31 Then X is going to be equal to the string. X. 16:43:39 Change that because it was backwards, y is going to be equal to the string. 16:43:43 Y size is going to be equal to the string size and fill color is going to be equal to the string color. 16:43:52 Okay. 16:43:54 Now for my last 2 arguments. I'm just going to set my edge color to be the string black, even though that's not a column of the data frame. 16:44:02 It will recognize that it's not a column of the data frame. 16:44:05 And then just set it to the color black. And then finally, I'm gonna set my alpha 2.7 16:44:13 Oh, no! What did I do? 16:44:20 Not edge, color, just line color. There we go. 16:44:26 And then maybe I should change this as well. So my comments correct. So here you go. 16:44:31 We've made this plot using this data dictionary instead of using the individual list or tuple so maybe it seems silly that we created this. 16:44:42 In the first place, but remember sometimes we're going to be loading data. 16:44:45 Oftentimes we'll be loading data that's been prepared for us and not creating data from scratch. 16:44:51 Similarly, I can go ahead and do the same thing with the source argument that's a data frame. 16:44:58 So I took that data dictionary and turned it into a data frame. 16:45:02 And now, instead of source, equals the data dictionary. It can just be the data frame. 16:45:07 And then all the other arguments are still the same, so I won't go over entering those, and I can see I get the same exact plot. 16:45:16 Okay. 16:45:19 Alright! So what's going on when we put in a data frame or a data dictionary, is that Boca is creating an object known as a column data source. 16:45:29 So boca recognizes that this argument to source is a column nerd data type and then after that happens, it makes its own internal data object called the column data source. 16:45:41 So, instead of waiting for Boca to do that for us, we can go our own column data sources, and we'll see why that's useful in a little bit. 16:45:50 So what we can do is import. The column import the column data source. 16:46:00 So from Boca dot model 16:46:03 We're gonna import column data source, hey? From Boca dot models, import column data source. 16:46:13 So now for our doing it this way. You call column data source, and then you can either put in that data dictionary or you can put in that data frame off. 16:46:22 Just put in the data frame. So column data source df, and then after that, we can see. 16:46:28 Oh, Jeeze just forgot to see. Now we can see that we've created a column data source object stored in the variable source. 16:46:39 And so now, instead of the data frame or the data dictionary, I can just put in that column data for data source object and get the same plot. 16:46:47 So sometimes it's the useful to create your own column data source, object and use that instead of using the data frame or the dictionary. 16:47:00 So one reason why this can be useful is because Boca has a number of functions that allow you to transform your columnner data. So this means it's going to take in arguments from your columns and then transform it in a way that will change the way. 16:47:17 They look on your plots. So one example of this is to add jittering in some sort of categorical scatterplot. 16:47:23 So remember, in our seborn content, we learned about the strip plot function which would plot the values of a particular continuous, variable, stratified by the the value of a categorical variable. 16:47:37 So we can do the same thing in Boca, but we now have to add the jitter transform function, which does the jittering for us. 16:47:45 So boca does not by default. Have a function like strip plot. 16:47:49 You have to add the jittering by hand, so we're gonna learn how to do this. 16:47:55 Using a data frame on various github commits. So this data frame called commits. 16:48:02 Has the date time as its index. So this date Time Index gives the exact date time of when the commit was created, and then it has the day and the time column. 16:48:15 So the plot we want to make, or we're going to make in this example is we're going to have on the vertical access the days, you know, Saturday. 16:48:24 Sunday, Monday, Tuesday, etc. And then on the horizontal axis, we'll have the time of day when the commit occurred. 16:48:34 So basically we're going to have 7 horizontal bands where the different dots represent a single commit made to a Github repository in aunt's basically just allowing us to see if they're any patterns in this github users commit history. 16:48:49 So there are 4,916 observations in this data frame. 16:48:55 And we can, you know, see? For ourselves, that's a lot of options to try and plot one next to each other, particularly if the times are very close. 16:49:04 So that's why, if we did not add jittering, it might be almost impossible to be able to distinguish individual points from one another. 16:49:12 So we're going to go ahead and use this jitter function in this example. 16:49:16 So we're gonna import jitter. So from Boca transform, we import jitter. 16:49:25 And so the way we use jitter, I guess. So let me first specify the way we set up this particular plot. 16:49:31 So the first thing I do and maybe I all break this off as its own thing. 16:49:35 As I take my commits, data frame and turn it into a column data source. 16:49:40 So now that's that's been done. You'll see this argument or this list being made of the different days of the week, and you know what I just realized. 16:49:50 It's out of order. That's weird. So it should be 16:49:55 I guess it's just in a weird order. But it's probably okay. 16:49:59 So I think it'll probably go Monday at the very top all the way down Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday. 16:50:06 So I think that's okay. So you can see, we have this argument. 16:50:09 Why range is equal to that list days. So this is taking in this list and setting the tick marks of the Y variable to be equivalent to these strings. 16:50:21 Now, we've seen this before. We're setting our X-axis to be able to take in date time data which we want, because this is a time 16:50:31 And then we put a title, and we'll talk a little bit more on the titles. 16:50:35 In a later notebook. So now I'm going to call Scatter, and I specified my source here so like last time I put it at the top. 16:50:43 But it doesn't matter with a named argument as long as you provide the name. 16:50:46 My horizontal access is set to be time my Alpha set to be point 4, and my Y is where I'm using the jitter so I call the jitter function. 16:50:57 I specify what column I would like to apply the jitter for? 16:51:02 Okay, so the are the categorical calm, which is the the why I specified the width of my jittering. 16:51:09 So this will be a point, a band of point 6. And then I specify a range basically just saying, taking the range from the figure itself. 16:51:23 Okay. So this is the plot that gets made 16:51:30 And we can see how, if we changed 16:51:34 You know the width. Now we get a narrow one, narrow, narrow, narrower, narrowware one, and then, if we increase it, we only have a wider one. 16:51:44 So why don't we go back to that point? 6, which I think was a nice, happy medium. 16:51:49 Now we see you know this jitter was made possible by us having a column data source format. 16:51:57 Another thing that we might be interested, knowing how to do is to change the color by a call. 16:52:03 So just like with C born or with Matt Plot Lib, where we would learn how to color plots based on the categorical or continuous variable of a data frame, we can do the same thing in bulka. 16:52:17 But we have to do it by hand with either the function, linear C map or factor C map. 16:52:22 So we're going to use this auto Mpg data set as an example. 16:52:25 So this data set provides information on automobiles. 16:52:30 So things like the mile per gallon. The number of cylinders that the engine has, the displacement that the vehicle provides the horsepower of the vehicle, the weight of the vehicle, etc. 16:52:42 And this column year here it's 2 numbers. That's because it's assumed that the first 2 numbers are 19. 16:52:48 So this instance in the first row was a car that was made in 1,970 16:52:54 Okay, so we're going to go ahead and demonstrate how to use linear Cmap, which takes in a minimum value through the low argument and a maximum value through the high AR argument. 16:53:05 And then produces a color map through yeah, for a particular variable. 16:53:12 Given a selected palette and we'll talk a little bit more about palettes in a second. 16:53:17 So first thing we have to do is import linear C maps. 16:53:20 So from Boca dot transform well, import the linear C map 16:53:27 Now I'm going to go ahead and create my source, which is the column data source of the data frame. 16:53:34 Create my figure, find the minimum. So what I'm gonna do is I'm gonna plot the miles per gallon are the horsepower on the horizontal axis and the weight on the vertical axis. 16:53:47 And then I'm going to color the points by the miles per gallon so when I'm coloring the point, I need my low and my high value, so the low is going to be the minimum miles per gallon that we observed, and then high will be the maximum miles per gallon so we set 16:54:02 The fill color you call linear C map. You specify the column that's being used to create the color map, which is Mpg, you specify the low, which is the minimum value. 16:54:17 You specify the high, which is the maximum value, and then you specify the palette. 16:54:22 So we'll look at this in a second. But the palette we're using is something called magma, and it has 256 color units. 16:54:30 Then we do the same thing for the line color, just to make sure the the Eds of the circle is the same color as the fill of the circle, okay, so here's that plot. 16:54:42 We can see that the cars with the highest. So in this particular color, palette, darker colors are lower, lighter colors, or higher, so we can see that the the high, the most fuel, efficient cars or automobiles tend to live down here where the weight is low 16:55:03 And the horsepower is also low. I believe it was horsepower. Right? 16:55:09 Yeah, the weight is low and the horse power is low. Another function that you might want for color maps is the factor. 16:55:17 C map, which will take in a variable and then assign each value possible value to a unique color. 16:55:25 So, for instance, maybe we would want to color the the points by the number of cylinders it has, so we can go here and just show auto. 16:55:34 Mpg, dot, cylinder, dot value counts. 16:55:38 Okay, so we have 3, 5, 6, 8 and 4, 16:55:43 So we can go ahead. Now, 1 point is the factor. C map can only take in string arguments, and these are integers. 16:55:52 So the first thing we have to do is just change the cylinder and create a string version of the column which I do here, and then I have to remake my source, which I do here now that I've done that I'm gonna import from Boca dot transform I'm 16:56:11 Gonna import factor, c map. 16:56:15 Now, what this does is going to give me a sorted list of unique 16:56:23 The unique cylinder values. So we can go ahead and show that off here. 16:56:28 Okay. And then what's go ahead and demonstrate factor c maps. 16:56:34 So I've got my source again. I'm plotting the weight against the horsepower. 16:56:38 I'm gonna go ahead. And you call Factor C map for your color argument. 16:56:44 You put in the column that you want to be colored by. 16:56:48 So the cylinder number you put in the palette, which for us is category 10 underscore 5. 16:56:54 And again, I'm gonna make a note about that right after this. 16:56:56 And then the factors are the unique values. So this 3, 4, 5, 6, 8. 16:57:01 So into factors. I put those unique values that I've stored in this variable. 16:57:06 And now I I have all of my plots are all of my points, colored by the number of so cylinders they have. 16:57:13 And now you'll notice. Wait a minute. How can I tell like what is purple, mean? 16:57:17 What is red, mean? Etc. So we're gonna learn more about legends in a different notebook. 16:57:21 So legends and color bars, I mean, that's a natural question for both of these, like, wouldn't I want a color bar for this? 16:57:27 Wouldn't I want a legend for this? We have a separate notebook for learning learning those particular features so for now we're just learning how to use the column data, source, allow us to color our points based on the value of a variable. 16:57:42 So I've been promising that we'll talk about palette. 16:57:46 So for the previous functions we specified 2 pallid arguments as strings. 16:57:50 The first, being magma, 256, the second being category, 10 underscore 5. 16:57:56 So all of the palettes that are available from Boca are provided at this documentation. 16:58:01 List. So for instance, here's the magma, palette, and you can see that these got these different levels. 16:58:07 So like 1, 2, 3, 1, 2, 3, 4, where it basically breaks up the color map into that many segments. 16:58:14 And then provides a transition accordingly. So when we called the string magma, 2, 56, it was the magma color, palette broken up into 256 blocks. 16:58:28 The same thing for category 10 underscore 5, so we can go down and look at category 10. 16:58:33 So here's Category 10. And the reasons it's category 10 is, there are 10 possible options for the 10 maximum possible options, whereas with category 20, you can see that there are up to 20 different options for the color. 16:58:50 So the underscore 5 specifies that we only want these 5 blue orange, green, red, and purple. 16:58:57 So the reason we need to have an underscore is because the name already has a number in it. 16:59:02 So it would confuse Boka if we tried to do category 15, because it would read us, can a category? 105? 16:59:08 Okay, so, if your palette does not have a number in it, you can just put the number immediately after the name. 16:59:15 If your palette does have a number in it, you need an acts. 16:59:20 You need an underscore. So another thing you can do is instead of specifying these strings, you can directly import the palette you'd want. 16:59:28 So let's look at the example. What I'm going to use. 16:59:31 Accent, and so, when I use accent, I can just go ahead and remember again, I have 5 unique cylinder counts your cylinder numbers so I'm going to say I want to import accent 5. 16:59:43 And so I can say from Boca dot palettes import accent. 16:59:48 5. And now, when I call palette, I can use accent 5 directly. 16:59:53 Hey? So now my points are colored according to the accent. 16:59:57 5. How it and I already mentioned that we would learn more about legends and color bars, and a later notebook. 17:00:05 So there are a number of other transform functions that you might be interested in learning about from Boca. 17:00:12 So here is a link to the Boca transform documentation. 17:00:15 You can see over here on the right. We have our factor, C map or linear C map and our jitter. 17:00:20 But there are other ones you might be interested in as well. 17:00:24 So now we know a little bit more about the way that Boca handles the column, their data, and how we can then use the advantage of that data handling to add in transforms for things like adding color or jittering to our plots, and the next notebook we're going to show you how we can use 17:00:40 Boca to alter various non-graphical elements, including titles and labels, grids and tick marks, legends and color bars, and more alright, so I hope you enjoyed learning about how Boca handles its data.