Jessica Forde, Yuvi Panda and Chris Holdgraf join Melanie and Mark to discuss Project Jupyter from it’s interactive notebook origin story to the various open source modular projects it’s grown into supporting data research and applications. We dive specifically into JupyterHub using Kubernetes to enable a multi-user server. We also talk about Binder, an interactive development environment that makes work easily reproducible.
Jessica Forde is a Project Jupyter Maintainer with a background in reinforcement learning and Bayesian statistics. At Project Jupyter, she works primarily on JupyterHub, Binder, and JuptyerLab to improve access to scientific computing and scientific research. Her previous open source projects include datamicroscopes, a DARPA-funded Bayesian nonparametrics library in Python, and density, a wireless device data tool at Columbia University. Jessica has also worked as a machine learning researcher and data scientist in a variety of applications including healthcare, energy, and human capital.
Yuvi Panda is the Project Jupyter Technical Operations Architect in the UC Berkeley Data Sciences Division. He works on making it easy for people who don’t traditionally consider themselves “programmers” to do things with code. He builds tools (e.g., Quarry, PAWS, etc.) to sidestep the list of historical accidents that constitute the “command line tax” that people have to pay before doing productive things with computing.
Chris Holdgraf is a is a Project Jupyter Maintainer and Data Science Fellow at the Berkeley Institute for Data Science and a Community Architect at the Data Science Education Program at UC Berkeley. His background is in cognitive and computational neuroscience, where he used predictive models to understand the auditory system in the human brain. He’s interested in the boundary between technology, open-source software, and scientific workflows, as well as creating new pathways for this kind of work in science and the academy. He’s a core member of Project Jupyter, specifically working with JupyterHub and Binder, two open-source projects that make it easier for researchers and educators to do their work in the cloud. He works on these core tools, along with research and educational projects that use these tools at Berkeley and in the broader open science community.
Cool things of the week
- Dragonball hosted on GC / powered by Spanner blog and GDC presentation at Developer Day
- Cloud Text-to-Speech API powered by DeepMind WaveNet blog and docs
- Now you can deploy to Kubernetes Engine from Gitlab blog
- Jupyter site
- JupyterHub github
- Binder site and docs
- JupyterLab site
- Kubernetes site github
- Jupyter Notebook github
- LIGO (Laser Interferometer Gravitational-Wave Observatory) site and binder
- Paul Romer, World Bank Chief Economist blog and jupyter notebook
- The Scientific Paper is Obsolete article
- Large Scale Teaching Infrastructure with Kubernetes - Yuvi Panda, Berkeley University video
- Data 8: The Foundations of Data Science site
- Zero to JupyterHub site
- JupyterHub Deploy Docker github
- Jupyter Gitter channels
- Jupyter Pop-Up, May 15th site
- JupyterCon, Aug 21-24 site
Question of the week
How did Google’s predictions do during March Madness?
- How to build a realt time prediction model: Architecting live NCAA predictions
- Final Four halftime - fed data from first half to create prediction on second half and created a 30 second spot that ran on CBS before game play sample prediction ad
- Kaggle Competition site
Where can you find us next?
- Melanie is speaking about AI at Techtonica today, and April 14th will be participating in a panel on Diversity and Inclusion at the Harker Research Symposium
MARK: Hi, and welcome to a very special episode of the Google Cloud Platform Podcast. I am Mark Mandel, and I am here with my colleague, Melanie Warrick.
MELANIE WARRICK: I'm doing OK. How are you doing?
MARK MANDEL: I'm doing OK. Before we get started today, I know you and I wanted to just have a little personal note to talk about the recent incident at YouTube in San Bruno.
MELANIE WARRICK: We wanted to acknowledge that that happened. And just a side note on that, usually we record our interviews and our wrappers, as we like to call them, a couple of days out before we launch our podcast on Wednesdays. So we'd recorded last week's prior. But yeah, we wanted to acknowledge what happened, and also the fact that, obviously, these are some serious issues that are going on in the US as well as other locations. And we want to give our support to our YouTube colleagues and those who've been impacted.
MARK MANDEL: Absolutely. And just as a reminder, if you need any sort of mental health help, please reach out. There are plenty of resources. And we'll put some in the show notes as well.
MELANIE WARRICK: And we're also grateful for the support that the community has been showing to the YouTubers as well. There's a lot of great people--
MARK MANDEL: Companies.
MELANIE WARRICK: Images and things that have been shared. But yes. Mental health is a big issue. There's a significant amount of stigma that's still out there. But there's some great resources and support that's coming out. And we highly recommend people reach out if they need help.
OK. So Mark, this week we have, actually, a podcast interview that I'm excited about. Because we got a chance to bring in some of the folks from Project Jupyter.
MARK MANDEL: Yep.
MELANIE WARRICK: Jessica, Yuvi, and Chris, who are all here to talk to us about Project Jupyter. And specifically we dive into JupyterHub as well as Binder and just get into the mechanics and the support around that. And this is a tool, or tools, that are being used specifically around the research. But they're not just for researchers. So you'll hear more about that shortly.
MARK MANDEL: Then after that, as always, we do our question of the week, where we're talking about how did Google predictions do during March Madness? Which I believe is basketball, I'm pretty sure.
MELANIE WARRICK: Yeah. I love it. You're the gamer, but yet neither of us really know that much about sports outside of that.
MARK MANDEL: Nope.
MELANIE WARRICK: Anyway, OK. So cool things of the week?
MARK MANDEL: Yeah.
MELANIE WARRICK: Let's start with your favorite.
MARK MANDEL: OK, yeah. So this was actually a couple of weeks ago but is really cool. If you hadn't known, there's a new Dragon Ball Z game coming out that's hosted on Google Cloud Platform. I'm going to put a link in the show notes to the video they did at the Game Developer Conference presentation they did at the Google Developer Day, where they talk about how they use Spanner for global consistency as well as the Google network to enable player versus player action around the world. And they show a live demo of someone playing in San Francisco against someone in Japan, which is really, really cool, with really low latency and a really great real-time gameplay.
MELANIE WARRICK: And one of the other cool things that they're using is BigQuery to help do their analysis, which I think is great. But it is, it's pretty impressive in terms of being able to use Spanner as a way to connect all the different players out there.
Next cool thing of the week that we want to mention is Text-to-Speech, which is powered by DeepMind's WaveNet. So Text-to-Speech is a new API that has been developed and is out there now on Google Cloud console that you can use to basically convert text to speech. And they have 32 different voices from 12 languages. The nice thing is that it's running on [INAUDIBLE], so it takes one second of speech and can convert it in 50 milliseconds. It's got much higher fidelity, higher quality.
This WaveNet model that they're using is actually something that was initially developed and provided back in 2016. But it's been significantly improved since then, and used for things like Text-to-Speech. So check it out.
MARK MANDEL: Nice. And finally, a project that I'm actually a huge fan of, GitLab. If you've never used it, it's a great DevOps lifecycle tool, as well as a great place to host Git for, say, in-studio or in-company projects. I've used GitLab in the past, but I've never actually been exposed so much to their continuous integration and continuous delivery pipeline tooling. But it just got even better in that they have a new order DevOps feature which detects the language your app is written in and automatically builds your CI and CD pipelines for you.
And particularly of note is that it can now push really nicely up to Kubernetes, either a GKE cluster or an existing Kubernetes cluster in a great way that works for continuous delivery. So it's got some great tools in it. Super happy with what GitLab's doing.
MELANIE WARRICK: OK, Mark, I think it's time to go talk to the group from Project Jupyter.
MARK MANDEL: Sounds good. Let's do it.
MELANIE WARRICK: So this week we're excited to have with us several members of the team Project Jupyter. We've got Jessica Forde, Yuvi Panda, and Chris Holdgraf. Welcome!
CHRIS HOLDGRAF: Yeah, it's great to be here.
JESSICA FORD: Hi.
YUVI PANDA: Hi.
MELANIE WARRICK: So I would like you all to take a minute and just explain a little bit more of what you do on Project Jupyter. So Jess, why don't you start?
JESSICA FORDE: So my name is Jessica Forde. I've been with Project Jupyter for a little less than a year. And I work on a number of projects. I'm actually a cross project, so I've been working a lot with the JupyterHub team, with Yuvi and Chris, and then I also do a lot of my work on the blog, the website, and I also work on JupyterLab, which is our new imagination of what communication can be like with Jupyter notebook-like interfaces. And so we just recently announced that as well. You guys can look at that. We're probably not going be talking as much about that, but JupyterLab is also another project of ours. We have actually a number of projects, and so I end up working on a lot of them.
YUVI PANDA: Hi, my name is Yuvi. I work at UC Berkeley as-- [? I write ?] on the technical operations for the data science division. I'm also part of Project Jupyter. I've been contributing to them for about two years. I mostly work on scalability stuff, so that's JupyterHub, BinderHub. I run the big JupyterHubs that Berkeley uses for both their local classes and their online classes. And I also do operations for BinderHub on MyBinder.org. And then I also [? dole ?] up a lot of the Kubernetes integrations for both JupyterHub and related projects, like [? Task. ?]
CHRIS HOLDGRAF: Yeah, my name is Chris Holdgraf. I'm a fellow over at the Berkeley Institute for Data Science. And I've been in the sort of open source data analytic world in Python for several years now and have sort of bumped up against the Jupyter world over that time. Now I work on the Jupyter Project as a link between the technical developments and some of the specific use cases and organizations that use Jupyter technology. So just a little bit of this, a little bit of that, and making sure that the things that we're building in Jupyter are matching onto needs that people have in the scientific and education community.
MELANIE WARRICK: So you've all talked about what you work on. Can you talk to us a little bit about what is Project Jupyter and a little bit about its history. Where'd it come from? Because I know IPython Notebooks originated as one of the main projects. But can you tell us a little more about all of this?
JESSICA FORDE: Well, OK, I'll start. So back in, I think, Colorado, there was a physicist named Fernando Perez, who was working on his PhD. And he wanted to be able to share his research with his colleagues back home in Medellin, Colombia and was frustrated by the fact that the software he was working with was very expensive and was closed source. And he couldn't necessarily share it very easily with people back home. And so he, in his spare time, created IPython, which is a project that still exists today and is part of the Jupyter ecosystem.
And IPython is a command line tool for interactive Python. It is now considered what we call a kernel for people to actually write code in many other languages. In fact, you can even create your own kernel. We just showed a blog post with kernels in C++. So if you want to do interactive C++, we have interactive C++.
And from there we got more people involved. We have, as part of our team, also Brian Granger, who also leads the project. And this eventually became IPython Notebooks, which is a tool in the browser for interactive Python visualization data science, which now became renamed to Jupyter. Because, again, we are actually a polyglot open source project. And so we now from there have a number of projects related to interactive computing, open source, open standards, and data science and scientific communication, which relate to education and lots of other applications. And so some of our latest projects, Binder and JupyterHub come out of that sort of origin story.
MARK MANDEL: Jessica, you mentioned a couple of times the term "interactive notebook" as part of the Jupyter Project. What exactly does that mean?
JESSICA FORDE: So an interactive notebook is a browser interface in which a person can enter code. In this example, we'll take Python. And so let's say you are a data scientist writing Python. And you have your Pandas DataFrame, and you want to be able to visualize it. Now, if you wanted to be able to get the visualization at it, you probably would have to do a little bit of extra leg work to get the image out, to save it, to open it up, and see it. Whereas in an interactive notebook, things flow together relatively seamlessly, so that if you want to create the plot, the plot shows up in the browser. And so we really try to leverage the power of the browser to make scientific communication a part of scientific computing.
MARK MANDEL: Is this something that I would use standalone just myself? Or is this something I would do in partnership with, say, another developer, kind of Google Docs style? Or how does that flow work?
JESSICA FORDE: You could do it either way. And in fact, people do do it either way. Communication in a, I write something and then I send it to you or share it to you, is very, very popular. People also use notebooks to write scratch. Say you're trying to prototype an idea and you want to be able to figure out if it runs or if you like the way the output looks to you, you can do it that way.
But additionally, we actually are very interested in ideas of real-time collaboration as well. Today that's something that we're working on and thinking about in the JupyterLab project. Additionally, you can communicate to people within a certain ecosystem-- for example, within JupyterHub. So these kinds of relationships between communicating for yourself and communicating with other people are very important to us.
YUVI PANDA: So also people use it just with GitHub. GitHub has millions of notebooks, so people use it just like code. And GitHub even renders them automatically now. So that's another way to use it.
CHRIS HOLDGRAF: I think one of the interesting origin stories of the Jupyter notebook actually comes from scientific publishing. So one of the original ideas that the then IPython, now Jupyter team had was trying to find a way to package the sort of static representation of your work, which in science and academia is just basically a PDF. It's a snapshot of words that you write and static images that you generate.
But behind those static images is a lot of really interesting complexity. And in some sense, that's the real work. Like, the code and the operationalization of that code is where the rubber meets the road, so to speak. And so one of the goals of the original notebook format was to create an interface and a way of packaging your work so that you didn't have to separate out the code from the narrative. And that, hopefully, would create different ways that you might try to communicate your results to other people, in ways that are only possible when you can actively be interacting with whatever work it is that you're presenting to someone.
JESSICA FORDE: And we've been really lucky in that we've had very good uptake from the scientific community, from the academic community. I think there are in the millions of Jupyter notebooks available on GitHub. In fact, it's now considered a language on GitHub. Although, I don't know necessarily if--
--we think of it as a language, per se. Because a Jupyter notebook can be written in any kind of language that we support. And we support tens of languages, like over 100 at least. And so we have notebooks from the chief economist of the World Bank that are on GitHub. We have notebooks from the LIGO project, which recently won the physics Nobel Prize, that are available on GitHub that we also host on Binder. So there's a lot of different interesting stuff that's been happening.
I know media companies now are sharing their data journalism through Jupyter notebooks. And so we're really excited about the ways people have been able to use it today.
MELANIE WARRICK: Now, you've mentioned JupyterHub. Can you explain a little bit more about what that is?
CHRIS HOLDGRAF: Yeah, sure. So a Jupyter notebook, for example, assumes that you're running something locally on your own computer. That's the most common pattern of interaction when you're writing code and running it against some kernel. What JupyterHub does is it allows you to host, either on hardware that you own or somewhere in the cloud, a server that manages multiple Jupiter processes at the same time. So it will use a Jupiter server to have many users simultaneously sending code to kernels and getting the responses back from them.
It's useful partially because, as a deployer of a JupyterHub, you can specify the environment that you want everyone to have access to. You can specify the packages, the versions. You can specify data that needs to be in a particular place so that everyone has access to it.
And the goal there is that by standardizing these things you're creating a more consistent environment, so that people can share their work more easily and you don't run into strange conflicts that come from having the wrong package installed or having the incorrect version of data on some path on your computer. And the other hope is that you can use this, I should say, as a portal to shared computing infrastructure, so that people who don't have, say, fancy, sophisticated hardware themselves, or maybe who don't even have a laptop with a ton of RAM, can still do fairly complicated processing. But they're doing so via the cloud, or via whatever resources the JupyterHub is deployed on. And so I think the goal there is that it increases the accessibility towards doing modern-day data analytic interactive workflows in a way that's only possible when you have those kind of shared resources.
MELANIE WARRICK: And when we were talking earlier, you were saying how JupyterHub was sort of the impetus for why you started to engage with using Kubernetes.
YUVI PANDA: So when I started working at Berkeley, we had like a biggish course. We had, what, 900 students at that time. They were learning fundamentals of data science, and we wanted them to use a JupyterHub. Because we didn't want to spend our time teaching people how to install stuff. Like, they might have different kinds of computers. And then someone's going to come up with Windows XP, and we had to figure out how to install Python through that. So we wanted to eliminate that and then allow students to focus directly on learning data science.
This was also especially important because we were targeting groups that were not just in computer science, but people from other disciplines and whatnot. And it was like-- we were running into some scaling problems at that time. And we also wanted to be as cloud-agnostic as possible. We did not want to get locked into any specific cloud vendor. And we also wanted to be able to run on [INAUDIBLE] if we needed to.
And so at this point Kubernetes was a very good choice for us to make. Because it let us do all of this in one system. So we didn't need to have like Ansible or something else that set up the base system, and then have a clustering technology on top, which was a lot more complex to manage. This was just like, OK, we have a [? cumulus ?] cluster. We can do everything we want on top of it. And the skill ability is fairly elastic. You can go up and down without too many problems.
So we were like, OK, let's just-- I was already writing the [? cumulus ?] integration for JupyterHub for a while as a volunteer. And then I just got hired at Berkeley to do that full time.
MARK MANDEL: So does that mean that you have to have a Kubernetes cluster to run JupyterHub? Or is it an optional thing?
YUVI PANDA: Not necessarily. So JupyterHub is very [? accessible. ?] You can plug in any authenticator you want. And in the same way you can plug in any spawner you want. So it ships with the default spawner that just doesn't assume anything, except you have the next machine. But then there are spawners for lots of popular technology. There are spawners that just use a dock or a container. There's the Kubernetes spawner that we have written. And then there's a system [? despawner. ?] There's like lots of spawners. People write them.
CHRIS HOLDGRAF: I think one of the challenges that Data8 faced in particular is that when you're designing technology for something like science or education, you have to make assumptions that the organizations that are going to be using that technology don't have, oftentimes, as much resources and funds to hire people as you would if you were at a company or something like this. And I think one of the benefits of Kubernetes is that, because of the properties that people often talk about in terms of self-healing and scalability and stuff like this, you can often manage more complex deployments with fewer highly trained technical dev ops style people.
And Data8, as you mentioned now, it's over, I think, 1,400 students, or something like this. And it has, by some accounts, a relatively modest team of people who are actively developing it, and upgrading it, and maintaining it over time. And my intuition is that that's something that would be much more difficult to do if you are using other kinds of cloud deployment approaches.
MELANIE WARRICK: So I know when we first were talking about doing this podcast, Jessica, you were telling me about Binder. And I know Binder's built off of JupyterHub. So can you tell us a little more about Binder and what it does and why it was built?
JESSICA FORDE: Yeah, so I'll start with the story of Binder. So Binder originally started as a project that was out of Janelia Farm. And it was a project to share notebooks in a curated manner, so that you had notebooks that had a similar idea or a similar narrative, story to come together in one cohesive piece. And so the current setup we have now is the version 2.0 of Binder, which is a publicly available service for GitHub repos to share with the public in a way that has the entire environment set up. So that you could go to a specific URL and you are launched in a Docker container, and that ends up giving you the opportunity to work with the repo with everything already pre-installed.
It's a similar experience to the things that the Data8 students are getting. But in this case, you are getting specifically the repo that you are looking at from GitHub. And this is particularly interesting for applications in science and education, also, people who are creating new libraries and want to show it off.
In fact, that's what we used to show off JupyterLab. We ended up using Binder and said, if you want to try out JupyterLab, here, we have it right now. You just click on the URL, and you are taken to the user interface of JupyterLab. You don't need to install anything. Here it is. Play with it. See how you like it. So that's particularly interesting for us, because it's a different way for us to interact with the public, providing open source as a service for us to share scientific data science repositories through GitHub.
And in fact, Binder is actually a number of projects that are connected together. There are modular pieces. JupyterHub is actually one of them. And the other major pieces under it are repo to Docker, the Binder server itself, and then we also have front ends such as Jupyter notebooks, R studio, and JupyterLab.
CHRIS HOLDGRAF: It's worth highlighting that Binder was originally created by the group of a researcher named Jeremy Freeman. And it was also open source technology. It was running online in the cloud. And it performed what Jessica just described, being able to share an interactive data analytic environment with a single link. And because that project was open source, the Jupyter team started having conversations with the first incarnation of the Binder project, basically realizing that a lot of the complexity of the tooling in Binder 1.0, call it, was doing a lot of this stuff that Jupyter hub handles very flexibly and very nicely already.
And so there was kind of a realization that, well, what if we could just make Binder a particular configuration of a deployment of a JupyterHub and add one extra component, which is the ability to automatically generate your computing environment on the fly so that people can specify their environment just via random text files in their GitHub repository, rather than requiring people to craft a Docker image on their own. So those projects kind of fused together in a collaboration that's been going on for about a year or so now. But a lot of that initial heavy lifting, we're all very thankful to the original Binder team for.
MELANIE WARRICK: And I know when we were talking about Binder, the reason behind driving out, building up Binder 2-- or part of the reason-- was you've got so much great content and research that's being done in the scientific community. And there was a desire to be able to make it easier for people to run that kind of content and run that without having to recreate it from scratch.
JESSICA FORDE: Yeah. I mean, I think that one of the points of friction in scientific research is being able to reproduce or being able to work with the scientific computation pipeline that is in a particular paper. So for example, the LIGO Project is probably our most famous research institute who has their research available on Binder. They have a demo that allows people to get immediate access to their data, which is gravitational wave data, and show people basic signal processing methods so that you can reproduce their study and basically find the gravitational waves that they got the Nobel Prize for.
If people did not necessarily have access to Binder, it would be less interactive. One would have to figure out how the dependencies were set up, install everything from scratch, and try to piece together all the different parts of the repo to understand how to reproduce this result. Whereas with Binder and also Jupyter notebooks, it becomes a seamless interactive notebook that has everything pre-installed. And so you can walk yourself through the thinking of these scientists and understand how they conduct their own research.
CHRIS HOLDGRAF: Yeah. The way that I like to think about this is there is, in my mind, a difference between making your work technically open and reproducible versus making it practically open and reproducible. And a lot of times you'll see people just throw a bunch of code and maybe some description of the environment needed to run that code onto the web somewhere.
And while it is technically possible that you could go through all of the different steps of trying to figure out what it is was going through the author's head when they put all of this there, the vast majority of people aren't going to take the time to clone a repository, pull it under their computer, install all the packages that are needed, get the data into the right place, spin up their own session, start stepping through it, debug problems when they come up. Most people just won't do that. But if you could do something like share the ability to interact with that repository just via a single link that someone clicks, and they can immediately start getting up and running, I think the barrier to entry there is low enough that people will actually start doing it. And then you can do different kinds of things in terms of designing your content when you're assuming that others are going to be interacting with it.
JESSICA FORDE: And we recently presented some work at the Fairness, Accountability, and Transparency conference in New York, with regards to accountability and transparency in machine learning research. Although we believe this extends to all different kinds of scientific research and data science in general. But the interesting distinction, I think, in when we think about what Binder does is that, historically when people were thinking about reproducibility, they wanted to be able to independently validate what happened in a scientific study, doing it in their own laboratory or their own environment. We're basically trying to replicate the environment of the scientist as much as possible with modern computational methods.
So it is as if the laboratory is opening the doors to the public and allowing people to walk in, and says, these are our tools. This is what we did. And you can use these tools as if they were your own. So it's kind of an interesting distinction. And I think that's part of the reason why we think about it even as an accountability and transparency project. Because the distinction is actually that we're not necessarily reproducing it, in that it's not a separate thing. It is the exact same environment as the scientist that is producing this research.
MARK MANDEL: Awesome. This sounds really great. So if I'm sitting down and I'm doing some research and I have this data set that I'm doing research on, and I'm thinking to myself, I want to get this out on Binder, where do I put my data? What does that development pipeline look like? How do I get that sort of thing set up?
CHRIS HOLDGRAF: So one of the goals of Binder is to try and piggyback on preexisting tools in the open source ecosystem as much as possible. And what that means is that the delta, the amount of energy that's needed to make a repository quote, unquote, "Binder-ready," is pretty small. What Binder does is, when you give it a URL able to a Git repository, and that could be something on GitHub or GitLab or wherever you put your code, as long as it's publicly available, it's going to check out the repository. And it looks for what are called configuration files. And by that what I mean is if you're a Python developer, requirement.txt or environment.yml. If you're a Julia developer, a capital require file. It also looks for things like apps.txt files to specify [? apt ?] packages.
And the goal of this is to infer the environment that's needed from text files that are already part of the workflows for people from those various communities. And so from the author's perspective if you want to make your repository interactive via something like Binder, all you need to do is make sure that those text files are there in either the root of your repository or in a folder in the root of your repo called Binder. Binder's then just going to automatically look through those files. And whenever it finds them, it generates a Docker file that says, OK, I need this version of [INAUDIBLE] installed and this version of [INAUDIBLE] installed. It then generates a Docker image from that Docker file and registers it online. And then Binder knows how to ask for certain images based on the links that different users are clicking on.
So in many cases, you don't actually have to do anything. If you've already sort of followed best practices in scientific computing, you've already included an environment.yml file with your repository. In many cases, all you need to do is just give that URL to Binder, and it'll do the rest.
MARK MANDEL: And where do I put my data? Like especially if I have very large sets of data, where does that go? Does it go on my GitHub repository?
CHRIS HOLDGRAF: I think that if you asked 10 different people you would get 10 different responses to that.
JESSICA FORDE: The examples that we've seen so far generally use smaller sets of data, just because the amount of computational power we have is rather limited. So the work that we've done so far on showing how people can use Binder have been on projects where the size of the data is relatively limited and the amount of computation that a person is finally using at the tail end is limited. So we won't necessarily be able to reproduce on Binder a study that used an entire server farm, for example.
But if you do have these pieces and parts, the computation then, since it is interactive, can be modified. So for example, in the write-up we had for the Fairness, Accountability, and Transparency study, we took a [INAUDIBLE] paper that had relatively simple experiments computationally. They weren't expensive. And we said, OK, let's try modifying this experiment and seeing how this works. Which is actually very interesting, because in a lot of studies you are given, this is what we did. It's done. There it is.
But the way Binder works, because it uses interactive computing and it has the same environment, it says, now that you have this tool, this model, this idea, we give you the ability to interact with the research in a way that people haven't been able to do online in a publication-like experience. Which is particularly interesting, I think, because it allows people to be able to think beyond what the researchers are simply telling you and work with the repo in a way that you might want to be able to work with it on your own, that you didn't necessarily-- weren't able to do it, because it required so much leg work to get to that point.
MELANIE WARRICK: But is it built to be able to work with server farms? Is it built to be able to hook into that kind of thing?
YUVI PANDA: Yeah. It ultimately uses JupyterHub to actually run your code. And JupyterHub can look into anything you want. So right now, because we run as a free public service, we put limits on how much RAM you can use and all of that. But you can set up one for your own institution or whatever, and then give it as much resources as you want, running on whatever kind of infrastructure you need.
So you could, for example, just configure your JupyterHub to use-- I think that is a Google Cloud VM spawner so then everyone basically gets one entire machine. Or you can just configure it to, like, OK, everyone gets like an instance group that can scale up and down as they want.
MELANIE WARRICK: And then to your question, Mark, you can integrate with any kind of data repositories that exist out there.
YUVI PANDA: Yeah. Just a matter of what you do in your code. I know, for example, there is the PanGeo. They run as a funded project that tries to make a central platform for people to do earth sciences. And so they store all their data in, like, GCS, Google Cloud Storage. And so they have fuse drivers that let them get data directly from GCS. And the UK Met Office has something similar, but they are in S3, so they get the data out of S3. And this is all like agnostic to JupyterHub. We don't actually care how you do this.
JESSICA FORDE: Actually, also, one of the earliest users of JupyterHub for scientific computing and research is actually the labs in the Department of Energy. They end up doing a lot of high-performance computing, and they have a lot of scientists who aren't devs ops people, who aren't necessarily the most sophisticated when it comes to running complex jobs. And they want to be able to have as lightweight of an experience as possible. And JupyterHub was a great way for them to have access to really powerful Jupyter notebooks.
MELANIE WARRICK: Is this only meant for scientific communities? Or have you seen use cases outside?
YUVI PANDA: So I don't think it's only been for scientific [? community. ?] So I came into this from Wikimedia. And so it's a very, like, oh, free knowledge, everyone should be able to participate. And so I was at that time helping run this thing called Two Labs, which is we provide free compute for people who want to do things with Wikimedia data. So for example, a lot of the anti-vandalism bots run here. A lot of the [? statistics ?] run here. But we required people to [INAUDIBLE] and then use Grid Engine. And that was complex. It was excluding a lot of people.
So from our perspective at the time, we were like-- we wanted more people to be able to access our data and do things with it. And so a good solution was JupyterHub. So that's where I actually started working on the Kubernetes stuff. Because we had a Kubernetes cluster. And we were like, OK, let's put JupyterHub on this so we can securely provide access to our data to people. So we provide access to our [INAUDIBLE].
But even to our direct MySQL database, which is live, we redacted all the sensitive information and gave people access to that. And that's something you cannot do on your own computer. Because that's a data store that's only available there. It's really large. It's upgraded in real time. And there's, I think, like 4 million edits or something that people have made from their JupyterHub. So that's, I think, a very good use case. And I there's more people doing things like that.
CHRIS HOLDGRAF: I think it's worth highlighting again, because I think this is actually related to your previous point, too, the goal of the Jupyter ecosystem is to build open source building blocks that can be composed and used for whatever use case you might have in mind. And so sometimes that is doing scientific research. Sometimes that is teaching a class, either at the university level or even at the elementary school and middle school levels. Sometimes it means providing an interactive environment to connect with some resources or hardware that would otherwise be very difficult to connect to, like what Yuvi just described with the Wikipedia dataset.
And related, then, to your question about storage, our goal, the Jupyter Project's goal, is not to create completely tightly integrated full stack solutions. Our goal is to build pieces that can be put together in order to accomplish some goal. And so what I would say when people ask, how is JupyterHub going to handle large-scale storage? Part of my answer is, well, we don't really have to. We just need to make sure that when there's another open tool that exists that makes that possible, that it's easy to integrate that into the sort of JupyterHub Binder workflow.
MELANIE WARRICK: Do you have any features, functionality that are up and coming, that are in the future of JupyterHub, Binder, Project Jupyter in general?
CHRIS HOLDGRAF: We have lots.
MELANIE WARRICK: Anything that you want to talk about?
MARK MANDEL: Lots is good.
YUVI PANDA: So one thing that we have been working on is high scale. So we're trying to get up to 50,000 active users at a time, spanning multiple Kubernetes clusters and multiple hubs, but all [INAUDIBLE] the user as one. So that's one area we will put a lot of work into it at the moment.
We are also-- on the Binder side, I think something that we did like fairly recently, and it's not super public, is we added RStudio support. Because I think a lot of people in the R community, they prefer using RStudio rather than Jupyter notebooks. And so we wanted to be able to-- we don't want to force people to switch tools just because they want to use us. So we're like, OK, let's make this generic enough so that people can use RStudio or whatever else it is that they want to use.
CHRIS HOLDGRAF: I think that another interesting future direction for development, Yuvi just described what would often be described as scaling outward. So we're just trying to get more and more and more users for a given JupyterHub deployment. But especially in more sophisticated data science teams or in academic research, scientific research, you do need access to non-trivial sized data sets. Or you need access to high-performance clusters and things like this.
And so I think that there's going to be a push of development towards connecting JupyterHub or something like Binder with more sophisticated hardware or more sophisticated cloud infrastructure for doing computations and analyzing data on that hardware. And I'm excited to see the different kinds of uses that people come up with, as the ecosystem of tools that are kind of natively running in the cloud continues to develop further.
YUVI PANDA: I also want to say, one of our biggest new features is more documentation.
I think it's a lot better documented now than it was like six months to a year ago. Jessica, Chris, and [INAUDIBLE], who is not here, have been doing a great job of making sure that the only information is not in some chat somewhere or hidden up in someone's brain, but written out in ways that a diverse group of people can actually understand and reuse in their own contexts.
MELANIE WARRICK: What has it been like working on an open source project for all of you? And it sounds like multiple, in some cases.
YUVI PANDA: I've been working on open source projects-- I was working at GNOME when I was like 19. And I was working at Wikimedia. So I don't know what it is like to not work on an open source project.
So I'm kind of an outlier. But Jupyter is the first project that I'm on that's partially based in academia. So that's a little different. And it's also a much more smaller project. When I joined-- when I started working with Wikimedia, it was already very big.
MELANIE WARRICK: But how is it different?
YUVI PANDA: I think in a smaller project you have more responsibilities to be kinder to everyone, than you have at a larger project. I'm not saying that it's OK to be mean to people in a larger project. But I think if you are person 40 in a project or person 16 in a project, then you have more responsibilities than if you're a person 800 at a project. If you're person 800, the culture is already sort of set. Changing that is going to be an uphill battle. And it's going to be hard. While if you're at a smaller person project, then you-- it's much easier to set. So if you are not careful, then you can set it in ways that you don't want it to be. So I think that's the biggest difference.
There's technical differences, of course. When I was in Wikimedia. I was like the 16th person to join the ops team. And I was the least experienced person doing that. While here, I think most people don't have that much ops knowledge, because they come through as grad students and whatnot. So that was also a big difference for me.
CHRIS HOLDGRAF: But I think related to that, from my perspective, one of the most exciting things about the Jupyter Project, and probably open source in general, is that because of the open nature of the project you get a lot more voices in the room. And you get a lot of representation from different kinds of background in the room. So I think it's amazing that I get to work on a team of people, some of which are heavily ops-oriented and have an incredible background in Kubernetes. Some of which care about things like documentation and community growth and the more social aspects of open source. Some of whom are more domain scientists who run analytics in Python or in R, but wouldn't be able to deploy Kubernetes cluster by themselves.
Being able to coordinate that kind of chaotic, diverse group of people in a way that you can create tools that are truly community-driven, and also available for all kinds of use cases, is a really satisfying thing. And in some ways is, in my mind, the primary goal of the academic and education and scientific communities-- creating public goods that people can then build on for whatever purpose they have in mind.
MELANIE WARRICK: How do you coordinate? Or how does that get coordinated?
JESSICA FORDE: I think that each subproject has their own subculture and their own submethods. And so that largely determines how it works. So actually, like for example, for me, I belong to I think five different Jupyter Project organizations. A lot of people don't even know that we have multiple repos. I think we have over 100. We have at least five or six organizations. So each organization has their own norms and culture.
But we actually share the same code of conduct. So at the higher level, values, norms, things like that are shared. And we have a governance repo if you want to spy on how we work. It's all available to the public. In fact, all our meetings are available to the public. We've put them online on YouTube. So our big weekly meetings are every Tuesday morning, 9:00 AM Pacific. You can watch us online.
But each individual project then has their own subnorm. So for example, there's a monthly meeting that we have with the JupyterHub and Binder group. JupyterLab meets every week, and we have a one-hour meeting once a week. But some groups just mostly work through the mailing list and the issues. And so it's all a combination of all these things.
I guess one of the things that we probably haven't mentioned enough that we probably should say more is, again, we are an open source project. And so anybody can basically jump in at any time. And so we really try to do that. And I think that's also a lot of what I do, is I basically try to Tom Sawyer people into saying, do you like Jupyter? You can work for us too. And basically I get them to do my work for me.
But really, it actually is something that's very, very important to us. In fact, we're having a community day I think the 25th of August in New York, which is a free day for anyone who wants to show up. Or if you just want to show up on the internet, we will be actively working on issues that's community related. And we follow GitHub's norms of marking issues as first issue or help wanted. So those things are particularly standard. But again, the broader, the other kinds of implementations, are slightly different from team to team.
MELANIE WARRICK: And then were there any resources that you wanted to mention in regards to--?
CHRIS HOLDGRAF: In regards to--?
MELANIE WARRICK: JupyterHub, Jupyter Notebook. We were talking about some of the classes that are out there.
MARK MANDEL: Yeah, where would people go if they want to learn about these projects?
CHRIS HOLDGRAF: So I think that in terms of examples of particular deployments of JupyterHubs for particular use cases, if you go to Data8.org, just data, then the number 8 dot org. That is a public-facing version of Data8, which is this major course at UC Berkeley that Yuvi has been working on and that has driven a lot of the technical infrastructure, particularly on the Kubernetes of JupyterHub.
There's also MyBinder.org is the public service version of Binder. So as Jessica mentioned, BinderHub is open source technology that anybody could use for whatever purpose they want to to deploy a Binder server. MyBinder.org is a kind of technical demonstration and a free public service, where as a user, you can begin to interact with that and create shareable public links for your code repositories.
And then from a developer's standpoint, or from like a deployer's standpoint, the best way to learn how to deploy JupyterHub on Kubernetes is that there's a guide called Zero to JupyterHub which you can find at Z2JH.jupyter.org. On the flip side, for BinderHub there is a guide to deploying BinderHub on Kubernetes at BinderHub.readthedocs.io.
JESSICA FORDE: Also I should probably mention another project on the stack. We have Repo to Docker, and that is also under the GitHub Organization JupyterHub. A lot of the projects we've been talking about have been under the GitHub Organization JupyterHub. So if you go to GitHub.com/jupyterhub you'll see a number of our projects, and Repo to Docker is one of them, which is also particularly useful if you want to take a GitHub repo and turn it into a Docker file or something that can be easily used and shared in that kind of format.
MELANIE WARRICK: This is great. Well, I really appreciate all of you coming and talking to us about Project Jupyter, Jupyter Notebooks, JupyterHub, and Binder. Anything else that you wanted to talk about before we go?
YUVI PANDA: I just want to say, if you know Kubernetes or want to learn more about it, then you should come and talk to us. Because there's a lot of things that we're doing on top of it. And having more people do that would be great. And you will have lots of impact, and a lots of people will use your stuff.
CHRIS HOLDGRAF: Actually, one other thing, a follow-up to your previous question. If you just want to get involved in the community, of the Jupyter community more broadly, we have a Gitter channel for pretty much all of the major projects. So there's a Binder Gitter channel and a JupyterHub Gitter channel. There's also a Gitter channel for most of the other components that we've talked about in this interview. And we're also pretty closely monitoring issues and mailing lists, and also a lot of just community groups that are scattered across cities all over the world. So we try to be as welcoming and inviting a community as humanly possible. And we would love for whoever is listening to this to get involved.
MELANIE WARRICK: Well, great. Well, thank you again. I'm so glad you all were able to come to join us today. Thank you.
CHRIS HOLDGRAF: Yeah, thank you so much.
MARK MANDEL: Fantastic.
MELANIE WARRICK: Well, again, thank you, Jessica, Yuvi, and Chris. We really appreciate you coming onto the podcast and telling us all about Project Jupyter, telling us about JupyterHub and Binder and all that. And also just hearing about how Kubernetes is being used in the project as well. That was fun.
MARK MANDEL: Yeah. Thanks so much for joining us. Super interesting project. Really great to see that kind of collaborative development platform getting built and having people use it. It's pretty awesome.
MELANIE WARRICK: And all open source. Open source is the best. OK, so.
MARK MANDEL: Question of the week. I'm going to ask you. Yay!
MELANIE WARRICK: Yay.
MARK MANDEL: All right. [LAUGHS] How did Google's predictions do during March Madness? So actually, let's set some context. I know March Madness, big basketball. There we go.
MELANIE WARRICK: Yes.
MARK MANDEL: Basketball tournament. Google Cloud is a sponsor. And there is a bunch of machine learning stuff that was put in place to see if we could predict who was going to win and who was going to lose. Is that right?
MELANIE WARRICK: That's right. And there was also a [INAUDIBLE] competition that's been going on to allow others to run their own prediction models and see how effective they were in March Madness. But Google did do its own separate prediction model, and worked hard at trying to be able to come up with real-time predictions, too. Because during March Madness, apparently, during like the halftime they would take the data from the first half and feed it into the model and come back with predictions prior to the second half.
MARK MANDEL: Oh, cool.
MELANIE WARRICK: During the final four there was a couple of halftime ads that were even generated within, like, the halftime show to then show what they thought the predictions were. And apparently, they had a prediction where Loyola versus Michigan were playing and they predicted 29 rebounds. And they were right. And then they also predicted during Villanova and Michigan that at least 21 3-pointer attempts would be made. And the final count was 24. So anyways, it was really interesting.
MARK MANDEL: Nice.
MELANIE WARRICK: And there's a blog post that helps give some insights in terms of how the prediction model was built. And this is actually something you would do with typically any kind of data science project, in terms of figuring out what type of data you need, then building out the pipeline that you're going to need to feed the data, and build out the model and train it. And then deploying it into production.
So they had this model they trained off of 200,000 discrete files from a decade of NCAA basketball data.
MARK MANDEL: Nice.
MELANIE WARRICK: They apparently engineered over 800 potential features and then applied variants and univariants statistical tests. And they built out their models using these more standard regression and classification techniques. And when the games were actually playing, they were updating their entire game state every two seconds, including all play-by-play data. And they were using Cloud Spanner to help with this kind of processing that they needed to do, as well as BigQuery.
So like I said, there's a blog post. We're going to include that blog post so that you can see how specifically they actually engineered their prediction models, how it was deployed. And it will give you some great insights to think about when you're doing your own type of data analysis, data prediction modeling. And yeah, it's kind of cool.
MARK MANDEL: That is cool.
MELANIE WARRICK: It's kind of cool what the results look like. It's kind of cool how they built it. And congrats to the team.
MARK MANDEL: Nice. Awesome. All right, Melanie. Before we wrap up today, are you going anywhere, doing anything cool?
MELANIE WARRICK: Yes, I will be speaking at Techtonica, actually, this week on the 11th. And I'm going to be talking about AI. And then actually this weekend on Saturday I'm going to be speaking at the Harker Research Symposium on the 14th, participating in a panel to talk about diversity and inclusion.
MARK MANDEL: Nice.
MELANIE WARRICK: Mark, what about you?
MARK MANDEL: So I'm not going anywhere. But I did just start something really cool.
MELANIE WARRICK: Oh, that's sad.
MARK MANDEL: It's fine. I was very focused on the [INAUDIBLE] launch and Game Developers Conference, which means I need to ramp back up. But in the meantime, I've started doing something really cool that I'm really excited about.
MELANIE WARRICK: OK.
MARK MANDEL: I've started streaming the work that I'm doing on the [INAUDIBLE] development thing. So if you're into game development and you're interested in patterns and strategies for scaling multiplayer games, it could be interesting. But if you're into Kubernetes and you want to look at custom resource definitions or controllers or-- on Friday last week, I was doing some Helm integration, that kind of stuff, you might find it interesting as well. So we'll put the link in the show notes. [INAUDIBLE] /MarkMandel. I'm doing Tuesdays at 9:00 AM Pacific on a regular basis, but also trying to do a lot of little ad hoc sessions as well.
MELANIE WARRICK: You're doing all kinds of shows now.
MARK MANDEL: Yeah, it's fun. It's so much-- actually, streaming coding is delightfully wonderful.
MELANIE WARRICK: Well, that's great. And as far as I know it, Mark, you and I are working on trying to make sure that we do an actual interview on [INAUDIBLE]. So hopefully that'll be coming up soon.
MARK MANDEL: Yeah, I should do that.
MELANIE WARRICK: As soon as we can sort out some of the logistics.
MARK MANDEL: I should get on that. That would be fun.
MELANIE WARRICK: Because your schedule's pretty busy.
MARK MANDEL: Awesome. Well, Melanie, thank you so much for joining me for yet another podcast.
MELANIE WARRICK: Thank you, Mark.
MARK MANDEL: And thank you all for listening. And we'll see you all next week.