From e1d5046e8fcdb4055f8e00aaf2ac42da0d53235f Mon Sep 17 00:00:00 2001 From: Sergio Giro Date: Wed, 3 Apr 2019 16:44:19 +0100 Subject: [PATCH 1/4] README: add design decisions --- README.md | 121 ++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 117 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 8a2e1dbc..4af29f32 100644 --- a/README.md +++ b/README.md @@ -460,10 +460,10 @@ We built Plz following these principles: variables. You don't need to add extra dependencies to your code or learn how to read/write your data in specific ways. - The tool must be flexible enough so that no unnecessary restrictions are - imposed by the architecture. You should be able to do with Plz whatever you - can do by running a program manually. It was surprising to find out how many - issues, mostly around running jobs in the cloud, could be solved only by - tweaking the configuration, without requiring any changes to the code. + imposed by its architecture. You should be able to do with Plz whatever you + can do by running a program manually. It was surprising to find out how much + of the friction around running jobs in the cloud could be solved, only by + tweaking the configuration and without requiring any changes to Plz code. Plz is routinely used at `prodo.ai` to train ML models on AWS, some of them taking days to run in the most powerful instances available. We trust it to @@ -471,6 +471,119 @@ start and terminate these instances as needed, and to manage our spot instances, allowing us to get a much better price than if we were using on-demand instances all the time. +## How does Plz help + +If you didn't have Plz, the steps you'd need as to run your code in an AWS +instance would be: + +- go to the AWS console and start an instance (or have created a launch template + and then use the cli) +- wait until the instance is up +- get the IP address of the instance from the console +- copy your code and data by ssh-ing to the instance +- ssh to the instance and run your job. Preferably inside docker so that a + dropped connection doesn't kill your job (but if you want to have docker you + have to take care of a docker file and build the image) +- each time the connection drops or you turn off your computer, you need to ssh + again. If you didn't use docker, you lost your terminal and it's very likely + your job died +- watch your job until it finishes (and lose money at the point it has already + finished and your instance keeps running, if you don't look often enough) +- copy your results back to your machine by ssh-ing to the instance, being + disciplined about where you store them and making sure you can link them to + the (version of the) code that produced them if you'll have several runs that + you want to compare. Or, if you started from a program that was running + locally, change it to write to a non-ephemeral location +- if you care about your standard output/logs, gather and retrieve them somehow +- make a note of your results (like stats or accuracy), or copy files with + results + +All of that gets simplified to `plz run`. If you stopped the output of `plz run` +(by hitting Ctrl-C, or turning off your computer) can do `plz output` to get the +output at any time. + +If you want to rerun your job later (for instance, to try different parameters), +you would need to have saved a copy of the code (or have been very disciplined +with your git history, plus have tags or have commits for every single one-line +tweak you try --more about that [below](#why-is-Plz-the-way-it-is)), and +possibly also have the same data you have used. You'd need to retrieve the code +from wherever you have it (for instance, you may need to find the git branch, +and switch to it --possibly after creating a different copy of the repo if you +don't want to stop working on what you are doing it). + +Another important factor is that it gives you a standard way to run your code. +Same as when you see a Makefile and you know that you can type `make`, when you +see a `plz.config.json` you know that you can do `plz run`. Then your code can +be launched in whatever machine your teammate happens to be sitting (specially +if the job runs in the cloud). Teammates need to install `plz`, sure, but your +team will know how to do it after a couple installs, and that's setting up one +program per team member, instead of one setup per project. + +## Why is Plz the way it is + +This section is an attempt to describe the rationale behind the high-level +architecture of Plz. + +- why Docker: simplifies log handling. We obtain a stream of logs from the + running jobs just by calling the Docker API, with facilities to filter for + time. Running commands with ssh requires to either keep the connection as to + gather the output or redirecting the output and reading it later from a file. + In general, Docker doesn't provide only isolation, but also an environment + where the job runs autonomously with controlled inputs and outputs +- why not using git to store code snapshots (and use git to transfer code to the + instance): because it's very common that users want to make changes that they + don't necessarily want in their commit history. For instance, when users try + to make their job run in the cloud, or run at a different scale than what they + use to (for instance, run the job with far more data than they do locally), + they might try several one-line tweaks. These commits (possibly paired with + messages that would be meaningless in a month, like ''Change foobar from 0 to + 1'') are hardly useful and pollute the repo history. Plz could also create a + different branch for each job but (in order to allow for `plz rerun`) then + these branches should be kept, would be listed in `git branch`, etc. _A good + summary answer to the question would be: because users want to commit stuff + that ''works'' (commits you can revert to, use for reference, etc.) and you + don't know whether something works until you've run it._ The solution for code + storage we implemented, using Docker images, is quite simple to implement and + understand as the docker API allows you to just send the files as a tarball in + order to create an image (if we were using git, for the case of private repos, + we would need to implement usage of git credentials in the instance, which + would actually be more complicated). Docker images are given a name so that + they can be referenced later, making `plz rerun` easy to implement as well. + The code can be retrieved by looking inside the image, which is a reliable + source of truth, as it stores the code that was actually running + +### Could plz be smaller? + +- why do we need a controller/server: one reason is to manage locks (for + instance, to avoid two jobs requests using the same instance). It's true that + locking could be done just by using a redis server (so instead of a + controller/plz server, the CLI could maybe point to a redis server taking care + of locks). That would force the tool to assume that everyone uses it + collaboratively (one could engineer an altered CLI that locks every instance, + etc.). We are having this assumption now, but we are not forced to keep it in + the future. Another reason for a controller is a feature that we have + considered for a while: to rerun jobs for spot instances that were terminated + because of being overbid. To that end, we need something to be running + permanently the cloud, as there might not be a CLI running at the point in + time where the spot instance dies. In general, it sounds like if you want to + do something serious about a bunch of instances that are running permanently, + eventually you'll need a coordinator/controller. Even if the current features + might not strictly require a controller, it's good that any features that do + require it won't need a major refactor. Needless to say, a controller-less plz + cannot be obtained by just erasing the controller: a major effort would be + needed so that the tasks being done by the controller (setting inputs, + collecting outputs, etc.) are done by, for instance, a wrapper of the program + being run by the user +- why collecting information from the running program: while it would be + possible to leave programs the task to write to whatever non-ephemeral storage + they choose, it would put a burden on the plz user to change their program + significantly, with respect to a program that they already run locally (for + instance, instead of writing files, to use the AWS API to write to S3). With + the current plz mechanism, as long as there is a point in your program where + you can set the output directory (and, if you program doesn't have such point, + it's a good idea to implement it anyway) you can write files and plz will make + them non-ephemeral for you + ## Future work In the future, Plz is intended to: From ca4c39d087e06bf11584ce1be36964407d11c06d Mon Sep 17 00:00:00 2001 From: Sergio Giro Date: Wed, 3 Apr 2019 16:55:28 +0100 Subject: [PATCH 2/4] README: incorporate stuff from https://noodlesandwich.com/talks/plz-respect-ur-data/essay.html --- README.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 4af29f32..6f4e72e2 100644 --- a/README.md +++ b/README.md @@ -453,9 +453,13 @@ This tells Docker to use the We built Plz following these principles: -- Code and data must be stored for future reference. +- Data that isn't reproducible is worthless. +- You don't know the value of your data at the time of creation. - Whatever part of the running environment can be captured by Plz, we capture it as to make jobs repeatable. +- Hardware is expensive. +- Code is a means to an end. What matters is the outcome you obtain from running + your code. - Functionality is based on standard mechanisms like files and environment variables. You don't need to add extra dependencies to your code or learn how to read/write your data in specific ways. @@ -464,7 +468,6 @@ We built Plz following these principles: can do by running a program manually. It was surprising to find out how much of the friction around running jobs in the cloud could be solved, only by tweaking the configuration and without requiring any changes to Plz code. - Plz is routinely used at `prodo.ai` to train ML models on AWS, some of them taking days to run in the most powerful instances available. We trust it to start and terminate these instances as needed, and to manage our spot instances, From dec51f1257e0904e1ec8c6ee7b0bbb6f13677203 Mon Sep 17 00:00:00 2001 From: Sergio Giro Date: Wed, 3 Apr 2019 16:58:44 +0100 Subject: [PATCH 3/4] README: apply prettier --- README.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 6f4e72e2..086f6d77 100644 --- a/README.md +++ b/README.md @@ -467,12 +467,12 @@ We built Plz following these principles: imposed by its architecture. You should be able to do with Plz whatever you can do by running a program manually. It was surprising to find out how much of the friction around running jobs in the cloud could be solved, only by - tweaking the configuration and without requiring any changes to Plz code. -Plz is routinely used at `prodo.ai` to train ML models on AWS, some of them -taking days to run in the most powerful instances available. We trust it to -start and terminate these instances as needed, and to manage our spot instances, -allowing us to get a much better price than if we were using on-demand instances -all the time. + tweaking the configuration and without requiring any changes to Plz code. Plz + is routinely used at `prodo.ai` to train ML models on AWS, some of them taking + days to run in the most powerful instances available. We trust it to start and + terminate these instances as needed, and to manage our spot instances, + allowing us to get a much better price than if we were using on-demand + instances all the time. ## How does Plz help From 6bfe091fcef25d35cdfac138f034bdd71f0ca965 Mon Sep 17 00:00:00 2001 From: Sergio Giro Date: Sun, 14 Apr 2019 16:29:38 +0100 Subject: [PATCH 4/4] README: add explanation about why we aren't using Kubernetes --- README.md | 127 ++++++++++++++++++++++++++++++++---------------------- 1 file changed, 76 insertions(+), 51 deletions(-) diff --git a/README.md b/README.md index 086f6d77..ee3ecd01 100644 --- a/README.md +++ b/README.md @@ -467,12 +467,13 @@ We built Plz following these principles: imposed by its architecture. You should be able to do with Plz whatever you can do by running a program manually. It was surprising to find out how much of the friction around running jobs in the cloud could be solved, only by - tweaking the configuration and without requiring any changes to Plz code. Plz - is routinely used at `prodo.ai` to train ML models on AWS, some of them taking - days to run in the most powerful instances available. We trust it to start and - terminate these instances as needed, and to manage our spot instances, - allowing us to get a much better price than if we were using on-demand - instances all the time. + tweaking the configuration and without requiring any changes to Plz code. + +Plz is routinely used at `prodo.ai` to train ML models on AWS, some of them +taking days to run in the most powerful instances available. We trust it to +start and terminate these instances as needed, and to manage our spot instances, +allowing us to get a much better price than if we were using on-demand instances +all the time. ## How does Plz help @@ -486,7 +487,7 @@ instance would be: - copy your code and data by ssh-ing to the instance - ssh to the instance and run your job. Preferably inside docker so that a dropped connection doesn't kill your job (but if you want to have docker you - have to take care of a docker file and build the image) + have to take care of a `Dockerfile` and build the image) - each time the connection drops or you turn off your computer, you need to ssh again. If you didn't use docker, you lost your terminal and it's very likely your job died @@ -527,40 +528,42 @@ program per team member, instead of one setup per project. This section is an attempt to describe the rationale behind the high-level architecture of Plz. -- why Docker: simplifies log handling. We obtain a stream of logs from the - running jobs just by calling the Docker API, with facilities to filter for - time. Running commands with ssh requires to either keep the connection as to - gather the output or redirecting the output and reading it later from a file. - In general, Docker doesn't provide only isolation, but also an environment - where the job runs autonomously with controlled inputs and outputs +- why Docker: simplifies input and output, which results in concrete + simplifications like log handling: we obtain a stream of logs from the running + jobs just by calling the Docker API, with facilities to filter for time. + Running commands with ssh requires to either keep the connection as to gather + the output or redirecting the output and reading it later from a file. In + general, Docker doesn't provide only isolation, but also an environment where + the job runs autonomously with controlled inputs and outputs - why not using git to store code snapshots (and use git to transfer code to the instance): because it's very common that users want to make changes that they don't necessarily want in their commit history. For instance, when users try - to make their job run in the cloud, or run at a different scale than what they - use to (for instance, run the job with far more data than they do locally), - they might try several one-line tweaks. These commits (possibly paired with - messages that would be meaningless in a month, like ''Change foobar from 0 to - 1'') are hardly useful and pollute the repo history. Plz could also create a - different branch for each job but (in order to allow for `plz rerun`) then - these branches should be kept, would be listed in `git branch`, etc. _A good - summary answer to the question would be: because users want to commit stuff - that ''works'' (commits you can revert to, use for reference, etc.) and you - don't know whether something works until you've run it._ The solution for code - storage we implemented, using Docker images, is quite simple to implement and - understand as the docker API allows you to just send the files as a tarball in - order to create an image (if we were using git, for the case of private repos, - we would need to implement usage of git credentials in the instance, which - would actually be more complicated). Docker images are given a name so that - they can be referenced later, making `plz rerun` easy to implement as well. - The code can be retrieved by looking inside the image, which is a reliable - source of truth, as it stores the code that was actually running - -### Could plz be smaller? + to make their job run in the cloud, or to run it at a different scale than + what they use to (for instance, to run the job with far more data than they do + locally), they might try several one-line tweaks. These commits (possibly + paired with messages that would be meaningless in a month, like ''Change + foobar from 0 to 1'') are hardly useful and pollute the repo history. Plz + could also create a different branch for each job but (in order to allow for + `plz rerun`) then these branches should be kept, would be listed in + `git branch`, etc. _A good summary answer to the question would be: because + users want to commit stuff that ''works'' (commits you can revert to, use for + reference, etc.) and you don't know whether something works until you've run + it._ The solution for code storage we implemented, using Docker images, is + quite simple to implement and understand as the docker API allows you to just + send the files as a tarball in order to create an image (if we were using git, + for the case of private repos, we would need to implement usage of git + credentials in the instance, which would actually be more complicated than + using Docker). Docker images are given a name so that they can be referenced + later, making `plz rerun` easy to implement as well. The code can be retrieved + by looking inside the image, which is a reliable source of truth, as it stores + the code that was actually running + +### Could Plz be smaller? - why do we need a controller/server: one reason is to manage locks (for instance, to avoid two jobs requests using the same instance). It's true that locking could be done just by using a redis server (so instead of a - controller/plz server, the CLI could maybe point to a redis server taking care + controller/Plz server, the CLI could maybe point to a redis server taking care of locks). That would force the tool to assume that everyone uses it collaboratively (one could engineer an altered CLI that locks every instance, etc.). We are having this assumption now, but we are not forced to keep it in @@ -568,24 +571,46 @@ architecture of Plz. considered for a while: to rerun jobs for spot instances that were terminated because of being overbid. To that end, we need something to be running permanently the cloud, as there might not be a CLI running at the point in - time where the spot instance dies. In general, it sounds like if you want to - do something serious about a bunch of instances that are running permanently, - eventually you'll need a coordinator/controller. Even if the current features - might not strictly require a controller, it's good that any features that do - require it won't need a major refactor. Needless to say, a controller-less plz - cannot be obtained by just erasing the controller: a major effort would be - needed so that the tasks being done by the controller (setting inputs, - collecting outputs, etc.) are done by, for instance, a wrapper of the program - being run by the user + time where the spot instance is terminated. In general, it sounds like if you + want to do something serious about a bunch of instances that are running + permanently, eventually you'll need a coordinator/controller. Even if the + current features might not strictly require a controller, it's good that any + features that do require it won't need a major refactor. Needless to say, a + controller-less Plz cannot be obtained by just erasing the controller: a major + effort would be needed so that the tasks being done by the controller (setting + inputs, collecting outputs, etc.) are done by, for instance, a wrapper of the + program being run by the user - why collecting information from the running program: while it would be - possible to leave programs the task to write to whatever non-ephemeral storage - they choose, it would put a burden on the plz user to change their program - significantly, with respect to a program that they already run locally (for - instance, instead of writing files, to use the AWS API to write to S3). With - the current plz mechanism, as long as there is a point in your program where - you can set the output directory (and, if you program doesn't have such point, - it's a good idea to implement it anyway) you can write files and plz will make - them non-ephemeral for you + possible to leave to user programs the task to write to whatever non-ephemeral + storage they choose, that would put a burden on the Plz user to change their + program significantly, with respect to a program that they already run locally + (for instance, instead of writing local files, to use the AWS API to write to + S3). With the current Plz mechanism, as long as there is a single point in + your program where you can set the output directory (and, if you program + doesn't have such point, it's a good idea to implement it anyway) you can + write files and Plz will make them non-ephemeral for you. Also, with the + current mechanism team members know how to access the outcomes of your job + even if they don't know the details (`plz output` for ''blobs'' and + `plz measures` for structured outputs), and can read them using standard + tools, as every computer setup can process json and files (as opposed to, say, + running SQL queries in the cloud) +- why managing the instances ourselves/why not using Kubernetes: because + autoscaling mechanisms (either using Kubernetes or autoscaling groups) do not + cover the case of ''interactive users'' which want to run instances, see them + spawn when they launch a job and see them terminate when they stop it. + Autoscaling mechanisms have ''cooling times'' specified so that scaling + changes are not happening all the time, degrading performance, but they cause + operations not to be immediate/deterministic and that can be really annoying + when working interactively. We discovered all of these because our first + attempt was to use AWS autoscaling groups, and that Plz was a pain to use and + also to test manually (''did AWS get that we want to terminate this instance? + Let's wait, sometimes it takes 5 minutes to take it down''). With respect to + Kubernetes specifically, when we started Plz the Kubernetes implementation of + AWS (EKS) wasn't there. There is a feature for Kubernetes in the works. We + plan that users would be able to specify a Kubernetes cluster to which the + execution will be sent (to support the case of a non-interactive user), or, as + we currently do, specify an instance type, so that an instance will be started + and managed by Plz (to support the case of an interactive user) ## Future work