Jan 30 2009

Cool new technology for y’all

Category: Blog Maintenance, Programming, ProjectsJim Powers @ 7:08 pm

Still, a work in progress, but the main part of the site (outside of the blog) is now running within a heavily hacked version of gitit, a wiki engine written in Haskell. Since I’ve been learning about Haskell recently what better way to play than on the Web site? Seriously John MacFarlane is brilliant! Not to mention the HAppS folks as well.

Haskell is a wondrous language, and definitely the hardest one I’ve decided to learn in quite a while. I played around with Haskell and SML/NJ many many years ago (1991-1994 time-frame), and I really liked what I saw. Recently, though I’ve been thinking hard about programming and languages again, especially building large systems easily, and purely functional languages figure prominently in my thinking at the moment.

Tags: , ,


Jan 28 2009

Server Upgraded to Fedora 10

Category: Blog MaintenanceJim Powers @ 8:45 am

Woo hoo!

Other than a couple minor hiccups (all of them were dealt with running restorecon -vR [dir] things went very smoothly (although a bit slowly ;-) ). Thanks Fedora folks! Also, thanks to all those FLOSS projects that make distros like Fedora possible!!

Tags:


Jan 26 2009

Apologies to 37 signals: you DO have a scaling problem…

Category: Computers, Programming, ProjectsJim Powers @ 12:45 pm

In response to Scale Later.

Problem statement

Building Internet applications is what I do (among other things, of course), I’ve been doing that since something like 1995. I’ve worked in numerous technologies and frameworks from bare-bones C/C++ to Rails and various Java frameworks (with many things in between), and I can say without a doubt: I always have a scaling problem.

In every Internet application I have built I have continuously been hit with the “scaling problem”, let me be more specific about what I mean about the “scaling problem”.

An application that can “scale” is one that can support the needs of 1 user up the needs of a very large number of concurrent users by following a relatively easy to implement recipe to deploy machine resources (CPUs, networking, storage, etc.) to support the concurrent user load you are interested in. The “scaling problem” is defined in terms of two parts then:

  • How does one build an application that can intrinsically scale?
  • Given an application that can intrinsically scale, what are the desirable properties properties of the “relatively easy to implement recipe” for deploying machine resources to support your desired concurrent use levels?

Furthermore, by very large number of concurrent users I do not mean tens or hundreds of concurrent users, I mean thousands or more. Please keep in mind that concurrent users is the important element here. Having an application with say a million signed up users but only 50 active or concurrent users is far less interesting to me in terms of building applications than say an application that has 5000 signed up users with 1000 or more using the application concurrently. If your expected number of concurrent users is probably going to be on the order of ~50, basically, forever, regardless of the number of “signups” then I would say that 37signals’ perspective is appropriate, and your long-term problems are going to be related to managing your data store. Such a scenario may be what happens with an internal application used within an organization. The interesting thing is that over time a lot of “internal” application and data have been given hooks to the “outside” world (a.k.a. the general Internet), many of these “internal” applications are not able to meet the challenge of higher concurrent user levels without significant and costly re-engineering.

What should the solution look like?

Firstly, I’m going to give a lot of credit where credit is due: Ruby on Rails succeeds completely in one area: building “old-fashioned” monolithic applications with a single RDBMS behind it, is, generally speaking, very easy. There are a number of other frameworks that have sprung up in the wake of Rails that refine or expand approaches made popular by Rails, which is a good thing. The problem with Rails (and basically ALL other frameworks) is that going beyond the limits of the assumed application model becomes progressively harder and more unwieldy. Many of the “niceties” the framework offered cease to be applicable.

So, to start off what I would like to propose is that we need a “solution” to the “scaling problem” that is as easy to develop with as Rails but does not become significantly more difficult to work with as one tries to scale up. Furthermore, the solution needs to be run-time efficient, as well as intrinsically be fault-tolerant.

Yes, I know it’s a lot to swallow in one go, but I think that it is entirely possible to solve this problem.

But how?

Start at the language level

My thinking starts with designing a computer language that implicitly assumes a run-time environment that is not confined to a single process or even the same machine: a distributed run-time environment. Now, many of you are going to shout Erlang!, or maybe Distributed Haskell, or Distributed Ruby, or maybe even Swarm. The ideas behind Swarm are probably the closest thing to what I have in mind, but Erlang and Distributed Haskell, while fantastically useful in their own right, are not what I have in mind. Why? Erlang and Distributed Haskell/Ruby, provide powerful tools (in the case of Erlang and D-Haskell language-based tools, DRb is a library) for abstracting the notion of communicating processes, but leave the notion of communicating processes as an explicit concept – I’m looking for a language/runtime that implicitly encapsulates the distributed run-time. Swarm starts to do that.

Literally, code that looks like (using Lisp-ish here):

(let ((x (some-func-1 ...))
      (y (some-func-2 ...)))
  ...)

could have the functions some-func-1 and some-func-2 run on the same or different machines or processes, the run-time decides what should be done, I should not have to care when I don’t want to care. The corollary to this is that when I do care where these functions run I should be able to say so. My thinking right now is that the specification of where a function gets run is metadata that can be supplied with the function definition/declaration, and/or in a separate specification file. This specification needs to both support a dynamic declarative representation of the rules that govern how to decide where a piece of code runs (i.e. it is an extensible domain specific language, or DSL)

From the above discussion it should be clear that the run-time will need to transparently ship both code and data around between process and machine boundaries. Erlang has a particularly good implementation of data marshaling, and Lisp’s “code as data” mentality clearly fits in here as well, but of course the language need not be Lisp-ish. Another characteristic that should be clear is that the runtime needs to efficiently support dynamic code loading/compilation because a freshly provisioned node or processes will have no code loaded into it and will get code as the distributed run-time decides how to use compute nodes. To some extent Hadoop does a lot of this, but as with most Java projects there is simply waaaay too much feeding the framework for my tastes and node provisioning is a chore, that said, Hadoop is rather amazing.

It is also clear from above that I want the ability to add nodes easily. Erlang’s approach to provisioning compute resources seems like a plausible way to start — it should be possible to start the language run-time on a machine (or multiple times on the same machine) and simply specify some key identification/security parameters, and how to join a run-time cluster.

Big problems to solve

From what has been stated so-far nothing has been said about how the run-time would utilize nodes. Clearly this is a non-trivial problem, and in general, there may be no “best way”. To some extent this is where the distribution rule DSL mentioned above comes into play. The DrDSL must not be a toy, it needs to be able to extensively probe the run-time, both locally and elsewhere to make decisions. For instance, perhaps the rule for some function says to run that function on the same machine where a resource is “local”, such as a file, or maybe a scanner or printer. Here’s an example (again, Lisp-ish):

[[ <process-file-1> requires <file> as local file ]]
(define (process-file-1 file ...)
   ...)

Where [[ and ]] indicate the DrDSL code that is used by the run-time to decide where to run the code instance of process-file-1. Furthermore, in a large deployment it may be that for any particular resource, such as a file, there may be more than one copy. This brings us to the issue of fault-tolerance.

In building big, it is useful to follow the sample of Google with their approach to map/reduce. In particular, my preference is a lot of cheap commodity hardware that is prone to failure. And failure should be automatically dealt with. So for example, let’s assume the process file code as listed above. Let’s say that we have three nodes, N1, N2, and N3, with the file foo.dat. Node N1 is chosen by the run-time to execute an instance of process-file-1. Mid-way through the computation node N1 becomes unavailable, the run-time should simply select from nodes N2 and N3 and re-start the computation.

Clearly, I’m glossing over huge issues such as transactional integrity (what if process-file-1 has some side-effect that was partially realized? How do we restart process-file-1 on another node cleanly?), but I’m fairly certain that such problems can be resolved as well (I’m not implying that these problems can be solved easily, mind you, there are a number of graduate theses that could be applied to addressing this problem).

So, where are we?

From the above discussion it should begin to become clear that if we has such a system we could write our code for distributed application easily more-or-less as if we were writing our code for a single machine/process. Agreed, it seems a bit “pie-in-the-sky” like, but I feel that something like the above is both needed and inevitable. Mayhaps I should start working on it?

Tags: , , , ,