14

Looking for general tips, tricks and war experience on developing scalable web applications to be distributed on many nodes. What are the key issues to keep in mind? What issues must be solved in early releases of the application? Etc...

Summarizing Contributions:

I will update this section as answers to this question are created. A twin question has been created to discuss frameworks and tools separately.

User Driven Applications

  • How many simultaneous users/connections?
  • How often do users connect?
  • What is the average connection duration?
  • How much data is exchanged during a communication?
  • Does one need to ensure permanent application access/availability?
  • Is a load balancer necessary?
  • What about ergonomy? Application response time?
  • Is Search Engine Optimization necessary/relevant?

Process Driven Applications

  • Do processes run mostly on foreground or background?
  • How many different type of processes?
  • How often does each process run? For how long?
  • How much data is processed per node?

Communication

  • How are nodes going to communicate between each other (internal<->internal)?
  • How are users going to communicate with the application (external<->internal)?
  • How many type of communication channels (internal<->internal<->external)?
  • Which devices are going to be used (PC, handhelds, Iphone...)?
  • Is group communication necessary? Which groups?
  • Is communication centralized or P2P-like?

Data

  • Is the amount of data stable or ever growing?
  • Can the application data be stored on one server? Should it?
  • Should the data be available online at any time/immediately?
  • Does one need to perform backup? Is it possible?
  • Does one need to archive data? If yes, how?
  • Does one need to shard data (i.e. distribute data) over many databases?
  • How flexible are the data structures? Can they integrate future needs?
  • What is the life cycle of each data structure? Do they have versions?

Project Management

  • Which project implement tool will be used (maven, ant,...)?
  • Will there be application upgrades?
  • Which kind of updates? Patches? Bundles? Service packs?
  • How are updates going to be applied?
  • Where is the application going to be deployed? Web server? Desktop?
  • How is the application going to be deployed?
  • Should one create system images and deploy them on nodes?
  • How is the application going to be maintained?
  • Should maintenance be outsourced? Can it be outsourced?
  • Should hardware & system administration be outsourced?

Functional & Technical Design

  • Is the application going to offer new/future functionalities?
  • What is the decision process to include/remove functionalities?
  • Who participates to the decision process?
  • Should feedback from users be taken into account?
  • Should feedback from engineers be taken into account?
  • How is feedback collected? How is feedback analyzed and documented?

Programming & Code Base

  • How many technologies/programming languages are used?
  • How and where is the code handled? How many repositories?
  • How and when is code released? What is the versioning strategy?
  • How is code modularized? Is there a separation between API and implementation?
  • Can an API be deprecated? Can it be modified over time? Should it have a version?
  • How is memory going to be managed? Shared, not shared?
  • Is thread-safety/concurrency relevant?

Testing

  • How is the application going to be tested?
  • Unit tests? Integration tests? Mocking tests?
  • Performance tests? Load tests? Regression tests?
  • What is the test coverage? How is it measured?
  • Can all aspect of the application be tested?
  • Is 3rd party code reliable/tested?

Documentation & User Manual

  • Should one define rules for documentation?
  • What should be documented? Which procedures?
  • Where is the documentation stored? Online? In the code?
  • How often should the documentation be reviewed? Is it confidential?
  • Who should access which documentation? From where?
  • Are user guides necessary? Should they be available online? As books?
  • Is a FAQ system required? Should one create user forums? Internal forums?

Security

  • Is an encryption system required? Which?
  • Is access control required? Which security model should be implemented?
  • Should stored data be encrypted?
  • What if security is broken? What if the system is hacked?

Monitoring

  • How is the system monitored?
  • What should be monitored? Traffic? Workload? Trends?
  • Should logs be produced? Should they be automatically/manually analyzed?
  • Can automatic monitoring tools be implemented/deployed?
  • Can/should users be able to signal dysfunctional behaviors?
  • What is the bug reporting strategy? How is it organized?

Disaster Planning & Migration

  • What is the desired level of availability?
  • What are the anticipated disaster scenarios?
  • How much redundant resources is available in case of failure?
  • What are the application/system's single points of failure?
  • Should a fail-over system be implemented? When is it triggered?
  • Can the application/system migrate? If yes, how will it be organized?
  • What if a 3rd party technology provider disappears?

Knowledge Management

  • Where is knowledge of the application located?
  • What if key engineers leave the company/become unavailable?
  • How is application/system knowledge transfered?
  • Are training courses required?

Technology & Standards

  • Should/does the application/system rely on 3rd party technology?
  • Should one rely on open source technology? Proprietary technology?
  • What are the license restrictions and requirements per 3rd party technology?
  • Should one develop in-house technologies?
  • Should one open source in-house technologies? If yes, with which license?
  • Will open sourcing reduce the development and testing burden?
  • Is there a risk of loosing control? What about the brand?
  • Should the application be platform independent?
  • How can new/future technologies be integrated?
  • Should the application fit with standards? If yes, which?
  • How hard/easy will it be to integrate new versions of standards?
6

I think it really depends in which kind the application should scale.

  • Does it need to process lots of data?
  • Does it need to handle lots of client requests?
  • Does it need to scale in database write accesses or in database read accesses or both?
  • ...

Because this questions lead to different requirements. You can scale for example in database read accesses with a simple Database replication setup. But when it comes to scalability in write accesses you need a different approach like sharding.

And then also the amount of data is important. Do you expect gigabytes or (lots of) terabytes of data. Because the latter needs maybe a NoSQL approach like e.g. Hadoop or something similar.

If you want to scale with client requests to your webserver a load balancer approach is also a more or less simple solution.

If you need to process lots of data in parallel over different nodes there are also different things like software agent frameworks like e.g. JADE which you can easily distribute. Or you can use parallel processing frameworks like JPPF.

What I always search for when it comes to scalability are real-world scenarios. For example in the last weeks I stumbled across lots of articles about the scalability and distributed computing at Twitter. I think they are one of the biggest players right now when it comes to scalability so we can learn a lot from them:

3

From my experience one of the most important stuff is the amount of load the nodes are going to put on the network (objects over the wire, basic messages etc).

Figuring out a proper group management mechanism(like jgroups which suits your needs is important.

Also understanding data growth and having strategies that help easy growth in number of nodes etc is advisable I guess.

The more emphasis and effort put into build a good core distribution would go a long way to making life simpler. This is my 2 cents.

2

Take a look at CQRS architecture. Examples are mostly in C#, but principles are cross-platform. I strongly recommend to watch video from Greg Young's class.

Also you could look at Udi Dahan's, Rinat Abdullin's, Jonathan Oliver's and Greg Young's blogs.

2

I'm by no means an expert, but here are my answers. I've worked on some web-based large-scale projects (including a game that had to handle around 200k daily users). I've also launched some modest startups and I've constantly had to re-evaluate my stance on some of the following.

User Driven Applications

How many simultaneous users/connections?

It depends on your application, but I would design it so it can theoretically and initially handle up to 10k concurrent users. That's just a standard I've done my best to abide by. 10k is a good strong number and more often than not reachable. Not only that, but by the time you have 10k concurrent users, you'll most likely have the capital to expand both vertically and horizontally. So it's a good number in my opinion.

How often do users connect?

If you can handle the concurrent connections, this question becomes moot. The connection frequency should be correlated with your expected load. There's not much overhead for creating a socket (or whatever mechanism you may be using).

What is the average connection duration?

I strongly suggest using keep-alive solutions when being faced with this question. See some benefits of keep-alive here: http://www.webhostingtalk.com/showthread.php?t=687701. Even though that thread is about Apache, a custom solution also stands by the same principles.

How much data is exchanged during a communication?

Serialize serialize serialize. Everything. Avoid unnecessary AJAX and remember that XHR is a double-take protocol. Use web-sockets whenever available.

Does one need to ensure permanent application access/availability?

Yes. Or at least 99.9%. Your users will thank you.

Is a load balancer necessary?

Absolutely. It should be on the white board from day one. It may not be implemented until months later, but it should not be an afterthought and everything should be designed with a load balancer in mind.

What about ergonomy? Application response time?

http://www.useit.com/papers/responsetime.html has stood the test of time. There's also a special section on web-based applications.

Is Search Engine Optimization necessary/relevant?

Not as much as it used to, especially if your application is social. But like with all things, some SEO wouldn't hurt.

Process Driven Applications

Do processes run mostly on foreground or background?

Background. (Usually daemonized.)

How many different type of processes?

This is very application-specific and kind of tough to answer.

How often does each process run? For how long?

Usually the service runs indefinitely and new threads may be created for new connections (although this paradigm should be avoided).

How much data is processed per node?

Little. Specific numbers are tough without an actual application, but again, if you serialize your data properly and use JSON (avoid XML) it shouldn't be a huge chunk of data.

Communication

How are nodes going to communicate between each other (internal<->internal)?

It depends on your network topology. There's no real right (or useful) answer here.

How are users going to communicate with the application (external<->internal)?

Usually a web-based interface. Almost always via some sort of API.

How many type of communication channels (internal<->internal<->external)?

It depends on your network topology. There's no real right (or useful) answer here.

Which devices are going to be used (PC, handhelds, Iphone...)?

With mobile computing becoming virtually ubiquitous, I would strongly suggest a focus on iPhones, iPads, laptops, etc, etc.

Is group communication necessary? Which groups?

This question is kind of confusing. User groups? Node groups?

Is communication centralized or P2P-like?

It depends on your network topology. There's no real right (or useful) answer here.

Data

Is the amount of data stable or ever growing?

In the world of Web 2.0, it usually grows indefinitely.

Can the application data be stored on one server? Should it?

It shouldn't. Kind of beats the point of distributed computing.

Should the data be available online at any time/immediately?

Question similar to this one already answered.

Does one need to perform backup? Is it possible?

It's possible and it should be done. Maybe not at first, but it's an aspect of distributed computing that is very important.

Does one need to archive data? If yes, how?

Yes, tar.gz.

Does one need to shard data (i.e. distribute data) over many databases?

Usually, yes. (This is one of the main benefits of distributed systems, after all.)

How flexible are the data structures? Can they integrate future needs?

The term "data structures" is ambiguous. But all protocols should be easily extensible.

What is the life cycle of each data structure? Do they have versions?

This question is a bit confusing, sorry.

Project Management

Which project implement tool will be used (maven, ant,...)?

It depends on your programming language.

Will there be application upgrades?

Hopefully :)

Which kind of updates? Patches? Bundles? Service packs?

Patches, bundles and service packs are basically all the same thing. An update.

How are updates going to be applied?

If your application is hosted, then it's all done internally, otherwise it should be done via some sort of auto-check mechanism.

Where is the application going to be deployed? Web server? Desktop?

Why are you asking me? ;) But I suggest web-based as it's easier to deploy and implement (and update!)

How is the application going to be deployed?

See above.

Should one create system images and deploy them on nodes?

This is how it's usually done, yes.

How is the application going to be maintained?

By using some sort of distributed application management system, such as Mainframe Express (http://www.microfocus.com/products/enterprise/MFEEE.aspx).

Should maintenance be outsourced? Can it be outsourced?

Often times, it can't be outsourced. But if it's possible, I say go for it.

Should hardware & system administration be outsourced?

System administration maybe yes, hardware not so much (you need someone to have physical access to the servers, after all).

Functional & Technical Design

Is the application going to offer new/future functionalities?

Hopefully :)

What is the decision process to include/remove functionalities?

That's really up to you.

Who participates to the decision process?

These are business questions, not programming ones :P

Should feedback from users be taken into account?

Common sense would say yes.

Should feedback from engineers be taken into account?

Common sense would say yes.

How is feedback collected? How is feedback analyzed and documented?

There's a zillion project management apps out there.

Programming & Code Base

How many technologies/programming languages are used?

Apart from the usual XHTML and Javascript, usually a softer language like PHP, Python, and a stronger language like Java or Erlang for the server. Sometimes, C or C++.

How and where is the code handled? How many repositories?

Every individual aspect of your distributed app has to have a repository (be organized!).

How and when is code released? What is the versioning strategy?

This is up to you. As far as the versioning strategy, it's hard to see what you're asking (app versioning or db versioning)

How is code modularized? Is there a separation between API and implementation?

There should always be a clear separation between API and implementation.

Can an API be deprecated? Can it be modified over time? Should it have a version?

Always always make sure your API is both deprecatable as well as compatible with past and future versions of your backend.

How is memory going to be managed? Shared, not shared?

This is very dependent on your application.

Is thread-safety/concurrency relevant?

Yes. Very very very relevant.

Testing

How is the application going to be tested?

Unit testing, usually.

Unit tests? Integration tests? Mocking tests?

All three if time (and money) permits.

Performance tests? Load tests? Regression tests?

These too.

What is the test coverage? How is it measured?

It depends on your application.

Can all aspect of the application be tested?

In an idea world yes, but time constraints say otherwise.

Is 3rd party code reliable/tested?

it depends on the 3rd party code.

Documentation & User Manual

Should one define rules for documentation?

Yes, it would probably be a good idea.

What should be documented? Which procedures?

Everything. (I'm not kidding.)

Where is the documentation stored? Online? In the code?

Online and in the code.

How often should the documentation be reviewed? Is it confidential?

At every iteration.

Who should access which documentation? From where?

Hard to see what you're trying to ask here.

Are user guides necessary? Should they be available online? As books?

Yes, they should be available online. The easier something is to use, the more people will use it :)

Is a FAQ system required? Should one create user forums? Internal forums?

Yes and yes.

Security

Is an encryption system required? Which?

It depends on your application.

Is access control required? Which security model should be implemented?

It depends on your application.

Should stored data be encrypted?

Most likely yes.

What if security is broken? What if the system is hacked?

The world ends? Make sure you have backups, failsafes, and stuff that goes off if the system does break.

Monitoring

How is the system monitored?

This is very platform-dependent.

What should be monitored? Traffic? Workload? Trends?

Everything.

Should logs be produced? Should they be automatically/manually analyzed?

Of course logs should be produced! Slow logs are also an absolute must.

Can automatic monitoring tools be implemented/deployed?

Yes.

Can/should users be able to signal dysfunctional behaviors?

Yes.

What is the bug reporting strategy? How is it organized?

What bug tracking software do you use?

Disaster Planning & Migration

What is the desired level of availability?

99.9%

What are the anticipated disaster scenarios?

This question is very "what-iffy" and tough to answer.

How much redundant resources is available in case of failure?

Generally, 3-levels of redundancy is a good number.

What are the application/system's single points of failure?

This depends entirely on your topology.

Should a fail-over system be implemented? When is it triggered?

It's a good idea to have one in place.

Can the application/system migrate? If yes, how will it be organized?

It should be migratable, but I wouldn't make this an absolute priority.

What if a 3rd party technology provider disappears?

Try to rely on well-established 3rd party tech providers.

Knowledge Management

Where is knowledge of the application located?

On a wiki or FAQ. Up to you.

What if key engineers leave the company/become unavailable?

/shrug

How is application/system knowledge transfered?

If you use a wiki, this is trivial.

Are training courses required?

It depends on the complexity of your app.

Technology & Standards

Should/does the application/system rely on 3rd party technology?

It probably will. It's also a good idea to not re-invent the wheel. If you see a technology that is well-established (lets say, Apache), then yes, you can count on them.

Should one rely on open source technology? Proprietary technology?

Avoid GPL and LGPL, embrace BSD and MIT. If you've got the money, you can also go for proprietary stuff. But more often than not, the open source stuff is just as good.

What are the license restrictions and requirements per 3rd party technology?

It depends on the technology.

Should one develop in-house technologies?

Yes.

Should one open source in-house technologies? If yes, with which license?

It depends on your requirements and if you want to release it to the open souce world or not.

Will open sourcing reduce the development and testing burden?

Sometimes, it 100% will. Sometimes it won't. This tends to be an ideological question more than a practical one.

Is there a risk of loosing control? What about the brand?

Consult with a lawyer.

Should the application be platform independent?

Yes, yes, yes.

How can new/future technologies be integrated?

Hopefully :)

Should the application fit with standards? If yes, which?

Yes.

And the many more standards out there. Make sure people can easily use your technology if they already know SOAP (or similar) technologies.

How hard/easy will it be to integrate new versions of standards?

It depends how you write it in the first place! Clean code is easily changed!

0

I think in several cases trying to achieve application server and database server independecy is worth considering. Platform independent solutions are harder to develop, but you are free to migrate later. Or choose and fix the platform in the early stage very carefully.. :-)