My experiences of Google AppEngine usage

Warning! This article was written in the first months of Google AppEngine. Today it is completely obsolete.

Disclaimer: This article is not about "I am so clever, Google is so stupid". This article is about some Google AppEngine problems (or peculiarities) which might not be obvious for newcomers.

You know, Google did really nice things: great search, and awesome mail. It gets a lot of valuable private information about our habits through that, but we continue to use these things because they are so awesome at solving their task...

There was some hype about AppEngine lately, so I’ve decided to give it a try in my new project.
I’ve chosen Python with Google’s native libraries to ensure best compatibility & performance.
I’ve started from the performance tests, and the results were…. Disappointing:
Test descriptionHits per second
print 'Hello world'260
1 read from Datastore, 1 write to Datastore38
1 read from Datastore 60
10 reads from Datastore, 1 write20
1 read from memcached, 1 write to memcached80
1 read from memcached120
Non-google complete PHP application, 6 SQL queries, http://3.14.by/240
Tests were done on 20 concurrent requests from 2 different servers at the same continent. Averages for 7 seconds of execution. Some might say "Hey! They are not terrible, my [place url here] could handle just 2 hits per second, so even 38 is a nice upgrade". Well, first of all, this is hello-world class application. It is as simple as possible. Real application would have 5-25 memcached/datastore calls, and would have more logic in here. I believe that Web Developers whose "classic" web-applications could not handle 100 hits per second must be executed. You see, the only way for me to avoid suicide while working with AppEngine is to use 1 memcached call per request, and that's it.
Also, for the ‘10 reads 1 write’ test I was getting ‘Error: Server Error’ for more than 10 concurrent requests (internal error was ‘too much contention on these datastore entities. please try again’).

Scaling

I was expecting that at some point I would get more nodes. Unfortunately, after 10 minutes of stress testing & wasting 10% of my daily CPU quota, speed was still the same. Probably it does not react on load that fast.

Sources

are really simple (like this one):
from google.appengine.ext import db

class Counter(db.Model): nick = db.StringProperty() count = db.IntegerProperty()
res = Counter.gql("WHERE nick = 'test3'") print 'Content-Type: text/html' print '' print '<html><body><h1>This is datastore performance test</h1>' print '<h2>It reads a counter, and increment it''s value in datastore</h2>' for v in res: v.count = v.count + 1
print 'New counter value : ', v.count
# v.put() print '</body></html>'

Samples are deployed here: http://mafiazone-dev.appspot.com/. So, Google guys are correct when they say that "performance is almost the same no matter what is the scale of your application". That's right, it is slow in low scale, and also slow at large scale. You see, even single request to anything which could hold data (memcached or Datastore) takes huge amount of time. If you need to perform serveral requests to show a page - you are likely end up in timeout exceptions sometimes. That was really disappointing.

Classic web applications (like my homepage) could easily serve 10'000'000 hits per day on a single server, and with further optimization could serve 30'000'000 (at average 500 hits per second for 8-10 rush hours). How many of the projects need at least 10% of that? What if 0.01% of these hits would trigger an unrecoverable error caused by random timeouts (because any handling procedures would need extra CPU time, which is really limited per call)?

Overall issues list

Here is what you should consider when thinking about using Google AppEngine for your project:
  • Any call to Datastore might fail randomly. Google says probability of this dropped from 0.4% down to 0.1, but it will be there. Datastore is not designed to be rock solid. You will have to write additional code to handle exceptions here.
  • Memcached is not THAT memcached you used to. This one is slow (some 100s op/s while REAL memcached could handle 10’000 and more).
  • You really need to find a place to serve static data. You cannot have large files here, and again, it is slow.
  • Some reports says URLFetch is less reliable in comparison to what we used to.
  • You cannot choose datacenter. For example, if you live in Europe, and AppEngine places your application at US, your users will feel it slow. It would be "moved" to Europe eventually, but you have no control over it.
  • Think twice - Google might serve almost unlimited number of requests if waiting for 100-200ms in average is not a problem. But to pay for that, you will have to invest a lot of efforts in making your code random-timeouts proof.

What I would like to see changed in AppEngine to make it as cool as GMail

  • Much more deterministic behavior. Less timeout exceptions. You may send me a warning to email saying that I need to optimize a script, but users should have 0 chance to face issues caused by that. As I was saying, it is not always possible to handle all possible issues, as we might run out of CPU time per request.
  • Much higher datastore & memcached performance. What if we put memcached on the same server, and communicate via shared memory? I am sure current approach is more reliable, but too slow (probably it is fast, but shared among many clients).
  • Datacenter selection
  • Cluster-aware applications API. Give us some small server-local ultra-fast storage, and give us events "initialize storage" and "release storage". That's it.

Some thoughts

Some time ago I was working with really nice technology – it was all redundant and reliable, "cloud"-like, convenient API interfaces, but it took 4 seconds to render forum page on 4-CPU Server. It sucked. It does not matter how cool technology is If it lack performance(google’s case) or usability – it would continue to suck.

Where this can and can’t be useful

It’s great for mostly read-only simple(i.e. no complex DB logic, little data) applications without load peaks. This might be awesome solution for “homepage” with some photos of your cat with rare changes and 0 maintenance and cost.

It’s not that great for more complex sites which experience digg/shashdot effect from time to time. Google AppEngine would not be able to scale it rapid enough to handle 100 hits/second peak.

Conclusion

Does that mean that Goggle devs and architectors are stupid? Not at all. It is really hard to allow scalability for software which is not specifically optimized for scalability. They did their best, but the end result has limited applicability.

But if your task fits nicely in AppEngine limitations and storage performance/error rate – this might be perfect solution for you.

Update: Yes, I know it scales if you do not write a lot. My goal was to look at the lower-level performance, the basics. No matter how many nodes you would have, you will never get 20-80ms response time (which is essential for 'snappy' web-application).
Update: Yes, I know that proper counter implementation is "sharded counter". In this article I was not benchmarking "counter", but tested some lower-level performance. Yes, we know that Datastore is slow, and it is even slower if you write to the same record. If you don't like this test, you may look at read-only and 10 read 1 write tests only.
Update: I haven't noticed any DDOS protection, probably it was too slow to get close to 500 hits/second hard limit and 7200 requests per minute limit.

August 19, 2009

RSS@BarsMonster3@14.by