The Potential RPG (working title) v0.8.16 series (BetaAlpha edition) has been steady as she goes. Beneath the surface calm, a current of design builds into a swell of development, which will crest in the following weeks.
Running autonomous test clients against my game server, I kept generating a situation in which several client network connections (TCP) remained open. This resulted in those client sessions remaining active long after the client had disconnected. This had the potential to retain game state in the server, preventing memory from being reclaimed (effectively a memory leak). The player might not be able to log in later, because they still have an active (albeit bogus) session.
No matter how many safety checks I put into the networking logic, I could not detect these broken, lingering connections. As it turns out, Java sockets cannot report they have closed when the socket was not shut down cleanly, without actively attempting to read/write data. My game server protocols are rather conservative, and do not chatter unnecessarily with clients. As a result, an abruptly disconnected client socket could idle indefinitely, so long as no data was destined to be sent its way.
To alleviate this problem, the server networking logic now takes note of client communication times. After a configurable timeout with no activity, the server sends a ping, to which the client is to pong. This has one of three results. If the pong is not returned within another probationary timeout, the client is considered lost, and the session is forcibly closed. If the client is, in fact, alive and well, the pong refreshes its most recent activity. The third result is what I've most commonly found: the attempt to send the ping over a broken connection triggers a network error (IOException), which is caught and handled, more-or-less, cleanly.
In a perfect world, client software would cleanly tear down all TCP connections. In the real world, several factors can prevent this ideal behavior (network failure, software crash, killed application). In any case, server systems cannot rely on clients to behave ideally. While I could spend the rest of my days improving network communication logic (resume available upon request), the above technique is simple enough and appears to be working well in ongoing tests.
Remember those client/gameplay updates I've been promising... I'm almost ready to get going on that. This week I've been hunting down excessive server memory allocations and deflating them. The goal is to have a predictable (and low) memory profile, so I can gauge the RAM required to support N players (as N approaches infinity, if all goes to plan).
The particular improvements include content persistence, atlas activities, and server startup data indexing. These activities were each consuming large chunks of memory, which could spike and cause
OutOfMemoryErrors. One technique I used to lower runtime RAM consumption was to stream content directly to the backing store during persistence, rather than constructing large intermediate data structures in memory. Atlas maps are served up directly from disk, when possible, rather than redundantly inflating/deflating in memory. Startup data indexing now builds its lookup structures on disk, rather than in memory. Overall, this should make for a smoother ride.
Although I could spend the rest of my days improving system software design and implementation (resume available upon request), I'm eager to re-focus on the player experience with client and gameplay enhancements. Why can't there be N of me?
... as I was saying, behavioral testing of the server under heavier load has spilled into this week, and consumed it. Several key systems now exhibit better behavior and are overall more resilient and scalable. At the end of this week, I find myself implementing another performance/scalability/resilience improvement into the server. Namely, persistence can cause wicked memory spikes, which can slam into the memory limit of the server process, causing instability. I have a design for improving the persistence routines for both speed and memory efficiency, which I'll be implementing just as soon as I publish this post. (Saturday night, 8pm, time to code!)
Then, client and gameplay improvements are (for real) just around the corner.
I spent this week running longer performance tests under heavier load, with the goal of preserving maximum service uptime. The game server runs exceptionally well, so long as no one logs in. Even then, it does really well, until I reach my server's memory capacity. Even at capacity, server CPU load remains low.
Nonetheless, I have found a few faults, which cause undesirable behavior in the client and server, under certain failure cases. For the last few days, I've been diagnosing these cases. Improvements have begun, which will spill into next week. Afterward, the client and server should be more resilient to failure cases.
Once these improvements are complete, the next group of tasking involves client and gameplay enhancements.
I had inadvertently placed a hex upon my avatar animation sequences when I chose to number the frames in hexadecimal. Several avatar animations, including all of the attack and death sequences, had been cut short.
The logic for loading animation frames counts in hex [00,01,...,09,0a,0b,...]. However, the external resources were numbered in decimal [00,01,...,09,10,11,...]. After frame , the loading logic didn't find frame [0a] (because it was numbered ) and stopped. Consequently, several frames of animation (up to 8 for some death sequences) were being ignored.
Now that everything is numbered in hex, the attack and death sequences are a bit more ... animated.
As a respite from server improvements, I've spent most of this week on client-side graphic rendering improvements. The updates are not yet released, as there is one more little item to deal with. Since it's not quite finished, I haven't taken precise measurements yet, but preliminary results look like about 80% speedup to the world painting routine. I like to leave myself a lot of room for improvement.
Then again, the client ran rather well even with no pre-rendering optimizations. Every frame of every character animation layered the base form, performed color masking, and added adornments. Now, the animations are pre-rendered for each avatar. The world surface was painted on-demand, including the complicated masking to produce the terrain edging patterns. Now the static parts of the map are pre-rendered to an off-screen image, reducing painting time.
The purpose of these improvements is to support larger numbers of on-screen elements (which did cause some slowdown in tests and according to reports from playtesters).
Drawing speed improvements are, however, at the expense of a large amount of client-side RAM consumption. In theory, the game could run in either of two modes: low-memory/high-CPU, or high-memory/low-CPU. It's a textbook performance tradeoff. I'll be excited to hear how it performs for playtesters, just as soon as I can release it.
Today, the last of the Internet's available IPv4 address space has been depleted
. Potential Games is pleased to announce that potentialgames.com
is now reachable via IPv6, as well as legacy IPv4.
Improving server scalability has been the overarching goal of recent tasking. I have not yet definitively measured performance as client count increases, but I've addressed what I believe to be the major performance impediments. This week, I made two important changes.
First, I've reduced more server memory demand by consolidating two content types used to manage creatures. Creatures outnumber characters by the thousands. On top of that, each creature content object incurred a client-visible distributed avatar data model. This nearly doubled the RAM required for each creature instance. With minor augmentation, the avatar class now handles everything needed for creatures (as well as player characters, which is their other purpose).
Second, I improved protocol broadcast messages. Some protocols used to deliver messages to all connected sessions, rather than selectively to designated clients. This avoided duplicating the backing byte buffer (to save server memory), but was wasteful of bandwidth. Now the low-level network library provides a better mechanism for selective delivery. I always follow the philosophy to code for correctness first, then efficiency, but this should probably have been improved earlier.
The scant testing I perform on my development machine already shows improvement. Once these updates are released to the playtesting server, I'll run a number of tests to measure server performance and observe behavior.
Next on the docket are client performance improvements.
While profiling the runtime memory allocation in my game server, I found that my Coordinate class had far-and-away the most numerous instances (over 600,000 at one count). Even though small (each Coordinate includes latitude and longitude integers), this was also the largest contributor to live memory in the Java Virtual Machine.
The JVM is very good at creating and disposing of small, short-lived objects. I suspect, however, that many of these Coordinate instances are being retained by server subsystems. For example, map logic, entity positions, and fog coverage all retain Coordinate objects. What is more, many of these Coordinate instances are likely to be redundant copies of the same [latitude, longitude] pair, especially because most maps in the world share the same coordinate space.
This is exactly the time to employ the flyweight design pattern. The flyweight design pattern avoids allocating redundant objects by sharing instances. Instead of hundreds of Coordinate[12,12] being instantiated, only one is created. This reduces the amount of overall memory allocation, but at some cost.
One cost is the bookkeeping (data structure and algorithm overhead) needed to store and look up existing objects. A Coordinate object is requested by its latitude and longitude, which I turn into a unique key. The simplest approach is to maintain a Map, but this has the downside of keeping every instance ever created in memory.
To avoid the cost of retaining references to rarely requested Coordinates, I employed a LinkedHashMap with the access-order constructor flag, overriding the removeEldestEntry() method to produce a least-recently-used (LRU) collection. The least requested Coordinate objects are dropped from the collection, based on an arbitrary maximum capacity (which was determined by experimentally observing the profiling at different thresholds).
Once the flyweight collection is full, it does incur higher baseline memory allocation, even when the server is idle, because the Coordinate instances cannot be garbage collected. As a further improvement (not yet implemented), the map could store weak references to the Coordinate objects, to avoid keeping them in memory when no longer needed.
In practice, the LRU flyweight Coordinate mechanism appears to lower peak memory allocation and gives a more stable memory profile (not so much spiking and garbage collecting). Instead of hundreds of thousands of live Coordinate instances, there are at most tens of thousands held as flyweights and some stragglers that were dropped from the LRU collection.
The overall goal was to reduce and stabilize long-term server memory allocation, which appears to be happening.