I was working on basic tree generation last few days which allowed me to discover a sneaky bug in block texturing. While applying textures for blocks, textures of all but top faces were applied correctly. Though as I was mostly using homogeneous textures for testing purposes (rock, sand, grass) it was really not possible to see the problem.

It was the non-homogeneous tree-top face texture that revelead it. So I dig deeper in texture-UV mapping code and it wasn’t really easy to spot the actual bug. I further decided to roll back my changes through git to see
the commit that introduced — to my surprise it was a really really sneaky, old bug which even existed in very first git commit (that I moved the project from bitbucket to github). As I had deleted the bitbucker project and didn’t have earlier history backed up, it wasn’t possible to track it back.

So there was nothing left but to verify all uv-mapping code. Few hours later I spotted it at last; wrong uv-mapping indexes were applied for Y-increasing faces!

There we go! I had just fixed a major sneaky bug that was lurking in git repository for a long while and it really really feels good to fix one!

If you aren’t aware yet, those great guys over Mono project works on creating an OpenGL based implementation of XNA framework which can target MacOS, Linux, Android and IOS (those latter two of course will require MonoDroid and MonoTouch). So their website already mentions the ever first 3d game powered by MonoGame and I couldn’t resist giving it a try to get Voxeliq running over MonoGame! Here goes the story;

It took me a few hours to figure things out. At the very first I tried my chances on wrong branch – MonoGame/develop – which at the time being is the furthest branch. Though I got my answer on irc.gnome.org #monogame, which ‘develop’ did not have support for MojoShader – which allows HLSL shaders to run over GL (it converts them actually).

So I checked out the develop3d branch as mentioned in irc channel, which was the correct branch to go for 3d games. Within my initial try, I got lots of error messages from compiler complaining about missing HalfVector2.cs. So I put in the XNA implementation, fixed a few bits here and there and all the errors were gone! When I hit run button, things started up but I got an exception where MonoGame was complaining about threaded creation of GPU resources were not supported yet (which will be fixed later). Bingo! It was all okay and I was able to run Voxeliq over MonoGame using OpenGL. So Voxeliq is not the first 3d game/engine running over MonoGame but I guess it’s one of the very earliest ones

Oh and I had to disable few bits / components (bloom effect, my music manager implementation and console which uses Digital Rune libraries that was compiled against XNA4) of Voxeliq which may need fixes – though that’s for another story I’ll be updating you guys on the progress.

PS: If you need HalfVector2 fix, you can find my patch here.

So I continue my optimization on engine and recently have worked on vertex builder a bit. Results are again brilliant! Vertex builder is now 5x to 7x faster!

Test Parameters:
* View range: 5 chunks.
* Total chunks in view: 121 chunks.

              Old #1        |           Old #2          |           New Technique
 #1          4497 ms        -          3237 ms          -              728 ms
 #2          4419 ms        -          3066 ms          -              623 ms
 #3          4520 ms        -          3089 ms          -              627 ms
 #4          4576 ms        -          3340 ms          -              679 ms
 #5          4413 ms        -          2832 ms          -              613 ms
 #6          4215 ms        -          3022 ms          -              597 ms
 #7          4451 ms        -          3463 ms          -              631 ms
 #8          4790 ms        -          2989 ms          -              627 ms
 #9          5117 ms        -          3442 ms          -              599 ms
#10          4624 ms        -          3191 ms          -              640 ms
AVG.         4562 ms        |          3167 ms           |             636 ms

* Old #1 = Old vertex building technique with block arrays per chunk.
* Old #2 = Old vertex building technique with single huge block array.
* New Technique = New vertex building technique optimized for single huge block array.
It's ~7 times faster then old #1 and ~5 times faster then old #2 technique.

Even better there are still more rooms for optimization which I’ll be working on them until I get a perfect point. Here’s a speed demonstration video. Note that in video there exists nearly ~960 chunks and ~30 million blocks.

Bonus Screenshot

In one of my previous DevLog entries I had mentioned that I was switching to a single huge block array technique instead of array per chunk which eventually simplifies access to block data. Now I’m done with upgrading my lighting code to take advantage of new block array technique and results are great!

With the latest updates, ligthing is now 3.2 times faster! Wow, that’s a really great optimization which worth every single seconds I’ve worked for it! Here’s test results;

Test Parameters:
* View range: 5 chunks.
* Total chunks in view: 121 chunks.

     Old Lighting Technique - New Lighting Technique
 #1          1316 ms        -          498 ms
 #2          1291 ms        -          384 ms
 #3          1338 ms        -          384 ms
 #4          1293 ms        -          389 ms
 #5          1278 ms        -          387 ms
 #6          1324 ms        -          386 ms
 #7          1284 ms        -          377 ms
 #8          1281 ms        -          380 ms
 #9          1288 ms        -          389 ms
#10          1299 ms        -          381 ms
AVG.         1299 ms        -          395 ms ~ 3.2 faster!

So what changed?

First, the old technique was optimized for old block array per chunk storage. So within this to access a neighboring block, it had to access using chunks – as a neighbor block could be in another chunk. The below propagation functions first need to resolve the chunk that owns the asked block (as block arrays are per chunk) and then need to progress on.

PropagateSunLight(chunk, (byte) (x - 1), y, z, light);
PropagateSunLight(chunk, (byte) (x + 1), y, z, light);
PropagateSunLight(chunk, x, (byte) (y - 1), z, light);
PropagateSunLight(chunk, x, (byte) (y + 1), z, light);
PropagateSunLight(chunk, x, y, (byte) (z - 1), light);
PropagateSunLight(chunk, x, y, (byte) (z + 1), light);

Where as the new lighting technique really doesn’t care about which chunk does the block resides in and can directly access the block data from our single huge block array!

PropagateSunLight(blockIndex + BlockStorage.XStep, propagatedLight); // propagate light to block in east.
PropagateSunLight(blockIndex - BlockStorage.XStep, propagatedLight); // propagate light to block in west.
PropagateSunLight(blockIndex + BlockStorage.ZStep, propagatedLight); // propagate light to block in north.
PropagateSunLight(blockIndex - BlockStorage.ZStep, propagatedLight); // propagate light to block in south.
// DO NOT repropagete to upper block which we don't need to do so and may cause loops!
PropagateSunLight(blockIndex - 1, propagatedLight);   // propagate light to block down.

So What’s Next?
I’ll be also optimizing terrain generation & vertex-building code similarly.

Bonus Screenshot
Nothing new & fancy

So, in this kinda blog page, I’ll be also trying to comment on language & framework other than talking about my stuff. With the latest huge array technique with flatten array support, I do now use block pointer functions and here’s one;

        /// <summary>
        /// Returns block index by relative position of block in chunk.
        /// </summary>
        /// <param name="chunk">The chunk block belongs to.</param>
        /// <param name="x">Blocks relative x position in chunk.</param>
        /// <param name="y">Blocks y position in chunk.</param>
        /// <param name="z">Blocks relative x position in chunk.</param>
        /// <returns></returns>
        public static int BlockIndexByRelativePosition(Chunk chunk, byte x, byte y, byte z)
            var xIndex = chunk.WorldPosition.X + x;
            var zIndex = chunk.WorldPosition.Z + z;

            var wrapX = xIndex % CacheWidthInBlocks;
            var wrapZ = zIndex % CacheLenghtInBlocks;

            var flattenIndex = wrapX * FlattenOffset + wrapZ * Chunk.HeightInBlocks + y;
            return flattenIndex;

All code in the engine that access block data has to use one of these functions and it’s really not a good idea to repeat the code all over the source (given that a possible future update on functions will require lots of time and will make it harder to maintain and hunt for bugs). Although today’s modern compilers are quite intelligent, still I was eager for some forced inlining functionality given that those functions get calls millions of times. Statement lambdas are a possibility but they’re technically not what I’m looking for;

Conventional way:
L_0000: nop 
L_0001: ldarg.0 
L_0002: ldarg.1 
L_0003: ldarg.2 
L_0004: ldarg.3 
L_0005: call int32 VolumetricStudios.VoxeliqGame.Chunks.BlockCache::BlockIndexByRelativePosition(class VolumetricStudios.VoxeliqGame.Chunks.Chunk,
 uint8, uint8, uint8)

Statement Lambdas:
L_0000: nop 
L_0001: ldsfld class [mscorlib]System.Func5<class VolumetricStudios.VoxeliqGame.Chunks.Chunk, uint8, uint8, uint8,
 int32> VolumetricStudios.VoxeliqGame.Chunks.BlockCache::BlockIndexByRelativePosition3
L_0006: ldarg.0 
L_0007: ldarg.1 
L_0008: ldarg.2 
L_0009: ldarg.3 
L_000a: callvirt instance !4 [mscorlib]System.Func`5<class VolumetricStudios.VoxeliqGame.Chunks.Chunk, uint8, uint8, uint8, int32>::Invoke(!0, !1, !2, !3)

So at last .Net 4.5 will be coming with a new Method Implementation Option called AggressiveInlining. MSDN has the following explanation;

  • AggressiveInlining: The method should be inlined if possible. It’ll be nice to force regular functions to get inlined with 4.5.

Update: You can find a nice reading on it over here.

So initial re-factoring is done. Voxeliq engine now uses a single huge array of blocks instead of block arrays per chunk. I can say initially that this improved the performance to some extend though there’re still pieces of code that’s optimized for old technique (especially lighting one). As I cover them all I guess we’ll see more performance improvements over time.

Here is the initial tests;

  • View range: 5 chunks.
  • Total chunks in view: 121 chunks.
  Block Array/Chunk          Single Huge Block Array
  Gen  Light  Build   -  -        Gen  Light  Build
 #1  906  1545  4497  -  -   #1  1287  1558  3237
 #2  933  1524  4419  -  -   #2  1283  1520  3066
 #3  933  1567  4520  -  -   #3  1319  1448  3089
 #4  959  1593  4576  -  -   #4  1420  1803  3340
 #5  912  1538  4413  -  -   #5  1256  1470  2832
 #6  903  1512  4215  -  -   #6  1405  1534  3022
 #7  897  1552  4451  -  -   #7  1386  1517  3463
 #8  912  1573  4790  -  -   #8  1362  1524  2989
 #9  935  1580  5117  -  -   #9  1367  1455  3442
#10  920  1606  4624  -  -  #10  1403  1688  3391
ST:  9210  15590  45622      ST:  13488  15517  31871
GT:      70422               GT:      60876

* All values are in msec.
  •  Clearly as you can see, vertex building took advantage of new technique a lot.
  • Lighting code performs nearly the same though it’s not optimized for new technique yet.
  • Terrain generation got slowed a bit though didn’t have time to look for it yet.

I’ll be providing more in-detail info as I progress through.

Bonus Content
A screenshot from ingame-chunk debugger

So basically these days I’m optimizing the engine aiming the best performance available. Recently I’ve seen a great idea by Slaihne over his game BlokWorld’s forums. He basically suggests using a single huge array for blocks and wrapping the array. So I decided to give it a try – and although I’m not done with re-factoring completely, it seems to work great!

Basically until now Voxeliq was using a double-indexed dictionary to cache chunks within player’s region and then storing a single-dimension block array per each chunk. This works to some extend though there are a few problems. First current technique’s speed is not that bad as you can see from my previous videos, though Slaihne’s one seems to be faster. I’ll be explaining below in details;

  • Memory-wise; Voxeliq’s current technique loads new chunks / removes them as player moves – which basically allocs/deallocs memory continuously – given that .NET GC’s in-deterministic nature this is not really good. On the other side slaihne’s method always uses a pre-determined amount of memory for block/chunk caches. Even more hundreds of chunk instances is another memory sink in current method (it’s already known that in .net object instances have quite noticeable overhead).
  • Speed-wise; Especially in the case of lighting the current method needs each chunk to have pointers to neighboring chunks and have extensive checks. The new method completely simplifies the stuff.
  • Recaching; The current technique extensively re-caches chunks as player moves around (allocs/deallocs chunks). Within the new method yet again this will be really simplified a lot thanks to array wrapping.

So slaihne mentions he uses array wrapping but in one point he also mentions about his array being a single dimensional one. This was already a technique I was using in chunk’s block arrays, where I was flattening a 3 dimensional array to a single dimension one (as single dimensional arrays are lot faster in .net compared to multi-dimension ones).

I basically implemented a wrapping array with additional flattening support;

        public Block this[int x, int y, int z]
                var wrapX = x%CacheWidthInBlocks;
                var wrapZ = z%CacheLenghtInBlocks;
                var flattenIndex = wrapX * FlattenOffset + wrapZ * Chunk.HeightInBlocks + y;

                return this.Blocks[flattenIndex];
                var wrapX = x % CacheWidthInBlocks;
                var wrapZ = z % CacheLenghtInBlocks;
                var flattenIndex = wrapX * FlattenOffset + wrapZ * Chunk.HeightInBlocks + y;

                this.Blocks[flattenIndex] = value;

So initially it seemed all good but I’ve to re-factor more parts to let the engine take advantage of this completely. I’ll be posting another update once I’m done with a result video!

Array Wrapping

Oh and this shows how array wrapping works (kudos goes to Slaihne for the mockup!);

Flatten Arrays

For the interested ones here’s array tests for multidimensional, jagged and flattened arrays;

Test Environment: 1 physical cpus, 2 cores, 2 logical cpus.

Array size: 256*256*256

Itr.    Multi.  Jagged  Flatten (Sequental)

#1      00.187s 00.116s 00.093s
#2      00.186s 00.112s 00.095s
#3      00.189s 00.112s 00.094s
#4      00.187s 00.115s 00.098s
#5      00.187s 00.113s 00.094s
#6      00.186s 00.115s 00.094s
#7      00.188s 00.117s 00.094s
#8      00.187s 00.112s 00.094s
#9      00.187s 00.114s 00.095s
#10     00.191s 00.117s 00.097s
~Avg    00.188s 00.115s 00.095s

Itr.    Multi.  Jagged  Flatten (Random)

#1      00.238s 00.158s 00.126s
#2      00.226s 00.160s 00.123s
#3      00.226s 00.155s 00.122s
#4      00.225s 00.159s 00.123s
#5      00.225s 00.168s 00.136s
#6      00.233s 00.164s 00.149s
#7      00.237s 00.187s 00.126s
#8      00.237s 00.163s 00.128s
#9      00.239s 00.158s 00.125s
#10     00.227s 00.156s 00.125s
~Avg    00.232s 00.163s 00.129s

As you can see flatten arrays in .net 4.0 is 2x times faster then conventional multi-dimensional arrays.You can find my test code over here; https://github.com/raistlinthewiz/dotnet-array-perf-tests