Low Level Platform Optimisation

Overview

Durning my final year the modules shifted from teaching concepts and implementation to purely theory taking what I had learnt over the past two years and adapting them to make practical solutions to hypothetical problems. This project demonstrates this perfectly, Low Level involved taking a program that would instantiate a number of objects with simple collisions and using low level techniques to not only improve the speed of the program but understanding how the keywords like new and delete work bypassing them and creating a way for memory to be human readable.

This project shows Memory Trackers, overriding of global and class new and delete, Custom memory pool with three different sizes of slots starting at 50 bytes up to 300 bytes, octree partitioning, thread pool to handle the octree collisions.

Grade: First

Github Repo

Image of Memory tracker printing the total memory used to the console

Memory Tracking

Before creating complicated memory, storage or attempting to speed up the program it was important to understand how the memory was currently allocated, if the program developed a memory leak, they can be difficult to pinpoint without the use of memory tools. This led to the creation of a memory tracker for the objects in the scene, as this scene was simple three trackers were created one each for spheres and cubes and a third for everything not filtered into the object trackers. The image shows the very simple output whenever the ‘m’ key was pressed, simply by taking the byte size of the object created and increasing the value held in the tracker.

Walking the Heap

Understanding where the memory is being used is all well and good but sometimes its important for the entire memory to be looked at to check its integrity, to do this the heap has to be walked. To do this a header and footer were created whenever a new object was created and added onto the memory request, for instance is a cube was 50 bytes, and the header and footer totalled 50 bytes then 100 bytes were requested and the data stored in the header.
The data included the start of the entire memory block where the cube started and where the footer started as the draw function still needed to know where the cube started but when the data was freed it need the start of the header not the cube or the program would crash, this calculated used pointer maths to move the pointer forward or backwards depending on the location required. This also meant that during a walking of the heap the footer could be jumped to and checked against the value it should be stored in the header and confirmed correct or not. Then using a linked list the next node is checked and so on until there is no longer a next pointer.

Memory pool

Games, and many software packages, don’t request memory during the life of the program, one because its slow but also because there is no safeguard against over allocating memory, memory leaks are undetected can wreak havoc upon a user’s computer if the new key word is used as the computer will keep giving up memory slots until the program, or worse computer, crashes. To combat this I created a memory pool that requests all the memory I need at the start of the program and using pointer maths I split the block into different slots ‘parking spaces’ of varying sizes, this changed the request of memory to not just calling malloc but asking the memory pool if a slot is available, is one is available of the correct size then the object is placed inside, if there is no slot the object is not created.

Image of testing base program collision time

Octree

To speed up the program it was important to know where the most change can be achieved, there is not much point sinking time into optimising something that won’t give the returns another area might have, to ensure I was optimising the correct area some testing was done and the image shows the massive slow down when more object are added to the collision space, it was clear to see that optimising the collisions were the place to start. This led to the researching of special portioning, splitting the game space into segments and only calculating the potential collisions in that space, this greatly reduced the amount of checks the program need to do on a given object, as there is no point checking an object against an object on the other side of the screen as there is no chance of collision. However, I didn’t want to stop there as during my research it was stated that an Octree is better at splitting a 3D space than a simple spatial partition, and if reducing the space by 4 then taking it further can only help, well during testing that wasn’t always the case.

CPU break down of time spent in diffent functions

Thread Pool

By creating an Octree, it was clear how multi-threading the program would be beneficial as the image shows the amount of time spent in the testing collisions is still significant and a change can improve this, and as each object would be assigned to a partition that could not interact with objects in other partitions would prevent race conditions in the collisions. However, its not as simple as taking a thread for each segment and firing it off as for an octree of depth 3 means 512 threads required to check all the collisions. After some research I came across the thread pool where rather than creating a thread and firing it off with a given task the threads are created at the start of the program and when a task is added to the queue the next available thread will take it while the others wait for the next task.

Click here to see code Gist

Testing

To ensure what I did made a difference in the performance of the program extensive testing was conducted, broken down into the graph shown it was interesting that the solution presented was not as cut and dry as originally thought. Take spatial partitioning with 10000 objects the time taken to compute the collisions drastically lowers with the depth of 5 vs 3 but with 100, and 1000 objects a depth of 5 was slower. A very similar thing happened with the introduction of the thread pool where a different number of threads might perform better than others with 10000 objects but by the same token slower with less objects. In my opinion there is two major conclusions to my testing the first that the types of optimisations taken will depend on the outcome desired, a simpler game might not need the speed of lots of objects the same way that an open world might do. And second despite the slow downs every single optimisation improves the speed of the program over the default case, so are worth it anyway.