I’ve decided to take a short break from working on renderer issues, insofar as pretty much every single game that doesn’t run into some sort of bug lurking in machine/n64.c or some sort of MIPS CPU bug has largely correct graphics. The few games that do run up to a machine/n64.c-related bug or MIPS CPU bug also have largely correct graphics. Barring a few exceptional cases, these games would be playable if not for the aforementioned bugs and/or performance.
Since I am not quite familiar enough with the N64’s non-graphical functions to be comfortable bug-hunting in those realms, for now I’m going to concentrate on performance.
Using MAME’s built-in profiler to determine CPU load distributions across the main CPU, RSP, and everything else (mainly the RDP), I can break the games down into four categories:
- Untestably broken: These games include Indiana Jones, Battle for Naboo, Conker’s Bad Fur Day, Banjo-Kazooie, Banjo-Tooie, Donkey Kong 64, Mario Party 3, Paper Mario, Perfect Dark, Goldeneye, Yoshi’s Story, Gauntlet Legends, Turok - Rage Wars, and I’m sure plenty of others. Games that don’t show a single thing in MESS before running off into the weeds.
- 2D Games: These games largely only use the RSP for audio processing, and limit their use of the RDP to things like Textured Rectangle commands. As a result, performance data indicates the RDP as being the main bottleneck for them. These games include Bust-A-Move 2: Arcade Edition and Bust-A-Move ‘99.
- 3D Games: These games use the RSP to do a whole bunch of vector calculations, and use the RDP as much as they want. These are the majority of games, and include Super Mario 64, Mario Kart 64, Army Men: Sarge’s Heroes, Tetrisphere, The Legend of Zelda: Ocarina of Time, Kirby 64: The Crystal Shards, Madden 64, and Aidyn Chronicles: The First Mage.
- Namco Museum 64: This game is Namco Museum 64. It does not use the RSP at all and does not use the RDP at all. It shoves PCM data out the stereo DAC by way of the main CPU, and it uses the N64’s entire video system for nothing other than a framebuffer. As a result, it runs at around 160% when unthrottled, compare with 10% unthrottled for most 3D games and 25% unthrottled for most 2D games. It is the only game of its kind that I know of.
In order to more accurately nail down the performance of 3D games, I’ve run a profile on three games: Castlevania, Tom & Jerry: Fists of Furry, and Super Mario 64. Unsurprisingly, due to the immensely small number of different microcodes that were ever used on the N64, the code profiles look largely the same. The percentages listed are the percentage of execution time spent in each function, not including children.
- Castlevania: RDP = 41.14%, RSP = 53.23%, Other = 5.63%
- 12.04%: fill_span_buffer_2×2
- 11.04%: FETCH_TEXEL
- 8.05%: render_spans_16
- 5.13%: read_dword_generic
- 4.99%: handle_vmadn
- 4.59%: cpu_execute_rsp
- 3.60%: COLOR_COMBINER
- 3.36%: write_dword_generic
- 3.32%: BLENDER2_16
- 3.11%: SATURATE_ACCUM
- 3.08%: handle_vmadh
- 2.01%: handle_vmadm
- 1.91%: handle_vmulf
- 1.56%: __divdi3
- 1.56%: memory_decrypted_read_dword
- 1.52%: handle_ldv
- 1.39%: handle_vmudn
- 1.25%: handle_vmudl
- 1.23%: handle_vadd
- 1.18%: handle_lqv
- 1.05%: handle_vmacu
- 1.02%: memory_read_byte_32be
- 0.99%: handle_vector_ops
- 0.96%: READ8
- 0.93%: taddr_clamp
- 0.91%: memory_write_byte_32be
- 0.87%: handle_vge
- 0.82%: handle_vmrg
- 0.80%: WRITE8
- 0.70%: handle_vmacf
- 0.66%: handle_vsub
- 0.62%: handle_sqv
- 0.62%: debugger_instruction_hook
- 0.62%: handle_lpv
- 0.60%: handle_vmudm
- 0.57%: handle_vmadl
- 0.53%: calculate_coverage
- 0.52%: handle_sdv
- 0.50%: handle_vmudh
- 0.46%: decompress_z
- 0.45%: fill_rectangle_16bit
- 0.43%: handle_luv
- 0.41%: handle_vcl
- 0.39%: handle_vmulu
- 0.38%: handle_lwc2
- 0.38%: handle_vrcph
- 0.37%: video_update_n64
- 0.35%: handle_vand
- 0.34%: handle_vxnor
- 0.33%: sp_dma
- 0.32%: handle_vch
- 0.31%: handle_vrcpl
- 0.28%: handle_swc2
- 0.26%: handle_vlt
- 0.26%: handle_llv
- 0.22%: handle_vsaw
- 0.19%: handle_vor
- 0.19%: fill_rectangle_32bit
- 0.16%: rdp_load_block
- Tom & Jerry: Fists of Furry: RDP = 29.15%, RSP = 64.42%, Other = 6.43%
- 7.41%: read_dword_generic
- 7.22%: cpu_execute_rsp
- 5.52%: handle_vmadn
- 4.79%: texture_rectangle_16bit
- 4.38%: write_dword_generic
- 4.29%: fill_span_buffer_2×2
- 3.54%: BLENDER1_16
- 3.38%: FETCH_TEXEL
- 3.27%: SATURATE_ACCUM
- 3.04%: handle_vmadh
- 2.75%: handle_vmadm
- 2.66%: handle_lqv
- 2.57%: COLOR_COMBINER
- 2.40%: memory_decrypted_read_dword
- 2.30%: handle_vmulf
- 2.25%: handle_ldv
- 1.87%: fill_rectangle_16bit
- 1.79%: handle_vmudl
- 1.79%: render_spans_16
- 1.48%: READ8
- 1.45%: handle_vadd
- 1.40%: handle_vmudn
- 1.38%: video_update_n64
- 1.24%: memory_read_byte_32be
- 1.19%: memory_write_byte_32be
- 1.10%: debugger_instruction_hook
- 0.96%: handle_vector_ops
- 0.94%: WRITE8
- 0.92%: handle_vsub
- 0.88%: handle_vmacf
- 0.75%: handle_sqv
- 0.72%: handle_vmudm
- 0.71%: handle_vsubc
- 0.70%: handle_sdv
- 0.68%: calculate_coverage
- 0.66%: handle_vge
- 0.65%: sp_dma
- 0.60%: rdp_load_tile
- 0.53%: _divdi3
- 0.52%: mame_rand
- 0.52%: copyline_rgb32
- 0.52%: handle_vmudh
- 0.51%: rand_memory
- 0.49%: handle_vcl
- 0.49%: driver_get_name
- 0.48%: compress_z
- 0.47%: handle_lwc2
- 0.47%: handle_vmrg
- 0.44%: handle_vrcpl
- 0.37%: handle_vrcph
- 0.35%: handle_vlt
- 0.33%: taddr_clamp
- 0.33%: handle_luv
- 0.30%: handle_llv
- 0.30%: region_post_process
- 0.28%: handle_swc2
- 0.28%: fill_random
- 0.27%: handle_vsaw
- 0.26%: handle_lsv
- 0.24%: handle_vch
- 0.23%: handle_vabs
- 0.22%: handle_ssv
- 0.19%: handle_vxor
- Super Mario 64: RDP = 27.33%, RSP = 61.21%, Other = 11.46%
- 10.73%: fill_span_buffer_2×2
- 6.56%: handle_vmadn
- 6.16%: cpu_execute_rsp
- 5.56%: read_dword_generic
- 4.63%: render_spans_16
- 3.61%: SATURATE_ACCUM
- 3.38%: write_dword_generic
- 3.20%: handle_vmadm
- 3.19%: FETCH_TEXEL
- 2.99%: handle_vmadh
- 2.74%: BLENDER1_16
- 2.27%: COLOR_COMBINER
- 2.10%: memory_decrypted_read_dword
- 1.97%: handle_vmudl
- 1.88%: handle_ldv
- 1.72%: handle_vadd
- 1.66%: handle_vmudn
- 1.51%: handle_vmulf
- 1.26%: handle_vector_ops
- 1.23%: handle_lqv
- 1.13%: handle_vsub
- 1.11%: handle_vge
- 1.07%: debugger_instruction_hook
- 1.04%: __divdi3
- 1.00%: memory_write_byte_32be
- 0.98%: handle_vsubc
- 0.98%: READ8
- 0.92%: memory_read_byte_32be
- 0.82%: calculate_coverage
- 0.78%: handle_sdv
- 0.76%: WRITE8
- 0.75%: mame_rand
- 0.74%: handle_vmudm
- 0.74%: driver_get_name
- 0.72%: compress_z
- 0.70%: video_update_n64
- 0.67%: fill_rectangle_16bit
- 0.65%: handle_vrcph
- 0.65%: rand_memory
- 0.60%: sp_dma
- 0.59%: handle_vlt
- 0.54%: decompress_z
- 0.54%: handle_vmudh
- 0.49%: handle_vrcpl
- 0.45%: handle_lwc2
- 0.43%: region_post_process
- 0.42%: handle_vmacf
- 0.40%: handle_vcl
- 0.39%: handle_sqv
- 0.38%: handle_vch
- 0.38%: copyline_rgb32
- 0.37%: handle_llv
- 0.37%: handle_vxor
- 0.36%: handle_vsaw
- 0.36%: handle_vmrg
- 0.35%: quark_tables_create
- 0.35%: fill_random
- 0.33%: handle_luv
- 0.32%: taddr_clamp
- 0.30%: handle_swc2
- 0.28%: handle_ssv
- 0.27%: handle_vmadl
- 0.27%: handle_lsv
- 0.27%: handle_lpv
- 0.27%: handle_vaddc
- 0.26%: handle_vor
As I see it, the first priority is to convert the RSP core over to use MAME’s DRC system. Unfortunately, I’m not quite sure what sort of performance increase will be seen by DRC-ifying the RSP. The VMAC* and VMUD* opcodes have a rather large amount of code associated with them, and not only that, they loop 8 times across 8 elements. This was probably accomplished in parallel on the real RSP.
Another piece of low-hanging fruit is the fact that around 10% of the execution time is taken up by memory accessors thanks to the RSP’s less-than-optimal IMEM and DMEM implementation. The RSP has to hit the memory system for every single read and write that it does. However, in reality IMEM and DMEM are accessed far, far less often by the main CPU than they are by the RSP itself. It therefore makes better performance sense to have two 4kbyte arrays central to the RSP core itself, which it will access directly rather than going through MAME’s core memory accessors. The main CPU will be able to access these memory spaces by querying the RSP core, and any RSP DMA accesses can be done by simply grabbing a pointer into the RSP’s IMEM or DMEM arrays, just like it works now.
Lastly, the plan is to wire the RDP emulation up to MAME’s “work unit” system, which will allow it to distribute drawing commands across multiple CPU cores when available. Unfortunately, the RDP being as slow as it is, it will likely not have too terribly much of a performance impact on my laptop, but it might improve in the situation of a quad-core CPU.
Anyway, that’s the main plan. Here’s hoping I can stick to it.