The good news for everybody except me is that I lost my originally typed answer; I appreciate I have some problems with politeness at the best of times but please forgive me if this retelling is unusually brusque.
The difference in approaches is between:
"Hey, VIAs and VIC, what's the earliest that one of you might autonomously change a signal line?"
"Uhh... 2000 cycles"
"Okay, I've just run for 2000 cycles; what's up?"
And:
"Hey, VIAs and VIC, I just ran for a cycle! Is anything happening now?"
"No"
"Hey, VIAs and VIC, I just ran for a cycle! What about now? Is anything happening now?"
"No"
"Hey, VIAs and VIC, I just ran for a cycle! What about now? Is anything happening now?"
"No"
... ad infinitum
Or — in the context of the assembly-level emulator that's the subject of this thread — between writing out your entire 6502 emulation in switch-style form and insert at the end of every complete cycle:
inc cycle
ld r, [gfxAddresses, cycle]
st r, [gfxValues, cycle]
dec cyclesUntilNextEvent
jr.z checkForEvents
Versus having to insert something more like (assuming macros this time, let's not be silly):
via1Update()
via2Update()
vicUpdate()
Each of which is likely, at the very least, to thrash a few more registers. Otherwise, register reads and writes are of the form "It's now time Q; [what's the value of register R / please put N into register R]".
I believe it is less processing because (i) it is cheaper to check a single timer as a broad-phase test for whether the code for VIAs and VIC need to be entered than it is to run the direct tests; and (ii) it is cheaper to emulate N cycles at once in a VIC or VIA than it is to emulate 1 cycle at a time, N times over.
But, to be explicit, I am arguing with the proposition that "you'll have to run CPU, VIC and both VIAs (especially their timers) in single-cycle lock-step". I do not agree with that statement. I think that the query-and-run, evaluate registers as a function of time, approach is equally accurate.
I separately think it is usually a performance improvement for the same reason that broad-phase tests are frequently a performance improvement: the number of times the gating test fails to prevent a close inspection adds less to the total than the number of times it succeeds.
However, as stated, it's not something I'd advocate for a normal computer-based emulator now because what one does for one of those is: (i) make sure you can model what each chip would do each cycle. If you want the version of emulation as sketched above you: (i) make sure you can model what each chip would do each cycle; (ii) write some additional code to reverse the test and say how many cycles until the chip does something interesting; (iii) write some more additional code to manage lazy updating for CPU-originated communications. More code = more to test = more to maintain = more to read =
much more chance of error. Just use the fact that computers are fast nowadays to allow yourself to write code directly. People are going to be happier if your emulator runs everything and occupies 2N% of your CPU than if it's sort of patchy but occupies only N%.
Mike wrote:In the other thread, I read about your idea to let the GPU decode the emulated composite video signal. *That* is indeed a nice approach - in VICE, the routine for the CRT emulation already hogs the majority of execution time in the emulator...
Yeah, the main problem being my general aptitude and a lesser being the costs of CPU/GPU synchronisation.
Mine's C++ for this stuff and I've adopted a form whereby the major chips are templates, which you use and subclass (it's a C++ thing; look up the curiously recurring template pattern) to provide the connection to that machine's bus. So e.g. the template part of the 6502 is the execution unit (in Intel-speak) and the inheritor of it simply gets a call every cycle saying "[read/read operation/write] address [to/from] byte". So that's cool in that there's no baked-in paging model, no rule for how to indicate addresses that are something other than 8-bit RAM, none of the other stuff that tends to make a 6502 emulation not usable for all machines that could possibly exist. But it's a function call (though not a virtual one or one via pointer, per the template pattern) every cycle, that often then incurs a few other function calls. It's a very easy flow to follow as a reader but it's suboptimal.
That excuse being made, my emulator is currently at least 2.5 times slower than VICE on the same Mac. I just use normal nearest sampling mode because what I hate is inauthenticity and I tend to find most emulators' post-processing attempt at "a television look"* to be fairly dreadful and have tarred VICE with that same brush without any further inspection, but I severely doubt that's saving me anything substantial.
So, costs are substantially greater. But I don't think that's the hardware-utilisation approach. I think that's me.
* in particular, guess what: on all but the most expensive televisions, shadow masks don't show gaps between scan lines. That bothers me so much. I've the photographs to prove it. I've only every seen separated scan lines as a result of progressive scan in the arcade, on super-expensive Trinitron displays during the actual 1980s and 1990s, and on almost every final-days-of-CRTs computer monitor. A shadow mask already throws away some huge proportion of the light that would be generated by a black-and-white set by constantly blocking the electron guns; why would manufacturers then cut the beam size by half again? Because they think consumers love really dim images?
(EDIT: nowadays I actually tend to use an intermediate approach to that described way above, which is simply lazy evaluation. It's not implemented yet but, once in place, any write to the built-in RAM or access to a VIC register will cause the VIC to be caught up to the current time prior to the write or access. Otherwise it's dormant. Even that small thing helps with caches and branch prediction, and lets you make a bunch of inner-loop pixel plotting optimisations, since you'll mostly be drawing huge runs rather than small pieces.)