Denial

Posted: **Thu Oct 08, 2015 9:24 am**

Talking of 6502 emulators written in C, I've found this other:

http://sourceforge.net/p/moarnes/code/c ... src/6502.c

It's not cycle-exact, but gives you the number of cycles taken per instruction, which is useful to keep the VIC chip at the same pace of the CPU (by running the same number of cycles).

Posted: **Thu Oct 08, 2015 10:38 am**

Now look at that – it's the same code like in the VIC-20 Arduino hack but without the GPL header.

Posted: **Mon Oct 12, 2015 11:07 am**

@pixel the author made an earlier version that is named fake6502

Posted: **Mon Oct 12, 2015 11:46 am**

nippur72 wrote:@pixel the author made an earlier version that is named fake6502

Cool!

Don't get me wrong. I don't put a license into every file either. Am in favor of the MIT license, so I actually don't really care what's happening to my code. A little fame never hurts, tho.

Posted: **Wed Jun 08, 2016 2:23 pm**

The one I swear by is 64doc.txt; it is as far as I'm aware the only one that gives a complete cycle-by-cycle breakdown for every instruction.

Otherwise, some general comments from over the years:

On 486-class hardware it was normal to have a CPU core that ran one instruction at a time with a lookup table of costs. If your machine has 65 cycles per scan line, you'd first ask it to run for 65 cycles, it would perform one full instruction at a time until it had spent at least 65 cycles, then draw the next line of the display. Supposing it had actually spent three cycles more than 65, it'd attempt to run for 65 - 3 cycles before drawing the next scan line. So errors average out. But if you've used one of the older emulators of a platform and wondered what those weird errors are that cause things like raster changes to bob up and down, it's usually this sort of emulation simplification.

If you're writing in assembly language then it's usually easiest to use a fixed mapping of host registers to emulated machine registers, figure out the smallest instruction cache you're going to target and make sure your inner loop fits into that. For a 6502 you're probably going to want to have a structure like: (i) read opcode; (ii) calculate fixed address + opcode * fixed size; (iii) jump to that address; (iv) at that address have the code that performs the instruction followed by a jump back to (i). Fixed size will then be an appropriate calculation from the size of your instruction cache minus the size of the code that performs steps (i), (ii) and (iii). You'll likely want to automatically generate the opcode implementations by writing 'calculate absolute address', 'perform adc', etc once and having them all automatically glued together in the appropriate slots. You can jump between the wasted space in slots to allow oversized implementations to grow into neighbours if you have to but ideally you won't have to.

I believe that Apple's 68000 emulator for the PPC did a fairly excellent job of this arrangement, though only because they ended up launching their first PPC machine with the 603 specifically because it had a cache size large enough to fit the emulator.

If you were in C you'd probably use the preprocessor to achieve the automatic gluing and trust the compiler with a switch on the jumping. Function pointers sound appealing but function calls usually have a much greater cost. Only a very poor compiler will do a bad job of a switch statement.

The easiest way to go cycle accurate on a machine like the Vic is probably to have a lookup table of the address the VIC would access each cycle and an output table of collected data. Just interleave address grabs and byte writes with the CPU steps. Have the 6522s able to return number of cycles until they next potentially signal an interrupt (being careful to invalidate that calculation when appropriate), timestamp any changes to the VIC registers and have it rewrite the lookup table whenever necessary. Then ask the CPU to run for a whole frame, have it ask the 6522s when they'll next potentially signal, run for the lesser of those two, give the 6522s an opportunity to catch up, ask them again, repeat. At end of frame, tell the VIC to inspect the data you collected and present it, running through its timestamped list of register changes as it does for proper palette and sizing changes.

So that optimises for the assumption that VIC display arrangement changes are a lot less frequent than the need to fetch video data. You'll need to hand-wave away unpredictable timing events into a fairly rigid timing structure, but they're mostly related to media access or human input, both of which can be rounded off without anybody much caring.

Over the years I have written all of:

an arrangement as above;
function pointer and switch-based C implementations;
a two-threaded solution allowing the emulated CPU to have no exit condition (other than program end) but block itself on any cycle while retaining its full stack frame while still just having a straightforward switch-statement flow;
of the z80, a bus-level emulation that's accurate to the half-cycle (i.e. you have to supply both leading and trailing edges of the clock signal; there is no inter-component communication other than the current status of the bus lines); and
most recently, essentially the state machine posted, though in my case it's breaking incoming operations down into micro-programs, each with memory accesses as a specific step, and having those be the actual switch-subject.

The first two of those were workable on 486-class hardware, the third ran well on a mid-range Pentium, the fourth is fantastically fun but exceedingly slow due to messaging overhead, the fifth is fine on anything of the last decade or so. I'm using C++ nowadays because I think it offers a better performance/maintenance trade-off than C for generically implemented chips, as per its strong support for generic programming; the impression that it's slower comes from the days when it was more about being object oriented than generic, I think, implying runtime call/messaging overhead if you're trying to create something where freestanding bits are pieced together.

Posted: **Mon Jun 13, 2016 12:01 pm**

One further follow-up comment: if your emulator is in assembly and structured as above, i.e.

the CPU alone is expected to run for prolonged periods without calling out, having video memory collection built into its operation and things like audio being processed after the fact from timestamped events; and
you've structured it as fetch-opcode-and-jump-to-routine.

... then it would also be completely feasible to do a static recompilation of ROMs prior to emulation starting. Dynamic recompilation of RAM might also be possible, subject to the costs of keeping dirty flags and overall speed. But static recompilation of anything loaded as a ROM should be an easy win, especially given the large number of games that are ROMs on the Vic.

Posted: **Thu Jun 16, 2016 2:26 pm**

I'm largely talking to myself now and have no idea of the etiquette here of replying to yourself but I've just found Simon Owen's VIC-20 Emulator for SAM Coupe and Spectrum, which implements a Vic-20 emulator in z80 assembly language for the ZX Spectrum (where it runs at about 1/10th Vic speed) and SAM Coupe (where it runs at about 1/7th). So, impliedly, it would run an approximately full-speed Vic emulation on a hypothetical 35.8Mhz Z80.

See the main source file and `main_loop:` on line 602 for the fetch-decode-execute loop (it uses a jump table in `msb_table`); the 6502 instruction implementations are separate and, from their looks, I would guess auto-generated, but if so then the generator is omitted.

Posted: **Thu Jun 16, 2016 3:45 pm**

No worries. I think you've collected quite a few relevant methods here how to build an emulator core.

With regard to the OP, the various incarnations of Joseph Rose tend to handle forums and Usenet as write-only medium. When you reply to his questions, you have good chances that nearly nothing of your info will improve his insight on the matter. Witness his numerous (cross-)postings where he tends to repeat questions nearly unchanged over years, unaltered, without any hint he learns from answers or does something about this by himself like by reading documentation, writing non-trivial applications to test out algorithms, and so on.

...

Here in Denial, we had discussed the VIC-20 emulator for Spectrum in another thread some years ago. From the description in the web, I concluded the emulator was functional, but still incomplete (the video part was missing multi-colour for example - obviously, that would be nearly impossible to implement on the Spectrum for hardware reasons, and requiring MODE 4 on the SAM Coupe with a large performance hit).

For your other assumptions about the 'necessary' accuracy: you'll have to run CPU, VIC and both VIAs (especially their timers) in single-cycle lock-step. Line-based emulation is insufficient. There exist contemporary games (like Imagic's Dragonfire), and also demos from the mid-90ies, which routinely change VIC (colour) registers in a way that fails to produce the intended outcome if changes to the registers are only allowed once per raster. Newer examples include VICtragic's rendition of Pitfall, a program producing greyscale output, many of the recently created new graphic modes and - for a specific example -, a picture I did recently, which relies on in-line colour splits for display. Do one cycle wrong, and either Mew or Celebi will appear (partly) in wrong colours.

Posted: **Fri Jun 17, 2016 10:44 am**

Mike wrote:For your other assumptions about the 'necessary' accuracy: you'll have to run CPU, VIC and both VIAs (especially their timers) in single-cycle lock-step.

Again, I'm not persuaded by this. I think you'd be fine with:

Code: Select all

while (haven't yet run for long enough to complete frame) {
    let timeUntilVIA1Fires = via1.howLongUntilYouFire?
    let timeUntilVIA2Fires = via2.howLongUntilYouFire?
    let timeUntilEndOfFrame = vic.howLongUntilEndOfFrame?

    let timeToRunFor = minimum of {timeUntilVIA1Fires, timeUntilVIA2Fires, timeUntilEndOfFrame}
    
    for amount of time in timeToRunFor {
        vicBuffer[cycle] = memory[vicAddressList[cycle]]
        run one cycle of the CPU
        
        if CPU write affected VIA 1 or 2 timers then exit inner loop now
        if CPU write is to a VIC palette register then push record register number, write value and current cycle to the end of the VIC write list
        if CPU write is to a VIC control register then store it as immediately pending and exit inner loop now
    }
    
    via1.runFor (as many cycles as the CPU just did) // VIA may update signal lines such as IRQ with the CPU here
    via2.runFor (as many cycles as the CPU just did)
    vic.runFor (as many cycles as the CPU just did)
}

vic.outputFrame

Code: Select all

- vic.runFor::implementation {
    while output for this field hasn't yet reached the current time {
        output half-byte with current colours or update output flags per current register values, etc; use vicBuffer as source for data
        if the next logged register event is for this cycle then apply it
    }
    
    apply any pending immediate VIC event and rewrite remainder of vicAddressList for this field per those values if required
}

- vic.outputFrame {
    rewrite first part of vicAddressList if it relevant registers were modified during the previous frame up to where the rewrite has already taken place
}

Code: Select all

via.howLongUntilYouFire? {
    return minimum of {time until timer 1 next fires, time until timer 2 next fires}
}

Net effect is to optimise for the 90% that don't often hit the VIC registers and/or do nothing more complicated that tweak the palette, not being all that bothered about the penalty to those that do.

You can move the immediate VIC events inside the CPU loop if you like, it just doesn't look so neat is pseudocode.

But, crucially, the CPU part gets to run off and do its own thing for the vast majority of the emulation period without external consultation to the other chips for the mere cost of an extra 'read address, fetch byte, store'.

I've used essentially this structure (different type of timing and different video chip, but you get the point) in the past, it's just so oblique that I wouldn't recommend it unless you're against the wall on performance. It's much more natural to make all decisions in lock-step and therefore much easier to get correct the first time and much easier to maintain.

Posted: **Fri Jun 17, 2016 11:55 am**

Mike wrote:For your other assumptions about the 'necessary' accuracy: you'll have to run CPU, VIC and both VIAs (especially their timers) in single-cycle lock-step.

Tom wrote:Again, I'm not persuaded by this.

Here's the start of the interrupt service routine as used by the "NTSC ENGINE" in the Pokémon picture:

Code: Select all

.3F00  D8        CLD
.3F01  38        SEC
.3F02  A9 15     LDA #$15
.3F04  ED 24 91  SBC $9124
.3F07  C9 0A     CMP #$0A
.3F09  90 03     BCC $3F0E
.3F0B  4C BF EA  JMP $EABF
.3F0E  8D 12 3F  STA $3F12
.3F11  90 00     BCC $3F13
.3F13  A9 A9     LDA #$A9
.3F15  A9 A9     LDA #$A9
.3F17  A9 A9     LDA #$A9
.3F19  A9 A9     LDA #$A9
.3F1B  A9 A5     LDA #$A5
.3F1D  EA        NOP

The PAL variant uses a different constant in the LDA instruction in $3F02.

When you can tell me what this code snippet is supposed to do, we can talk further.

Posted: **Fri Jun 17, 2016 4:46 pm**

Mike wrote:When you can tell me what this code snippet is supposed to do, we can talk further.

I can only assume you haven't read my suggestion and are replying dogmatically.

Are you capable of explaining in what scenario a VIA can't tell a priori you how many cycles until its interrupt status will change? Or when the VIC can't answer the question "if no registers change value, what will be the sequence of memory accesses for the rest of the field?"?

Otherwise, it looks like a fairly basic approximation of Duff's device, selecting an execution period that grows as current VIA time decreases. Easiest assumption is that it's to resolve the classic problem of variable interrupt timing — interrupts can be serviced only in between instructions, therefore the amount of time between the IRQ line being signalled and the interrupt routine being entered is variable.

That's just based on a two-minute look though, as if you can't actually tell me what you think doesn't work in the suggested scheme — (i) how many cycles until soonest potential non-CPU event?; (ii) run for that many, with inline video collections; (iii) let non-CPU components check for events; (iv) repeat — then I'm not sure what you're arguing.

Posted: **Sun Jun 19, 2016 10:06 am**

The VIA updates internal state every cycle, and the CPU might be investigating that state at arbitrary times by reading the VIA registers, not just in response of an interrupt.

Inthusfar I cannot imagine how deferring VIA processing as done in via%.runFor() provides any significant advantage.

Same applies to certain VIC-I registers. The raster counter in $9004, for example, changes value every two raster lines at a certain horizontal position. That is used by my raster engines to _start_ the VIA timer also at a precise x-coordinate of the electron beam, in a procedure which first results in a jitter of 7 cycles, that is then reduced to 3, 1 and finally 0 cycles. As you've correctly guessed, the start of the ISR then compensates the other jitter which results from the interrupt latency until the current instruction has finished.

Again, if - during the timer setup/init routine - the change of the raster counter is not seen by the emulated CPU at exactly the same time/position as on real hardware, this will result in permanently shifted colour definitions.

Of course you could employ something a like speculative execution, which rolls back CPU activity as soon as it detects that it might have worked with wrong data of the VIC or VIA registers. For the majority of VIC-20 programs, this might indeed lead to improved performance. But for "newer" programs, which employ those methods as result from cycle-exact programming, those necessary roll back procedures could cause a large performance hit.

Besides - the people would not only want to use an emulator to just run "old" software, emulators are a big help to write new software, and then, the emulation of the most obscure features just can't be accurate enough. For a recent project of mine, VICE (nearly) correctly emulates the behaviour of the lightpen registers in conjunction with a strobe of the /LIGHTPEN-Signal by putting the corresponding VIA pin to output.

That helped me a great deal to produce a sensible readout routine, which can also detect if the lightpen is actually at the screen - and then I could concentrate on other, more important aspects of the program's state machine.

In the other thread, I read about your idea to let the GPU decode the emulated composite video signal. *That* is indeed a nice approach - in VICE, the routine for the CRT emulation already hogs the majority of execution time in the emulator...

Posted: **Sun Jun 19, 2016 11:25 am**

in VICE, the routine for the CRT emulation already hogs the majority of execution time in the emulator...

just wait until i add the 3-times scaled variant with added dotmask texture *evil laughter*

Posted: **Mon Jun 20, 2016 7:16 am**

The good news for everybody except me is that I lost my originally typed answer; I appreciate I have some problems with politeness at the best of times but please forgive me if this retelling is unusually brusque.

The difference in approaches is between:

"Hey, VIAs and VIC, what's the earliest that one of you might autonomously change a signal line?"
"Uhh... 2000 cycles"
"Okay, I've just run for 2000 cycles; what's up?"

And:

"Hey, VIAs and VIC, I just ran for a cycle! Is anything happening now?"
"No"
"Hey, VIAs and VIC, I just ran for a cycle! What about now? Is anything happening now?"
"No"
"Hey, VIAs and VIC, I just ran for a cycle! What about now? Is anything happening now?"
"No"
... ad infinitum

Or — in the context of the assembly-level emulator that's the subject of this thread — between writing out your entire 6502 emulation in switch-style form and insert at the end of every complete cycle:

inc cycle
ld r, [gfxAddresses, cycle]
st r, [gfxValues, cycle]
dec cyclesUntilNextEvent
jr.z checkForEvents

Versus having to insert something more like (assuming macros this time, let's not be silly):

via1Update()
via2Update()
vicUpdate()

Each of which is likely, at the very least, to thrash a few more registers. Otherwise, register reads and writes are of the form "It's now time Q; [what's the value of register R / please put N into register R]".

I believe it is less processing because (i) it is cheaper to check a single timer as a broad-phase test for whether the code for VIAs and VIC need to be entered than it is to run the direct tests; and (ii) it is cheaper to emulate N cycles at once in a VIC or VIA than it is to emulate 1 cycle at a time, N times over.

But, to be explicit, I am arguing with the proposition that "you'll have to run CPU, VIC and both VIAs (especially their timers) in single-cycle lock-step". I do not agree with that statement. I think that the query-and-run, evaluate registers as a function of time, approach is equally accurate.

I separately think it is usually a performance improvement for the same reason that broad-phase tests are frequently a performance improvement: the number of times the gating test fails to prevent a close inspection adds less to the total than the number of times it succeeds.

However, as stated, it's not something I'd advocate for a normal computer-based emulator now because what one does for one of those is: (i) make sure you can model what each chip would do each cycle. If you want the version of emulation as sketched above you: (i) make sure you can model what each chip would do each cycle; (ii) write some additional code to reverse the test and say how many cycles until the chip does something interesting; (iii) write some more additional code to manage lazy updating for CPU-originated communications. More code = more to test = more to maintain = more to read = much more chance of error. Just use the fact that computers are fast nowadays to allow yourself to write code directly. People are going to be happier if your emulator runs everything and occupies 2N% of your CPU than if it's sort of patchy but occupies only N%.

Mike wrote:In the other thread, I read about your idea to let the GPU decode the emulated composite video signal. *That* is indeed a nice approach - in VICE, the routine for the CRT emulation already hogs the majority of execution time in the emulator...

Yeah, the main problem being my general aptitude and a lesser being the costs of CPU/GPU synchronisation.

Mine's C++ for this stuff and I've adopted a form whereby the major chips are templates, which you use and subclass (it's a C++ thing; look up the curiously recurring template pattern) to provide the connection to that machine's bus. So e.g. the template part of the 6502 is the execution unit (in Intel-speak) and the inheritor of it simply gets a call every cycle saying "[read/read operation/write] address [to/from] byte". So that's cool in that there's no baked-in paging model, no rule for how to indicate addresses that are something other than 8-bit RAM, none of the other stuff that tends to make a 6502 emulation not usable for all machines that could possibly exist. But it's a function call (though not a virtual one or one via pointer, per the template pattern) every cycle, that often then incurs a few other function calls. It's a very easy flow to follow as a reader but it's suboptimal.

That excuse being made, my emulator is currently at least 2.5 times slower than VICE on the same Mac. I just use normal nearest sampling mode because what I hate is inauthenticity and I tend to find most emulators' post-processing attempt at "a television look"* to be fairly dreadful and have tarred VICE with that same brush without any further inspection, but I severely doubt that's saving me anything substantial.

So, costs are substantially greater. But I don't think that's the hardware-utilisation approach. I think that's me.

* in particular, guess what: on all but the most expensive televisions, shadow masks don't show gaps between scan lines. That bothers me so much. I've the photographs to prove it. I've only every seen separated scan lines as a result of progressive scan in the arcade, on super-expensive Trinitron displays during the actual 1980s and 1990s, and on almost every final-days-of-CRTs computer monitor. A shadow mask already throws away some huge proportion of the light that would be generated by a black-and-white set by constantly blocking the electron guns; why would manufacturers then cut the beam size by half again? Because they think consumers love really dim images?

(EDIT: nowadays I actually tend to use an intermediate approach to that described way above, which is simply lazy evaluation. It's not implemented yet but, once in place, any write to the built-in RAM or access to a VIC register will cause the VIC to be caught up to the current time prior to the write or access. Otherwise it's dormant. Even that small thing helps with caches and branch prediction, and lets you make a bunch of inner-loop pixel plotting optimisations, since you'll mostly be drawing huge runs rather than small pieces.)

Posted: **Mon Jun 20, 2016 8:12 am**

btw, VICE also uses this "run until something interesting happens" approach for the CIA and VIA cores. and its exactly what makes this code so terrible and almost unmaintainable

Denial

6502 emu help needed!

Re: 6502 emu help needed!

Re: 6502 emu help needed!

Re: 6502 emu help needed!

Re: 6502 emu help needed!

Re: 6502 emu help needed!

Re: 6502 emu help needed!

Re: 6502 emu help needed!

Re: 6502 emu help needed!

Re: 6502 emu help needed!

Re: 6502 emu help needed!

Re: 6502 emu help needed!

Re: 6502 emu help needed!

Re: 6502 emu help needed!

Re: 6502 emu help needed!

Re: 6502 emu help needed!