WIP: FAST-40

FD22 · Post by **FD22** » Sat Feb 08, 2014 5:04 pm

Just a heads-up that I'm getting close to an Alpha release of FAST-40, something I've been working on for a couple of months as a derivative side-project of VIC++ and which has it's own sub-page here.

I only get to work on this for maybe 3-4 hours a week, so I'm not making any promises on dates for either an Alpha, Beta or final release - but I've hit a milestone today so I've posted a couple of screenies on the blog page.

Comments welcome.

rhurst · Post by **rhurst** » Sat Feb 08, 2014 6:15 pm

Nice blog! Telengard anyone?

Boray · Post by **Boray** » Sun Feb 09, 2014 2:15 am

Cool!

Mike · Post by **Mike** » Sun Feb 09, 2014 5:36 am

Nice!

I had a quick look over your blog about this side-project, some questions:

- You're using a dirty-row scheme, and a 'big' interrupt-routine to redraw invalid rows. The underlying KERNAL API is single character only, though. Don't you think it might be more sensible just to draw the glyph directly into the bitmap? You'd still keep a PETSCII representation in the text buffer, of course.

- Regarding that text buffer, are you planning to make it POKEable, so that POKEs into to are also reflected on the bitmap? That could quite easily be done by installing a BASIC extension, which checks for POKEs into that range, and then carries out the extra functionality.

- How is the address generating text screen arranged? In rows or in columns? The arrangement in columns gives a much simpler address function for the bitmap, and scrolling is also much easier as it can be done in a loop unrolled over all bitmap columns - for both directions, and also over a restricted vertical range. The latter is especially important, when lines are inserted in the editor and an empty line needs to be opened (scrolling everything below it one line down).

FD22 · Post by **FD22** » Sun Feb 09, 2014 6:17 am

Mike wrote:Nice!

I had a quick look over your blog about this side-project, some questions:

- You're using a dirty-row scheme, and a 'big' interrupt-routine to redraw invalid rows. The underlying KERNAL API is single character only, though. Don't you think it might be more sensible just to draw the glyph directly into the bitmap? You'd still keep a PETSCII representation in the text buffer, of course.

- Regarding that text buffer, are you planning to make it POKEable, so that POKEs into to are also reflected on the bitmap? That could quite easily be done by installing a BASIC extension, which checks for POKEs into that range, and then carries out the extra functionality.

- How is the address generating text screen arranged? In rows or in columns? The arrangement in columns gives a much simpler address function for the bitmap, and scrolling is also much easier as it can be done in a loop unrolled over all bitmap columns - for both directions, and also over a restricted vertical range. The latter is especially important, when lines are inserted in the editor and an empty line needs to be opened (scrolling everything below it one line down).

Mike, to each of your points:

1. Not entirely sure what you mean here, unless you're suggesting drawing individual glyphs as they change in the text buffer rather than the entire line...? If so, there's a trade-off for that mechanism - you'd either need to have a double-buffered text buffer so changes could be detected on each IRQ, or modify _CHROUT to call the renderer as characters were sent to the screen. I decided early-on in VIC++ (and therefore FAST-40) not use use a double buffered renderer, just because it needed twice the memory and a long scan of the active buffer every IRQ to find changes. Also, in situations where a lot of characters changed (e.g. when the screen scrolls) the refresh performance was pretty awful. Modifying _CHROUT is a possibility, but I much prefer to keep the character-handling logic separated from the screen draw logic - this makes it easier to introduce features that control refresh speed or 'alternate inputs' to the renderer, since they are nothing to do with character I/O.

2. This is the 'alternate input' I allude to - it's entirely possible to hit the bitmap independently, and if one or more lines of the text buffer were to be 'locked out' of the render then you could POKE hi-res graphics straight into it. The text buffer just holds PETSCII codes, so you could also just POKE straight into that without going through the KERNAL. I hadn't thought of a BASIC extension for either of those (only for balancing refresh speed against CPU load) but it's a cool idea.

3. Yep, column-based; using DHCM makes this a virtual necessity anyway, although you can just create a huge lookup table if you want to stay in row-mode. I just have a much smaller 40-byte lookup table that yields the start address of each row to make computation quicker during the render. I'm doing an entire row in just under 6000 cycles, which means I can refresh 2 rows per frame in NTSC or 3 in PAL, and that means the whole 40x24 display can be completely redrawn (roughly) five times a second when the renderer is running at maximum speed.

darkatx · Post by **darkatx** » Sun Feb 09, 2014 6:29 am

Impressive!

Mike · Post by **Mike** » Sun Feb 09, 2014 6:46 am

FD22 wrote:1. Not entirely sure what you mean here, unless you're suggesting drawing individual glyphs as they change in the text buffer rather than the entire line...? [...] you'd [...] need to [...] modify _CHROUT to call the renderer as characters were sent to the screen. [...] in situations where a lot of characters changed (e.g. when the screen scrolls) the refresh performance was pretty awful. [...]

It's always a good idea to keep the character-handling logic separated from the screen draw logic. But in any case, there's no double buffer involved.

For single character output, you'd:

- put the character into the 'off-screen' text buffer, and
- draw that single glyph into the bitmap.

Job done. For the other actions, like opening a line, or scrolling parts of a line, you similarly can provide the necessary operations on the bitmap that 'mirror' all actions done on the text buffer. That way, all changes are made immediately visible, and don't need to be deferred to an action done in a heavy interrupt routine.

Especially for single character output, consider: there might only be 5 or 6 characters updated during each frame, but your character-line renderer would catch that dirty row, and always redraw 40 characters, doing much more work than necessary. More to that for point 3.

3. Yep, column-based; using DHCM makes this a virtual necessity anyway, although you can just create a huge lookup table if you want to stay in row-mode.

What is DHCM?

I'm doing an entire row in just under 6000 cycles, which means I can refresh 2 rows per frame in NTSC or 3 in PAL, and that means the whole 40x24 display can be completely redrawn (roughly) five times a second when the renderer is running at maximum speed.

It is wasteful to redraw the full screen from the text buffer in case it is scrolled. A full scroll of the bitmap can be done in ~35000 cycles. Of course, in your case the text screen again needs to keep in sync with that. That means it is entirely possible to scroll 30 times a second with that method - minus whatever is necessary to handle the text buffer and clear the newly opened space or redraw just the new line.

FD22 · Post by **FD22** » Sun Feb 09, 2014 6:58 am

Mike wrote:
FD22 wrote:Mike, to each of your points:

1. Not entirely sure what you mean here, unless you're suggesting drawing individual glyphs as they change in the text buffer rather than the entire line...? If so, there's a trade-off for that mechanism - you'd either need to have a double-buffered text buffer so changes could be detected on each IRQ, or modify _CHROUT to call the renderer as characters were sent to the screen. I decided early-on in VIC++ (and therefore FAST-40) not use use a double buffered renderer, just because it needed twice the memory and a long scan of the active buffer every IRQ to find changes. Also, in situations where a lot of characters changed (e.g. when the screen scrolls) the refresh performance was pretty awful. Modifying _CHROUT is a possibility, but I much prefer to keep the character-handling logic separated from the screen draw logic - this makes it easier to introduce features that control refresh speed or 'alternate inputs' to the renderer, since they are nothing to do with character I/O.
It's always a good idea to keep the character-handling logic separated from the screen draw logic. But in any case, there's no double buffer involved.

For single character output, you'd:

- put the character into the 'off-screen' text buffer, and
- draw that single glyph into the bitmap.

Job done. For the other actions, like opening a line, or scrolling parts of a line, you similarly can provide the necessary operations on the bitmap that 'mirror' all actions done on the text buffer. That way, all changes are made immediately visible, and don't need to be deferred to an action done in a heavy interrupt routine.

Especially for single character output, consider: there might only be 5 or 6 characters updated during each frame, but your character-line renderer would catch that dirty row, and always redraw 40 characters, doing much more work than necessary. More to that for point 3.

3. Yep, column-based; using DHCM makes this a virtual necessity anyway, although you can just create a huge lookup table if you want to stay in row-mode. I just have a much smaller 40-byte lookup table that yields the start address of each row to make computation quicker during the render. I'm doing an entire row in just under 6000 cycles, which means I can refresh 2 rows per frame in NTSC or 3 in PAL, and that means the whole 40x24 display can be completely redrawn (roughly) five times a second when the renderer is running at maximum speed.
It is wasteful to redraw the full screen from the text buffer in case it is scrolled. A full scroll of the bitmap can be done in ~35000 cycles. Of course, in your case the text screen again needs to keep in sync with that. That means it is entirely possible to scroll 30 times a second with that method - minus whatever is necessary to handle the text buffer and clear the open space or redraw just the new line.

Ah, I see what you're getting at. It's only the actual redraw that happens in the IRQ, not the amendments to the text buffer or stuff like opening lines, etc - those things happen on-demand when the screen editor requires it. Yes, the renderer redraws all 40 characters per line even if only one changes, but this is a constant - and the worst-case scenario.

Similarly, stuff like scrolling the screen actually happens much faster than 5 times a second - I was just using that to illustrate that the refresh can redraw the whole screen that quickly. Scrolling and other line-based operations happen asynchronously to the render, so in fact things move around very quickly.

Mike · Post by **Mike** » Sun Feb 09, 2014 7:13 am

FD22 wrote:Scrolling and other line-based operations happen asynchronously to the render, so in fact things move around very quickly.

Ah, ok, that means you won't use the renderer to scroll the screen. Good. Still, if you take the line renderer out of the interrupt, and make BSOUT draw single glyphs instead, you'd make the interrupt as precious resource available to the programmer again.

That would also spare you the need to block these two parts of the display pipeline - character output vs. global actions - against each other. All rendering would take place synchroneously.

FD22 · Post by **FD22** » Sun Feb 09, 2014 10:23 am

Mike wrote:
FD22 wrote:Scrolling and other line-based operations happen asynchronously to the render, so in fact things move around very quickly.
Ah, ok, that means you won't use the renderer to scroll the screen. Good. Still, if you take the line renderer out of the interrupt, and make BSOUT draw single glyphs instead, you'd make the interrupt as precious resource available to the programmer again.

That would also spare you the need to block these two parts of the display pipeline - character output vs. global actions - against each other. All rendering would take place synchroneously.

God no - having the renderer dictate system character I/O rate would be ... insane.

It's all a trade-off between blocking CPU and giving the user a smooth experience. In actuality the CPU load during the IRQ is, mostly, negligable - 99% of the time there's only a 23 cycle impact on the interrupt, which is a basic design intent. That precious IRQ time is almost always available to the user, except when there's stuff to render, and that's why I'm giving the user a BASIC hook to adjust the balance between screen update smoothness and CPU availability. So for program hotspots where code needs as much CPU as possible and the refresh isn't as critical, you can wind the IRQ overhead all the way down to zero if necessary (think ZX80 without the blanking).

The renderer is essentially an asynchronous snapshot - when it fires on the IRQ, it draws the text buffer as it exists at that moment (based on the dirty row markers) so lots and lots of changes can happen between snapshots, including line inserts, scrolling, etc. That means that apps doing lots of character I/O (or even just the LIST command, for example) can burn through and do their writes to the text buffer at full speed, and aren't slowed down except when a line redraw is needed - and if, in the interim, they've actually caused the entire screen to scroll one or more lines, the renderer doesn't care because it locates whatever dirty row markers are set and refreshes accordingly. And even if that means all 24 dirty row markers are set, it's still only a 5th of a second to refresh the whole screen, and the refresh looks smooth and responsive.

Per-character rendering is an alternative approach that could work, but I think there might be a trade-off there too; the render speed gain versus the complexity of the code that manages the text-to-bitmap conversion as character I/O occurs. The line-based approach I'm using does consume more CPU when an update is needed (even if only for one character) but has the advantage that CPU load is constant, and because it's logically separate from character I/O it can be easily tuned for responsiveness and for things like split-screen character-and-graphics stuff. What would be REALLY interesting is a side-by-side comparison - but my time budget precludes that possibility (unless someone else wants to create a character-based rather than line-based implementation)

.

EDIT: just spotted your 'What is DHCM' question which I overlooked before - Double Height Character Mode.

tokra · Post by **tokra** » Sun Feb 09, 2014 11:07 am

Can't wait for the finished product. 40-column-renderers for the VIC always fascinated me...

Analyzing other 40-column renderers and their manuals I found some interesting tricks. From the manual from Screen-40 for example (Compute's Gazette 1985-06)

"Screen-40 needs only 4-bit memory blocks for characters and therefore keeps most of them in the unused part of color RAM.When the program is initialized, the alphanumeric character shapes are transferred there from packed storage in thelast 384 bytes of the program. Graphics characters are drawn from the character ROM within the VIC."
So this saves lots of precious space and makes good use of the unused color-RAM (784 x 4-bit).

From the manual of another 40-colum-renderer (FAT-40 = VIC-40 from Ahoy! 1984-10):

"Reverse-field characters are not stored, but are generated on the fly by simply reversing the bits of astored character, then displaying it onthe hi-res screen."
Yet another idea to save space.

Back to my own thoughts: I see you are going for a 40x24 mode - the drawback of this is that BASIC PET or C64-software will always display one line too many. On the plus side you have clear boundaries for color-RAM (one color for a square of 4 chars).

I have been wondering if - once you got the programm working - you could maybe add a 25th-line in single-height-character-mode by adding a raster-interrupt after line 24 switching char-RAM pointer to $0000 and put the char-information for the needed extra 20 chars in the tape buffer at $033c for example (160 bytes are needed). Unless doing tape operations this area in the lower 1K is free for use. One small problem would be that you would need 20 bytes more of video-RAM as well and this would then go from $1000-$1103 - while character-RAM would still need to start at $1100 for the normal 240 chars. But updating 4 bytes from $1100-$1103 dynamically shouldn't be too much of a drain on CPU. It will probably more trouble to adjust the renderer for this extra-line which does not follow the rules of the rest of the display. However such a feature would make this a "true" 40x25 mode with just a little drawback of needing to deactivate 40 column-mode while doing tape operations.

The other available 40x25 solutions use a 160x176 display with 4x7 chars which leads to strange color-boundaries. Still the demand for true 40x25 on the VIC was probably why such products were released back then.

FD22 · Post by **FD22** » Sun Feb 09, 2014 3:58 pm

Hi Tokra.

FAST-40 also uses 4-bit characters, but I hold them doubled-up, 2-to-a-byte, for speed - I don't have to do shift operations during the render process to get a character into the left or right bitmap nybble, only a mask. VIC++ uses the 4-bit colour memory to hold rendering attributes that affect how characters are drawn, but none of that logic is present in FAST-40.

I'm intrigued by the statement that Screen-40 draws graphics characters out of the ROM - I'm assuming it must render alternate bits, which could make some glyphs quite distorted. I have re-drawn everything at 4-bit resolution, so some are a bit squashed but they're all recognisable.

Re. FAT-40 rendering inverse characters on the fly - VIC++ used to do this, but I reverted to pre-drawn inverted characters, again for speed; FAST-40 inherits this version of the renderer. The speed-vs-memory trade-off is made a less bitter pill to swallow by having FAST-40 able to relocate the character data and the runtime into BLK5 if it contains RAM, which (as far as I know) no other 40-column package does. Also I compress that data, and the runtime code, so the binary isn't bloated when loading - at the moment I'm using a custom Huffman compressor, but I might switch to Exomizer in the final product since it can do in-code runtime decompression/relocation as a single operation.

I've toyed with other formats for the screen - 40x23, 40x25, raster splits in different places, different character sizes; in the end I've gone with a format that is optimised for speed and simplicity. I might do a 40x25 version and trade colour clash for that extra line, but I think that'll be only if there's a demand. Primary objective right now: finish the version I've got.

EDIT: I'll also be publishing the full sourcecode for FAST-40 on release too, so everyone can pick it apart and look for ways to improve it.

tokra · Post by **tokra** » Sun Feb 09, 2014 4:33 pm

FD22 wrote:FAST-40 also uses 4-bit characters, but I hold them doubled-up, 2-to-a-byte, for speed - I don't have to do shift operations during the render process to get a character into the left or right bitmap nybble, only a mask.

Ok, sounds economical as well. Just feels like a pity to let unused color-RAM go to waste.

I'm intrigued by the statement that Screen-40 draws graphics characters out of the ROM - I'm assuming it must render alternate bits, which could make some glyphs quite distorted. I have re-drawn everything at 4-bit resolution, so some are a bit squashed but they're all recognisable.

In the meantime I checked how those characters look: not really good. I guess re-drawing them to 4-bit-resolution was the right way to go after all.

Re. FAT-40 rendering inverse characters on the fly - VIC++ used to do this, but I reverted to pre-drawn inverted characters, again for speed

Would the few EORs necesarry to invert the char really be noticeable speed-wise?

FAST-40 able to relocate the character data and the runtime into BLK5 if it contains RAM, which (as far as I know) no other 40-column package does.

I think PET Loader sits in BLK5 as this starts with reset. Is the 3K-RAM expansion area in BLK0 still possible? That looks like the a good place to hide. Or even better: $9800-$a000 (IO2 and IO3 if you put RAM there) - granted it's only 2K but at this place interference with the rest of the system would be minimal. Maybe that's the reason why MegaCart and SJLOAD for MegaCart use this area.

Source-code should be interesting! Looking forward to it.

FD22 · Post by **FD22** » Sun Feb 09, 2014 5:43 pm

Re. On-the-fly inverted characters: it's not just a few EORs, it's 8 per character, up to 40 per line, plus the cycles needed on every character read to determine whether to invert or not. It all adds up, and I've elected to trade increased memory usage for the notable reduction in render time.

FAST-40 looks for RAM in BLK0, BLK1 and BLK5; if the user selects BLK0 it relocates a big chunk of itself in the 3K area with the remainder at the top of BLK1/2/3 (you always need at least one 8K block, since the bitmap and matrix occupy all the onboard 5K and FAST-40 is bigger than 3K); if you select 'top of memory' it pushes everything to the top of whatever 8K block is highest; and if you select BLK5 then everything goes to $A000, leaving all other expansion RAM available. The objective is to leave as much space available to the user as possible, based on what memory is installed and their preference.

I've deliberately steered-clear of the $9800-$A000 area just because I know that Megacart, SJLOAD, etc. use it and I didn't want to allow the user to accidentally cause some horrible clash with those. I'm also not entirely sure what population of VIC users might actually have RAM installed there - it's probably tiny, and even XVIC doesn't let you configure it that way, so there's probably little point even trying to have FAST-40 put anything there.

Mike · Post by **Mike** » Sun Feb 09, 2014 6:13 pm

FD22 wrote:God no - having the renderer dictate system character I/O rate would be ... insane.

I'm quite sure, that under 'normal' circumstances the character rate won't ever exceed something like 1000 characters per second. If you take the LIST command as example, I took one arbitrary example program, where a LIST needed 5.5 seconds, and resulted in 3580 bytes when redirected to a file: ~650 bytes/second.

If we take that further, 650 characters/second correspond to 13 characters per frame (PAL, 50 Hz), so for single character output, a full-line renderer for 40 columns would only run at 33% efficiency.

Direct output doubles the CPU load for each character, as it presumably has to write the same bitmap bytes twice for the resulting double char in each 8x8 pixel cell, but besides that it only needs to write to the parts of the bitmap that really must be updated.

if, in the interim, they've actually caused the entire screen to scroll one or more lines, the renderer doesn't care because it locates whatever dirty row markers are set and refreshes accordingly. And even if that means all 24 dirty row markers are set, it's still only a 5th of a second to refresh the whole screen, and the refresh looks smooth and responsive.

So that again looks like you let your program perform the scroll by the renderer at last. If you only can scroll 5 times a second, that won't look smooth - it will look very jerky, with lines jumping around haplessly depending in which state of the scroll they have been 'caught' by the renderer.

As you might know, I already had my hands into 40-column display routines. There's MG Browse, with some routines that descended from my paint program MINIPAINT. It only uses a renderer for complete lines, but then the exposed API is especially tuned for fast text display with scrolling in a bitmap. It is possible to modify the source text lines on-the-fly and re-paint them, but I just didn't include single character rendering because I was lazy.

Instead, I put the time into the formatting routine, which takes a file and puts it into RAM with word wrapping and TABs, so a BASIC program has no big hassle scrolling several KB of text around.

If you take a look into the source of MG Browse, you'll also see the nibbles of the character definitions doubled in each byte. And the inner render loop for each double char is also quite similar - but then, the need for maximum speed nearly automatically leads to that solution.

For MG Text Edit I actually derived a single character output routine from the line renderer. It did this mainly, because the text display renders on a MINIGRAFIK bitmap, and normally there is neither an off-screen ASCII/PETSCII buffer available, nor was it intended to introduce one. It just 'paints' the glyphs as typed in. Speed is of no big importance here, as the limiting element is the user before the computer, and there's no scrolling.

That all being said, I'm looking forward how FAST-40 will look in practice.

Denial

WIP: FAST-40

WIP: FAST-40

Re: WIP: FAST-40

Re: WIP: FAST-40

Re: WIP: FAST-40

Re: WIP: FAST-40

Re: WIP: FAST-40

Re: WIP: FAST-40

Re: WIP: FAST-40

Re: WIP: FAST-40

Re: WIP: FAST-40

Re: WIP: FAST-40

Re: WIP: FAST-40

Re: WIP: FAST-40

Re: WIP: FAST-40

Re: WIP: FAST-40