Further EDSK extensions

I’ve been involved with various disk preservation groups over the last few years. A large part of that has been for Spectrum +3 and Amstrad CPC disks, with SamDisk extended to support copy-protected disks. The +3/CPC disks are usually stored in the Extended DSK (EDSK) image file format, designed to hold (almost) any format compatible with the uPD765 floppy controller.

Many problem disks have been reverse-engineered to discover why they didn’t work. A few required emulator enhancements to improve hardware accuracy, but most were missing details from the original disks, due to some creative floppy controller use by the copy protection checks. Not all of these could be supported by the original EDSK specification.

Back in October 2005, I suggested a few of EDSK enhancements, designed to address some known limitations of the format. The extensions didn’t involve anything too radical, to maintain as much backwards compatibility as possible.

It’s now three years later, and a number of new gap-related CPC protections have been identified, which are beyond the scope of even the extended Extended DSK format! I’ve made further changes to address the new requirements, as well as a correction to a previous one.

My development version of SamDisk includes support for all the new features, and will be released if the extensions are approved. Other programs will need similar enhancements to take advantage of them, particularly emulators wanting to run some of the difficult disks.

See the updated extensions document for further details. There are also sample disk images showing each extension.

FdInstall false-positive, again

Avira Antivir strikes again, with another false-positive in the fdrawcmd.sys installer. The current virus definitions report the FdInstall.dll installer plugin as infected with TR/Dropper.Gen (a “generic trojan detection routine”).

As before, avoiding UPX compression on the module is a magic fix. It’s particularly frustrating because the compression isn’t hiding anything, since the original module can be extracted using freely available code that they’re already using! Why should using a reversible executable packer be an instant black mark? Shouldn’t they be more worried about unknown or non-reversible packers? Grrr.

I’ve updated the driver installer with a UPX-less version. Hopefully the complete removal will mean an end to these virus scanner hassles.

Avira have since confirmed the issue as a false-positive, and will be fixing it in a future virus definition update. Thanks to zogzog for taking the time to report the original problem.

SAM/IP

The SAM port of uIP seems to be on hold at the moment, so I’ve been looking at other IP stacks to use until it’s ready. The most appealing is Mark Rison’s CPC/IP, not least because it’s written in Z80 and should work without extensive changes. It also comes with a number of built-in client (telnet, finger, host, ping) and server (web, tftp, dns) modules.

So far I’ve modified the source so it assembles with pyz80. A global search and replace made quick work of changing the label format from “.label” to “label:”, but I had to change many data statements manually. Strings were often combined with other single bytes in defb statements, but the Comet format used by pyz80 doesn’t allow that, requiring defm be used instead.

The existing code is nice and modular, but there are CPC-specific ROM calls sprinkled throughout them. All those need to be changed before a test run on SAM, to avoid us unexpectedly jumping into the middle of nowhere! I changed the stdio.z module to use SAM’s ROM calls for character output, leaving the cursor control and keyboard input doing nothing for now. I also replaced the serial module with a dummy ethernet module, with no-op versions of the required interface functions. They will be fleshed out with calls to the Trinity driver when the rest of the code is ready.

Those changes are enough for a basic run on SAM, without being connected to a real network. Here’s what you see when it’s launched:
CPC/IP on SAM

The program continues polling for serial and keyboard input in a main loop. The CPC version uses a 300Hz timer to poll for new data from the serial link, which is buffered for later reading from the main loop. Each received byte is passed into either the SLIP or PPP module (whichever was configured during the build), which builds up complete datagrams. These are then passed into the ‘ip_handle’ function inside ip.z for the processing.

The SAM implementation will read complete datagrams from the Trinity driver, so they can be passed straight into ‘ip_handle’. This vastly simplifies the setup above, but introduces a new requirement: ARP. SLIP and PPP push datagrams back into the link and let the remote end deal with routing. With ethernet we need to determine the hardware addresses for delivery, for both local and routed traffic.

I’ll need to write a new arp.z module to sit between the Trinity driver and the IP module. Outgoing traffic for hosts already in the ARP cache can be sent immediately. Anything for as-yet-unknown targets must be buffered, and a who-has ARP request made for the address owner to reply. Once a reply is received, an entry for it is added to the local ARP cache and data buffered for that host is sent. If no reply is received (ideally after multiple attempts), data for the target will be discarded. We must also reply to incoming ARP requests for our own address so other hosts can to talk to us.

I’m still torn between using the SAM ROM routines for I/O and something based on the terminal code I wrote for the Apple 1 emulator. The ROM code would give the same output flexibility as in BASIC, but the general ROM code is a bit on the slow side. My own code could be tailored for a specific mode, either mode 2 for speed or mode 3 for hi-res. It might be easier to stick with the ROM code for now, and change it if it’s too slow.

Trinity Ethernet

After a break of a few of months, I’m almost back on the development wagon. I did the odd project tweak during that time but haven’t spent any quality time working on new features.

Last month I picked up one of the first Quazar Trinty boards. Since then I’ve been working on the ethernet side, which is based around a MicroChip ENC28J60 chip. The Trinity board also includes EEPROM and MMC/SD board features, but I’m leaving those for another time. My first task was to write a simple network driver, to allow sending and receiving raw packets from BASIC.

Trinity uses the SAM port range &DC to &DF. The first of these is the microcontroller, which acts as a central hub for all the board’s features. The other ports are used for the EEPROM, Ethernet and MMC/SD card, and each needs to be enabled through the microcontroller before it can be used. Port &DE is used for ENC ethernet chip, and once enabled we can read and write to the chip directly. Well, almost directly as the link uses the SPI bus.

If you’re as clueless about electronics as I am you probably won’t have come across the SPI (Serial Peripheral Interface) bus. It’s a full duplex link where each byte written is paired with a read back from the device. Since reads can’t be performed without a write, Trinity stores the value read for later. Reading from SAM reads only the stored value, without accessing the ENC.

SPI introduces a lag between writing a value and reading any result generated by the write, since the stored value is what was read before the write completed. An additional dummy (zero) write is needed for the actual result to be available for reading. The lag also means block reads require a dummy write before reading each byte. Fortunately, the latest Trinity firmware provides an auto-null-writing feature to simplify and optimise this.

The ENC itself has a banked register setup, arranged as 4 banks of 32 registers. The final 5 registers in each bank are common across all banks, and are used for status registers and bank selection. All ENC features are accessed through these registers, including reading and writing from the internal 8K data buffer. The buffer is used for both transmitting and receiving, with a user-defined portion of it configured as a circular receive buffer. The remaining space is unmanaged and available for transmission storage.

In its power-on state the ENC will see but not receive anything. It has no hardware address set, no space allocated for the receive buffer, and the packet filter is set to ignore everything. The driver initialisation is responsible for setting up all of those, and any other register where the defaults are not suitable. Before we do that it’s wise to ask the Trinity microcontroller to reset the ENC chip back to a known state.

I started my experimentation from BASIC as it was quicker to tweak the ENC registers and see results than launching the assembler for each change. Colin supplied a sample disk with macros to access the board, with most containing a couple of OUTs and maybe an IN. I added to them for higher level functions, such as setting the MAC address and writing blocks of data to the ENC buffer. Once I was happy this was working I was ready to port it to Z80.

I chose to use 6.5K of the 8K buffer for receiving, with 1.5K left for sending. That’s just enough space to send a single full-size ethernet frame. The packet filter was set to receive packets addressed to our MAC address, as well as anything broadcast to the whole subnet. Writing a zero to the packet filter register disables it, so all local network traffic is seen. Couple that with packet decoding and you have an easy network sniffer.

My driver development wasn’t all smooth sailing, with a few bumps along the way. The first was my early attempts to write and read the MAC address values, to ensure my new Z80 code was working. It turns out the subset of ENC registers starting with ‘M’ (which includes the MAC registers) have an extra lag on top of SPI, and require double-reading before they return the correct result. I was also stung by a documented ENC issue with the transmit logic getting stuck under certain conditions. A bug in my work-around meant I would still occasionally hang during transmits.

Even with the driver initialised and reception enabled, we’re still not quite ready to handle a test ping from another machine on the network. Responding to requests requires CRC calculations in the return packets, which involved more work than I wanted to do for a test setup. That will be the job of of a full IP stack. It’s marginally easier to send a ping request from SAM, since the request can be pre-calculated and it’s only the remote host that needs to worry about dynamic responses.

Even pinging an IP address from SAM is surprisingly involved:

  1. Use local IP and netmask to determine whether target IP is on our subnet (if not, send to gateway machine for further routing)
  2. Check local ARP cache for the target IP (if found, goto 5)
  3. Broadcast who-has ARP request to find the MAC of the IP
  4. Wait for ARP reply, then add MAC to local ARP cache
  5. Construct ECHO REQUEST ICMP packet
  6. Send unicast packet to target MAC

Fortunately, we can strip this down for the sake of a simple test. We’re using a local target so step 1 is unnecessary. We can also hard-code the MAC of the target machine, to also skip steps 2 to 4. An ICMP ECHO request packet can then be constructed with fixed details and pre-calculated CRCs. I used Ethereal on my PC to sniff a request sent with a zero CRC, which was expected to fail, then completed the correct CRC with what it reported.

To send a reply, the target machine will perform the same steps as above, with an ICMP ECHO REPLY packet. As SAM is currently unable to reply to ARP requests we must use the “arp” command on the target machine to add a static entry linking SAM’s IP with its MAC address. In my case that meant running the following command in Windows XP:

arp -s 10.0.0.88 02:A4:92:E4:D3:20

The test MAC address I used was formed from bits of the string “TRINITY”, with a few unused zero bits at the end. Bits 0 and 1 of the first byte are flags, but the rest can be pretty much anything. I’ve set flag bit 1 to mark the address as “locally administered”, to avoid the (rather unlikely!) clash with existing network devices. To avoid clashes with other Trinity boards, Colin will be assigning unique addresses to each one sold. For convenience, the MAC and other network settings will ultimately stored on the EEPROM.

The rigid setup above was enough to show that I could ping my PC from SAM, and have the echo reply read from the receive buffer. What we needed now was a proper IP stack to plug my driver into…

While I was working on the driver, Adrian Brown was busy porting Adam Dunkels’ uIP stack from C to Z80. He’s made quick work of it too, with ARP and ICMP already working well enough to ping from PC to SAM without the need for any of my cheating (ping times are typically 7-8ms). Once TCP is ready we’ll have enough for some real applications! Web server anyone?

ATTRibute port

The attribute port (255) is part of SAM’s Spectrum compatibility, and implements a quirk of the original hardware. On the Spectrum it returns the last value on the ULA side of the bus — an attribute byte over the main screen or 255 during the border. A handful of Spectrum titles use it to synchronise with the top of the main screen, giving the maximum the amount of time to draw sprites without raster shearing.

To my knowledge no SAM software uses it, so it’s remained near to the bottom of my SimCoupe ToDo list for many years. I made do with a dummy implementation, returning zero over the main screen and 255 during the border. Though a bug in the border test meant even that functionality was broken, so port reads have always returned zero!

Velesoft recently released a SAM-mouse enhanced version of the Spectrum title Galactic Gunners. It uses the ATTR port to synchronise drawing with the top of the main screen, and the broken SimCoupe implementation caused sprites in the upper 2/3 of the screen to flicker. In this case fixing the border test bug cured the flicker, but full ATTR support was needed to ensure other titles behaved correctly.

The SAM Technical Manual contains some details of SAM’s ATTR port behaviour:

This register enables the programmer to read the attributes of the currently displayed character cell in modes 1 and 2, and the third byte in every four displayed in modes 3 and 4.

There’s no mention of what happens in the border, thought a quick test was enough to show it didn’t match the Spectrum’s behaviour. In fact it seemed to only return attribute bytes from the main screen. Time for a test program! My usual approach with these tests is to make whatever I’m probing as visible as possible, so the emulation will only match the real thing once everything is perfect. In this case I used a tight loop reading from the ATTR port and writing the result to CLUT entry 0:


ld hl,loop
ld bc,&00f8
loop:
in a,(255)
out (c),a
jp (hl)
 

This code must be run with interrupts disabled, and started from a fixed position in the frame to ensure it’s the same on each run. Both are most easily achieved by placing the code at the IM 1 handler address of &0038 and using a HALT to guarantee the current instruction is a fixed 4 tstates when the interrupt handler is invoked. The 4-cycle rounding from the HALT opcode fetch ensures the test begins on the same frame cycle each time.

To make the most of the test output I created a test pattern containing a range of colours, and interleaved with columns of palette colour 0 where the test colour would show through. Here’s what I came up with:

ATTR Pattern

And here’s what it looks like on SAM running in screen mode 1:

ATTR Test

The time between the port read and the palette write causes the output to be shifted a few screen blocks to the right of the main screen position, reaching into the right border area. The screens above were taken with the SimCoupe border area set to Complete, to show what would be seen if the ASIC generated the display over the full frame. This doesn’t happen on a real machine but is useful to see video changes outside the visible TV area.

The stripes on the main screen and the jagged edges in the lower border are caused by the loop timing not being an exact multiple of the 384 tstates per display line. The actual timing is complicated by the display memory fetches, mode 1 contention delays, and ASIC port I/O delays, but if you look closely you can see three repeating line end positions.

If SAM’s border behaviour matched the Spectrum, the border colour should be bright white (colour 127, since the top bit is not used) everywhere except to the right of the main screen where the colour bleeds from the main screen. To the left of the main screen the colour is actually off-white (colour 120), which matches the right-most attribute on the scanline — bright white paper with black ink is 01111000 binary, 120 decimal.

To test the right-most attribute observation with the lower border I added a bright white paper with white ink (01111000 binary, 127 decimal) block to the bottom right of the screen. As expected this caused the lower and upper border to be coloured bright white. So during the border areas SAM was returning the last attribute value fetched to draw the main screen area.

As a further test I coloured the screen attributes red, set the screen-off bit to disable the display, coloured the screen attributes green, then read from the ATTR port. As expected the port returned the red colour, since that was the last screen byte the ASIC read when drawing the display. Here’s the BASIC code for the test:

10 PAPER 2 : BORDER 2 : MODE 4
20 BORDER 4 : OUT 254,132 : PAPER 4 : CLS
30 PAUSE 5 : PRINT IN 255 : REM should be 34 for red
40 BORDER 0
 

Once the port behaviour was understood the SimCoupe implementation could be enhanced. When the port is read it uses the current raster position to determine the last on-screen location that the ASIC would have read, and the memory address of the screen data (which depends on the current screen mode). The existing mode-change ASIC artefact implementation did a lot of this already so the same code could be re-used.

An additional complication is the value returned when the display is disabled, which may no longer be part of the current display. It requires the ATTR value to be determined when the screen goes from enabled to disabled, giving a value to return for as long as the screen remains disabled.

The test program and source code are available for download (9K). Pressing the NMI button returns you to basic, allowing the screen mode/contents to be changed to see different patterns. Don’t forget that it won’t work in SimCoupe until the next release!

Atom Lite CF support

With Edwin’s help, I’ve just finished adding Atom Lite 1.x support to both SimCoupe and SamDisk.

The new interface is a simplified version of the original Atom HDD interface, and is now primarily for Compact Flash use. The Atom Lite uses an ATA feature for 8-bit data accesses, rather than normal 16-bit IDE mode, avoiding the need for half the data to be latched inside the interface. The change simplifies the design and allows faster data transfers - the next data byte is now fetched with a single IN, rather than having to select the high or (latched) low address first. Streamed media playback anyone?

The new interface requires updated B-DOS and HD-BOOT ROM versions to select 8-bit mode, but once set it’s software compatible with the original interface. Data can be read from both &F6 and &F7 ports as before, despite no latching being involved this time. However, the change does means the byte order of the Atom Lite media is reversed (or perhaps un-reversed!) compared to the Atom, which returned the high byte first. Fear not, existing Atom media can be converted to use Atom Lite byte-order using SamDisk!

The changes to SimCoupe were mainly to the ATA emulation, with enhancements to support 8-bit data mode and 28-bit LBA sector addressing. The latter allows support for devices beyond the 8GB CHS limit (16383 cylinders, 16 heads, 63 sectors), extending the maximum size to a whopping 137GB. Even an 8GB card would contain almost 10,000 B-DOS records, which could easily contain every SAM software title ever written! The Atom Lite implementation is just a cut-down version of the existing Atom C++ class, which has been further simplified as part of the same changes.

The SamDisk changes were also fairly trivial, especially as there’s no ATA emulation to worry about. The byte order of the media is determined by examining the BDOS signature at offset 0xe8 in the first record (which follows the boot sector and record list). With the original Atom (seen as “DBSO”) data accesses must be byte-swapped after reads and before writes. This allows all record-level commands to work transparently on both media. A new command-line option (/bs) forces byte-swapping of entire images, used for the Atom <-> Atom Lite conversion mentioned above. Simply read the device to an HDF image using the byte-swap option, then write the converted image back to the original device.

The Atom Lite 2.x boards are expected to include a Dallas clock chip, with registers access through the same floppy 2 ports. SimCoupe support will be added once the details have been confirmed…

SID Player v1.1

I’ve updated SAM SID Player to version 1.1, addressing some issues with the original version:

  1. Updated 6502 core
    The recent core enhancements mean it’s now possible to trap SID writes from all instructions, without the need for hard-coded checks. Control register re-triggering now works correctly in all tunes rather than just the few previously supported cases.

    Unfortunately, limited program space prevents the full 65C02 core being used, so the extra instructions have been replaced by NOPs of the appropriate size. This is still better than the previous behaviour of failing if an undocumented instruction was encountered. The updated core also includes a bug fix to the indirect indexed addressing using X, which wasn’t performing the indirect lookup correctly.

  2. Additional playback rates
    The previous version supported only 50Hz playback using SAM’s frame interrupt. This worked well with most tunes (taken from PAL C64 titles), but it made 60Hz NTSC tunes (such as Fairlight) sound sluggish, and anything requiring 100Hz or above sounded awful.

    Generating 60Hz on a 50Hz machine is a bit of a challenge, requiring synchronisation to 6 different points across the frame, advancing to the previous point in the next frame to achieve the correct playback rate. In our case it also needs to work without stealing too much CPU time from the 6502 emulation running in the background. The 6 sync points divide the 312 raw display lines into 52 lines segments. The first point is simply the frame interrupt, which is nice and easy. With a 1-line adjustment, the final 4 points fall on the main screen area, and can be synchronised to using line interrupts at screen lines: 35, 87, 139 and 191.

    That just leaves the second point at display line 52, which is 68-52=16 lines above the main screen area. Busy-looping from the frame interrupt would waste 1/6 of the total frame time, and the point is too early for a line interrupt… but not for another technique. MIDI writes are output at a fixed 31.25Kbaud, and generate an interrupt to signal when the transfer has completed, even if there’s no device present to receive the write. Using a cascading sequence of MIDI writes starting from the frame interrupt, we can regain control at the required point without having to wait for it. There is some interrupt processing overhead, but any remaining time is free for the main 6502 emulation.

    A number of SID tunes also use the C64 programmable timer to generate custom speeds, which can be used to make the playback speed independent of PAL/NTSC model. 50/60Hz timers are supported the same way as PAL/NTSC tunes, as described above. 100Hz is used by a few tunes, and can be supported by adding a single line interrupt in the middle of the frame (312/2-68 = line 88).

    The tune playback speed is detected automatically, using the speed bit array in the SID tune header and the active C64 timer frequency, with 50Hz used for other cases. In the playback rate is close enough to one of the supported speeds then it will be used instead. You can also override the playback speed with the following keys: 1=100Hz, 5=50Hz and 6=60Hz.

  3. Large tune support
    To simplify relocating the SID tune, the previous version required the tune be loaded at 49152 with a maximum size of 16K. This could be expanded to 28K by allowing the tune to be loaded directly after the 4K player code at 36864. That still doesn’t give enough space to load the 49K Ghouls n Goblins SID, which fills most of the available C64 RAM.

    The new version now works with tunes up to the full 64K, including those that span the I/O area from &d000-dfff (which is where the SID player code runs). On the first playback the tune is relocated to the correct address, with subsequent plays using the existing player to save time. As with the previous version, a fresh copy of the SID player code is copied for each playback, to minimise the risk of tune players overwriting parts of it.

  4. Keyboard control tweaks
    The new version adds a mask for keys to ignore during playback, allowing the caller to limit the key selection causing the player to terminate. This allows the Next key to be ignored when there is no next tune to play, etc.

I’ve updated the sidplay page with the new source code, which can be assembled directly to a disk image using pyz80.

You can also download a preview disk (175K) containing 37 sample SID tunes. You’ll need a Quazar SID interface board for your SAM to hear anything, of course!

AVI recording

I’ve now done the bulk of the work needed for SimCoupe’s AVI recording feature, using 8-bit MS RLE encoding as planned. It gives lossless video and (optionally) audio recording at up to full framerate, and should be no problem for most systems to handle in parallel to the emulation. The encoder was written from scratch, meaning it’s available on all platforms without additional library dependencies, as well as being tuned for emulator use.

Video is recorded in either half size or full size modes, with the latter needed to preserve mode 3 hi-res pixels. Half size mode defaults to storing the odd pixels for the best quality in SAM BASIC, but even pixels can optionally be used instead. Full size mode also respects the scanline option and intensity, so it should look the same as when using the emulator. Only the 5:4 mode is ignored, as AVI doesn’t support aspect ratios, and proper smooth scaling of the image can’t be done due to the limited 8-bit video format.

Audio is stored in the same 44.1KHz 16-bit PCM format as the separate sound recording feature. Unfortunately there aren’t any widely supported lossless codecs I can rely on, so the sound data remains uncompressed. This requires 3528 bytes per frame, adding a steady 172K/s to the recording. Of course, post-processing the audio stream into MP3 format reduced the file size without impacting the video quality.

The recording framerate uses a simple 1:n method to decide which frames to store, giving framerates of 50fps, 25fps, 16.67fps, … down to 1fps. Key frames are stored every 5 seconds to give faster seeking during video playback, and allow catch-up if the decoder is struggling to keep up (possible with full-size 50fps videos on slower systems).

As with GIF, only the differences are stored between most frames. MS RLE includes a “00 02 xx yy” GOTO sequence to skip up to 255 pixels on both horizontal and vertical directions, to position at the next change point. Identical frames are stored as a 2-byte 00 01 sequence, adding 10 bytes to the recording including chunk headers. Not as effecient as GIF but much better than most traditional video codecs.

Here’s a 2.4MB sample video of the first level of Manic Miner. It was recorded in full resolution (576×480) at 25fps, with sound post-processed into 192Kbps MP3 using VirtuaDub:

Sample Video

Apple 1 Emulator

This will probably be my last emulator for a while so I can return to normal projects. I’d wanted to emulate the Apple 1 for quite a while, and didn’t think it should take more than a couple of hours to make a usable emulator.

The Apple 1 is a surprisingly simple device, with 1MHz 6502 CPU, 4K RAM (expandable to 32K) and a tiny 256-byte monitor ROM. Slots on the main board allowed for add-on ROMs for BASIC, cassette functions, assembler, etc. The user manual includes comprehensive hardware details plus a fully commented disassembly of the monitor ROM.

There aren’t many original Apple 1 devices anymore, but there are a few modern replicas available. Most of them use the 65C02 CPU, so it’s probably just as well I added support for it recently! The original ROMs don’t use the extended instructions but a 3rd party assembler supports for them so it was worth having them covered.

Input and output is via a dumb terminal style interface, supporting only upper-case letters and a slightly cut-down symbol set. I/O speed is tied to the terminal display, giving a 60 characters/second maximum on the original. Both input and output have data and control ports, with the latter used to indicate whether the terminal is busy outputting a character or has a key available for input.

For the emulation, the limited terminal output speed means at most 1 character (plus cursor) needs to be drawn each interrupt. Until that is processed the terminal appears busy and the running program will wait before outputting more. SAM’s 50Hz interrupt frequency reduces the output speed slightly, but not by enough to worry about. Adding a line interrupt in the middle of the frame (312/2-68 = line 88) double this to 100Hz very easily, so Sym-1 and Sym-2 can be used to change the terminal speed.

When a character is written or a key is read, the terminal must update the control ports with the new status. This must happen as part of the read/write to prevent the running program doing anything further until it has been processed. The 6502 core was enhanced to trap memory writes for the Orao emulator, so the display could be updated immediately. The Apple 1 emulation also needs the same enhancement for memory reads, so it can update the input control port.

The output terminal is 40×24 characters, giving a maximum character set size of 6×8 pixels for a 256×192 mode 2 SAM screen mode. Mode 3 would have allowed up to 12×8 thin pixels, but there wasn’t really much benefit from the extra resolution, and the 24K display was 4 times slower to scroll. Perhaps the only drawback in using mode 2 is that Spectrum-style masking and rotating needed to draw each character.

Input is entirely character based, and doesn’t need support for multiple simultaneous key presses (just Shift for symbols). For that reason I decided to use SAM’s ROM keyboard scanner rather than rolling my own version. All I had to do was page the ROM in and ask for the next key, with the ROM debouncing the input and buffering fast typing. The returned key symbols are then converted to Apple 1 keys, converting lower-case to upper-case, and mapping a few special keys including Delete, Tab and Escape.

Normal monitor use is 100% speed, as most of the time is spent waiting for key presses. To test the underlying speed I ran an empty loop in BASIC: FOR X=1 TO 1000 : NEXT X which takes 1.2 seconds on the original device and 10 seconds in my emulator. The 10-15% running speed matches the results for the Orao emulator, and probably applies to anything else that uses my 6502 core.

The completed emulator (plus source code) is now available on my website.

Orao Emulator

This emulator started as a quick test of my 6502 core, to see if it could run the Orao ROMs. I half expected it to fail due to lack of decimal mode or interrupt support, neither of which were implemented in the SID player core. It took just 20 minutes of hacking the SID player source code to reach the point where I could see the flashing input cursor, and it would have been a crime not to continue…

Keyboard input was transplanted from the matrix scanning in the Galaksija emulator, though due to the weird memory mapping layout, the Orao table needs 3-byte (address+value) entries for each key. The bulk of the addresses were taken from the Windows Orao emulator source, though there were a few minor errors that I’ve corrected (one of them stopping Up working in Boulder Dash).

I considered updating the display during interrupt processing, but the large (256×256 = 8K) display size was too much to do every frame. Splitting the frame into 8 or 16 strips to have minor impact on the CPU emulation would have made it too obvious and laggy. It seemed better to update the display live by catching writes to the display memory. Unlike native running Z80-based emulators, we have full control of the 6502 CPU and can filter the writes as they happen.

One approach was to modify any instruction that could write to the display, but that would require a lot of duplicate code. Fortunately, each of the writes formed the target address in HL, where it remained until the point it jumped back to main_loop (next instruction fetch). I simply had to define a new looping point, and use that instead of main_loop for any display write candidates. Zero-page writes (&0000-&00ff) couldn’t affect the display (&6000-&7fff), so they were ignored.

Display writes were filtered using a simple:
LD A,H
CP &80
JR NC,screen_or_up

Using JR instead of JP meant the fall through case was only 5 tstates instead of 10 tstates. The total display write checking overhead was 4+7+5 = 16 tstates (plus contention) for normal RAM writes, which didn’t seem too bad. Further address filtering could also be done for sound writes at &8800, without further slowing of the normal RAM write path. No other addresses were of interest to us, so they were ignored.

At first glance the Orao display seems perfectly suited to SAM’s mode 2 layout, with both using linear addressing, 32 bytes per line, and 8 pixels per byte. The biggest difference is Orao’s 256 line vs SAM’s 192, where clipping or scaling of the display would be needed. Unfortunately, the bit order within each Orao display byte is also reversed compared to SAM, ruling out a simple memory copy to update the display.

Lookup tables to the rescue! Using a 256-byte table for the bit-reversing was a no-brainer, but the display mapping was more awkward. My first thought was to use a line mapping table, mapping from Orao line to SAM line, with &c0 entries for lines that weren’t visible. That still required too much arithmetic to look up an address, then add the line offset from the low 5 bits of the original address. Whatever I used would be done for every byte written to the display, so it had to be fast.

The 12K needed for the mode 2 display meant there wasn’t room in the normal 64K address space, so I was already paging to access it. That left over 16K of spare space in the 32K paging window. Rather than looking up display lines, I had enough space to map every byte on the Orao display to the final SAM address. This also gave the flexibility needed to pan any 192-line view of the original 256-line display, and even to scale the original display to fit, without any additional overhead.

As with the 6502 instruction handler addresses, the display table was ordered with address low bytes in the lower half and the address high bytes in the upper half. That allowed a SET/RES instruction to switch halves during the lookup, which is twice the speed of using add/sub on the high byte instead. Orao display bytes outside the visible area are mapped to SAM line 192, just beyond the visible display.

Everything seemed perfect at this point, until I realised I needed to preserve the 6502 PC value in DE. The core also crammed 6502 registers into almost every other Z80 register, leaving little room to juggle paging, the original address and a new address lookup. The only register-based option to preserve DE was to use IX, at a cost of 16 tstates each way. That was still 4 tstates faster that pushing DE around the block, once stack memory contention was included.

Here’s the final screen write code:

ld ixh,d
ld ixl,e
ld e,(hl)
ld d,rev_table/256
ld a,screen_page+rom0_off
out (lmpr),a
ld a,(de)
ld d,(hl)
res 5,h
ld e,(hl)
ld (de),a
ld a,low_page+rom0_off
out (lmpr),a
ld d,ixh
ld e,ixl

The 6502 core got a few additional upgrades along the way, with the first being a boost to 65C02 support. This added a new addressing mode, and a handful of new instructions (many sorely lacking from the base 6502). A side-effect of this was that undocumented instructions were guaranteed NOPs (1 to 3 bytes in length), so I didn’t have to worry about the hybrid undocumented instructions in the original chip.

Decimal mode was finally added too, in just 20 bytes of extra code. I simply needed to optionally execute a DAA after the adc/sbc calculations to make the necessary adjustment. The DAA was patched with a NOP when in normal binary calculation mode, for the non-BCD behaviour.

There were no interrupts to handle for the Orao, but I completed the implementations of BRK (call maskable interrupt handler) and RTI (return from interrupt), so they could be used if anything tried. As they’re untested, I set the emulator border colour to green to show it has been used. This actually happens when running Space Invaders, but due to a suspected corrupt image. The BRK instruction is &00, so it’s quite likely to get called if execution jumps to a random memory location.

As a result of the 65C02 change the emulator now runs Manic Miner correctly, a game which crashes under the Windows versions due to incorrect undocumented instruction handling. The decimal mode addition also fixes the score updating in Manic Miner and the timer count-down in Boulder Dash. Space Invaders will need redumping from the original tape for it to work correctly.

The final emulation speed is typically 10-15% that of the original machine speed, with slower speed during heavy display writes due to the screen write overhead described above. The mix of 6502 instructions also makes a difference, with heavy indexing requiring more calculations for the CPU emulation. It still runs surprisingly quickly considering everything it’s doing, and on a machine produced only a few years later.

The completed emulator is now available on my site, with the source code following soon.