The library was developed for projects that we did using LPC23xx chips.
While there is nothing LPC specific in the library per se, you can find a few useful things for the LPC in the package. In particular, you have a few tools that are oriented for these chips, see LPC2xxx related tools running on the host. Also, you can find header files that describe the peripherals on these chips. There are some examples that were tested on an LPC2378 based development board, namely the Olimex LPC-2378-STK. You can find them in the examples subdirectory under the installation directory. The examples as well as the chip header files are provided using a BSD-style license: you can use and distribute them any way you like as long as you keep the copyright notice in them intact and you accept full liability (and in particular, you do not hold us liable) for any consequence of your using or distributing them, even if those consequences were due to the content of those files being erroneous or harmful in some way.
While working with these chips we found a couple of possible pitfalls, which we thought would be helpful to warn about. On this page you will find some of our observations.
When you install the library, in the include directory there will be a chips subdirectory. In that you can find headers for a few chips, including the LPC23xx series itself.
You will find the LPC specific header files under the LPC23xx subdirectory. There are other files as well as subdirectories in which some other chips are defined. It costs us nothing to put them there, they might be useful for others, so they're there. There's no docs for them apart from the comments in them. Generally either the files themselves or the subdirectory they live in is named the same as the chip(s) they describe.
Most definition files define only the registers, among these lines:
#define SOME_REG_NAME (*(volatile int *) 0xADDRESS)
That has two problems. One, C preprocessor symbols are not known for the debugger. Two, since the register fields are not defined, one tends to write code similar to this:
// Program the desired transfer mode, error correction method // data length and enable reception. SOME_REG = 0x207104f;
which is not exactly easy to maintain. Especially if the comment is not there, as is often the case.
So, these header files define LPC modules as structures and define the structure base address as a C constant. That makes it possible to refer to registers by symbolic names in a debugger. I was contemplating using anonymous enums for register bit values for the same reason, but for the time being I use preprocessor symbols. It might change later, if I have a compelling reason to do so, but for the time being it is #define-s for register bits.
The register and register field names match the ones used in the NXP documentation, to a degree. If you read the NXP manuals you will see that they are anything but consistent in their naming conventions. Sometimes they use all caps, other times they use camel case, other times just a textual name is given as 'symbol'. A bit that enables something might be called E or EN or Enab or Enbl or enable, depending on the module. A status register can be called Stat, ST or status. Sometimes within the same module they have inconsistencies. The most horrific example is the Ethernet controller, which has registers called 'MAC1' and 'RxDescriptorNumber' and has register bits with names like 'PASS ALL RECEIVE FRAMES' and 'RegReset' and 'Length out of range' and 'VLAN'. All in the same module. Since I don't think that definitions like
#define ETH_RSV_CARRIER_EVENT_PREVIOUSLY_SEEN value #define ETH_IPGR_NON_BACK_TO_BACK_INTER_PACKET_GAP_PART2(x) value
are such a good idea (may look OK in your browser, but the second symbol name is 47 characters long), some changes had to be made.
So, I did the following:
lpc_sio3->rbr while the Ethernet controller's RxProduceIndex is accessed as lpc_eth->rxproduceindex lpc_dma->chan[ 0 ].srcaddr
lpc_dma->chan[ 1 ].srcaddr
#define MOD_REG_FIELD(n)
#define MOD_REG_FIELD_CODE1 #define ... #define MOD_REG_FIELD_CODEn
#define MOD_REG_FIELD mask_value
The name of the structure that describes a module is lpc_??? where ??? is the module name. The module names are all 3 letters. They are the following; the '?' at the end of the name indicates that there are more than one instances of that kind of module:
scb System control blockvic Vectored interrupt controllerwdt Watchdog timerpcb Pin connect blockpio Ports, APB accessfio Ports, AHB access (defined in the pio.h header)spi SPIrtc Real-time clockadc A/D converterdac D/A converterpwm PWM controllereth Ethernet controllerusb USB device controllerotg USB On-The-Go controllerhci USB host controlleri2s I2S controllermci SD/MMC controllercan CAN controller (2 controllers + common filtering in 1 module)dma General purpose DMA controlleremc External memory controlleriap Definitions for the IAP routinessio? UARTs (0-3)ssp? SSPs (0-1)tim? Timers (0-3)i2c? I2C controllers (0-2)Note that the scb module contains some modules that are treated separately in the manual, but live in the same address block and provide system control functionality, so they are lumped together. Namely, scb contains the following manual chapters:
The LPC23xx realted header files are located under the include/chips/LPC23xx subdirctory under you install location. Within it you will find the module headers under the 3 letter module name with a .h extension, that is, the SSP will be in ssp.h, the UART in sio.h and so on. The module headers describe modules to their fullest extend, it is up to you to know that some modules, combinations, registers etc. do not exist for a perticular module or chip. For example, the UART definition describes a UART that has full modem control (even though those registers exist only in UART1) and IrDA mode register (even though that exists only in UART3). The USB controller contains all the OTG and host controller registers, even though not all chips have OTG and host controllers and some do not have a USB controller at all.
Furthermore, you will find files under the names of 2361.h, 2362.h and so on, up to 2388.h. Those files contain the definitions that are specific to the particular chip. The name identifies the chip: 2364.h is for the LPC2364. The things defined in these files include the presence or absence of various peripheral modules, the base address and size of all memory blocks, the chip ID, FLASH segment sizes and the like. They all have the exact same format, just the defined values are different.
The SSP controllers are very handy, but they have a couple of issues. One is their SPI compatibility. I only used them in that mode, so I can not comment on the other modes. Others are related to the FIFOs.
Motorola specified the SPI as a method of transferring one or more bytes through the bus in one transaction. Now the NXP interpretation is somewhat different. In modes where the CPHA is 1 (known as SPI modes 1 and 3), the NXP and the Motorola SPI work the same way. However, when the CPHA is 0 (modes 0 and 2), NXP deactivates the slave select between bytes. That, of course, makes SPI peripherals with multi-byte transfers rather confused. Fortunately, there's a fairly simple workaround when the LPC23xx is the master: you control the slave select yourself, which would be the case anyway if you have more than one SPI slave on the bus.
The real problem is when the LPC23xx is the slave. It is because it does not consider a transfer of a single byte complete until the slave select (which is now an input) is negated (i.e. goes high). I don't know any workarounds, apart from not using SPI modes 0 and 2 when the LPC23xx's SSP is the SPI slave.
Those wonderful SSP FIFO's are a real saviour. You don't have to jump in and out of an interrupt for every byte. When you are the SPI master, you put all the bytes you want to send into the FIFO (or the last few bytes of the transfer, if you have more bytes than the FIFO size) and you leave the SSP on its own.
When the last byte is transferred, it will request for an interrupt, you unload the receive FIFO, deselect the slave and you're done. Or are you...
There's a little problem with that "it will request for an interrupt" bit. There's no Tx FIFO empty interrupt. There's a TX FIFO half empty, but that does not help you here. There is no way to know how many bytes are in the Tx FIFO and the only notification you get is that there are less than half the FIFO size. Of course one could arrange the transfer cleverly and use the receive FIFO not empty interrupt instead. Except that there is no such interrupt either.
There is a workaround, though. There are two cases to consider.
The SSP has FIFOs, both on the transmit and receive sides. That's very convenient, except in one particular case, namely when the SSP is a slave and you deal with variable length multibyte transfers.
In such cases the other side is the one that dictates the speed of the transfer. Therefore, to avoid transmit underrun, you really want to fill your Tx FIFO to the brink. The problem is, that if the master finishes (or aborts) the transfer before all the data you stuffed into the FIFO goes out, then you are stuck with the leftover data still in your transmit FIFO. When the master select you again, it will receive that leftover first instead of the new data you put into the FIFO when you detect that you became selected. In slave mode the negation of the slave select does not flush the transmit FIFO (possibly because in modes 0 and 2 the controller expects the slave select to become negated between consecutive bytes). What's even worse, there is no way to flush a FIFO. There's no FIFO reset, nothing. Even reconfiguring the SSP retains the FIFO content.
I have found a work-around. It is very ugly and time consuming (which means that you must assure that the master leaves enough time between transactions so that you can do it), but at least it can save you from stray data being transferred.
When a transfer finishes (i.e. you detect that slave select became inactive), you check if the transmit FIFO is empty or not. If it is empty, then you're laughing. Otherwise, you do the following:
The timers are very good, except for one little thing that was overlooked when the timer was designed. It is only an issue if you want the output compare (match) to do something with a signal and you use more than one channel of the timer.
Let's assume that you have an input and when there was a change on that input, you have to generate an L microsec long pulse on an output exactly D microsecs after the input transition. You can connect the input to an input capture of a timer and the output to an output match of the same timer, then write two interrupt routines like this (all interrupt related details are omitted for brevity, just pseudo code is shown):
CaptureIrq()
{
int now;
int action;
// Get the time of the input change (i.e. the capture), add the delay
// to it and put it into the match register of the output channel
now = timer_capture;
timer_match = now + D;
// Set up the output channel's mode to force the output pin to
// high when the match happens
action = timer_match_action;
action &= MASK_FOR_OTHER_CHANNELS;
action |= FORCE_OUTPUT_TO_HIGH_ON_THIS_CHANNEL;
timer_match_action = action;
// Set up an interrupt for the match
timer_interrupt_enable |= OUTPUT_MATCH_FOR_THE_OUTPUT_CHANNEL;
}
MatchIrq()
{
int now;
int action;
// Add the pulse width to the match register, which now contains
// the start time of the pulse
timer_match += L;
// Set up the output channel's mode to force the output pin to
// low when the match happens
action = timer_match_action;
action &= MASK_FOR_OTHER_CHANNELS;
action |= FORCE_OUTPUT_TO_LOW_ON_THIS_CHANNEL;
timer_match_action = action;
// With that we're done, no more interrupt.
timer_interrupt_enable &= ~OUTPUT_MATCH_FOR_THE_OUTPUT_CHANNEL;
}
That works fine, as long as no other channel on this timer is used to control output pins by timer match. If, however, that's not the case, you are in trouble.
The problem is that the timer's EMR register, which controls what happens at a match also contains the current state of the output bit. If you write to the EMR register, then the output pin will be forced to the state that you wrote to the register. Now this seems to be fine, because when you modify the action that will be taken when the match occurs (in the future) you do not want to change the current state of the output, so you read the EMR, modify the action (but not the current state of the pin) and write it back.
Except that there is a race condition here. Since the EMR contains the mode and current output state for all four channels, if an other channel's hardware modifies its output state between your reading and writing the register, you will overwrite what the HW did:
As a result, on channel 1 instead of the output going high, there will be a short positive glitch (as wide as the time between the actual match and your writing the EMR back, probably in the hundred nanosec range) and after that the output will remain low.
There's not much you can do about it. If you can use separate timers for all your outputs, that eliminates this issue. If you can't, then you can try a workaround, assuming that you have enough time to actually do it. Before you modify the EMR, you check if there are any output match values within a few microsecs of the current timer value and if yes, you just wait until the timer's free running counter passes them (of course you only have to check channels that will modify their output on a match). Then you can be sure that there will be no match on other channels while you are modifying the register. It is indeed ugly and it means that you might burn CPU cycles inside an interrupt routine but that's the best you can do, unfortunately. Just make sure that you put a comment in your code explaining why that abomination was written by an otherwise respected professional...
The whole problem would not arise if the current state of the output associated with a channel would live in a separate single-bit register. Or if those bits in the EMR would be read-only and separate set and clear registers were provided to change them. Or, maybe the simplest solution would have been if the EMR itself had had mask bits to select which output bits to modify and which not (costing 4 AND gates worth of extra hardware all together). So there are a few simple HW solutions for the problem but the NXP engineers, unfortunately, used none of them - forcing you to resort to a software kludge.
The LPC2xxx series has an In-System Programmin (ISP) module, which is a bit of firmware running on the chip and activated at boot time. With that you can interrogate the chip and program the FLASH using a simple RS-232 connection. It is indeed a great feature and a programming tool is provided in the package that uses that, see the lpc-prg, a serial programmer (needs Tcl/Tk) tool and the Serial programming C library. However, there are two small issues with the ISP that you should be aware.
The manual states that the ISP uses XON/XOFF protocol for software flow control. Alas, the manual is wrong. The ISP does not use any flow control. That is usually not an problem, as you issue a command (a short string) then you wait for the chip to send a reply (an other short string). There is one exception: when you upload code into the chip. In that case you can send up to twenty lines, each being up to 63 characters long to the chip before it sends you anything. Unfortunately the chip can't keep up with the data at higher baudrates and it will lose characters. That, of course puts you and the chip being out of sync. It will not talk to you when you expect its response (it is still waiting for data from you) or even if it responds, it will just tell you that there's something wrong.
There are two solutions to that problem. One is that you use the echo feature of the ISP and you do not send a new line until the previous line was echoed back in full. That works, but can cause extreme slowdown especially if you access the chip through a USB to RS-232 adapter.
The other solution (and that is what the library uses) is to know how long it takes to transfer the line and for the chip to process it and just wait that much before sending the next line. The USB adapter can cause some trouble, but the library takes that into account as well (as much as it can).
In any case, if you want to talk to the ISP through the serial port you should know that XON/XOFF is not implemented and that the ISP is slower than the serial line. I submitted a report on this to NXP in 2008, but they didn't even look at my report (it is still in 'new' state, not yet advanced to 'pending', the state indicating that the report was at least read by someone).
The following is a very strange bug. You activate the ISP by resetting the chip and holding the ISP pin of the chip low when the reset is negated. After reset it is always the ISP that starts up. It then checks the vector table checksum. If it is incorrect, it activates, that is, waits for a command on the serial port. If the table is correct, it checks the ISP pin. If it is high, then it jumps to your reset vector and the execution of the code in the FLASH commences. If the ISP pin is low, then the ISP starts up, except if the reset was caused by the watchdog. That makes sense, because the ISP pin is used for other purposes as well. The watchdog resets the LPC2xxx but it does not reset the external hardware. Therefore, an external device that is blissfully ignorant about the demise of the LPC2xxx can pull that pin low for very legitimate reasons that have nothing to do with the re-programming of the LPC chip. The LPC then would enter ISP mode and wait for serial commands, that would never arrive. So it was a very wise decision from the NXP engineers to ignore the ISP pin in case of watchdog reset.
However, the implementation of the above is somewhat lacking. In some cases it can happen that after the chip was reset by the watchdog, you can not reprogram the chip. Even though the external reset (which you must apply to activate the ISP) clears the watchdog bit in the reset source register, the ISP for some unknown reason decides that it does not activate, despite the ISP pin being pulled low. So you seem to be stuck, you can't reprogram the chip, as the ISP does not want to talk to you.
Fortunately, things get back to normal if you power cycle the device. If you do a power cycle and keep the reset low, the ISP will happily talk to you (until it gets an other watchdog reset).
Not all watchdog resets cause the chip into go into this state. I have no idea why the ISP is doing it and what are the specific circumstances that make it go gaga. I have experienced it with LPC2103 and LPC2362 chips. I would assume that disassembling and understanding the ISP code would shed light on both problems, but I didn't have the time for that and, for that matter, it would not help as you can not modify the ISP code.
The single-cycle memory access using the local RAM might make you a bit over-enthusiastic and care very little about bandwidth. If you have a system with lots of data transfers, especially peripheral transfers, you should curb your optimism. Things are not as fast as they seem to be.
The LPC23xx chips have 4 internal buses.
Accessing the RAM on the local bus is a single-clock affair. `Nuff said.
Accessing the FLASH is a different thing all together, due to two factors. First, the FLASH is slow and second, there is a Memory Accelerator Module that tries to rectify that. For linear program execution you can assume that the FLASH is accessed in a single clock. For a more detailed description see Accessing the FLASH.
Accessing anything on an AHB bus takes 2 clocks since an AHB bus cycle is 2 clocks. Naturally, if you go out to external memory, you have to add the external RAM access times to that. Nevertheless, accessing the Ethernet or USB RAM costs you just one extra clock compared to the local RAM.
Now you will see why the whole section was written at all. Accessing things on the APB is just slow. Really, really slow.
The clock of a peripheral (PCLK) is the CPU clock (CCLK) divided by 1, 2, 4 or 8. You can set the division factor on a peripheral by peripheral basis. Whenever you access a peripheral, the APB clock will be the PCLK of that particular peripheral. The access time, which was that cosy 1 CCLK for the local RAM suddenly jumps up to 5*PCLK + 6*CCLK. If the stars are aligned (the technical term for PCLK being in the correct phase) you might save half of a PCLK from that. In real number terms, the access times are 11, 16, 26 and 46 CCLK cycles, for 1, 2, 4 or 8 PCLK dividers, respectively. With some luck, the 26 and 46 clocks can go down to 24 and 42, but you can't count on that. The battery-backed RAM in the RTC also works that way; note that the clock divider for the RTC RAM is independent from that of the RTC itself.
46 clocks per access, so what, you might ask. Well, it's not that simple. For that 46 clocks you also block the local bus and the AHB1. This means that during that time it is not possible:
Yeah, yeah, but after the 46 clocks the bus gets released and those DMAs can go on their way, you might say. Well, yes. Sort of. First, 46 clocks is 23 AHB bus cycles, that is, 92 bytes worth of data for the USB or the Ethernet. Not something you just sneeze on. Furthermore, let's do a quick calculation. Assume that you have a UART set up to PCLK=CCLK/8 and you have some DMA transfer and you are sitting in a loop, waiting for a UART character:
while ( ( UART_STATUS_REG & CHAR_AVAILABLE ) == 0 );
The above line will be compiled to something along these lines:
ldr r1,address_of_uart_status_reg
loop:
ldr r0,[r1]
tst r0,#CHAR_AVAILABLE
bne loop
How do we fare? The ldr instruction is 2 clocks plus the 46 while it gets the status from the UART. The tst is just 1 clock and the bne, if you happen to have a good alignment in the FLASH, is 3 clocks. All together 46+2+1+3 = 52 clocks. Of that, the local bus is locked for 49 clocks (94%), the AHB1 and APB are locked for 46 clocks (88%). Even if you are lucky and you have a good clock phase alignment and a bad FLASH alignment, the utilisation of the AHB1 and APB will be around 81%. You don't leave much to the DMA. If the DMA happens to be the general purpose DMA transferring data from or to a peripheral on the APB (so it too has a slow access and locks the AHB1 and APB for long periods), well, it will have a hard time at high data rates.
So while it is nearly impossible to saturate the AHB2 with Ethernet traffic, you can reasonably easily saturate the AHB1 or the APB if you are careless.
It does not mean that the buses are not adequate, it just means that you have to keep in mind that access through the APB is slow and that accessing an APB address also blocks the AHB1. So, my recommendations would be:
If you access the FLASH without the MAM (Memory Acceleration Module), you will have to wait for the FLASH to finish for every word you read from it. The FLASH access time is 50ns, the number of wait states should be calculated accordingly. The MAMTIM register in the System Control Block is used to set the number clocks per FLASH access, between 1 and 7. For a given CCLK frequency you should calculate it like this:
MAMTIM = ceil( 50ns * fCCLK ).
MAMTIM register determines the number of clocks per FLASH access. Leaving it at its default 7 will result in a very slow system while setting it to a value resulting in a less than 50ns FLASH access time will render the system unstable.The MAM has threee modes. One is MAM off, that is, you wait MAMTIM clocks for each and every FLASH access. That's not very exciting, so nothing more is said about that, except noting that it is completely deterministic. Regardless of code alignment, access patterns, you know that a FLASH access is exactly MAMTIM clocks.
The LPC2xxx FLASH is 128 bits wide. That is, a FLASH line contains 4 32-bit words. Since in ARM mode each instruction is 32 bits, a FLASH line can supply 4 instructions.
The partially enabled mode of the MAM basically prefetches the next 4-word line while the core is executing instructions from the current one. Once a line is fetched from the FLASH, it is transferred to a prefetch buffer and the next FLASH read cycle can commence. That means that if you have linear code and MAMTIM is set to 4 or less, then you will not have to wait for the FLASH: the processor needs at least 4 clocks to execute the 4 instructions from the prefetch buffer, during which time the next line with the next 4 instructions is being read from the FLASH. In THUMB mode the 4 words actually encode 8 instructions, so it is even better; the FLASH will be idle half the time, reducing power consumption.
In fully enabled mode the MAM adds a branch trail buffer and a data buffer to the prefetch buffer. In theory, the branch trail buffer makes loops to run without delay, assuming that you have no forward jumps within your loop. The data buffer is used to minimise the latencies caused by accessing the FLASH for data.
The description of the MAM in the NXP documentation is very short, rather lacking in detail. I've spent some time trying to figure out what it does, counting cycles and the like but I have to admit, I failed. The block diagram of the MAM may or may not reflect reality, but the control logic, which is not even shown on the block diagram, is definitely more complex than one would imagine. If you want to rack your brains about its operation, here is a piece of testcode, the simplest of all I used:
.align 4 // Align to a 4-word boundary .rept M // Repeat the next assembly line M times nop // Uses 1 word .endr // End of the repetition loop: subs r0,r0,#1 // Decrement the loop counter (1 clock) .rept N // Repeat the next assembly line(s) N times nop // Do nothing (1 clock) .endr // End of the repetition bne loop // If the loop count isn't 0, loop (3 clocks)
That is, we have a loop that consists of two real instructions plus N single cycle NOPs. The whole loop is aligned to a FLASH line boundary if M is 0, otherwise it is misaligned by M words.
I measured the run-time of the above code for N between 0 and 10, with MAMTIM set to 3, 4 and 5. The result for the M=0 (the loop is aligned at a FLASH line boundary) can be found below. You will also find a line for MAMTIM=1, that indicates the real cycle number, if the CPU never needs to wait for the FLASH.
MAMTIM N=0 1 2 3 4 5 6 7 8 9 10
1 4 5 6 7 8 9 10 11 12 13 14 3 4.0 5.0 6.0 7.0 8.0 9.0 12.0 13.0 14.0 13.0 16.0 4 4.0 5.0 6.0 7.0 8.0 9.0 13.0 14.0 15.0 13.0 17.0 5 4.0 5.0 6.0 7.0 8.0 14.0 15.0 16.0 17.0 19.5 20.5
I'd point out the half clock execution time for the MAMTIM=5 case when the loop has 9 or 10 NOPs. That is actually not a half clock, but it indicates that of every two consecutive executions of the loop, one takes one clock cycle longer than the other. I'd also draw your attention to the difference between the 8 and 9 NOP loops when MAMTIM is 3 or 4. As you can see, the loop is one instruction longer yet it executes faster. That is a very interesting result, especially if you read the ARM7TDMI documentation. As it turns out, when the number of NOPs is 8, the bne instruction is at a 4-word boundary plus 1 word. The ARM7TDMI core prefetches two more words before actually taking the jump. That means that it will request for the BNE, then next word and the one after that, all of which are contained within the same FLASH line. If you then have 9 NOPs, then the BNE is at a 4-word boundary plus two words, thus the second prefetch will request a read from the next FLASH line, needing an additional FLASH fetch. Yet the execution time is less than when the core does not ask for any instruction word beyond the FLASH line the BNE is in. However, that phenomenon disappears as soon as MAMTIM is 5, that is, the reading of a FLASH line becomes longer than the time needed to execute the words delivered by a line.
Things get even more interesting if you misalign the above code by 1, 2 or 3 instructions. The tables below give you the values for all cases, M=0 to M=3, the M=0 being just a repeat of the table above:
M MAMTIM N=0 1 2 3 4 5 6 7 8 9 10
1 4 5 6 7 8 9 10 11 12 13 14
0 3 4.0 5.0 6.0 7.0 8.0 9.0 12.0 13.0 14.0 13.0 16.0
4 4.0 5.0 6.0 7.0 8.0 9.0 13.0 14.0 15.0 13.0 17.0
5 4.0 5.0 6.0 7.0 8.0 14.0 15.0 16.0 17.0 19.5 20.5
1 3 4.0 5.0 6.0 7.0 8.0 11.0 12.0 13.0 12.0 15.0 16.0
4 4.0 5.0 6.0 7.0 8.0 12.5 13.5 14.5 12.0 16.5 17.5
5 4.0 5.0 6.0 7.0 13.5 14.5 15.5 16.5 19.0 20.0 21.0
2 3 4.0 5.0 6.0 7.0 10.5 11.5 12.5 11.0 14.5 15.5 16.5
4 4.0 5.0 6.0 8.0 12.0 13.0 14.0 12.0 16.0 17.0 18.0
5 4.0 5.0 6.0 13.0 14.0 15.0 16.0 18.5 19.5 20.5 21.5
3 3 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 16.0 17.0 18.0
4 10.0 11.0 12.0 13.0 14.0 15.0 16.0 17.0 18.0 19.0 20.0
5 12.0 13.0 14.0 15.0 16.0 17.0 20.0 21.0 22.0 23.0 25.0
So, using the MAM in fully enabled mode will give you a speed that is a bit unpredictable, but most certainly faster than what you could achieve without the MAM (which would be the clock cycles for the MAMTIM=1 case multiplied by the actual value of MAMTIM). A few general things to consider, if you want to do handcrafted assembly for speed reasons and you can't afford to copy it to RAM (where it executes at full speed):
continue in C) has no penalty, so you can also use that technique to re-organise your loop to avoid forward jumps. Of course, jumping out of the loop (like break in C) is also penalty free, because when a conditional jump is not executed, it is equivalent to a NOP, a single-cycle sequential instruction.
1.6.3