ESP Schachzwerg chess computer

Schachzwerg translates to "chess dwarf".

@author Andre Adrian
@date 2019-08-31

Introduction

The chess game was once the "Drosophila melanogaster" of the artificial intelligence (AI) research. The creator of an AI-program is intelligent, but is the creation intelligent, too? This is a statement from L. Torres y Quevedo, the creator of the first chess playing machine, to a Scientific America reporter in 1915:

There is, of course, no claim that it will think or accomplish things where thought is necessary, but its inventor claims that the limits within which thought is really necessary need to be better defined, and that the automaton can do many things that are popularly classed with thought. It will do certain things which depend upon certain conditions, and these according to arbitrary rules selected in advance.

One school of thought simply says: If it passes the Turing test, it is intelligent. The other school of thought says: We have no good definition of intelligence, how can we have a definition of artificial intelligence?
What is the purpose of this endeavour? First to have fun and learn something new. Second to verify or reject the "small is enough" idea. Third to create the philosophical question: How can a creator with a chess performance rating of 1000 Elo create a chess computer with a chess performance of 1500 Elo (yes, this is the design goal).
The predecessor of ESP Schachzwerg and Super ESP Schachzwerg is my AVR MAX Schachzwerg chess computer from 2009. The magazine Elektor provided a kit. A PCB, an ATMega88V SoC (System-on-a-Chip), some push-buttons, four 7-segment LEDs and two AA batteries . Performance is 1290 Elo out of 8MHz 8-bit CPU, 8KByte ROM and 1KByte RAM. Because of RAM limitations a non-recursive version of the search algorithm was used. The program is based on Micro-Max from H.G. Muller.

Small is enough

The program of ESP Schachzwerg should work like an insect brain or a hummingbird brain: a little animal can only afford a little brain. But because the little animal has to live in the same complex environment as bigger animals with larger brains it can not afford to be stupid. Scientific American MIND July/August 2017 magazine wrote: "How could such a tiny brain perform as well as a bigger one? Through vicious, cutthroat evolutionary efficiency". The same "small is enough" principle is used for ESP Schachzwerg. The program uses primitive evaluation for a deep search within the limitations of the hardware.

Computer chess program Tell, 1974

The typical computer for a chess program in 1974 was an IBM 360/168 mainframe. But Johann Joss used a HP2115A mini computer. This 16-bit computer was on the market since 1967. The maximum core memory amount was 8KWord, that is 16KByte.

Picture: The HP2115A cabinet. Another cabinet the same size housed the power supply.

With strong competition in Stockholm 1974, Tell got rank 11 out of 13. In Dortmund, 1975, competition was weak and Tell got first rank! In Toronto 1977 Tell only got last rank The strong competition at this time was Chess from U.S.A. and Kaissa from UdSSR. This was the time of cold war at the computer chessboard.

Rank	1974 Stockholm	1975 Dortmund	1977 Toronto
1	Kaissa	Tell	Chess 4.6
2	Chaos	Daja	Duchess
3	Chess 4.0	Schach MV 5,6	Kaissa
4	Ribbit	Orwell III	Belle
11	Tell	-	Black Knight
16	-	-	Tell

The chess computer Belle was special. It was a PDP-10 together with special hardware. Belle terminated the reign of the mainframe chess dinosaurs. Orwell III was written by Nitsche. Nitsche and Henne and the Mephisto chess computer made Hegener & Glaser an important chess computer company.

Tensilica Xtensa ISA (Instruction Set Architecture)

There are CISC and RISC CPUs. And there are RISC and modern RISC CPUs. A traditional 32-bit RISC has 32-bit opcodes. This is a waste. Modern RISC like ARM Thumb or Xtensa have 16-bit or 24-bit opcodes. These small size opcodes make opcode fetch a little more complicated, but bring RISC CPU code density quality to CISC CPU quality. Remember, the micro-code and nano-code Motorola 68000 CISC had 16-bit opcodes. One of the first publications about the RISC CPU that is used in the Espressif SoCs is the 2008 Tensilica Diamond white paper. The Xtensa ISA is documented in the 2004 Tensilica LX Microprocessor. There is a Xtensa Instruction Set Architecture (ISA) Reference Manual available, too.

Table: Tensilica Xtensa opcodes.

The ESP Schachzwerg hardware

The ESP8266, ESP8285 and ESP32 SoCs from Espressif were designed as WLAN-bridge or WLAN-modem, an IO-chip that gives another CPU WLAN (WiFi) or Bluetooth (BLE) capabilities. That is, it was the chinese low-cost answer to the Nordic nRF24 chips. The big difference is that the ESP SoCs can execute user programs next to the protocol stack firmware. The ESP8266/ESP8285 use a Tensilica Diamond 106Micro (L106) 32-bit 80MHz CPU. The program execution is somewhat special. The program is read from an external Flash memory into the 32KByte execution cache/32 KByte execution memory. A jump outside the 32KByte segment triggers a load from Flash. This is slower than having "real" memory for program memory, but cheap. The ESP8266/ESP8285 RAM is 80KByte. The ESP8266 has normally 4MByte of external Flash, the ESP8285 has 1MByte internal Flash. Price for a DOIT ESP8285 NodeMCU-M DevKit is US$ 2.50
The user interface is done very modern with a touchscreen LCD. A stylus is used as in the old Palm PDA handheld computers. The LCD is of good quality TN technology with 6-bit color depth per primary color, LED backlight and 240x320 pixel resolution.
The DevKit and the touchscreen LCD are available as ready-made breadboard modules. The breadboard has 2.54mm (1/10 inch) contacts. After hardware development on breadboard, a simple PCB can replace the breadboard. I choose the NodeMCU-M ESP8285 module that provides USB port, USB-to-UART bridge and 3.3V voltage regulator. The touchscreen LCD has SPI interface for LCD-controller and touch-controller. The controller chips ILI9341 for LCD and XPT2046 for touch are fully supported by the IDE.
The brightness of the LCD backlight is selected with a resistor. A 33Ω resistor is used. The ESP SoCs and other components are 3.3V devices. Therefore a 3.6V lithium battery in AA form factor is used. A 5V USB power supply works as well. An on/off switch, battery holder and insulated wire completes the hardware BOM (bill of materials).

Top picture: DOIT NodeMCU-M DevKit with ESP-M2 module using ESP8285 SoC.
Second picture: TJCTM24028-SPI 2.8" touchscreen LCD ILI9341, XPT2046.
Bottom picture: 400 pins breadboard.

Hardware connection

The ESP have the Flash SPI interface at GPIO 6 to 11. These GPIO pins are best left alone. The ESP8266/ESP8285 has an additional hardware SPI. The SPI signals Clock-Out, MISO and MOSI are connected to ESP8266/ESP8285 pins GPIO14 (D5), GPIO12 (D6) and GPIO13 (D7). Expressif uses the GPIO naming, DOIT NodeMCU DevKit uses the Arduino Dx naming. These signals go to the LCD and to the touchscreen in parallel, that is ESP Clock-Out is connected to LCD Clock-In and Touch Clock-In. To separate the two devices LCD and touch, every device has a Chip Select (CS) pin. The LCD CS is connected to ESP pin GPIO15 (D8), the touch CS is connected to GPIO4 (D2). The LCD can receive data or command codes. The code type is transmitted via the LCD Data Command (DC) signal which is connected to ESP GPIO5 (D1). The Reset or EN pin of the ESP is connected to the LCD reset pin. And last but not least, the LCD backlight pin is connented via a 33Ω resistor to 3.3V. The power supply pins 3.3V and GND are routed between ESP and touchscreen LCD module, too. The on/off switch is traditionally placed in the 3.3V wire between battery and circuit. But you can also place it in the GND wire. As an old telephone technician I could tell you why this is the correct way to do things, but our little system has no earth connection and therefore this information is not needed.

ESP8266	ESP8285	ESP32	LCD	Touch
D5	GPIO 14	GPIO 18	SPI CLK	SPI T_CLK
D7	GPIO 13	GPIO 23	SPI MOSI	SPI T_DIN
D6	GPIO 12	GPIO 19	SPI MISO	SPI T_DO
D8	GPIO 15	GPIO 15	LCD CS	n/a
D1	GPIO 5	GPIO 5	LCD DC	n/a
D2	GPIO 4	GPIO 4	n/a	T_CS
RST	EN	EN	RESET	n/a

The SPI bus uses serial data transfer. The ESP is the bus master. Only the master provides the clock signal. The "Master Out Slave In" MOSI pin is an output at the ESP and an input at the LCD or touch. The "Master In Slave Out" operates in the other direction. The SPI bus is really primitive, and best of all does not need royalty payment.
The current consuption is 48mA@3.6V for the ESP8285 and 88mA@3.6V for the ESP32 running the Mandelbrot demo program on a 2.4" touchscreen LCD. WLAN is switched off at ESP8285, but not at ESP32.

The Super ESP Schachzwerg hardware

The ESP8266/ESP8285 can be replaced by an ESP32 module like the DOIT ESP32 DevKit, the Espressif DevKitC V4 or clones. The ESP32-WROOM-32 module is the slightly more expensive "bigger and better" brother. DOIT ESP32 DevKit price is US$ 3.85. It has a dual core Tensilica Xtensa LX6 32-bit 240MHz CPU, 160KByte static allocated RAM, 160KByte dynamic allocated RAM and 4MByte Flash. This amount of RAM allows a transposition table. A Super ESP should have 1700 Elo. The Spracklens did 1755 Elo in 1983 with a 3.2MHz 8-bit 6502, 24KByte ROM and 3KByte RAM.

Top Picture: 30pin DOIT ESP32 DevKit with ESP-Wroom-32 module using ESP32 SoC.
Bottom Picture: 38pin Expressif DevKitC V4 with ESP-Wroom-32 module using ESP32 SoC.

@note Espressif DevKitC V4 and clones are Arduino board "ESP32 Dev Module"
@note press the BOOT key if "Connecting ..." is displayed. Release the key after upload starts.
The GPIO pin 0 is used for the BOOT key. The pins 1 and 3 are used for serial monitor. Pins 6 to 11 are used for Flash memory interface. These pins should not be used. The GPIO pins 34, 35, 36, 39 are input only. The default SPI pins are 18 for CLK, 19 for MISO and 23 for MOSI. The difference between 30 pin and 38 pin DevKits is small. The GPIO pins 6 to 11 are useless. The GPIO pin 0 is wired to the BOOT key. If used as a output it can create a short circuit if the pin output is high and the key forces it low. The last difference is one more GND pin. The 30 pin DevKit is cheaper and more practical.

Picture: ESP32 DevKit (Super ESP Schachzwerg) and touchscreen LCD SPI/TJCTM24028-SPI present touchpaint. Touchpaint is a Pixel draw program to protytype the user interface. The battery is of lithium 3.6V type.

Mandelbrot set animation

This demo program is output only, that is, it uses only the LCD. It was used to familiarize with hardware SPI, firmware yield() and RAM. The RAM size of the ESP32 shrank from 520KByte (DRAM+IRAM) to 320KByte (DRAM) to 160KByte static allocated + 160KByte dynamic allocated further to 110KByte available dynamic allocated DRAM. For me this is the difference between optimism and realism: the transposition table size is 110KByte, not more. The same source code is for ESP8266, ESP8285 and ESP32. Just define ESP8266 for the first two or ESP32 for the third. Define USE_BUFFER will draw the animation faster, but will not use the full LCD area. ESP32 has 110KByte RAM for the buffer and ESP8266/ESP8285 has 50KByte RAM for the buffer. The ESP32 program runs three times faster than the ESP8285 program. This suggests that 106Micro in ESP8285 runs at 80MHz and LX6 in ESP32 runs at 240MHz and there are no differences in both CPUs - at least not for this program.

/* @file mandelbrot_ili9341 @brief Mandelbrot set animation Hardware: ESP8266/ESP8285/ESP32 + TJCTM24024-SPI/TJCTM24028-SPI LCD resolution 320x240, LCD controller ILI9341 based on Adafruit ILI9341 mandelbrot the fractal pattern is unlimited, but the numeric precision is limited @author Andre Adrian @date 2019-08-30 */ #define ESP8266 // or #define ESP32 #define USE_BUFFER // or deactivate this #define #include <Adafruit_GFX.h> // version 1.5.6 #include <Adafruit_ILI9341.h> // version 1.5.1 #ifdef ESP8266 # include <ESP8266WiFi.h> # define BITRATE 74880 #endif #ifdef ESP32 # include <WiFi.h> # define BITRATE 115200 #endif // HW SPI CLK MISO MOSI // ESP8266 D5 D6 D7 // ESP8285 14 12 13 // ESP32 18 19 23 #define TFT_CS 15 #define TFT_DC 5 // Use hardware SPI and the above for CS/DC Adafruit_ILI9341 tft = Adafruit_ILI9341(TFT_CS, TFT_DC); const int16_t bits = 28, // Fractional resolution iterations = 96; // Fractal iteration limit or 'dwell' int16_t pixelWidth = 320, // TFT dimensions pixelHeight = 240 - 1; float centerReal0 = -0.6, // Image center point in complex plane centerImag0 = 0.0, rangeReal0 = 3.0, // Image coverage in complex plane rangeImag0 = 3.0, delta0 = -0.1; #if defined(USE_BUFFER) uint16_t *buffer = NULL; #endif void setup(void) { #ifdef ESP8266 WiFi.disconnect(); WiFi.mode(WIFI_OFF); WiFi.forceSleepBegin(); yield(); #endif #ifdef ESP32 // @todo change this Espressif software to Arduino software // https://docs.espressif.com/projects/esp-idf/en/latest/api-reference/system/sleep_modes.html // esp_bluedroid_disable(); // esp_bt_controller_disable(); // https://docs.espressif.com/projects/esp-idf/en/latest/api-reference/network/esp_wifi.html // esp_wifi_stop(); // esp_wifi_set_mode(WIFI_MODE_NULL); #endif Serial.begin(BITRATE); Serial.println(); // synchronize serial monitor (less garbage output) Serial.println("Mandelbrot set animation"); tft.begin(); tft.setRotation(1); tft.fillScreen(ILI9341_BLACK); #if defined(USE_BUFFER) // https://docs.espressif.com/projects/esp-idf/en/latest/api-reference/system/mem_alloc.html // ESP32: Due to a technical limitation, the maximum statically allocated DRAM usage is 160KB. // The remaining 160KB (for a total of 320KB of DRAM) can only be allocated at runtime as heap. uint32_t size; for (;;) { size = pixelWidth * pixelHeight * sizeof(uint16_t); buffer = (uint16_t *)malloc(size); if (buffer != NULL) break; pixelWidth -= 2; pixelHeight -= 2; } Serial.print("pixelWidth="); Serial.print(pixelWidth); Serial.print(" pixelHeight="); Serial.print(pixelHeight); Serial.print(" size="); Serial.println(size); #endif } void loop() { float centerReal = centerReal0, centerImag = centerImag0, rangeReal = rangeReal0, rangeImag = rangeImag0, delta = delta0; for (int counter = 0; counter < 67; ++counter) { int64_t startReal = (int64_t)((centerReal - rangeReal * 0.5) * (float)(1 << bits)), startImag = (int64_t)((centerImag + rangeImag * 0.5) * (float)(1 << bits)), incReal = (int64_t)((rangeReal / (float)pixelWidth) * (float)(1 << bits)), incImag = (int64_t)((rangeImag / (float)pixelHeight) * (float)(1 << bits)); uint32_t startTime = millis(); int64_t posImag = startImag; const int16_t pixelHeight2 = pixelHeight / 2 + 1; for (int y = 0; y < pixelHeight2; y++) { yield(); int64_t posReal = startReal; for (int x = 0; x < pixelWidth; x++) { int64_t a = posReal; int64_t b = posImag; int64_t n; for (n = iterations; n > 0 ; n--) { int64_t a2 = (a * a) >> bits; int64_t b2 = (b * b) >> bits; if ((a2 + b2) >= (4 << bits)) break; b = posImag + ((a * b) >> (bits - 1)); a = posReal + a2 - b2; } #if defined(USE_BUFFER) buffer[y * pixelWidth + x] = (n * 29) << 8 | (n * 67); buffer[(pixelHeight - 1 - y) * pixelWidth + x] = (n * 29) << 8 | (n * 67); #else tft.drawPixel(x, y, (n * 29) << 8 | (n * 67)); tft.drawPixel(x, pixelHeight - 1 - y, (n * 29) << 8 | (n * 67)); #endif posReal += incReal; } posImag -= incImag; } #if defined(USE_BUFFER) tft.drawRGBBitmap(0, 0, buffer, pixelWidth, pixelHeight); #endif uint32_t elapsedTime = millis() - startTime; Serial.print("time="); Serial.print(elapsedTime); Serial.print(" ms"); Serial.print(" counter="); Serial.println(counter); if (centerReal > -1.43) centerReal += delta; rangeReal *= 0.8; rangeImag *= 0.8; delta *= 0.9; } }

Picture: The ESP8285 NodeMCU-M DevKit (ESP Schachzwerg) and touchscreen LCD SPI/TJCTM24028-SPI present Mandelbrot set animation.

The ESP Schachzwerg IDE

The integrated development environment is as important as the hardware for the success of any project. I choose the Arduino IDE with the necessary enhancements for ESP8266, ESP32, ILI9341 and XPT2046. There is no ESP8285 enhancement because the ESP8285 is software compatible to the ESP8266. Expressif uses the same firmware - even the build date is 2013 as in the ESP8266. Being a C programmer since 1987 on Atari ST and a C++ programmer since 1990 on IBM clones, using C/C++ was a no-brainer for me.

The ESP8266 Wifi firmware

The Wifi protocol stack firmware runs always in the ESP. If you compile your user-program in Arduino-IDE, the firmware is linked in. At least every 10ms the firmware has to be called. If not, a watchdog timer interrupt happens. Placing a yield() call at a convinient position in the search algorithm is all that is needed to keep the firmware watchdog happy. The watchdog is active even after the WLAN stack is powered down to save battery power. The following C source code shows the WiFi switch off on ESP8266 or ESP8285:

WiFi.disconnect(); WiFi.mode(WIFI_OFF); WiFi.forceSleepBegin(); yield();

King Rook - King endgame algorithm

In 1915, the Scientific American magazine reported about the mechanical/electromechanical chess automaton from L. Torres y Quevedo. Development started in 1911, presented was the automaton 1914 in Paris. The best part of the article is the description of the algorithm. We can easily implement the algorithm as a software program instead of a hardware automaton. Here is the algorithm:

The algorithm is a simple state machine. The state is position of white_king, white_rook and black_king. The algorithm transforms the state into six different non-overlapping cases. These six cases have some sub-cases which get executed:

1 carries castle to column A,
l' carries the castle to column H,
2 carries the castle downward one square,
3 carries the king downward one square,
4 carries the king one square to the right,
4' carries the king one square to the left,
5 carries the castle one square to the right,
5' carries the castle one square to the left,
6 carries the castle downward one square.

This is really a simple algorithm and I can not imagine a more simple algorithm. No recursive search, no Megabytes of hash memory and no Gigahertz of CPU speed needed. For me this is strong indication for "small is enough". The today available 4-pieces and 5-pieces endgame databases work the same. A simple look-up in these large data files gives perfect chess play. No thinking needed, just recall. An opening book allows the best play according to current knowledge. But new chess knowledge can invalidate an old opening book. If we ignore this detail, we can say: playing an opening is recall, too.
The case 1 has a problem. If white_king and white_rock share the same row, the rock move may be impossible because the king can block the movement to A column or H column. Therefore we change the cases into:

1 carries castle to column A if king is not blocking, else next to the king,
l' carries the castle to column H if king is not blocking, else next to the king,

And here is the original hardware implementation:

The chessboard is a small square part below the weights in the left picture.

The inverse move generator (IMG)

The normal move generator starts with the from-field and advances to the to-field. The inverse move generator starts with the to-field and searches the to-field. Because the move-pattern of the piece on the from-field is not known, a super-piece move-pattern is used: the pattern of pawn, knight, bishop and rook combined. The king-pattern is a one-step bishop/rook pattern and the queen-pattern is a long-range bishop/rook pattern. The strongest capture move is the capture of the king. Therefore the IMG starts with the to-field of the king and goes down the material ranking (queen, rook, bishop, knight, pawn). The IMG can search the pieces on the chessboard. The better solution is to have a pieces-list. The pieces-list has 32 entries for the 32 chess-men. Every piece has its own location. The white knight that started the game on c1 will remain white_knight1 for the whole game in the pieces-list. As the game progresses, entries in the pieces-list will become nil, that is the piece is gone to chess-heaven. Here is the pieces-list:

wKg	wQn	wRk1	wRk2	wBp1	wBp2	wNt1	wNt2
wPn1	wPn2	wPn3	wPn4	wPn5	wPn6	wPn7	wPn8
bKg	bQn	bRk1	bRk2	bBp1	bBp2	bNt1	bNt2
bPn1	bPn2	bPn3	bPn4	bPn5	bPn6	bPn7	bPn8

This pieces-list can not handle two queens on the chessboard. If the opponent gets two queens, the program resigns.

Ply

A chess move consists of white piece movement and black piece movement. A ply is a half move. I use ply and bit for singular and plural. I am an engineer. I say 3 Volt and not 3 Volts, because I write 3V and not 3Vs. Because 3Vs means 3*Volt*second for me. But you can do as you like!

The move generator (MG)

The move generator is the oldest part. The Claude Shannon "random mover" was a move generator and a select a move by chance. Shannon wrote in 1949: "The writer played a few games against this random strategy and was able to checkmate generally in four or five moves (by fool's mate, etc.)". The move generator has the problem of illegal moves: your king is not allowed to move thru a field that is under enemy control while doing the castling. Even more simple: the king is not allowed to move himself into check. The solution of these problems are: we ignore the castling case. The king moves himself into check is handled by the search-tree. In the next ply the tree-search will recognize win, therefore it will avoid this part of the search-tree. The move generator produces pseudo-legal moves.
The most simple move generator starts with field a1 and walks column-first thru the chessboard: a1, a2, a3, ... h8. First variant is to use a1 ... h8 for white and h8 ... a1 for black. Second variation is to start in the middle of the chessboard and go d4, d5, d6, ... d3. Third variant is to go in a spiral: d4, e4, e5, d5, c5, c4, ... a8. Fourth variant is to use the pieces-list. You can start with the pawns or you can start with the queen. Starting with the queen should give more alpha-beta pruning.
Our MG does not generate capture moves. This is the job of the IMG.
The MG uses a chessboard representation. The most space efficient is the 8x8 board. A better space/time trade-off has the 10x12 array (mailbox) or the 128 fields 0x88 chessboard. The mirco-Max program used a 129 fields 0x88 chessboard. The last field is to "dump" garbage, thst is to always have a destination field. I think I stay with the 129 fields 0x88 chessboard.

Do move, undo move

There is one chessboard and there is one pieces-list. And there is a search-tree that is 20ply deep. How does this fit together? We need a move-stack. We have a "do move" operation that executes the move and pushes the undo-information on the move-stack. And we have a "undo move" operation that brings the chessboard and pieces-list back to the previous condition.
The do/undo-move part of the chess program is a nasty piece of software. It is executed of every search-node, that is hundred thousand times or even a million times for every move the computer calculated. Therefore execution shall be quick. But there are nasty details like undo a castling or undo an en passant capture or undo a promotion of a pawn to a queen (queening). Or even the simple question can the king still do a castling or did he move already e.g. e1-d1-e1? The good thing is that we can test the do/undo-move very easy. Even if we need hundreds of test-cases, we should not progress to advanced topics of our chess program before the do/undo-move is not rock-solid and fast.

Material balance

Normally the do/undo-move maintains the material-balance (MB). Depending of how much chess-knowledge is in the program, the material-value (MV) of a piece is a constant or a variable. The traditional values are Pn=1, Nt=Bp=3, Rk=5, Qn=9 and Kg very high, e.g. Kg=100. This allows 8-bit signed char (int8_t) for the material balance.
A pawn on seventh row can have the value of a rook. An end game king and two knights is a draw, an end game with king, knight, bishop is at least in theory a win. If we have two knights and one bishop it makes sense to increase the MV of the bishop.
The MB is calculated incrementally. If you loose a pawn, you subtract the actual pawn value. If the pawn moves to seventh row and gets the value of a rook, you add the difference of new_pawn_value and old_pawn_value to the MB.
Like the do/undo-move calculation, the material balance calculation should be rock-solid and fast.

Piece-Square Tables

There is a piece-square table for every kind of chessmen. The piece-square table "nudges" the balance to good piece positions on the chessboard. A pawn piece-square table can keep the king shield intact and suggest a center pawn development before the development of other pawns. David Kittinger started with "pre scan heuristics", that is use a chessboard evaluation, the oracle, to calculate the piece-square tables. In mid game the king should hide behind the king shield, but in end game the king should go to the center of the chessboard.
Because the piece-square tables do not change with search depth, a deep search can cross the mid game to end game "border" and will have the wrong piece-square table for the deeper search levels. One possibility is to have two different piece-square tables for every type of chessmen and "slide" between the tables, that is give a value of a*table1+(1-a)*table2 with the "sliding" factor a between 0 or 1.
ESP Schachzwerg will use another algorithm. First the "ESP oracle" defines the piece-square tables. Then it uses for the first M searches in the iterative deepening the material balance plus piece-square tables balance and for the deeper searches only the material balance. Real material gain is best for ESP Schachzwerg and can be calculated without piece-square tables balance. The danger of wrong piece-square tables balance is small for a shallow search.
The score for depth M search with piece-square tables balence and for depth N search without piece-square tables balance are stored. ESP Schachzwerg plays the best move of all. If there is a sure deep search material gain, e.g. a pawn fork, it will play this, else it will play the shallow search development move.

Material and piece-square tables balance

The piece-square tables have "nudge" values in centi-pawns. Typical chessmen values in centi-pawns are Pn=100, Nt=320, Bp=330, Rk=500, Qn=900, Kg=20000. A 16-bit signed integer (int16_t) is used for the material balance. Typical piece-pawn tables "nudge" values are -50, ..., -5, 0, 5, ... 50, that is between a 1/20 of a pawn and 1/2 of a pawn.

Position hash

Usually the do/undo-move maintains the position hash. See transposition table heuristic for details. The current plan is to use Zobrist hashing and no assembler programming.

Search-tree

The recursive NegaMax or NegaScout algorithm is for many programmers conplicated stuff. The good news is: We do not need to understand it! We set the piece values, start the search and get the best move as answer. Okay, for space-saving we convert the recursive algorithm into an iterative algorithm and maintain the local program stack-variables in an indexed struct. And we want to show the principal variation to the user, that is the chain of ply the computer is most afraid of.
The simple fact that if we have a search depth of 20ply, but we have a check-mate after 5ply is a special case for the search. But don't worry. I have seen other programmer's chess program crash, so why not allow our program to crash, too?
Okay, this is the ugly truth: our search-tree explodes. Even if we have a branching factor of only 6 because of NegaMax or NegaScout search, the result is 6^ply. 6^10 is 60 millions, 6^20 is 3656 trillions using the "short scale" trillion.

Transposition table heuristic

We get the same position if we start the game with 1. e4 e5, 2. d4 d5 or if we start with 1. d4 d5 2. e4 e5. This different order of moves but same position happens over and over in the search-tree. The transposition table (TT) helps to avoid duplicated work in the search. But there are ugly details. The NegaScout algorithm has a start-window that is not -infinite/+infinite. The TT results are only valid for "more narrow" start-windows. One solution to this TT problem is to use NegaMax together with TT.
The "good old" chess computers like the "Mephisto Risc 1MB" from 1992 had 1MByte of TT memory. Our ESP8266 has 80KByte, and we need 2KByte for the do/undo-move and search-tree and XKByte for the user interface. You want to see nice graphics, yes? The ESP32 is much better for TT.
The TT is not filled depth-first, but it is filled wide-first. We want to have all transpositions on depth-1, depth-2, depth-3 until the TT memory is exhausted. One possibility is that more shallow transpositions will overwrite deeper transpositions. Another method is to allow fixed sizes of TT memory for depth-1, depth-2 and so one. In this case there is no need for the individual transposition to carry the depth information.
The indentification of a position is done with a "hash". Hashing uniformly distributed data is usualiy done with division remainder or mod. The chess position is interpreted as a long number. This long number is divided by an 32-bit prime number. The remainder is a good hash.
For me this long division should be done in assembler. Normally a N bit CPU has a 2*N bit divide by N bit operation with a N bit quotient and N bit remainder result. Such an operation is normally not available in C language. It was avaiable in the Forth language.
Another, less powerful, hash is the Zobrist hashing. The good thing about this hash is that you can calculate it incrementally. Like the material-balance you only have to add or subtract the change on the chessboard.
The disadvantage of hashing are hash collisions. Because the hash is much smaller than the original data, several different chess positions have the same hash. Konrad Zuse did a little calculation in 1941. You need approximate 4 bit (exact 3.7005 bit) to express the 13 separate inhabitants of a field: wKg, wQn, wRk, wBp, wNt, wPn, bKg, bQn, bRk, bBp, bNt, bPn and empty. You have 64 fields, that is the information of a chess position is 64*4bit=256bit. A typical mod-hash is 32-bit long, a typical Zobrist-hash is 64-bit long. You can imagine how many hash-collisions are possible! Because of this I call it TT heuristic. Thank goodness, the chessboard is only sparsely populated. At least half of the fields are empty. The end game profits most of TT.

Picture: Konrad Zuse's 1941 calculation of the information in a chess position.

The TT memory helps in detection threefold repetition. If all played-out positions are "hashed" and we have two hash collisions for the same hash, the same position has occured probably three times.
If a piece moves away from "its" field and later moves to the same field, and the adversary does the same, we have the same position in two different depths of the search tree. The back-and-forth movements are don't care in most cases. But moving the king back-and-forth before castling makes the castling impossible. Just directly after the two-square first move of the pawn is an en passant capture legal. Again we can have a legal en passant capture or an illegal en passant capture, if we insert back-and-forth movements. Maybe I ignore these nasty details - every chess computer needs an "identifying" bug. Okay, thinking time is over. The pawn that can be captured in the next ply by en passant is not a wPn or bPn but a wPnep or bPnep. There are eight wPnep positions and eight bPnep positions. And the chess rules tell the location the adversary pawn has to hit to make the en passant capture. The Kg "inhabitant" gets differentiated into KgC, wKg, bKg. The KgC is a king before his first move. It has no color information, but the position e1 or e8 tells the color. After the first move the KgC becomes a wKg or bKg. A KgC has still the castling right, a wKg or bKg not. With these additions we need now exactly 256bit or 32Byte to describe the chess position. Konrad Zuse was right!

Iterative deepening

Chess is a game against your opponent and against time. You have this nasty chess-clock. The iterative deepening in the chess program helps the computer to stay in its time budget. If after the ply_N search is done there is time left, the ply_N+1 search starts. But the iterative deepening does more. It allows to start the ply_N+1 search with the best ply_N move. This gives better cut-off or a smaller branching factor. And there is another advantage. If we have a maximum search of 20 ply, but we have a mate-in-5, the iterative deepening stops after depth 5. The game is over - at least for the computer. Therefore iterative deepening makes it unnecessary to report win-in-N or loose-in-N. The terminate condition for the search is "if there is no more time or if there is a win". Last but not least does iterative deepening tell us the size for the transposition table (TT). After depth_1 search we have used this amount of memory, after depth_2 we have used that amount of memory and depth_3 TT can only use the free transposition table memory.

Maximum search depth

What is the maximum search depth? A chess game can last 80 moves or 160 ply. But the combinatorial explosion kills us long time before. We can learn a little from David Kittinger. Here are some results of some of his chess computers:

Year	Name	Elo	max. Ply
1988	Novag VIP	1731	20
1989	Novag Super VIP	1750	16
1995	Novag Jade II	2032	16
1998	Novag Turquoise	2006	18

Poor David had a handicap. All presented chess computers use the H8 SoC. The H8 SoC has only 1KByte RAM. He could use the RAM for search-depth or for chess-knowledge. In his career he changed first from search-depth to chess-knowledge and later back. Our conclusion: A maximum search-depth between 16 and 20 ply should be fine for ESP Schachzwerg. Lets give the ESP8266 18 ply and the ESP32 20 ply. In 1990 Ed Schröder gave the "Mephisto Polgar 6502 10 MHz" chess computer a maximum search depth of 30 ply. I think this was overkill. The Ed Schröder computer of the following year, "Mephisto Milano 6502 5 MHz", has 22 ply maximum search depth.

Permanent brain

Permanent brain describes the continous working of the chess computer. While the chess-clock ticks away your time, the chess computer uses the principal variation, that is the move the chess computer expects from you, to "think ahead". If you play the expected move, the chess computer has already perform a ply_N search and can continue with the ply_N+1 search, that little bastard! If you make another move than the "feared", the chess computer starts a new search at ply_0. But this is no problem. You did not play the strongest move, but a weaker one - at least in the opinion of the computer. Permanant brain is the holy grail for ESP Schachzwerg. Hopefully it is not too complicated for my little 1000 Elo natural brain!

Null move and other forward pruning heuristics

Forward pruning like null move does reduce the branching factor, but for a price! Even the forward pruning fans agree on this: "To reduce the chance that Zugzwang will cause a problem, null move forward pruning is not done in the end game." A scholar's mate is a checkmate in the opening. I do not want to forward prune away something like this. The old chess computer game comments are full of "why had the program not seen the checkmate-in-N?". Because the program had forward pruned away the relevant sub-tree!
I know that debugging do/undo-move, material-balance and Zobrist hash is hard. I assume that debugging forward pruning is impossible.

Bitboard

The old mainframe chess programs used 64-bit bitboards. I do not understand bitboards and I have not heard that the professional 8-bit chess computer programmers like the Spracklens, David Kittinger, Ed Schröder, Thomas Nitsche, Elmar Henne, Kaare Danielson, Richard Lang or Frans Morsch have used bitboards. The ESP8266 and ESP32 are 32-bit RISC CPUs. But I treat them as powerful 8-bit CPUs. One important reason for mainframe bitboard use was that the old-and-ugly beasts could not address space in chunks of 8-bit. Bitboards are a work-around for mainframe deficiencies.
For the same odd reason we have "packed char" in good old Wirth Pascal. On the Control Data Corporation CDC6000 60-bit word size mainframe, one word could hold one floating point number or 10 packed 6-bit char. This was the architecture of Wirth's first Pascal computer and the reason why Pascal is all capital letters. Or on a 16-bit word machine an unpacked char used a word (16-bit) and a packed char only 8-bit. By the way: knowing this makes me really feel old. There is no Wikipedia entry about "packed char", but the cited Pascal reference book is a newbie from 1983! I have seen core-memory computers in action like the Siemens 330 or Raytheon RDS-500 in the system DERD-MC. And the Telefunken TR-86 was even older. It had the same small scale integration technology as the Apollo guidance computer.

About the author

I work as Referent for the german air traffic control. We do our operational business with Linux operating systems and mostly C/C++ programming. The ops business uses redundant real-time systems with measured availability of 99.999%, running 24/7 every day of the year. This is 5 minutes not-planned down-time per year. I have a hardware background as Fernmeldeelektroniker and a software background as Diplom Informatiker(FH). I am a ham, too. Callsign is DL1ADR. This qualifies me as triple nerd™. But I paint with acrylic colors and I have a big interest in dancing and Kamasutra. These are my non-nerdy facets. I have three children, two are already adults. My e-mail address is: