Svarichevsky Mikhail - RSS feed Svarichevsky Mikhail - RSS feed en-us Tue, 10 Jun 2006 04:00:00 GMT Tue, 16 Jul 24 10:15:50 +0000 120 10 <![CDATA[Programming Intel 87C51 - first high-volume integrated microcontroller (1980)]]>
Recently I got my hands on D87C51FC-20, and decided to experience the old ways of embedded software.

I was curious to see the chip in detail and what is the shape and thickness of the window, so I sacrificed AMD AM27C64. Window thickness appeared to be around 1mm and it was surprisingly well glued into ceramic case:

  We used to think that flat optical windows do not affect image and we can easily observe things though them. But it is not the case for microscopy imaging with high numerical aperture (i.e. with converging beam), where flat window introduce spherical aberration which severely degrade resolution with lenses not optimized for specific cover glass thickness. Similar problem is with observation of data on CD's - as one increases NA beyond 0.3 - resolution rapidly declines with standard lenses.
  Here is an example I've got during observing 87C51. On the left - image without glass thickness correction (low contrast and resolution due to spherical aberration), on the right - with 1.05mm correction:

Now we can see the chip itself through quartz window used to erase EPROM content:

UV Erasing

  Erasing the chip requires relatively high dose of UV light. I've tried 245nm LED at first, but it's light output was so tiny that even after 1 hour there was no effect. It seems under 350nm mercury-based light sources are unchallenged by semiconductors.
  Recently I've got these weird 10v/300mA mercury lamps. These exist in "ozone" version (quartz tube, so with 185nm output), and "no-ozone" (some glass that absorb 185nm, but is transparent to 254nm). It is better to use "no-ozone" version, as 254nm became industry standard for EPROM erasing, and 185nm might be too harsh and way less safe. Notice that only upper part of filament is covered by white oxide coating for enhanced electron emission.

Lamp works in a very unexpected way. Power supply has to be set to ~14-15v with 300mA current limit. Initially filament glows red (especially it's lower part, not covered by oxide) and evaporates mercury from amalgam plate.

  When Mercury vapor pressure is high enough - plasma ignites and shunts filament. Electron emission from warm oxide-coated filament is enough to sustain plasma. This photo was made with protection glasses, and camera with UV filter (as you don't need much 185nm light to damage optical glue/plastic). Needless to say that with 185nm light one need to protect whole body + ozone is toxic. So it's really better to work with no-ozone version.
  When quartz glows blue - it's time to drop and run!

Lamp was mounted in a pickle jar (literally), with majority of walls covered by aluminum foil. Glass was filtering most of <365nm light. This way emission point of the lamp was within ~2cm from the chip, and erasing it took just 10 minutes.


With MiniPro TL866 it was straightforward (now it can work with open source software).

Writing demo program

  Enough photos, time to code! In addition to standard blinking LED I decided to calculate some prime numbers and print them via serial port as it's already somewhat challenging task for 44-year old microcontroller. Initially I followed path of Jay Carlson and tried to use modern IDE for 8051-compatible chips from Silicon Labs (Simplicity Studio). Unfortunately, while it works for blinking LED, serial port peripheral apparently is not binary compatible (and it makes sense - serial port in 8051 is very simple by today's standards). So I switched to open source SDCC which worked perfectly.
  I did not wanted to introduce complexity of doing serial port IO with ring buffer and interrupts, but did slight optimization in 1-byte putchar: It waits for character transmission end before sending next character, not after. This way further compute and serial IO can be partially parallelized. To ensure that it works for first character I set TI = 1 in the beginning of the program (i.e. "transmission of previous character done").
#include <8051.h>
#include <stdio.h>
#include <stdbool.h>

int putchar(int c) {
    /* We never start when transmission is not complete */
    while (TI==0);      /* Wait until stop bit transmit */
    TI = 0;
    SBUF = c;     
    return c;

void toggle_led(void)
    P1_0 = !P1_0;

int main (void)
    int i, j, loop_limit;
    bool is_prime;

    //Serial port speed. Perfect 9600 for 18.432MHz crystal
    TH1 = (unsigned char)(256-5);//No "overflow in implicit constant conversion"
    TMOD = 0x20;
    SCON = 0x50;
    TR1  = 1;
    TI = 1;//To continue work while we are transmitting - we are waiting before transmission, not after

    printf("Hello world!\r\n");
    printf("Let's calculate some primes: 2");
    while (1) 
      for (i = 3; i <= 32000; i+=2) {
          loop_limit = 180;
          if(i-1<loop_limit)loop_limit = i-1;//We will not calculate square root :-)
          is_prime = true;

          for (j = 2; j <= loop_limit; ++j) {
              if (i % j == 0) {
                  is_prime = false;

          if (is_prime) {
              printf(" %d", i);

      printf("\r\nHere we go again: 2");

As 8751 had no PLL (not surprisingly), getting proper serial port speed appeared to be most challenging. Serial port speed is defined by quartz frequency and timer1 reload value - TH1. Initially I tried to run it on 20Mhz crystal with TH1=256-5 and was unable to receive correct data on the computer, while oscilloscope was still able to decode it. It appeared that baud rate had 9% error. While it is possible to configure CP2102 for custom baud rate (X = F/(12*32*Y) where F is quartz frequency and Y is timer reload value), another option is to use different quartz crystal for lower error.
18.432Mhz crystal with reload value of 256-5 is perfect, as if it was specifically designed for this purpose... (Narrator: it was)

Prototype assembly

P1.0 (pin 1): LED.
Reset (pin 9): Active high. Pull down with resistor, connect to VCC via capacitor for short reset at power up.
TXD (pin 11): Serial port out.
XTAL1 and XTAL2 (pins 18 and 19): Crystal, 18.432Mhz in my case. Worked without external capacitors, even though it might have some slight frequency error / be less stable. Parasitic capacitance of breadboard is too low (~2pF).
GND (pin 20) and VCC (pin 40): 5V power supply.
EA (pin 31): External memory disable (to use internal EPROM), connect to VCC.


With correct serial port speed - it works!

If we look at the TX signal on oscilloscope during printing of different numbers - we can see something interesting:

When printing 3-digit numbers - delay between space and first digit is ~1.25ms, 1.8ms for 4 digit and 2.5ms for 5 digit. Why is that?
  After printf(" %d", i) prints space, it needs some time to convert integer to string. And it's getting longer for longer numbers. As there is no hardware division of 16-bit integers in 8051 - straightforward implementation of itoa is so slow that you can see the delay even at 9600 baud rate.


  All this took around 6 hours, with ~8 erase-program iterations which were quite slow without devboard in a classical sense. Most necessary features of modern microcontrollers are already present in 87C51 - it is surprising how ahead of it's time it was. Still, there are few things it lacks compared to chips we use today, which could have helped in this test program:

1) Raw performance. 1(one) 8-bit MOPS vs ~50-150 32-bit MOPS we can have now for 1-2$. Gap in math is even wider, today we are spoiled by relatively fast hardware division and multiplication. It is not unheard to even have hardware 32-bit floating point math in <5$ chips.
2) PLL. Today with some common/cheapest 8Mhz crystal many useful frequencies can be synthesized, so no need to have large stock of crystals for different applications.
3) On chip debugging and programming allows for very fast iterations.
4) Serial ports with buffers + much faster baud rates are common.
5) Internal reset circuit, brown-out detection, watchdog timer.

  So, while 8051-compatible chips are cheaper (often cheapest), development requires more work and consideration. But for high-volume applications where task is relatively simple and does not require a lot of math - it can still be the solution with lowest total cost, and this is why derivatives of 8051 are still in use today (+ some legacy code reuse). ]]>
Sun, 12 May 24 13:03:48 +0000
<![CDATA[Sony F828 and infrared photography]]>
Here - electric cooktop emits alot of infrared, black glass is transparent to infrared and shows it's internals:

Camera itself. With 15x15mm neodymium magnet - switch to infrared works as expected. Smaller 10x5mm magnets were too weak.
I've got 680nm+, 760nm+ and 950nm+ filters. So far 760nm one is most practical: shows world significantly different and lets decent amount of light through. At 950nm+ sensitivity of sensor drops too far, so it's only for tripod.

Next looking at my sweater in infrared:

Dyes have no spectral features in 760nm+ infrared, surprisingly even black dye. Only in 680nm+ features are barely visible.

Fri, 19 Jan 24 21:19:26 +0000
<![CDATA[Lichee Console 4A - RISC-V mini laptop : Review, benchmarks and early issues]]> small laptops and phones - but for some reason they fell out of favor of manufacturers ("bigger is more better"). Now if one wanted to get tiny laptop - one of the few opportunities would have been to fight for old Sony UMPC's on ebay which are somewhat expensive even today. Recently Raspberry Pi/CM4-based tiny laptops started to appear - especially clockwork products are neat, but they are not foldable like a laptop. When in summer of 2023 Sipeed announced Lichee Console 4A based on RISC-V SoC - I preordered it immediately and in early January I finally received it. Results of my testing, currently uncovered issues are below.

Brief specs and internals

First of all, Lichee Console is tiny, 185 x 140 x 19 mm, 656g. Build is solid and high-quality, using mostly aluminum. Keyboard has typical laptop key travel and to my feel comparable to Lenovo's but it is of course quite a bit smaller. I cannot type on it blindly (yet), but it is possible. The only inconvenient part of keyboard (in context of Linux) is compressed keys [,][.][/], which are often used in console. Trackpoint mouse is ok for someone who had it all these years on Lenovo laptops.

Lichee Console 4A runs on T-Head (Alibaba) TH1520 quad-core RISC-V SoC (4x C910 cores). While TH1520 can clock up to 2.0/2.5Ghz, in Lichee Console it is tamed down to 1.5Ghz max, likely to help with thermal dissipation. Maximum configuration is quite serious for such tiny thing: maximum of 16Gb DDR4 RAM (I got this version) and maximum of 128Gb eMMC. There is a slot for 42mm SATA M.2 SSD, but it connected via ASM1153 USB3.0->SATA adapter. More on that later.

I would prefer working on M.2 SSD as you can keep the data if something else fails (eMMC is soldered on the board and will be expensive to recover). 42mm SATA SSD's are not very popular, and the best I was able to find was Transcend MTS400 256Gb (it is still in transit). There are many such SSD's from Chinese brands though.

Display has resolution of 1280x800 and looks to be IPS. No color shift at high angles. There is a webcam on the left side of the monitor - it is average quality full HD 30p (requires good lighting), landscape orientation. It is possible to connect external display via mini-HDMI (cable included). It worked fine on FullHD monitor, but unstable/non-working on 4k.

Battery is 2S 3000mAh. Charging can be done through USB-C (maximum 5V 2.2A, does not trigger 9/12V) or via 12V jack (which I personally will not use). 12V power brick came with Chinese/US plug, so if you want to use it - you will need an adapter. Jack external diameter is 3.45mm (central positive), if you would want to find a 12V PD trigger adapter for it (something like this but double check voltage). More on battery life below.

On the software side - Debian 12 with Xfce is preinstalled, built for 64-bit RISC-V. WiFi or Ethernet connection was straightforward, and Chrome-based browser was able to play YouTube video with no issues. apt update fetches packages from Chinese server.

Unboxing video below (no comments):

Build quality:
There was only 1 build quality issue on my sample: apparently aluminum bottom part of case was squeezing keyboard, or something was pressing and bending keyboard outwards slightly (~1mm), which was catching screen bottom when opening and making unhealthy snapping sound. After reassembly & pressing on keyboard in the middle - the issue was resolved.

Disassembly/assembly is relatively complicated due to tight fit of aluminum bottom cover, and I do not recommend to disassemble the unit unless absolutely necessary.

I do not like metal clips holding lithium battery in place. After enough vibration and abuse, some swelling on the battery it is not unthinkable that they might bite into the battery and destroy the unit. Even if 0.1% of units will burn down due to this potential issue - this will be very sad. Plastic bracket for the battery or glue in place (unliked by many and hard to service) are well tested by the industry and safe. What makes it hard to do well is flex cables under the battery. On my unit I added kapton tape under and over the metal bracket to ensure it does not wiggle over the battery and has a harder time biting into the battery.

Unlike most laptops, Lichee Console uses 2 PCB's in addition to SOC module, and this will bite us later. IO board has microsd card slot, USB and analog audio.

SoC module is removable. Heat is transferred via silicon pad to a heatpipe glued to aluminum back cover.

Benchmarks & tests

CPU & Power
TH1520 @1.5GhzRaspberry Pi 4Raspberry Pi 5
idle power7.68 / 6W (with/without screen)1.93W2.42W
CoreMark 1 core6900793817725
Power 1 core8.376W (with screen)2.70W4.47W
CoreMark 4 cores256893153269860
Power 4 cores9.408W (with screen)4.85W7.35W
Here performance is slightly behind Raspberry Pi 4 due to clock speed being reduced from 2.0 to 1.5Ghz. Personally I find performance of Raspberry Pi 4 perfectly acceptable for console work, and I am satisfied with performance of TH1520 for my use. I have included Raspberry Pi 5 for comparison as it's already 2024, and later this year we'll (hopefully) see competing products using CM5.

What I don't like though is high static power consumption of Lichee Console. At idle system goes down to 300Mhz, and even with 3 cores manually parked - it still consumes ~6W (without screen). This static power consumption makes Lichee Console quite warm even at idle. Also, this gives us just ~2.5 hours of battery life without any heavy load. As USB charging is limited to 5V/2.2A - Lichee Console will charge extremely slowly when powered on (~3 hours to full charge when switched off and ~10 hours when switched on). Surely, 12V 3.45mm barrel charging is much faster.

Dynamic power consumption of C910 cores are rated at 200µW/MHz/core, which gives us 300mW dynamic power consumption for 1 core at 1.5Ghz, and 1.2W for 4 core load. Measurements confirm these numbers, so the only issue is high static power consumption. On ratio performance / dynamic power it is perfectly competitive to Raspberry Pi 4, it is only static part that hurts it.

To investigate high static power consumption I made thermal photo at idle:

Here we see that approximately half of power is dissipated by Via VL817 - USB 3.0 hub IC located right under SoC module. Less but still significant power is dissipated by ASM1153 USB->SATA adapter, despite no SATA drives connected. This is quite disappointing. If no software fix would be found to disable unused interfaces, I am personally considering de-soldering these IC's or disconnecting them from power. 5-6 vs 2.8 hours of battery life is more important for my use.

This high idle power consumption is probably why cooling fan is always on (thankfully it is quiet), even when I put Lichee Console inside the fridge :-)

WiFi & Ethernet WiFi module is connected via SDIO. Practical speed via iperf3 is 122/115 Mbit/sec. "Not great, not terrible" - but good enough for regular use.
Wired Ethernet does 925/925 Mbit/sec without jumbo packets which is nearly as good as it gets. SoC has 2 Ethernet ports, only 1 is accessible on Lichee Console.

Disk performance
Random 4k: Writes 8102 IOPS, 31.6MiB/s. Reads 2502 IOPS, 9.77MiB/s
Random 1Mb: Writes 202mb/s, Reads 130mb/s

Random access is slower than modern fast microsd cards, but sequential is acceptable (for eMMC).

Testing fast MicroSD cards (Samsung Pro Ultimate, Sandisk Extreme Pro) which can negotiate fastest possible speed (up to 200Mb/s) - uncovered that they are unstable and operations fail with io errors. This is likely caused by extremely long signal path : from SoC, then to flex connector, then folded flex cable, then path across IO board. Old/slow MicroSD cards work reliably but at snail speed. Hopefully maximum interface speed for MicroSD can be reduced in software without affecting eMMC speed.

Currently missing/broken features (mostly software):
1) Bluetooth was failing to pair devices out of the box using GUI tools.
2) No sleep function. You have to switch off / boot up every time you open Lichee Console.
3) Not sure if there is sensor detecting closed screen. Right now when closed it just continues working with the screen on.
4) Adjustment of screen brightness does not work (it is always at max brightness, or off). Update: "apt install pkexec" fixes adjustment via gui. Keyboard bindings are still need to be done.
5) Suboptimal power management leading to high static power consumption: Is it possible to disable VL817/ASM1153? Is SoC supply voltage scaling correctly at idle?

I will update the article here as software is improved.


My overall experience with Lichee Console is positive and I like it. It should be noted that at the moment it is more of a product for tinkering and not something that you can immediately use as-is for work with no changes. Substantial improvement will be required on software to fully utilize hardware capabilities (but this often happens with Linux on mobile platforms). Hardware has some flaws, they are unpleasant but not fatal (microsd stability at high speed, high idle power consumption). I am concerned about battery safety, and hopefully this is something that Sipeed can improve.

12nm TH1520 SoC offer competetive dynamic power consumption and sufficient performance, but lacks in IO (for desktop) which forced Sipeed to add additional interface ICs which happened to consume too much static power.

I hope that current rapid pace of RISC-V infrastructure development will continue and in the nearest years we'll see more RISC-V SoC's, this time with at least few lines of PCI-E - and we'll get even more exciting Linux-capable RISC-V devices. Update: Milk-V Oasis is a glimplse of this future, expected later in 2024. Looking forwared to test it.

PS. If you like microchip and their internals - you might like my blog about boiling microchips in acid :]]>
Tue, 16 Jan 24 06:34:52 +0000
<![CDATA[Ronald Reagan and Raspberry Pi]]> told a joke:

You know there’s a ten year delay in the Soviet Union of the delivery of an automobile, and only one out of seven families in the Soviet Union own automobiles. There’s a ten year wait. And you go through quite a process when you’re ready to buy, and then you put up the money in advance.

And this happened to a fella, and this is their story, that they tell, this joke, that this man, he laid down his money, and then the fella that was in charge, said to him, ‘Okay, come back in ten years and get your car.’ And he said, ‘Morning or afternoon?’ and the fella behind the counter said, ‘Well, ten years from now, what difference does it make?’ and he said, ‘Well, the plumber’s coming in the morning.'

On 28th of September preorders for Raspberry Pi 5 were opened. I did not preorder it immediately, but slept it over and placed my preorder at 6am the next day. I surely did pay 100% in advance. What I did not know at the time is that every ~6 hours was postponing delivery by ~1 month. So while first preorders were delivered in early November (unless you are a celebrity), mine was fulfilled only in early January. Still, it is better than what was happening with Raspberry Pi 4 at the peak of silicon shortage where one easily had to wait 6 months. These who really needed it surely could have paid scalpers 200% price (not sure why manufacturers hesitate to do it). Hopefully, queues for electronics will get shorter over time, not longer (although with current Taiwan situation there could be surprises).

Now, having 2 precious Pi's in my hands I can feel the privilege. The hype is partially justified, my coremark benchmarks confirm 2.2x performance boost at 1.5x power consumption and PCI-E is real. There is still quite a lot of room for further improvement until Raspberry Pi reaches 100W peak power consumption :-)

Fri, 12 Jan 24 11:59:08 +0000
<![CDATA[Finishing 10 minute task in 2 hours using ChatGPT]]> Many of us have heard stories where one was able to complete days worth of work in minutes using AI, even being outside of one's area of expertise. Indeed, often LLM's do (almost) miracles, but today I had a different experience.

The task was almost trivial: generate look-up table (LUT) for per-channel image contrast enhancement using some S-curve function, and apply it to an image. Let's not waste any time: just fire up ChatGPT (even v3.5 should do, it's just a formula), get Python code for generic S-curve (code conveniently already had visualization through matplotlib) and tune parameters until you like it before plugging it into image processing chain. ChatGPT generated code for logistic function, which is a common choice as it is among simplest, but it cannot change curve shape from contrast enhancement to reduction simply by changing shape parameter.

The issue with generated code though was that graph was showing that it is reducing contrast instead of increasing it. When I asked ChatGPT to correct this error - it apologized and produced more and more broken code. Simply manually changing shape parameter was not possible due to math limitation - formula is not generic enough. Well, it is not the end of the world, LLM's do have limits especially on narrow-field tasks, so it's not really news. But the story does not end here.

For reference, this is ChatGPT's code:

import numpy as np
import matplotlib.pyplot as plt

def create_s_curve_lut():
    # Define parameters for the sigmoid curve
    a = 10.0  # Adjust this parameter to control the curve's shape
    b = 127.5  # Midpoint of the curve (127.5 for 8-bit grayscale)

    # Create the S-curve LUT using the sigmoid function
    lut = np.arange(256)
    lut = 255 / (1 + np.exp(-a * (lut - b) / 255))

    # Normalize the LUT to the 0-255 range
    lut = (lut - np.min(lut)) / (np.max(lut) - np.min(lut)) * 255

    return lut.astype(np.uint8)

# Create the S-curve LUT
s_curve_lut = create_s_curve_lut()

# Plot the S-curve for visualization
plt.plot(s_curve_lut, range(256))
plt.xlabel("Input Values (0-255)")
plt.ylabel("Output Values (0-255)")
plt.title("S-curve Contrast Enhancement LUT")

# You can access the S-curve LUT with s_curve_lut

At this point I gave up on ChatGPT LUT code and redid it using more universal regularized incomplete beta function. I adjusted a=b parameter to achieve curve shape that I like and applied LUT to image using OpenCV's LUT function. To my surprise and disbelief function was reducing contrast instead of increasing it. What?

After extensive head-scratching, to troubleshoot the problem I made a simplified linear contrast enhancement LUT and observed expected result. Only when I added linear contrast LUT to the graph issue became clear: When I abandoned ChatGPT's S-curve function, I kept graph code. In this code ChatGPT marked graph's axis labels and even added title. But then it threw a wrench by feeding x-data into Y axis and vice versa, effectively flipping the graph. As parameters of plt.plot are not named, it is very easy to miss this error for a human.

When I tuned shape factor for beta function with a flipped graph - I made it contrast-reducing that looked like it is what I needed. When I told ChatGPT that it's S-curve function is reducing contrast instead of increasing it - I misled it (and it unconditionally believed me), as S-curve was correct but error was in graph piece. Surely, if you tell ChatGPT that error is in plt.plot parameters - it can correct it.

I remember my teacher of analytic geometry at the final exam: when I was proving my solution - he could unexpectedly do not agree with one of the steps and claim that there is an error. To get maximum mark one had to not panic and continue defending correct solution. Hopefully we will see LLM's disagree with users more.

▶ Show error in code

But that's not all: Just when I've thought we are done - there is one more bug in the code. One can notice slight asymmetry of GPT-TRAP curve at high end. It's a rounding error - calculated value is simply cast to uint8 (which discards fractional part) instead of rounding, so in average we are getting 0.5 unit / ~0.25% lower brightness of the image and significantly more rare full white values (255). What is interesting is that this error appeared to be systematic and present in all generated samples from all LLM's I've tested. I.e. apparently error was very widespread in training data of all LLM's, so they all have learned that "multiply by 255 and cast to uint8" is enough to fit values to 0..255 range.Technically this is true, but result is mathematically flawed.

▶ Show error in code

My conclusions are:
  • LLM's are like junior developers - they can and will do unexpected mistakes, they need clear instructions and guidance. The difference though is that junior developers will learn over time and LLM's will get better only in next generation. Like junior developers - LLM's needs to be "managed" with reasonable expectations.
  • All code from LLM's must be verified, the more niche field - the more tests. LLM's generate code that looks correct, and when it's not - errors can be very subtle and expensive to debug/fix.
  • In case of unexpected or puzzling results it is often faster to simply ask multiple LLM's : now in addition to ChatGPT (3.5/4) we have Copilot, Bard, Replit and more. None of these gave perfect results from the first time, but some errors were different and often less subtle / easier to get it working in 20 minutes total.
  • Some of the errors are systematic for multiple LLM's, which apparently come from training data (as LLM's currently unconditionally trust training data, unlike humans). I.e. currently LLM's cannot exceed level training data on quality, but can only approach it. It is unclear how much further work on LLM's will be needed to get perfect result consistently, I afraid it might be the case where last 10% of the work require 90% of time.
Sun, 22 Oct 23 23:18:49 +0000
<![CDATA[Sirius and color twinkling ]]>
Why it happens? Stars twinkle due to turbulence of the atmosphere acting as a random gradient refractive index "prism" (which is randomly shifting image & splitting colors - yes, even air has dispersion and it's visible here!) - so more/less light of different colors randomly hit lens aperture / eye. For stars air turbulence is sampled (in this case) in cylinder 62mm in diameter and ~50km in length, which makes effect very visible. Jupiter for example will average turbulence over a cone which opens up to 7.2m at 50km due to angular size of the planet, which will dramatically reduce contrast of twinkling due to averaging. Same averaging (reduction of twinkling) could happen for large telescopes (300mm+) even for stars, simply due to averaging across larger air volume.

One more:

Mon, 25 Sep 23 03:05:34 +0000
<![CDATA[EVE Online - it's getting crowded in space]]> should now get 1'000'000 SP on first login and that's the point of this post.

Sun, 24 Sep 23 23:24:51 +0000
<![CDATA[65B LLaMA on CPU]]>
16 years ago dog ate my AI book. At the time (and way before that) common argument on «Why we still don't have AI working and it is always 10 years away» was that we can't make AI work even at 1% or 0.1% human speed, even on supercomputers of the time – therefore it's not about GFLOPS.

This weekend I ran gpt4-alpaca-lora_mlp-65B language model in home lab on CPU (using llama.cpp, due to model size – there is 0 chance to run it on a consumer GPU). This model is arguably the best open LLM (this week), and 65 billion parameters is no joke: with single precision math it won't fit in 256Gb of RAM. If you let it spill into swap, even on NVMe drive – it will run at ~1 token per minute (limited by swap speed), which is about 0.5% of human speed. Even at this snail pace it can still show superhuman performance in memorization-related tasks. It is clear that it was not possible to get there 20 years ago – training time would have been prohibitive even with unlimited government funding.

And this is where unexpected open approach of Meta proven to be superior to closed, dystopian megacorp approach of OpenAI: In 10 weeks since LLaMA was released into the wild, not only derivative models were trained but 2-4-5 bit quantization enabled larger models on consumer hardware. In my case with 5bit quantization - model fits into 64Gb of RAM and runs at ~2 tokens per second (on 64-cores), which is probably 70-90% of my human speed in best shape.

For comparison, I tried Replit-code-v1.3b 2.7B model optimized for coding. After 65B monster – Replit feels like a breeze and shows very good performance despite its size. This is a good reminder that field-specific, smaller models should always be used where possible.

It feels like "1 Trillion parameters will be enough for everybody", but such models would not be practical probably for another 2 years. Meanwhile key enablers of AI proliferation could be increase of RAM in consumer GPUs beyond 24Gb (which is sadly unlikely to happen due to commercial interests) and smaller field-specific models where I would be looking into with much more interest.]]>
Mon, 22 May 23 07:57:14 +0000
<![CDATA[First tiny ASIC sent to manufacturing]]> 5 years ago making microchip from high-level HDL with your own hands required around 300k$ worth of software licenses, process was slow and learning curve steep.

Yesterday I've submitted my first silicon for manufacturing and it was... different. In the evening wife comes as asks "How much time until deadline?". I reply: "2 hours left, but I still have to learn Verilog." (historically my digital designs were in VHDL or schematic).

All this became possible thanks to Google Skywater PDK and openlane synthesis flow - which allowed anyone to design a microchip with no paperwork to sign and licenses to buy. Then by Matt Venn lowered the barrier even further (idea to tapeout in ~4 hours, including learning curve).

As expected, this all allows much more people to contribute to open source flow, with my favorite being work of Teodor-Dumitru Ene on hardware adders which now match and beat commercial tools. I think (and hope) that in 5 years opensource tools will dominate the market on mature nodes (28nm and up), not because they are cheaper, but because they are better and easier to use.

My design fits in 100x100µm and contains 289 standard cells. There are 7 ring oscillators with frequency dividers to compare silicon performance to analog simulation across voltage/temperature. I expect to see chips in ~6-9 months, both working and under microscope :-)]]>
Sun, 04 Sep 22 15:22:58 +0000
<![CDATA[This cake is a lie.]]> Stable Diffusion model that was publicly released this week is a huge step forward in making AI widely accessible.

Yes, DALL-E 2 and Midjourney are impressive, but they are a blackbox. You can play with it, but can't touch the brain.

Stable Diffusion not only can be run locally on relatively inexpensive hardware (i.e. sized perfectly for wide availability, not just bigger=better), it is also easy to modify (starting from tweaking guidance scale, pipeline and noise schedulers). Access to latent space is what I was dreaming about, and Andrej Karpathy's work on latent space interpolation is just the glimpse into many abilities some consider to be unnatural.

Model is perfect with food, good with humans/popular animals (which are apparently well represented in the training set), but more rare Llamas/Alpakas often give you anatomically incorrect results which are almost NSFW.

On RTX3080 fp16 model completes 50 inference iterations in 6 seconds, and barely fits into 10Gb of VRAM. Just out of curiosity I run it on CPU (5800X3D) - it took 8 minutes, which is probably too painful for anything practical.

One more reason to buy 4090... for work, I promise!
Fri, 26 Aug 22 19:49:29 +0000