Finishing 10 minute task in 2 hours using ChatGPT

Many of us have heard stories where one was able to complete days worth of work in minutes using AI, even being outside of one's area of expertise. Indeed, often LLM's do (almost) miracles, but today I had a different experience.

The task was almost trivial: generate look-up table (LUT) for per-channel image contrast enhancement using some S-curve function, and apply it to an image. Let's not waste any time: just fire up ChatGPT (even v3.5 should do, it's just a formula), get Python code for generic S-curve (code conveniently already had visualization through matplotlib) and tune parameters until you like it before plugging it into image processing chain. ChatGPT generated code for logistic function, which is a common choice as it is among simplest, but it cannot change curve shape from contrast enhancement to reduction simply by changing shape parameter.

The issue with generated code though was that graph was showing that it is reducing contrast instead of increasing it. When I asked ChatGPT to correct this error - it apologized and produced more and more broken code. Simply manually changing shape parameter was not possible due to math limitation - formula is not generic enough. Well, it is not the end of the world, LLM's do have limits especially on narrow-field tasks, so it's not really news. But the story does not end here.

For reference, this is ChatGPT's code:

import numpy as np
import matplotlib.pyplot as plt

def create_s_curve_lut():
    # Define parameters for the sigmoid curve
    a = 10.0  # Adjust this parameter to control the curve's shape
    b = 127.5  # Midpoint of the curve (127.5 for 8-bit grayscale)

    # Create the S-curve LUT using the sigmoid function
    lut = np.arange(256)
    lut = 255 / (1 + np.exp(-a * (lut - b) / 255))

    # Normalize the LUT to the 0-255 range
    lut = (lut - np.min(lut)) / (np.max(lut) - np.min(lut)) * 255

    return lut.astype(np.uint8)

# Create the S-curve LUT
s_curve_lut = create_s_curve_lut()

# Plot the S-curve for visualization
plt.plot(s_curve_lut, range(256))
plt.xlabel("Input Values (0-255)")
plt.ylabel("Output Values (0-255)")
plt.title("S-curve Contrast Enhancement LUT")

# You can access the S-curve LUT with s_curve_lut

At this point I gave up on ChatGPT LUT code and redid it using more universal regularized incomplete beta function. I adjusted a=b parameter to achieve curve shape that I like and applied LUT to image using OpenCV's LUT function. To my surprise and disbelief function was reducing contrast instead of increasing it. What?

After extensive head-scratching, to troubleshoot the problem I made a simplified linear contrast enhancement LUT and observed expected result. Only when I added linear contrast LUT to the graph issue became clear: When I abandoned ChatGPT's S-curve function, I kept graph code. In this code ChatGPT marked graph's axis labels and even added title. But then it threw a wrench by feeding x-data into Y axis and vice versa, effectively flipping the graph. As parameters of plt.plot are not named, it is very easy to miss this error for a human.

When I tuned shape factor for beta function with a flipped graph - I made it contrast-reducing that looked like it is what I needed. When I told ChatGPT that it's S-curve function is reducing contrast instead of increasing it - I misled it (and it unconditionally believed me), as S-curve was correct but error was in graph piece. Surely, if you tell ChatGPT that error is in plt.plot parameters - it can correct it.

I remember my teacher of analytic geometry at the final exam: when I was proving my solution - he could unexpectedly do not agree with one of the steps and claim that there is an error. To get maximum mark one had to not panic and continue defending correct solution. Hopefully we will see LLM's disagree with users more.

▶ Show error in code

But that's not all: Just when I've thought we are done - there is one more bug in the code. One can notice slight asymmetry of GPT-TRAP curve at high end. It's a rounding error - calculated value is simply cast to uint8 (which discards fractional part) instead of rounding, so in average we are getting 0.5 unit / ~0.25% lower brightness of the image and significantly more rare full white values (255). What is interesting is that this error appeared to be systematic and present in all generated samples from all LLM's I've tested. I.e. apparently error was very widespread in training data of all LLM's, so they all have learned that "multiply by 255 and cast to uint8" is enough to fit values to 0..255 range.Technically this is true, but result is mathematically flawed.

▶ Show error in code

My conclusions are:
  • LLM's are like junior developers - they can and will do unexpected mistakes, they need clear instructions and guidance. The difference though is that junior developers will learn over time and LLM's will get better only in next generation. Like junior developers - LLM's needs to be "managed" with reasonable expectations.
  • All code from LLM's must be verified, the more niche field - the more tests. LLM's generate code that looks correct, and when it's not - errors can be very subtle and expensive to debug/fix.
  • In case of unexpected or puzzling results it is often faster to simply ask multiple LLM's : now in addition to ChatGPT (3.5/4) we have Copilot, Bard, Replit and more. None of these gave perfect results from the first time, but some errors were different and often less subtle / easier to get it working in 20 minutes total.
  • Some of the errors are systematic for multiple LLM's, which apparently come from training data (as LLM's currently unconditionally trust training data, unlike humans). I.e. currently LLM's cannot exceed level training data on quality, but can only approach it. It is unclear how much further work on LLM's will be needed to get perfect result consistently, I afraid it might be the case where last 10% of the work require 90% of time.

October 22, 2023