From Photons to Pixels: Image Formation and Color Spaces, with OpenCV in Python and C++

How a camera turns light into an array of numbers — projection, sampling, quantization, the Bayer sensor — and the color spaces (RGB/BGR, grayscale, HSV, YCrCb, Lab) you convert between every day. The equations, plus runnable OpenCV in both Python and C++.

Luis Condados · June 4, 2026

computer-visionimage-processingopencvcolor-spacesfundamentals

Before any model, filter, or detector touches an image, that image has already been through a whole pipeline: light bounced off a scene, was focused by a lens, landed on a sensor, got sampled and quantized into integers, and was arranged into a grid you call an array. Understanding that pipeline — and the color spaces you reshuffle those integers into — is the foundation everything else sits on. Here it is end to end, with the math and runnable OpenCV in Python and C++.

The image formation pipeline. Every stage is lossy, and each one shows up later as noise, blur, aliasing, or banding.

1. What a digital image actually is

A scene in front of a camera is continuous: at every point and every wavelength there’s some amount of light. Mathematically we can write the image reaching the sensor as a continuous function

f(x, y) \in \mathbb{R}_{\ge 0}.

A computer can’t store a continuous function, so two things happen. Sampling reads $f$ only on a grid of points spaced $\Delta x, \Delta y$ apart, and quantization rounds each reading to one of a finite set of levels:

I[m, n] = Q\big(f(m\,\Delta x,\, n\,\Delta y)\big), \qquad Q : \mathbb{R} \to \{0, 1, \dots, L-1\}, \quad L = 2^{b}.

For a standard 8-bit image $b = 8$ , so $L = 256$ and every pixel is an integer in $[0, 255]$ . That’s the whole reason an image is a 2-D array of uint8 — sampling gives it width and height, quantization gives it the integer values.

2. Image formation inside the camera

Projection: from 3-D scene to 2-D plane

A lens (idealized as a pinhole) projects a 3-D point $(X, Y, Z)$ in camera coordinates onto the image plane at focal length $f$ :

x = f\,\frac{X}{Z}, \qquad y = f\,\frac{Y}{Z}.

Converting those metric coordinates to pixel indices adds the focal lengths in pixels $(f_x, f_y)$ and the principal point $(c_x, c_y)$ — the camera intrinsic matrix $K$ :

\begin{bmatrix} u \\ v \\ 1 \end{bmatrix} \sim K \begin{bmatrix} X \\ Y \\ Z \end{bmatrix}, \qquad K = \begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}.

This is the matrix you recover from camera calibration, and it’s what lets you go back and forth between pixels and rays.

From light to numbers: the sensor

Each sensor pixel (“photosite”) collects photons over the exposure time and turns them into a charge. A simplified but useful model of the digital value is linear in the incident irradiance $E$ :

I \approx Q\big(g \cdot E \cdot t + n\big),

where $t$ is exposure time, $g$ is the analog/ISO gain, $n$ is noise, and $Q$ is the analog-to-digital quantizer from §1. Two practical consequences fall out immediately: more gain amplifies noise along with signal, and the final rounding to $2^b$ levels is where banding in smooth gradients comes from.

Color from a gray sensor: the Bayer filter

A silicon sensor only measures intensity — it’s colorblind. To capture color, manufacturers overlay a color filter array (CFA), most commonly the Bayer pattern: a mosaic of red, green, and blue filters with twice as many greens (your eye is most sensitive to green). Each photosite therefore records only one of R, G, or B; the missing two channels at every pixel are interpolated in a step called demosaicing. OpenCV does it for you:

import cv2

# `raw` is a single-channel Bayer mosaic from the sensor (here: BGGR layout).
bgr = cv2.cvtColor(raw, cv2.COLOR_BayerBG2BGR)   # demosaic -> 3-channel BGR

#include <opencv2/opencv.hpp>

// `raw` is a single-channel Bayer mosaic from the sensor (here: BGGR layout).
cv::Mat bgr;
cv::cvtColor(raw, bgr, cv::COLOR_BayerBG2BGR);   // demosaic -> 3-channel BGR

By the time you call imread, all of this has already happened — but it explains why your image is BGR, why greens look cleanest, and where demosaicing artifacts near sharp edges come from.

3. The image as an array

Loading an image hands you that grid of integers. The one detail that trips up everyone new to OpenCV: channels are ordered B, G, R, not R, G, B.

import cv2

img = cv2.imread("street.jpg")     # BGR, dtype uint8
print(img.shape, img.dtype)        # (1080, 1920, 3) uint8
h, w, c = img.shape                # rows, cols, channels

#include <opencv2/opencv.hpp>
#include <iostream>

int main() {
    cv::Mat img = cv::imread("street.jpg");   // BGR, type CV_8UC3
    std::cout << img.rows << "x" << img.cols
              << " channels=" << img.channels() << "\n";  // 1080x1920 channels=3
    int h = img.rows, w = img.cols, c = img.channels();
}

Reading and writing a single pixel

Indexing is (row, col) — i.e. (y, x) — and each pixel is a 3-vector in BGR order:

b, g, r = img[100, 200]            # one pixel at row 100, col 200 (uint8 each)
print(int(b), int(g), int(r))

img[100, 200] = (0, 0, 255)        # paint it pure red (B=0, G=0, R=255)

cv::Vec3b px = img.at<cv::Vec3b>(100, 200);   // (row, col), BGR order
uchar b = px[0], g = px[1], r = px[2];

img.at<cv::Vec3b>(100, 200) = cv::Vec3b(0, 0, 255);  // pure red

4. Color spaces

A color space is just a choice of axes for the same color information. You convert between them because some tasks are far easier in the right coordinate system. In OpenCV every conversion goes through one function, cvtColor.

Grayscale (luma)

Dropping color collapses three channels to one. It isn’t a plain average — the weights match human luminance sensitivity (Rec. 601):

Y = 0.299\,R + 0.587\,G + 0.114\,B.

gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)   # shape (H, W), single channel

cv::Mat gray;
cv::cvtColor(img, gray, cv::COLOR_BGR2GRAY);   // single-channel CV_8UC1

HSV — hue, saturation, value

RGB mixes color and brightness together, which makes “find the red things” hard when lighting changes. HSV separates what the color is (hue) from how vivid (saturation) and how bright (value). With $R,G,B \in [0,1]$ , let $M = \max(R,G,B)$ , $m = \min(R,G,B)$ , and chroma $C = M - m$ :

V = M, \qquad S = \begin{cases} 0 & M = 0 \\[2pt] C / M & \text{otherwise} \end{cases}

H = 60^\circ \times \begin{cases} 0 & C = 0 \\[2pt] \big((G - B)/C\big) \bmod 6 & M = R \\[2pt] (B - R)/C + 2 & M = G \\[2pt] (R - G)/C + 4 & M = B \end{cases}

A gotcha worth memorizing: in 8-bit OpenCV, hue is stored in $[0, 179]$ (degrees halved to fit a byte), while $S$ and $V$ use the full $[0, 255]$ .

hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)     # H in [0,179], S,V in [0,255]

cv::Mat hsv;
cv::cvtColor(img, hsv, cv::COLOR_BGR2HSV);     // H in [0,179], S,V in [0,255]

YCrCb — luma plus chroma

This is the space behind JPEG and most video. It keeps the luma $Y$ and stores two color-difference channels (Rec. 601, 8-bit, with offset $\delta = 128$ ):

Y = 0.299R + 0.587G + 0.114B, \quad C_r = (R - Y)\cdot 0.713 + \delta, \quad C_b = (B - Y)\cdot 0.564 + \delta.

Because the eye is far more sensitive to luma than chroma, codecs subsample $C_r, C_b$ (4:2:0) and almost nobody notices — a direct, daily payoff of the camera→color-space chain.

ycrcb = cv2.cvtColor(img, cv2.COLOR_BGR2YCrCb)   # channels: Y, Cr, Cb

cv::Mat ycrcb;
cv::cvtColor(img, ycrcb, cv::COLOR_BGR2YCrCb);   // channels: Y, Cr, Cb

CIELAB — perceptually uniform

Lab is designed so that equal numerical distances look like roughly equal color differences to a human — handy for color comparison and matching. It’s a nonlinear transform through CIE XYZ, with $X_n, Y_n, Z_n$ the reference white:

L^* = 116\,f\!\left(\tfrac{Y}{Y_n}\right) - 16, \quad a^* = 500\left[f\!\left(\tfrac{X}{X_n}\right) - f\!\left(\tfrac{Y}{Y_n}\right)\right], \quad b^* = 200\left[f\!\left(\tfrac{Y}{Y_n}\right) - f\!\left(\tfrac{Z}{Z_n}\right)\right]

f(t) = \begin{cases} t^{1/3} & t > \delta^3 \\[2pt] \dfrac{t}{3\delta^2} + \dfrac{4}{29} & \text{otherwise} \end{cases}, \qquad \delta = \tfrac{6}{29}.

lab = cv2.cvtColor(img, cv2.COLOR_BGR2Lab)   # L in [0,255], a,b offset by 128

cv::Mat lab;
cv::cvtColor(img, lab, cv::COLOR_BGR2Lab);   // L in [0,255], a,b offset by 128

Splitting and merging channels

Whatever space you’re in, you can pull it apart and put it back:

b, g, r = cv2.split(img)        # three single-channel images
merged  = cv2.merge([b, g, r])  # back to one 3-channel image

std::vector<cv::Mat> ch;
cv::split(img, ch);             // ch[0]=B, ch[1]=G, ch[2]=R
cv::Mat merged;
cv::merge(ch, merged);

5. A practical payoff: segmenting by color in HSV

Here’s why all of this matters. Picking out red objects in RGB is fiddly; in HSV it’s a hue window. Red is the awkward case because its hue wraps around 0, so we union two ranges:

import cv2
import numpy as np

hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)

# Red wraps around hue = 0, so combine the low and high ends.
mask1 = cv2.inRange(hsv, np.array([0, 120, 70]),   np.array([10, 255, 255]))
mask2 = cv2.inRange(hsv, np.array([170, 120, 70]), np.array([179, 255, 255]))
mask  = mask1 | mask2

result = cv2.bitwise_and(img, img, mask=mask)   # keep only the red pixels

cv::Mat hsv, mask1, mask2, mask, result;
cv::cvtColor(img, hsv, cv::COLOR_BGR2HSV);

// Red wraps around hue = 0, so combine the low and high ends.
cv::inRange(hsv, cv::Scalar(0, 120, 70),   cv::Scalar(10, 255, 255),  mask1);
cv::inRange(hsv, cv::Scalar(170, 120, 70), cv::Scalar(179, 255, 255), mask2);
cv::bitwise_or(mask1, mask2, mask);

cv::bitwise_and(img, img, result, mask);        // keep only the red pixels

The same five lines that would be brittle in RGB are robust in HSV — purely because we chose better axes for the question.

Takeaways

An image is sampling + quantization of continuous light — that’s why it’s a grid of integers in $[0, 255]$ , and where aliasing and banding originate.
The camera pipeline is lossy at every stage (projection, sensor noise, demosaicing, quantization); artifacts you fight later are born here.
OpenCV is BGR, indexed (row, col) — internalize this once and stop fighting it.
Color spaces are coordinate choices. Convert with cvtColor; reach for grayscale to drop color, HSV for color thresholding, YCrCb for compression, Lab for perceptual distance.
Watch the ranges: 8-bit hue lives in $[0, 179]$ , not $[0, 360]$ .

Once light is a clean array of numbers, the fun starts — like quantizing the models that consume those arrays and running them on an integrated GPU. That’s exactly what we do in YOLO26-seg vs RF-DETR-Seg: INT8 instance segmentation on an Intel iGPU.