C API Reference

This page documents the C implementation details for developers who want to understand or extend the underlying algorithm.

Note

The C extension is implemented in src/kmeans/_kmeans.c. This documentation describes the algorithm and function signatures.

Algorithm Overview

The k-means implementation uses the following approach:

Initialization: K-means++ algorithm for better initial centroids
Assignment: Assign each point to nearest centroid
Update: Recalculate centroids as mean of assigned points
Convergence: Repeat until centroids stop moving significantly

Core Functions

Distance Calculation

static double squared_distance(const double *a, const double *b, int dims)

Calculate squared Euclidean distance between two points.

param a:: First point coordinates
param b:: Second point coordinates
param dims:: Number of dimensions
return:: Squared distance

Find Nearest Centroid

static int find_nearest_centroid(const double *point,
                                  const double *centroids,
                                  int k, int dims)

Find the index of the nearest centroid for a given point.

param point:: Data point coordinates
param centroids:: Array of all centroid coordinates
param k:: Number of clusters
param dims:: Number of dimensions
return:: Index of nearest centroid (0 to k-1)

K-means++ Initialization

static void init_centroids_kmeanspp(const double *data,
                                     int n_samples, int dims,
                                     int k, double *centroids)

Initialize centroids using the k-means++ algorithm.

The k-means++ initialization chooses initial cluster centers that are far apart, leading to better and faster convergence.

param data:: Input data array
param n_samples:: Number of data points
param dims:: Number of dimensions
param k:: Number of clusters
param centroids:: Output array for centroid coordinates

Algorithm:

Choose first centroid uniformly at random
For each remaining centroid:
1. Calculate distance to nearest existing centroid for each point
2. Choose new centroid with probability proportional to squared distance

Main K-means Algorithm

static int kmeans_core(const double *data, int n_samples, int dims,
                       int k, int max_iter, double tol,
                       double *centroids, int *labels)

Main k-means clustering algorithm.

param data:: Input data array (row-major, n_samples × dims)
param n_samples:: Number of data points
param dims:: Number of dimensions
param k:: Number of clusters
param max_iter:: Maximum number of iterations
param tol:: Convergence tolerance
param centroids:: Output array for final centroids
param labels:: Output array for cluster assignments
return:: 0 on success, -1 on memory allocation failure

Convergence:

The algorithm stops when the maximum centroid shift is less than tol, or when max_iter iterations have been reached.

Python Binding Functions

static PyObject *py_fit(PyObject *self, PyObject *args)

Python wrapper for the fit operation. Accepts NumPy arrays and returns tuple of (centroids, labels).

static PyObject *py_predict(PyObject *self, PyObject *args)

Python wrapper for the predict operation. Assigns new data points to existing centroids.

Memory Management

The C implementation uses dynamic memory allocation for:

Temporary centroid calculations
Cluster point counts
Distance calculations in k-means++

All allocations are freed before function return. Error checking ensures proper cleanup on allocation failures.

Performance Considerations

Optimization Techniques

Memory Layout: Row-major array layout for cache efficiency
Squared Distance: Avoid expensive sqrt() in distance calculations
Early Termination: Stop when convergence tolerance is reached
K-means++: Better initialization reduces iterations needed

Complexity

Time: O(n × k × d × i) where:
- n = number of samples
- k = number of clusters
- d = number of dimensions
- i = number of iterations
Space: O(n + k × d)

Building the C Extension

The C extension is built using CMake:

# From CMakeLists.txt
python_add_library(_kmeans MODULE
    src/kmeans/_kmeans.c
    WITH_SOABI
)

target_include_directories(_kmeans PRIVATE
    ${Python_NumPy_INCLUDE_DIRS}
    ${CMAKE_SOURCE_DIR}/include
)

target_link_libraries(_kmeans PRIVATE Python::NumPy)

Source Code Location

The complete C implementation can be found in:

src/kmeans/_kmeans.c - Main implementation file
CMakeLists.txt - Build configuration

See the source files directly for the most up-to-date implementation details.