C API Reference
This page documents the C implementation details for developers who want to understand or extend the underlying algorithm.
Note
The C extension is implemented in src/kmeans/_kmeans.c.
This documentation describes the algorithm and function signatures.
Algorithm Overview
The k-means implementation uses the following approach:
Initialization: K-means++ algorithm for better initial centroids
Assignment: Assign each point to nearest centroid
Update: Recalculate centroids as mean of assigned points
Convergence: Repeat until centroids stop moving significantly
Core Functions
Distance Calculation
static double squared_distance(const double *a, const double *b, int dims)
Calculate squared Euclidean distance between two points.
- param a:
First point coordinates
- param b:
Second point coordinates
- param dims:
Number of dimensions
- return:
Squared distance
Find Nearest Centroid
static int find_nearest_centroid(const double *point,
const double *centroids,
int k, int dims)
Find the index of the nearest centroid for a given point.
- param point:
Data point coordinates
- param centroids:
Array of all centroid coordinates
- param k:
Number of clusters
- param dims:
Number of dimensions
- return:
Index of nearest centroid (0 to k-1)
K-means++ Initialization
static void init_centroids_kmeanspp(const double *data,
int n_samples, int dims,
int k, double *centroids)
Initialize centroids using the k-means++ algorithm.
The k-means++ initialization chooses initial cluster centers that are far apart, leading to better and faster convergence.
- param data:
Input data array
- param n_samples:
Number of data points
- param dims:
Number of dimensions
- param k:
Number of clusters
- param centroids:
Output array for centroid coordinates
Algorithm:
Choose first centroid uniformly at random
For each remaining centroid:
Calculate distance to nearest existing centroid for each point
Choose new centroid with probability proportional to squared distance
Main K-means Algorithm
static int kmeans_core(const double *data, int n_samples, int dims,
int k, int max_iter, double tol,
double *centroids, int *labels)
Main k-means clustering algorithm.
- param data:
Input data array (row-major, n_samples × dims)
- param n_samples:
Number of data points
- param dims:
Number of dimensions
- param k:
Number of clusters
- param max_iter:
Maximum number of iterations
- param tol:
Convergence tolerance
- param centroids:
Output array for final centroids
- param labels:
Output array for cluster assignments
- return:
0 on success, -1 on memory allocation failure
Convergence:
The algorithm stops when the maximum centroid shift is less than tol,
or when max_iter iterations have been reached.
Python Binding Functions
static PyObject *py_fit(PyObject *self, PyObject *args)
Python wrapper for the fit operation. Accepts NumPy arrays and returns tuple of (centroids, labels).
static PyObject *py_predict(PyObject *self, PyObject *args)
Python wrapper for the predict operation. Assigns new data points to existing centroids.
Memory Management
The C implementation uses dynamic memory allocation for:
Temporary centroid calculations
Cluster point counts
Distance calculations in k-means++
All allocations are freed before function return. Error checking ensures proper cleanup on allocation failures.
Performance Considerations
Optimization Techniques
Memory Layout: Row-major array layout for cache efficiency
Squared Distance: Avoid expensive sqrt() in distance calculations
Early Termination: Stop when convergence tolerance is reached
K-means++: Better initialization reduces iterations needed
Complexity
Time: O(n × k × d × i) where:
n = number of samples
k = number of clusters
d = number of dimensions
i = number of iterations
Space: O(n + k × d)
Building the C Extension
The C extension is built using CMake:
# From CMakeLists.txt
python_add_library(_kmeans MODULE
src/kmeans/_kmeans.c
WITH_SOABI
)
target_include_directories(_kmeans PRIVATE
${Python_NumPy_INCLUDE_DIRS}
${CMAKE_SOURCE_DIR}/include
)
target_link_libraries(_kmeans PRIVATE Python::NumPy)
Source Code Location
The complete C implementation can be found in:
src/kmeans/_kmeans.c- Main implementation fileCMakeLists.txt- Build configuration
See the source files directly for the most up-to-date implementation details.