Most of platforms build in Information Technologies are generating huge amount of data. This data is called as Big Data and it carries lots of business intelligence. This data is crossing boundaries to meet different goals and opportunities. There is opportunity to apply Machine Learning to create value for clients.

**Problems**

- We have big data based platforms in Accounting and IoT domain that keep on generating customer behavior and device monitoring data.
- Identifying targeted customer base or deriving patterns based on different dimensions is key and really provide an edge to the platforms.

**Idea**

Imagine you got 1000’s of customers using your platform and vast amount of big data that’s keep on generating, any insight on this is really going to value add.

As part of Machine Learning initiatives and innovative things that Patterns7 team keep on trying, we experimented on K-Means Clustering and value it brings to our Clients is awesome.

**Solution**

Clustering is the process of partitioning a group of data points into a small number of clusters. In this part, you will understand and learn how to implement the K-Means Clustering.

**K-Means Clustering **

K-means clustering is a method commonly used to automatically partition a data set into k groups. It is unsupervised learning algorithm.

**K-Means Objective**

- The objective of k-means is to minimize the total sum of the squared distance of every point to its corresponding cluster centroid. Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares where µi is the mean of points in Si.
- The k-means algorithm is guaranteed to converge a local optimum.

**Business Uses**

This is a versatile algorithm that can be used for any type of grouping. Some examples of use cases are:

- Behavioral segmentation: Segment by purchase history ,Segment by activities on application, website, or platform.
- Inventory categorization:Group inventory by sales activity.
- Sorting sensor measurements:Detect activity types in motion sensors ,Group images.
- Detecting bots or anomalies:Separate valid activity groups from bots.

**K-Means Clustering Algorithm**

- Step 1: Choose the number K of clusters.
- Step 2: Select at random K points, the centroids.(not necessarily from your dataset)
- Step 3: Assign each data point to the closest centroid -> That forms K clusters.
- Step 4: Compute and place the new centroid of each cluster.
- Step 5: Reassign each data point to the new closest centroid. If any reassignment took place, go to Step 4, otherwise go to FIN.

**Example: Applying K-Means Clustering to Customer Expenses and Invoices Data in python.**

For python i am using Spyder Editor. As an example, we’ll show how the K-means algorithm works with a Customer Expenses and Invoices Data.We have 500 customers data we’ll looking at two customer features: Customer Invoices, Customer Expenses. In general, this algorithm can be used for any number of features, so long as the number of data samples is much greater than the number of features.

**Step 1: Clean and Transform Your Data**

For this example, we’ve already cleaned and completed some simple data transformations. A sample of the data as a pandas DataFrame is shown below. Import libraries in python i.e.

- numpy for mathematical tool to include any types of mathematics in our code.
- matplotlib.pyplot it help to plot nice chart.
- pandas for import dataset and manage dataset.

**Step 2: We want to apply clustering on Total Expenses and Total Invoices. So select required columns in X.**

The chart below shows the dataset for 500 customers, with the Total Invoices on the x-axis and Total Expenses on the y-axis.

**Step 3: Choose K and Run the Algorithm**

**Choosing K**

The algorithm described above finds the clusters and data set labels for a particular pre-chosen K. To find the number of clusters in the data, the user needs to run the K-means clustering algorithm for a range of K values and compare the results. In general, there is no method for determining exact value of K, but an accurate estimate can be obtained using the following techniques.

One of the metrics that is commonly used to compare results across different values of K is the mean distance between data points and their cluster centroid. Since increasing the number of clusters will always reduce the distance to data points, increasing K will always decrease this metric, to the extreme of reaching zero when K is the same as the number of data points. Thus, this metric cannot be used as the sole target. Instead, mean distance to the centroid as function of K is plotted and the “elbow point,” where the rate of decrease sharply shifts, can be used to roughly determine K.

Using the elbow method we find the optimal number of clusters i.e. K=3. For this example, use the Python packages scikit-learn for computations as shown below:

# K-Means Clustering # importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # importing tha customer Expenses Invoices dataset with pandas dataset=pd.read_csv('Expense_Invoice.csv') X=dataset.iloc[: , [3,2]].values # Using the elbow method to find the optimal number of clusters from sklearn.cluster import KMeans wcss = [] for i in range(1, 11): kmeans=KMeans(n_clusters=i, init='k-means++', max_iter= 300, n_init= 10, random_state= 0) kmeans.fit(X) wcss.append(kmeans.inertia_) plt.plot(range(1, 11),wcss) plt.title('The Elbow Method') plt.xlabel('Number of clusters K') plt.ylabel('Average Within-Cluster distance to Centroid (WCSS)') plt.show() # Applying k-means to the mall dataset kmeans=KMeans(n_clusters=3, init='k-means++', max_iter= 300, n_init= 10, random_state= 0) y_kmeans=kmeans.fit_predict(X) # Visualizing the clusters plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label='Careful(c1)') plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label='Standard(c2)') plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label='Target(c3)') plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 250, c = 'yellow', label='Centroids') plt.title('Clusters of customer Invoices & Expenses') plt.xlabel('Total Invoices ') plt.ylabel('Total Expenses') plt.legend() plt.show()

**Step 4: Review the Results**

The chart below shows the results. Visually, you can see that the K-means algorithm splits the three groups based on the invoice feature. Each cluster centroid is marked with a yellow circle. Now customers are divided into

- “careful” who’s income is less also they spend less.
- “Standard” who’s income is Average and they spends less and,
- “Target ” who’s income is more and they spends more .

very good article ..really helped a lot