Improve stock control using cluster analysis

Table of Contents

Objectives

The goal of this project is to:

  • Categorized fabric by features
  • Identify the category with higher stock
  • Predict the sales

Dataset

Data was collected from the previous quarter’s sales results, with a total of 3000 fabrics. The detailed descriptions of the extracted features and target are listed in the table. Fig. 1 provides an example of fabric and its features. Regarding the targets, since the purchased quantity of each fabric varies, we use the percentage of sales instead of the total amount of sales for each fabric.

Exploratory data analysis

Distribution

We can perform a descriptive analysis by examining the distribution of each feature and its correlation with sales, as shown in the figure.

The percentage of blue and red is high in the main color (HUE). However, HUE is uncorrelated to SALES in Figure, indicating that our customers have no significant preference for the main color of the fabric. Therefore, we will drop this feature in future analyses.

The RANGE and SIZE of the fabric are relatively small. This is reasonable, as it is difficult to design fabric with a wide color range, large pattern size. Regarding THEME, it is evident that the percentage decreases as the theme becomes more intricate. This is reasonable since, similar to RANGE, designing becomes more challenging when there are more limitations on color usage to match the theme.

In terms of SALES performance, the mean is 40% and the distribution is mainly below this mark.

Correlation

The figure shows the correlation between SALES and features. Despite the low correlation, all below 0.5, we can still gain some insight into customer preferences. All features, except for the PERCENT, monochrome, non-specific color themes, are positively correlated with sales.

The result is quite intuitive, figure provides an example. Fabric A has a wider range of colors and larger pattern size, as well as higher contrast. It uses a triadic theme to balance red, blue, and green, resulting in a higher visual appeal than B. Therefore, it is reasonable to assume that A has higher sales than B.

test

Models

Before conducting quantitative analysis, it’s important to split the data into training and test datasets. This is necessary for verification in the final step of the analysis. Additionally, data standardization is required to ensure that all features have a comparable range.

Dimensionality reduction

As there are both numerical and categorical features, the covariance matrix should be treated differently. For numerical features, we can follow the original definition. However, for categorical features, the covariance matrix is constructed by preserving the $\chi^2$ distance between features. Next, we can perform singular value decomposition on the covariance matrix to find the principal components. In summary, PCA (principal component analysis) and CA (correspondence analysis) are utilized to decrease feature dimensionality.

The results are shown in the figure. We can use three principal components and still preserve 70% of the total variance. The loading of each components are also shown in figure.

Cluster analysis

K-means

After identifying the turning point in the elbow plot, as shown in figure, we have decided to use 3 clusters for further analysis. For each cluster, we calculate the average of its features as shown in the figure.

The first cluster, named “Colorful”, has higher values in RANGE, CONTRAST, and multiple color themes. The second cluster, named “Large pattern”, has the highest value in SIZE. In contrast to the first two clusters, the third cluster has only a large positive value in main color percentage (PERCENT) and monochrome color theme, so it is named “Plain” and should be considered less visually appealing than the first two clusters. The figure provides examples of fabric for each cluster.

Sales and Stock

The histogram and cumulative curve for the percentage of sales are shown in the figure. It is evident that the distribution of “Plain” fabric is concentrated near zero. The portion of sales and stock are shown in figure, the “Plain” fabric take up to 60% of total stock and only contributes to 31.9% of sales. In other words, the root cause of the low sales is the “Plain” fabric. Therefore, if we reduce the total portion of “Plain” fabric in every purchase, we can improve our stock control. The test data leads to the same conclusion (not shown).

Figure: (a) The histogram and cumulative curve, as well as (b) the sale and stock of each cluster, indicate that the “Plain” fabric is the root cause of low sales.

Regression

Another goal is to make predictions on sales. To do this, we can first examine the correlation between sales and components from dimension reduction. Unfortunately, there is only one component reaching -0.5, which means that the current features are still insufficient to make precise predictions.

Alice Wu
Alice Wu
Professor of Artificial Intelligence

My research interests include distributed robotics, mobile computing and programmable matter.