Improve stock control using cluster analysis
Table of Contents
Objectives
The goal of this project is to:
- Categorized fabric by features
- Identify the category with higher stock
- Predict the sales
Dataset
Data was collected from the previous quarter’s sales results, with a total of 3000 fabrics. The detailed descriptions of the extracted features and target are listed in the table. Fig. 1 provides an example of fabric and its features. Regarding the targets, since the purchased quantity of each fabric varies, we use the percentage of sales instead of the total amount of sales for each fabric.
Feature | Unit | Type | Range | Description |
---|---|---|---|---|
HUE | degree | Numeric | [0, 359] | Hue of the dominant color. |
PERCENT | - | Numeric | [0, 100] | Percentage of dominant color. |
RANGE | degree | Numeric | [0, 359] | Range of colors. |
THEME | - | Categorical | Monochrome, analogous, complementary, triadic, none | Color themes. |
SIZE | $cm^2$ | Numeric | [0, 1247.35] | Pattern size. |
CONTRAST | - | Numeric | [0, 127.5] | Standard deviation in grayscale. |
SALES | - | Numeric | [0, 100] | Percentage of sales. Total sales at the end of the month divided by total purchases at the beginning of the month. |
(a)show hide
![](./figure/figure_01.webp)
(b)show hide
(c)show hide
(d)show hide
Exploratory data analysis
Distribution
We can perform a descriptive analysis by examining the distribution of each feature and its correlation with sales, as shown in the figure.
The percentage of blue and red is high in the main color (HUE). However, HUE is uncorrelated to SALES in Figure, indicating that our customers have no significant preference for the main color of the fabric. Therefore, we will drop this feature in future analyses.
The RANGE and SIZE of the fabric are relatively small. This is reasonable, as it is difficult to design fabric with a wide color range, large pattern size. Regarding THEME, it is evident that the percentage decreases as the theme becomes more intricate. This is reasonable since, similar to RANGE, designing becomes more challenging when there are more limitations on color usage to match the theme.
In terms of SALES performance, the mean is 40% and the distribution is mainly below this mark.
HUEshow hide
PERCENTshow hide
RANGEshow hide
SIZEshow hide
CONTRASTshow hide
SALESshow hide
THEMEshow hide
Correlation
The figure shows the correlation between SALES and features. Despite the low correlation, all below 0.5, we can still gain some insight into customer preferences. All features, except for the PERCENT, monochrome, non-specific color themes, are positively correlated with sales.
The result is quite intuitive, figure provides an example. Fabric A has a wider range of colors and larger pattern size, as well as higher contrast. It uses a triadic theme to balance red, blue, and green, resulting in a higher visual appeal than B. Therefore, it is reasonable to assume that A has higher sales than B.
test
(a)show hide
![](./figure/figure_02.webp)
(c)show hide
Models
Before conducting quantitative analysis, it’s important to split the data into training and test datasets. This is necessary for verification in the final step of the analysis. Additionally, data standardization is required to ensure that all features have a comparable range.
Dimensionality reduction
As there are both numerical and categorical features, the covariance matrix should be treated differently. For numerical features, we can follow the original definition. However, for categorical features, the covariance matrix is constructed by preserving the $\chi^2$ distance between features. Next, we can perform singular value decomposition on the covariance matrix to find the principal components. In summary, PCA (principal component analysis) and CA (correspondence analysis) are utilized to decrease feature dimensionality.
The results are shown in the figure. We can use three principal components and still preserve 70% of the total variance. The loading of each components are also shown in figure.
(a)show hide
(c)show hide
Cluster analysis
K-means
After identifying the turning point in the elbow plot, as shown in figure, we have decided to use 3 clusters for further analysis. For each cluster, we calculate the average of its features as shown in the figure.
The first cluster, named “Colorful”, has higher values in RANGE, CONTRAST, and multiple color themes. The second cluster, named “Large pattern”, has the highest value in SIZE. In contrast to the first two clusters, the third cluster has only a large positive value in main color percentage (PERCENT) and monochrome color theme, so it is named “Plain” and should be considered less visually appealing than the first two clusters. The figure provides examples of fabric for each cluster.
(a)show hide
(c)show hide
(a)show hide
Colorfulshow hide
![](./figure/figure_03.webp)
Large patternshow hide
![](./figure/figure_04.webp)
Plainshow hide
![](./figure/figure_05.webp)
Sales and Stock
The histogram and cumulative curve for the percentage of sales are shown in the figure. It is evident that the distribution of “Plain” fabric is concentrated near zero. The portion of sales and stock are shown in figure, the “Plain” fabric take up to 60% of total stock and only contributes to 31.9% of sales. In other words, the root cause of the low sales is the “Plain” fabric. Therefore, if we reduce the total portion of “Plain” fabric in every purchase, we can improve our stock control. The test data leads to the same conclusion (not shown).
Figure: (a) The histogram and cumulative curve, as well as (b) the sale and stock of each cluster, indicate that the “Plain” fabric is the root cause of low sales.
(a)show hide
(a)show hide
Regression
Another goal is to make predictions on sales. To do this, we can first examine the correlation between sales and components from dimension reduction. Unfortunately, there is only one component reaching -0.5, which means that the current features are still insufficient to make precise predictions.