Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

*Scientific Reports* **volume 12**, Article number: 22286 (2022)

1

Metrics details

Recent progress in encoder–decoder neural network architecture design has led to significant performance improvements in a wide range of medical image segmentation tasks. However, state-of-the-art networks for a given task may be too computationally demanding to run on affordable hardware, and thus users often resort to practical workarounds by modifying various macro-level design aspects. Two common examples are downsampling of the input images and reducing the network depth or size to meet computer memory constraints. In this paper, we investigate the effects of these changes on segmentation performance and show that image complexity can be used as a guideline in choosing what is best for a given dataset. We consider four statistical measures to quantify image complexity and evaluate their suitability on ten different public datasets. For the purpose of our illustrative experiments, we use DeepLabV3+ (deep large-size), M2U-Net (deep lightweight), U-Net (shallow large-size), and U-Net Lite (shallow lightweight). Our results suggest that median frequency is the best complexity measure when deciding on an acceptable input downsampling factor and using a deep versus shallow, large-size versus lightweight network. For high-complexity datasets, a lightweight network running on the original images may yield better segmentation results than a large-size network running on downsampled images, whereas the opposite may be the case for low-complexity images.

Medical image segmentation aims to delineate organs or lesions in images from computed tomography (CT), magnetic resonance imaging (MRI), optical imaging, and other medical imaging modalities, and serves as a basis for subsequent quantitative image analysis in a wide range of clinical and research applications. It is one of the most difficult tasks in medical image analysis, as it provides critical information about organ shapes and volumes, and medical images can be quite complex^{1,2,3,4}. The challenges of obtaining a clinically applicable segmentation are multifaceted, including diverse segmentation tasks, different modalities, multiple resolutions, and varying anatomical characteristics such as shape, size, location, deformity, and texture. Recent progress in encoder-decoder architectures such as U-Net^{5,6,7,8} has improved segmentation performance on many benchmarks. However, designing such networks requires significant effort in choosing the right network configuration.

The size of medical imaging datasets is constantly increasing^{9} and often it is not possible to train deep neural network architectures on a single mid-range graphics processing unit (GPU) at the native image resolution. As a result, the images are typically downsampled before training, which may cause loss or alteration of fine details that are potentially important for diagnosis. Also, in benchmarking studies, downsampling is sometimes used for both training and testing of medical image segmentation methods^{10,11}, and thus the results may not be fully representative of performance on the native images. Alternatively, shallow networks are often proposed^{12,13,14}, in an attempt to trade off image size and network size to allow training on limited computing hardware. Another common practice is iterative downsampling until training of a deeper network of choice becomes feasible on given hardware. While these approaches are understandable from a practical standpoint, we argue that the optimal choice of input size and network depth is inherently dependent upon the characteristics of the data and the segmentation task.

Recent methods in medical image segmentation adopt neural architecture search (NAS)^{15,16,17,18,19,20} to determine the best suitable network architecture for the task at hand. However, a computationally expensive search has to be performed for each new dataset and task, and the resulting architecture may not generalize well to other datasets and tasks. Here again, the importance of the information content of the data is often ignored. We argue that we need to take a step back and base the macro-level design choices of neural networks, such as the amount of downsampling or the depth of the network, on the information complexity of the data.

Our objective in this work is to employ measures of image complexity to guide macro-level neural network design for medical image segmentation. We focus specifically on balancing input image downsampling and network depth/size for optimal segmentation results. To this end, we consider four statistical complexity measures: delentropy^{21}, mean frequency^{22}, median frequency^{22}, and perimetric complexity^{23}. Delentropy and perimetric complexity have been used previously as measures of data complexity in autonomous driving^{24} and binary pattern recognition^{23}, respectively, while mean and median frequency have been used in electromyography signal identification^{22}. In this paper, they are used for the first time as complexity measures for predicting a suitable input image downsampling factor and selecting a shallow versus deep, lightweight versus large-size neural network.

In general, the architectural design choices for semantic segmentation networks boil down to either model scaling^{25} (in the pursuit of performance) leading to deep networks, or model compression^{26} (for embedded and edge applications) resulting in shallow counterparts. The intended applications and corresponding hardware resources impose demands and limits on the number of trainable network parameters, and determine whether to use a computationally heavy or lightweight network. Based on model scaling and model compression, four design combinations, including deep large-size, deep lightweight, shallow large-size, and shallow lightweight networks are included in our experiments (Table 1). Here, networks with more versus less than 80 layers are categorized as deep versus shallow, and networks with more versus less than 3 million parameters are categorized as large-size versus lightweight. Based on these criteria, four existing state-of-the-art networks are selected for the comparative analysis. Specifically, DeepLabV3+^{27} is used as a deep large-size network, M2U-Net^{28} as a deep lightweight network, an adapted U-Net^{5} as a shallow large-size network, and U-Net Lite as a shallow lightweight network. To find the best complexity measure in selecting a suitable network, we use several data fitting models, including linear and polynomial fitting such as linear regression (text {R}^2), adjusted (text {R}^2), root mean square error (RMSE), mean absolute error (MAE), Akaike information criterion (AIC), and corrected AIC.

The aim of this work is to take advantage of image complexity in the design of macro-level neural networks for medical image segmentation. To demonstrate the efficacy and wide applicability of image complexity analysis for neural network based medical image segmentation, we present experiments on 10 different datasets from public challenges. The results confirm that the proposed complexity measures can indeed aid in making the said macro-level design choices and that median frequency is the best measure for this purpose. More specifically, the results show that input image size is important for datasets with high complexity and downsampling negatively affects segmentation performance in such cases, whereas downsampling does not significantly affect performance for datasets having low complexity. Also, in the case of high-complexity datasets and computational constraints, a shallow network taking the original images as input is to be preferred, whereas for low-complexity cases competitive performance with the same computational constraints is achievable by using downsampling and a deep network topology.

It has long been known that data complexity measures can be used to determine the intrinsic difficulty of a classification task on a given dataset^{29}. In this study we consider four important complexity measures and investigate their suitability for medical image segmentation tasks.

The standard Shannon entropy of a gray-scale image is defined as^{21}:

where *N* is the number of gray levels and (p_i) is the probability of a pixel having gray level *i*. Delentropy (DE) is computed similarly, but using a probability density function known as deldensity^{21}. DE is different from Shannon entropy, which looks only at individual pixel values. Instead, DE considers the underlying spatial image structure and pixel co-occurrence through the deldensity, which is based on gradient vectors in the image. Specifically, the two-dimensional probability density function (normalized joint histogram) (p_{i,j}) is computed as:

where (d_x) and (d_y) denote the derivative kernels in the *x* and *y* direction, (delta) is the Kronecker delta to describe the binning process in histogram generation, and *W* and *H* are the image width and height, respectively. From this, DE is computed as:

where *I* and *J* are the number of bins (discrete cells) in the two dimensions of the probability density function. The (frac{1}{2}) factor in (3) reflects the Papoulis generalized sampling, which halves the entropy rate^{21}. Discrete 2(times)2 kernels are used as (d_x) and (d_y) in our implementation to estimate the *x* and *y* derivatives by taking finite differences.

The mean frequency (MNF) of a signal is computed as the sum of the product of the power spectrum and frequency divided by the total sum of the power spectrum^{22}:

where (P_i) is the value of the power spectrum at frequency bin *i*, (f_i) is the actual frequency of that bin, and *M* is the total number of frequency bins. The power spectrum is computed as the squared amplitude of the Fourier transform. Prior to power spectrum estimation, the image is windowed with a rectangular window of length determined by the dimensions of the image. The MNF can be considered as the frequency centroid or the spectral center of gravity and is also called the mean power frequency and mean spectral frequency in several works^{22}. For an extension to the 2D image domain, the 1D formula (4) is first applied to each column of the image independently to obtain its mean frequency, and subsequently to the resulting vector of mean frequencies.

The median frequency (MDF) of a signal is the frequency at which the power spectrum of the signal is divided into two regions with equal integrated power^{22}. In other words, at the (text {MDF}=f_j) the following equality holds:

Similar to MNF, the MDF of a 2D image is computed by first applying the 1D procedure to each column independently, and then to the resulting vector. The power within each bin is computed by rectangular integration. Afterwards, the MDF is determined by searching for the bin *j* that satisfies the condition (5).

The perimetric complexity (PC) is a measure of the complexity of binary images. The general concept goes back to the early days of vision research^{23} where this measure, originally called dispersion, was used to describe the perceptual complexity of visual shapes. It is defined as:

where *P* represents the perimeter of the foreground and *A* is the foreground area. In our study, this measure is computed from the annotation masks of the gray-scale images.

To investigate the interplay between image complexity, input downsampling, and network depth and size, we considered four possible network design options: deep large-size (DeepLabV3+), deep lightweight (M2UNet), shallow large-size (U-Net), and shallow lightweight (U-Net Lite).

DeepLabV3+^{27} was used as a deep large-size network. Consisting of 100 layers and 20 million trainable parameters, it enhances DeepLabV3 by including a simple yet effective decoder module to refine segmentation results, particularly along object boundaries^{27}. We built a DeepLabV3+ network using ResNet-18 as the base network.

M2U-Net^{28} was employed as a representative a deep lightweight network. It uses a new encoder-decoder architecture based on the U-Net and consists of 155 layers and 0.55 million trainable parameters. Specifically, it incorporates MobileNetV2^{30} pretrained components in the encoder and novel contractive bottleneck blocks in the decoder, which, when combined with bilinear upsampling, drastically reduces the parameter count to 0.55 million compared to about 30 million in the original U-Net^{5}.

The U-Net^{5} architecture was adopted as a shallow large-size network. It is made up of two subnetworks, namely an encoder and a decoder, which are linked by a bridge section. The encoder and decoder subnetworks are divided into several stages, the number of which determines the depth of the subnetworks. In our experiments, the encoder depth was set to 4 stages to make the U-Net a shallow network, totalling 58 layers and about 30 million trainable parameters. The U-Net encoder stages consist of two sets of convolutional and rectified linear unit (ReLU) layers, followed by a 2-by-2 max pooling layer. The decoder stages consist of an upsampling transposed convolutional layer followed by two sets of convolutional and ReLU layers. For the convolutional layers, we used feature map depths of 64, 128, 256, 512 for the four stages, respectively, and 1024 for the bridge section.

For a shallow lightweight network we designed U-Net Lite based on the U-Net architecture. In U-Net Lite, we reduced the encoder depth of U-Net to 3 stages. We also used a reduced number of convolutional filters in each stage to, respectively, 8, 16, and 32. Together, these modifications reduced the number of layers to 46 and the number of trainable parameters to only 0.28 million.

Two experiments were performed to test the hypothesis that image complexity can and should be taken into account in making macro-level neural network design choices for medical image segmentation. In the following sections we present the network training approach, the used public datasets, segmentation performance metrics, regression analysis performance metrics, and the results of the two experiments.

All experiments were carried out on an Intel(R) Core(TM) i7-8700 CPU with 64 GB RAM and a relatively low/mid-range GeForce GTX1080Ti GPU. Network training was done with adaptive moment estimation (Adam) and a fixed learning rate of 1e-3. After initial experimentation, the maximum number of epochs was set to 15 with a batch size of 8 to match the hardware constraints. Gradient clipping was employed based on the global (l_2)-norm with a gradient threshold of 3^{31}. Weighted cross-entropy loss was used as the objective function for training all models in our experiments. To calculate the class association weights in the loss, we used median frequency balancing^{32}.

We used 10 publicly available datasets (Table 2) representing a range of image complexities (Table 3). We confirm that all experiments were performed in accordance with relevant guidelines and regulations.

The STARE (Structured Analysis of the Retina) dataset^{33} consists of 20 color retinal fundus images acquired with a field of view (FOV) of (35^circ) and size 700(times)605 pixels. There are various pathologies in 10 of the 20 images. For each of the 20 images, two expert manual segmentation maps are available of the retinal blood vessels, and we used the first of these as the ground truth. Following others^{34,35}, we used 10 for training and ten for testing.

The DRIVE (Digital Retinal Images for Vessel Extraction) dataset^{36} is from a diabetic retinopathy screening program. It contains 20 color images for training and 20 for testing with a size of 584(times)565 pixels and covers a wide age range of diabetic patients. Seven of the 40 images show small signs of mild early diabetic retinopathy. For each of the 40 images, an expert manual segmentation mask is available for use as ground truth.

The CHASE-DB1 dataset^{37} (a subset of the Child Heart and Health Study in England) includes 28 color images of children. Each image is captured with a (30^circ) FOV centered on the optic disc and has a size of 999(times)960 pixels. As ground truth, two different expert manual segmentation maps are available, of which we used the first for our experiments. Since there are no specific training or testing subsets, following others^{11,38,39,40} we used the first 20 images for training and the remaining 8 for testing.

The Montgomery County (MC) chest X-ray dataset^{41} contains 138 frontal chest X-ray images obtained from a tuberculosis research program and is often used as a benchmark for lung segmentation. It includes 58 tuberculosis cases and 80 normal cases with a variety of abnormalities and for which expert manual segmentations are available. The images are relatively large, either (4020times 4892) or (4892times 4020) pixels. Following others^{42}, we selected 100 images for training and the remaining 38 for testing.

The PH2 dataset^{43} (named after its provider, the Hospital Pedro Hispano in Matosinhos, Portugal) includes 200 dermoscopic images, (768times 560) pixels each, of melanocytic skin lesions with expert annotation to be used as ground truth in evaluating both segmentation and classification methods. Following experimental protocols of others^{44,45,46,47}, we used all images in this dataset for testing, while training was done on the ISIC-2016 training images.

The ISIC-2016 dataset^{48} (named after the International Skin Imaging Collaboration who hosted the challenge at the 2016 IEEE International Symposium on Biomedical Imaging where this dataset was used) contains 900 dermoscopic training images of different sizes, from as small as (576times 768) or (718times 542) pixels to as large as (4288times 2848) pixels, with expert manual annotation for benchmarking melanoma segmentation, pattern detection, and classification methods. For testing, we used the PH2 images.

The DRISHTI-GS1 dataset^{49} includes 101 retinal images for glaucoma assessment. The images were captured with a (30^circ) FOV centered on the optic disc (OD) and are of size 2896(times)1944 pixels. Average boundaries of both the optic cup (OC) and the OD in all images were obtained from manual annotations by four experts. The dataset is divided into 50 images for training and 51 for testing. We refer to the OC boundaries as the DRISHTI-OC dataset.

The DRISHTI-OD dataset refers to average boundaries of the OD regions in the 101 retinal images of the DRISHTI-GS1 dataset^{49} described above.

The PROMISE12 (Prostate MR Image Segmentation 2012) dataset^{50} contains three-dimensional (3D) transversal T2-weighted magnetic resonance (MR) images of 50 patients scanned at various centers using various MRI scanners and imaging protocols. The size of the images varies, from 256(times)256 pixels, to 320(times)320, 384(times)384, and 512(times)512 pixels. In our experiments we used only images of patients 0-12, all of size 512(times)512 pixels, of which we used 200 for training and 74 for testing^{51}.

The BCSS (Breast Cancer Semantic Segmentation) dataset^{52} contains more than 20,000 manually segmented tissue regions in 151 whole-slide breast-cancer images from The Cancer Genome Atlas (TCGA). The images vary in size, 1500-3000(times)2000-4000 pixels, and were annotated by 25 participants ranging in experience from senior pathologists to medical students. Following others^{53}, we used 100 images for training and the remaining 51 for testing.

To quantify segmentation performance, we used seven popular metrics^{54,55}. Denoting the segmented image by *S* and the corresponding ground-truth image by *G*, each having *N* pixels (i=1dots N) with a value either 0 (negative (=) background) or 1 (positive (=) foreground), we first computed the numbers of true-positive (TP) pixels:

true-negative (TN) pixels:

false-positive (FP) pixels:

and false-negative (FN) pixels:

from which we obtained the sensitivity (Se), also known as the recall (R):

the specificity (Sp):

the accuracy (A):

the balance accuracy (BA):

the Dice (D) coefficient, which is equivalent to the F1-score:

the Jaccard (J) coefficient:

and the overlap error (E):

The values of all metrics are in the range [0, 1], where 0 means worst and 1 means best performance, except for E, where 0 means best and 1 means worst performance.

To evaluate the performance of the linear regression models, we used the most common regression performance metrics, including the coefficient of determination (text {R}^2), adjusted (text {R}^2), RMSE, MAE, and important unbiased metrics, namely AIC and its corrected version AICc^{56}.

The first is a statistical measure of proportional variance in the outcome that is explained by the independent variables^{57} and is computed as:

with the total sum of squares (TSS)

and the residual sum of squares (RSS)

computed from the observed values (y_i) and the values (m_i) predicted by the model^{57}. The regression model having a higher (text {R}^2) value is considered to be better. To account for the numbers of independent variables, *k*, and observations, *n*, the adjusted (text {R}^2) ((text {AR}^2)) is also employed^{58}:

To measure the average error of the models in predicting the observations, we computed the RMSE, defined as:

as well as the MAE, defined as:

Finally, to get an unbiased estimate of a model’s performance, we computed the AIC metric:

and because our sample size is relatively small ((n=10) datasets), we also employed the AICc metric:

This experiment was designed to investigate the effect of input downsampling on medical image segmentation performance and how the proposed complexity measures predict the corresponding information loss. We considered three downsampling factors: 2, 3, and 4, which are typically sufficient to reduce the images to a workable size for most networks. For this experiment, we did not employ the networks, as the goal was to study the effect of input downsampling alone. To this end, the binary annotation masks of the images of all considered datasets were downsampled by a given factor, and then upsampled with the same factor to restore their size for comparison with the original masks using the segmentation performance metrics (Section “Segmentation performance metrics“). Bilinear interpolation was employed in our implementation for both downsampling and upsampling. To minimize aliasing artifacts in the reconstructions, we removed all frequency components above the resampling Nyquist frequency using a low-pass filter^{59} before downsampling, and after upsampling we applied optimal thresholding to get binary masks maximizing the Dice/F1-measure^{60}. From the results of this experiment (Table 3) we observe two important trends: (1) the segmentation quality is consistently decreasing with increasing downsampling, and (2) this effect is less severe for datasets with relatively low image complexity. These trends clearly support our hypothesis that the proposed complexity measures are indicative of the information loss caused by downsampling and therefore can be employed as a guideline to determine the amount of acceptable downsampling.

To compare the predictive power of the different complexity measures on segmentation performance, we performed linear regression for the two most common segmentation performance metrics: Dice (F1) and Jaccard (expressed via E). The results (Fig. 1) indicate that the MDF measure outperforms the other measures in predicting segmentation quality, as confirmed by its highest (text {R}^2) values. As both MNF and MDF are higher than DE and PC, it can be concluded that frequency information is most predictive of segmentation performance in the datasets considered in our experiments. The other measures capture different types of complexity and may prove useful in other medical image segmentation tasks.

Comparison of complexity measures in terms of their predictive performance.

To evaluate the trade-off between the goodness-of-fit and model complexity in terms of the number of independent variables (or the degree of freedom), we compared the regression performance of models by varying the degree of freedom (DoF) and using the regression performance metrics (Section “Regression analysis performance metrics“). The metrics were computed for the three considered downsampling factors: 2, 3, and 4. The DoF is the number of independent variables in the polynomial function (or the degree of the polynomial) that best fits the data. In our experiments, models with (text {DoF}>5) did not improve the regression performance in general (Table 4). More specifically, while performance further improved in terms of the other metrics, according to the AICc metric optimal performance was reached for (text {DoF}=4) or 5 in most cases. Given our small sample size, we considered AICc to be decisive owing to its unbiased nature.

To reaffirm the predictive power of the proposed image complexity measures for segmentation performance, we trained U-Net (Section “Segmentation networks“) with the original images and separately with downsampled images (factors 2, 3, 4) from two relatively high-complexity datasets (DRIVE and CHASE-DB1) and two relatively low-complexity datasets (DRISHTI-OC and DRISHTI-OD). From the quantitative results (Table 5) we again observe that segmentation performance consistently decreases with increasing downsampling factor, and the loss is more pronounced for the high-complexity datasets. For example, in this experiment the performance loss was 17% in J, with an increase of 41% in E, for a downsampling factor of 4 on the DRIVE dataset. Similarly, a decrease of 9% in J and an increase of 23% in E was seen in the CHASE-DB1 dataset for the same downsampling factor. By contrast, as expected, no noteworthy loss in segmentation performance was observed in either of the DRISHTI datasets, due to their low complexity. This is confirmed by visual inspection (Figs. 2 and 3). We also notice that with increasing downsampling, the number of false negatives increased more than the number of false positives in the DRIVE dataset. This was to be expected, as it is increasingly harder for the deep networks to capture the tiny vessels, which tend to get lost in the downsampling process. In the DRISHTI dataset, on the other hand, the loss due to downsampling is negligible. Further segmentation results for the DRIVE dataset (Fig. 4) and DRISHTI-OC dataset (Fig. 5) illustrate the performance of the four different networks. The percentages of foreground (FG) and background (BG) pixels (Table 5), which represent the class imbalance in the datasets, are not affected by image downsampling, as expected. Plotting the class imbalance of the datasets against the proposed complexity measures showed no direct relationship between these variables (Fig. 6).

Sample segmentation results with U-Net on the DRIVE dataset. Four examples are shown from top to bottom. From left to right: the input images, the ground truth manually annotated by an expert, and the results on (2times), (3times), and (4times) downsampled input images. Correctly segmented foreground and background pixels are shown in, respectively, green and black. False positive and false negative pixels are shown in, respectively, red and blue.

Sample segmentation results with U-Net on the DRISHTI-OC dataset. Four examples are shown from top to bottom. From left to right: the input images, the ground truth manually annotated by an expert, and the results on (2times), (3times), and (4times) downsampled input images. Correctly segmented foreground and background pixels are shown in, respectively, green and black. False positive and false negative pixels are shown in, respectively, red and blue (visible around the object edges only at very high magnification).

Sample segmentation results of the four different networks on the DRIVE dataset. Five examples are shown from top to bottom. From left to right: the input images, the ground truth manually annotated by an expert, and the results on DeeplabV3+, M2U-Net, U-Net, and U-Net Lite. Correctly segmented foreground and background pixels are shown in, respectively, green and black. False positive and false negative pixels are shown in, respectively, red and blue.

Sample segmentation results of the four different networks on the DRISHTI-OC dataset. Five examples are shown from top to bottom. From left to right: the input images, the ground truth manually annotated by an expert, and the results on DeeplabV3+, M2U-Net, U-Net and U-Net Lite. Correctly segmented foreground and background pixels are shown in, respectively, green and black. False positive and false negative pixels are shown in, respectively, red and blue (visible around the object edges only at very high magnification).

Effect of class imbalance on complexity measures.

Framework for designing medical image segmentation networks. Macro-level network design choices are depicted in red. The ranges are indicative based on our experiments and are subject to the task at hand.

In this experiment, we investigated the suitability of image complexity as a guideline in choosing a deep large-size, deep lightweight, shallow large-size, or shallow lightweight network for segmentation. The assumption here was that training a deep network on moderate hardware would necessitate downsampling of the input images. To evaluate the impact of this, we used the DRIVE dataset, which has high image complexity, and a combination of datasets, ISIC-2016 (training set) and PH2 (test set), which have low complexity. Since we learned from the previous experiment (Table 5) that performance on the DRIVE dataset decreases as the amount of downsampling increases, in the second experiment we examined the impact of formidable downsampling (factor 4) on both high and low-complexity sets on the performance of the considered networks.

The experimental results (Table 6) show that when image complexity is high, downsampling by 4 has a negative impact on the performance of all four networks. For example, for DeepLabV3+, the J for the downsampled data was about 18% lower than the original data, and E about 36% higher. We can see that on the high-complexity dataset DRIVE, the shallow large-size U-Net performed better than the other three networks. The shallow lightweight U-Net Lite, which has nearly 100 times fewer parameters than the U-Net, performed well too. Thus, we can conclude that shallow networks are best suited for high-complexity datasets in general. For high-resolution, high-complexity datasets, a shallow lightweight network is most practical, as it is computationally faster.

We also observe that when image complexity is low, each of the four networks performed comparably on the original and the downsampled images (Table 6). For example, for DeepLabV3+, the J for the downsampled data was only about 1% lower than the original data, and E only about 5% higher. Overall, this network performed better than the other three, and the deep lightweight M2U-Net performed better than the two shallow networks. The J for M2U-Net was only about 3% lower than for DeepLabV3+, and E around 15% higher, while the former network has 36 times fewer trainable parameters. Our results advocate the choice of deep networks for low-complexity datasets. Moreover, a deep lightweight alternative achieves competitive performance when dealing with high-resolution, low-complexity datasets, but at considerably lower computational cost.

Networks for medical image segmentation often have a large number of model parameters and require multi-GPU compute resources for training. Leaderboard methods in polyp, retinal vessel, and skin lesion segmentation benchmarks are a few representative examples^{45,61,62}. Image downsampling is common in applying these methods in order to offset the computational load during training^{20,61}. Lightweight approaches for medical and generic image segmentation targeted at embedded platforms either predetermine the architectural choices^{28} or iteratively search for topologies to minimize some objective^{13}. Common to all these approaches is dataset (task) independent network design. In this work, we recommend that the complexity of the dataset be an important factor in macro-level network design, specifically the depth of the network and the number of feature channels per layer.

Based on our experiments, we put forward a generic framework for designing neural networks for medical image segmentation (Fig. 7). The macro-level design choices include the number of layers in the network (deep versus shallow) and the representational power within each layer (large-size versus lightweight). Depending on the complexity and resolution of the dataset, one of the four macro-level design combinations can be adopted for network design. We note that image complexity guides the choice between deep and shallow networks, whereas the resolution is important in deciding between lightweight and large-size networks. Categorically, for high-complexity datasets, shallow architectures are a fitting choice, whereas deep networks are more appropriate for low-complexity datasets. We demonstrate the efficacy of the proposed framework by mapping ten benchmark medical datasets to network design choices based on their complexity and resolution. These mappings are supported by the quantitative and qualitative results of Experiment II (Section 4.6). Our complexity-based framework can be employed to guide network design for any new medical image segmentation benchmark or challenge.

Based on image complexity measures, we presented a framework to guide developers in making several critical macro-level neural network design choices for medical image segmentation. The proposed framework is independent of the segmentation task at hand and the image modalities used. This is possible because the design choices are based solely upon the information contained in the dataset. Extensive experiments on 10 different medical image segmentation benchmarks demonstrated the suitability of our framework. We conclude that the proposed image complexity measures help address the following critical issues in designing a neural network for medical image segmentation: (1) design and train neural networks for high-resolution medical images using generally available moderate computing resources, (2) minimizing the effects of downsampling the input images (usually to aid training) on segmentation performance, and (3) deciding on the depth and size of the architecture (number of layers/parameters) for a given medical image segmentation task. We suggest that our framework complements NAS approaches and can be employed at the macro-level stage in conjunction with NAS for micro-level architectural optimization. In future work we aim to test this hypothesis and perform more extensive experiments on a wider range of different neural network architectures for medical image segmentation as well as other applications.

The datasets analyzed for this study are accessible via the URLs listed in the URL column of Table 7.

Hesamian, M. H., Jia, W., He, X. & Kennedy, P. Deep learning techniques for medical image segmentation: Achievements and challenges. *J. Digit. Imag.* **32**, 582–596 (2019).

Article Google Scholar

Tajbakhsh, N. *et al.* Embracing imperfect datasets: A review of deep learning solutions for medical image segmentation. *Med. Image Anal.* **63**, 101693 (2020).

Article Google Scholar

Liu, X., Song, L., Liu, S. & Zhang, Y. A review of deep-learning-based medical image segmentation methods. *Sustainability* **13**, 1224 (2021).

Article Google Scholar

Fu, Y. *et al.* A review of deep learning based methods for medical image multi-organ segmentation. *Phys. Med.* **85**, 107–122 (2021).

Article Google Scholar

Ronneberger, O., Fischer, P. & Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In *International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI)*, 234–241 (2015).

Du, G., Cao, X., Liang, J., Chen, X. & Zhan, Y. Medical image segmentation based on U-Net: A review. *J. Imag. Sci. Technol.* **64**, 20508 (2020).

Article Google Scholar

Siddique, N., Paheding, S., Elkin, C. P. & Devabhaktuni, V. U-Net and its variants for medical image segmentation: A review of theory and applications. *IEEE Access* **9**, 82031–82057 (2021).

Article Google Scholar

Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J. & Maier-Hein, K. H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. *Nat. Meth.* **18**, 203–211 (2021).

Article CAS Google Scholar

García, J. D., Crosa, P. B., Álvaro, I. & Alcocer, P. Downsampling methods for medical datasets. In *IADIS International Conference on Computer Graphics, Visualization, Computer Vision and Image Processing*, 12–20 (2017).

Arsalan, M., Owais, M., Mahmood, T., Choi, J. & Park, K. R. Artificial intelligence-based diagnosis of cardiac and related diseases. *J. Clin. Med.* **9**, 871 (2020).

Article Google Scholar

Arsalan, M., Owais, M., Mahmood, T., Cho, S. W. & Park, K. R. Aiding the diagnosis of diabetic and hypertensive retinopathy using artificial intelligence-based semantic segmentation. *J. Clin. Med.* **8**, 1446 (2019).

Article Google Scholar

Khan, T. M., Abdullah, F., Naqvi, S. S., Arsalan, M. & Khan, M. A. Shallow vessel segmentation network for automatic retinal vessel segmentation. In *International Joint Conference on Neural Networks (IJCNN)*, 1–7 (2020).

Howard *et al.*, A. Searching for MobileNetV3. In *IEEE/CVF International Conference on Computer Vision (ICCV)*, 1314–1324 (2019).

Ma, N., Zhang, X., Zheng, H.-T. & Sun, J. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. In *European Conference on Computer Vision (ECCV)*, 116–131 (2018).

Zhu, Z., Liu, C., Yang, D., Yuille, A. & Xu, D. V-NAS: Neural architecture search for volumetric medical image segmentation. In *International Conference on 3D Vision*, 240–248 (2019).

Kim *et al.*, S. Scalable neural architecture search for 3D medical image segmentation. In *International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI)*, 220–228 (2019).

Weng, Y., Zhou, T., Li, Y. & Qiu, X. NAS-Unet: Neural architecture search for medical image segmentation. *IEEE Access* **7**, 44247–44257 (2019).

Article Google Scholar

Yang, D. *et al.* Searching learning strategy with reinforcement learning for 3D medical image segmentation. In *International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI)*, 3–11 (2020).

Yu *et al.*, Q. C2FNAS: Coarse-to-fine neural architecture search for 3D medical image segmentation. In *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 4125–4134 (2020).

He, Y., Yang, D., Roth, H., Zhao, C. & Xu, D. DiNTS: Differentiable neural network topology search for 3D medical image segmentation. In *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 5841–5850 (2021).

Larkin, K. G. Reflections on Shannon information: In search of a natural information-entropy for images. arXiv:1609.01117 (2016).

Thongpanja, S., Phinyomark, A., Limsakul, C. & Phukpattaranont, P. Application of mean and median frequency methods for identification of human joint angles using EMG signal. In *Information Science and Applications*, 689–696 (2015).

Attneave, F. & Arnoult, M. D. The quantitative study of shape and pattern perception. *Psychol. Bull.* **53**, 452–471 (1956).

Article Google Scholar

Rahane, A. & Subramanian, A. Measures of complexity for large scale image datasets. In *Artificial Intelligence in Information and Communication*, 282–287 (2020).

Tan, M. & Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In *International Conference on Machine Learning (ICML)*, 6105–6114 (2019).

Buciluǎ, C., Caruana, R. & Niculescu-Mizil, A. Model compression. In *ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 535-541 (2006).

Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F. & Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In *European Conference on Computer Vision (ECCV)*, 833–851 (2018).

Laibacher, T., Weyde, T. & Jalali, S. M2U-Net: Effective and efficient retinal vessel segmentation for real-world applications. In *IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)*, 115–124 (2019).

Basu, M. & Ho, T. K. *Data Complexity in Pattern Recognition* (Springer-Verlag, 2006).

Book MATH Google Scholar

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L. MobileNetV2: Inverted residuals and linear bottlenecks. In *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 4510–4520 (2018).

Pascanu, R., Mikolov, T. & Bengio, Y. On the difficulty of training recurrent neural networks. In *International Conference on Machine Learning (ICML)*, 1310–1318 (2013).

Badrinarayanan, V., Kendall, A. & Cipolla, R. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. *IEEE Trans. Pattern Anal. Mach. Intell.* **39**, 2481–2495 (2017).

Article Google Scholar

Hoover, A. D., Kouznetsova, V. & Goldbaum, M. Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response. *IEEE Trans. Med. Imag.* **19**, 203–210 (2000).

Article CAS Google Scholar

Khan, T. M., Robles-Kelly, A. & Naqvi, S. S. A semantically flexible feature fusion network for retinal vessel segmentation. In *International Conference on Neural Information Processing (ICONIP)*, 159–167 (2020).

Khan, T. M., Robles-Kelly, A., Naqvi, S. S. & Muhammad, A. Residual multiscale full convolutional network (RM-FCN) for high resolution semantic segmentation of retinal vasculature. In *Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR)*, 324 (2021).

Staal, J., Abramoff, M. D., Niemeijer, M., Viergever, M. A. & van Ginneken, B. Ridge-based vessel segmentation in color images of the retina. *IEEE Trans. Med. Imag.* **23**, 501–509 (2004).

Article Google Scholar

Fraz, M. M. *et al.* An ensemble classification-based approach applied to retinal blood vessel segmentation. *IEEE Trans. Biomed. Eng.* **59**, 2538–2548 (2012).

Article Google Scholar

Khan, T. M., Robles-Kelly, A. & Naqvi, S. S. RC-Net: A convolutional neural network for retinal vessel segmentation. In *Digital Image Computing: Techniques and Applications (DICTA)*, 1–7 (2021).

Khan, T. M., Robles-Kelly, A. & Naqvi, S. S. T-Net: A resource-constrained tiny convolutional neural network for medical image segmentation. In *IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)*, 644–653 (2022).

Arsalan, M., Khan, T. M., Naqvi, S. S., Nawaz, M. & Razzak, I. Prompt deep light-weight vessel segmentation network (PLVS-Net). *IEEE/ACM Trans. Comput. Biol, Bioinform,* (2022) (**in press)**).

Jaeger, S. *et al.* Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. *Quant. Imag. Med. Surg.* **4**, 475–477 (2014).

Google Scholar

Owais, M., Arsalan, M., Mahmood, T., Kim, Y. H. & Park, K. R. Comprehensive computer-aided decision support framework to diagnose tuberculosis from chest X-ray images: Data mining study. *JMIR Med. Inform.* **8**, e21790 (2020).

Article Google Scholar

Mendonça, T., Ferreira, P., Marques, J., Marçal, A. & Rozeira, J. PH2: A dermoscopic image database for research and benchmarking. In *Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)*, 5437–5440 (2013).

Bi, L. *et al.* Dermoscopic image segmentation via multistage fully convolutional networks. *IEEE Trans. Biomed. Eng.* **64**, 2065–2074 (2017).

Article Google Scholar

Lee, H. J., Kim, J. U., Lee, S., Kim, H. G. & Ro, Y. M. Structure boundary preserving segmentation for medical image with ambiguous boundary. In *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 4816–4825 (2020).

Anandarup, R., Anabik, P. & Utpal, G. JCLMM: A finite mixture model for clustering of circular-linear data and its application to psoriatic plaque segmentation. *Patt. Recognit.* **66**, 160–173 (2017).

Article Google Scholar

Bozorgtabar, B., Abedini, M. & Garnavi, R. Sparse coding based skin lesion segmentation using dynamic rule-based refinement. In *Machine Learning in Medical Imaging (MLMI)*, 254–261 (2016).

Codella *et al.*, N. C. F. Skin lesion analysis toward melanoma detection. In *IEEE International Symposium on Biomedical Imaging (ISBI)*, 168–172 (2018).

Sivaswamy, J. *et al.* A comprehensive retinal image dataset for the assessment of glaucoma from the optic nerve head analysis. *JSM Biomed. Imag. Data Papers* **2**, 1004 (2015).

Google Scholar

Litjens, G. *et al.* Evaluation of prostate segmentation algorithms for MRI: The PROMISE12 challenge. *Med. Image Anal.* **18**, 359–373 (2014).

Article Google Scholar

Milletari, F., Navab, N. & Ahmadi, S. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. arXiv:1606.04797 (2016).

Amgad, M. *et al.* Structured crowdsourcing enables convolutional segmentation of histology images. *Bioinformatics* **35**, 3461–3467 (2019).

Article CAS Google Scholar

Ortega-Ruiz, M. A., Roman-Rangel, E. & Reyes-Aldasoro, C. C. Multiclass semantic segmentation of immunostained breast cancer tissue with a deep-learning approach. medRxiv:2022.08.17.22278889 (2022).

Taha, A. A. & Hanbury, A. Metrics for evaluating 3D medical image segmentation: Analysis, selection, and tool. *BMC Med. Imag.* **15**, 29 (2015).

Article Google Scholar

Yeghiazaryan, V. & Voiculescu, I. Family of boundary overlap metrics for the evaluation of medical image segmentation. *J. Med. Imag.* **5**, 015006 (2018).

Article Google Scholar

Kassambara, A. *Machine Learning Essentials: Practical Guide in R* (STHDA, 2017).

Glantz, S. & Slinker, B. *Primer of Applied Regression & Analysis of Variance* (McGraw-Hill, New York, 2001).

Google Scholar

Miles, J. R-squared, adjusted R-squared. In *Encyclopedia of Statistics in Behavioral Science* (Wiley Online Library, 2005).

Schumacher, D. General filtered image rescaling. In *Graphics Gems III*, 8–16 (Morgan Kaufmann, San Francisco, 1992).

Lipton, Z. C., Elkan, C. & Naryanaswamy, B. Optimal thresholding of classifiers to maximize F1 measure. In *Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD)*, 225–239 (2014).

Srivastava, A. *et al.* MSRF-Net: A multi-scale residual fusion network for biomedical image segmentation. *IEEE J. Biomed. Health Inform* **26**, 2252 (2022).

Article Google Scholar

Kamran, S. A. *et al.* RV-GAN: Segmenting retinal vascular structure in fundus photographs using a novel multi-scale generative adversarial network. In *International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI)*, 34–44 (2021).

Download references

School of Computer Science and Engineering, University of New South Wales, Sydney, NSW, Australia

Tariq M. Khan & Erik Meijering

Department of Electrical and Computer Engineering, COMSATS University, Islamabad, Pakistan

Syed S. Naqvi

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

T.M.K. and S.S.N. conducted experiments. T.M.K. and S.S.N. prepared figures. E.M. supervised this project. T.M.K., S.S.N., E.M. wrote the main manuscript text. All authors reviewed the manuscript.

Correspondence to Tariq M. Khan.

The authors declare no competing interests.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.**Open Access** This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

Khan, T.M., Naqvi, S.S. & Meijering, E. Leveraging image complexity in macro-level neural network design for medical image segmentation. *Sci Rep* **12**, 22286 (2022). https://doi.org/10.1038/s41598-022-26482-7

Download citation

Received:

Accepted:

Published:

DOI: https://doi.org/10.1038/s41598-022-26482-7

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Advertisement

© 2022 Springer Nature Limited

Sign up for the *Nature Briefing* newsletter — what matters in science, free to your inbox daily.