1 Introduction

Smoke detection plays a crucial role in fire alert systems as smokes often precede flames. While indoor smoke sensors effectively detect fire heat or particles, they are not suitable for outdoor use due to the delay in smoke reaching the sensor. Additionally, these discrete sensors are inefficient for wide-range monitoring.

Video-based smoke detection systems offer a promising solution by efficiently capturing information from wider areas [1]. These sensors detect smoke via visual features such as colors, movements, textures, and shapes of smokes. However, challenges remain in visually recognizing smoke, including its limited visual characteristics compared to flames [2], fluctuations in smoke density, and variable smoke shapes [3], leading to harder separation from various backgrounds similar to smoke color. New visual techniques are needed to overcome these challenges.

Forest fires pose a significant threat to the environment, economy, and health [4]. Video cameras sensitive to smoke in the visual spectrum are commonly used to monitor forest fires. Various visual techniques have been developed for forest fire detection [5,6,7].

In contrast, urban fire camera systems have received less attention [8, 9]. One major difference is the presence of numerous moving objects in urban environments, such as vehicles and turbines in addition to trees and clouds. Therefore, progressive methods are required to eliminate these objects for effective smoke detection in urban areas.

This paper focuses on developing a daytime smoke detection technique for fire disaster mitigation in urban cities. This paper is composed of the following sections. Section 2 introduces related studies and both backgrounds and objectives of this study. Section 3 presents our proposed smoke detection method. Section 4 evaluates the method and defines evaluation criteria for comparison with previous studies. We conclude this paper in Sect. 5.

2 Backgrounds and Objectives of this Study

2.1 Related Studies

Various methods for capturing the characteristics of smoke have been studied. Piccinini et al. [10] focused on detecting smoke by analyzing energy changes in the wavelet and smoke color models. They employed a Bayesian approach to classify the obtained features. Tian et al. [11] proposed a blended image model that combines background and smoke components to accurately identify smoke. By solving the opacity of smoke, they achieved improved smoke detection results. Labati et al. [12] presented algorithms specifically designed for smoke detection in various wildfire environmental conditions. Their approach utilized computational intelligence techniques to adaptively detect smoke in frame sequences.

Terada et al. [13] introduced a method for detecting fire smoke using optical flow. Their approach was robust against different image acquisition environments and focused on early detection of fire incidents. By first detecting the region of the flame in the images, they then extracted characteristic quantities that specifically represented smoke, ensuring accurate smoke detection. Similarly, Chunyu et al. [14] proposed a video smoke detection method that incorporated both color and motion features. They utilized optical flow to approximate the motion field and estimated candidate smoke regions using background estimation and a color-based decision rule. The optical flow results were further processed to calculate motion features, allowing for differentiation between smoke and other moving objects.

When it comes to fire detection using machine learning, the focus is primarily on flames, with limited examples of its application to smoke alone [15]. One challenge in smoke detection using machine learning is the significant computational cost. In the study by Luo et al. [16], they employed an NVIDIA Tesla K40C GPU for detection, achieving processing speeds of 6-7 frames per second with a pixel size of 320 x 240.

Although aforementioned studies made significant contributions to smoke detection, a common limitation observed in most of them was the detection of large smoke areas, which may not be practical for real-world outdoor camera systems. In real field situations where cameras monitor rural or urban areas for smoke detection, the spatial resolution of the smoke to be detected is relatively small as seen in Figure 1. Additionally, it is important to exclude similar objects that may be mistaken for smoke. Therefore, the motivation of this study is to develop a technique that can overcome these practical issues and contribute to effective disaster mitigation in real-world scenarios.

2.2 Backgrounds

The authors have been developing Visual IoT systems as introduced in Appendix. Using a video transmission protocol specified to wireless networks [17], modern IP network cameras with both PTZ (pan-tilt-zoom) functions controlled by a remote operation protocol [18] and single board computers on edge side and supercomputers on cloud side, the systems have a potential of highly-functional smoke detection. While this paper focuses on the detection algorithm, the ultimate goal is to implement cutting-edge smoke image processing techniques within Visual IoT systems. After smoke detection, the smoke area is locked and enlarged. By capturing zoomed smoke images, more accurate fire information can be obtained. This PTZ function plays an important role in the discussion of F-score in Sect. 4.1.

2.3 Objectives of this Study

Figure 1
figure 1

Examples of original frame image obtained by outdoor IP network cameras: (a) an urban camera at Kitakyushu city on 10:00 AM JST, August 2nd, 2021 and (b) a rural camera at Chikuma city [19] on 7:10 AM JST, February 16th, 2022. The white frame area in (a) corresponds to Figure 3

The study aims to develop a practical real-time smoke detection application during daytime, integrated with Visual IoT technologies. The geographic location of smoke break-out should be quickly detected by live cameras deployed over fields.

To achieve this, the authors propose a smoke detection algorithm that utilizes optical flow. Outdoor cameras with network connectivity continuously capture and provide high-resolution, high-frame-rate footage. Frame rates are important for tracking motion with optical flow since the properties of smoke, that are shape, color, and motion, change within a second.

The use of multiple frames in optical flow processing enables the emphasis of smoke-specific motion. This study employs footage with a resolution of \(1920 \times 1080\) pixels and a frame rate of 25 fps, recorded every 10 min for a duration of 5 s. These parameters align with typical and practical settings for modern IP network cameras on mobile networks. Smoke examples obtained from IP network cameras in Kitakyushu city and Chikuma city, Japan, are shown in Figure 1. We use OpenCV 4.6.0 for computer vision processing including optical flow and contour drawing.

3 Smoke area Detection Method

Figure 2
figure 2

A flow chart to detect urban smokes proposed in this study corresponding figures to each process are indicated. Contour colors in figures change after every threshold discrimination

Figure 3
figure 3

(a) An extracted frame image as indicated in white in Figure 1. (b) The same image with annotations selected by hand. Annotation frames (bounding boxes) in (b) are identical to those in Figures 4, 5, 6, 7, and 8

3.1 Flow Chart to Detect Urban Smokes

This section presents our proposed smoke area detection method, illustrated in Figure 2. We begin by introducing a general optical flow technique in Sect. 3.2. Then, we propose our original smoke detection algorithm in Sects. 3.3, 3.4, 3.5, and 3.6.

Figure 3a is an sample image of smoke in an urban city taken at 10:00 AM JST on August 2nd, 2021, which will be discussed in this section. Figure 3a is an \(800 \times 800\) pixel image extracted from Figure 1a. This area is selected because multiple high optical flow objects are concentrated. Figure 3b displays the manually identified positions of moving objects, including smoke from a factory on the right-hand side. Other moving objects, such as wind turbines (in the middle) and cars on the roads (on the left-hand side), are also identified.

3.2 Optical Flow

Optical flow is a method used to detect motion between image frames by analyzing the changing status of each pixel. It provides information about the direction and speed of motion, making it suitable for analyzing motion characteristics.

There are two main assumptions in optical flow calculations. First, it assumes that the brightness of moving points remains constant over a short period. Second, it assumes that the pixels around a point move in a similar manner. Lucas-Kanade method [20] and Farneback method [21] are typical optical flow methods. In this study, we utilize the Farneback method due to its higher accuracy, despite its increased computational cost.

The Farneback method approximates the brightness value of each pixel using a quadratic polynomial and estimates the amount of movement by comparing coefficients between frames. When the brightness is \(f_{t} ({\mathbf{x}}) \in [0,1]\) at the coordinates \(\mathbf{x}\) at time t, the brightness of the neighborhood points expressed by a quadratic polynomial is

$$\begin{aligned} f_t(\mathbf{x})=\mathbf{x}^T \mathbf{A}_t \mathbf{x} + \mathbf{b}_t \mathbf{x} + c_t \end{aligned}$$
(1)

where \(\mathbf{A}_t\), \(\mathbf{b}_t\), \(c_t\) are symmetric matrices, column vectors, and scalar values, respectively. The coefficient is obtained by optimizing the neighborhood region by the weighted least squares method. If the amount of movement at the point \(\mathbf{x}\) from the time t to \(t+1\) is \(\mathbf{d}_t\), then from \(f_t (\mathbf{x})= f_{t+1} (\mathbf{x}+\mathbf{d}_t)\), it is expressed as below.

$$\begin{aligned} \mathbf{d}_t= -\frac{1}{2} {\mathbf{A}_t}^{-1} (\mathbf{b}_{t+1}-\mathbf{b}_t) \end{aligned}$$
(2)

Ideally, \(\mathbf{A}_t = \mathbf{A}_{t+1}\), but in reality, the following approximation is used.

$$\begin{aligned} {\hat{\mathbf{A}}} = \frac{{\mathbf{A}}_t + {\mathbf{A}}_{t+1}}{2} \end{aligned}$$
(3)

As a result, the following constraints are obtained.

$$\begin{aligned} {\hat{\mathbf{A}}}_t {\mathbf{d}}_t = \Delta {\mathbf{b}}_t \end{aligned}$$
(4)

Here,

$$\begin{aligned} \Delta \mathbf{b}_t =-\frac{1}{2}(\mathbf{b}_{t+1} -\mathbf{b}_t). \end{aligned}$$
(5)

This equation gives a solution point by point, even though it is noisy. Therefore, assuming the change in displacement is gradual, the information in the neighborhood of each pixel is integrated. The energy in the point \(\mathbf{x}\) neighborhood is expressed as follows.

$$\begin{aligned} \sum _{\Delta \mathbf{x}\in I}^{} w(\Delta \mathbf{x})\Vert \mathbf{A}(\mathbf{x}+\Delta \mathbf{x})\mathbf{d}(\mathbf{x}) - \Delta \mathbf{b}(\mathbf{x}+\Delta \mathbf{x})\Vert ^2 \end{aligned}$$
(6)

By differentiating this equation with the optimal amount of motion \(\mathbf{d}_t(\mathbf{x})\), \(\mathbf{d}_t (\mathbf{x})\) that minimizes the energy is determined. In Farneback method, the gradient can be stably obtained by approximating the local region of the image with a quadric surface. We implement this algorithm using the OpenCV calcOpticalFlowFarneback function.

3.3 Optical Flow Summation

Figure 4
figure 4

Optical flow magnitude summed over (a) two frames and (b) 31 frames. Annotation frames are identical to those in Figure 3b

In this section, we emphasize the smoke motions by adding optical flow values over time-series of frames in a footage. Since the source location of smoke remains stationary, the optical flow vectors of pixels in the smoke area exhibit minimal change within one footage. The optical flow at a pixel (ij) between time t and \(t + 1\) is expressed as a cartesian coordinate vector \((X_t(i, j)\), \(Y_t (i, j)\)).

By summing the vector values, we obtain the optical flow across multiple frames as follows:

$$\begin{aligned} X_{all}(i, j) = \sum _{t=1}^{T} X_t (i, j), Y_{all}(i, j) = \sum _{t=1}^{T} Y_t (i, j) \end{aligned}$$
(7)

We define \(R_{all} (i, j)\) and \(\Theta _{all} (i, j)\) as the polar coordinate transformations of \(X_{all} (i, j)\) and \(Y_{all} (i, j)\) as below.

$$\begin{aligned} R_{all}(i, j)&= \sqrt{X_{all} (i, j) + Y_{all}(i, j)} \end{aligned}$$
(8)
$$\begin{aligned} \Theta _{all}(i, j)&= \tan ^{-1} \frac{Y_{all}(i, j)}{X_{all} (i, j)} \end{aligned}$$
(9)

\(R_{all} (i, j)\) and \(\Theta _{all} (i, j)\) are the magnitude and the angle of the summed optical flow at pixel (ij), respectively. This summation reduces noise and emphasize stable optical flow movement.

Figure 4a shows optical flow magnitudes of every pixel from two continuous frames, while Figure 4b is a summed value of the optical flow magnitudes over 31 frames. The color intensity in both panels in Figure 4 indicates the amount of motion in pixels. We note that the motion of smoke, turbine, and car are emphasized by stacking the number of flows. Note that the high-magnitude car areas expands as time goes on due to their position replacement. The addition of optical flow is limited to 31 frames since the direction of smoke may change over time and degrade summed value. To reduce processing cost, optical flow is performed without using the 3 frames in between. Thus, we get 31 frames of optical flow out of 121 frames (5 s) in the footage.

Figure 5
figure 5

A grid block example of (a) optical flow magnitude \(R_{all} (i, j)\), (b) binary values B(ij) and (c) grids after small area removal

At this point, magnitude \(R_{all} (i, j)\) is represented as separated \(800 \times 800\) values. To recognize them as continuous areas, pixels above a certain magnitude threshold are extracted, and adjoining pixels are regarded as one area. We measure the areas of smokes in the following manner. The threshold is set for the magnitude of the optical flow. Here, the value B(ij), which is obtained by classifying \(R_{all} (i, j)\) into binary values according to a threshold, is defined by the following equation.

$$\begin{aligned} B (i, j)= {\left\{ \begin{array}{ll} \; 0,\quad R_{all} (i, j) < T_r \\ \; 1,\quad R_{all} (i, j) \ge T_r \end{array}\right. } \end{aligned}$$
(10)

Here, \(T_r\) is a threshold value to be set arbitrarily. Then the neighboring \(B (i, j)=1\) grids are connected to form a block that shows a high-magnitude area. This process is schematically depicted in Fig 5. The optical flow magnitude \(R_{all}(i,j)\) in Figure 5a is binarized into Figure 5b, where contours of high-magnitude areas are drawn in orange lines. This process uses the OpenCV findContours function. On the assumption that the smoke area has a certain size, small areas are excluded from candidates. Note that, if the area is too small, the application of the optical flow method is unreliable. This concept is described in Figure 5c with green contours. This area selection procedure is similar to erode-dilate operation used for image denoising [22].

Figure 6a shows the case where the threshold of \(T_r=10\) is empirically set for the optical flow of Figure 4b. The areas above threshold magnitude are enclosed with green contours. Let each area be \(C_l\) (\(l \in [1,L]\)) and \(M_l\) be the number of pixels in the area \(C_l\). In this example, \(L = 3\). We set a threshold \(T_m\) and if \(M_l \le T_m\), these areas are excluded. Figure 4c shows the result when \(T_m = 1\). The remained area is defined as \(C'_l\) (\(l \in [1,L']\)) and the number of pixels in \(C'_l\) as \(M'_l\). For Kitakyushu city footage, \(T_m = 200\) is set arbitrarily. Figure 6b is the result after removing the small areas. There are 23 extracted candidates of smoke areas in green which also includes turbines or cars. In the case of cars, the entire trajectory is recognized as one high motion region by the optical flow summation.

Since high-magnitude areas occasionally include objects rather than smoke, we have to extract only the area with smoke characteristics. We use optical flow variance and HSV (hue-saturation-value) color characteristics to extract smoke from the candidates. Both methods are described in Sects. 3.4 and 3.5, respectively.

Figure 6
figure 6

Contours surrounding (a) all areas and (b) large-size areas (candidate areas) of high-magnitude optical flow. Annotation frames are identical to those in Figure 3b

Figure 7
figure 7

Characteristics of optical flow. (a) Magnitude and (b) variance in grey scale drawn with green contour lines (candidate areas defined in Figure 6b). (c) Detected areas after thresholding of (a) and (b). Annotation frames in (a) and (b) are identical to those in Figure 3b

Figure 8
figure 8

Characteristics of HSV color. (a) Saturation and (b) values in grey scale drawn with green contour lines (candidate areas defined in Figure 6b). (c) Detected areas after thresholding of (a) and (b). Annotation frames in (a) and (b) are identical to those in Figure 3b

3.4 Extraction of Smoke Area by Variance of Optical Flow

To detect moving objects, it is reasonable to find areas with large value of optical flow magnitude. Figure 7a is magnitude of optical flow drawn in grey scale that is identical to \(R_{all} (i, j)\) in Sect. 3.3. This is not enough to distinguish smokes from other moving objects.

As one of the characteristics of smoke motions, it generally moves upward from the source location. When wind blows, the whole smoke moves in one direction accordingly during short period. We assume in this study that the moving direction is mostly same in the region. In other words, the variance of optical flow vectors in smoke regions is small. Taking advantage of ability to acquire high frame rate footage in the system described in Sect. 2, we utilize the optical flow variance of vectors between multiple frames that show changes of extracted areas.

In this method, variances within areas are calculated not only in space but also in time. The \(s_l\) which is the variance of \(C'_l\) is as shown below.

$$\begin{aligned} {s_l}^2 = \frac{1}{t \cdot m}\sum _{t=1}^{T}\sum _{m=1}^{M'_l} ((X_t (C'_l(m)) - \bar{X})^2 +(Y_t (C'_l(m)) - \bar{Y} )^2 ) \end{aligned}$$
(11)

To visualize this, Figure 7b shows the variances derived from the changes over time. The variance value of the optical flow vectors at each pixel over 31 frames is represented in gray scale. The variance s at a pixel is derived as follows.

$$\begin{aligned} s(i, j)^2 = \frac{1}{t}\sum _{t=1}^{T} ((X_t (i, j) - \bar{X} (i, j))^2 +(Y_t (i, j) - \bar{Y} (i, j))^2 ) \end{aligned}$$
(12)

It can be observed that the variances of the turbines and cars are larger than those of the smoke areas in Figure 7b. As for cars, the areas correspond to the entire track of cars. The amount of motion in the area is temporally large only when moving vehicle exists, and is small enough at other times. In contrast, the positions of turbines are stable in time. The directions of movement, however, vary within the region due to its wing rotations that makes the variance larger. For these reasons, smoke and other high-motion objects are distinguishable by combining magnitude and variance of optical flow.

Figure 7a and b represent the average magnitude and variance obtained along the temporal dimension, respectively. Figure 7c is the detected area after thresholding both (a) and (b) in the following manner. In Figure 7c, the average magnitude across both the temporal and spatial dimensions is calculated for each candidate area. The variance is obtained from all pixels in the area as well. If a set of magnitude and value meet the threshold criteria, the area is selected as a detected area as described by the blue contour areas. It is important to determine appropriate thresholds for the magnitude and variance based on real smoke footage, which will be discussed in Sect. 4.2.

3.5 Extraction of Smoke Area by Characteristics of HSV Colors

In parallel with the variance of optical flow, we use color discrimination to exclude the non-smoke areas as shown in Figure 2. In general, foreground image frames in footage are denoted by RGB intensities and a set of rules are applied to each color for the discrimination. However, despite these rules, images often suffer from nonlinear visual perception and illumination dependency [14]. Appana et al. performed color segmentation by identifying the pixels that match the color of smoke in a frame [23]. They utilized HSV color analysis by transforming the RGB color space and thresholding the saturation (light intensity) and value (brightness) components in the HSV (hue-saturation-value) color space.

Figure 8 displays a set of frame image with the saturation (S) and value (V) components shown in gray scales. The color scale consists of 256 gradations with higher numbers corresponding to lighter colors. The candidate areas derived in Figure 6b are highlighted in green. Our objective in this section is to narrow down smoke candidate areas using both S and V parameters. As indicated in Figure 2, we apply a thresholding method to each smoke candidate area derived in Sect. 4.2. Equation (13) is used for thresholding the smoke areas in this study,

$$\begin{aligned} F_{color} (l)= {\left\{ \begin{array}{ll} \; 1,&\quad if \; Th(s,v) \supset (\bar{S},\bar{V})_l \\ \; 0, &\quad otherwise \end{array}\right. } \end{aligned}$$
(13)

where Th(sv) is the S and V inside threshold and \((\bar{S},\bar{V})_l\) is the S and V in each pixel averaged over a candidate area \(C'_l\). The blue contour areas in Figure 8c represent the detected areas after thresholding.

The bounded area should be defined on each target city. In [23], thresholds of saturation and value are independently given in constant values, respectively. We propose a dependent thresholding model on both saturation and value for more tuned discrimination that is derived from many smoke images of the target city. The definition of the bounded area in case of Kitakyushu city will be discussed in Sect. 4.2.

3.6 Combined Smoke Detection

The final discrimination of smoke area is performed by combining the detected areas derived from both optical flow (Sect. 3.4) and HSV color (Sect. 3.5). Figure 9a and b correspond to the detected areas by optical flow (Figure 7c) and HSV color (Figure 7c), respectively. The smoke candidate areas are highlighted in blue.

The smoke area is identified by overlaying the detected areas to extract overlapping ones. In Figure 9c, the red and blue contours correspond to the overlapped and the summed areas of the detected areas in Figure 9a and b, respectively. The final result, shown in Figure 9d, indicates that the red box corresponds to the smoke area manually selected in Figure 3b. This demonstrates the successful extraction of the smoke area from multiple candidate areas in Figure 6b.

Figure 9
figure 9

(a) The detected areas by the optical flow characteristics (identical to Figure 7c). (b) The detected areas by the HSV color characteristics (identical to Figure 8c). (c) The common area of (a) and (b) is in red and the summed areas of both of them are in blue, respectively. The candidate areas in green in (a), (b) and (c) are identical to Figure  6b. (d) The final result of smoke detection is indicated in red. Annotation frames in (d) are identical to those in Figure 3b (Color figure online)

4 Evaluations of the Proposed Method

4.1 Evaluation Criteria

We use recall and precision values as evaluation indices of the smoke detection method introduced in Sect. 3. Recall is the ratio of correctly predicted positive samples to the total number of actual positive samples. It is calculated as \(\text{TP} / (\text{TP} + \text{FN})\) where TP represents true positive and FN represents false negative. Recall is commonly used to evaluate the oversight rate of detection. Precision, on the other hand, measures the ratio of correctly predicted positive samples to the total number of predicted positive samples. It is calculated as \(\text{TP} / (\text{TP} + \text{FP})\), where FP represents false positives. Precision and recall have a trade-off relationship, and the F-score is commonly used to evaluate both values. Balanced F-score (\(F_1\) score) is the harmonic mean of precision and recall, which is described as follows.

$$\begin{aligned} F_1 = \frac{2 Precision * Recall}{Precision + Recall} \end{aligned}$$
(14)

In real-time monitoring using a PTZ camera, it is possible to zoom in on smoke candidates to re-verify them on higher-resolution frames. It suggests importance to reduce false negative values than false positive ones for practical fire prevention since reverifying allows us to get rid of false positive cases. Therefore, in the evaluation of the F-score, we place greater importance on recall. \(F_\beta\) is a generalized F score to show that recall is \(\beta\) times as important as precision. We choose \(\beta = 2\) represented as below.

$$\begin{aligned} F_2 = \frac{5 Precision * Recall}{4 Precision + Recall} \end{aligned}$$
(15)

4.2 Threshold

Figure 10
figure 10

Relationship of variance with averaged magnitude of optical flow on November 4, 2021

Figure 11
figure 11

Relationship of saturation with value of HSV color

In order to evaluate the effectivity of the proposed method in Sect. 3, we apply it to a set of 5 s footage datasets observed in Kitakyushu city in 2021. The datasets were captured on August 1, August 9, and November 2. We first define the threshold values from a set of obtained footage on these days.

Kitakyushu city is known as a large-scale industrial zone and has many factories with chimneys. As mentioned in Sect. 2.3, we assume the use of IP network cameras to capture wide-view images of urban cities. As is seen in Figure 1a, the size of the smoke area is less than \(50 \times 50\) pixels while the image size is in full HD (1080p). Our technique in Sect. 3 is applicable for such lower pixel cases.

The footage from three days are processed in the manner of Figure 2 and the detection rates are evaluated. We investigated the characteristics of various high-magnitude optical flow areas to obtain thresholds for extracting smoke only. The high-magnitude optical flow areas obtained from the three-day footage were manually classified into four categories: smoke, turbines, cars, and others. These areas were then analyzed in terms of their properties.

Figure 10 shows the relationship between variances and averaged magnitudes of optical flow in high-magnitude areas during the daytime (from 6:00 AM to 5:00 PM (JST)) on November 4, 2021. In the figure, smoke has higher y/x values than cars and turbines. It is suggested that smoke can be distinguished from other high-magnitude areas by using variance and intensity. We define the threshold as \(mag/var = 9\), represented by the green line in Figure 10.

Figure 11 shows the relationship between the averaged color saturation and value on high-magnitude areas on November 4, 2021. The color labels are same as in Figure 10. The smoke and turbines share the same region (both have same color characteristics) but are distinguishable from cars and other objects in this scatter plot. Same as in Figure 10, the threshold is represented by the green lines. It can be written as \(\{V > (256-64)/160 * S + 64\} \, \cap \, \{V < -256/160 * S + 256\}\).

Then, we verify whether the obtained threshold is applicable to data on other days. Considering that the color of smoke is affected by the weather conditions or dependent on time of day, we examine the dependence of colors (HSV saturation and value) o n weather conditions, represented by the amount of solar radiation values. Figure 12 is solar radiations on the three days whose values are obtained by AMATERASS system [24]. The AMATERASS dataset [25] is the estimated solar radiation amount acquired by the Himawari satellite [26, 27]. Figure 12 shows changes in solar radiation due to weather condition. August 1 and November 4, 2021 are sunny days and August 9, 2021 is a cloudy day.

Figure 12
figure 12

Solar radiations of three days from Amaterass dataset [24]: 1-Aug-21 is a sunny day, 9-Aug-21 is a cloudy day, and 4-Nov-21 is a sunny day, respectively

Figure 13
figure 13

Relationship of solar radiation with (a) saturation and (b) value of HSV color

Figure 13 shows the relationship of solar radiation with HSV color saturation and value for three-day datasets, respectively. Regarding smoke, the correlation with solar radiation is 0.01 for saturation and \(-\)0.08 for value, respectively. We thus conclude that neither the saturation nor the value of the smoke depends on the intensity of solar radiation.

Table 1 Recall Values of the Proposed Method (Kitakyushu City)

4.3 Evaluation and Case Studies

We introduced a thresholding method and evaluated a set of threshold values in Sect. 4.2. These thresholds are applicable for any footage obtained at the same location. The \(F_2\) values are obtained using the aforementioned thresholds on these three days. As shown in Table 1, the proposed method achieves \(F_2\) values larger than 90% independently of weather condition. It should be noted that the smoke area size on the footage by the outdoor cameras in this study is limited, varying from 300 to 2000 pixels that are smaller than those in Figure 15 around 4000 pixels. Considering this limitation, we conclude that the proposed method is practical.

Figure 14
figure 14

Sample cases of smoke detection on three days

Figure 14 shows other typical cases of smoke detection on the three days in Sect. 4.2. The detected smoke areas are highlighted in red frames as same as in Figure 9d. Smoke is successfully detected in both Figure 14a and b, but cloud motion is also partially detected in Figure 14b. This is a case of false positive. Small fragments of clouds are recognized as smoke when the shape of the cloud changes. However, our Visual IoT system can easily avoid cloud misdetection by using geometric information, such as masking the sky area above the skyline. Figure 14c represents a case of false negative, where smoke on the left-hand side is not detected due to cloudy and dark conditions, making it difficult to distinguish the smoke from the clouds. Foggy conditions also pose challenges for smoke recognition, especially as the distance from the camera increases.

4.4 Comparison with Previous Method

Figure 15
figure 15

The detection results with the proposed method using Bilkent examples [28] named (a) sBehindtheFence, (b) sBtFence2, (c) sWasteBasket, and (d) sWindow. The red annotation frames represent smoke

To demonstrate the effectiveness of the proposed method in Sect. 3, we compare it with a previous study using a set of reference footage. We process four types of footage on a website of Bilkent samples [28], with a spatial resolution of \(320 \times 240\) and frame rates of either 10 or 15 fps.

The threshold used here differs from the one used in the previous chapter. Based on the smoke characteristics in the sWasteBasket video, the thresholds for optical flow and HSV color are obtained in the same manner as described in Sect. 4.2 and applied to the other videos. The defined thresholds are \(mag/var = 5\) and \(\{S< 90 \} \, \cap \, \{ 60< V < 255 \}\).

The detection results are shown in Figure 15, and a summary of the comparison between the proposed method and the previous method [23] is shown in Table 2. In most cases, the proposed method outperforms the previous method. The decrease in recall values in Figure 15a is attributed to the rapid change in smoke orientation. In real-world monitoring scenarios, the detection rate is expected to be higher since smoke is typically observed in a more stable manner.

Table 2 Comparison Results of the Two Methods (Figure 15)

4.5 Discussion

In the realm of forest fire detection methods using optical sensors and digital cameras, Alkhatib [5] reviewed four different types of methods. Among them, ForestWatch [30] is a notable video camera system because of its modern equipment. It adopts PTZ cameras and an image sampling engine in association with mobile networks. An application for this system is bundled to support decision-making.

To further enhance the speed and accuracy of fire detection, continued development of these modern techniques is necessary. Appendix presents Visual IoT [17] as a new IoT technology with numerous achievements reported in various outdoor applications (e.g., [31, 32]) One of the advantages of the Visual IoT system is its application of IP network cameras with the function of PTZ [17]. This type of camera enables autonomous direction change and zooming in on smoke (Figure 16). We believe future smoke detection applications should be closely tied to modern systems like Visual IoT (Figure 17), where PTZ plays a crucial role.

Figure 16
figure 16

A zooming process to get higher resolution image using PTZ function

Figure 17
figure 17

Concept of visual IoT system

In practical disaster mitigation, the recall value is more important than precision, as any omission of fire detection can lead to a serious disaster. Our study achieved notably high recall values (Tables 1 and 2). To improve systems, we focus on reducing errors, which can be achieved using PTZ cameras. Higher-resolution footage improves precision and recall values. In this study, we visually confirmed that the smoke size ranged from 300 to 2000 pixels, so we decided to exclude areas smaller than 200 pixels. However, in actual fire monitoring, smoke may occur in smaller areas due to distance. To deal with such cases, using PTZ cameras to autonomously sweep the view area with slight zoom and capture high-definition videos is crucial.

In future studies, we plan to evaluate the smoke detection rate during nighttime and rainy conditions. At night, detecting smoke in urban areas poses challenges due to low light conditions. Streetlights can assist in detection, and the increasing availability of ultra-high sensitivity CMOS sensors is expected to improve detection rates. During rainfall, the detection rate is expected to decrease due to unclear camera images. Although the likelihood of a fire occurring in such conditions is low, it is important to examine the quantified relationship between rainfall and detection rate.

While we exclusively used smoke footage from factories in this study, future research will incorporate actual fire footage or color-adjusted substitute footage. The primary distinction between factory smoke and fire smoke lies in their color. Edited substitute footage can be created by processing the image of factory smoke to black smoke to make it resembles that of a fire.

5 Conclusion

In this paper, we proposed a novel method for detecting daytime smoke using outdoor cameras in urban cities, focusing on the specific properties of smoke. Optical flow is applied for a set of sequential frames in footage to extract smoke area. However, standard optical flow alone was insufficient as it also detected other moving objects such as cars and wind turbines. A new concept is thus introduced to apply both variance of optical flow and characteristics of HSV color on footage.

Our algorithm is developed based on high-quality footage with high-resolution (e.g., 1080p) and high-frame rate (e.g., 25 fps) obtained by modern IP network cameras. This algorithm shows better performance than the previously proposed ones; over 90% \(F_2\) score detection rate is achieved using the high-quality outdoor camera footage. The accurate setting of threshold values is crucial in this type of algorithm, as the detection accuracy often depends on them. We proposed methods for deriving appropriate threshold values, particularly through color analysis. We found no significant dependence of weather conditions on smoke detection using solar radiation datasets.

Our technique in this study has the potential for even higher smoke detection capabilities when combined with IP network cameras equipped with PTZ functions. This integration can further enhance the effectiveness and efficiency of smoke detection in urban areas.