Video-based “people counting” uses cameras coupled with video analysis to provide highly accurate and reliable data for statistical analysis and alerting. A people counting capability can be part of a solution to determine how many people enter a building or room during a specified time interval. Applications that determine how many people are inside an area can also be developed from the number of people entering less the number of people leaving, starting with a known estimate of occupancy. Many applications of video analysis based on people counting must be able to uniquely identify each person in the field of view (also referred to as tagging) to prevent over counting.
A people counting application determines the density of people in an area, which is often expressed as a “fill percentage.” Counting the number of people crossing a “threshold location” that marks an entrance or exit and estimating density for a defined space requires the ability to tag individuals and track their movement in the field of view. Tagging is not the same as recognition. People counting does not require that the software determines personal identity, only that it can identify unique person 1 from unique person 2, and so on.
Another application that uses unique person tagging and people counting is “dwell time.” Dwell time is defined as the amount of time a person spends in a designated area of interest that has camera image coverage. Using a combination of entry counting, unique tagging, and exit counting, a software algorithm can be developed to collect statistics related to maximum, minimum, and average dwell time along with other descriptive statistics such as standard deviations.
The placement of cameras is the most critical aspect of the physical configuration. Applications that track movement through changes in position for a tagged object typically require more than one camera with different fields of view. Video analysis software can then use the positioning information from two perspectives with the unique tag to calculate a time series of positions that represent the path between two or more points in time.
The technology for detecting the presence of people in a video frame is based on the same deep learning methods for object detection that are used for categorizing images of cats compared with dogs. “Facial analysis” is another branch of video analysis that is related to images of people. Training deep learning models for facial analysis is based on supervised learning techniques that require datasets labeled with the characteristics that the model must infer from new unlabeled observations. Facial analysis predicts individual data using statistical inference from the image itself and can be performed on images without any other known personally identifiable labels. Models that can infer age and gender are among the most widely available facial analysis modeling applications. The accuracy of predictions for age is based on many factors including how the age variable is collected and used. Most studies use age ranges as an input feature, which limits inference to the same level of inference. For example, age for a model might be collected as one of three categories: under 30, 30 to 50 or over 50. Inference is also limited to predicting the most likely category from these three options. It is not possible to answer the question “How many people aged 18 through 25 were detected today?”.
Most facial analysis from video for “gender classification” is based on male or female. The reported recall for inference on this demographic variable has a strong ratio of true positives to false positives for many studies cited in the research community. Many organizations have experienced high rates of none response when collecting optional gender demographics through surveys and attempts to infer gender from other interactions have been unreliable. The high rate of successful categorization of male and female demography based on video analysis is a valuable potential tool if not complicated by privacy concerns and disclosure issues.