R
Rösti
Unconfirmed Member
On October 19, 2012, Sony Computer Entertainment Inc. filed a patent application for "Multi-Modal Sensor Based Emotion Recognition and Emotional Interface". It was published today, on April 24, via the USPTO. Inventor is Dr. Ozlem Kalinli-Akbacak, Staff Research Engineer at SCEA. It's fairly lengthy and technical (there's no clear image of the sensor itself), as usual, but I have bolded the more interesting parts.
Drawings
Some bits about potential specs:
If link isn't working, search for Document Number 20140112556 here.
There's much more at the link. But that's what we have for now, though there is an additional patent by the same inventor, but it's not as interesting and deals only with emotion recognition by extracting speech data.
MULTI-MODAL SENSOR BASED EMOTION RECOGNITION AND EMOTIONAL INTERFACE
Abstract
Features, including one or more acoustic features, visual features, linguistic features, and physical features may be extracted from signals obtained by one or more sensors with a processor. The acoustic, visual, linguistic, and physical features may be analyzed with one or more machine learning algorithms and an emotional state of a user may be extracted from analysis of the features. It is emphasized that this abstract is provided to comply with the rules requiring an abstract that will allow a searcher or other reader to quickly ascertain the subject matter of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims.
FIELD OF THE INVENTION
[0001] Embodiments of the present invention are related to a method for implementing emotion recognition using multi-modal sensory cues.
BACKGROUND OF THE INVENTION
[0002] Emotion recognition or understanding the mood of the user is important and beneficial for many applications; including games, man-machine interface, etc. Emotion recognition is a challenging task due to the nature of the complexity of human emotion; hence automatic emotion recognition accuracy is very low. Some existing emotion recognition techniques use facial features or acoustic cues alone or in combination. Other systems use body gesture recognition alone. Most multi-modal emotion recognition involves facial recognition and some cues from speech. The recognition accuracy depends on the number of emotion categories to be recognized, how distinct they are from each other, and cues employed for emotion recognition. For example, it turns out that happiness and anger are very easily confused when emotion recognition is based on acoustic cues alone. Although recognition tends to improve with additional modalities (e.g., facial cues combined with acoustic cues), even with only about 8 emotional categories to choose from most existing systems are lucky to achieve 40-50% recognition accuracy.
[0003] It is within this context that aspects of the present disclosure arise.
Drawings
BRIEF DESCRIPTION OF THE DRAWINGS
[0004] Embodiments of the present invention can be readily understood by referring to the following detailed description in conjunction with the accompanying drawings.
[0005] FIGS. 1A-1D are flow diagrams illustrating examples of methods for determining an emotional state of a user in accordance with certain aspects of the present disclosure.
[0006] FIG. 2 is schematic diagram illustrating a map of facial points that may be used in conjunction with certain aspects of the present disclosure.
[0007] FIG. 3 is a schematic diagram illustrating a map of body points that may be used in conjunction with certain aspects of the present disclosure.
[0008] FIG. 4 is a schematic diagram illustrating placement of physiological sensors on a game controller for physiologic sensing in conjunction with certain aspects of the present disclosure.
[0009] FIG. 5 is a schematic diagram illustrating placement of physiological sensors on a wrist band, ring and finger cap for physiologic sensing in conjunction with certain aspects of the present disclosure.
[0010] FIG. 6A is a schematic diagram illustrating placement of physiological sensors on an apparatus held in a user's mouth for physiologic sensing in conjunction with certain aspects of the present disclosure.
[0011] FIG. 6B is a schematic diagram illustrating a physiological sensor on an apparatus for physiologic sensing in conjunction with certain aspects of the present disclosure.
[0012] FIG. 7 is a block diagram illustrating an example of an apparatus for implementing emotion estimation in conjunction with certain aspects of the present disclosure.
[0013] FIG. 8 is a block diagram illustrating an example of a non-transitory computer-readable storage medium with instructions for implementing emotion estimation in conjunction with certain aspects of the present disclosure.
DESCRIPTION OF THE SPECIFIC EMBODIMENTS
[0014] Embodiments of the present invention relate to spoken language processing methods and apparatus that use multi-modal sensors for automatic emotion recognition.
Introduction
[0015] According to aspects of the present disclosure accurate emotion recognition may be implemented using multi-modal sensory cues. By fusing multi-modal sensory data, more reliable and accurate emotion recognition can be achieved. Emotion recognition and or understanding the mood of the user is important and beneficial for many applications; including games, man-machine interfaces, and the like. For example, it can be used in a user interface to dynamically adapt the response of a game or other machine based on player's or user's emotions. The detected mood, emotional state, stress level, pleasantness, etc. of the user may be used as an input to the game or other machine. If the emotional state of the user or game player is known, a game or machine can dynamically adapt accordingly. For example, in a simple case, a game can become easier or harder for the user depending on the detected emotional state of the user. In addition, if the game or machine uses voice recognition, the detected emotional state of the user can be used to adapt the models or to select appropriate models (acoustic and language models) dynamically to improve voice recognition performance.
[0016] As far as is known, no existing emotion recognition technique has combined physiological (biometric) cues with facial feature cues, linguistic cues (e.g., the meaning of words or sentences), audio cues (e.g., energy and pitch of speech) and, cues from body gestures. According to aspects of the present disclosure a combination of such cues may be used to improve emotional state recognition.
Method for Determining Emotional State
[0017] According to certain aspects of the present disclosure a new method is proposed for reliable emotion recognition by fusing multi-modal sensory cues. These cues include, but are not limited to acoustic cues from person's voice, visual cues (i.e. facial and body features), linguistic features, physical biometric features measured from the person's body.
[0027] The physical features 113 may include, but are not limited to, vital signs (e.g., heart rate, blood pressure, respiration rate) and other biometric data. The body reacts to emotional state relatively quickly even before the subject verbally and/or visually expresses his/her emotions/feelings. For example, heart rate, blood pressure, skin moisture, and respiration rate can change very quickly and unconsciously. A user's grip on an object may tighten unconsciously when anxious. In addition to heart rate, blood pressure (BP), and respiratory rate (breathing frequency), depth and pace of breath, serotonin (happiness hormone), epinephrine (adrenal), skin moisture level (sweating), skin temperature, pressure in hands/fingers/wrist, level of saliva, hormones/enzymes in saliva (cortisol in saliva in an indication of stress), skin conductance (an indication of arousal), and the like are also useful physical features.
[0028] The nature of the sensors 102 depends partly on the nature of the features that are to be analyzed. For example, a microphone or microphone array may be used to extract acoustic features 107. The microphone or microphone array may also be used in conjunction with speech recognition software to extract linguistic features 111 from a user's speech. Linguistic features may also be extracted from text input which is captured by keypad, keyboard, etc.
[0029] Visual features, e.g., facial expressions and body gestures may be extracted using a combination of image capture (e.g., with a digital camera for still or video images) and image analysis. In particular, facial expressions and body gestures that correspond to particular emotions can be characterized using a combination feature tracking and modeling. For example, the display of a certain facial expression in video may be represented by a temporal sequence of facial motions. Each expression could be modeled using a hidden Markov model (HMM) trained for that particular type of expression. The number of HMMs to be trained depends on the number of expressions. For example, if there are six facial expressions, e.g., happy, angry, surprise, disgust, fear, sad, there would be six corresponding HMMs to train. An example of a facial map is shown in FIG. 2. In this example, an image of a user's face may be mapped in terms of sets of points that correspond to the user's jawline, eyelids, eyebrows, mouth, and nose.
[0037] According to some aspects, the sensors may include a mouth ball 600 that has sensors as shown in FIGS. 6A and 6B. By way of example and not limitation, FIG. 6A shows the teeth 620 bottom-up view from inside the user's mouth. The sensors in the mouth ball 600 may measure levels of saliva, or hormones or enzymes in saliva that are indicative of emotional state. By way of example, and not by way of limitation, adrenal hormone, AM cortisol in saliva, indicates situational stress. In alternative implementations, sensors can be attached on the chest directly or can be attached using a wearable band for measuring some of the cues such as respiratory rate, depth of breath, etc. According to other alternative implementations, a user may wear a cap or headset (not shown) with sensors for measuring electrical brain activity. A similarly configured apparatus may be used to obtain measurements for estimating hormone levels such as serotonin. For example, through deep brain stimulation, a Wireless Instantaneous Neurotransmitter Concentration System (WINCS) can detect and measure serotonin levels in the brain. WINCS can measure serotonin with a technology called fast-scan cyclic voltammetry, which is an electrochemical method of being able to measure serotonin in real time in the living brain. Also, a blood lancet, a small medical implement can be used for capillary blood sampling to measure some hormone levels in the blood. In addition, some types of sensors may be worn around the user's neck, e.g., on a necklace or collar in order to monitor one or more of the aforementioned features.
Some bits about potential specs:
[0052] By way of illustrated example, and without limitation FIG. 7, depicts a possible signal processing apparatus 700 configured to perform emotion estimation in accordance with aspects of the present disclosure. The apparatus 700 may include a processor module 701 and a memory 702 (e.g., RAM, DRAM, ROM, and the like). In some implementations, the processor module 701 may include multiple processor cores, e.g., if parallel processing is to be implemented. Examples of suitable multi-core processors, include, but are not limited to dual-core processors, quad-core processors, processor architectures having a main processor and one or more co-processors, cell processor architectures, and the like.
[0053] The memory 702 may store data and code configured to facilitate emotion estimation in any of the implementations described above. Specifically, the memory 702 may contain signal data 706 which may include a digital representation of input signals (e.g., after analog to digital conversion as discussed above), and code for implementing emotion estimation by analyzing information contained in the digital representations of input signals.
[0054] The apparatus 700 may also include well-known support functions 710, such as input/output (I/O) elements 711, power supplies (P/S) 712, a clock (CLK) 713 and cache 714. The apparatus 700 may include a mass storage device 715 such as a disk drive, CD-ROM drive, tape drive, or the like to store programs and/or data. The apparatus 700 may also include a display unit 716 and user interface unit 718 to facilitate interaction between the apparatus 700 and a user. The display unit 716 may be in the form of a cathode ray tube (CRT) or flat panel screen that displays text, numerals, graphical symbols or images. The user interface 718 may include a keyboard, mouse, joystick, light pen or other device. In addition, the user interface 718 may include a microphone, video camera 730 or other signal transducing device to provide for direct capture of a signal to be analyzed. The camera may be a conventional digital camera that produces two-dimensional images. Alternatively, the video camera may also be configured to provide extra information that can be used to extract information regarding the depth of features shown in one or more images. Such a camera is sometimes referred to as a depth camera. A depth camera may operate based on the principle of stereo imaging in which images obtained by two slightly offset cameras are analyzed to determine depth information. Alternatively, a depth camera may use a pattern of structured light, e.g., infrared light, projected onto objects in the camera's field of view. The processor module 701 may be configured to analyze the distortion of the pattern of structured light that strikes objects in the field of view to determine relative depth information for pixels in images obtained by the camera.
Source: http://appft.uspto.gov/netacgi/nph-...tainment"&RS=AN/"sony+computer+entertainment"[0034] Any number of different sensors may be used to provide signals corresponding to physical features 113. Using some sensory devices, wearable body sensors/devices such as wrist band 500, ring 501, finger cap 502, mouth ball 600, a head band/cap enriched with sensors (i.e. electroencephalogram (EEG) that measure brain activity and stimulation,) wearable brain-computer interface (BCI), accelerometer, microphone, etc., aforementioned cues can be measured and transmitted to a computer system. Usually these physical indicators react faster; even before the subject verbally and/or visually expresses emotions or feelings through speech, facial expression, body language, and the like. Physiologic cues include, body temperature, skin moisture, saliva, respiration rate, heart rate, serotonin, etc.
[0035] By placing groups of electrode sensors, for example on a game controller as in FIG. 4, to measure the nerve activation of the fingers and/or of the body, some of the aforementioned physical cues of the finger and human body can be measured. For example, the sensors can measure the stress/pressure level of nerves. Also these sensors can measure the temperature, conductance, and moisture of the human body. Some sensors can also be in the back of the controller as shown in FIG. 4. By way of example and not limitation, sensors can be placed on the controller to take measurements from the thumbs 401 and 402, or from the palms of the hands 403 and 404.
If link isn't working, search for Document Number 20140112556 here.
There's much more at the link. But that's what we have for now, though there is an additional patent by the same inventor, but it's not as interesting and deals only with emotion recognition by extracting speech data.