Deep Transfer Learning for Food Recognition

Food Recognition is an essential topic in the area of computer of its target applications is to avoid achieving a cashier at the dining place. In this paper, we investigate the application of Deep Transfer Learning for food recognition. We fine-tune three well learning models namely; AlexNet, GoogleNet, and Vgg16. The fine tuning procedure depends on removing the last three layers of each model and adds another five new layers. The training and validation of each model conducted through food a dataset collected from our university's canteen. The dataset contains 39 food types, 20 images for each type. The fine-tuned models show similar training and validation performance and achieved 100% accuracy over the small-scale dataset.


INTRODUCTION
The recent advent of transfer deep learning has achieved successes in many areas such as classification and recognition [1]- [4]. One of the most promising visual object recognition applications is food recognition, since it helps to estimate food calories and analyze eating habits of people to maintain their health [5]. Those applications started to open new challenges to the computer vision and object recognition algorithms. Most of existing methods of food recognition directly extract visual features of the whole image using popular deep networks for food recognition without considering its own characteristics Food recognition is gaining more attention in the multimedia community due to its various applications, e.g., multimodal food-log and personalized healthcare common that one dish can be served in several ways. Therefore, in this paper, we will explain the utilization of deep transfer learning concept for small-scale food dataset captured directly from the tray. The rest of ent of transfer deep learning has achieved successes in many areas such as classification and . One of the most promising visual food recognition, since it helps to estimate food calories and analyze eating habits of . Those applications to the computer vision and on algorithms. Most of existing methods of food recognition directly extract visual features of the whole image using popular deep networks for food recognition without considering its own characteristics [6].
tion is gaining more attention in the multimedia community due to its various applications, e.g., log and personalized healthcare [7]. It is common that one dish can be served in several ways. Therefore, in this paper, we will explain the utilization of scale food dataset The rest of the paper is organized as follows: Section 2 covers the Literature Review, Section 3 contains details of the materials used in this paper and methodologies. Section 4 elaborates on the achieved results and finally the conclusion.

LITERATURE REVIEW
Recently the use of artificial intelligence is increasing rapidly. Almost, in every branch of our lives we are facing with the beginnings of its use. That is, also, the case with food recognition applications. More and more researches are done in this field, and many of them are there to help improving health of human body. type and its features, as the content of meal brought automatic or semi-automatic dietary estimations to help people control their eating habits and help them in thei diets and daily food income [8]. Those applications were improved using CNN methods, the powerful class of models in various problems, applied to the database from 23 restaurants, to predict the calories and nutrition of and Merjem Begovic 1 Food Recognition is an essential topic in the area of computer vision. One ving a cashier at the dining place. In this paper, we investigate the application of Deep Transfer tune three well-known deep and Vgg16. The finetuning procedure depends on removing the last three layers of each model and adds another five new layers. The training and validation of each dataset collected from our university's dataset contains 39 food types, 20 images for each type. The tuned models show similar training and validation performance and organized as follows: Section 2 covers the Literature Review, Section 3 contains details of the materials used in this paper and methodologies. Section 4 elaborates on the achieved results and finally the conclusion.
Recently the use of artificial intelligence is increasing rapidly. Almost, in every branch of our lives we are facing with the beginnings of its use. That is, also, the case with food recognition applications. More and more researches d, and many of them are there to help improving health of human body. Recognizing of food type and its features, as the content of meal brought automatic dietary estimations to help people control their eating habits and help them in their . Those applications were improved using CNN methods, the powerful class of models in various problems, applied to the database taken he calories and nutrition of their meals from one single image [5,8]. More and more researchers are using CNN such as ResNet, GoogleNet, MobileNet and VGG-Net. Some of the researches mention that the GoogleNet has the highest validation accuracy value, with the lowest number of epochs [10]. This lead to the use of DCNN (deep convolutional neural network), which is very suitable for large-scale image data, since it takes only 0.03 seconds to classify one food photo with GPU. The experiments done with this on ETH Food UEC FOOD 100, and UEC FOOD 256 datasets show that it has achieved the accuracy of 88.28%, 81.45%, and 76.17% as top-1 accuracy and 96.88%, 97.27%, and 92.58% as top-5 accuracy [11]. The further working was concentrated on a buffet-style restaurant. The results showed that, using real data can achieve 0.79 in F and 9.4% error in energy, much better than the previous approach [12]. The Supervised Extreme Learning Committee (SELC) takes as many features as possible but shows just the features which are proposed for the classification of the food. Each ELM presented a particular type feature [13]. The classification rate of 55.8 % is reached in the approach of recognizing multiple images by detecting candidate regions and classifying them with various features [12]. Newly proposed system, the visual attention analysis, has shown that the network is able to self-identify the relevant portions of the image that should be considered for classification [6]. Going further, the Ingredient-Guided Cascaded Multi-Attention Network (IG-CMAN) brings the state-of-the-art recognition performance with new dataset WikiFood-200 improvement of meal-recognition systems were done by Tensorflow based machine-learning process was performed, where an Expert.js-based semantic network was constructed. Recognizing the Korean, Chinese and Italian food brought the result of 55.3 % of the food recognition accuracy rate [14]. When it was not just enough to recognize the meal on the plate, the system for retrieving recipes from the picture was set. The joint relationship between food and ingredient labels through multi-task learning is exploited by deep architecture. done by learning contextual relationships of ingredients from a large textual corpus of recipes [15]. To see the quality of the food one of the most important parameters is its expiration date. Mobile application was developed to be used in recognizing of printed expiration dates Furthermore, the ear-warn device was proposed for identification of the temporal similarity between different types of food. It is aimed to record the fluctuations on the glucose level during the satiation and satiety periods of the user, in order to keep track on daily intake of calories and their deficit, to provide a better dietary experience for users [17].

MATERIALS AND METHODOLOGY 3.1 Dataset
The dataset is consisted of images of 39 different sorts of foods. Each food type contains 20 images. Therefore, the  [7]. The recognition systems were done by learning process was based semantic network was constructed. Recognizing the Korean, Chinese and Italian food brought the result of 55.3 % of the food . When it was not just enough to recognize the meal on the plate, the system for retrieving recipes from the picture was set. The joint relationship between food and ingredient labels through p architecture. It is done by learning contextual relationships of ingredients . To see the quality of the food one of the most important parameters is its expiration date. Mobile application was developed to be f printed expiration dates [16]. warn device was proposed for identification of the temporal similarity between different types of food. It is aimed to record the fluctuations on the ng the satiation and satiety periods of the user, in order to keep track on daily intake of calories and their deficit, to provide a better dietary experience for The dataset is consisted of images of 39 different sorts of foods. Each food type contains 20 images. Therefore, the total number of images in the dataset is 780. smartphone used to collect the dataset. Table 1 contains a full list of the food names included in this dataset. The dataset includes pictures of soups, main dishes, appetizers, and salads. Figure 1 shows samples of the dataset images. All the images collected from foods provided at the ccanteen of the International University of Sarajevo. food images total number of images in the dataset is 780. smartphone 1 contains a full list of the food names included in this dataset. The dataset includes pictures of soups, main dishes, appetizers, and salads. Figure 1 shows samples of the dataset images. All the images collected from foods provided at the ccanteen of e International University of Sarajevo.

Methods
The success of [18] to deal with largescale dataset bacame a revolution in the fild of machine learning. in contrast with the shallow neural network, deep neurla nwtwork compromises of large number of layers. Therefore, it ganerate a large number of parameters which needs ahardware with high capability such as GPUs working in parallel. Hence, Transfer learning concepts emerged to assests in train deep neural network models with small scale datasets. In this paper, we fine tuned 3 of the ealiest deep learning models namely; AlexNet [18], GoogleNet [19] and Vgg16 [20]. The fine-tuning depends on removing the last 3 layers of each model and replace it with 5 new layers. Fine-tuning process is tabulated in Table 2. Fine-tuning make the models hybridized of trained and newly added layers. Therefore, in order to have balance between all layers in terms of the learning speed, we assign higher learning rate for the new layers to boost their learning and freeze the old layers weight.

RESULTS
The deep transfer neural networks that were used in this research were trained in MATLAB, and after the training process accomplished, the results of the training are seen. The results show how accuracy and loss have been behaving from the first to the last iteration. Accuracy represents how well the network has learned the image, and loss function is actually the opposite of accuracy, the percentage of the loss in training. We can see from the presented figures that loss is basically an opposite graph of the accuracy.

A. AlexNet
Firstly, we notice from Figure 2 (a) that AlexNet training accuracy risen up very quickly. This shows how handy transfer learning is in practice. The main reason it rises so fast is that all of the layers not containing new data for the learning process were already trained. Almost after 200 iterations approximately, the accuracy is fully developed. This means that the accuracy of the learning process is done, and the network has learned 100% of the data we already gave it for processing. More deatils about the performance of AlexNet training during the earlier iterations is shown in Figure 2(b). Secondly, with the GoogleNet training, we can see in Figure 3(a) that the training accuracy made a quick rise, but now we can see that it has reached 100% even faster than AlexNet. There are no visible oscillations throughout the training except during the fist few iterations as illustrated in Figure 3(b). We also notice that the loss started decreasing earlier, not on the top of the scale, since the neural network itself already has some training, which is also a benefit from transfer learning. C. Vgg16 Figure 4(a) represents Vgg16 training accuracy at different iterations. We can see a little different performance, with more oscillations than the previous ones. This happens when too much similarity is present in the data. Since the data here had pictures of food that are really hard to differ, the accuracy was lowered at some times.

Loss Evaluation
Loss function is an important part in Deep Neural Networks evaluation, which is used to measure the inconsistency between predicted value and actual label. For AlexNet model, we can observe from the Figure 5(a) that as accuracy increases, loss decreases. This is also a representation of the benefits from transfer learning.
For GoogleNet model, as shown in Figure 5(b), the decrease of loss indicate that the network learned rapidly. After it reaches 0 level, it is consistent at the bottom, with no visible oscillations. We can see that the training process of GoogleNet went somewhat better than training with AlexNet.
Lastly, Figure 5 (c) represents loss in training with VGG16. We can see that exactly at the points where accuracy increases, loss decreases. There is slightly more loss than in the previous graphs. Even though we can see some oscillations, the training still went well, and the final accuracy of the learning process is 100%. it is an important to show the validation of each model. As presented in Figure 6(a), the models shows different behaviours at the beginning of the traning, while the rest of the performace is similar for all models. Figure 6(b) represents the models performance in terms of validation loss. It shows additional evidence that all the models has different performance at the begining of the traning.

CONCLUTION
In this paper, we covered the utilization of three fine-tuned deep transfer learning models. AlexNet, GoogleNet and Vgg16 performed similarly in terms of training, validation and loss. The experiments shows the GoogleNet acheived higher scores faster than AlexNet and Vgg16. However, the fine-tuned models recorded 100% as training and validation accuracy. In the future, this work can be combained with currancy recognition to automat the casheir position and develop a fully self-check-in counter.