Introduction
This is a kind of odd title for a technical post. But yes, it is a technical post. Actually, doesn't look like a real problem. But yes, it is a real one.
It turns out that I have a compulsion and obsession to watch movies. It is better to say, to copy and organize movies on my personal storage. But some of those movies will be never played.
Recently, I also noticed that I am running out of space. A well-known approach to solve this situation could be to eliminate all those movies that I never played or all that I really don't like.
But I could also try something else and take some advantage of this situation, something more productive for a Sunday afternoon because at some point I will be in the same position again.
The root of all evil
In Cuba, Internet access is very expensive. You can check the prices by yourself on the official website of only one Internet service provider (a.k.a. ETECSA). Therefore, regular Cubans don't use Netflix, nor use the Internet to download large multimedia files (at least not from home).
Such a situation has created a unique business model, that probably only works in Cuba. An offline alternative of media service provider, code name "El paquete" (the package).
I will not give you too many details about this service. All you need to know is that the package distributes a lot of movies every week via USB drives. The media content includes the latest premiers as pirates cinema copies, improved cinema copies, HD copies with Chinese subtitles, Full-HD versions, classics movies, animated movies, a specific actor's cycle, and so on. The package also includes some television programs, series, sports, contests, etc. About 1 TB per week in media files.
But my personal OCD is about movies, and I copy them all. This is not exactly a healthy approach for my very limited personal storage.
These metadata enrich the user experience and are displayed when someone browses the media content from a client like Emby for Roku direct from the TV.
As you can also notice in the picture above, Emby also tracks the movies that I already played. Wait a second. That looks like a perfect ground truth to be used to solve a classification problem.
So, I just needed to try something deep ;). Something that could work as a long-term approach.
Yes, I know. I haven't written too much on this blog for a while. But remember I'm training Alexa every day, and she demands a lot of my time ;). She only left me time to publish Computing Anomaly Score Threshold with Autoencoders Pipeline and then I completely forgot to comment about it here. But that will be the subject of the next post (or the next one). So, let's go back to the movies.
The Emby server has an SQLite database (library.db). I explored the data all around and extracted all the useful information to solve my problem with a simple join of two tables MediaItems and UserDatas.
At this point, I thought that was good timing to try the ML.NET Model Builder (Preview) but the extension size is about 150 MB. Too large for a Sunday at home. The .NET solution to this problem has to wait until I finish writing this post, or maybe the next weekend.
First, I split the existing data. I created the training data set with 80% and the evaluation data set with 20% from the full data set. I stored the local analysis of the full data set to normalize each one using the same analysis.
Then I transformed the data using DataVect as follow:
Followed by this network configuration:
Finally, I set up the early stopping trainer to save the best model:
Such a situation has created a unique business model, that probably only works in Cuba. An offline alternative of media service provider, code name "El paquete" (the package).
I will not give you too many details about this service. All you need to know is that the package distributes a lot of movies every week via USB drives. The media content includes the latest premiers as pirates cinema copies, improved cinema copies, HD copies with Chinese subtitles, Full-HD versions, classics movies, animated movies, a specific actor's cycle, and so on. The package also includes some television programs, series, sports, contests, etc. About 1 TB per week in media files.
But my personal OCD is about movies, and I copy them all. This is not exactly a healthy approach for my very limited personal storage.
Everything gets "worse" when I meet Emby
Emby is a media server designed to organize, play, and stream audio and video to a variety of devices as you can read here. Therefore, my copy movies routine now includes the download of all movie metadata with the original title, the tag line, poster and backdrop images, the cast, community rating, critical rating, genres, all the information available from sites like IMDb or TheMovieDB that is stored in the server database and also in nfo local files next to each movie file.These metadata enrich the user experience and are displayed when someone browses the media content from a client like Emby for Roku direct from the TV.
As you can also notice in the picture above, Emby also tracks the movies that I already played. Wait a second. That looks like a perfect ground truth to be used to solve a classification problem.
Deep learning to the rescue
Sundays are good days to spend time with the family and watch movies. But, I couldn't find the right one yesterday. I'm also near to zero space for the next release of the package.So, I just needed to try something deep ;). Something that could work as a long-term approach.
Yes, I know. I haven't written too much on this blog for a while. But remember I'm training Alexa every day, and she demands a lot of my time ;). She only left me time to publish Computing Anomaly Score Threshold with Autoencoders Pipeline and then I completely forgot to comment about it here. But that will be the subject of the next post (or the next one). So, let's go back to the movies.
The Emby server has an SQLite database (library.db). I explored the data all around and extracted all the useful information to solve my problem with a simple join of two tables MediaItems and UserDatas.
Sample of extracted data from Emby database |
Let's do this straightforward
There is enough documentation about DL4J, even a book Deep Learning: A Practitioner's Approach. So, this will be fast. I will try don't repeat any step available online, but probably you notice some resemblance with the excellent Paul Dubs quick-start tutorial, since this, is exactly a classification problem.
Yes, if you didn't notice yet. This is a classification problem and is a quite simple one. I have to predict if I will play a movie from the following features: Official Rating, Community Rating, Critic Rating, and Genres in correlation with my own playback action.
First, I split the existing data. I created the training data set with 80% and the evaluation data set with 20% from the full data set. I stored the local analysis of the full data set to normalize each one using the same analysis.
Then I transformed the data using DataVect as follow:
Followed by this network configuration:
Finally, I set up the early stopping trainer to save the best model:
And done.
The results
Well, the results are quite impressive and also suspect. But there is no problem at all. The network perfectly isolates the movies that I already played on the evaluation data set.
Wait a second. I just remember, that I have an isolated copy of the last week's package with 58 movies in the inbox and already processed by Emby. After running the prediction program, the assistant neural network (the result of the training process) recommends that I copy only 7 movies. Yes, I can deal with that.
Prediction over the last week package |
Not too bad for a Sunday, right? But probably it requires some tuning (or watching more movies). I'm not sure that the adversary network (myself) allows ignoring Ad Astra. Or yes? ;)
Nice and creative idea to solve this problem!!! I liked your approach... my only recommendation is, for the next problem, when the classification problem is simple, maybe a simple classifier is better. For instance, Random Forest would be a great choice for this type of problem, because, it tends to be more accurate than shallow neural networks, and it will output the Variable Importance that it computed for that classification task. So, you can have also some kind of ranking of which of your variables is more important to the final classification. It would be nice to know, which variable influences your taste for movies... maybe Genre? maybe the Official Ratings? ... well anyways, it's just a suggestion, but I liked your overall approach and your great idea for solving this problem.
ReplyDelete