like was told to me: Deeplearning4j

Monday, December 16, 2019

How to avoid copying movies that you will never play?

Introduction

This is a kind of odd title for a technical post. But yes, it is a technical post. Actually, doesn't look like a real problem. But yes, it is a real one.

It turns out that I have a compulsion and obsession to watch movies. It is better to say, to copy and organize movies on my personal storage. But some of those movies will be never played.

Recently, I also noticed that I am running out of space. A well-known approach to solve this situation could be to eliminate all those movies that I never played or all that I really don't like.

But I could also try something else and take some advantage of this situation, something more productive for a Sunday afternoon because at some point I will be in the same position again.

The root of all evil

In Cuba, Internet access is very expensive. You can check the prices by yourself on the official website of only one Internet service provider (a.k.a. ETECSA). Therefore, regular Cubans don't use Netflix, nor use the Internet to download large multimedia files (at least not from home).

Such a situation has created a unique business model, that probably only works in Cuba. An offline alternative of media service provider, code name "El paquete" (the package).

I will not give you too many details about this service. All you need to know is that the package distributes a lot of movies every week via USB drives. The media content includes the latest premiers as pirates cinema copies, improved cinema copies, HD copies with Chinese subtitles, Full-HD versions, classics movies, animated movies, a specific actor's cycle, and so on. The package also includes some television programs, series, sports, contests, etc. About 1 TB per week in media files.

But my personal OCD is about movies, and I copy them all. This is not exactly a healthy approach for my very limited personal storage.

Everything gets "worse" when I meet Emby

Emby is a media server designed to organize, play, and stream audio and video to a variety of devices as you can read here. Therefore, my copy movies routine now includes the download of all movie metadata with the original title, the tag line, poster and backdrop images, the cast, community rating, critical rating, genres, all the information available from sites like IMDb or TheMovieDB that is stored in the server database and also in nfo local files next to each movie file.

These metadata enrich the user experience and are displayed when someone browses the media content from a client like Emby for Roku direct from the TV.

Spider-Man: Into the Spider-Verse (Emby for Roku)

As you can also notice in the picture above, Emby also tracks the movies that I already played. Wait a second. That looks like a perfect ground truth to be used to solve a classification problem.

Deep learning to the rescue

Sundays are good days to spend time with the family and watch movies. But, I couldn't find the right one yesterday. I'm also near to zero space for the next release of the package.

So, I just needed to try something deep ;). Something that could work as a long-term approach.

Yes, I know. I haven't written too much on this blog for a while. But remember I'm training Alexa every day, and she demands a lot of my time ;). She only left me time to publish Computing Anomaly Score Threshold with Autoencoders Pipeline and then I completely forgot to comment about it here. But that will be the subject of the next post (or the next one). So, let's go back to the movies.

The Emby server has an SQLite database (library.db). I explored the data all around and extracted all the useful information to solve my problem with a simple join of two tables MediaItems and UserDatas.

Sample of extracted data from Emby database

At this point, I thought that was good timing to try the ML.NET Model Builder (Preview) but the extension size is about 150 MB. Too large for a Sunday at home. The .NET solution to this problem has to wait until I finish writing this post, or maybe the next weekend.

Deeplearning4j (DL4J) is already cached on my local nexus. So, here we go.

Let's do this straightforward

There is enough documentation about DL4J, even a book Deep Learning: A Practitioner's Approach. So, this will be fast. I will try don't repeat any step available online, but probably you notice some resemblance with the excellent Paul Dubs quick-start tutorial, since this, is exactly a classification problem.

Yes, if you didn't notice yet. This is a classification problem and is a quite simple one. I have to predict if I will play a movie from the following features: Official Rating, Community Rating, Critic Rating, and Genres in correlation with my own playback action.

First, I split the existing data. I created the training data set with 80% and the evaluation data set with 20% from the full data set. I stored the local analysis of the full data set to normalize each one using the same analysis.

Then I transformed the data using DataVect as follow:

Followed by this network configuration:

Finally, I set up the early stopping trainer to save the best model:

And done.

The results

Well, the results are quite impressive and also suspect. But there is no problem at all. The network perfectly isolates the movies that I already played on the evaluation data set.

Played movies from the evaluation data set.

Now, I'm ready for the next release of the package.

Wait a second. I just remember, that I have an isolated copy of the last week's package with 58 movies in the inbox and already processed by Emby. After running the prediction program, the assistant neural network (the result of the training process) recommends that I copy only 7 movies. Yes, I can deal with that.

Prediction over the last week package

Not too bad for a Sunday, right? But probably it requires some tuning (or watching more movies). I'm not sure that the adversary network (myself) allows ignoring Ad Astra. Or yes? ;)

Wednesday, July 11, 2018

Introducing myself into Deep Learning

Overview

It has been some time since my last blog post. Actually, it has been more than two years. The main reason is that something changed in my life when I started to train a neural network. Her name is Alexa.

Just to avoid confusion let me clarify that I am not a member of Amazon Alexa team. Alexa is the female version of my own name and it is the name of my little girl ;)

She is one of the reasons for this deep learning journey.

How did the journey begin?

Every journey has a motivation and this one is not the exception. It started on September 19 of 2016 when I carried her for the very first time. After saw her for a while I asked myself: How is possible that she can learn something in the future?

Months passed, and I saw her learning so fast effortless without too much “computational power” (apparently). She learned to walk, to dance, to almost talk, to solve a puzzle, to play basketball, to solve pyramid-piling rings puzzle, and even to scramble a Rubik’s cube ;)

Solving puzzle

Defending vs. Elizabeth

Scrambling a Rubik's cube

Beyond the obvious answers that she will learn by design, there is a lot of trial and error in her learning process. Several attempts to look for the best fit before she can learn something. I love to see her “computing the distance” from the expected result and her attempts at solving the pyramid-piling rings puzzle by removing a wrong ring and replacing it with the right one.

Dad is trying to build something
with blocks but I'm interested
in the neighbor’s dog ;)

What really happened is that her learning process motivated me to explore something “new”. A discipline that is called to be (if it is not really is) the new toolset for every single software developer. It turns out that the new hobby came with a practitioner approach but it required to be found first.

What happened next?

At this point, I started to watch the machine learning course videos of Andrew Ng using alternatives sources. Cubans (that live in Cuba) are not able to access the certification program at Coursera. In some way, the USA embargo to Cuba – (specifically the USA exportation laws) – also affects the deep learning global democratization process.

However, it doesn't worry me too much, actually never does. I can't get the certification but I can get the knowledge. Andrew’s course is actually motivating and didactically insuperable. It was able to bring back to life some math and algebra that I thought was dead in my brain and made me felt very comfortable implementing a vectorized version of the Stochastic Gradient Descent algorithm with Octave.

"We apologize for the inconvenience" said a @coursera message because also #block us, https://t.co/vsmxsoaMop #Cuba @cubavsbloqueo pic.twitter.com/4JpDtas3PH
— Alexánder Fernández (@alexfdezsauco) August 9, 2017

After understand how this works (almost just like Andrew used to say), and what kind of problems were possible to solve, just I wanted to put this in practice at the production level. Some researchers (friends of mine) sold me Tensorflow as the Holy Grail but I had some doubts about Python performance (still have).

The new hobby comes with the
practitioner approach

After some research, I found exactly what I was looking for. A deep learning library for JVM. Deeplearnig4j (DL4j) is an excellent library and comes with this excellent book Deep Learning: A Practitioner's Approach from Adam Gibson and Josh Patterson. I just needed to read the preface, to identify me as a deep learning practitioner. It wasn't to hard notice that was the right book for me. I'm pretty sure that is also the right book for you.

DL4j also comes with a lot of helpful features and tools to assist the training process including Training UI, datavec, early stopping, even GPU support, and more, but we can talk about this in forthcoming posts.

DL4J Performance - examples/sec on Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (No GPU required in this case)

Nice normal distribution shape for weights in the Layer Parameters Histogram (So, no regularization issues)

Some network details

Recently, I also started to work in build a proof of concept of an anomaly detection system built on top of DL4j, specifically by using Autoencoder networks with promising results but I can't give you anything in advance yet (just wait for it).

Autoencoders are supported by DL4j

Conclusions

I have a lot to learn about deep learning, but the journey has already begun and I have the intention to share it with you.

Btw, if you are not motivated to go “deep” with machine learning yet, be a father (or mother) first, and then let me know. I'm sure you will find the same "biological" inspiration when you witness how to "train a neural network" looks like ;)

Alexa talking with a large predator cat of stone