Automating the Automation

Automation is a scary thing. While I was in Silicon Valley, I stayed at a house that was a short distance from Facebook. In this area, it is almost impossible to get by without roommates, and I had four. Three of them worked at Facebook. I got to know the inner workings a bit, and I met a lot of the really bright people who were working there. One was an extremely talented software engineer who seemed to be getting a promotion every month. He was working in a high-level position in one of the biggest and most profitable companies in the world, and he was concerned that automation would replace his job. Specifically, he was worried about Google developing programs that could write programs. This makes me think of an M.C. Escher print that I hung in my dorm room; it seemed deep at the time but now feels a bit silly and pretentious.

That employee’s concern is legitimate especially given the current rise in generator technology (see my previous blog post). The question is: can we automate machine learning? Well, yes, definitely some aspects, but end to end is still a ways off. Let me clarify. There are three steps in machine learning. The first is data preprocessing/cleaning/planning. The second is the actual machine learning. The third (and sometimes unfortunately overlooked) is interpretation and sanity checks on the model. Arguably, there could be a fourth step discussing deployment and maintenance, but it is outside the scope right now.

Of the three steps, the shortest and most fun is the machine learning. The other two steps require a significantly greater amount of work. Which steps can be automated? Data preprocessing is usually very unique to the process. Generally there is the act of pulling from several tables, there is a lot of noise, missing data, and things that just do not make sense. It is almost always a mess, but it is a snowflake in that no two messes are the same. There are some best practices to look at when determining if the data is suitable for machine learning, but that is pretty early on. The third step where there is a bulk of work is interesting. There is a tremendous amount of metrics to measure the performance of a model. A good machine learning scientist will look at these metrics in the following way: terrible performance = unsolvable with this data; fair performance = potential room for improvement; good performance = the sweet spot; great performance = something is wrong. This process requires a human set of eyes.

Machine learning is equivalent to asking a teenager to clean your house. The house could look spotless after a few minutes with an extremely lumpy rug. That leaves step two. We have the data ready to go; now we just need to find the right model with the right parameters. While there has been some work on what data works best with which model, the “no free lunch” model still applies here. You have to search to find the best. There are several free packages that are freely available to automate this process—sklearn and H20—to name a couple. This is also a growing area in the service and startup side of machine learning. The big dogs are getting in on it as well with Amazon Sagemaker, Azure Machine Learning, and Google AutoML. The take- home is that finding a model is easy, but everything around it still requires traditional human participation.

Big data is daunting and confusing, yet it is becoming more of a necessity to remain competitive in the modern workplace. We here at Convergent Technologies specialize in simplifying the seemingly opaque. We have helped many organizations of various scales implement sensible data solutions. Let us help get you there!