KDD 2018 was intense. Workshops and keynotes started at 8 AM and the poster sessions were runing until 9 PM. There was a dozen tracks running in parallel, so choosing the session to attend was hard – there was always something else potentially interesting going on at the same time. For the first two days I tried to pay close attention and write up comprehensible notes from the tutorials and workshops, but that was not sustainable, so from the third day on I sat back, relaxed, listened more, wrote less. Below are a few random things from that part of the conference that stuck in my memory – not necessarily due to being the most significant or groundbreaking, perhaps just most relevant or interesting to me at this time.
It always struck me that calling ReLU nonlinear is a bit of a stretch: there is just one point where the tangent changes, but otherwise it is a combination of two linear functions. Other than its (for me, surprising) effectiveness in deep learning, another result of this piecewise linearity is that useage of ReLU, or similar activations, across all layers, allows decomposition of the deep learning model into a large set of linear models: each member of that set corresponds to a “configuration” that describes which linear part of the activation function is in use for each unit in the network. This allows explanation of model predictions in terms of the features that influenced the choice of configuration (a convex polytope within the input space), and then the features that determined the prediction of the linear classifier chosen. More details in the paper.
Some vendors – Baidu, Amazon – were talking about “democratising machine learning”, i.e. making it accessible to a wide range of companies that do not currently have staff with experience necessary to develop and deploy machine learning models, presumably, while locking them in to the vendor’s platform. Either way, Amazon’s effort in this area is called SageMaker. While other vendors – Google, Microsoft – offer similar end-to-end platforms, the striking feature/constraint of Amazon’s offering is that all machine learning algorithms running on it must support streaming. Perhaps it is due to my day-to-day work only involving batched learning, but I found this restriction, and the architectural benefits it brings – distribution, incremental training – fascinating. It is also a reminder that in the practical deployments the quality of model’s output is just one of many considerations – others being the ease of management, monitoring and evolution of the model over its lifetime. Being stuck knee-deep in the multitude of techniques and new results in achieving “state of the art predictions”, it is easy to forget about those other factors.
Event Registry consumes feeds from 30,000 news sources in 100 languages, cleans them up, identifies common concepts using a cross-lingual ontology extracted from Wikipedia and uses them to construct a timeline of world events. On a closer look, a number of those events are opinions, it is also not clear how susceptible the service is to misinformation attacks, or multiple stories in multiple languages originating in a single, unreliable source. Still, the idea is compelling, and the authors experimented with predicting future events based on the data collected – reportedly, with useful outcomes, though no published results were mentioned.