I’ve made the argument (PDF) that Functional Programming (FP) is the best way to approach data problems. Why? Because our work with data is essentially Mathematics and FP is inspired by the same, so it emphasizes the abstractions that are most appropriate for data analysis. Object-oriented Programming, on the other hand, doesn’t promote the same useful abstractions, which is why I consider the use of Java in Big Data applications to be counterproductive. (There are of course hybrid programming languages, like Scala, F#, and OCaml, that combine both paradigms. If you want a comprehensive comparison of programming paradigms, consider this chart.) In fact, SQL can be considered a functional programming language, since it is derived from Set Theory, although it has lots of limitations as a language.
I believe there are two other emerging trends in programming worth watching that will impact the data world.
Logic Programming, like FP, is actually not new at all, but it is seeing a resurgence of interest, especially in the Clojure community. Rules engines, like Drools, are an example category of logic programming that has been in use for a long time.
In logic programming, you write programs using the concepts of Logic, such as first order logic. Simply stated, you specify conditions or constraints (e.g., rules) that must be satisfied, known “facts” about the system you’re modeling, and the runtime finds the values of the system’s variables that satisfy the conditions. One way to think of it is to imagine the runtime searching the space of all possible answers for those that satisfy the conditions and facts.
Why is this interesting for data? Logic programming is a declarative and concise way to express problems that can be framed this way. Hence, if your problem fits the logic programming model, you can work quickly and efficiently, in just the same way that SQL queries are a very concise and expressive way to ask questions of data and to perform analytics.
For example, a classic use of logic programming has been fault diagnosis; given observed events or symptoms and knowledge of the system, what are the possible underlying faults that caused the observations? This approach is applicable for diagnosing malfunctions in cars, chemical plants, medical problems, etc.
There’s one catch, though. Most logic programming systems assume we have absolute knowledge; facts are yes/no, true/false, or some fixed value, while constraints are absolute and comprehensive. Many, if not most real-world scenarios aren’t so clear cut. Probabilistic modeling has proven most fruitful for these scenarios where knowledge and constraints are imprecise and contain gaps, but we don’t require absolute answers either. In our example, a list of faults is great, but which one is most likely? What is the probability that an observed event was a false alarm? How do we know we’re monitoring all the relevant data?
I’ll cite a few examples of great interest today. Recommendation engines are widely used in social networks and ecommerce. For example, Netflix might observe that you rent action movies more often than romantic comedies, but does that reflect a hard and fast rule for you? What about romantic comedies with car chases? You’re probably going to rent another action movie the next time, but there’s a nonzero chance a romantic comedy will appeal to you some day.
Self-navigating robots have a model of the world, e.g., a map of the terrain and sensors used to detect where they are. There are sources of error and uncertainty. Real sensors aren’t 100% accurate. The map could have errors and obstacles could be in the way (like people crossing the street!) that are not represented on the map. So, the world is modeled probabilistically and the robot calculates the most likely location, given it’s measurements and how they correlate to the map.
Finally, how do we automate the understanding and processing of human language? If you’ve used a voice-recognition system, like Siri on an iPhone, you’ve used just one example of the amazing progress we’ve made. Fundamentally, we now think of human language as a probabilistic process, where previously we thought of it as the outcome of sophisticated internal models. This argument between Noam Chomsky and Peter Norvig illustrates the sea-change in our thinking.
We already have powerful probabilistic modeling techniques and tools, such as Bayesian networks, Markov networks, and their variants, generically called Probabilistic Graphical Models (because they model probabilities about systems using graphs). Implementations are available in many languages. There are excellent textbooks, including Artificial Intelligence, A Modern Approach, by Russell and Norvig, that describe them. However, deep technical expertise is required to understand and use these techniques effectively.
We’re on the verge of moving to the next level, probabilistic programming languages and systems that make it easier to build probabilistic models, where the modeling concepts are promoted to first-class primitives in new languages, with underlying runtimes that do the hard work of inferring answers, similar to the way that logic programming languages work already. The ultimate goal is to enable end users with limited programming skills, like domain experts, to build effective probabilistic models, without requiring the assistance of Ph.D.-level machine learning experts, much the way that SQL is widely used today.
DARPA, the research arm of the U.S. Department of Defense, considers this trend important enough that they are starting an initiative to promote it, called Probabilistic Programming for Advanced Machine Learning, which is also described in this Wired article.
This is a next logical step in the democratization of data, making the sophisticated analysis of large data sets accessible to a wider audience. It’s amazing how universal SQL knowledge has become. I often meet very nontechnical people who have learned enough basic SQL to get the answers they need for themselves. Achieving the same level of fluency in logic and probabilistic programming will be harder, even if good languages and tools are developed, because the core concepts are harder for people to grasp. Still, it’s an important challenge and the results will benefit us all.