1. Introduction
Algorithms governance is a key topic, which is receiving more and more attention as we enter this 21st century. The rise of this complex and difficult topic is no surprise, since “software is eating the world” – i.e., the part of our lives that is impacted by algorithms is constantly growing – and since software is “getting smarter” every year, with the intensification of techniques such as Machine Learning or Artificial intelligence. The governance question is also made more acute since smarter algorithms are achieved through more emergence, serendipity and weakening of control, following the legendary insight of Kevin Kelly in his 1995 “Out of control” best seller: “ « Investing machines with the ability to adapt on their own, to evolve in their own directions, and grow without human oversight is the next great advance in technology. Giving machines freedom is the only way we can have intelligent control. » Last, the algorithmic governance issue has become a public policy topic since Tim O’Reilly coined the term “Algorithmic Regulation” to designate the use of algorithms for taking decision in public policy matters.
Algorithm governance is a complex topic that may be addressed from multiple angles. Today
I will start from the report written by Ilarion Pavel and Jacques Serris “Modalities
for regating content management algorithms”. This report was written at the
request of Axelle Lemaire and focuses mostly on web advertising and
recommendation algorithms. Content management – i.e. deciding dynamically which
content to display in front of a web visitor – is one of the most automatized
and optimized domain of the internet. Consequently, web search and content
recommendation are domains where big data, machine learning and “smart
algorithms” have been deployed at scale. Although the report is focused on
content management algorithms, it takes a broad view of the topic and includes
a fair amount of educational material about algorithms and machine learning. Thus, this report addresses a large number of algorithm
governance issues. It includes five recommendations about algorithm regulation
intended for public governance stakeholders with the common intent of more
transparency and control for algorithms that are developed in the private
sector.
This short
blog post is organized as follows. The first part provides a very simplified
summary of the key recommendations and the main contribution of this report. I
will focus on a few major ideas which I found quite interesting and
thought-provoking. This report addresses some
of the concerns that occur from the use of machine learning and artificial
intelligence in mass-market services. The second part is a reply from the
angle of our
NATF work group on Big Data. As was previously explained, I find that we
have entered a “new world” for algorithms that could be described as “data is the new code”. This cast a
different shadow on some of the recommendations from the Ilarion Pavel & Jacques
Serris report. As algorithms become grown from data sets through training
protocols, it becomes more realistic to audit the process than the result. The
last part of this post talks about the governance of emergence, or how to
escape what could be seen as an oxymoron. The question could be stated as “is there a way to control and regulate
something that we do not fully understand ?”. As a citizen, one expects a
positive answer. Other sciences have learned to cope with this question a long
time ago, since only computer scientists from Silicon Valley believe that we
may control and fully understand life today (these issues arise constantly in
the worlds of medicine, protein design or cellular biology for instance). But
the existence of this positive answer for Artificial Intelligence is a topic
for debate, as illustrated by Nick Bostrom’s book “Superintelligence
– Paths, Dangers, Strategies”. To dive deeper into this topic, I strongly recommend the reading of "Code-Dependent : Pros and Cons of the Algorithmic Age" by Lee Rainee and Janna Anderson
2. Algorithm Regulation
First, I should start with my usual caveat that you should read the report versus this very simplified and partial summary. The five recommendations can be summarized as follows:
- Design a software platform to facilitate the study, the evaluation, and the testing of content / recommendation algorithms in a private/public collaboration opened to research scientists
- Create an algorithm audit capability for public government
- Mandate private companies to communicate about algorithm behavior to their customers, through a “chief algorithm officer role”
- Start a domain-specific consultation process with private/public stakeholders to formalize what these “smart content management services” are and which best practices should be promoted nationally or internationally.
- Better train public servants who use algorithms to deliver their services to citizens
A fair
amount of the report talks about Machine Learning and Artificial Intelligence,
and the new questions that these techniques raise from an algorithm ethic point
of view. The question “how does one know what the algorithm is doing” is
getting harder to answer than in the past. On page 16, the concept of “loyalty” (is the algorithm true to its stated purpose ?) is introduced and
leads to an interesting debate (cf. the classical debate about the filter bubble). The
authors argue – rightfully – that with the current AI & ML techniques the
intent is still easy to state and to audit (for instance because we are still
mostly in the era of supervised learning), but it is also clear that this may
change in the future. A key idea that is
briefly evoked on page 19 is that machine learning algorithms should be evaluated
as a process, not on their results. Failure to do so is what triggered
the drama of the Microsoft chatbot who was made non-loyal (not to say
racist and fascist) through a set of unforeseen bet perfectly predictable
interactions. One could say there is the equivalent of Ashley’s
law of requisite variety in the sense that the testing protocol should
exhibit a complexity commensurate to the desired outcome of the algorithm.
Designing training protocols and data sets for algorithms that are built from
ML to guarantee the robustness of their loyalty is indeed a complex research
topic that justifies the first recommendation.
We hear a
lot of conflicting opinions about the threat of missing the train of AI
development in Europe or in France, compared to the US or China. The topic is
amplified by the huge amount of hype around AI and the enormous investments
made in the last few years, while at the same time there seems to be a “race to open source” from the most
notorious players. The authors propose three scenarios of AI development. In
the first scenario, the current trend of sharing dominates and produces
“algorithms as a commodity”. AI becomes a common and unified technology, such
as compilers. Everyone uses them, but differentiation occurs elsewhere. The
second scenario is the opposite where a few dominant players master the smart
systems (data and algorithms) at a skill and scale level that produces a unique
advantage. The third scenario focuses on data ecosystems but recognizes that
the richness and regulatory complexity of data collection make it more likely
to see a large number of “data silos” emerge (larger number of locally dominant
players, where the value is derived more from the data than the AI & ML
technology itself). As will become clear in the rest of this blog, I see the
future as the combination of 2 and 3 : massive concentration for a few topics
(cf. Google and Facebook) that coexists with a variety of data ecosystems (if
software is eating the world and tomorrow’s software is derived from data, this
is too much to chew for a single player, even with Google’s span).
A key
principle proposed by the authors is to “embody” the algorithm intent through
the role of “chief algorithm officer”, with the implicit idea that (a)
algorithms have no will or intent of their own, that there is always a human
behind the code (b) companies should have someone who understands what the
algorithm does and is able to explain it to stakeholders, from customer to
regulators. The report makes a convincing case that “writing code that works is
not enough”, the of “chief algorithm officer” should be able to talk about it
(say what it does) and prove that it works (does what is intended). There is no proof, on the other hand, that
this is feasible, which is why the topic of algorithm ethics is so interesting.
The authors recognize on page 36 that auditing algorithms to “understand how
they work” is not scalable. It requires too much effort, will prove to be
harder and harder as techniques evolve, and we might expect some undecidability theorems
to hit along the way. What is required is a relaxed (weaker) mandate for
algorithm regulation and auditing: to be able to audit the intent, the
principles that guarantee that the intent is not lost, and the quality of the
testing process. This is already a formidable challenge.
3. Data is the New Code
This tagline means that the old separation between data and code is
blurring away. The code is no longer written separately following the great
thinking of the chief algorithm officer and then applied to data. The code is
the result of a process – a combination of machine learning and human learning
– that is fed by the available data. “Data is the new code” was introduced in
our NATF report to represent the fact that when Google values software assets
for acquisition, it’s the quantity and quality of collected data that gives the
basis for valuation. The code may be seen as the by-product of the data and the
training process. There is a lot of value and practical expertise with this training
process, which is why I do not subscribe to the previously mentioned scenario
of “AI as a commodity”. Smart systems is first and foremost an engineering
skill.
A first consequence is that the separation of the Chief Data Officer
from the Chief Algorithm Officer is questionable. The code that implements
algorithms is no longer static, it is the result of an adaptive process. Data
and algorithms live in the same world, with the same team. It is hard to
evaluate / audit / understand / assess the ethical behavior of data collection
or algorithms if the auditor separates one from the other. Data collection
needs to be evaluated with respect to the intent and the processes that are run
(which has always been the position of the CNIL) and algorithms are – more and
more, this is a gradual shift – the byproduct of the data that is collected.
Data ethics is also very closely related to algorithm ethics. On page
29, the report tells that bias in data collection produces bias in the
algorithms output. This is true, and the more complex the inference from data,
the more complex tracking these biases may be. The questions about the ethics
of data collection, the quality and the fidelity of the data samples, are bound
to become increasingly prevalent. As explained before, this is not a case where
one can separate the data collection from the usage. To understand fairness –
the absence of biases - , the complete system must be tested. Serge
Abiteboul mentioned in one of his lectures the case of Staples,
whose pricing mechanism, through a smart adaptive algorithm, was found to be
unfair to poorer neighborhood (because the algorithm “discovered” that you
could charge higher prices when there are fewer competitors around). I
recommend reading the article “Discovering
Unwarranted Associations in Data-Driven Applications with the FairTest Testing
Toolkit” to see what a testing protocol / platform for algorithm fairness
could look like (in the spirit of the first recommendation of the report). The
concept of purpose is not enough to
guarantee an ethical treatment of data, since many experiments show that big
data mining techniques are able to “find private pieces of data from public
ones”, to evaluate features that we not supposed to be collected (no opt-in,
regulated topics) from data that were either “harmless” or properly collected
with an opt-in. Although the true efficiency of the algorithms of “Cambridge Analytica”
are still under debate, this is precisely the
method that they propose to derive meaning full data traits from those that can
be collected publicly.
The authors of the report are well aware of the rising importance of emergence in algorithm design. On page
4, they write “one grows these algorithms
more than one writes them”. I could not agree more, which is why I find the
fourth recommendation surprising – it sounds too much of a top-down approach
where data services are drawn from analysis and committees versus a bottom-up
approach where data services emerge from usage and collected data. In the
framework of emergent algorithm design, what needs to be audited is no longer
the code (inside of the box which is becoming more of a black box) but the
emergence controlling factors and the results:
- Input data
- Purpose (intent) of the algorithm
- “training” / “growing” protocol
- Output data
This brings us to our last section:
how can one control the system (delivering a “smart” experience to a
customer) without controlling the “black box” (how the algorithm works) ?
4. How to Control Emergence ?
The third
recommendation tells about the need to
communicate about the way algorithms operate. Following the previous
decomposition, I favor the recommendation on communicating about intent, with the associate capability
(recommendation #2) to audit the loyalty (the algorithm does what its purpose
says). On the other hand, I do not take this literally to explaining how the
algorithm works. This was perfectly achievable in the past, but emergent
algorithm design will make it more difficult. As explained earlier, there are
many reasons to believe that it may simply be impossible from a scientific /
decidability theory view point.
This is
still a slightly theoretical question as of today, but we are coming fast to a
point when we will truly no longer understand the solutions that are proposed
by the algorithms. Because AlphaGo is using reinforcement learning, it has been able to
synthetize strategies that may be qualified as deceiving or hiding its intent
to the opponent player. But humans are very good at understanding Go
strategies. In the case of the recent
win of AI in poker tournaments, it is trickier since we humans have a more
difficult time at understanding randomized strategies. We have known this from
game theory and Nash equilibriums for a long time. Pure strategies are easier
to understand but mixed
strategies are often the winning ones. Some commentators assess that the
domination of the machine over human is even more impressive for Poker than for
Go, which to me reflects the superiority of the machine to handle mixed (i.e. randomized)
strategies. As we start mixing artificial intelligence with game theory, we
will grow algorithms that are difficult to explain (i.e., we will explain the
input, the output, the intent and the protocol, not what the algorithm does).
If one only uses a single AI or machine learning technique, such as deep
learning, it is possible to still feel “in control” of what the machine does.
But when a mix of techniques is used, such as evolutionary
game theory, generative
AI, combinatorial optimization and Monte-Carlo simulation, it become much
less clear. As a practitioner of GTES (Game
Theoretical Evolutionary Simulation) for a decade,
it is very clear that the next 10 years of Moore Law will produce “smart
algorithms” with deep insights from game theory that will make them able to
interact with their environment – that is, us – in uncanny ways.
I have used
the “backbox” metaphor because a systemic approach to control “smart algorithm”
is containment, that is isolate them
as a subsystem in a “box of constraints”. This is how we handle most of the
other dangerous materials, from viruses to radioactive materials. This is far
from easy from a software perspective, but there is no proof that it is
impossible either. Containment starts with designing interfaces, to ensure what
the algorithm has access to, and what outcome/ suggestions it may produce. The
experience of complex system engineering shows that containment is not
sufficient, because of the nature of complex interaction that may appear, but
it is still a mandatory foundation for safe system design. It is not sufficient
for practical reasons: the level of containment that is necessary for safety is
often in contradiction with the usefulness of the component. Think of a truly
great “strong AI” in a battery powered box with no network connection and a
small set of buttons and lights as an interface. The danger of this
“superintelligence” is contained, but it is not really useful either. The fact
that safety may not come solely from containment is the reason we need complex
/ systemic testing protocols, as explained earlier.
Another
possible direction is to “weave”
properties into the code of the emergent algorithm. It is indeed possible to
impose simple properties onto complex algorithms, that may be proven formally.
The paradox is that there are simple properties of programs, such as
termination, which are undecidable, while at the same time, using techniques
such as abstract
interpretation or model
checking, we may formally prove properties about the outputs. For my more
technical readers, one could imagining weaving the purpose of the algorithm
using aspect-oriented
programming into a framework that is grown through machine learning. This
is the implicit assumption of the scifi movies about Asimov’s laws
that are “coded into the robots” : they must be either “weaved” into the smart brain
of the robot or added as a controlling supervisor – precisely the containment
approach, which is always what gets broken in the movie. The idea of being able
to weave “declarative properties” – that capture the intent of the algorithm
and may be audited – into a mesh of code that is grown from data analysis is a
way to reconcile the ambition of the Ilarion Pavel and Jacques Serris report
with the reality of emergent design. This is a new field to create and develop,
in parallel with the development of AI and machine learning in software that is
eating the world. This will not happen without regulation and pressure from the
public opinion.
These are
not theoretical considerations because the need to control emergent design is
happening very soon. Some of these concerns are pushed away by creating
divides: “weak AI” that
would be well controlled versus “strong AI”
that is dangerous but still a dream, “supervised machine
learning” that is by definition under control, versus “unsupervised
learning” which is still a laboratory reseach topic. The reality is very different:
these are not hard boundaries, there is a gradual shift day after day when we benefit
from more computing power and more data to experiment with new techniques.
Designing methods to control emergence requires humility (about what we do not
know) and paranoia (because bad usage of emergence without control or foresight
will happen).
1 comment:
Life Science services like Peptide Synthesis,
Custom Peptide Synthesis,
Peptide Synthesis Services,
Peptide Synthesis Company at
http://www.biosyn.com/peptide-synthesis.aspx
Post a Comment