oupipi: 七月 2021

2021年7月22日星期四

Alphabet Shares Undervalue The Cloud And Might Be Worth Closer To $3,000 (NASDAQ:GOOG)

🔥 Save unlimited web pages along with a full PDF snapshot of each page.
Unlock Premium →

Summary

Alphabet's Google Cloud segment continues to grow, recording ~$4B in sales in the most recent quarter; but the business remains unprofitable.
Nonetheless, Google Cloud could achieve a ~$20B annual run rate by the end of FY '21 and is closing the gap with key rivals AMZN and MSFT.
I argue that Alphabet's current share price may assign little value to the Google Cloud segment, and thus current share prices may actually be undervalued as a result.
Based on my analysis, Class A GOOGL shares, at the moment, may be trading 10%-15% below a price that reflects a fair value of the Google Cloud business.

Cloud Network Solution — da-kuk/E+ via Getty Images

Still Playing Catching Up

Alphabet's (GOOGL, GOOG) Google Cloud business has come a long way since its early AppEngine-centric beginnings in 2008.

Revenue in Q1 FY '21 stood at $4.047B versus $2.777B in Q1 FY '20, jumping 46%.

The total loss for the business segment improved to ($974M) versus ($1,730M) in the prior year period. Using the sales data above, the efficiency of the business segment is noted to have improved considerably, losing ($0.24) per dollar of sales in the most recent period versus ($0.62) per dollar of sales in the prior year period.

The business could approach ~$20B in FY '21 if the company can maintain its recent average sequential quarter-over-quarter growth rate of ~10%.

Although, as Google Cloud remains in the red, heavyweights Amazon (AMZN) and Microsoft (MSFT) continue growing their respective cloud businesses' profitably and rather impressively. Amazon recently noted in their Q1 FY '21 earnings release that "[in] just 15 years, [Amazon Web Services ("AWS")] has become a $54 billion annual sales run rate business competing against the world's largest technology companies, and its growth is accelerating – up 32% year over year." Comparably, Microsoft's Intelligent Cloud segment recorded a 23% jump in QoQ revenue in Q3 FY '21, bringing in $15.1B, which in turn implies an annual run-rate in excess of $60B. Of course, the results of these three competing cloud businesses cannot be compared on an apples-to-apples basis due to their varying, and in certain cases questionable, compositions. However, the thrust of the preceding data, as readers already know, is that Alphabet is playing catch-up in the cloud.

Figure 1: Magic Quadrant for Cloud Infrastructure and Platform Services:: Google Is a Leader But Still Playing Catch-Up to AMZN and MSFT

Source: Gartner

With this in mind, I suggest in this report that investors may be:

Discounting Google Cloud's growth potential.
Subsequently undervaluing Alphabet's stock price as a result.

Basically, I argue that Alphabet's current share price may essentially associate little value toward the Google Cloud business. And, for that reason, shares may be trading at a bit of a discount to some fair value, to be derived later in this report. Please note that Google shares, as other analysts argue, may not correctly reflect the firm's overall growth potential; but, I am concerned here with a narrower perspective – that of the company's cloud segment and what "fully-valued" shares should be worth right now if the segment is assigned a "fair" value.

In the following sections, I layout my ideas. As is often the case when I write about Alphabet, when discussing the firm itself, I sometimes use "Google" and stock symbol "GOOGL" as synonyms for Alphabet. It should be clear to readers, based on context, when I am using "Google" and "GOOGL" from an organizational perspective. Additionally, readers will note that when discussing the stock itself, I make specific reference to the Class A GOOGL shares; although, the general ideas presented herein apply to the other shares classes as well (obviously).

Growth Potential

Before jumping into a quantitative analysis of my argument that Google's share price does not reflect an appropriate value for the cloud business, let us first consider the firm's opportunity in the cloud market. A Microsoft executive offered a "mouth-dropping" cloud market TAM forecast of $4T not too long ago. While MSFT's prediction may be a bit "aggressive", especially when considering that the figure exceeds all expected information technology spending in 2021, it may not necessarily be wrong. But either way, and as we all already know, the cloud (as a market) is big... very big with Gartner estimating total spend exceeding $300B in 2021, representing growth of 18% over 2020. Accordingly, despite AMZN's and MSFT's dominant positions, an abundance of space exists for GOOGL (and other players) to grow their cloud segment.

Figure 2: Worldwide Public Cloud Services End-User Spending

Source: Gartner

Naturally, Google's opportunity in the cloud should be considered in terms of those drivers catalyzing cloud market growth overall, and how the company's particular capabilities map to those drivers. Google CEO, Sundar Pichai, offered that "...there are three distinct market trends shaping [Google's] growth [in the cloud] and driving... product and go-to-market strategy" during the company's recent Q1 FY '21 earnings call. I summarize these trends and discuss them in the context of Google's offerings as follows:

1. Businesses are thirsting for increasingly sophisticated data analytics. Mr. Pichai notes "[Google's] expertise in real-time data and analytics is winning [data-heavy] companies like Twitter (TWTR) and Ingersoll Rand (IR), who are moving their complex data workloads to Google Cloud." This result is unsurprising. As I discussed in a previous Seeking Alpha article discussing Google and Snowflake (SNOW ) in a data warehousing context, Google pioneered, more than a decade ago, "interactive-speed" analysis of massive-scale datasets via its Dremel technology, which subsequently evolved into the company's current cloud-based BigQuery offering. It is predictable, then, that Google is benefitting from data-driven trends in cloud computing. Further, as I also pointed out in the afore mentioned Seeking Alpha article, Google is a leader in artificial intelligence ("AI"), and particularly machine learning ("ML") technologies; technologies which are increasingly utilized by organizations to glean additional insights from data beyond the reach of traditional business intelligence ("BI"), as well as to support the data-driven development of new systems and products. Mr. Pichai accordingly points out the firm's success in analytics-related cloud projects is due to "[Google's] strength in AI and ML." Of perhaps all the trends pushing Google's cloud business forward, data analytics may be the most important as "[the] game is about data acquisition. The more corporate data that resides in a cloud the more sticky the customer is to the vendor."

2. Growth in infrastructure-as-a-service ("IaaS") projects. Google offers that they are winning important IaaS projects as large customers, in some cases, move entire data centers to the Google Cloud Platform ("GCP"), seeking the cost efficiencies and other benefits of cloud-based deployments. Moreover, in the firm's Q1 FY '21 earnings transcript, Mr. Pichai highlights the company's ability to support customers' hybrid and multi-cloud operations as a "true differentiator", enabled by technologies such as Google Anthos. On this point, GOOGL, like AMZN and MSFT, is pushing to offer organizations the tooling required to support complex applications that may operate across a variety of cloud and on-premise environments. Even Google BigQuery, mentioned earlier, has moved to a model where its engine supports massive-scale analytics – via its Omni offering – across datasets that are distributed across multiple cloud platforms. In other words, Google BigQuery Omni "...[empowers people] to access and analyze data, regardless of where it is stored." Although, despite these examples of Google Cloud technologies moving in the right direction, "...multicloud deployments are increasingly a two-cloud race between AWS and Microsoft Azure" at the moment. To reiterate the point from the introduction, Google is playing catch-up.

3. Software-as-a-Service ("SaaS"), the "original" cloud use case, remains a driver of cloud growth. As seen in Figure 2, Gartner forecasts the SaaS sub-market a bit shy of $120B in 2021, with an expected growth rate of 17% in 2022. It is the largest cloud sub-market where Google Workspace, inclusive of the company's popular Gmail, Meet, Sheets, and Docs offerings, appears to be thriving. Likely riding the remote work tailwind provided by the pandemic, Mr. Pichai highlighted "...[growth in Google Workspace's] revenue per seat and the number of seats in the last quarter." From a longer-term perspective, Google Workspace appears well-positioned for growth as the remote work and hybrid work trends continue to gain traction; and as organizations, in some cases, seek an alternative to Microsoft Office. Additionally, "[cloud] providers are [increasingly] going vertical to corner industries". It would be logical for Google to differentiate itself, as it already is, through industry-specific Google Workspace solutions. Certainly, a set of full-stack (IaaS, PaaS, SaaS) industry-specific solutions would likely prove very compelling for those organizations contemplating a move into the cloud.

Overall, the 3 market catalysts described above, alongside the company's technology stack, suggest Google has the ingredients to cook up an acceleration of growth of the cloud segment as they capitalize on these long-term trends. Without question, they face an uphill battle to match, much less surpass, the sheer size and scale of AWS and Microsoft Azure. But, consider:

Google itself already possesses a global computing infrastructure on par with the likes of AMZN and MSFT. Thus, the infrastructure to scale the Google Cloud business, arguably, already exists in large part; it is a matter of scaling the business which the company is trying to do, as evidenced by its more recent, aggressive hiring within the unit.

Google is closing the cloud capabilities gap with AMZN and MSFT (albeit perhaps not as fast as investors would like).

A growing and profitable Google Cloud business does not, obviously, have to compete with the heavyweights on every front. If, for example, Google can increasingly capture share of cloud analytics workloads, which as I suggested above may rest within a particular "sweet spot" relative to the company's historical capabilities and are especially valuable due to their "stickiness", such a dynamic could help put the firm into a position where "[cloud] operating results...benefit from increased scale", in turn driving profitability.

Indeed, the best is likely yet to come from Google Cloud...

Valuing Google Cloud

So, if we agree that Google is well-positioned in the cloud, if still trailing AMZN and MSFT, let us come back to my central idea of this analysis. My thought that investors may be discounting Google Cloud's growth potential centers on a simple observation derived from the firm's historical PE. Consider:

While there is no precise point in time at which one might say investors should have begun to "assign" a value to GOOGL's cloud business, perhaps we might – for my purposes here – say that the start of FY '12 marked a key turning point for Google Cloud as a commercial enterprise when, for example, AppEngine became an officially supported product and Google launched its Cloud Partner Program, among other developments.

When one examines the company's average PE before FY '12, the multiple is ~30. Interestingly, when one examines the company's average PE after FY '12 (ignoring the more recent run in the stock over the last few months), it is also ~30.

This simple observation certainly is not conclusive that investors are dismissing GOOGL's cloud segment. After all, it might be precisely what would be expected given that the business produces no earnings! However, this is Google after all; and it would not be unreasonable to anticipate some measure of PE multiple expansion relative to a maturing, even if still unprofitable, Google Cloud segment. Accordingly, from my point-of-view, it is suggestive that the firm continues to be largely valued on the basis of its legacy advertising and related businesses.

Perhaps the idea finds additional support when one considers the richer forward PE multiples of ~60 and ~34 for AMZN and MSFT respectively, as seen in Figure 3 below.

Figure 3: GOOGL and Selected Competitor Comparison

Data Source: Yahoo Finance

Table Source: Yves Sukhu

Notes:

Data as of market close July 16, 2021.

The higher PE multiples for AMZN and MSFT also do not allow us to conclude anything in an absolute sense; but, they could hint toward a broader investor community that currently places a greater value on their cloud businesses.

If my idea holds merit, the question is: "How much is the Google Cloud segment worth?"

It is not an easy question to answer obviously without more precise data. However, perhaps we can begin to sketch out a response with a consideration of the segment's quarterly revenue data.

Figure 4: Google Cloud Revenues (Actual) Q4 FY '18 – Q1 FY '21

Data Source: Google Earnings Releases

Table Source: Yves Sukhu

From Figure 4, we can easily determine that the average sequential quarter-over-quarter growth rate of the business has been ~10% (as also mentioned in the introduction). Maintaining this rate over time, we can simply model cloud segment revenues through the end of FY '21 and subsequent fiscal periods.

Figure 5: Google Cloud Revenues (Estimated) Q1 FY '21 – Q4 FY '22

Source: Yves Sukhu

Note that the revenue estimates in Figure 5 suggest GOOGL's cloud segment would exceed an annual run rate of $20B by Q4 FY '21, and $30B by Q4 FY '22. Now, using this data, let us assume:

Class A GOOGL shares are not grossly overvalued; an assumption which finds validation in MarketBeat's analyst consensus price target of $2513.20.

The Google Cloud segment will reach ~$19B in total sales (not annual run rate) at the end of FY '21, which is calculated by summing the actual Q1 FY '21 sales result with the Q2, Q3, and Q4 FY '21 estimates above.

Total shares outstanding at the end of FY '21 will be ~660M assuming the company reduces the total number of shares by ~1.5% via buybacks this year. Obviously, I don't know what the number will be; but this estimate seems possible.

With these assumptions in mind, we can calculate a cloud revenue/share forecast of $28.79/share in FY '21.

If we come back to my original suggestion that Google's current share price essentially assigns no value to the Google Cloud segment, what additional share value would be appropriate given the revenue/share forecast above? To answer this more refined question, we could apply an appropriate P/S ratio to arrive at a figure. If we were to "merely" apply Google's trailing-twelve-month ("ttm") PS multiple of 9.43, we would determine that the Google Cloud segment value is worth ~$271.50/share, suggesting that "Google Cloud-valued" Class A GOOGL shares are worth north of $2,800/share right now. But, is that multiple reasonable? Imagine, in some alternate reality, that the Google Cloud segment was a separate, independent entity at the moment. We see, again utilizing Figure 3, that SNOW – an unprofitable enterprise like Google Cloud – commands a PS multiple of ~105. Now, I have previously written about SNOW's valuation and I think SNOW's stock is overheated – to say the least; and so I don't think such a radical multiple should be applied to the Google Cloud segment. But, given that Google Cloud, AWS, and Microsoft Azure all overlap with SNOW's addressable market to – arguably – a large degree, it would seem reasonable that some multiple greater than the Google's current overall PS ratio would be more fair in the present market in a determination of the value of the segment. Indeed, if we were to argue that MSFT more closely resembles the Google Cloud business more than any other competitor in Figure 3, we could reasonably apply their PS multiple of 13.4 to establish a Google Cloud business value of ~$385.80/share, implying that "Google Cloud-valued" GOOGL shares are worth north of $2,900/share at the moment.

As always, we can push the numbers around any which way that we like. But, the general point, coming back to my declaration in the introduction, is that I think there is an argument to be made that GOOGL shares are undervalued right now due to investor dismissal of the cloud business, perhaps somewhere in the range of 10% - 15%.

Conclusion

MarketBeat's analyst data shows that a number of analysts boosted their price targets for the Class A shares in the general range of $2,800 - $3,000 over the last couple months. This target range obviously coincides well with the range that I derived in the previous section. Typically, I tend to lean a bit more to the conservative side when forecasting; but, my gut suggests to me that the Google Cloud segment is likely to power the stock over the long-term well beyond $3,000. I fully expect the "two-horse" cloud race between AMZN and MSFT to become a "three-horse" race in the not too distant future. I do think Google has struggled historically - for one reason or another - in terms of enterprise sales; and I think this, as opposed to any technological deficiency, may be their biggest weakness in a comparison against AMZN and MSFT. However, I, obviously, remain optimistic about their prospects. If an investor is seeking an undervalued cloud play, there may be no other better opportunity right now than Google.

This article was written by

I usually write to explore and articulate ideas that inform my own investment decision-making. So, typical...

Tech, Software, Long Only

Contributor Since 2017

I usually write to explore and articulate ideas that inform my own investment decision-making. So, typically, I am putting my capital where "my mouth is". I have a professional background in enterprise software, and I am presently working in a very niche area of computing. From time to time, I stray from the technology sector to write about companies that I think are worthy long-term investments.

Disclosure: I/we have a beneficial long position in the shares of GOOGL either through stock ownership, options, or other derivatives. I wrote this article myself, and it expresses my own opinions. I am not receiving compensation for it (other than from Seeking Alpha). I have no business relationship with any company whose stock is mentioned in this article.

gygy1

20 Jul. 2021, 3:42 AM

Hire that snowflake ceo and it may reach 4k

"Possibly" being 10-15% undervalued is hardly compelling these days. The error bars on any analysis must be larger than that.

@fusion150 - I understand your point, especially as it relates to a broader investor community that is increasingly focused on "large gain, short time-frame" type opportunities. That is my polite way of saying I think the broader investment community (not you personally) is increasingly looking for "get rich quick opportunities". And the opportunities that I see some investors piling into these days along that line appear incredibly dangerous to me. A 10% - 15% near-term upside in Alphabet shares may not seem all that compelling, but I believe -- as I argued in the report's conclusion -- that Alphabet is a long-term winner, with the cloud business likely serving as significant contributor to top-line and bottom-line growth as time goes on. I'll take 10% - 15% now (if my idea is right), and happily sit on the stock over the long-term. :-) Thanks for reading and commenting. Regards - Yves

Google is one of the best companies in the world.they would definitely be worth a lot more 10 or 20 years from now.the best performing faang stock in the last 5 years however is Apple.

FcFr

18 Jul. 2021, 2:56 PM

In order to evaluate the potential revenue growth of GOOG cloud business, ine has to take in consideration what drives a customer to decide btw GOOG Cloud, AWS or Azure. The question is: since the addressable market is the same for these prominent players, what differentiates Google cloud from AMZN and MSFT offerings? Is there a comparison btw these three offerings?

@FcFr - agreed (re: in order to evaluate the potential revenue growth of GOOG cloud business, [one] has to take in consideration what drives a customer to decide btw GOOG Cloud, AWS or Azure.) And certainly there are any number of competitive comparisons available. I think areas where Google features an advantage include (1) massive-scale data analytics and (2) machine learning. This other article I wrote about Google and Snowflake makes reference to a Forrester report that also discusses Amazon and Microsoft, although the report is written in a data warehousing context:
seekingalpha.com/...

You guy's apparently have a lot more money than I do to invest in a stock that pays no dividend and now "might" rise another 12% on an anti of $2,637 a share. When does the risk outweigh the reward? *10 share exposure talking.

@3carmonte when you survey top science and engineering graduates on where they want to work, google is always at the top. It's a great company. Their work in artificial intelligence cannot be discounted. When they blend this with the described analytics functionality the sky is the limit. Everyone will use it. Security is also a huge growth area for google and their cloud business. As hacks get more sophisticated cloud services will be required to do business and google has a massive loyal customer base that trusts them. That all means a lot of growth even if they stop trying. It's a money machine.

Agree, 10:1 split, IPO of 10-20 percent stakes of YouTube and Maps and GOOGL share price is off to the races. Management sees shareholders with disdain, only slowly improving since founders stepped down.

Coiled spring. I keep slowly adding.

MaritimeTrader

Google's cloud not profitable at this size. Is there structural issues to profitability such as low pricing or high cost vs comps (both larger like Amazon and Microsoft and smaller like Oracle). Maybe the market should value this business very low.

@dc10023 - I think Google expects to realize economies of scale as time goes on, and the data already points to that (i.e. loss per $ of revenue is decreasing). Also, don't forget -- perhaps more so in the case of Microsoft's Intelligent Cloud -- that companies "lump" in other stuff into their cloud businesses to juice the numbers. So, as I mentioned in the report, once cannot always make an "apples-to-apples" comparison between the heavyweight cloud vendors. Best - Yves

One more comment on $GOOGL... Hard to believe that the politicians on both sides of the aisle, most of whom have no idea about big business and the financial markets, are so hell bent on annoying these fine big tech companies. The politicians don't know what they're doing. These companies (GOOGL, AMZN, FB, MSFT) are the IBMs and GMs of fifty years ago. We are world leaders in tech and these companies have lead the economic success of our country over the last couple of decades. Why screw around with them? Just my two cents worth.

@esavela
Agreed but you used the correct word in that the politicians are "annoying" the big tech companies. But I will bet that is all that will come of this saber rattling. Maybe even find a way to impose a "big" fine, as thought it matters, but no significant change if politicians like campaign contributions. And last time I checked, politicians like campaign contributions.

Is Google 10-15% undervalued on their cloud? Probably. I hear GCS come up more and more and they are actually trying to grow the business.

But

Litigation, government fines, broke governments, indebted governments could be understated.

Probably even worse than that is the risk of mass boycotts. I'm not saying it's going to happen but if they keep silencing voices and deleting videos, rigging search results the risk is north of 1%.

Long Google and I still think it's an add here.

@No Guilt - agreed that the potential impacts from government-related litigation should be on the radar of shareholders. In the end, I tend to think Google may be left alone...more or less. But, that's just my gut feeling. Best - Yves

@No Guilt
mass boycotts by whom???????
As though there is or will be for the next 10 years a competitive search engine to Google for the right wing to support. How is the twitter and FB boycott progressing thus far? It's nothing more than right wing hot air by people who are full of hot air. But even the right wingers are as dependent on Google and big tech as everyone else.
Do you realize that the Birmingham bus boycott lasted for 12 months? That is 12 months of blacks in Birmingham walking and carpooling, no matter the weather or time of day. You really think even 1% of the right wingers have anywhere near the intestinal fortitude or moral fiber to stop using big tech.
Riiiiigth.

@ATLAtty Wow. What is Riiiiigth? don't underestimate the power of the silent majority.

$GOOGL is one of my high conviction holdings. I've been in it since $816/share. Always 100 shares or more so I can sell calls against it. The cloud is noted here in the article but I think the unlocked value is in YouTube and YouTubeTV. I understand that more internet bandwidth is used by traditional YouTube than by any other entity including streamers such as $NFLX. I personally would like to see a 10 for one split because as we all know, despite not adding any true value to the holder it does pump the stock price, invites more options activity, etc. And finally YouTubeTV--a totally different product is by far the best (IMHO) way to stream television. It is rather expensive but unlike all others, has unlimited cloud storage. Thanks for the article.

@esavela I agree on a split being desirable for exactly the reason you state--selling calls. I'll always own this as a core and won't risk assignment, but for trading I can stand to lose a few shares if called.

Author, your chart on end user cloud spending is great. Even if Alphabet never really competes with MSFT or AMZN in that arena, I'd think it is such an integral part now of youtube and other services that I'd prefer to keep building. I'll be watching earnings closely and hope for a dip, which is common for this one. Long GOOGL.

@wiredlitigator

Maybe they don't want every Tom, Dick and Harry trading options on Google. Same as Amazon and Berk.💡

Disagree with this article? Submit your own. To report a factual error in this article, . Your feedback matters to us!

Source: https://seekingalpha.com/article/4439733-alphabet-shares-undervalue-the-cloud-and-might-be-worth-closer-to-3000
This web page was saved on Friday, Jul 23 2021.

Upgrade to Premium Plan

✔ Save unlimited bookmarks.

✔ Get a complete PDF copy of each web page

✔ Save PDFs, DOCX files, images and Excel sheets as email attachments.

✔ Get priority support and access to latest features.

Upgrade now →

2021年7月21日星期三

Reward is enough - ScienceDirect

🔥 Save unlimited web pages along with a full PDF snapshot of each page.
Unlock Premium →

Abstract

In this article we hypothesise that intelligence, and its associated abilities, can be understood as subserving the maximisation of reward. Accordingly, reward is enough to drive behaviour that exhibits abilities studied in natural and artificial intelligence, including knowledge, learning, perception, social intelligence, language, generalisation and imitation. This is in contrast to the view that specialised problem formulations are needed for each ability, based on other signals or objectives. Furthermore, we suggest that agents that learn through trial and error experience to maximise reward could learn behaviour that exhibits most if not all of these abilities, and therefore that powerful reinforcement learning agents could constitute a solution to artificial general intelligence.

Keywords

Artificial intelligence

Artificial general intelligence

Reinforcement learning

Reward

1. Introduction

Expressions of intelligence in animal and human behaviour are so bountiful and so varied that there is an ontology of associated abilities to name and study them, e.g. social intelligence, language, perception, knowledge representation, planning, imagination, memory, and motor control. What could drive agents (natural or artificial) to behave intelligently in such a diverse variety of ways?

One possible answer is that each ability arises from the pursuit of a goal that is designed specifically to elicit that ability. For example, the ability of social intelligence has often been framed as the Nash equilibrium of a multi-agent system; the ability of language by a combination of goals such as parsing, part-of-speech tagging, lexical analysis, and sentiment analysis; and the ability of perception by object segmentation and recognition. In this paper, we consider an alternative hypothesis: that the generic objective of maximising reward is enough to drive behaviour that exhibits most if not all abilities that are studied in natural and artificial intelligence.

This hypothesis may startle because the sheer diversity of abilities associated with intelligence seems to be at odds with any generic objective. However, the natural world faced by animals and humans, and presumably also the environments faced in the future by artificial agents, are inherently so complex that they require sophisticated abilities in order to succeed (for example, to survive) within those environments. Thus, success, as measured by maximising reward, demands a variety of abilities associated with intelligence. In such environments, any behaviour that maximises reward must necessarily exhibit those abilities. In this sense, the generic objective of reward maximisation contains within it many or possibly even all the goals of intelligence.

Reward thus provides two levels of explanation for the bountiful expressions of intelligence found in nature. First, different forms of intelligence may arise from the maximisation of different reward signals in different environments, resulting for example in abilities as distinct as echolocation in bats, communication by whale-song, or tool use in chimpanzees. Similarly, artificial agents may be required to maximise a variety of reward signals in future environments, resulting in new forms of intelligence with abilities as distinct as laser-based navigation, communication by email, or robotic manipulation.

Second, the intelligence of even a single animal or human is associated with a cornucopia of abilities. According to our hypothesis, all of these abilities subserve a singular goal of maximising that animal or agent's reward within its environment. In other words, the pursuit of one goal may generate complex behaviour that exhibits multiple abilities associated with intelligence. Indeed, such reward-maximising behaviour may often be consistent with specific behaviours derived from the pursuit of separate goals associated with each ability.

For example, a squirrel's brain may be understood as a decision-making system that receives sensations from, and sends motor commands to the squirrel's body. The behaviour of the squirrel may be understood as maximising a cumulative reward such as satiation (i.e. negative hunger). In order for a squirrel to minimise hunger, the squirrel-brain must presumably have abilities of perception (to identify good nuts), knowledge (to understand nuts), motor control (to collect nuts), planning (to choose where to cache nuts), memory (to recall locations of cached nuts) and social intelligence (to bluff about locations of cached nuts, to ensure they are not stolen). Each of these abilities associated with intelligence may therefore be understood as subserving a singular goal of hunger minimisation (see Fig. 1).

As a second example, a kitchen robot may be implemented as a decision-making system that receives sensations from, and sends actuator commands to, the robot's body. The singular goal of the kitchen robot is to maximise a reward signal measuring cleanliness.¹ In order for a kitchen robot to maximise cleanliness, it must presumably have abilities of perception (to differentiate clean and dirty utensils), knowledge (to understand utensils), motor control (to manipulate utensils), memory (to recall locations of utensils), language (to predict future mess from dialogue), and social intelligence (to encourage young children to make less mess). A behaviour that maximises cleanliness must therefore yield all these abilities in service of that singular goal (see Fig. 1).

When abilities associated with intelligence arise as solutions to a singular goal of reward maximisation, this may in fact provide a deeper understanding since it explains why such an ability arises (e.g. that classification of crocodiles is important to avoid being eaten). In contrast, when each ability is understood as the solution to its own specialised goal, the why question is side-stepped in order to focus upon what that ability does (e.g. discriminating crocodiles from logs). Furthermore, a singular goal may also provide a broader understanding of each ability that may include characteristics that are otherwise hard to formalise, such as dealing with irrational agents in social intelligence (e.g. pacifying an angry aggressor), grounding language to perceptual experience (e.g. dialogue regarding the best way to peel a fruit), or understanding haptics in perception (e.g. picking a sharp object from a pocket). Finally, implementing abilities in service of a singular goal, rather than for their own specialised goals, also answers the question of how to integrate abilities, which otherwise remains an outstanding issue.

Having established that reward maximisation is a suitable objective for understanding the problem of intelligence, one may consider methods for solving the problem. One might then expect to find such methods in natural intelligence, or choose to implement them in artificial intelligence. Among possible methods for maximising reward, the most general and scalable approach is to learn to do so, by interacting with the environment by trial and error. We conjecture that an agent that can effectively learn to maximise reward in this manner would, when placed in a rich environment, give rise to sophisticated expressions of general intelligence.

A recent salutary example of both problem and solution based on reward maximisation comes from the game of Go. Research initially focused largely upon distinct abilities, such as openings, shape, tactics, and endgames, each formalised using distinct objectives such as sequence memorisation, pattern recognition, local search, and combinatorial game theory [32]. AlphaZero [49] focused instead on a singular goal: maximising a reward signal that is 0 until the final step, and then +1 for winning or −1 for losing. This ultimately resulted in a deeper understanding of each ability – for example, discovering new opening sequences [65], using surprising shapes within a global context [40], understanding global interactions between local battles [64], and playing safe when ahead [40]. It also yielded a broader set of abilities that had not previously been satisfactorily formalised – such as balancing influence and territory, thickness and lightness, and attack and defence. AlphaZero's abilities were innately integrated into a unified whole, whereas integration had proven highly problematic in prior work [32]. Thus, maximising wins proved to be enough, in a simple environment such as Go, to drive behaviour exhibiting a variety of specialised abilities. Furthermore, applying the same method to different environments such as chess or shogi [48] resulted in new abilities such as piece mobility and colour complexes [44]. We argue that maximising rewards in richer environments – more comparable in complexity to the natural world faced by animals and humans – could yield further, and perhaps ultimately all abilities associated with intelligence.

The rest of this paper is organised as follows. In Section 2 we formalise the objective of reward maximisation as the problem of reinforcement learning. In Section 3 we present our main hypothesis. We consider several important abilities associated with intelligence, and discuss how reward maximisation may yield those abilities. In Section 4 we turn to the use of reward maximisation as a solution strategy. We present related work in Section 5 and finally, in Section 6 we discuss possible weaknesses of the hypothesis and consider several alternatives.

2. Background: the reinforcement learning problem

Intelligence may be understood as a flexible ability to achieve goals. For example according to John McCarthy, "intelligence is the computational part of the ability to achieve goals in the world" [29]. Reinforcement learning [56] formalises the problem of goal-seeking intelligence. The general problem may be instantiated with a wide and realistic range of goals and worlds — and hence a wide range of forms of intelligence — corresponding to different reward signals to maximise in different environments.

2.1. Agent and environment

Like many interactive approaches to artificial intelligence [42], reinforcement learning follows a protocol that decouples a problem into two systems that interact sequentially over time: an agent (the solution) that takes decisions and an environment (the problem) that is influenced by those decisions. This is in contrast to other specialised protocols that may for example consider multiple agents, multiple environments, or other modes of interaction.

2.2. Agent

An agent is a system that receives at time t an observation and outputs an action . More formally, the agent is a system that selects an action at time t given its experience history , in other words, given the sequence of observations and actions that have occurred in the history of interactions between the agent and the environment.

The agent system α is limited by practical constraints to a bounded set [43]. The agent has limited capacity determined by its machinery (for example, limited memory in a computer or limited neurons in a brain). The agent and environment systems execute in real-time. While the agent spends time computing its next action (e.g. producing no-op actions while deciding whether to run away from a lion), the environment system continues to process (e.g. the lion attacks). Thus, the reinforcement learning problem represents a practical problem, as faced by natural and artificial intelligence, rather than a theoretical abstraction that ignores computational limitations.

This paper does not delve into the nature of the agent, focusing instead on the problem it must solve, and the intelligence that may be induced by any solution to that problem.

2.3. Environment

An environment is a system that receives action at time t and responds with observation at the next time step. More formally, an environment is a system that determines the next observation that the agent will receive from the environment, given experience history , the latest action , and potentially a source of randomness .

The environment specifies within its definition the interface to the agent. The agent consists solely of the decision-making entity; anything outside of that entity (including its body, if it has one) is considered part of the environment. This includes both the sensors and actuators that define the observations for the agent and the actions available to the agent respectively.

Note that this definition of environment is very broad and encompasses many problem dimensions, including those in Table 1.

Table 1. The definition of environment is broad and encompasses many problem dimensions.

Dimension	Alternative A	Alternative B	Notes
Observations	Discrete	Continuous
Actions	Discrete	Continuous
Time	Discrete	Continuous	Time-step may be infinitesimal
Dynamics	Deterministic	Stochastic
Observability	Full	Partial
Agency	Single agent	Multi-agent	Other agents are part of environment, from perspective of single agent (see Section 3.3)
Uncertainty	Certain	Uncertain	Uncertainty may be represented by stochastic initial states or transitions
Termination	Continuing	Episodic	Environment may terminate and reset to an initial state
Stationarity	Stationary	Non-stationary	Environment depends upon history and hence also upon time
Synchronicity	Asynchronous	Synchronous	Observation may remain unchanged until action is executed
Reality	Simulated	Real-world	May include humans that interact with agent

2.4. Rewards

The reinforcement learning problem represents goals by cumulative rewards. A reward is a special scalar observation , emitted at every time-step t by a reward signal in the environment, that provides an instantaneous measurement of progress towards a goal. An instance of the reinforcement learning problem is defined by an environment ε with a reward signal, and by a cumulative objective to maximise, such as a sum of rewards over a finite number of steps, a discounted sum, or the average reward per time-step.

A wide variety of goals can be represented by rewards.² For example, a scalar reward signal can represent weighted combinations of objectives, different trade-offs over time, and risk-seeking or risk-averse utilities. Reward can also be determined by a human-in-the-loop, for example humans may provide explicit reinforcement of desired behaviour, online feedback via clickthrough or thumbs-up, delayed feedback via questionnaires or surveys, or by a natural language utterance. Including feedback from a human can provide a mechanism to formulate seemingly fuzzy goals such as "I'll know it when I see it".

In addition to their generality, rewards also provide intermediate feedback, potentially at every time-step, on progress towards the goal. This intermediate signal is an essential part of the problem definition when considering long or infinite streams of experience – without intermediate feedback, learning is not possible.

3. Reward is enough

We have seen, in the previous section, that rewards are sufficient to express a wide variety of goals, corresponding to the diverse purposes towards which intelligence may be directed.

We now have all the ingredients to make our main point: that many different forms of intelligence can be understood as subserving the maximisation of reward, and that the many abilities associated with each form of intelligence may arise implicitly from the pursuit of those rewards. Taken to its limit, we hypothesise that all intelligence and associated abilities may be understood in this manner:

Hypothesis Reward-is-Enough

Intelligence, and its associated abilities, can be understood as subserving the maximisation of reward by an agent acting in its environment.

This hypothesis is important because, if it is true, it suggests that a good reward-maximising agent, in the service of achieving its goal, could implicitly yield abilities associated with intelligence. A good agent in this context is one that successfully – perhaps using as-yet-undiscovered algorithms – performs well at maximising cumulative reward in its environment. We return to the question of how such an agent may be constructed in Section 4.

In principle, any behaviour can be encoded by maximisation of a reward signal that is explicitly chosen to induce that behaviour [12] (for example, providing rewards when objects are correctly identified, or when a syntactically correct sentence is produced). The hypothesis here is intended to be much stronger: that intelligence and associated abilities will implicitly arise in the service of maximising one of many possible reward signals, corresponding to the many pragmatic goals towards which natural or artificial intelligence may be directed.

Sophisticated abilities may arise from the maximisation of simple rewards in complex environments. For example, the minimisation of hunger in a squirrel's natural environment demands a skilful ability to manipulate nuts that arises from the interplay between (among other factors) the squirrel's musculoskeletal dynamics; objects such as leaves, branches, or soil that the squirrel or nut may be resting upon, connected to, or obstructed by; variations in the size and shape of nuts; environmental factors such as wind, rain, or snow; and changes due to ageing, disease or injury. Similarly, the pursuit of cleanliness in a kitchen robot demands a sophisticated ability to perceive utensils in an enormous range of states that includes clutter, occlusion, glare, encrustation, damage, and so on.

Furthermore, the maximisation of many other reward signals by a squirrel (such as maximising survival time, minimising pain, or maximising reproductive success) or kitchen robot (such as maximising a healthy eating index, maximising positive feedback from the user, or maximising their gastronomic endorphins), and in many other environments (such as different habitats, other dexterous bodies, or different climates), would also yield abilities of perception, locomotion, manipulation, and so forth. Thus, the path to general intelligence may in fact be quite robust to the choice of reward signal. Indeed the ability to generate intelligence may often be orthogonal to the goal that it is given, in the sense that the maximisation of many different reward signals in many different environments may produce similar abilities associated with intelligence.

In the following sections we explore whether and how this hypothesis could be applied in practice to various important abilities, including several that are seemingly hard to formalise as reinforcement learning problems. We do not provide an exhaustive discussion of all abilities associated with intelligence, but encourage the reader to consider abilities such as memory, imagination, common sense, attention, reasoning, creativity, or emotions, and how those abilities may subserve a generic objective of reward maximisation.

3.1. Reward is enough for knowledge and learning

We define knowledge as information that is internal to the agent; for example, knowledge may be contained within the parameters of an agent's functions for selecting actions, predicting cumulative reward, or predicting features of future observations. Some of this knowledge may be innate (prior knowledge), while some knowledge may be acquired through learning.

An environment may demand innate knowledge. In particular, it may be necessary, in order to maximise total reward, to have knowledge that is immediately accessible in novel situations. For example, a new-born gazelle may need to run away from a lion. In this case, innate understanding of predator evasion may be necessary before there is any opportunity to learn this knowledge. Note, however, that the extent of prior knowledge is limited both in theory (by the capacity of the agent) and in practice (by the difficulty of constructing useful prior knowledge). Furthermore, unlike other abilities that we shall consider, environmental demand for innate knowledge cannot be operationalised; it is the knowledge that comes before experience, and therefore cannot be acquired from experience.

An environment may also demand learned knowledge. This will occur when future experience is uncertain, due to unknown elements, stochasticity, or complexity of the environment. This uncertainty results in a vast array of potential knowledge that the agent may need, depending on its particular realisation of future events. In rich environments, including many of the hardest and most important problems in natural and artificial intelligence, the total space of potential knowledge is far larger than the capacity of the agent. For example, consider an early-human environment. An agent may be born in either the Arctic or Africa, leading to radically different requirements for knowledge to maximise total reward: facing polar bears or lions; traversing glaciers or savannah; building with ice or with mud; etc. Particular events within the agent's life may also lead to radically different needs: choosing to hunt or to farm; facing locusts or war; going blind or going deaf; encountering friend or foe; etc. In each of these possible lives, the agent must acquire detailed and specialised knowledge. If the sum of such potential knowledge outstrips the agent's capacity, knowledge must be a function of the agent's experience and adapt to the agent's particular circumstances – thus demanding learning. In practice this learning may take many computational forms, for example making predictions, models or skills through the adaptation of parameters or the construction, curation and reuse of structure.

In summary, the environment may call for both innate and learned knowledge, and a reward-maximising agent will, whenever required, contain the former (for example, through evolution in natural agents and by design in artificial agents) and acquire the latter (through learning). In richer and longer-lived environments the balance of demand shifts increasingly toward learned knowledge.

3.2. Reward is enough for perception

The human world demands a variety of perceptual abilities to accumulate rewards. Some life-or-death examples include: image segmentation to avoid falling off a cliff; object recognition to classify healthy and poisonous foods; face recognition to differentiate friend from foe; scene parsing while driving; or speech recognition to understand a verbal warning to "duck!" Several modes of perception may be required, including visual, aural, olfactory, somatosensory or proprioceptive perception.

Historically, these perceptual abilities were formulated using separate problem definitions [9]. More recently, there has been a growing movement towards unifying perceptual abilities as solutions to supervised learning problems [24]. The problem is typically formalised as minimising the classification error for examples in a test set, given a training set of correctly labelled examples. The unification of many perceptual abilities as supervised learning problems has led to considerable success in a wide variety of real-world applications where large data-sets are available [23], [15], [8].

As per our hypothesis, we suggest that perception may instead be understood as subserving the maximisation of reward. For example, the perceptual abilities listed above may arise implicitly in service of maximising healthy food, avoiding accidents, or minimising pain. Indeed, perception in some animals has been shown to be consistent with reward maximisation [46], [16]. Considering perception from the perspective of reward maximisation rather than supervised learning may ultimately support a greater range of perceptual behaviours, including challenging and realistic forms of perceptual abilities:

•: Action and observation are typically intertwined into active forms of perception, such as haptic perception (e.g. identifying the contents of a pocket by moving fingertips), visual saccades (e.g. moving eyes to switch focus between bat and ball), physical experimentation (e.g. hitting a nut with a rock to see if it will break), or echolocation (e.g. emitting sounds at varying frequency and measuring timings and intensity of subsequent echoes).
•: The utility of perception often depends upon the agent's behaviour – for example, the cost of misclassifying a crocodile depends upon whether the agent is walking or swimming, and whether the agent would subsequently fight or flee.
•: There may be an explicit or implicit cost to acquiring information (e.g. there are energy, computational, and opportunity costs for turning the head and checking for a predator).
•: The distribution of data is typically context-dependent. For example, an Arctic agent may be required to classify ice and polar bears, upon which its rewards depend; while an African agent may need to classify savannah and lions. The diversity of potential data may, in rich environments, vastly exceed the agent's capacity or the quantity of pre-existing data (see Section 3.1) – requiring that perception is learned from experience.
•: Many applications of perception do not have access to labelled data.

3.3. Reward is enough for social intelligence

Social intelligence is the ability to understand and interact effectively with other agents. This ability is often formalised, using game theory, as the equilibrium solution of a multi-agent game. Equilibrium solutions are considered desirable because they are robust to deviation or worst-case scenarios. For example, the Nash equilibrium is a joint strategy for all agents such that no unilateral deviation gives benefit to the deviator [33]. In zero-sum games a Nash equilibrium is also minimax optimal [34]: it attains the best possible value against worst-case opponents.

According to our hypothesis, social intelligence may instead be understood as, and implemented by, maximising cumulative reward from the point of view of one agent in an environment that contains other agents. Following this standard agent-environment protocol, one agent observes the behaviour of other agents, and may affect other agents through its actions, just as it observes and affects any other aspect of its environment. An agent that can anticipate and influence the behaviour of other agents can typically achieve greater cumulative reward. Thus, if an environment needs social intelligence (e.g. because it contains animals or humans), reward maximisation will produce social intelligence.

Robustness may also be demanded by the environment, for example when the environment contains multiple agents following different strategies. If these other agents are aliased (i.e. there is no way to identify in advance which strategies other agents will follow) then a reward-maximising agent must hedge its bets and choose a robust behaviour that will be effective against any of these potential strategies. Furthermore, the other agents' strategies may be adaptive. This means that the behaviour of other agents may depend on the agent's past interactions, just like other aspects of the environment (for example, a jammed door that only opens on the nth attempt). In particular, such adaptivity may arise in environments that contain one or more reinforcement learning agents that learn to maximise their own reward. This kind of environment may call for aspects of social intelligence such as bluffing or information hiding, and a reward-maximising agent may have to be stochastic (i.e. use a mixed strategy) to avoid exploitation.

Reward maximisation may in fact lead to a better solution than an equilibrium [47]. This is because it may capitalise upon suboptimal behaviours of other agents, rather than assuming optimal or worst-case behaviour. Furthermore, reward maximisation has a unique optimal value [38], while the equilibrium value is non-unique in general-sum games [41].

3.4. Reward is enough for language

Language has been a subject of considerable study in both natural [10] and artificial intelligence [28]. Because language plays a dominant role in human culture and interactions, the definition of intelligence itself is often premised upon the ability to understand and use language, especially natural language [60].

Recently, significant success has been achieved by treating language as the optimisation of a singular objective: the predictive modelling of language within a large corpus of data [28], [8]. This approach has facilitated progress towards many subproblems of language that were often previously studied or implemented separately within natural language processing and understanding, including both syntactic subproblems (e.g. formal grammar, part-of-speech tagging, parsing, segmentation) and semantic subproblems (e.g. lexical semantics, entailment, sentiment analysis) as well as some that bring both together (e.g. summarisation, dialogue systems).

Nevertheless, language modelling by itself may not be sufficient to produce a broader set of linguistic abilities associated with intelligence that includes the following:

•: Language may be intertwined with other modalities of action and observation. Language is often contextual, not just in what was uttered but also on what else is happening in the environment around the agent, as perceived through vision and other sensory modalities (e.g. consider a dialogue between two agents carrying an awkward object or building a shelter). Furthermore, language is often interspersed with other communicative actions, such as gestures, facial expressions, tonal variations, or physical demonstrations.
•: Language is consequential and purposeful. Language utterances have a consequence in the environment, typically by influencing the mental state and thereby the behaviour of other communicators within the environment. These consequences may be optimised to achieve a variety of ends, for example, a salesperson learns to tailor their language to maximise sales, while a politician learns to tailor their language to maximise votes.
•: The utility of language varies according to the agent's situation and behaviour. For example, a miner may require language regarding the stability of rocks, while a farmer may need language regarding the fertility of soil. Furthermore there may be an opportunity cost to language (e.g. discussing farming instead of doing the work of farming).
•: In rich environments the potential uses of language to deal with unforeseen events may outstrip the capacity of any corpus. In these cases, it may be necessary to solve linguistic problems dynamically, through experience — for example, interactively developing the most effective language to control a new disease, to build a new technology, or to find a way to address a new grievance of a rival so as to forestall aggression.

According to our hypothesis, the ability of language in its full richness, including all of these broader abilities, arises from the pursuit of reward. It is an instance of an agent's ability to produce complex sequences of actions (e.g. uttering sentences) based on complex sequences of observations (e.g. receiving sentences) in order to influence other agents in the environment (cf. discussion of social intelligence above) and accumulate greater reward [7]. The pressure to comprehend and produce language can come from many reward-increasing benefits. If an agent can comprehend a "danger" warning then it can predict and avoid negative rewards. If an agent can generate a "fetch" command then this may cause the environment (say, containing a dog) to move an object nearer to the agent. Similarly, an agent may only eat if it can comprehend complex descriptions of the location of food, generate complex instructions for growing food, engage in complex dialogue to negotiate for food, or build long-term relationships that enhance those negotations – ultimately resulting in a variety of complex linguistic skills.

3.5. Reward is enough for generalisation

Generalisation is often defined as the ability to transfer the solution to one problem into the solution to another problem [37], [58], [61]. For example, generalisation in supervised learning [37] may focus upon transferring a solution learned from one data-set, such as photographs, to another data-set, such as paintings. Generalisation in meta-learning [61], [20] has recently focused upon the problem of transferring an agent from one environment to another environment.

As per our hypothesis, generalisation may instead be understood as, and implemented by, maximising cumulative reward in a continuing stream of interaction between an agent and a single complex environment – again following a standard agent-environment protocol (see Section 2.1). Environments such as the human world demand generalisation simply because the agent encounters different aspects of the environment at different times. For example, a fruit-eating animal may encounter a new tree every day; furthermore it may become injured, suffer a drought, or face an invasive species. In each case, the animal must adapt quickly to its new state, by generalising its experience from past states. The differing states faced by the animal are not parcelled neatly into disjoint, sequential tasks with distinct labels. Instead the state depends upon the animal's behaviour; it may combine a variety of elements that overlap and recur at different time-scales; and important aspects of the state may be partially observed. Rich environments demand the ability to generalise from past states to future states – with all these associated complexities – in order to efficiently accumulate rewards.

3.6. Reward is enough for imitation

Imitation is an important ability associated with human and animal intelligence, which may facilitate the rapid acquisition of other abilities, such as language, knowledge, and motor skills. In artificial intelligence, imitation has often been formulated as a problem of learning from demonstration [45] through behavioural cloning [2], where the goal is to reproduce the actions chosen by a teacher, when supplied with explicit data regarding the teacher's actions, observations and rewards, typically under the assumption that the teacher is solving a symmetric problem to the agent. Behavioural cloning has led to several successful machine learning applications [54], [62], [5], especially those where human teacher data is plentiful but interactive experience is limited or costly. In contrast, the natural ability of observational learning [3] includes any form of learning from the observed behaviour of other humans or animals, and does not assume a symmetric teacher or require direct access to their actions, observations and rewards. This suggests that a much broader and realistic class of observational learning abilities, compared to direct imitation through behavioural cloning, may be demanded in complex environments:

•: Other agents may be an integral part of the agent's environment (e.g. a baby observing its mother), without assuming the existence of a distinct data-set containing teacher data.
•: The agent may need to learn an association between its own state (e.g. the pose of the baby's body), and the state of another agent (e.g. the pose of its mother), or between its own actions (e.g. rotating a robot's manipulator) and observations of another agent (e.g. seeing a human hand), potentially at higher levels of abstraction (e.g. imitating the choice of food by the mother rather than her muscle activations).
•: Other agents may be partially observed (e.g. where the human hand is occluded), such that their actions or goals may only be imperfectly inferred, perhaps in hindsight.
•: Other agents may demonstrate undesirable behaviours that should be avoided.
•: There may be many other agents in the environment, exhibiting different skills or different levels of competence.
•: Observational learning may even occur without any explicit agency (e.g. imitating the idea of bridge-building from observations of a log fallen across a stream).

We speculate that these broader abilities of observation learning could be driven by the maximisation of reward, from the perspective of a single agent that simply observes other agents as integral parts of its environment [6], potentially leading to many of the same benefits as behavioural cloning – such as the sample-efficient acquisition of knowledge – but in a much wider and more integrated context.

3.7. Reward is enough for general intelligence

For our last example, we turn to the ability which simultaneously poses the greatest challenge and where our hypothesis offers the greatest potential benefit. General intelligence, of the sort possessed by humans and perhaps also other animals, may be defined as the ability to flexibly achieve a variety of goals in different contexts. For example, humans can flexibly address problems (such as locomotion, transportation or communication) with solutions (such as swimming or skiing, carrying or kicking, writing or sign language) appropriate to their circumstances. General intelligence is sometimes formalised by a set of environments that measures the agent's capabilities across a variety of different goals and contexts [25], [14].

According to our hypothesis, general intelligence can instead be understood as, and implemented by, maximising a singular reward in a single, complex environment. For example, natural intelligence faces a contiguous stream of experience throughout its lifetime, generated from interactions with the natural world. An animal's stream of experience is sufficiently rich and varied that it may demand a flexible ability to achieve a vast variety of subgoals (such as foraging, fighting, or fleeing), in order to succeed in maximising its overall reward (such as hunger or reproduction). Similarly, if an artificial agent's stream of experience is sufficiently rich, then singular goals (such as battery-life or survival) may implicitly require the ability to achieve an equally wide variety of subgoals, and the maximisation of reward should therefore be enough to yield an artificial general intelligence.

4. Reinforcement learning agents

Our main hypothesis, that intelligence and its associated abilities may be understood as subserving the maximisation of reward, is agnostic to the nature of the agent. This leaves open the important question of how to construct an agent that maximises reward. In this section, we suggest that this question may also be answered by reward maximisation. Specifically, we consider agents with a general ability to learn how to maximise reward from their ongoing experience of interacting with the environment. Such agents, which we refer to as reinforcement learning agents, provide several advantages.³

First, among all possible solution methods for maximising reward, surely the most natural approach is to learn to do so from experience, by interacting with the environment. Over time, that interactive experience provides a wealth of information about cause and effect, about the consequences of actions, and about how to accumulate reward. Rather than predetermining the agent's behaviour (placing faith in the designer's foreknowledge of the environment) it is natural instead to bestow the agent with a general ability to discover its own behaviours (placing faith in experience). More specifically, the design goal of maximising reward is implemented through an ongoing internal process of learning from experience a behaviour that maximises future reward.⁴

Reinforcement learning agents, by learning from experience, provide a general solution method that may be effective, with minimal or even zero modification, across many different reward signals and environments.

Furthermore, a single environment may be so complex, like the natural world, that it contains a heterogeneous diversity of possible experiences. The potential variations in the stream of observations and rewards faced by a long-lived agent will inevitably outstrip its capacity for preprogrammed behaviours (see Section 3.1). To achieve high reward, the agent must therefore be equipped with a general ability to fully and continually adapt its behaviour to new experiences. Indeed, reinforcement learning agents may be the only feasible solutions in such complex environments.

A sufficiently powerful and general reinforcement learning agent may ultimately give rise to intelligence and its associated abilities. In other words, if an agent can continually adjust its behaviour so as to improve its cumulative reward, then any abilities that are repeatedly demanded by its environment must ultimately be produced in the agent's behaviour. A good reinforcement learning agent could thus acquire behaviours that exhibit perception, language, social intelligence and so forth, in the course of learning to maximise reward in an environment, such as the human world, in which those abilities have ongoing value.

We do not offer any theoretical guarantee on the sample efficiency of reinforcement learning agents. Indeed, the rate at and degree to which abilities emerge will depend upon the specific environment, learning algorithm, and inductive biases; furthermore one may construct artificial environments in which learning will fail. Instead, we conjecture that powerful reinforcement learning agents, when placed in complex environments, will in practice give rise to sophisticated expressions of intelligence. If this conjecture is correct, it offers a complete pathway towards the implementation of artificial general intelligence.

Several recent examples of reinforcement learning agents, endowed with an ability to learn to maximise rewards, have given rise to broadly capable behaviours that exceeded expectations, the performance of prior agents, and in several cases, the performance of human experts. For example, when asked to maximise wins in the game of Go, AlphaZero learned (see Section 1) an integrated intelligence across many facets of Go [49]; when the same algorithm was applied to maximise outcomes in the game of chess, AlphaZero learned a different set of abilities encompassing openings, endgames, piece mobility, king safety and so forth [48], [44]. Reinforcement learning agents that maximise score in Atari 2600 [30] have learned a range of abilities, including aspects of object recognition, localisation, navigation, and motor control demanded by each particular Atari game, while agents that maximise successful grips in vision-based robotic manipulation have learned sensorimotor abilities such as object singulation, regrasping, and dynamic object tracking [22]. While these examples are far narrower in scope than the environments faced by natural intelligence, they provide some practical evidence for the effectiveness of the reward maximisation principle.

One may of course wonder how to learn to maximise reward effectively in a practical agent. For example, the reward could be maximised directly (e.g. by optimising the agent's policy [57]), or indirectly, by decomposing into subgoals such as representation learning, value prediction, model-learning and planning, which may themselves be further decomposed [56]. We do not address this question further in this paper, but note that it is the central question studied throughout the field of reinforcement learning.

5. Related work

Intelligence has long been associated with goal-oriented behaviour [59]. This goal-oriented notion of intelligence is central to the concept of rationality, in which an agent selects the actions in the manner that optimally achieves its goals or maximises its utility. Rationality has been widely used to understand human behaviour [63], [4], and as a basis for artificial intelligence [42], formalised with respect to an agent interacting with an environment.

It has frequently been argued that computational constraints should also be taken into account when reasoning about goals. Bounded [50], [43], [36] or computational rationality [27] suggests that agents should select the program that best achieves their goals, given the real-time consequences arising from that program (for example, how long the program takes to execute), and subject to limitations on the set of programs (for example, limiting the maximum size of the program). Our contribution builds upon these viewpoints, but focuses on the question of whether a single, simple reward-based goal could provide a common basis for all abilities associated with intelligence.

The standard protocol for reinforcement learning was defined by Sutton and Barto [56]; in Section 2 we presented a common generalisation with partially observed histories.

Unified cognitive architectures [35], [1] aspire towards general intelligence. They combine a variety of solution methods for separate subproblems (such as perception or motor control), but do not provide a generic objective that justifies and explains the choice of architecture, nor a singular goal towards which the individual components contribute.

The perspective of language as reward-maximisation dates back to behaviourism [52], [53]; however, the reinforcement learning problem differs from behaviourism in allowing an agent to construct and use internal state. In multi-agent environments, the advantages of focusing on a single agent's objective, rather than an equilibrium objective, were discussed by Shoham and Powers [47].

6. Discussion

We have presented the reward-is-enough hypothesis and some of its implications. Next we briefly answer a number of questions that frequently arise in discussions of this hypothesis.

Which environment? One may ask which environment will give rise, through reward maximisation, to the "most intelligent" behaviour or to the "best" specific ability (for example natural language). Inevitably, the specific environmental experiences encountered by an agent — for example the friends, foes, teachers, toys, tools, or libraries encountered during the lifetime of a human brain — will shape the nature of its subsequent abilities. While this question may be of great interest for any specific application of intelligence, we have focused instead upon the arguably more profound question of which generic objective could give rise to all forms of intelligence. The maximisation of different rewards in different environments may lead to distinct, powerful forms of intelligence, each of which exhibits its own impressive, and yet incomparable, array of abilities. A good reward-maximising agent will harness any elements present in its environment but the emergence of intelligence in some form is not predicated upon their specifics. For example, the human brain develops from birth differently when exposed to different experiences in its environment, but will acquire sophisticated abilities regardless of its specific culture or education.

Which reward signal? The desire to manipulate the reward signal often arises from the idea that only a carefully constructed reward could possibly induce general intelligence. By contrast, we suggest the emergence of intelligence may be quite robust to the nature of the reward signal. This is because environments such as the natural world are so complex that intelligence and its associated abilities may be demanded by even a seemingly innocuous reward signal. For example, consider a signal that provides +1 reward to the agent each time a round-shaped pebble is collected. In order to maximise this reward signal effectively, an agent may need to classify pebbles, to manipulate pebbles, to navigate to pebble beaches, to store pebbles, to understand waves and tides and their effect on pebble distribution, to persuade people to help collect pebbles, to use tools and vehicles to collect greater quantities, to quarry and shape new pebbles, to discover and build new technologies for collecting pebbles, or to build a corporation that collects pebbles.

What else, other than reward maximisation, could be enough for intelligence? Unsupervised learning [19] (e.g. to identify patterns in observations) and prediction [18], [11] (e.g. of future observations [31]) may provide effective principles for understanding experience, but do not provide a principle for action selection and therefore cannot in isolation be enough for goal-oriented intelligence. Supervised learning gives a mechanism for mimicking human intelligence; given enough human data one might imagine that all abilities associated with human intelligence could emerge. However, supervised learning from human data cannot be enough for a general-purpose intelligence that is able to optimise for non-human goals in non-human environments. Even where human data is plentiful, imitative intelligence may be limited in scope to the behaviours already known and exhibited by humans within the data, rather than discovering creative new behaviours that solve problems in unexpected ways (see also offline learning, below). Evolution by natural selection can be understood at an abstract level as maximising fitness, as measured by individual reproductive success, optimised by a population-based mechanism such as mutation and crossover. In our framework, reproductive success can be seen as one possible reward signal that has driven the emergence of natural intelligence. However, artificial intelligence may be designed with other goals than reproductive success, and may maximise the corresponding reward signals using methods other than mutation and crossover, potentially leading to very different forms of intelligence. Furthermore, while fitness maximisation may explain the initial configuration of natural intelligence (e.g. a human baby brain), a process of trial-and-error learning to maximise an intrinsic reward signal [51] (see Section 4) may furthermore explain how natural intelligence adapts through experience to develop sophisticated abilities (e.g. a human adult brain) in the service of fitness maximisation. Maximisation of free energy or minimisation of surprise [13] may yield several abilities of natural intelligence, but does not provide a general-purpose intelligence that can be directed towards a broad diversity of different goals in different environments. Consequently it may also miss abilities that are demanded by the optimal achievement of any one of those goals (for example, aspects of social intelligence required to mate with a partner, or tactical intelligence required to checkmate an opponent). Optimisation is a generic mathematical formalism that may maximise any signal, including cumulative reward, but does not specify how an agent interacts with its environment. By contrast, the reinforcement learning problem includes interaction at its heart: actions are optimised to maximise reward, those actions in turn determine the observations received from the environment, which themselves inform the optimisation process; furthermore optimisation occurs online in real-time while the environment continues to tick.

Which reward maximisation problem? Even amongst reinforcement learning researchers a number of variations of the reward maximisation problem are studied. Rather than following a standard agent-environment protocol, the interaction loop is often modified for different cases that may include multiple agents, multiple environments, or multiple training lifetimes. Rather than maximising a generic objective defined by cumulative reward, the goal is often formulated separately for different cases: for example multi-objective learning, risk-sensitive objectives, or objectives that are specified by a human-in-the-loop. Furthermore, rather than addressing the problem of reward maximisation for general environments, special-case problems are often studied for a particular class of environments such as linear environments, deterministic environments, or stable environments. While this may be appropriate for specific applications, a solution to a specialised problem does not usually generalise; in contrast a solution to the general problem will also provide a solution for any special cases. The reinforcement learning problem may also be transformed into a probabilistic framework that approximates the goal of reward maximisation [66], [39], [26], [17]. Finally, universal decision-making frameworks [21] provide a theoretical but incomputable formulation of intelligence across all environments; while the reinforcement learning problem provides a practical formulation of intelligence in a given environment.

Can offline learning from a sufficiently large data-set be enough for intelligence? Offline learning may only be enough to solve problems that are already to a large extent solved within the available data. For example, a large data-set of squirrels collecting nuts is unlikely to demonstrate all the behaviours required to build a nut harvester. Although it may be possible to generalise to the agent's current problems from solutions demonstrated in or extracted from an offline data-set, in complex environments, this generalisation will inevitably be imperfect. Furthermore, the data necessary to solve the agent's current problems will often have a negligible probability of occurring in offline data (e.g. under random behaviour or imperfect human behaviour). Online interaction allows an agent to specialise to the problems it is currently facing, to continually verify and correct the most pressing holes in its knowledge, and to find new behaviours that are very different from and achieve greater reward than those in the data set.

Is the reward signal too impoverished? One might wonder whether sample-efficient reinforcement learning agents necessarily exist that can maximise reward in complex, unknown environments. In answering this question, we note first that an effective agent may make use of additional experiential signals to facilitate the maximisation of future reward. Many solution methods, including model-free reinforcement learning, learn to associate future reward with features of observations, through value function approximation, which in turn provide rich secondary signals that drive learning of deeper associations through a recursive bootstrapping process [55]. Other solution methods, including model-based reinforcement learning, construct predictions of observations, or of features of observations, that facilitate the subsequent maximisation of reward through planning. Furthermore, as discussed in Section 3.6, observations of other agents within the environment may also facilitate rapid learning.

Nevertheless, it is often the challenge of sample-efficient reinforcement learning in complex environments that has led researchers to introduce assumptions or develop simpler abstractions that are more amenable to both theory and practice. However, these assumptions and abstractions may simply side-step the difficulties that must inevitably be faced by any broadly capable intelligence. We choose instead to accept the challenge head-on and focus upon its solution; we hope that other researchers will join us on our quest.

7. Conclusion

In this paper we have presented the hypothesis that the maximisation of total reward may be enough to understand intelligence and its associated abilities. The key idea is that rich environments typically demand a variety of abilities in the service of maximising reward. The bountiful expressions of intelligence found in nature, and presumably in the future for artificial agents, may be understood as instantiations of this same idea with different environments and different rewards. Furthermore, a singular goal of reward maximisation may give rise to deeper, broader and more integrated understanding of abilities than specialised problem formulations for each distinct ability. In particular, we explored in greater depth several abilities that may at first glance seem hard to comprehend through reward maximisation alone, such as knowledge, learning, perception, social intelligence, language, generalisation, imitation, and general intelligence, and found that reward maximisation could provide a basis for understanding each ability. Finally, we have presented a conjecture that intelligence could emerge in practice from sufficiently powerful reinforcement learning agents that learn to maximise future reward. If this conjecture is true, it provides a direct pathway towards understanding and constructing an artificial general intelligence.

Declaration of Competing Interest

All authors are employees of DeepMind.

Acknowledgements

We would like to thank the reviewers and our colleagues at DeepMind for their input into this article.

References

[1]: J.R.

Anderson

,

D.

Bothell

,

M.D.

Byrne

,

S.

Douglass

,

C.

Lebiere

,

Y.

Qin

An integrated theory of the mind

Psychol. Rev., 111 (4) (2004), p. 1036
[2]: M.

Bain

,

C.

Sammut

A framework for behavioural cloning

Machine Intelligence 15 (1995), pp. 103-129
[3]: A.

Bandura

,

D.C.

McClelland

Social Learning Theory, vol. 1

Prentice Hall, Englewood Cliffs (1977)
[4]: G.S.

Becker

The Economic Approach to Human Behavior. Economic Theory

University of Chicago Press (1976)
[5]: M.

Bojarski

,

D.

Del Testa

,

D.

Dworakowski

,

B.

Firner

,

B.

Flepp

,

P.

Goyal

,

L.D.

Jackel

,

M.

Monfort

,

U.

Muller

,

J.

Zhang

,

X.

Zhang

,

J.

Zhao

,

K.

Zieba

End to end learning for self-driving cars
[6]: D.

Borsa

,

N.

Heess

,

B.

Piot

,

S.

Liu

,

L.

Hasenclever

,

R.

Munos

,

O.

Pietquin

Observational learning by reinforcement learning

Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, International Foundation for Autonomous Agents and Multiagent Systems (2019), pp. 1117-1124
[7]: J.

Bratman

,

M.

Shvartsman

,

R.

Lewis

,

S.

Singh

A new approach to exploring language emergence as boundedly optimal control in the face of environmental and cognitive constraints

Proceedings of the 10th International Conference on Cognitive Modeling, ICCM (2010)
[8]: T.B.

Brown

,

B.

Mann

,

N.

Ryder

,

M.

Subbiah

,

J.

Kaplan

,

P.

Dhariwal

,

A.

Neelakantan

,

P.

Shyam

,

G.

Sastry

,

A.

Askell

,
et al.
Language models are few-shot learners
[9]: E.C.

Carterette

,

M.P.

Friedman

Handbook of Perception

Academic Press (1978)
[10]: N.

Chomsky

,

D.W.

Lightfoot

Syntactic Structures

Walter de Gruyter (2002)
[11]: A.

Clark

Whatever next? Predictive brains, situated agents, and the future of cognitive science

Behav. Brain Sci., 36 (3) (2013), pp. 181-204
[12]: G.

Debreu

Representation of a Preference Ordering by a Numerical Function

(1954)
[13]: K.

Friston

The free-energy principle: A unified brain theory?

Nat. Rev. Neurosci., 11 (127–38) (2010), Article 02
[14]: B.

Goertzel

,

C.

Pennachin

Artificial General Intelligence, vol. 2

Springer (2007)
[15]: A.

Graves

,

A.-r.

Mohamed

,

G.

Hinton

Speech recognition with deep recurrent neural networks

2013 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE (2013), pp. 6645-6649
[16]: N.

Grujic

,

J.

Brus

,

D.

Burdakov

,

R.

Polania

Rational inattention in mice
[17]: D.

Hafner

,

P.A.

Ortega

,

J.

Ba

,

T.

Parr

,

K.J.

Friston

,

N.

Heess

Action and perception as divergence minimization
[18]: J.

Hawkins

,

S.

Blakeslee

On Intelligence

Times Books, USA (2004)
[19]: G.

Hinton

,

T.J.

Sejnowski

Unsupervised Learning: Foundations of Neural Computation

The MIT Press (1999)
[20]: T.M.

Hospedales

,

A.

Antoniou

,

P.

Micaelli

,

A.J.

Storkey

Meta-learning in neural networks: A survey
[21]: M.

Hutter

Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability

Springer (2005)
[22]: D.

Kalashnikov

,

A.

Irpan

,

P.

Pastor

,

J.

Ibarz

,

A.

Herzog

,

E.

Jang

,

D.

Quillen

,

E.

Holly

,

M.

Kalakrishnan

,

V.

Vanhoucke

,

S.

Levine

Scalable deep reinforcement learning for vision-based robotic manipulation

2nd Annual Conference on Robot Learning, Proceedings, Proceedings of Machine Learning Research, vol. 87, CoRL 2018, Zürich, Switzerland, 29-31 October 2018, PMLR (2018), pp. 651-673
[23]: A.

Krizhevsky

,

I.

Sutskever

,

G.E.

Hinton

Imagenet classification with deep convolutional neural networks

Advances in Neural Information Processing Systems (2012), pp. 1097-1105
[24]: Y.

LeCun

Computer perception with deep learning

(2013)
[25]: S.

Legg

,

M.

Hutter

Universal intelligence: A definition of machine intelligence

Minds Mach., 17 (4) (2007), pp. 391-444
[26]: S.

Levine

Reinforcement learning and control as probabilistic inference: Tutorial and review
[27]: R.L.

Lewis

,

A.

Howes

,

S.

Singh

Computational rationality: Linking mechanism and behavior through bounded utility maximization

Top. Cogn. Sci., 6 (2) (2014), pp. 279-311
[28]: C.D.

Manning

,

C.D.

Manning

,

H.

Schütze

Foundations of Statistical Natural Language Processing

MIT Press (1999)
[29]: (1998)
[30]: V.

Mnih

,

K.

Kavukcuoglu

,

D.

Silver

,

A.A.

Rusu

,

J.

Veness

,

M.G.

Bellemare

,

A.

Graves

,

M.

Riedmiller

,

A.K.

Fidjeland

,

G.

Ostrovski

,
et al.
Human-level control through deep reinforcement learning

Nature, 518 (7540) (2015), p. 529
[31]: J.

Modayil

,

A.

White

,

R.S.

Sutton

Multi-timescale nexting in a reinforcement learning robot

Adapt. Behav., 22 (2) (2014), pp. 146-160
[32]: Artif. Intell., 134 (1–2) (2002), pp. 145-179
[33]: J.F.

Nash

,
et al.
Equilibrium points in n-person games

Proc. Natl. Acad. Sci., 36 (1) (1950), pp. 48-49
[34]: J.v.

Neumann

Zur theorie der gesellschaftsspiele

Math. Ann., 100 (1) (1928), pp. 295-320
[35]: A.

Newell

Unified Theories of Cognition

Harvard University Press, USA (1990)
[36]: L.

Orseau

,

M.B.

Ring

Space-time embedded intelligence

Proceedings of the 5th International Conference on Artificial General Intelligence, Lecture Notes in Computer Science, vol. 7716, Springer (2012), pp. 209-218
[37]: S.J.

Pan

,

Q.

Yang

A survey on transfer learning

IEEE Trans. Knowl. Data Eng., 22 (10) (2009), pp. 1345-1359
[38]: M.L.

Puterman

Markov Decision Processes: Discrete Stochastic Dynamic Programming

John Wiley & Sons (2014)
[39]: K.

Rawlik

,

M.

Toussaint

,

S.

Vijayakumar

On stochastic optimal control and reinforcement learning by approximate inference

Twenty-Third International Joint Conference on Artificial Intelligence (2013)
[40]: M.

Redmond

,

C.

Garlock

AlphaGo to Zero: The Complete Games

Smart Go (2020)
[41]: J.

Ben Rosen

Existence and uniqueness of equilibrium points for concave n-person games

Econometrica (1965), pp. 520-534
[42]: S.

Russell

,

P.

Norvig

Artificial Intelligence: A Modern Approach

Prentice Hall (1995)
[43]: S.J.

Russell

,

D.

Subramanian

Provably bounded-optimal agents

J. Artif. Intell. Res., 2 (1995), pp. 575-609
[44]: M.

Sadler

,

N.

Regan

,

G.

Kasparov

Game Changer: AlphaZero's Groundbreaking Chess Strategies and the Promise of AI

New in Chess (2019)
[45]: S.

Schaal

Learning from demonstration

M. Mozer, M.I. Jordan, T. Petsche (Eds.), Advances in Neural Information Processing Systems 9, NIPS, Denver, CO, USA, December 2–5, 1996, MIT Press (1996), pp. 1040-1046
[46]: J.

Schaffner

,

P.

Tobler

,

T.

Hare

,

R.

Polania

Neural codes in early sensory areas maximize fitness
[47]: Y.

Shoham

,

R.

Powers

,

T.

Grenager

If multi-agent learning is the answer, what is the question?

Artif. Intell., 171 (7) (2007), pp. 365-377
[48]: D.

Silver

,

T.

Hubert

,

J.

Schrittwieser

,

I.

Antonoglou

,

M.

Lai

,

A.

Guez

,

M.

Lanctot

,

L.

Sifre

,

D.

Kumaran

,

T.

Graepel

,
et al.
A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play

Science, 362 (6419) (2018), pp. 1140-1144
[49]: D.

Silver

,

J.

Schrittwieser

,

K.

Simonyan

,

I.

Antonoglou

,

A.

Huang

,

A.

Guez

,

T.

Hubert

,

L.

Baker

,

M.

Lai

,

A.

Bolton

,
et al.
Mastering the game of go without human knowledge

Nature, 550 (7676) (2017), pp. 354-359
[50]: H.A.

Simon

A behavioral model of rational choice

Q. J. Econ., 69 (1) (1955), pp. 99-118
[51]: S.

Singh

,

R.L.

Lewis

,

A.G.

Barto

,

J.

Sorg

Intrinsically motivated reinforcement learning: An evolutionary perspective

IEEE Trans. Auton. Ment. Dev., 2 (2) (2010), pp. 70-82
[52]: B.F.

Skinner

The Behavior of Organisms: An Experimental Analysis

Appleton-Century-Crofts, New York (1938)
[53]: B.F.

Skinner

Verbal Behavior

Appleton-Century-Crofts, New York (1957)
[54]: N.

Stiennon

,

L.

Ouyang

,

J.

Wu

,

D.M.

Ziegler

,

R.

Lowe

,

C.

Voss

,

A.

Radford

,

D.

Amodei

,

P.F.

Christiano

Learning to summarize with human feedback

Advances in Neural Information Processing Systems 33 (2020)
[55]: R.S.

Sutton

Learning to predict by the methods of temporal differences

Mach. Learn., 3 (1) (1988), pp. 9-44
[56]: R.S.

Sutton

,

A.G.

Barto

Reinforcement Learning: An Introduction

(second edition), The MIT Press (2018)
[57]: R.S.

Sutton

,

D.A.

McAllester

,

S.

Singh

,

Y.

Mansour

Policy gradient methods for reinforcement learning with function approximation

Advances in Neural Information Processing Systems (2000), pp. 1057-1063
[58]: M.E.

Taylor

,

P.

Stone

Transfer learning for reinforcement learning domains: A survey

J. Mach. Learn. Res., 10 (1) (2009), pp. 1633-1685
[59]: E.C.

Tolman

Purposive Behavior in Animals and Men

Century/Random House, UK (1932)
[60]: A.M.

Turing

Computing machinery and intelligence

Mind, 59 (236) (1950), p. 433
[61]: J.

Vanschoren

Meta-learning: A survey
[62]: O.

Vinyals

,

I.

Babuschkin

,

W.M.

Czarnecki

,

M.

Mathieu

,

A.

Dudzik

,

J.

Chung

,

D.H.

Choi

,

R.

Powell

,

T.

Ewalds

,

P.

Georgiev

,

J.

Oh

,

D.

Horgan

,

M.

Kroiss

,

I.

Danihelka

,

A.

Huang

,

L.

Sifre

,

T.

Cai

,

J.P.

Agapiou

,

M.

Jaderberg

,

A.S.

Vezhnevets

,

R.

Leblond

,

T.

Pohlen

,

V.

Dalibard

,

D.

Budden

,

Y.

Sulsky

,

J.

Molloy

,

T.L.

Paine

,

Ç.

Gülçehre

,

Z.

Wang

,

T.

Pfaff

,

Y.

Wu

,

R.

Ring

,

D.

Yogatama

,

D.

Wünsch

,

K.

McKinney

,

O.

Smith

,

T.

Schaul

,

T.P.

Lillicrap

,

K.

Kavukcuoglu

,

D.

Hassabis

,

C.

Apps

,

D.

Silver

Grandmaster level in starcraft II using multi-agent reinforcement learning

Nature, 575 (7782) (2019), pp. 350-354
[63]: M.

Weber

,

G.

Roth

,

C.

Wittich

,

E.

Fischoff

Economy and Society: An Outline of Interpretive Sociology

University of California Press (1978)
[64]: Slate and Shell (2017)
[65]: Y.

Zhou

Rethinking Opening Strategy: AlphaGo's Impact on Pro Play

Slate and Shell (2018)
[66]: B.D.

Ziebart

,

J.A.

Bagnell

,

A.K.

Dey

Modeling interaction via the principle of maximum causal entropy

Proceedings of the 27th International Conference on Machine Learning, Omnipress (2010), pp. 1255-1262

Source: https://www.sciencedirect.com/science/article/pii/S0004370221000862
This web page was saved on Wednesday, Jul 21 2021.