Web Giants Back New AI Scraping Standard

A new open standard is emerging to help web publishers set the rules for AI companies that scrape their content. Called Really Simple Licensing (RSL), the standard aims to create enforceable terms that participating publishers expect AI firms to follow. Backed by major players including Reddit, Yahoo, Medium, and People Inc., the initiative seeks to establish a framework for compensation in the age of artificial intelligence.

The RSL standard builds upon the existing robots.txt protocol, the simple file that provides instructions for web crawlers. It adds specific licensing terms that publishers can set. The available options include free use, requiring attribution, subscription models, pay-per-crawl, and a novel pay-per-inference model. This last option means an AI company would only pay a publisher when their specific content is actually used to generate a response for a user.

A new nonprofit organization, the RSL Collective, is launching alongside the standard. It views itself as a counterpart to music royalty collectives like ASCAP and BMI. The group states its mission is to establish fair market prices and strengthen the negotiation power of publishers against large AI corporations. The list of participating brands reads like a who’s who of internet old-schoolers, including Reddit, Yahoo, Internet Brands, Ziff Davis, wikiHow, O’Reilly Media, Medium, The Daily Beast, and Ranker. The effort is led by former Ask.com CEO Doug Leeds and RSS co-creator Eckart Walther.

Reddit CEO Steve Huffman voiced support for the initiative, stating that the RSL Standard offers a clear and scalable way for publishers to set licensing terms. He noted that the collective approach is an important step toward protecting the open web. This is notable as Reddit has already signed its own multi-million dollar content licensing deals with AI giants like OpenAI and Google.

A significant question remains whether AI companies will voluntarily honor this new standard. These firms have a history of sometimes ignoring robots.txt instructions altogether. However, the RSL Collective believes its terms will be legally enforceable. Leeds pointed to Anthropic’s recent massive legal settlement as evidence that there is real financial risk for AI companies that do not train their models legitimately. He also suggested that the collective nature of the standard could help spread out legal costs, making it more feasible for publishers to challenge violations.

For technical enforcement, the standard itself cannot block bots. To address this, the group is partnering with cloud company Fastly, which can act as a gatekeeper for participating publishers. Leeds described this partnership as having a bouncer at the door to the club.

The RSL Collective argues there are incentives for AI companies to participate. Leeds suggested it could be simpler and more efficient than negotiating individual licensing deals with countless publishers. It could also solve a problem for AI models, which often have to use multiple inferior sources to avoid copying too much from any single one. If content is legally licensed, an AI could simply use the single best source, leading to higher-quality answers and fewer factual errors or hallucinations.

Leeds also addressed complaints from AI companies that there is no effective way to license content from across the entire web. He stated that the RSL standard is a direct response to that need, providing a scalable protocol. AI firms get a way to access all the content they want with the incentive that they only pay for the best content their models actually use. The underlying principle is simple: if they use it, they pay for it; if they don’t, they won’t.

Leave a Comment

Your email address will not be published. Required fields are marked *