Bulletins / 2024.08.01

2024.08.01 22:35 PDT / 05:35 UTC

The fine folks over at deeplearning.ai openly saying they prefer AI companies just ignore robots.txt files and hope the courts back them in this.

The Batch Issue 260

What’s new: Researchers at MIT analyzed websites whose contents appear in widely used training datasets. Between 2023 and 2024, many of these websites changed their terms of service to ban web crawlers, restricted the pages they permit web crawlers to access, or both.
…
Yes, but: The instructions in robots.txt files are not considered mandatory, and web crawlers can disregard them. Moreover, most websites have little ability to enforce their terms of use, which opens loopholes. For instance, if a site disallows one company’s crawler, the company may hire an intermediary to scrape the site.
…
We’re thinking: We would prefer that AI developers be allowed to train on data that’s available on the open web. We hope that future court decisions and legislation will affirm this.

(Highlight mine).