Authors Take Microsoft to Court Over Alleged AI Copyright Violations

A group of prominent authors has filed a lawsuit against Microsoft, accusing the tech giant of willful copyright infringement in the development of its large language model (LLM) known as Megatron. Filed in the Southern District of New York, the suit claims Microsoft illegally used pirated copies of their books to train its AI, despite knowing such use would require proper licensing.

The plaintiffs—an accomplished lineup including Kai Bird, Jia Tolentino, Eloisa James, Hampton Sides, Victor LaValle, Mary Bly, Jonathan Alter, Eugene Linden, Daniel Okrent, Rachel Vail, and Simon Winchester—allege that Microsoft knowingly leveraged the controversial Books3 dataset, a trove of nearly 200,000 pirated titles, to accelerate the performance and capabilities of its LLMs.

In their complaint, the authors highlight Microsoft’s 2023 licensing deal with HarperCollins as evidence that the company was fully aware of the legal obligations around using copyrighted content. They argue that Microsoft’s decision to tap into the Books3 dataset was a shortcut designed to save time and money, giving them an edge in the race to commercialize generative AI tools.

“The end result is a computer model that is not only built on the work of thousands of creators and authors,” the complaint states, “but also built to generate a wide range of expression that mimics the syntax, voice, and themes of the copyrighted works on which it was trained.”

The lawsuit contends that Microsoft’s actions have contributed to a larger problem—legitimizing and sustaining the use of pirated literary databases. “Microsoft’s intentional decision to use pirated libraries allowed it to gain huge advantages in the timing and efficiency of its LLMs,” the filing reads. “Meanwhile, its use of pirated libraries helped sustain and foster rampant copyright violations by keeping these pirated libraries in business and providing them a seal of approval.”

The authors are asking the court to halt Microsoft’s use of their copyrighted works and are seeking damages of up to $150,000 per infringed book under U.S. copyright law.

This case joins a growing wave of litigation targeting how tech companies develop and train AI models. With the stakes mounting across the publishing industry and beyond, the outcome could significantly influence how LLMs are trained—and who gets compensated for the data used.

Read the complaint here.

This post contains affiliate links. If you use these links to buy something we may earn a commission at no extra cost to you. Thank you.