Blog Post

Prmagazine > News > News > Court filings show Meta staffers discussed using copyrighted content for AI training | TechCrunch
Court filings show Meta staffers discussed using copyrighted content for AI training | TechCrunch

Court filings show Meta staffers discussed using copyrighted content for AI training | TechCrunch

Copyright-protected works obtained using legally questionable means have been discussed internally for years, based on unsealed court documents on Thursday.

These documents were filed by the plaintiff in the case, Kadreyv. Meta is one of many AI copyright disputes, gradually passing through the US court system. Defendant Meta claims that training models for IP-protected works (especially books) are “fairly used”. Plaintiffs, including authors Sarah Silverman and Ta-Nehisi Coates, disagreed.

Previous materials filed in the lawsuit state that Chief Executive Officer Mark Zuckerberg Given Meta’s AI team to train copyright work,That META stops negotiations on AI training data licensing with book publishers. However, the new file (most of which show internal work chats between meta employees) depicts the clearest picture of how Meta might use copyrighted data to train its models, including those of the company, Camel family.

In a chat, Melanie Kambadur, senior manager of Meta’s Llama model research team, discussed training models about works they knew might be legally confused.

“My opinion is (in the case of ‘request for forgiveness, not permission): we try to get books and upgrade them to executives so they can call,” Meta research engineer Xavier Martinet wrote in the chat. Date is February 2023, According to the document. “That’s why they built this ai org for this [sic]: Therefore, we may reduce the risk. ”

Martinet proposed the idea of ​​buying e-books at retail prices to build a training package instead of slashing licensing with a single book publisher. After another employee pointed out that the use of unauthorized, copyrighted material could be a reason for a legal challenge, Martinet doubled and believed that “a large number of” startups might have been training with pirated books.

“I mean, the worst case scenario: we found out that it finally works, and a $100 billion start [sic] Just about pirated books about Bittorrent,” Martint wrote, According to the document. “My 2 cents again: It takes a long time to try to deal with publishers directly […]transparent

In the same chat, Kambadur (Kambadur) had such a recognition in the past.

“Yes, we absolutely need to obtain permission or approval for public data,” Kambadur said. According to the document. “The difference now is that we have more money, more lawyers, more Bizdev help, the ability to quickly track/upgrade speeds, and lawyers are less conservative in terms of approval.”

Libigan talks

In another work chat in the file, Kambadur may use Libgen to discuss Libgen, a “link aggregator” that accesses publishers’ access to copyrighted works in an alternative to what Meta might permit Data source.

Libgen has been sued multiple times, ordered to close and fined tens of millions of dollars for copyright infringement. One of Kambadur’s colleagues Respond with screenshots In Google search results, Libgen contains a summary “No, Libgen is illegal”.

Some decision makers in META seem to be impressed that failure to use Libgen for model training could seriously undermine Meta’s competitiveness in AI competitions, According to the document.

In an email to Meta AI Vice President Joelle Pineau, Meta AI, Sony Theakanath, Director of Product Management, called Libgen “essential to satisfy SOTA numbers in all categories”, referring to the best, best AI Models (SOTAs) AI models and benchmark categories.

Theakanath also outlines “miscuity” in emails designed to reduce Meta’s legal exposure, including removing data from “clearly marked as pirated/stolen” Libgen and not publicly cited usage at all. As Theakanath said, “We will not disclose the use of the Libgen dataset for training.”

In practice, these mitigations require combing words such as “stolen” or “pirated” through libgen files, According to the document.

exist Work ChatKambadur Mentioned Meta’s AI team also tweaked the model to “avoid IP risk warning” – IE configured the model to refuse to answer questions like “Copy the first three pages of Harry Potter and The Sorcerer’s Stone,” or “Tell me which e-books you accepted . .”

The file contains other revelations, which means meta Maybe reddit data has been scratched For some type of model training, it may be by imitating the behavior of a third-party application promote. It is worth noting that Reddit explain In April 2023, it plans to start charging AI companies data for accessing model training.

In a March 2024 chat, Meta’s Generative AI Org Director of Product Management Chaya Nayak said Meta leadership is considering “covering” past training data decisions, including decisions not to use Quora content or license books and scientific articles, Ensure that the company’s model has sufficient training data.

Nayak hints about Meta’s first-party training datasets – Facebook and Instagram posts, text in videos transcribed from the Meta platform, and some Commercial dollar Message – Not enough at all. “We need more data,” she wrote.

Some pirated books with copyrighted books can be used in a license since the case was filed in the U.S. District Court in 2023 in the U.S. District Court in Northern California to determine whether it makes sense to enter into a licensing agreement with the publisher.

To show how high the legal shares the company has, the company Added Both Supreme Court litigants at law firm Paul Weiss are involved in the defense team in the case.

Meta did not immediately respond to a request for comment.

Source link

Leave a comment

Your email address will not be published. Required fields are marked *

star360feedback Recruitgo