The ritual of data scraping to train AI models by Big Tech firms like OpenAI and Google has already triggered a wave of copyright infringement lawsuits. Artists and authors with distinguished works to their name were obviously expected to be more litigious – the theft is marked and fought against more easily. 

Separately, cash-rich tech firms are signing deals with platform to use their proprietary data to train AI system. The $60 million Google-Reddit licensing deal foreshadows a season of AI licensing boom for content platforms. 

But for most content platforms, the agreements on data protection with their users are loosely worded and catch-all. And these platforms provide very little legal recourse to users.

What is driving these AI licensing deals?

The data scraping issue has been wearing down content platforms like social media companies and news publishers which have been struggling to make profits. Internet wormhole Reddit has been trying to tamp down costs given they don’t generate as much in ad revenue as other social media platforms.

(For top technology news of the day, subscribe to our tech newsletter Today’s Cache)

CEO Steve Huffman was worried that Reddit was losing tens of millions of dollars to support these companies that were dipping into their data for free and wanted to charge a fee for it.

In June last year, for the first time, the company decided to charge for access to its application programming interface or API, the system which software developers outside the company use to draw content from a platform. (Platform X, owned by Elon Musk, cut off access to its free API in April last year too after a similar realisation.) 

The deal with Google is a blessing for Reddit considering there’s an impending IPO hanging in the air. Huffman’s company will now pocket $60 million a year as well as be able to leverage the AI boom through revenue brought in via API fees. 

Other deals between AI companies and various types of content platforms that are hungry for a fresh source of money, are imminent. A report by 404Media revealed, Automattic, the parent company of Tumblr and WordPress, was discussing a potential deal with OpenAI and AI image tool maker Midjourney to provide training data from users’ posts. 

Last year in September, Meta admitted that a “vast majority” of its training data used to develop its AI assistants comprised of publicly available posts, including those on Facebook and Instagram. The company’s president of global affairs, Nick Clegg told Reuters that they had steered clear of datasets that include a heavy amount of personal information, such as posts from LinkedIn. 

Can social media content be secure?

But that is hardly a concrete reassurance for social media users. “Anyone who posts publicly on a social media platform needs to be aware that their content can be used to train an AI model. This is true whether you use a large social platform like Facebook or a smaller one like Reddit. Social platforms have an abundance of human-created information in the form of posts, links, photos, videos and more, and AI models need human oversight and verification in order to provide factual, correct information,” tech analyst, Debra Aho Williamson said. 

These partnerships indicate that AI companies do recognise the value of human-created content in training and improving their models. But that doesn’t make it less concerning. “Meta, for example, has built an enormous ad business by selling targeted advertising, and its success comes from the fact that it has built a world-class database of information about its users. Many users find that creepy. Using that same information to improve an AI model can feel similarly invasive – even if its public information,” she noted. 

If a TikTok ban wasn’t looming in the US, it would be the most likely social media platform to strike a deal with an AI company, according to Williamson. “TikTok videos contain a wealth of information, and much of its content is publicly available. LinkedIn is another company whose data might be useful to AI firms,” she stated. LinkedIn is also notably owned by Microsoft, which is knee-deep in AI partnerships with firms like OpenAI and Mistral. 

There’s very little that’s off limits. “Any sort of content creator – whether a social media user, a journalist, a designer, or a filmmaker – needs to be worried about how AI companies will use their content, as well as the potential for the AI company to disintermediate their business entirely,” she explained. 

How to protect against tools misusing data?

The Terms of Service of most content platforms are usually very broad and barely decipherable word salad. For example, while Grammarly’s user agreement explicitly states that they do not sell data to third-parties, they do save and archive your data. But their user agreement does not make that as apparent in so many words. 

“You grant us a license to your User Content for the limited purposes of: Operating, providing, improving, troubleshooting, and debugging our products (for example, your acceptance or rejection of suggestions may help train our suggestion engine); Protecting our products (for example, to analyze patterns in usage to prevent abuse); Customizing our products (for example, to create personalized suggestions for you); Developing new products or features (for example, creating our tone detector); and Using information you upload or provide to us (such as your name) to encourage other people in your organization to join your Grammarly Business or Grammarly for Education team account.” 

Once these platforms gain some functionality, users find it difficult to avoid using them. “The sad truth is, unless you’re a Luddite and refuse to use the internet (or any technology really), there isn’t much you can do about it. It’s like living in a city like Delhi or Beijing and asking what one can do to protect themselves from pollution. Your devices collect diagnostic data that are used to train ‘internal’ systems and make them better. If you’re using a Microsoft device, OpenAI’s products may tomorrow be considered ‘internal systems’ by Microsoft,” Subha Chugh, a lawyer and AI advisor said. 

Her advice is to always presume that your data will end up being shared with several different parties. “Use end-to-end encrypted service providers to share confidential information and refrain from uploading data on online third-party tools or language models,” she stated. 

Three weeks ago, electronic signature company, Docusign, published a FAQ quietly admitting to using customer data, which normally includes sensitive business information and personal details, for training AI. “They state that customers ‘contractually consent’ to such use, although I cannot find anything in their Terms which mentions that customers consent to such activities. But presumably if customer provides ‘consent’, they can also withdraw that consent without detriment, right?” Alexander Hanff, a legal consultant for data protection and privacy said on LinkedIn. 

Hanff explains that while this should have been enough “to sink the company, the problem is that people use these tools out of convenience” and rarely consider these loopholes seriously. 

There is also a sign of an increasing amount of desensitisation on invasion of privacy, Chugh believes. For most users, an alternative to such freely available tools which provide privacy comes at a hefty cost. “For many companies, data sharing is either a source of revenue directly or indirectly, which allows them to keep their service price low. For many people, getting to use ‘free’ services in exchange for sharing their data is simply worth it,” she reasoned.

Given the extent to which the future will be AI-infused, the calls for AI regulation to not just safeguard the privacy of data but to prevent misappropriation are getting louder. 

Can datasets for AI training ever be ethical?

If you asked OpenAI if it was even possible to train AI models without using copyright data, the answer would be an outright no. In a statement responding to an inquiry by the UK’s House of Lords communications and digital select committee, the Sam Altman-led company said it was “impossible” to use data that didn’t violate copyright laws. 

But a couple of very recent announcements reported by Wired this week show that that in fact isn’t true. A nonprofit called Fairly Trained which was founded in January this year by Ed Newton-Rex awarded its first certification to a large language model called KL3M that was built without any copyright infringement. Rex had reportedly quit his executive role at image generation startup, Stability AI, because he was at odds with its data scraping policy. 

The dataset, developed by a Chicago-based legal tech consultancy startup 273 Ventures, used curated legal, financial and regulatory documents. 

Another collective of researchers backed by the French government have worked in collaboration with startups and groups like Allen AI, Nomic AI and EleutherAI have released what they claim is the largest AI training dataset made of text entirely from the public domain. Even though small, these hint towards a hopeful future where ethical datasets are a possibility even if that takes more time. 

Source link