Exclusive: Leaked data exposes a Chinese AI censorship machine

Complaints about rural poverty in China. A news report about corrupt Communist Party members. About corrupt police help, crying entrepreneur.

These are just some of a spare 133,000 examples designed to automatically mark anything the Chinese government considers to be sensitive.

The leaked database seen by TechCrunch shows that China has developed an AI system that can enhance its already powerful censorship machines, far beyond traditional taboos like Tiananmen Square Massacre.

The system appears to be mainly used for online censorship of Chinese citizens, but can be used for other purposes such as improving China’s artificial intelligence models Already extensive censorship.

Chinese flag on pole behind razor — The photo, taken on June 4, 2019, shows the Chinese flag of the housing compound in Yangisar, south of Kashgar, located in the Xinjiang region of western China.Image source:Greg Baker / AFP / Getty Images

Xiao Qiang, a UC Berkeley researcher who studied the Chinese censorship and studied the data set, told TechCrunch that “clear evidence” suggests that the Chinese government or its branches want to use LLMS to improve inhibition.

Qiang told TechCrunch: “Unlike traditional censorship mechanisms rely on manual keyword-based filtering and manual review, LLM trained with such instructions will significantly improve the efficiency and granularity of information controls under national leadership.”

This adds to growing evidence that authoritarian regimes are rapidly adopting the latest AI technologies. For example, in February, Openai says It used LLM to capture multiple Chinese entities to track anti-government positions and smear Chinese dissidents.

China Embassy in Washington, DC told TechCrunch In the statement It opposes “unfounded attacks and slander against China”, and China’s emphasis on the development of moral AI is very important.

Data found in sight

Discover datasets Security Researcher NetaskariAfter discovering that it was stored in an unsolicited elastic search database on Baidu server, he shared a sample with TechCrunch.

This does not indicate any involvement of any company – various organizations store their data in these providers.

There is no indication of who exactly constructed the dataset, but the records show that the data is up to date with the latest entries dating back to December 2024.

An LLM for detecting objections

Language in language reminds people of how people prompt the creator of the system, Chatgpt Mission LLM who did not want to be named to figure it out If a piece of content has anything to do with sensitive topics related to politics, social life and the military. Such content is considered “highest priority” and needs to be marked immediately.

The top topics include pollution and food safety scandals, financial fraud and labor disputes, which are hot issues in China that sometimes lead to public protests, such as Shifang’s anti-pollution protest 2012.

Any form of “political satire” is clear. For example, if someone uses historical analogy to point out the “current politician” that must be marked immediately, anything related to “Taiwanese politics” must. Military affairs are broad, including military movements, exercise and weapons reporting.

A summary of the dataset can be seen below. The internal code references the prompt token and LLMS, confirming that the system uses the AI model for bidding:

A snippet of JSON code that references tokens and LLMS. Most of the content is in Chinese. — Image source: Charles Rollet

Training data internal

From the 133,000 examples that LLM must evaluate, TechCrunch collects 10 representative contents.

The topic that may cause social unrest is a recurring theme. For example, one clip is a business owner complaining about the shaky entrepreneur who is shaking about corrupt local police, A rising problem in China With its economy struggling.

Another content lamented China’s rural poverty, describing a dilapidated town with only the elderly and children. There is also a news report about the CCP’s expulsion of local officials due to serious corruption and believes in “superstition” rather than Marxism.

There is extensive material related to Taiwan and military affairs, such as comments on Taiwan’s military capabilities and details about the new Chinese jet fighter. Searches by TechCrunch show that more than 15,000 Taiwanese Chinese words were mentioned in the data only (only more than 15,000 mentions).

Subtle objections also seem to be targeted. A summary contained in the database is an anecdote about the ephemeral nature of power, which uses the popular Chinese idiom “When a tree falls, monkeys scatter.”

Due to its autocratic political system, power transformation is a particularly sensitive topic in China.

Built for public opinion work‘

This dataset does not include any information about its creator. But that does say it is for “public opinion work”, which provides a strong clue to serve the Chinese government’s goals.

Article 19, Asian Program Manager Michael Caster explained that “public opinion work” is supervised by a strong Chinese government regulator, China Cyberspace Management (CAC), and usually refers to censorship and publicity work.

The ultimate goal is to ensure that the Chinese government’s narrative is protected online while clearing any other perspectives. Chinese President Xi Jinping He described it himself The Internet is the “frontline” of CCP’s “public opinion work”.

Repression becomes smarter

The dataset examined by TechCrunch is the latest evidence that authoritarian governments seek to leverage AI for repressive purposes.

Openai Report released last month Revealing an unidentified actor, probably operating in China, uses generated AI to monitor social media conversations, especially those who advocate human rights protests against China and forward it to the Chinese government.

Contact Us

If you know more about how AI is used in state transport, you can safely contact the Charles Rollet in the Charlesrolet’s signal. 12 You can also pass Security.

Openai also found that the technology was used to comment, highly criticizing China’s dissident Cai Xia.

Traditionally, China’s censorship methods rely on more basic algorithms that automatically prevent references to blacklist terms such as “Tiananmen Massacre” or “xi jinping”, Many users have experienced the first time using DeepSeek.

However, newer AI technologies, such as LLMS, can improve censorship by finding subtle criticism on a wide scale. Some AI systems can also be improved with increasing amounts of data.

“I think it’s important to emphasize how AI-driven censorship develops, which complicates state control over public discourse, especially in an age when Chinese AI models like DeepSeek are making in the media,” Berkeley researcher Xiao told TechCrunch.

Source link

Exclusive: Leaked data exposes a Chinese AI censorship machine

Data found in sight

An LLM for detecting objections

Training data internal

Built for public opinion work‘

Repression becomes smarter

Contact Us

Shampoo recalled for bacteria contamination that could cause infection

SpaceX reportedly has a secret backdoor for Chinese investment | TechCrunch

Leave a comment Cancel reply

Categories

Contact us

Hand Picked News

Plane crash near Minneapolis sends home up in flames

Southern California healthcare agencies fear cuts to HIV prevention will

Faye Hall, American detained by Taliban, has been released

Blog Post

Data found in sight

An LLM for detecting objections

Training data internal

Built for public opinion work‘

Repression becomes smarter

Contact Us

Shampoo recalled for bacteria contamination that could cause infection

SpaceX reportedly has a secret backdoor for Chinese investment | TechCrunch

Leave a comment Cancel reply

Categories

Contact us

Hand Picked News

Plane crash near Minneapolis sends home up in flames

Southern California healthcare agencies fear cuts to HIV prevention will

Faye Hall, American detained by Taliban, has been released