Complaints about rural poverty in China. A news report about corrupt Communist Party members. About corrupt police help, crying entrepreneur.
These are just some of a spare 133,000 examples designed to automatically mark anything the Chinese government considers to be sensitive.
The leaked database seen by TechCrunch shows that China has developed an AI system that can enhance its already powerful censorship machines, far beyond traditional taboos like Tiananmen Square Massacre.
The system appears to be mainly used for online censorship of Chinese citizens, but can be used for other purposes such as improving China’s artificial intelligence models Already extensive censorship.

Xiao Qiang, a UC Berkeley researcher who studied the Chinese censorship and studied the data set, told TechCrunch that “clear evidence” suggests that the Chinese government or its branches want to use LLMS to improve inhibition.
Qiang told TechCrunch: “Unlike traditional censorship mechanisms rely on manual keyword-based filtering and manual review, LLM trained with such instructions will significantly improve the efficiency and granularity of information controls under national leadership.”
This adds to growing evidence that authoritarian regimes are rapidly adopting the latest AI technologies. For example, in February, Openai says It used LLM to capture multiple Chinese entities to track anti-government positions and smear Chinese dissidents.
China Embassy in Washington, DC told TechCrunch In the statement It opposes “unfounded attacks and slander against China”, and China’s emphasis on the development of moral AI is very important.
Data found in sight
Discover datasets Security Researcher NetaskariAfter discovering that it was stored in an unsolicited elastic search database on Baidu server, he shared a sample with TechCrunch.
This does not indicate any involvement of any company – various organizations store their data in these providers.
There is no indication of who exactly constructed the dataset, but the records show that the data is up to date with the latest entries dating back to December 2024.
An LLM for detecting objections
Language in language reminds people of how people prompt the creator of the system, Chatgpt Mission LLM who did not want to be named to figure it out If a piece of content has anything to do with sensitive topics related to politics, social life and the military. Such content is considered “highest priority” and needs to be marked immediately.
The top topics include pollution and food safety scandals, financial fraud and labor disputes, which are hot issues in China that sometimes lead to public protests, such as Shifang’s anti-pollution protest 2012.
Any form of “political satire” is clear. For example, if someone uses historical analogy to point out the “current politician” that must be marked immediately, anything related to “Taiwanese politics” must. Military affairs are broad, including military movements, exercise and weapons reporting.
A summary of the dataset can be seen below. The internal code references the prompt token and LLMS, confirming that the system uses the AI model for bidding:

Training data internal
From the 133,000 examples that LLM must evaluate, TechCrunch collects 10 representative contents.
The topic that may cause social unrest is a recurring theme. For example, one clip is a business owner complaining about the shaky entrepreneur who is shaking about corrupt local police, A rising problem in China With its economy struggling.
Another content lamented China’s rural poverty, describing a dilapidated town with only the elderly and children. There is also a news report about the CCP’s expulsion of local officials due to serious corruption and believes in “superstition” rather than Marxism.
There is extensive material related to Taiwan and military affairs, such as comments on Taiwan’s military capabilities and details about the new Chinese jet fighter. Searches by TechCrunch show that more than 15,000 Taiwanese Chinese words were mentioned in the data only (only more than 15,000 mentions).
Subtle objections also seem to be targeted. A summary contained in the database is an anecdote about the ephemeral nature of power, which uses the popular Chinese idiom “When a tree falls, monkeys scatter.”
Due to its autocratic political system, power transformation is a particularly sensitive topic in China.
Built for public opinion work‘
This dataset does not include any information about its creator. But that does say it is for “public opinion work”, which provides a strong clue to serve the Chinese government’s goals.
Article 19, Asian Program Manager Michael Caster explained that “public opinion work” is supervised by a strong Chinese government regulator, China Cyberspace Management (CAC), and usually refers to censorship and publicity work.
The ultimate goal is to ensure that the Chinese government’s narrative is protected online while clearing any other perspectives. Chinese President Xi Jinping He described it himself The Internet is the “frontline” of CCP’s “public opinion work”.
Repression becomes smarter
The dataset examined by TechCrunch is the latest evidence that authoritarian governments seek to leverage AI for repressive purposes.
Openai Report released last month Revealing an unidentified actor, probably operating in China, uses generated AI to monitor social media conversations, especially those who advocate human rights protests against China and forward it to the Chinese government.
Contact Us
If you know more about how AI is used in state transport, you can safely contact the Charles Rollet in the Charlesrolet’s signal. 12 You can also pass Security.
Openai also found that the technology was used to comment, highly criticizing China’s dissident Cai Xia.
Traditionally, China’s censorship methods rely on more basic algorithms that automatically prevent references to blacklist terms such as “Tiananmen Massacre” or “xi jinping”, Many users have experienced the first time using DeepSeek.
However, newer AI technologies, such as LLMS, can improve censorship by finding subtle criticism on a wide scale. Some AI systems can also be improved with increasing amounts of data.
“I think it’s important to emphasize how AI-driven censorship develops, which complicates state control over public discourse, especially in an age when Chinese AI models like DeepSeek are making in the media,” Berkeley researcher Xiao told TechCrunch.