The Tsinghua team woven a smart computing power grid for the large model
Computing power may be said to be the "water, electricity and coal" of the digital age, but the actual situation is: large factories have tens of thousands of tons of GPUs in their hands but cannot use them all up, startups queue up and wait for cloud manufacturers to wait for them to be exhausted, and small and medium-sized teams are even more difficult to touch the door. This structural mismatch has become an invisible ceiling that restricts the development of the entire AI industry.
Recently, an entrepreneurial team from the Tsinghua Department did something quite interesting-they built a "smart computing power grid" for large-scale model training and reasoning. It is not a simple transaction of resources, but just like dispatching power by a power grid, it intelligently matches idle computing power scattered around the world to companies in need. Doesn't sound new? But the problem they really solve is to make computing power like electricity-on-demand, on-demand allocation, and dynamic scheduling.
Why is this worthy of attention? Because it may change the underlying rules of the game for large-scale model training in the next three to five years.
1. Event/Technical Background
At the beginning of 2026, an entrepreneurial team born out of the Department of Computer Science of Tsinghua University officially released their computing power scheduling platform. The founding members of this team have participated in many large model projects in the Tsinghua NLP Laboratory, and have a serious problem with the computing power bottleneck in large model training.
According to public information, the founder of this team is a Ph. D. from the Department of Computer Science at Tsinghua University and was responsible for the construction of heterogeneous computing platforms at a head cloud vendor. After seeing the seriousness of computing power mismatch in the industry in 2024, he brought several old classmates who worked on distributed systems and compilers to start a business. The product will start internal testing in 2025 and will be officially opened to the public in 2026.
The core logic is very simple: connect the "computing power islands" around the world into a network, the demand side will access it on demand like plugging in a socket, and the supply side will realize the idle computing power. It sounds like a replica of cloud computing, but the key difference lies in scheduling granularity and cost-their goal is to allow kilocalorial-level training tasks to be scheduled in minutes, and the cost is only about 60% of that of traditional cloud vendors.
Why is this important? Because the large-scale model arms race is shifting from "model competition" to "computing efficiency." When everyone realizes that computing power is a core resource, whoever can use computing power more efficiently will be able to live longer in this competition.
2. Analysis of core technology principles
"Smart computing power grid" may sound mysterious, but its dismantling is mainly supported by a three-layer technical architecture:
The first layer is the heterogeneous resource abstraction layer. The team has developed a unified resource description protocol that can abstract GPU clusters from different manufacturers (NVIDIA, AMD, domestic chips), servers with different architectures (x86, ARM), and data centers in different geographical locations into standardized "computing power unit." This agreement is similar to the "plug standard" in the power industry. No matter whether your home's electricity is thermal power or wind power, it will be 220V and 50Hz when it enters the grid.
The second layer is the intelligent scheduling engine. This is the core barrier. The team revealed in the paper that they used a hybrid scheduling scheme that combines reinforcement learning and heuristic algorithms. Reinforcement learning is responsible for predicting short-term fluctuations in computing power demand, and heuristic algorithms are responsible for ensuring scheduling certainty and lower latency limits. Measured data shows that the scheduling time of kcal training tasks can be compressed to less than 3 minutes, and the success rate of task start exceeds 99.5%.
The third layer is elastic charging and fault tolerance mechanisms. They referred to the time-of-use electricity price model of the power industry and designed a dynamic pricing system: price increases during peak periods and discounts during trough periods. Demands can independently choose the dispatching strategy of "cost-effectiveness first" or "timeliness first". At the same time, the system will automatically configure checkpoint saving and breakpoint running capabilities for long-term tasks to minimize the impact of hardware failures on training progress.
List of key technical points:
- Unified Resource Abstraction Protocol: Supports mainstream chips such as NVIDIA A100/H100, AMD MI300, and Huawei Shengteng 910B. The time from resource registration to scheduled launch is less than 5 minutes
- Hybrid scheduling algorithm: dual engines of reinforcement learning + heuristic algorithm, scheduling delay P99<200ms (according to the team's technical blog)
- Dynamic pricing model: A real-time bidding system based on supply and demand, with price fluctuations of ±40%(compared with fixed discounts from traditional cloud vendors)
- Fault tolerance mechanism: Automatic checkpoint interval is configurable, and fault recovery time is <30 seconds (experimental environment data)
- Security isolation solution: Hard isolation of resources based on lightweight virtualization technology, a single card failure will not affect other tasks in the same crew
The beauty of this architecture is that it does not start a new start and rebuild the wheels. Instead, it "connects" the existing computing infrastructure and redistributes usage rights in a software-defined way. This is lighter than a self-built data center and more stable than a pure matchmaking platform.
3. Why is this important?
Let's start with an industry consensus: the essence of big model competition is computing power competition, but in the second half of computing power competition, the competition is not about who buys more, but who uses it well.
In the past two years, everyone has seen the crazy hoarding of cards by big factories. Nvidia's H100 was once sold out of stock, and the domestic A100 price was also hyped to outrageous proportions. But the problem is that buying a card does not mean that you can use it well. The GPU utilization rate of many companies has been hovering between 30% and 40% for a long time-unreasonable scheduling of training tasks, flawed batch processing granularity design, and high cross-node communication overhead... These problems have resulted in a large amount of computing power being wasted.
At the same time, a large number of small and medium-sized enterprises and scientific research teams simply cannot get enough computing power. The monthly price of an H100 has risen to more than $20,000. Startups cannot afford the money, and university laboratories are even more reluctant. As a result, the Matthew effect of computing power resources is becoming more and more serious, large factories are getting stronger, and small teams are getting more and more difficult.
What this team does is essentially a "supply-side reform" in the field of computing power: revitalizing scattered, fragmented, and low-utilization computing power resources and redistributing them using market-oriented means. Whoever has free cards can make money; whoever needs them can get resources quickly. This is not a technological revolution, but it may be a revolution in business models.
Looking at it from another perspective, if this "computing power grid" can really run through, it will not just solve efficiency issues-it may reshape the entire AI infrastructure. Imagine training a large model in the future. You don't need to build a self-built computer room or sign an annual frame contract. You just need to "plug in and use it" like electricity. At that time, computing power will truly become a universal basic resource, just as cloud computing universalizes computing resources.
Of course, this is the ideal state. Whether it can be done depends on subsequent development.
4. Industry impact and data support
Data first. How big the computing power market is and how serious the demand gap is. Only by making these figures clear can everyone understand the value of this direction.
According to a report released by IDC at the end of 2025, the global AI computing power market will reach approximately US$78 billion in 2025, a year-on-year increase of 42%. However, a considerable part of this is "invalid supply"-computing power equipment purchased by companies, and the median actual utilization rate is only about 35%. Calculated, the amount of AI computing power wasted every year around the world may exceed US$20 billion.
Synergy Research Group's data is more intuitive: In Q4 of 2025, the global data center GPU capacity will be approximately 210 million, but the average utilization rate is only 38%. This means that more than 60% of the computing power is idling.
Back to the country. According to a 2025 survey by China Institute of Information and Technology, the utilization rate of GPU clusters of domestic top cloud manufacturers is about 45%, while the utilization rate of self-built clusters of small and medium-sized enterprises is generally less than 30%. The situation in universities and scientific research institutes is even worse. Many laboratories have GPU server utilization rates of less than 20%, and they just lie there and eat dust after buying and running a few experiments.
The data on the demand side is more interesting. According to qubit reports, the computing power gap of domestic large model startups in 2025 will be approximately 150,000 H100 equivalent computing power, and the actual deployable computing power is approximately 60% of the demand. The gap exists for a long time, and this number is expanding as more large model projects are launched.
There is also another data that is easily ignored: the proportion of computing power costs in large model training. According to estimates by the Stanford HAI Institute, the training cost of the GPT-4 level model is approximately US$78 million, of which computing power costs account for more than 85%. If this cost can be reduced by 30% to 40%, the impact on the entire industry will be huge.
Taken together, these numbers point to one conclusion: the structural contradictions in the computing power market have reached a point where they must be resolved. It is not a technical issue, but an efficiency issue, an institutional issue, and a business model issue. This Tsinghua Department team is attacking this pain point.
5. Actual implementation cases
Case 1: The life-and-death breakthrough of an AI pharmaceutical startup
In mid-2025, a domestic startup company engaged in AI drug research and development (hereinafter referred to as "Company A") encountered big trouble. They are training a molecular generation model for a certain type of target, and halfway through the training they find that the computing power is not enough-the original negotiated annual frame contract for the cloud manufacturer has been cut in half due to internal priority adjustments.
At the original pace, the project will be postponed for three months. For a startup with only half a year left in the financing window, this is not a matter of time, but a matter of life and death.
The technical person in charge of Company A found the product of this Tsinghua Department team. During the internal testing phase, the platform has just opened several computing power nodes in East China, charging by the hour. They transferred the two training tasks with the mentality of giving it a try.
The dispatch process went surprisingly smoothly. The technical leader told me that it only took less than four minutes from submitting the task to starting the first GPU. "It's much faster than the cloud vendor we used before. They had to wait for two or three hours just to queue up. "
Two weeks later, the training of the molecular generation model was completed, one week earlier than originally planned. Company A later reviewed the offer. This round of training consumed a total of about 8000 cal-hours of computing power. Based on the dynamic pricing of the platform at that time, the total cost was about 28% lower than the original cloud manufacturer's offer.
"What surprised us most was not the price, but the stability. "The technical person in charge said," After 14 days of training, there was a node failure in the middle, and the system automatically switched to the standby node, and the tasks were seamlessly connected. In the past, such accidents would have to be rolled back for at least two to three hours. "
Case 2: The "computing power equalization" experiment in a university laboratory
In the second half of 2025, the NLP laboratory (hereinafter referred to as "Laboratory B") of a leading university in China is conducting a pre-training project for a multilingual large model. The scale of the project is not large, but it requires training tasks that last for more than three months.
The computing power situation of the laboratory is very interesting: they have a batch of GPU servers purchased uniformly by the college, but the management method is very traditional-whoever applies first uses it first, there is no dynamic scheduling, and there is no flexible expansion. The result is that there are not enough machines when there are projects, and when there are no projects, the machines are idle, and the utilization rate hovers around 25% for a long time.
The instructor of Lab B contacted the platform team and took a fancy to their concept of "computing power grid." But there is one concern: Academic project funds are limited, can they be affordable?
The solution given by the platform is "hybrid scheduling"-the laboratory's own cluster access platform serves as a supplier and rents it out during idle periods; at the same time, when its own computing power is insufficient, it dispatches other nodes from the platform to supplement it. In this way, the laboratory not only did not spend extra money, but also earned a subsidy by renting idle computing power.
After three months of training, the laboratory's expenditure was about 15% less than the budget, and the GPU utilization rate increased to 52%(according to internal laboratory statistics). The tutor later shared the case at an academic conference, which attracted the attention of many colleagues.
"The problem with computing power in colleges and universities is not that there are no resources, but that resources are too scattered and management is too backward. "The product manager on the platform side told me," We hope to create a college version of the 'Computing Power Sharing Alliance' so that laboratories can also allocate computing power to each other. "
6. Comparison with competing products/alternatives
This company is not the only one doing this in the direction of computing power scheduling. There are several teams with similar ideas at home and abroad, and a group of traditional cloud vendors are cutting this cake. A horizontal comparison can more clearly see the position and advantages and disadvantages of this Tsinghua Department team.
Comparison of mainstream solutions:
| programme | core advantages | main disadvantage | pricing model | applicable scenarios |
|---|---|---|---|---|
| Tsinghua Department's "Computing Power Grid" | Fine scheduling granularity (P99<200ms), unified abstraction of heterogeneous resources, and flexible dynamic pricing | The ecology is still under construction and node coverage is limited | On-demand billing + dynamic pricing | Small and medium-sized training tasks, flexible expansion needs |
| Traditional cloud vendors (AWS, Alibaba Cloud, etc.) | Wide node coverage, mature ecology, and stable services | High prices, poor scheduling flexibility, and locked resources | Annual/monthly + pay-as-you-go | Large-scale long-term training, deterministic load |
| Decentralized computing power platforms (Render, Livepeer, etc.) | Community-driven, low cost, no centralized risk | The quality of computing power is uneven, the failure rate is high, and the suitable scenarios are limited | Pure on-demand billing | Edge reasoning, lightweight tasks |
| Supercomputing center directly connected | Rich computing power, suitable for large-scale scientific computing | Long approval process and poor scheduling flexibility | Policy pricing | National big science project |
A few interesting points can be seen from the table:
The advantage of traditional cloud vendors is "stability", but the price is "expensive" and "tied." Once you sign a year-on-year contract, your computing power will be locked in, and your ability to expand flexibly is very weak. Want to add machines temporarily? Yes, add money. Want to reduce machinery? Sorry, the fees will be paid during the contract period.
The advantage of decentralized computing power platforms is that they are "cheap," but the problem lies in "uncontrollable quality." Most of the nodes on this type of platform are provided by individuals or small teams, and GPU models, network bandwidth, and stability are all uneven. It's okay to run a lightweight reasoning task, but do big model training? The risk is too great.
Tsinghua's plan is positioned somewhere in between-more flexible than traditional cloud vendors and more reliable than decentralized platforms. The scheduling granularity is very detailed, and the fault tolerance mechanism has kept up, but the shortcoming is that the ecology is not big enough. At present, the number of connected nodes is limited, and the coverage area is not as good as that of head cloud vendors.
My judgment is that at this stage it is more suitable as a "flexible supplement" rather than a "main platform." Enterprises can put their core training tasks on traditional cloud vendors and use this system for elastic expansion and cost optimization. Only when the ecology starts running and the node coverage becomes wider will it be possible to challenge the position of the main force.
7. Technical challenges and limitations
To be honest, this article cannot just sing praises. Any new system will encounter problems during the implementation process, and this "smart computing power grid" is no exception.
The problem of uneven node quality. Although the platform has an access mechanism, the computing power nodes accessed come from different organizations and different hardware configurations, and the actual performance will vary. The platform told me that they will currently conduct a 48-hour stress test on new nodes, but this can only eliminate obviously problematic nodes and cannot guarantee long-term stability. Some users reported that after scheduling to certain nodes, the network bandwidth was lower than expected, resulting in a decrease in communication efficiency for multi-card training. This requires more fine-grained monitoring and dynamic scheduling capabilities.
Delay problem of cross-regional scheduling. Large model training is sensitive to communication bandwidth, especially distributed training that requires multi-card collaboration. If the scheduled nodes are distributed in different regions, cross-regional network latency may offset the advantages of "fast scheduling." Platforms currently tend to prioritize scheduling nodes in the same region, but this problem cannot be completely avoided during periods of tight computing power.
Border issues of security isolation. Computing power sharing means that the tasks of different users run on the same batch of hardware, and security isolation is the core requirement. The platform uses a lightweight virtualization solution, which can theoretically achieve hard isolation of resources. However, in large-scale concurrency scenarios, whether "noisy neighbor" problems will occur (such as other people on the same node occupying bandwidth or memory), more verification is needed.
Sustainability issues of business models. Dynamic pricing sounds beautiful, but uncertainty about supply and demand can lead to excessive price fluctuations. Demand sides may find that the cost of scheduling during peak periods is more expensive than that of traditional cloud vendors. This requires a more refined pricing model and a long-term price stability mechanism.
Regulatory and compliance risks. Computing power scheduling involves sensitive areas such as cross-border data flow and export control of computing power resources. If nodes are located in the United States or other countries with export restrictions in the future, the platform needs to handle compliance issues very carefully.
These problems are not fatal, but they are indeed growing pains. Teams need to find a balance between expanding scale and polishing the experience.
8. Who should pay attention to this matter
If you are in the AI industry, no matter what role you are in, this matter is related to you, but the relationship is different.
Developers and technology leaders should focus on this direction because it may change the way you use resources. Imagine doing model training in the future, without having to bargain with cloud vendors for sales, without having to sign contracts that cost hundreds of thousands of yuan per year, directly dispatch on demand and charge per second. This is especially meaningful for independent developers and small teams-you can finally afford half the price of a big factory and run training tasks of the same size.
Product managers and project managers should pay attention because it will affect your project scheduling and cost estimates. If the platform can provide stable supplies, you can do "fast iteration" more aggressively-retrain the model if the effect is not good, without worrying about the cost of computing power. Of course, the prerequisite is that the service quality of the platform can continue to remain stable.
Entrepreneurs and CXOs should pay attention, as your competitors may be using similar methods to reduce costs and increase efficiency. Competition for large model tracks is becoming increasingly fierce. Whoever can obtain computing power at a lower cost will have greater pricing flexibility and R & D investment space. This is not overtaking on corners, but it may be a key variable in continued competition.
Investors and strategic planning departments should pay attention because it could reshape the landscape of AI infrastructure. If this "computing power grid" can be built, it will become the "power grid" of the AI era-all large model training and reasoning cannot be separated from it. At that time, the valuation logic was no longer a software platform, but an infrastructure provider.
9. Prediction of future trends
I have a relatively clear judgment: in the direction of computing power scheduling, one to two head platforms will appear within three years. The pattern is similar to today's cloud computing market-three to five major manufacturers dominate, and a large number of small and medium-sized platforms will make subdivision.
It is too early to draw a conclusion whether the Tsinghua Department team can come out. But there are several key nodes worthy of attention:
node expansion speed. If they can access the equivalent computing power of more than 500,000 GPUs by the end of 2026, they will have the confidence to fight with traditional cloud vendors. If it is stuck at the 100,000 level, it may become a small but beautiful segment platform.
benchmark cases for major customers. The recognition of startups and small teams can only prove that they are "usable", and the recognition of large manufacturers can prove "reliable." If you can obtain computing power scheduling orders from one or two leading Internet companies, the brand's endorsement effect will be very strong.
ecological construction. The computing power scheduling platform is essentially a bilateral market that requires simultaneous growth on both the supply side and the demand side. The platform side told me that their goal this year is to focus on expanding the supply side and establish cooperative relationships with 10 to 15 medium-sized data centers. This strategy is correct-supply the "electricity" first, and the "users" will naturally come.
From a larger perspective, I think computing power scheduling is only the first step. The AI infrastructure in the future will become more and more differentiated: the bottom layer is hardware resources, the middle layer is the scheduling platform, and the top layer is models and applications. This trend of hierarchical decoupling is very similar to the path taken in the cloud computing era.
Whoever can gain a foothold in the middle level can become the "AWS" of the AI era. This road is not easy to walk, but there is a huge room for imagination.
X. Summary and action recommendations
The "smart computing power grid" created by the Tsinghua Department team is essentially to redistribute scattered computing power resources in a software-defined way. It solves not technical issues, but efficiency and cost issues-liquidating idle computing power and allowing demand sides to access it on demand.
In the short term, it is more suitable as a supplementary solution to traditional cloud vendors for elastic expansion and cost optimization. In the long run, if ecology can run, it may become an important part of AI infrastructure.
If you are doing big model-related work, it is recommended to register an account on their platform to try it out-it is still in the promotion period and the pricing is advantageous. If you are a business decision maker, you can evaluate the feasibility of migrating some non-core training tasks. If you are looking at investment opportunities in AI tracks, this direction deserves special attention.
Computing power is the "hydropower coal" in the AI era, but "hydropower coal" should not be monopolized. Whether this matter can be done, time will tell.