There are many clock tree structures used widely in the design industry, each of which has its
own merits and demerits.
- H-tree (figure 1)
- Balance tree (figure 2)
- Spine tree (figure 3)
- Distributed driven buffer tree (figure 4).

Figure 1 — H-tree, Because of the balanced construction, it is easy to reduce clock skew in the H
tree clock structure.
A disadvantage to this approach is that the fixed clock plan makes it
difficult to fix register placement. It is rigid in fine-tuning the clock tree.

Figure 2 — Balanced tree makes it easy to adjust capacitance of the net to achieve the skew
requirements. But the dummy cells used to balance the load increase area and power.

Figure 3 — The spine tree (Fish bone) arrangement makes it easy to reduce the skew. But it is
heavily influenced by process parameters, and may have problems with phase delay.

Figure 4 — Distributed driven buffer tree. Distributed buffers make it easy to reduce skew and
power. Clock routing may not be an issue. However, since buffering is customized, it may be a
less area efficient method.
Cluster based CTS
A better plan in principle would be to individually balance and distribute the clock from the
source to each of the logic elements.
If we were simply to lay out the logic and then route a clock
signal to each logic element, using constant wire width and no buffers, then obviously the register
furthest from the source will receive the clock with the greatest delay.
Delay can then be equalized by adding buffers/repeaters or lengthening the wires in the shorter
nets so that the clock reaches all the registers at the same time.
But this is not always true in practice, because in real life routing the clock directly from the source may not be possible in high speed and multi clock domains. A practical approach for such problems can be customized cluster-based clock tree synthesis.
In a customized cluster-based clock tree plan, the logic elements operated in a single clock
domain or the logic with the same input timing registers are combined together to form a group.
A reference register for each group is selected to establish a reference arrival time.
Initially these clusters are placed to meet the latency requirements based on the clock insertion
delay from the source.
These clusters can be individually optimized to minimize skew, so that the
skew will be within the allowable range for the desired cycle time.
The cluster level routing can use any of the routing topologies mentioned above, based on the
priorities of the design.
For example, a cluster sensitive to skew can use H-tree or balanced-tree
designs. Obviously, the tradeoff between power, area and amount of buffers added into the network are to be preplanned before selecting a method for each group.
It is not advisable to limit the performance based on the clock tree topology.
Advantages of cluster based CTS
In a deep submicron process, millions of gates and very high frequencies — sometimes multiple
GigaHertz — are becoming normal.
In order to achieve such high frequency requirements, the clock tree network needs to be very well planned and elaborated.
The designers should be able to plan for the skew requirements to achieve the minimum cycle period, instead of trying to force the skew to zero in the presence of perhaps poorly-characterized process variations.
One of the biggest advantages of doing cluster-based CTS is that the delay due to voltage drop in
the interconnect can be modeled or incorporated into static timing analysis.
A voltage drop of 10% of Vdd may lead to more than 15% delay variation in nanometer technologies.
Typically the voltage drop will be more severe at the center of the chip, so the standard cells
characterized for either worst case or best can not give accurate delay values — they cannot
account for voltage variations that are happening on the fly.
Though simultaneous, mixed min max or on-chip variation (OCV) kinds of analysis are derived for such situation, there is unfortunately no static timing engine that considers the delay of standard cells due to the variation in the voltage.
A cluster near the center of a chip, for example, may not receive 100% of Vdd, so
the delay also will vary at the center since the voltage of those standard cells are less than Vdd.
When we use a cluster-based customized clock tree synthesis method, we always treat the cluster
based on its priority within the chip. If the voltage drop at the center of the die is a potential
problem, then the clusters at the center would always have timing priorities with setup and hold
margins sufficient to prevent the voltage drop from causing timing problem.
There always needs to be a trace-back analysis with actual voltage drop numbers to define the setup and hold margin.
In such cases, the clusters can be operated based on the weighting of timing, area, power, and
other factors.
Conclusion
For many designers, timing closure is the major issue in high-speed designs, which leads to any
number of iterations and schedule-consuming tasks.
At 350nm or 250nm technology, the EDA companies understood timing closure issues and they updated their tools to a certain extent to meet the designers’ needs.
But in the current scenario, timing closure is the major part of the design cycle. Because of time
to-market pressure, new design methodologies at the system integration level like system on chip
and network on chip are using on-chip buses and multiple clock domains.
This kind of design requirement pushes designers to use new design tricks like timing driven placement, useful-skew concepts, or customized cluster-based CTS.
Since there are plenty of IP cores available in the market, simply designing an SoC may not
consume much time. But after the blocks have been selected, the physical implementation into
hardware consumes more time and effort.
We may have to develop a customized design flow for each and every stage of the design cycle to address the current challenges in chip implementation.
Clock tree synthesis is a case in point. As we have discussed throughout the paper, the designers
should not be limited to any fixed topology of clock routing algorithms.
Rather, they should be encouraged to mix up any of the described clock routing schemes into a single chip. This customized cluster-based clock tree synthesis utilizes the best topology to meet requirements like skew, area, and power at every stage, and it improves the top-level system performance.