How to fix setup timing violation?

13 ways to fix setup timing violation

Reducing the Clock Frequency (not preferrable)
𝑇𝑙𝑎𝑢𝑛𝑐ℎ𝑒𝑑𝑔𝑒 + 𝑇𝑙𝑎𝑢𝑛𝑐ℎ𝑙𝑎𝑡𝑒𝑛𝑐𝑦 + 𝑇𝑐𝑜𝑚𝑏 < 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒𝑒𝑑𝑔𝑒 − 𝑇𝑠𝑒𝑡𝑢𝑝 + 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒𝑙𝑎𝑡𝑒𝑛𝑐𝑦
• The easiest and simplest solution is to reduce the frequency (increase the period) of the clock to add time to the capture time
• Doing this degrade the performance (Data rate / CPU speed / Operations per second / etc)
• The decision to reduce the clock frequency is left to the architecture team and can’t be modified individually by RTL or PNR engineers
• Sometimes this solution is not acceptable because the product standard requires specific data rate that needs to be met

2. Pipelining
𝑇𝑙𝑎𝑢𝑛𝑐ℎ𝑒𝑑𝑔𝑒 + 𝑇𝑙𝑎𝑢𝑛𝑐ℎ𝑙𝑎𝑡𝑒𝑛𝑐𝑦 + 𝑇𝑐𝑜𝑚𝑏 < 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒𝑒𝑑𝑔𝑒 − 𝑇𝑠𝑒𝑡𝑢𝑝 + 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒𝑙𝑎𝑡𝑒𝑛𝑐𝑦

What is Piepelining?

• The most common way to fix setup in RTL design is to add pipeline registers.
• The idea of pipelining is to split a large 𝑇𝑐𝑜𝑚𝑏 into multiple clock cycles.
• For example, to implement the equation 𝐴 + 𝐵 ∗𝐶, one can do all the operations in one cycle or do the multiplication in one cycle then the addition in the next
cycle as shown in the diagram

• The disadvantage of pipelining is:
o More area due to the pipeline registers
o More latency: Instead of finishing the operation in one cycle we finish it in multiple cycles.
o Synchronization
. Since the data is delayed by the pipeline registers, the downstream logic that will receive the data have to account for this delay. Notice also how we needed to add pipeline on A as well to synchronize 𝐴1 with 𝐵1 ∗𝐶1 otherwise we would have added 𝐴2 from next sample to 𝐵1 ∗𝐶1

3. Multicycle Path

What is Multicycle Path?

• This method has some similarity to pipelining. Similarly, we will let the combinational path finish in multiple cycles.
• The difference is we won’t add pipeline registers. Instead, we will capture the data at another capture clock edge
• This can be done in 2 ways1:
o Use a control circuit to mask the 1st capture edge and allow another one.
o Use a divided clock for the capture FF as shown in the diagram below

𝑇𝑙𝑎𝑢𝑛𝑐ℎ𝑒𝑑𝑔𝑒 + 𝑇𝑙𝑎𝑢𝑛𝑐ℎ𝑙𝑎𝑡𝑒𝑛𝑐𝑦 + 𝑇𝑐𝑜𝑚𝑏 < 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒𝑒𝑑𝑔𝑒 − 𝑇𝑠𝑒𝑡𝑢𝑝 + 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒𝑙𝑎𝑡𝑒𝑛𝑐�

Multi Cycle Path vs Pipelining
𝑇𝑙𝑎𝑢𝑛𝑐ℎ𝑒𝑑𝑔𝑒 + 𝑇𝑙𝑎𝑢𝑛𝑐ℎ𝑙𝑎𝑡𝑒𝑛𝑐𝑦 + 𝑇𝑐𝑜𝑚𝑏 < 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒𝑒𝑑𝑔𝑒 − 𝑇𝑠𝑒𝑡𝑢𝑝 + 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒𝑙𝑎𝑡𝑒𝑛𝑐𝑦
• At first it might appear that multi cycle path and pipelining are the same. But a deep look shows the big difference

• In the case of pipelining:
o In the 1st cycle A,B,C enters the 1st stage of the pipeline. In the 2nd cycle A,B,C enters the 2nd stage while a new sample enters 1st stage of the pipeline
o We receive an output every clock cycle and the added latency due to the pipeline registers affects us at the beginning only

• In the case of MCP:
o In the 1st cycle A,B,C enters the circuit. In the 2nd cycle, the circuit is still busy and we can’t insert a new sample until it finishes.
o We receive an output every 2 clock cycles
• This shows that pipelining fix setup and have high processing speed while MCP slows down the processing speed
• You can think of MCP as reducing the clock frequency but selectively in parts of the circuit and not on the entire circuit

4. Retiming

What is Retiming in VLSI

• In this method if 𝑇𝑐𝑜𝑚𝑏 is large to fit in the clock cycle, we split the logic and move part of it to another cycle.

• Consider the example below:
o The red and green logic combined make a 𝑇𝑐𝑜𝑚𝑏=𝟕𝟎𝟎𝑝𝑠 which causes a setup violation.
o We move the green logic to the next clock cycle to be combined with the blue logic.
o This reduces 𝑇𝑐𝑜𝑚𝑏 between FF1 and FF2 to 𝟓𝟎𝟎𝑝𝑠 instead of 𝟕𝟎𝟎𝑝𝑠 which passes setup.
o But increases 𝑇𝑐𝑜𝑚𝑏 between FF2 and FF3 to 𝟑𝟎𝟎𝑝𝑠 instead of 𝟏𝟎𝟎𝑝𝑠 but this is okay because it also passes setup. If the blue logic was big this method won’t.

• Retiming can be done manually by the RTL designer or automatically by the synthesis tools

o In the example below, the purple logic takes as input A and B. If we move the green logic to the next cycle, we get B one cycle later than what was
expected. When we wait for this one cycle, 𝑨𝟏 will be gone and a new 𝑨𝟐 will arrive which will get computed with sample 𝑩𝟏. This will break the
functionality of the circuit

o Synthesis tools will avoid any retiming that breaks the functionality as this example did.
o The RTL designer has full control over the code so he can fix this issue by, for example, adding a pipeline register before the purple logic to delay it one
cycle and handle any new issues that will appear due to this added register
o Hence, the RTL designer can do more aggressive retiming compared to the synthesis tools but with extra effort.

Retiming + Pipelining
𝑇𝑙𝑎𝑢𝑛𝑐ℎ𝑒𝑑𝑔𝑒 + 𝑇𝑙𝑎𝑢𝑛𝑐ℎ𝑙𝑎𝑡𝑒𝑛𝑐𝑦 + 𝑇𝑐𝑜𝑚𝑏 < 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒𝑒𝑑𝑔𝑒 − 𝑇𝑠𝑒𝑡𝑢𝑝 + 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒𝑙𝑎𝑡𝑒𝑛𝑐𝑦

• The previous example shows how retiming can be combined with pipelining.
• Lets Consider the same example of 𝑨 + 𝑩 ∗𝑪
o We can move the adder to the next clock cycle if there is margin there.
o However, we get the same issue in the previous slide that A is not synchronized with B*C. So we add a pipeline register.
o This way we fixed the setup violation and saved the area of the 𝐵 ∗ 𝐶 pipeline registers

6. Optimizing Synthesis

• Synthesis tools have lots of features and switches that the engineer can use to enhance the timing and control the trade-offs between the PPA metrics.
• This topic is very large and needs a tutorial on its own, so we will demonstrate just a few of what can be done.

o Increase the timing effort
: Most synthesis tools have switches that controls the effort the tool will put to fix a certain PPA metric or to do a certain optimization. Higher effort leads to better optimization but higher runtime while a lower effort leads to less optimization but better runtime.

o Decrease or disable area and power efforts : Area and power optimizations usually degrade the timing of the circuit. Reducing the effort of these optimizations or disabling them all together may enhance the timing but worsen the area and power of your chip

o Enable Flattening: The RTL code consists of several modules connected to each other. By default, synthesis tools will synthesize each module separately and then connect them together in the top module, thus preserve the hierarchy and boundaries between the modules.

Another approach is to remove the module boundaries and make all cells in one hierarchy. This is called flattening and generally produce better timing result.

7. False Path

Applying False Paths in the Constraints
𝑇𝑙𝑎𝑢𝑛𝑐ℎ𝑒𝑑𝑔𝑒 + 𝑇𝑙𝑎𝑢𝑛𝑐ℎ𝑙𝑎𝑡𝑒𝑛𝑐𝑦 + 𝑇𝑐𝑜𝑚𝑏 < 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒𝑒𝑑𝑔𝑒 − 𝑇𝑠𝑒𝑡𝑢𝑝 + 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒𝑙𝑎𝑡𝑒𝑛𝑐𝑦

• False paths are timing paths that can’t possibly occur due to the logic of the circuit
• Consider the example below:
• Both muxes have the same select signal. This means we have 2 possible timing paths. The one going through both red logics (200 + 300 = 500𝑝𝑠) and the one going through both blue logics (100 +500 = 600𝑝𝑠)
• The paths going through a red logic then a blue logic (200+ 500 = 700𝑝𝑠) or blue logic then red logic (100 +300 = 400𝑝𝑠) is impossible to happen.

• If we don’t apply correct constraints on these paths, not only do we get fake setup
violations, but we hinder the synthesis and
violating timing paths
• Unless we instruct the tool to ignore these false paths, they will be considered for timing analysis leading to the large 𝑇𝑐𝑜𝑚𝑏 of the red to blue path which will violate setup.
𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑝𝑎𝑡ℎ𝑠
PnR tools ability to optimize the other real because the tools apply extreme optimizations only on the critical and worst paths and it won’t consider the less critical paths for these optimizations unless
they solve the most critical ones.

8. Optimizing the Floorplan

• Floorplaning is the 1st step in the PNR flow and involves things like creating the chip size and boundaries, manually placing the major blocks (analog, SRAM, etc) in the chip, and placing the chip ports
• Here are some of the things that affects the setup in the circuit
o A small chip area might cause the cells to get closer to each other and closer to the ports which in turn will reduce the wire delays. However, if the size is too small several issues will appear such as big voltage drop, cell congestion, routing detours, crosstalk, etc1.

o The placement of the major blocks in the chip affects the timing. The example on the left shows how the placement of the SRAMs near the IO ports might block the standard cells from being placed near their relevant ports. Not only that but they will block the routing resulting in longer wire delays to go
around them.

o The placement of the ports also affects the timing. The example on the right shows how a bad placement of the ports can lead to long wire delays and buffering which will worsen 𝑇𝑐𝑜𝑚b

9. Optimizing the wire delay

• In part 1 we showed how a signal propagating through an RC circuit will have a delay proportional to the resistance and the capacitance. Hence, to reduce this delay we need to reduce the resistance and capacitance of the wire.

• This will also decrease the load cap of the cell that drives the wire which will speed up the cell too.

Reducing the resistance 𝑹 =𝝆𝑳/A

Reducing the length 𝑳 of the wire will reduce the delay. We showed some examples on how to reduce it using a better floorplan.
Increasing the width will decrease the delay. Higher metal layers have higher default width and also bigger thickness
hence larger area 𝑨. PNR tools will use these higher layers for long and critical nets to reduce their delay. The PNR engineer can manually move the wires to higher layers during ECO or apply non-default routing rules (NDR) on these nets to make the router route them in higher layers.

Reducing the capacitance 𝑪 =𝝐𝑨/d

Increasing the spacing 𝒅 by moving the two wires aways from each other will reduce the capacitance between them.
We can apply NDR on specific nets to tell the router that we want no nets to get routed very close to these nets
Reducing the common distance. When two wires move along each other for a long distance the common area 𝑨 will be big leading to bigger capacitance. We can move one of the two wires to another layer to reduce the delay.

10. Relaxing the Power Grid
𝑇𝑙𝑎𝑢𝑛𝑐ℎ𝑒𝑑𝑔𝑒 + 𝑇𝑙𝑎𝑢𝑛𝑐ℎ𝑙𝑎𝑡𝑒𝑛𝑐𝑦 + 𝑇𝑐𝑜𝑚𝑏 < 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒𝑒𝑑𝑔𝑒 − 𝑇𝑠𝑒𝑡𝑢𝑝 + 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒𝑙𝑎𝑡𝑒𝑛𝑐𝑦

• The power grid is the metal connection that delivers the power from higher metal layers down to the standard cells
• We showed how the wire delay is affected by things like spacing and width, etc. A wide and compact power grid will leave few routing resource for the signal nets leaving no option for increasing spacing or width.
• However, relaxing the power grid will increase the resistance of the power network causing bigger voltage to drop. So, the PNR designer has to trade-off between enhancing timing and fixing voltage drop.

11. Upsizing

What is Upsizing in VLSI?

We showed in part 1 how the MOSFET size affects the propagation delay of the cell. So to fix setup we can use larger cells that has less propagation delay

There are several considerations when doing this method:

Bigger cells means more area and power consumption

Bigger cells has larger gate capacitance. This will slow down the cell that drives them because it now has
larger load capacitance. The enhancement of upsizing the cell should overcome the slow down of the
driving cell.

Since big cells consume more power they are likely to cause big voltage drop on the cells around them.

During ECO flow there might not be enough area to accommodate the bigger cell which require you to
move the cells around it and then reroute the nets to their pins. The moving of the cells and the reroute
could worsen the timing for these cells

12. Increasing the Driving Strength
𝑇𝑙𝑎𝑢𝑛𝑐ℎ𝑒𝑑𝑔𝑒 + 𝑇𝑙𝑎𝑢𝑛𝑐ℎ𝑙𝑎𝑡𝑒𝑛𝑐𝑦 + 𝑇𝑐𝑜𝑚𝑏 < 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒𝑒𝑑𝑔𝑒 − 𝑇𝑠𝑒𝑡𝑢𝑝 + 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒𝑙𝑎𝑡𝑒𝑛𝑐𝑦

• When we discussed upsizing we showed that when a cell drives a large load capacitance its output transition time gets slower which in turn will slow down the load cells.
Increasing the driver strength will enhance the transition time which in turn will enhance the load cells delay

• There are several ways to enhance the driving strength
o Upsizing the driver cell:
Bigger cells produce larger current and hence charge the load capacitance faster. This method combine the benefit of speeding up the driver by upsizing and the benefit of speeding up the load cells because they see a better input transition time.

o Downsizing the load cells:
this will decrease the load capacitance of the driver which will speed up the propagation and transition time which in turn will speed up the load cells. However, smaller cells has larger delay, so for this method to work the gain from enhancing the driving strength should overcome the increase in delay due to downsizing

o Fanout splitting :
Instead of one cell driving all the fanout we can duplicate the driver and split the fanout among them as shown in the diagram. But note that the driver of the driver is now seeing double the load cap which increases its delay. So, you have to balance things to make the overall gain overcome the increase in delay

o Side load isolation:
Add a small buffer that isolates a large load from the driver. In the example shown, the driver now sees the small cap of the buffer instead of the large cap of the large NAND. This will fix the green paths but will worsen the red path because the small buffer will add a delay that increases the overall delay of the red path.

For this method to work, the red path should be passing setup check and have good a margin to accommodate the increase in delay.

13. Breaking up Long Nets

• When a cell drives a very long wire with big capacitance it will have bad propagation and transition times. By breaking the long wire with buffers, the overall enhancement could overcome the delay of the added buffers
• If the wire is very long, we can split it with an inverter pair instead of a buffer. This is better because the delay of an inverter is less than that of a buffer of the same size1. This way we get more cuts in the wire (less load cap for each cell) with roughly the same delay of the added buffer.