Updated 10 hours ago
BGP looks complicated because it is complicated. But the complexity isn't arbitrary—it's the accumulated weight of every business relationship, security concern, and scaling challenge the Internet has faced over three decades.
This deep dive explores what's actually happening when BGP chooses a route, why each attribute exists, and the operational realities that make BGP both frustrating and irreplaceable.
How BGP Chooses a Path
When a BGP router receives multiple routes to the same destination, it must pick one. The selection process follows a strict order—but understanding that order reveals something important about how the Internet actually works.
The Logic Behind the Order
BGP's path selection isn't about finding the "best" path in any objective sense. It's about satisfying the most important local policies first, then falling back to increasingly technical tiebreakers.
First: Local Administrative Preferences
The first criteria are entirely local—they never leave your network:
-
Weight (Cisco-specific): A local knob that overrides everything else. Higher wins. If you absolutely must use a specific path, set its weight higher.
-
Local Preference: Shared within your AS via iBGP. This is how you tell all your routers "when leaving our network, prefer this exit." Higher wins, default is 100.
-
Locally Originated: Prefer routes you're announcing yourself. Your own networks should take the most direct path.
These three let operators implement business decisions: prefer the cheaper provider, avoid the congested link, use the customer path over the transit path.
Second: Path Characteristics
-
Shortest AS Path: Fewer autonomous systems in the path wins. This is BGP's crude distance metric—crude because AS path length has almost no correlation with latency, bandwidth, or reliability. A path through three ASes might be faster than a path through two.
-
Origin Type: Prefer routes that were explicitly configured (IGP) over routes learned from legacy protocols (EGP) over routes redistributed from IGP protocols (Incomplete). This is mostly historical.
-
Lowest MED: The Multi-Exit Discriminator lets your neighbor suggest which of their entry points you should use. Lower wins. But MED is only compared between routes from the same AS—and many networks ignore it entirely.
Third: The Hot Potato
-
eBGP over iBGP: Prefer external routes over internal ones.
-
Lowest IGP Metric to Next Hop: Among remaining routes, pick the one whose exit point is closest to you.
This is "hot potato" routing: hand the packet off to the next AS as quickly as possible. Your network minimizes its own costs by making the packet someone else's problem at the earliest opportunity. Every network does this. The result is that Internet routing is optimized for each individual network's costs, not for the packet's end-to-end journey.
Finally: Tiebreakers
-
Oldest Route: Stability matters. Don't switch paths unless you have a reason.
-
Lowest Router ID: Deterministic tiebreaker.
-
Shortest Cluster List: For route reflector deployments.
-
Lowest Neighbor IP: Final tiebreaker.
Most path selections are decided in the first few steps. The tiebreakers exist because BGP must always pick exactly one path—it can't say "I don't know."
The Attributes That Matter
AS Path: Identity and Loop Prevention
The AS path does three things:
Loop Prevention: If a router sees its own AS number in the path, it rejects the route. This is how BGP prevents routing loops without a hop count limit.
Path Selection: Shorter paths are preferred (step 4 above).
Policy Tool: AS path prepending—adding your own AS number multiple times—makes a route look longer and therefore less attractive. Announce [65001] on your preferred link and [65001, 65001, 65001] on your backup. Neighbors will prefer the shorter path.
Prepending is a blunt instrument. It affects all neighbors equally, and excessive prepending (more than 3-4 times) is generally pointless and can trigger filtering.
Next Hop: Where to Send Packets
The next hop attribute tells a router where to forward packets. Two critical behaviors:
eBGP behavior: Next hop is the IP address of the neighbor who advertised the route.
iBGP behavior: Next hop is unchanged by default. If router A learns a route from eBGP neighbor 203.0.113.1, then advertises it via iBGP to router B, router B sees next hop 203.0.113.1—an address that might not be directly reachable.
This is why "next-hop-self" is commonly configured on iBGP sessions: it changes the next hop to the advertising router's address, which internal routers know how to reach.
Local Preference: Controlling Your Exit
Local preference controls which exit your AS uses when multiple paths exist. Higher wins. It's shared via iBGP but never sent to external neighbors—it's purely internal policy.
Common pattern: Set local preference 150 for customer routes, 100 for peer routes, 80 for transit routes. This ensures traffic prefers paths through customers (who pay you) over peers (free) over transit (you pay them).
MED: Suggesting an Entrance
MED is the inverse of local preference—it suggests to your neighbors which of your entry points they should use. Lower wins.
If you connect to AS 65002 in both New York and London, you can advertise your US prefixes with MED 10 in New York and MED 100 in London. AS 65002 should prefer the New York path.
But MED's influence is limited:
- It only affects one neighbor at a time
- Many networks ignore MED from peers and transit providers
- It's only compared between routes from the same neighboring AS
MED is a polite suggestion, not a command.
Communities: Tags for Policy
Communities are 32-bit tags attached to routes, typically written as ASN:value (like 65001:100). They enable policy without requiring explicit configuration for every prefix.
Examples:
Internal signaling: Tag customer routes with 65001:1000, peer routes with 65001:2000. Filtering policies can match on these tags instead of maintaining prefix lists.
Remote triggering: Many transit providers honor specific communities. Prepend AS path once? Set community 65002:101. Don't announce to a specific peer? Set community 65002:9999. Customers can influence routing without the provider touching their config.
Well-known communities:
NO_EXPORT: Don't advertise to eBGP neighborsNO_ADVERTISE: Don't advertise to any neighborLOCAL_AS: Don't advertise outside the local confederation sub-AS
Communities are BGP's most flexible policy tool.
BGP Messages and States
Message Types
OPEN: Establishes the session. Exchanges AS numbers, hold times, router IDs, and capabilities.
UPDATE: The actual routing information. Contains withdrawn routes (no longer valid), path attributes, and NLRI (Network Layer Reachability Information—the prefixes being advertised).
KEEPALIVE: Maintains the session when there's nothing to say. Sent every 60 seconds by default.
NOTIFICATION: Something went wrong. Includes an error code, then the session closes.
Session States
Idle → Connect → Active → OpenSent → OpenConfirm → Established
A session bouncing between Idle, Connect, and Active means TCP can't connect—wrong IP, blocked port, or no route to the peer. A session stuck in OpenSent or OpenConfirm usually means mismatched parameters (AS number, authentication, capabilities).
Filtering: The Essential Discipline
Unfiltered BGP is dangerous. Every BGP session needs appropriate filtering.
Prefix Filtering
Bogon filtering: Reject reserved, unallocated, and private address space. These should never appear in the global routing table.
Length filtering: Reject prefixes longer than /24 (IPv4) or /48 (IPv6)—they're often more specific than necessary or hijack attempts. Reject prefixes shorter than /8 (IPv4)—something's wrong.
Customer validation: Accept only the specific prefixes your customers should announce. Maintain explicit prefix lists.
AS Path Filtering
Own-AS filtering: Reject routes containing your own AS number (loop prevention).
Private AS filtering: Reject routes containing private ASNs (64512-65534, 4200000000-4294967294) in public routing.
Length filtering: Reject implausibly long AS paths (more than 50 ASes is suspicious).
RPKI/ROV
Resource Public Key Infrastructure provides cryptographic validation that the AS originating a prefix is authorized to do so. Route Origin Validation (ROV) checks incoming routes against RPKI data:
- Valid: Origin AS matches RPKI record
- Invalid: Origin AS doesn't match (potential hijack)
- Unknown: No RPKI record exists
RPKI adoption is growing. Dropping Invalid routes is increasingly standard practice.
Scaling iBGP
iBGP has a scaling problem: full mesh requirements.
Why Full Mesh?
iBGP doesn't modify the AS path—it can't, because all routers are in the same AS. Without AS path changes, there's no loop prevention. So iBGP uses a simple rule: don't re-advertise routes learned via iBGP to other iBGP neighbors.
This means every router must learn routes directly from the router that learned them externally. With N routers, you need N(N-1)/2 sessions. Ten routers need 45 sessions. Fifty routers need 1,225 sessions.
Route Reflectors
Route reflectors break the "no re-advertisement" rule in a controlled way. A route reflector:
- Receives routes from clients
- Reflects them to other clients
- Adds cluster information for loop detection
Instead of full mesh, routers peer with route reflectors. Two route reflectors (for redundancy) can serve an entire AS.
The tradeoff: suboptimal routing is possible if route reflectors don't have full topology visibility. Design carefully.
Confederations
Confederations divide an AS into sub-ASes. Internal routing uses eBGP-like semantics between sub-ASes, providing the scaling benefits of hierarchical routing. More complex than route reflectors, but useful for very large networks or organizational boundaries.
Security Realities
BGP was designed when the Internet was small and operators trusted each other. That trust model doesn't scale.
Prefix Hijacking
An attacker announces prefixes they don't own. Traffic flows to them instead of the legitimate owner. Causes range from accidental misconfiguration to intentional interception.
Defenses: RPKI/ROV (validates origin), monitoring (detects unexpected announcements), filtering (prevents propagation).
Route Leaks
A network re-advertises routes in violation of policy—typically advertising transit routes to other transit providers, making themselves a transit point for traffic that shouldn't cross their network.
Defenses: Strict filtering by relationship type, NO_EXPORT communities, monitoring.
Session Security
BGP runs over TCP. Session hijacking, RST attacks, and resource exhaustion are all possible.
Defenses: MD5 authentication (weak but better than nothing), GTSM (rejects packets with TTL suggesting they didn't come from a directly connected neighbor), TCP-AO (stronger authentication, limited deployment).
Timers and Convergence
Hold Time and Keepalive
Default: 180-second hold time, 60-second keepalive. If three keepalives are missed, the session is declared dead.
Aggressive timers (3-second keepalive, 9-second hold) detect failures faster but increase CPU load and risk false positives from momentary congestion.
BFD (Bidirectional Forwarding Detection) provides sub-second failure detection without aggressive BGP timers—the preferred approach.
Advertisement Interval
eBGP waits 30 seconds between updates for the same prefix to the same neighbor (MRAI—Minimum Route Advertisement Interval). iBGP defaults to 0.
This dampens oscillation but delays convergence. The Internet-wide convergence time for a route change is measured in minutes, not seconds.
Route Dampening
Route dampening suppresses unstable routes by accumulating penalties for flaps. Exceed a threshold, and the route is suppressed.
In practice, dampening often suppresses legitimate routes experiencing normal convergence events. Most operators disable it.
Frequently Asked Questions About BGP Path Selection
Was this page helpful?