19.1 Advanced Referral Costing

Server applications often communicate with other servers via a built-in client (Dclient), because a single server doesn't contain all the necessary eDirectory data for an application to operate. An example is NLDAP, when it is configured to chain requests.

When a server application requests data that the local server does not hold, the server locates another server that contains the requested data, and subsequently retrieves the data for the client. This process is called “tree walking”. It naturally takes longer for a server to fulfill a request through tree walking. Although best practice guidelines for eDirectory tree design minimize the need for tree walking, it is still sometimes necessary.

Figure 19-1 Advanced Referral Costing

Figure 19-1 illustrates an LDAP subtree search to Server A for cn=GHowe, starting at O=MyCorp. However, the cn=GHowe object is located in the ou=MidWest partition, which is not represented on Server A.

To locate a server that holds the data needed to fulfill the client request, Server A must either get the data from Server B or Server C. To do this, Server A must send the request to either Server B or C. Server A happens to choose Server B. Note that the process of choosing server is unpredictable. Server B is available on the network and accepts the request, but is unable to complete the request quickly, resulting in Server A waiting for Server B even though Server C could also provide the required data. Until Server B either fulfills the request or is no longer available on the network, the request from Server A must wait.

The following sections provide information about how you can improve the performance of eDirectory servers:

19.1.1 Improving Server-to-Server Connection

Advanced Referral Costing (ARC) is an improved costing algorithm. The main purpose of ARC is to prevent server outages. Some of the benefits of ARC can include:

Improved server performance and fault tolerance
Better server-to-server communications
Load distribution
Remote server health monitoring
Simplified isolation and identification of communication problems

Who Should Use ARC?

Servers that don't hold a local copy of an object or service need to walk the tree for information benefit from ARC, because they frequently communicate with the other servers. ARC is very effective in an LDAP environment, especially during prefer chaining.

For example, a server is sometimes overwhelmed by other servers that always make requests to that server, as illustrated in Figure 19-2.

Figure 19-2 One Stop Server Effect

Although there are other available servers with replicas of the needed objects, servers still seem to prefer this server. This is because the servers making requests for a service or replica are are already connected to this server, so they tend to send all the requests that the server can handle. Figure 19-2 shows that all requests from S4 are going to S1. This is because S4 was already connected and authenticated to S1, so it continues to send all the requests for the blue partition to S1, even though S2 and S3 could service those requests. ARC helps to eliminate these situations by distributing the load to the servers that respond faster. You should enable ARC on remote servers (S4) that request this server, or you can enable ARC on all servers.

Figure 19-3 shows another scenario, illustrating the “cascading server” effect. Here, server S1 is often not responding, but it is not down. If the S1 were down, the requests would time out and communication would stop. If the server is still up at the transport level, but the database is slow or busy, the server continues to accept and queue new requests from other servers. This can cause the additional servers (S2) to eventually run out of threads. Each outstanding request takes a thread on the remote server, and when they run out of threads the server becomes non-responsive. ARC resolves this issue by distributing requests across the fastest servers, because a server that is slow or sick incurs a higher cost in servicing requests.

Figure 19-3 Cascading Server Effect

In addition, ARC is a good choice for improving fault tolerance. It has the ability to easily identify server communication problems.

19.1.2 Advantages of Referral Costing

It times/routes most Resolve Name requests to remote servers as they are made.
It averages the Resolve Name request times in milliseconds on each address. This allows ARC to be more granular and adjust the cost of the referral more aggressively. It is also able to quickly detect a slow server, because timing is tracked in milliseconds instead of seconds.
It tracks outstanding requests so quickly determine if a request is taking too long. It does not have to wait for the request to complete in order to know that the server is taking a long time.
It tracks response time on a per-address basis. It is normal for a server to have numerous connections to the same address. By tracking per address instead of per connection, one connection can benefit from statistics gathered from the other connections.

NOTE:To account for LDAP requests, ARC also takes into account responsiveness of private connections.

19.1.3 Deploying ARC

ARC is usually deployed on a server-to-server basis. Those servers that are ARC - enabled can know the new costing information. You should patch all the servers to 8.7.3.9 ftf3 and eDirectory 8.8 servers to 8.8.2 version and then enable ARC on each server in the environment.

Deployment Considerations

It is not useful to enable ARC on all servers. Figure 19-4 shows a situation that could impact the efficiency of LDAP servers. In the figure, S4 holds a copy of the green partition, but not of the blue partition. Any chaining LDAP request that requires information from the blue partition needs to walk to the S1, S2, or S3 servers to be fulfilled. This works in most cases, and ARC is designed for just such situations.

Figure 19-4 ARC Deployment Considerations

However, performing specific LDAP operations could be difficult. Although it is possible to add a user, for example, Bob.Blue.Novell, the operation might fail when you try to immediately return to modify Bob. The figure shows Bob added on S2, but modifying Bob on S3 has failed because S3 has not yet synchronized with S2, so S3 has not yet received Bob. ARC has the capability to direct you to a different server, because ARC is more dynamic than the original costing method.

This configuration works well in scenarios where the server costs don't vary much and they don't have problems synchronizing. Disabling ARC on S4 resolves this issue.

19.1.4 Enabling Advanced Referral Costing

ARC is enabled by default for eDirectory 8.8 SP8 and later versions. To configure ARC by using the NDS iMonitor, click Agent Configuration > Background Process Settings. In addition, the Enable, Disable, and Debug options are available.

Figure 19-5 NDS iMonitor Agent Configuration Screen

NDSTrace

Use the NDSTrace tool to enable ARC on all UNIX platforms.

Table 19-1 Enabling ARC on UNIX Platform

set NDSTRACE =!ARC	Displays the gv_ResolveTimesTable for debugging.
set NDSTRACE =!ARC0	Disables Advanced Referral Costing.
set NDSTRACE =!ARC1	Enables Advanced Referral Costing.
set NDSTRACE =!ARC2	Enables Advanced Referral Costing in debug mode and displays the resulting costs of each referral on the Resolve Name DSTrace flag anytime a costing decision is made.

19.1.5 Tuning Advanced Referral Costing

ARC requires no tuning by default. However, there are tunable parameters in ARC that can be used to change how ARC functions, or to disable or enable certain features. There are 3 major components to ARC.

Advanced Costing

When asked to cost a given address, ARC uses the information known about the connection to calculate the cost of the given referral. If ARC is on, Advanced Costing is always used when costing a referral.

Background Monitoring

A background thread periodically checks the timer information to ensure that it is current. When a server is slow, its cost rises and there is a good chance that communication will cease. The background thread periodically (once a minute by default) checks to see if a server in the table has not been updated. If the server is has not been updated in the last three minutes, the server makes a resolve name request on its behalf to check the server's health. This creates current costing for the server, and also detects if a server is now less busy, or is healthy, so a client doesn't need to suffer adverse effects to check the server's health. There are two permanent configuration parameters that can be changed for the background thread:

ARC_MAX_WAIT: How stale a timer is before a request to the server to check its health (180 seconds by default).
ARC_BG_INTERVAL: How often the background thread runs (60 seconds by default,: 0 means disabled and the thread doesn't run).

For additional information, see section 8.4.24 setting permanent configuration parameters.

Remote Health Information

Servers using ARC periodically request health information from a remote server. These are not additional requests on the wire, but additional health information that is returned in standard resolve name requests that servers frequently make. This information is then used in the costing algorithm to further enhance reactions to servers that are under heavy loads. When a resolve name request is being made to a remote server, if it has been more than 15 seconds since the last update, health information is requested from the remote server and is added to the reply of the resolve name request.

There is one tunable parameter for Remote Health Monitoring:

ARC_DS_INFO_INTERVAL: This is how often to request lock (health) information in ARC (15 seconds by default).

19.1.6 Monitoring Advanced Referral Costing

You can print the ResolveTimes table to observe Advanced Referral Costing in action.

Use the following commands to print the ResolveTimes table:

set DSTRACE = +DBG
set DSTRACE = !ARC

This prints the Resolve Times table and the current stored information for each server. It shows the transport address, the milliseconds since the address was last used, the last cost that was used in a referral decision, and the number of outstanding requests for that address.

A high number of outstanding requests is not necessarily a problem. It might simply mean that that server is used frequently.

Using ARC for Troubleshooting

One of the most useful features of ARC is the ability to quickly identify communication problems with servers.

The following is an example of a ResolveTimesTable printout:

ARC is currently enabled.

Table 19-2 Resolve Time Costs

Slot	Transport Address	Cost	LastUse	Checked	waiters	LockTime
1	tcp:151.155.134.27:524	214	14	14	0	0
2	tcp:151.155.134.11:524	0	0	0	0	0
3	udp:151.155.134.11:524	0	0	0	0	0
4	cp:151.155.134.13:524	554759	280	0	27	582
5	tcp:151.155.134.59:524	0	179	179	0	0
6	udp:151.155.134.59:524	0	119	119	0	0
7	tcp:151.155.134.28:524	1543	119	119	0	0
8	tcp:151.155.134.15:524	124	14	14	0	0

The printout shows that from this server's perspective, 151.155.134.13 is having difficulties. You can also see that the problem is most likely the server, not the transport. The server has 27 requests waiting for access to the database, and the requests are taking a long time to acquire the database lock. This server has two requests that have never received replies from the remote server.

You can also see that 151.155.134.11 and 151.155.134.59 are either very fast servers, or are not very busy, or both. You can see that 151.155.134.59 and 151.155.134.11 have both had problems communicating via TCP at one time, but are both healthy now, because they both have UDP connections. UDP connections to a server are tried only if there is a problem talking to the server via TCP.

The following is a summary of what each number means:

Transport Address: The address of the remote server.

Cost: The current cost of the remote server.

Last Use: The duration in seconds since last communication with the server.

Checked: The duration in seconds since last health information from the remote server.

#Req: The number of outstanding requests to the remote server.

Waiters: The number of requests to the remote server waiting for the database lock.

LockTime: Duration that a process has held the database lock on the remote server.

The following printout has another example of quickly identifying a communications problem, because you can see that the server currently cannot communicate to 151.155.134.13 via TCP.

ARC is currently enabled.

Table 19-3 Resolve Time Costs

Slot	Transport Address	Cost	LastUse	Checked	#Req	waiters	LockTime
1	tcp:151.155.134.27:524	394	92	14	0	0	0
2	tcp:151.155.134.11:524	0	0	0	0	0	0
3	udp:151.155.134.11:524	0	0	0	0	0	0
4	tcp:151.155.134.13:524	5000000	180	180 is in BAD ADDRESS CACHE

There are a few things to keep in mind when looking at these tables:

Outstanding requests are not necessarily bad, because the server might just be servicing many requests. Outstanding requests on servers where costing is high are a problem.
Your first indicator of a server's health is the current cost, making it easy to see what server is causing you problems.

NOTE:All requests are timing round trip time, and how long requests are outstanding. This means transport times are also a component of the cost. If a server shows up as having problems in this table, but is working well from other servers, and doesn't appear to have a problem, this might indicate a transport issue.

Background Thread Traces

The following is a trace showing the ARCBackgroundResolveTimerThread running:

ARCBackGroundResolveTimerThread started Interval = 60 MaxWait = 180000

Updating timer info for tcp:151.155.134.11:524

Updating timer info for udp:151.155.134.11:524

Updating timer info for tcp:151.155.134.13:524 ARCBackGroundResolveTimerThread error -635 in DCConnectToAddress for tcp:151.155.134.59:524

ARCBackGroundResolveTimerThread completed in 0 seconds

8-total timers 4-stale timers 3-timers updated

From the above message you can see the following:

TCP:151.155.134.11 has not been used for more than 3 minutes
UDP:151.155.134.11 has not been used for more than 3 minutes
TCP: 151.155.134.13 has not been used for more than 3 minutes

The timer information was updated for all of the above servers, with the following results:

TCP: 151.155.134.59 is still not reachable from this server.

The new costing is very dynamic and changes very frequently. In order to watch it work, you can set the Advanced Referral Costing parameter to Debug mode.

NOTE:Ensure you reset ARC to non debug mode by running the command set NDSTRACE = !ARC1 when you have finished monitoring. Overhead printing costs are not desirable when you don't need it.

In the DSTrace or NDSTrace, you now see the individual referral costs displayed if Advanced Referral Costing and +RSLV are turned on. The remaining tags are turned off using the set NDSTrace =nodebug command.

Sorted results from DCAdjustCostAndSort follow:

137.65.10.3 cost of 217

137.65.10.9 cost of 222

137.65.10.10 cost of 400

The numbers change quickly if a remote server is slow or overloaded. The ExRef server's costing adjusts dynamically every second, so to watch costs over time you should the trace to a log file.