Tuesday, April 10, 2007

Keeping Multiple Subversion Repositories in Sync

With Subversion 1.4, svnsync was introduced for this purpose. The key problem with using svnsync for multiple Subversion repositories distributed over the WAN is its reliance on a master-slave architecture. While svnsync does provide the advantage of having local read-only repositories at each of the remote development sites, only the master repository is writeable. The master repository is then replicated to the read only slaves. However, the replication process can place a significant load on the network and servers. Because of this, replication tends to happen on an infrequent basis, leaving the read-only slave repositories that remote sites do their checkouts from out of sync with the master much of the time. As a result, commit failures due to update conflicts on the master repository can become a problem. In order to avoid commit failures, developers at the slave repository sites have to do updates over the WAN against the master Subversion repository before doing their commits. This can negate most of the expected network performance and developer productivity benefits of using svnsync in a distributed development environment.

Other solutions such as svk do allow multiple repositories to be readable as well as writeable, but there are no guarantees of consistency across the repositories. A commit can succeed on a developer’s local repository where there are no conflicts, and fail when it’s copied to other sites’ repositories due to update conflicts. This can make administration extremely difficult.

WANdisco solves these problems by turning distributed Subversion repositories into peers. All of the repositories are writeable, and consistency across the repositories is guaranteed. WANdisco’s active-active replication capabilities allow developers to work at LAN speed over the WAN for both read and write operations, while keeping all of the repositories in sync, in effect in real-time. WANdisco also provides self-healing capabilities that automate disaster recovery after a network outage or server failure.

9 comments:

Unknown said...

thanks man, subversionman

nice summary

we are thinking of implementing / evaluating Subversion on multiple sites probably over the next few months

what are some of the tech reqs for implementing wanDisco. Is it on the same server as Subv

SubversionMan said...

Koo,

WANdisco is installed on the same server as Subversion at each site if you're using Subversion stand-alone. If you're using Subversion with Apache, then it's installed on the same server as Apache at each site.

To get a quick feel for the technical requirements for implementing WANdisco check:

Install Checklist

You can take a look at all of our product docs at:
Products

Pete Alfvin said...

Jim,

Can you explain why the replication process for svnsync involves so much net traffic? Is there significantly more traffic flowing aside from the updated files? And do they send deltas or the full text version?

Pete Alfvin

SubversionMan said...

Peter,


Svnsync is a master-slave solution in which only the master Subversion server is writable, and changes to the master are then replicated out to read-only slave Subversion servers at remote sites. This means that developers at remote sites will be going out over the WAN using standard Subversion protocols to perform commits and any other write operations against the master server.

With Subversion, only the changes to a file are sent from the Subversion client to the Subversion server on a commit, not the entire file. However, with each Subversion commit using the SVN RA protocol (Subversion without Apache) up to six WAN round trips will take place between a remote site developer's Subversion client and the master Subversion server in order to process the commit. These WAN round trips are required to open a connection between the remote client and the master server, authenticate the user on the master server, and write the commit to the master. Then, additional WAN traffic is generated when changes to the master are replicated back out to each of the read-only slaves at the remote sites.

If Subversion is implemented with Apache and the WebDAV HTTP protocol is used, in addition to the WAN round trips required to open the connection, authenticate the user and process the commit on the master Subversion server, there will be at least four WAN round trips required for each file committed. This is because of the HTTP puts required to send each file to the master server over the WAN. For example, when Subversion is implemented with Apache, if a commit consists of a directory with 10 files, this means that there will be at least 40 WAN round trips required to complete the commit, in addition to the WAN round trips required to open the connection, authenticate the user and write the commit to the master server.

While WANdisco works with the native SVN RA and webDAV HTTP protocols
used for communication between Subversion clients and Subversion servers over the LAN at each development site, once commits are sent out over the WAN between development sites WANdisco: (1) uses a persistent connection between the WANdisco replicator instances installed with the Subversion servers at each of the sites, so the overhead associated with the TCP three-way handshake to establish connections goes away, and (2) WANdisco uses its own protocol on top of TCP to replicate the commit to the other sites over the WAN in only one WAN round trip. This allows WANdisco to deliver significantly better WAN performance and use much less bandwidth than svnsync.

Unknown said...

Can we some real world numbers comparing svnsync and WANDisco? If the difference is only a few ms total, who cares?

SubversionMan said...

Brian,

The difference is not going to be just a few milliseconds when writes are done between a Subversion client and the remote master server that is the only writeable server in the master-slave architecture used by svnsync.

Let’s use the example of a commit consisting of 500 MB of data between the US and India. Given that the typical E-1 line used between the US and India operates at approximately 2 megabits per second, it would take 2000 seconds to transfer a 500 MB commit, or a little over 33 minutes. This assumes that everything goes smoothly, without any communication errors between the client and the remote master server. In addition, this doesn’t include the additional overhead I mentioned in my previous post that results from the WAN round trip latencies required to open the connection, authenticate the user, and process the commit once the data is received by the master server. It also doesn’t include the WAN round trip latencies that become a factor when the WebDAV HTTP protocol is used, that can add several minutes to this time, depending on the number of files in the commit.

Given that most LANs now operate at a speed of one gigabit per second, it should only take about four seconds to transfer our 500 MB of data between the client and the server over a LAN. Because WANdisco, unlike svnsync, allows all repositories to be writeable, developers at all sites will experience this LAN-speed level of performance on their commits. At the same time, Subversion repositories in the US and India will be kept in sync. WANdisco’s unique active-active replication approach supports this, and you can learn more from our site. This is why all of our Subversion customers who’ve tried svnsync, end up implementing WANdisco.

Unknown said...

Hi SubversionMan,

I hope you are still online.

Would like to know if the WANdisco peer solution for distributed Subversion implementation can support selected synchronization of only some of the Subversion contents. Is there any limitation that the entire repository/ies have to be synced? We are looking to synchronize only the core components of our project.

Kindly explain or forward us to a link describing a solution to this.

Thanks in advance
~Santhosh

SubversionMan said...

Santhosh,

You could still use single servers at each site, but you would need to break your Subversion repository into two separate repositories, each with its own URL: one for the project(s) you want to replicate, and a second repository for the project(s) you don’t want to replicate. If you’re using Apache to front-end Subversion, you could use Apache’s ProxyPass feature to redirect clients to the correct repository. There is an article on our
support site:
ProxyPass
that outlines using ProxyPass. We’ve had customers use this approach to accomplish the same thing you’re trying to, and it has worked well for them.

monitor said...

You can use inotify-tools as well.

http://planet.admon.org/synchronize-subversion-repositories-with-inotify-tools/