Data Replication Using rsync
Recently we discussed the importance of data replication for situations ranging from mission critical environments to home users. Replication ensures that you have a copy of your current data on a separate storage environment (secondary system) so that if you lose the first system (primary system), you still have access to the data.
In general there are two types of replication: synchronous and asynchronous. Synchronous replication, as the name implies, means that the primary storage and secondary storage are kept exactly the same. Any data writes (or deletes) have to complete on both the primary and secondary storage pools before returning to the application allowing it to continue. This means that the data in the data pools is an exact match.
Asynchronous replication allows any writes (or deletes) from an application to finish on the primary storage pool. Then the data is copied from the primary storage pool to the secondary storage pool, typically outside of the application I/O path. This means that there can be some data difference between the two pools at any instance in time. The amount of data difference you can tolerate is up to you (your requirements) but you can shrink that data difference to something fairly small and tolerable.
But one of the most important aspects of data replication that you shouldn't forget is that data replication and data backup are not the same thing. A backup can keep prior versionsof data so that you can effectively go back in time over the life of the data to get prior versions. On the other hand, replication keeps a replica (copy) of the current data. You can get the most current state of the data from a backup but it will only be as accurate as when the backup was made. Replicas are much more recent so they will, in general, capture the data changes since the last backup. The question you need to answer is when do I need replication?
There is no universal answer to that question. You need to examine your data requirements and determine how important having the latest copy of the data is to your mission and the importance of the accessibility of that data. During your examinations you should also weave in discussions about off-site disaster recovery for your data center (or your home system).
In the previous article about replication, two replication techniques for Linux were discussed - DRBD and rsync. DRBD is a kernel based replication method that automatically replicates all data from the primary storage pool to a secondary storage pool without the user or administrator having to intervene. On the the hand, rsync is a file based approach allowing you to selectively replicate directory trees so that you don't have to replicate an entire file system. However, rsync mush be invoked manually so it's not as automatic as DRBD increasing the possible data differences between the primary and secondary storage pools. But many times the flexibility of replicating only portions of the data to one or more secondary storage pools make it a popular choice despite the possibility of increased data differences.
Since rsync is so flexible and is file based, in this article I want to show a simple rsync example to illustrate what it can do for data replication.
Simple example using rsync
I want to present a simple rsync example to illustrate the basic steps involved in getting to replicate a directory tree. One of the advantages of rsync is that you don't have to make a copy of the entire file system - you can just make a copy of a specific directory tree and even specific file types or file names. This makes it incredibly flexible since you can now replicate directory trees to various secondary storage servers as needed and/or focus on certain types of files or directory trees.
For this article I will be using rsync to replicate data from my laptop to my main storage box at home. The overall concept is that I come home from being on the road, I fire up the laptop on the home network and my data gets replicated to my home server. In addition, I will only be replicating a specific directory from the laptop to the home server since that is where I keep all the data I modify while I travel.
Let's start by defining terms within the rsync framework. The first two systems or terms that we need to define are the rsync server and the rsync client. One would think that the traditional client/server terms would apply, but a common source of confusion in rsync is that the rsync server does not necessarily have to be the system that has the original copy of the data and the rsync client does not have to be the recipient of the data. To better understand rsync, remember that there is a distinction between roles and processes in rsync. So to make sure we understand all the terms used in this article, there are four terms we'll be using (taken from this link).
A reasonably good tutorial to use to start learning rsync is here. It is a bit old (1999) but it has a very good overview of rsync and explains things fairly well. There are other tutorials that cover useful topics such as how to use ssh with rsync or using stunnel with rsync.
For the rsync command used for my simple scenario let's start with the simple example in the "everythinglinux" tutorial. Here is the script I used to perform the rsync that is taken from the article and adapted to my situation (notice that there are few changes).
rsync --verbose --progress --stats --compress --rsh=/usr/bin/ssh \ --recursive --times --perms --links --delete \ --exclude "*bak" \ /home/laytonjb/Documents/* 192.168.1.8:/data/laytonj/rsync_test
For the particular example in this article I didn't need all of these options but I wanted to leave them in since they help illustrate some of what you can do with rsync.
The final two parts of the rsync command tell rsync what data to synchronize on the rsync sender and where to put it on the rsync receiver. In this particular example, the command is copying everything under the directory /home/laytonjb/Documents (don't forget it is doing this recursively) and copying to my home server (IP address is 192.168.1.8) into the directory /data/laytonj/rsync_test. Sometimes, in the vernacular of rsync, the first part is referred to as the source and the second is referred to as the destination.
Before running the command I made sure that the directory existed on the destination (192.168.1.8). After that, I just ran the script for the first time.
root@laytonjb-laptop:~# ./runit_rsync email@example.com's password: building file list ... 6241 files to consider 00354107.pdf 1291267 100% 4.90MB/s 0:00:00 (xfer#1, to-check=6240/6241) 01-IntroToCUDA.ppt 12302848 100% 5.11MB/s 0:00:02 (xfer#2, to-check=6239/6241) 145213.pdf 1809831 100% 2.61MB/s 0:00:00 (xfer#3, to-check=6238/6241) 19990064118_1999099363.pdf 961611 100% 1.04MB/s 0:00:00 (xfer#4, to-check=6237/6241) AIAA-2007-512-524.pdf 3015861 100% 1.98MB/s 0:00:01 (xfer#5, to-check=6236/6241) AIAA-2009-601-842 4164087 100% 3.33MB/s 0:00:01 (xfer#6, to-check=6235/6241) CFD_SC07.ppt ... Number of files: 6241 Number of files transferred: 5938 Total file size: 680013393 bytes Total transferred file size: 680013393 bytes Literal data: 680013393 bytes Matched data: 0 bytes File list size: 177740 File list generation time: 4.146 seconds File list transfer time: 0.000 seconds Total bytes sent: 554389880 Total bytes received: 132474 sent 554389880 bytes received 132474 bytes 3734157.27 bytes/sec total size is 680013393 speedup is 1.23 root@laytonjb-laptop:~#
I cut out some of the output since there are so many files (6,241). At the very end is the summary data of the rsync operation. Remember that this is the first time the rsync operation happened so it will transfer all of the data as specified in the script. The next time rsync is invoked it will only copy the parts of the files that have changed.
One other quick observation is that I ran the script as root. Consequently, when I tried to connect to the destination server (192.168.1.8) it did so as root so I needed to use root's password.
To illustrate what happens when rsync is run again after some files have changed, I edited two files in the source directory tree and re-ran the same script. Below is the output from rsync.
root@laytonjb-laptop:~# ./runit_rsync firstname.lastname@example.org's password: building file list ... 6241 files to consider FEATURES/STORAGE066/ FEATURES/STORAGE066/notes.txt 1816 100% 0.00kB/s 0:00:00 (xfer#1, to-check=73/6241) FEATURES/STORAGE066/storage066.html 4082 100% 3.89MB/s 0:00:00 (xfer#2, to-check=72/6241) Number of files: 6241 Number of files transferred: 2 Total file size: 680014698 bytes Total transferred file size: 5898 bytes Literal data: 2398 bytes Matched data: 3500 bytes File list size: 177740 File list generation time: 0.069 seconds File list transfer time: 0.000 seconds Total bytes sent: 178691 Total bytes received: 112 sent 178691 bytes received 112 bytes 32509.64 bytes/sec total size is 680014698 speedup is 3803.15
Notice that only two files are listed as being transferred (these are the two files that were modified). This is the beauty of rsync - it only transfers the data that has been changed. This can make replication much, much easier.
At this point I know the specific rsync command works correctly. I could then make a script using this command sending the output to a log file. I could also put this script in a cron job so that it runs every so often (perhaps every hour). But ideally I would like to run it only when the laptop is first plugged into my home network. Without getting crazy about writing a script to check if I'm on my home network, that the home server is up, etc., I could just put the script in rc.local. So when the laptop boots, the script will be run and my data will be replicated onto my home server.
Data replication can be a very important technology for making sure that you have an up to date copy of your data somewhere in case you lose your primary source of data. In a previous article I mentioned that there are two tools in Linux for data replication - DRBD and rsync. In this article I showed a very simple example of using rsync (a very simple example).
The simple example was just replicating the data from my laptop (primary data storage) to my home server (secondary data storage). More precisely, I was replicating a specific directory from my laptop, recursively, to my home server. I used a simple example from a popular rsync tutorial as a starting point. I modified it slightly but I left in options that I don't normally use so that you can see what options are available for rsync (the man pages for rsync are HUGE so perhaps this simple example will help jump start your understanding of rsync).
Before anyone gets really upset that I'm stating that rsync is a replication tool and replication is different from backups, let me also say that you can use rsync for backups. An example is here where the author talks about how to use rsync to make daily, weekly, and monthly backups. However, there are some important differences between using rsync for backups and using a true backup utility such as Amanada that you should understand before using rsync as your production backup tool. But if you play your cards correctly, you can use the same tool for backups as data replication.
Rsync is a great and easy to use utility for data replication. It has some really great features such as compression, recursion, the ability to utilize different shells or sockets, and it understands to look for differences in files between rsync operations and only transmit the changes. Not a bad utility if you ask me - not bad at all.