Computer: Storage
Distributed File System (Lite)   (+3)  [vote for, against]
Storage solution for 2 to 50 networked computers. "Don't keep all your eggs in one basket"

This idea is inspired by the excellent Coda and AFS file systems. They are distributed file systems. A distributed file system has no central physical location and can be distributed between N peers (in Coda in both RW and in AFS as R). This offers huge advantage when it comes to performance, redundancy and backup. The down side is that both AFS and Coda require a high degree of planning and knowledge to deploy them. The biggest reason for this is that these file systems are designed for very heavy duty use in environments of 10,000s of users where security, compatibility, performance, flexibility and scalability all play a huge role (and complexity creeps in whether we like it or not)

My idea is to simplify this excellent idea and make it practical for networks consisting of 2 to about 50 computers residing on a single LAN.

- Each computer has N GB hard drive or partition available for the cluster
- All computers are on a single high speed LAN with predicable 100+ Mbs connection speed.
- Each file has a master location
- Each file has N spare locations
- Minimum number of spare locations per file is set by the admin (common sense would say that minimum should be 1 so that if one copy fails the second survives - for more mission critical environments this could be increased)
- Available disk space will be reported as a sum of (all drives on the LAN) - (used space) - (used spare space)
- To the user the drive will appear as one huge drive.
- The files will get replicated transparently by caching, ensuring that minimum N spares exist somewhere on the network at all times.
- If you are interested in historical spares (what was the file like 1 week ago) you can also specify those settings and have the system keep those copies for you.
- The user would not know a thing. They would just see one huge drive (or not so huge if spare num was set very high and you are keeping a lot of history)

And yes ... underneath this is exactly what Coda and AFS does - that's why I'm giving them the big credit. But what Coda and AFS fail to do is make the process completely transparent to the admin. By scaling the solution down to a small network - the complexity of the technical issues involved is also reduced and the administration overhead disappears since the decision can be automated based on specified policy. ... but note that the key benefits are not lost .. you still have

- 100% transparent backup system (where all you specify is how many spares you want to keep)
- Improved performance (frequently used files are kept locally)
- Improved organization (users don't have to remember on which of the 10's drives the file sits ... there is only one drive where everything sits)
- Space efficiency. (Most companies I do administration for have 100 + GB hard drives on the workstations but only ~60 GB for the server. The workstations are extremely underutilized going for most of their data on the server and storing only 10 GB of the OS themselves leaving 90 GB+ idle .... for a small company with 10 workstations that means they could sell the file server and have huge 900 GB of storage non-reliable storage (backed up manually) or 450 GB of 1 spare redundant storage (like a giant network RAID1) ... then if you wanted to add the daily history requirement that would eat in to the total number little bit depending on the delta size (it would change only the changes and rebuild them as a full snapshot upon request ... rsync-like idea) ... by the time you'd be sufficiently satisfied with the number of spares and history you might still end up with 60 GB like you had before ... BUT
... you could do all this without having to move a finger. If machines were stolen, crash, or get destroyed in a natural disaster you would be 100% fine as long as you had more spares than the number of lost machines. It would also be extremely scalable ... want extra 400 GB redundant space with 2 spares? ... add 3 x 400 GB anywhere on the LAN and you are done.
-- ixnaum, Oct 17 2005

OpenAFS (Andrew File System)
My inspiration #1 [ixnaum, Oct 17 2005]

Coda FS
My inspiration #2 [ixnaum, Oct 17 2005]

Perfect. I currently have three networked machines at home, all with different data on them. Some of them have back-ups of some of the others' data - it's a bit haphazard. I am going to purchase a big Network Attached Storage drive to back them all up onto, but this is a really neat solution. One problem - how does it cope with a machine going offline?
-- wagster, Oct 17 2005

well if your LAST copy goes offline then the file is offline ... but I guess, if it's not an actual crash then the file could be replicated to another machine prior to the machine going offline. That's why it would be good to have the number of spares set as high as your disk space can handle (higher redundancy and failure tolerance)
-- ixnaum, Oct 17 2005

I have investigated and been baffled by rsync. It sounds like a very clever solution but I don't think I know enough to implement it myself. Once again my techie credentials fail when real networking or programming knowledge is called for.
-- wagster, Oct 17 2005

rsync is nice, but has major drawbacks

- in practice can't realistically sync more than every 15 minutes (on any typical size drive) - eats up quite a lot of CPU going through the filesystem and determining which files have changed
- designed for static files (ex. web pages) than raplidly changing content (ex. databases, mail boxes)
- if you change the replica then changes will not be propagated back (you'll have 2 separate copies until the next sync overwrites your changes on the replica)

-- ixnaum, Oct 17 2005

Ah - it runs in the background. Can't have that on the audio machine. Besides, once a week should be enough backup for me.

I don't know enough to compare the various systems on offer here, I'll leave that to others.
-- wagster, Oct 17 2005

maybe mine takes too much (10%) CPU because I run it through ssh for security ... that would explain that.

But the other problems stand. You have to run rsync in a schedule ... yes it's infinitely more efficient than copying when the files didn't change since last sync. But it will never be as efficient as AFS or Coda. Especially when we talk about large files that change many times a minute .. or even many times per second - in that scenario rsync ends up doing exactly same thing as copying .... and on top of that there is a good chance that you copy the db file in a inconsistent corrupted state (just like if you were to copy a file)
-- ixnaum, Oct 17 2005

so we MUST be talking about something different then :-) ... your rsync seems way better than mine ... I'm talking about the one at

from man page: "rsync - faster, flexible replacement for rcp" ... this matches my long experience with using rsync exactly ...

so how do you make your rsync run continously without scheduling the sync? Sure there is a daemon mode for the server .. but you do have to schedule the client to trigger the sync or not?
-- ixnaum, Oct 17 2005

Your wish is my command [ixnaum].

I'm on a research project on this topic. I can't say anything about it, but it's basically looking into the management of a user's personal content over a number of mobile devices... The project is in it's infancy, but I'll come back in 3 years and let you know how I get on!
-- Jinbish, Oct 18 2005

Three years! My files are a mess *now* !

The way it looks to me this instant, I think most of us are going to end up with most of our data (files and otherwise) on free or cheap remote public servers, accessed online. It's certainly the way I am working right now, forced to split my computing between my home PC and various PCs I have to share with others
-- DrCurry, Oct 18 2005

// going to end up with most of our data (files and otherwise) on free or cheap remote public servers//

That's great .. but I don't trust cheap or free remote servers :-) .. I worked with one briefly (they try the best they can) but I wouldn't want to store my valuable data there ... maybe some crappy data that I want to lose eventually.
-- ixnaum, Oct 18 2005

Sorry DrC! But you are right about the remote servers. I think that as fixed-line broadband becomes more available, more people will be running servers at home (server-in-a-box type of stuff) and streaming their data/files to wherever they are.

So the burden falls on the communication methods and the ability of the data to be matched to the end device.
-- Jinbish, Oct 18 2005

//The project is in it's infancy, but I'll come back in 3 years and let you know how I get on!

.... 6 years later (double the time you asked for). I really hope it's good news Jibnish. I still need this.
-- ixnaum, Jun 01 2011

random, halfbakery