Half a croissant, on a plate, with a sign in front of it saying '50c'
h a l f b a k e r y

idea: add, search, annotate, link, view, overview, recent, by name, random

meta: news, help, about, links, report a problem

account: browse anonymously, or get an account and write.



Please log in.
Before you can vote, you need to register. Please log in or create an account.

File system support for auto-breaking hard links

For more efficient file version control
(+1, -1)
  [vote for,

I'm hoping this is baked and I'm just using the wrong search terms.

Many commonly used file systems support hard links and/or soft links between files, allowing one copy of some data to be accessed from two different locations. Many also have snapshot features that allow all the files to be virtually copied without making a second physical copy unless an application tries to modify the original. This allows capturing the state of all files at one point in time, often for the purpose of performing a backup of all files at a certain time stamp while allowing the system to continue normal operation during the backup.

I think it would be useful for file systems to support a type of hard link that would automatically create a copy and unlink files as soon as any application tries to modify the file through any of the links.

The times that I really want this feature are maybe somewhat specific to my job, but another application that I think would be useful to many people is in managing digital photos.

When I copy photos onto my computer, I organize all of the original files the may I like to archive them and never modify them. When I want to gather a collection of photos to send to someone, for example highlights of a vacation, or if I'm sorting through a bunch of photos to select the best (from a portrait session with my kids). I generally create a separate folder and put copies of a bunch of photos in that folder. Depending on what I'm doing, sometimes I'll end up making modification to some of the files, which is why I made copies to ensure I don't modify the originals. After I use these files I generally leave them in that folder so I can see what I did with them. Of course I never delete the originals. Hard drive space is pretty cheap, so I don't worry too much about that, but I do end up wasting a lot, and worse is the time spent waiting for files to copy when I copy a large number of photos to a folder so I can pick the one I want based on process of elimination. If I had the ability to make hard link copies that would automatically split into real copies if either were modified, that would save hard drive space and copying time. When talking specifically about photos, some photo management programs implement some of these things, but I have other uses for this as well, and photo management software could take advantage of this feature if it was built into the file system.

Hard links almost implement this feature. If you create a hard link to a file in a separate folder, you can read it from both. If you overwrite the photo in one folder, the link will be broken and the photo in the other folder will be unchanged, but if you modify the photo in one folder, the photo in both folders will be modified since they share the same data. The technology to share data but make a copy if an attempt is made to modify the file is present in existing snapshot features like Windows Shadow Copy, but that particular feature can only be used on the entire volume at once, and the shadow copy is read only. For some of the uses I'm interested in, there may be many copies of each file and many of those need to be writable. But if I ever modify one, I want to unlink and save it separately so I don't modify the others

One other use of this would be in a hard disk optimizer. Someone could write a small program to scan through a hard drive and link duplicate files together. This could be basically transparent to the user because if an of the duplicate files were modified, it would create a separate copy of the modified one.

Note that this should not be used for making back-up copies, but that's obvious because a backup copy on the same volume is not very useful anyway.

This could make normal back-ups less efficient unless the backup software if aware of this feature and there is a way to track linked files so they can be linked on the backup drive as well. Then again if this is only used in cases where normal copies would have been used anyway, then the backup would be no less efficient than before.

scad mientist, Sep 25 2014

Maybe baked? http://en.wikipedia...te_in_storage_media
[scad mientist, Sep 25 2014]

HAMMER File System http://www.dragonflybsd.org/hammer/
Part of DragonFly BSD [Spacecoyote, Jan 31 2015]


       So you want a hard link that stops being a hard link as soon as you write to it...the problem with that (besides being a rather opaque solution to the problem where more transparent solutions exist) is that a hard link simply means there are multiple names for the same file; it can't do anything fancy. You could implement this as a special case of variable symbolic link, though IMHO what you should really be after is a version control system and a backup system.
Spacecoyote, Sep 25 2014

       // (besides being a rather opaque solution to the problem where more transparent solutions exist) // -- Please elaborate on these more transparent solutions.   

       By the way, the application where this came to mind today (I've had this idea stirring fora long time) does have a lot to do with version control. I'm working on a software project. It's poorly managed (good thing I'm not in charge or it would be worse). Because of this it's hard to grab just the set of files that are needed. Access to the Perforce server is slow for some people in the organization (overseas and no Perforce proxy installed). Also, training on how best to use Perforce is somewhat spotty in the off site locations. So when trying to reproduce some issue they are working on they just send me a huge zip file of their version of the code. Now I could check that into Perforce somewhere, but that seems like I rubbing in their face the fact that they should have used Perforce. Often I just save their copy on my computer and make multiple other copies to test various solutions. I end up with a bunch of copies of every single file, but only a couple of those have even the slightest difference.   

       It just seems wasteful to me to have the same data stored twice on the same drive and it annoys me waiting for the files to copy, when I really don't want another copy. I think the best way to implement would actually to have this be the default behavior when a file is copied to a different folder or file name on the same physical volume. Or maybe just make it a check box next to the box for "compress drive to save disk space".   

       Oh yeah, I also remember wanting this feature back when I was required to use Subversion for source control. SVN stores two copies of every file. It diffs the main one against the one in the hidden .svn folder to see if you've made local changes. Again, most of these file pairs could be linked, and once that was implemented in the OS, with a few useful hooks, SVN could simply ask the OS if the files are still linked rather than actually having to diff the files. Even if SVN didn't take advantage of the feature explicitly, when the diff tool read one file then the other, the file would probably already be cached by the OS so it would only get read once from the hard drive anyway.
scad mientist, Sep 25 2014

       To my knowledge, Git doesn't make a diff until there is a change; you could have 100 branches where only 1 copy is actually stored. Since its a distributed version control system, it doesn't need to connect to a remote server unless you are pushing/pulling commits to said server. So there's your problem: you need a proper DVCS, and you need people to actually use it.
Spacecoyote, Sep 25 2014

       I finally found the term I'm looking for. I knew it had to be something others had thought of before. It's called Copy-on-write. See link. I say maybe baked because while the description of it's use for copying objects in memory matches how I'd like to use this, the section on copy-on-write in storage media says this is implemented on btrfs and ZFS. I can't tell how it is used in btrfs, but in ZFS it looks like their use of copy on write is to always make a new copy of a full block of data when modifying any of it to ensure file integrity, with no mention of allowing duplicate files to share space on disk. Qcow2 looks like what I want, but not integrated with a normally running OS (just virtual machines).
scad mientist, Sep 25 2014

       I think you're thinking of a Versioning File System.
Spacecoyote, Sep 25 2014

       Why does this idea remind me of Teamcenter?   

       Oh yeah, because Teamcenter is what I have to deal with from a random day-to-day basis. I hate that program.
RayfordSteele, Sep 25 2014

       No I am not thinking about a versioning file system. I'm not looking for more data redundancy. I'm looking for less.   

       As I continue to think about what I'm looking for here, it really is a file system that optimizes disk space and file copy time by never copying data unless it need to.   

       Previously I was thinking this would be an optional way of dealing with some files I want to treat in this way, but as I think about this more, I see no reason not to always have the file system do this. The only reason not to is if people think they are making the data more secure by making a second copy. But if it's on the same volume, a second copy doesn't protect against a hard drive crash anyway. If someone is worried about data integrity, they would be much better off using a system that stores data redundantly automatically. If you make a copy of a file to improve data integrity and one bit is corrupted in one of the file, you may not notice that right away. If you do notice, you have to manually decide which file you think isn't corrupted.   

       Therefore, a user should have a backup system and data integrity system appropriate for their situation, but on any one volume, the file system should make at least a nominal effort not to make multiple copies of the same data.
scad mientist, Sep 26 2014

       Windows 8 ReFS sounds pretty interesting. It does mention Copy on Write, but they are using the term in the same sense as ZFS: they never update data in place since a power failure during the write would corrupt the new data and the old data.   

       From the 2nd link: "The NTFS features we have chosen to not support in ReFS are: named streams, object IDs, short names, compression, file level encryption (EFS), user data transactions, sparse, hard-links, extended attributes, and quotas."   

       Based on the list of NTFS feature not included in ReFS, (they excluded hard links and compression), I'd say they are trending away from this concept. Though they still support shadowing, which is really the most essential infrastructure needed to support my proposed feature.
scad mientist, Sep 26 2014

       So, basically you want an editlog file (there's a name for that and I haven't been able to dredge it up for the last couple days), then commitable. The file (for picture editing), might read (may as well do it in plaintext)   

Import c:/proj1/Lookitsunset.gif
Crop 24,500
Redout 300,475

       which is a pretty small file, but it does rely on the original "LookitSunset" being around. At a later time the text file could be opened and a finished .gif written. Then if wanted it could be deleted as well as the original pic.   

       Problem of course is that all that text is proprietary to the software program, barring creation of industry-wide standards. So it would be pointless putting it into the OS.
FlyingToaster, Sep 27 2014

       I think that's a bit more high-level than he's going for, [FT].
Spacecoyote, Sep 27 2014

       yeah, I just reread the post. He wants to automatically write a new file when an opened hard-link file is modified.   

       If it's Windows you can delete the link from SaveAs before SavingAs. Problem is "SaveAs" automatically opens in the original's directory, not where the link is.
FlyingToaster, Sep 27 2014

       Sure, doing it by sectors would be great too. My simple implementation should work well with larger collections of small files, but would be quite annoying with large files (for example Outlook mailbox files).   

       Unfortunately that would make it more difficult to implement.
scad mientist, Sep 29 2014

       DragonFly BSD's HAMMER File System sounds interesting. From the [link]:   

       "HAMMER retains a fine-grained history. The state of the filesystem can be accessed live on 30-60 second boundaries without having to make explicit snapshots, up to a configurable fine-grained retention time. Coarse-grained history is controlled by snapshots. By default the system cron generates one snapshot a day and retains 60 days worth. Snapshots can be accessed live. A convenient undo command is provided for single-file history, diffs, and extractions. Snapshots may be used to access entire directory trees. Data and meta-data is CRC-checked for integrity. Data block deduplication [is used.]"
Spacecoyote, Jan 31 2015


back: main index

business  computer  culture  fashion  food  halfbakery  home  other  product  public  science  sport  vehicle