Speaking of thumb drives.
In the case of copying files to a mounted file system, I’ve sometimes found it faster to use a tar pipeline than cp when copying data to an USB stick or SD/microSD card.
Instead of:
cp -r ~/wherever/somedir/ /media/SOMETHING/
I would do cd ~/wherever/
tar cf - somedir/ | ( cd /media/SOMETHING/ && tar xvf - )
And it would be noticably faster.Not the same use case as linked article, but wanted to bring this up since it’s somewhat related.
There's also rsync which works perfectly with local paths and can resume from interruptions (by default), with or without crc checking the material ("-c"). That can be useful for removable storage which can sometimes be a bit unreliable.
Just take care that cp, tar and rsync each have slightly different handling of extended attributes and sparse files.
(By the way, I believe "tar -C /path" is the canonical way of doing "cd /path ; tar" without resorting to subshells.)
rsync is a strange beast to me
First thing that makes it so weird is how it assigns different meaning between paths including vs not including trailing slash. Completely different from how most command line tools I am used to behave in Linux and FreeBSD.
That alone is enough to remind me every time I try to use rsync why I don’t like and generally don’t use rsync.
While rsync is different than cp and mv, I dislike the cp and mv destination-state-dependent behaviour.
With rsync, destination paths are defined by the command itself and the command is idempotent. I don't usually need to know what the destination is like to form a proper command (though oopsies happen, so verify before using --delete).
With cp/mv - the result depends on the presence and type of the destination. E.g. try running cp or mv, canceling then restarting. Do you need to change the arguments? Why?
mkdir s1 s2 d1
touch s1/s1.txt s2/s2.txt
# this seems inconsistent
cp -r s1 d1 # generates d1/s1/s1.txt
cp -r s2 d2 # generates d2/s2.txt
mkdir s1 s2 d1
touch s1/s1.txt s2/s2.txt
# I don't use this form, but it is consitent
rsync -r s1 d1 # generates d1/s1/s1.txt
rsync -r s2 d2 # generates d2/s2/s2.txt
# Same as above but more explicit
rsync -r s1 d1/ # generates d1/s1/s1.txt
rsync -r s2 d2/ # generates d2/s1/s2.txt
# I prefer this form most of the time:
rsync -r s1/ d1/ # generates d1/s1.txt
rsync -r s2/ d2/ # generates d2/s2.txt
I simply try to use trailing slashes wherever permitted and the result is amply clear.If it didn't do that, it would have to add a switch that you'd still have to look up. It's a tool that's most often doing a merge and update, but looks like a copy command. I think that made it friendlier.
How would you separate "merge this directory into that one and update files with the same name" from "copy this directory into that directory"?
Halt with an error message explaining the ambiguity and the proper switch to use.
Same principle as rm requiring --no-preserve-root if you actually want to nuke / - right now it's way too easy to accidentally and destructively do the wrong thing.
> How would you separate "merge this directory into that one and update files with the same name" from "copy this directory into that directory"?
I would make "copy this directory into that directory" out of scope for the tool.
Let’s imagine a tool similar to rsync, but less confusing, and more in tune with what I want to do, personally. There are for sure a bunch of things that rsync can do, that this imagined tool can’t. That’s fine by me.
Let’s call the tool nsync.
It would work like this:
nsync /some/src/dir /some/dest/dir
And running that would behave exactly the same with or without trailing slashes.
I.e the above and the following would all be equivalent:
nsync /some/src/dir/ /some/dest/dir/
nsync /some/src/dir/ /some/dest/dir
nsync /some/src/dir /some/dest/dir/
And what would this do? It would inspect source and dest dirs. Then it would copy files from source to dest for which last modified was greater in source dir than in dest, or where files in source dir did not exist in dest dir.
In other words, it would overwrite older files that had older last modified time stamp, and it would copy files that did not exist.
Like rsync it would also work with ssh (scp/sftp). Maybe some other protocols too, but only if those other protocols supported the comparisons we need to make. Prefer fewer protocols, and this way of working over trying to be the subset of what works across a gazillion protocols.
If a file exists in dest but not in source dir, it is kept untouched in dest. Not deleted. Not copied back to source dir.
Then there would be one other mode; destructive mode. The flag for it would be -d.
nsync -d /whatever/a/b/c/ /wherever/x/y/z/
This would work similar to the normal mode. But it would remove any files in dest dir that are not present in source dir. Before actually deleting anything it would list all of the files that will be deleted, and ask for keyboard confirmation. [y/N] so that you have to explicitly hit y and then enter. Enter alone will be interpreted as no.
You would be able to override the confirmation with the -y argument.
nsync -dy /whatever/a/b/c/ /wherever/x/y/z/
And that’s it. That’s what I would want rsync to be for me.
There probably are some programs that behave exactly like this. I’ll eventually write one too. It’ll have a user base of 1. Me.
rclone move will do this
https://github.com/chapmanjacobd/journal/blob/main/programmi...
Lost too much data that way, trailing slash with delete. I still feel bitter, the UX is terrible considering the effect are so different from copy and delete.
> I am used to behave in Linux and FreeBSD
the only strange thing about rsync is that it follows _BSD_ syntax
https://wiki.archlinux.org/title/rsync#Trailing_slash_caveat
Weird! I never noticed that.
Indeed, it does!
cd "$( mktemp -d )"
mkdir a b c
touch a/f1 a/f2 a/f3
cp -r a/ b/
ls b
Resulting content of directory b on one my FreeBSD machines f1 f2 f3
Guess I usually1. Don’t use cp -r so often, and
2. When I do use cp -r apparently I don’t put a trailing slash
Cause I only ever had problems when trying to use rsync, not when using cp or mv :S
Gotta write your own application-specific wrapper script to put up the missing guardrails e.g. require/forbid trailing '/', set "--maxdelete=0", ...
Yes, a complete redesign of the rsync command line would be great.
I've found when there are many files but overall size being small i.e. many many small files, archiving, copy and unpacking works faster. Perhaps due to enumerating, analysing overall size and then copying vs archiving directly.
Would be happy to learn the real reason.
cp is single-threaded and only blocks on one IOP at a time.
With a tarpipe, you can block on two IOPs at a time, and they're decoupled by the pipe buffer.
This primarily makes a difference because the kernel cannot issue the I/O for small files ahead of time, like it does when you sequentially read a large file, so you do actually end up blocking and waiting.
This is the strategy I always use with that use-case. Especially with "assets" folders where thousands of little icons and drawables and the like.
On a tangential note, I always found unzipping/untarring in Midnight Commander excruciatingly slow, and copying of many files just barely tolerable.
Interesting.
My guess: speed up is due to buffering involved in the latter's case, not sure though.
It's because reading and writing happen in separate processes, i.e. simultaneously instead of interleaved.
In general writing to disk is handled asynchronously by the kernel (`write` just copies to a buffer and returns), but metadata operations like creating files are not, so this should help the most for many small files.
There's probably more than one reason it's faster. For example, tar ignores extended file attributes by default. Cp -r would have to check them for every file.
cp queries the preferred block size of the destination file in 'struct stat', and has specific tweaks for certain filesystems. As far as I can tell, dd does not do this as it calls through to 'write' directly.
In any case, the tests in ioblksize.h indicate that bs=4M is far too large and may perform worse than the default for cp/cat (128KiB). There is a script there that should clear things up for more modern systems.
The point about fdatasync is superfluous as you can run 'sync' yourself, or unmount the filesystem.
This example was performed on the block-special device, so there is no filesystem.
dd uses a default block size of 512 bytes, according to the manual page: calling write(2) directly means you need to choose a buffer of some size.
"bs=" sets both input and output block sizes, which probably isn't the best idea in this case.
Block sizes are a tricky subject:
https://utcc.utoronto.ca/~cks/space/blog/unix/StatfsPeculiar...
https://utcc.utoronto.ca/~cks/space/blog/tech/SSDsAnd4KSecto...
I've been re-implementing a bunch of coreutils as an exercise, and got stuck on dd input/output block sizes, AND disk/partition block sizes for a while. (As far as I understand it, for dd I need a ring buffer the size of max(ibs, obs), and then some moderately clever book-keeping to know when to trigger the next read/write, perhaps with code specific to ibs>obs, ibs<obs, etc; partitioning on the other hand is plainly stupid, there's decades of hardware and software just lying to each other and nothing makes sense.)
Thank you and everyone else in this thread for the know-how and references! I would like to eventually write an article (or at least heavily commented source) to hopefully explain all this nonsense for other people like me.
TFA mentions that you can use sync after cp to do the same thing as fdatasync. You can't "unmount the file system" because you're writing directly to the block device, the thumb drive isn't mounted.
Syncing as you go (ideally asynchronously) when you have to sync anyway (like when you're writing an image to a thumb drive, or write to NFS) has the big advantage that you don't end up saying "I'm done writing all data" and then blocking for five minutes waiting for the kernel to flush a couple gigs to disk.
For removable media, I believe that finishing up with "eject" would be the safest way to ensure that all writes were committed to the hardware.
You can't "unmount the file system" because you're writing directly to the block device, the thumb drive isn't mounted.
("Eject" is macOS's term for "unmount filesystem")
Distrowatch.com is a site promoting use of Linux and BSD.
"eject" is a Linux command which attempts to safely disconnect removable media.
https://manpages.ubuntu.com/manpages/noble/en/man1/eject.1.h...
TFA doesn't mention filesystems at all, sort of jumps in where we find the block device. Things could become messy if the device were mounted while the copy is attempted.
I've switched to the following:
<infile pv >outfile
And it seems to work very well plus gives a good progress indicator.