Comparing dd and cp when writing a file to a thumb drive

Comments

By codetrotter 2024-01-2710:404 reply

Speaking of thumb drives.

In the case of copying files to a mounted file system, I’ve sometimes found it faster to use a tar pipeline than cp when copying data to an USB stick or SD/microSD card.

Instead of:

    cp -r ~/wherever/somedir/ /media/SOMETHING/

I would do

    cd ~/wherever/
    tar cf - somedir/ | ( cd /media/SOMETHING/ && tar xvf - )

And it would be noticably faster.

Not the same use case as linked article, but wanted to bring this up since it’s somewhat related.

By xorcist 2024-01-2711:511 reply

There's also rsync which works perfectly with local paths and can resume from interruptions (by default), with or without crc checking the material ("-c"). That can be useful for removable storage which can sometimes be a bit unreliable.

Just take care that cp, tar and rsync each have slightly different handling of extended attributes and sparse files.

(By the way, I believe "tar -C /path" is the canonical way of doing "cd /path ; tar" without resorting to subshells.)

By codetrotter 2024-01-2714:086 reply

rsync is a strange beast to me

First thing that makes it so weird is how it assigns different meaning between paths including vs not including trailing slash. Completely different from how most command line tools I am used to behave in Linux and FreeBSD.

That alone is enough to remind me every time I try to use rsync why I don’t like and generally don’t use rsync.

By labawi 2024-01-2722:17

While rsync is different than cp and mv, I dislike the cp and mv destination-state-dependent behaviour.

With rsync, destination paths are defined by the command itself and the command is idempotent. I don't usually need to know what the destination is like to form a proper command (though oopsies happen, so verify before using --delete).

With cp/mv - the result depends on the presence and type of the destination. E.g. try running cp or mv, canceling then restarting. Do you need to change the arguments? Why?

  mkdir s1 s2 d1
  touch s1/s1.txt s2/s2.txt
  # this seems inconsistent
  cp -r s1 d1  # generates d1/s1/s1.txt
  cp -r s2 d2  # generates d2/s2.txt

  mkdir s1 s2 d1
  touch s1/s1.txt s2/s2.txt
  # I don't use this form, but it is consitent
  rsync -r s1 d1  # generates d1/s1/s1.txt
  rsync -r s2 d2  # generates d2/s2/s2.txt
  # Same as above but more explicit
  rsync -r s1 d1/  # generates d1/s1/s1.txt
  rsync -r s2 d2/  # generates d2/s1/s2.txt
  # I prefer this form most of the time:
  rsync -r s1/ d1/  # generates d1/s1.txt
  rsync -r s2/ d2/  # generates d2/s2.txt

I simply try to use trailing slashes wherever permitted and the result is amply clear.

By pessimizer 2024-01-2714:232 reply

If it didn't do that, it would have to add a switch that you'd still have to look up. It's a tool that's most often doing a merge and update, but looks like a copy command. I think that made it friendlier.

How would you separate "merge this directory into that one and update files with the same name" from "copy this directory into that directory"?

By Karunamon 2024-01-2715:01

Halt with an error message explaining the ambiguity and the proper switch to use.

Same principle as rm requiring --no-preserve-root if you actually want to nuke / - right now it's way too easy to accidentally and destructively do the wrong thing.

By codetrotter 2024-01-2715:161 reply

> How would you separate "merge this directory into that one and update files with the same name" from "copy this directory into that directory"?

I would make "copy this directory into that directory" out of scope for the tool.

Let’s imagine a tool similar to rsync, but less confusing, and more in tune with what I want to do, personally. There are for sure a bunch of things that rsync can do, that this imagined tool can’t. That’s fine by me.

Let’s call the tool nsync.

It would work like this:

nsync /some/src/dir /some/dest/dir

And running that would behave exactly the same with or without trailing slashes.

I.e the above and the following would all be equivalent:

nsync /some/src/dir/ /some/dest/dir/

nsync /some/src/dir/ /some/dest/dir

nsync /some/src/dir /some/dest/dir/

And what would this do? It would inspect source and dest dirs. Then it would copy files from source to dest for which last modified was greater in source dir than in dest, or where files in source dir did not exist in dest dir.

In other words, it would overwrite older files that had older last modified time stamp, and it would copy files that did not exist.

Like rsync it would also work with ssh (scp/sftp). Maybe some other protocols too, but only if those other protocols supported the comparisons we need to make. Prefer fewer protocols, and this way of working over trying to be the subset of what works across a gazillion protocols.

If a file exists in dest but not in source dir, it is kept untouched in dest. Not deleted. Not copied back to source dir.

Then there would be one other mode; destructive mode. The flag for it would be -d.

nsync -d /whatever/a/b/c/ /wherever/x/y/z/

This would work similar to the normal mode. But it would remove any files in dest dir that are not present in source dir. Before actually deleting anything it would list all of the files that will be deleted, and ask for keyboard confirmation. [y/N] so that you have to explicitly hit y and then enter. Enter alone will be interpreted as no.

You would be able to override the confirmation with the -y argument.

nsync -dy /whatever/a/b/c/ /wherever/x/y/z/

And that’s it. That’s what I would want rsync to be for me.

There probably are some programs that behave exactly like this. I’ll eventually write one too. It’ll have a user base of 1. Me.

By xk3 2024-01-2818:06

rclone move will do this

https://github.com/chapmanjacobd/journal/blob/main/programmi...

By pastage 2024-01-2714:23

Lost too much data that way, trailing slash with delete. I still feel bitter, the UX is terrible considering the effect are so different from copy and delete.

By xk3 2024-01-2818:082 reply

> I am used to behave in Linux and FreeBSD

the only strange thing about rsync is that it follows _BSD_ syntax

https://wiki.archlinux.org/title/rsync#Trailing_slash_caveat

By codetrotter 2024-01-2820:00

Weird! I never noticed that.

Indeed, it does!

  cd "$( mktemp -d )"
  mkdir a b c
  touch a/f1 a/f2 a/f3
  cp -r a/ b/
  ls b

Resulting content of directory b on one my FreeBSD machines

  f1      f2      f3

Guess I usually

1. Don’t use cp -r so often, and

2. When I do use cp -r apparently I don’t put a trailing slash

Cause I only ever had problems when trying to use rsync, not when using cp or mv :S

By everybodyknows 2024-01-284:59

Gotta write your own application-specific wrapper script to put up the missing guardrails e.g. require/forbid trailing '/', set "--maxdelete=0", ...

By ape4 2024-01-2714:45

Yes, a complete redesign of the rsync command line would be great.

By legends2k 2024-01-2712:262 reply

I've found when there are many files but overall size being small i.e. many many small files, archiving, copy and unpacking works faster. Perhaps due to enumerating, analysing overall size and then copying vs archiving directly.

Would be happy to learn the real reason.

By formerly_proven 2024-01-2714:06

cp is single-threaded and only blocks on one IOP at a time.

With a tarpipe, you can block on two IOPs at a time, and they're decoupled by the pipe buffer.

This primarily makes a difference because the kernel cannot issue the I/O for small files ahead of time, like it does when you sequentially read a large file, so you do actually end up blocking and waiting.

By naitgacem 2024-01-2714:57

This is the strategy I always use with that use-case. Especially with "assets" folders where thousands of little icons and drawables and the like.

By inglor_cz 2024-01-2717:26

On a tangential note, I always found unzipping/untarring in Midnight Commander excruciatingly slow, and copying of many files just barely tolerable.

By legends2k 2024-01-2712:232 reply

Interesting.

My guess: speed up is due to buffering involved in the latter's case, not sure though.

By mortehu 2024-01-2713:19

It's because reading and writing happen in separate processes, i.e. simultaneously instead of interleaved.

In general writing to disk is handled asynchronously by the kernel (`write` just copies to a buffer and returns), but metadata operations like creating files are not, so this should help the most for many small files.

By tyingq 2024-01-2712:34

There's probably more than one reason it's faster. For example, tar ignores extended file attributes by default. Cp -r would have to check them for every file.

By tadfisher 2024-01-274:142 reply

cp queries the preferred block size of the destination file in 'struct stat', and has specific tweaks for certain filesystems. As far as I can tell, dd does not do this as it calls through to 'write' directly.

In any case, the tests in ioblksize.h indicate that bs=4M is far too large and may perform worse than the default for cp/cat (128KiB). There is a script there that should clear things up for more modern systems.

The point about fdatasync is superfluous as you can run 'sync' yourself, or unmount the filesystem.

By NoZebra120vClip 2024-01-274:361 reply

This example was performed on the block-special device, so there is no filesystem.

dd uses a default block size of 512 bytes, according to the manual page: calling write(2) directly means you need to choose a buffer of some size.

"bs=" sets both input and output block sizes, which probably isn't the best idea in this case.

Block sizes are a tricky subject:

https://utcc.utoronto.ca/~cks/space/blog/unix/StatfsPeculiar...

https://utcc.utoronto.ca/~cks/space/blog/tech/SSDsAnd4KSecto...

By rollcat 2024-01-2713:13

I've been re-implementing a bunch of coreutils as an exercise, and got stuck on dd input/output block sizes, AND disk/partition block sizes for a while. (As far as I understand it, for dd I need a ring buffer the size of max(ibs, obs), and then some moderately clever book-keeping to know when to trigger the next read/write, perhaps with code specific to ibs>obs, ibs<obs, etc; partitioning on the other hand is plainly stupid, there's decades of hardware and software just lying to each other and nothing makes sense.)

Thank you and everyone else in this thread for the know-how and references! I would like to eventually write an article (or at least heavily commented source) to hopefully explain all this nonsense for other people like me.

By mort96 2024-01-2710:351 reply

TFA mentions that you can use sync after cp to do the same thing as fdatasync. You can't "unmount the file system" because you're writing directly to the block device, the thumb drive isn't mounted.

By formerly_proven 2024-01-2714:101 reply

Syncing as you go (ideally asynchronously) when you have to sync anyway (like when you're writing an image to a thumb drive, or write to NFS) has the big advantage that you don't end up saying "I'm done writing all data" and then blocking for five minutes waiting for the kernel to flush a couple gigs to disk.

By NoZebra120vClip 2024-01-287:111 reply

For removable media, I believe that finishing up with "eject" would be the safest way to ensure that all writes were committed to the hardware.

By mort96 2024-01-2912:101 reply

You can't "unmount the file system" because you're writing directly to the block device, the thumb drive isn't mounted.

("Eject" is macOS's term for "unmount filesystem")

By NoZebra120vClip 2024-01-2918:11

Distrowatch.com is a site promoting use of Linux and BSD.

"eject" is a Linux command which attempts to safely disconnect removable media.

https://manpages.ubuntu.com/manpages/noble/en/man1/eject.1.h...

TFA doesn't mention filesystems at all, sort of jumps in where we find the block device. Things could become messy if the device were mounted while the copy is attempted.

By BenjiWiebe 2024-01-2719:22

I've switched to the following:

  <infile pv >outfile

And it seems to work very well plus gives a good progress indicator.

Hacker News