jasonwcdasjoe, I updated the bug report with my findings. https://github.com/zfsonlinux/zfs/issues/1059
lblumeHello all
lblumeHow much memory is zol using in addition to the ARC? I find top says a lot more memory is in use, which processes are not accounting for.
DeHackEdlblume: the ARC slabs consume a lot due to fragmentation. a single 128k allocation may take up to 4 MB of RAM, and freeing 128k won't free the 4 MB until every 128k chunk from it is freed
lblumeDeHackEd: Thanks, that'd explain it. Any way to mitigate this?
dasjoelblume: https://github.com/zfsonlinux/zfs/pull/2129 helps, but may break receiving
lblumeThanks for the pointer, I'll keep an eye on it, I have enough memory to live with it
jasonwclblume, arc_summary.py gives detailed stats on ARC and slab usage
jasonwcI would like to do nightly backups from my main pool to a backup tool using zfs send/recv. I've tested manually first with a zfs send R and then with a zfs send -I old new and it works well
jasonwcThe question is how would I go about automatic it to use the latest daily snapshot? zfs-auto-snapshot appends the name with the exact time the snapshot was taken, which changes from day to do.
Lalufujasonwc: you could just create a snapshot explicitly for backup purposes
Lalufuand delete that after sending
jasonwcIn that case, what would I specify as the initial snapshot for the incremental send?
jasonwcWhen I've done it manually I would do something like zfs send -I yesterday-daily today-daily
Dagger 928 root 39 19 0 0 0 R 96.2 0.0 563:12.35 spl_kmem_cache/
DaggerACTION sighs
DaggerI tried adding more RAM and now everything has hung, rather than just ZFS :(
DeHackEdlblume: I have a custom build of 2129. It's a mix-and-match of main ZFS patches from HEAD, 2129, and a few from the pull request list. It may be safe for receiving. I've been running it for nearly a month with nightly receives without issue
DeHackEdfrom kernel 3.19's merge window article: The Btrfs filesystem's RAID5 and RAID6 implementation finally has support for disk scrubbing and replacement
DeHackEdWell what did you do before?!
vandemaryou didn't use them :(
jasonwcbtrfs was experimental
jasonwcfor RAID5/6
jasonwcThe wiki stated not to use it because the features for replacing a failing disk wasn't even tested
jasonwcFrom what I've seen, you're not supposed to use btrfs for RAID5/6 but RAID0, 1 and 10 are OK
jasonwcLalufu, looks like I can do zfs send -I old new where old never changes and new is created and destroyed after every backup. Thanks.
Lalufujasonwc: that will get you ever growing snapshots, though, wouldn't it?
jasonwcMy understanding is that if you use -I it will simply send each incremental that exists between the two specified snapshots whereas -i creates a new snapshot with the changes from the old to new. I could be wrong on this. Also, since i use -F on the receiving end, it should delete any snapshots removed on the sending side.
jasonwcI suppose it's inefficient becuase you'll be re-sending some data that already exists but I'm not particularly worried about how long it takes to backup.
DeHackEdwith -R the send stream includes a complete list of snapshots the sender has, even if it's not sending them all. the receiver can prune their local list.
DeHackEddo experiment first. -R has side-effects that may surprise you
jasonwcI did the initial send with -R and incrementals with -I
jasonwcWhat are the side effects of -R?
jasonwcThe more I use ZFS the more I am impressed with it. I recently setup mdadm RAID1 for my server and it seems to require more manual intervention than ZFS. Just to make sure it would boot off a degraded array I removed one of the disks and rebooted. When I re-attached the drive and restarted it was still degraded. I apparently had to manually ask it to add the removed disk at which point it had to scan the entire SSD. Wi
jasonwcth ZFS, it just shows the disk as online and quickly resilvers if there have been any changes. Automatic and much faster.
jasonwcAlso, I appreciate that I never lost any data despite having multiple SAS cards with a defect that randomly corrupts writes and drops disks
jasonwcIn fact, I only discovered the problerm because of the checksumming feature
flingjasonwc: mdraid supports bitmap. No need to sync the entire SSD after you readd the drive :P
flingjasonwc: but yes, feels simplier with zfs.
jasonwcso, how do you get it to resynchronize without reading the entire drive?
fling--manage --add
flingbut you should first --grow -b internal
fling^ before you remove the drive.
jasonwcHowever, the process of moving from a single disk to a mdadm RAID1 was far simpler than the process of migrating to a ZFS root pool. It required 2 commands "raider -R1" and "raider --run" - http://raider.sourceforge.net/
flingjasonwc: bitmap is a some kind of stupid metadata.
flingdoh never seen raider
jasonwcIt automates the process and is idiot proof
flingoh ok
jasonwcIt recommends that you swap the drives after creating the degraded array just to make sure it works
jasonwcTook about 15 minutes for each step with a 240 GB SSD
jasonwcGotta love the need to synchronize all those 0s on a 25% used disk
flingtry enabling the bitmap?
jasonwcNext time I'll use FransUrbo's Debian install disk
flingit will synchronize in seconds then
jasonwcyeah, i will do so
jasonwcDidn't know about that option previously
jasonwcHonestly, it's not a huge deal with SSDs. It takes 15 minutes to fully synchronize.
jasonwcThanks for the info. I'm off
lblumeDeHackEd: Thanks, I need first to safely receive data from Solaris pools, then I will be able to play with them. I'll ping you, next year, I think :-)
DeHackEdI'm hoping tuxoko figures out what's wrong and updates 2129
DeHackEdmy patch at this point is more of a "it works, don't touch it" sort of thing.
flingDeHackEd: when will it be merged in master? Is not it close to production ready yet?
DeHackEdfling: 2129 is viewed as being only one part of a much bigger project. it may never be merged because it's seen as incomplete.
DeHackEd(but I'll take it any day)
jasonwcfling, Since the mdadm "scrub" can't check for data integrity (at best, it can report mismatches), is there a point to running a mdadm scrub vs just a SMART Extended test?
bekksHow is that related to zfs? :)
jasonwcsorry, he had provided me useful info before. I will take it out of the channel. :)
jzawbekks, is there a problem having a discussion about comparative file/disk systems on an otherwise quiet channel ?
dasjoejasonwc: don't, please. I don't see how this channel should be about ZFS only :)
jzawditto dasjoe :D
bekksjzaw: No, there isnt. I was just hoping that hw would clarify his actual issue :)
jasonwcActually, the reason we began discussing mdadm is that I compared it to ZFS, noting that I missed the features in ZFS.
jzawah fair enough
jasonwcI didn't have an issue. Just wondering if I should setup a cron job to do a scrub.
jzawjasonwc, zfs scrub ? why not? :)
DeHackEdwith RAID-6 you have the potential to find corrupted data, as long as only 1 disk in a stripe returns bad data
jasonwcSee above. I was asking whether I should run a mdadm scrub
DeHackEd(only works for fully optimal configurations)
jasonwcI'm using RAID1, so it's basically a coin flip as to which data is good
DeHackEdand whether mdadm supports that, I don't know
DeHackEdI'd still do the periodic scrubs
DeHackEdespecially for RAID-5. maybe RAID-1 has less benefit
jasonwcI currently have ZFS set to scrub weekly on Sundays, and SMART Extended tests on all my disks on Saturdays
jasonwcThankfully, the scrub is very fast since it's just two 240 GB SSDs, so it's done in 15-20 min
jasonwcFor the OS disks that is. My ZFS scrub takes about a day on 26TB
jasonwcThanks for the info DeHackEd
phoxanyone know about import failures with 'out of space' + threatening to lose data with an import -F? can this be imported ro without data loss?
DeHackEdRO import should be safe. ZFS will never write anything so you can't really break anything
dasjoeMake sure to have the pool imported ro, not just the FS set to ro
DeHackEdzpool import -o readonly=on $poolname
jasonwcHow full does a pool need to be to get such an error?
phoxbloody full.
DeHackEdzfs list shows 0
phoxwhat's the cause, no space to write ZIL back to rest of pool basically?
DeHackEdI suppose that's one possibilitiy
jasonwcDoesn't ZFS allocate something like 0.5% of the pool for metadata and pool data?
DeHackEdthere's also the zpool history
DeHackEdwonder if it could be fragmentation related? can't find 128k blocks anymore?
phoxhm maybe
dasjoejasonwc: 1/64th, iirc
phoxyes 1/64th
phoxbut that's allocate, not keep free
jasonwcso, the available attribute in ZFS list excludes overhead, parity, and 1/64th of the pool as noted above?
DeHackEdit also excludes space claimed by reservations in other datasets, and limits it to the quota if any.
DeHackEdso it's not a perfect metric
jasonwcI noticed that my 6 drive raidz2 with AF drives had about 1% less free space than mentioned here: http://4.bp.blogspot.com/-2l9K6WpD1sc/U_pL3m-ReHI/AAAAAAAAUKk/GOTlWGq6Ia0/s1600/Screen%2BShot%2B2014-08-24%2Bat%2B21.27.29.png
jasonwcthat's probably why
jasonwcOdd, I'm trying to do a send/recv and am getting this error - "cannot receive incremental stream: dataset is busy
jasonwcNot sure why it would be busy though
marlincIs there any way to make sure ZFS honers the ZFS ARC limit?
phoxmarlinc: lolololol.
jasonwcWill this work to automate the process of sending incremental backups via zfs send nightly?
jasonwczfs snapshot data/backups@now && zfs send -I old data/backups@now | zfs recv -Fd Backups && zfs destroy data/backups@now && zfs destroy Backups/backups@now
jasonwcseems to work
dasjoemarlinc: not right now, no. There are experimental patches which break receiving snapshots
marlincI see well okay
marlincHow can I see the current size of the ARC? In '/proc/spl/kstat/zfs/arcstats' i"m sure but what value?
jasonwcI believe it's c
jasonwcactually, apparently there is a difference
jasonwc Current Size: 4,678 MB (arcsize)
jasonwc Target Size (Adaptive): 4,159 MB (c)
jasonwcmarlinc, the output with arc_summary.py is clearer anyhow
jasonwcmarlinc, https://github.com/tiberiusteng/tools/blob/master/arc_summary.py
marlincMm strange its very low according to that script
marlinc100 mb
marlincBut when I unload the ZFS module my memory usage goes down by about 2 GB
jasonwcwhat does arcsize show?
jasonwcthat's the slab size, so when you check free, you're seeing that, not c
jasonwcI've seen the two diverge by 10GB
marlincarcsize is 72 MB
jasonwcbut unloading the module frees several GB?
marlincIn most cases yes
marlincMy system uses about 4 GB now
marlinc1-2 GB is used by programs
marlincAnd when I unload ZFS it goes to about 500 mb
marlincWhen I close everything, programs, stop my Ubuntu session and simply go in the VTY and unload ZFS
marlincThen it goes down to about 500 MB
dasjoemarlinc: earlier today: <DeHackEd> lblume: the ARC slabs consume a lot due to fragmentation. a single 128k allocation may take up to 4 MB of RAM, and freeing 128k won't free the 4 MB until every 128k chunk from it is freed
marlincI see mm
marlincWell I mainly noticed it when I tried to run a Windows VM
lblumeI wonder if just setting the min ARC as the same size as the max, would just force allocating all the memory once, and never release it
phoxDeHackEd: yep import RO seems to have worked.
phoxparty on, Wayne.
dasjoeHey prakashsurya, I don't know whether you're directly involved, but how's device removal coming along at Delphix?
jasonwcPer the Oracle docs, the use of the -F flag with zfs recv should cause the destruction of snapshots deleted on the sending side but I'm not seeing that behavior. Is my command correct?
jasonwczfs snapshot data/documents@now && zfs send -I old data/documents@now | zfs recv -d -F Backups && zfs destroy data/documents@now && zfs destroy Backups/documents@now
Hamzahhm I just upgraded my NAS from RHEL6 to RHEl7, and installed ZoL on it. I was just about to setup my vdev_id.conf file, and I noticed that for some reason RHEL doesn't seem to populate /dev/disk/by-path/ for GPT devices (or that's what it looks like anyway...)
Hamzahhas anyone else noticed this ?
Hamzahhm actually, it's not just GPT, it's just not populating it for some devices :s
jasonwcHmm, it looks like it works if I use zfs send -R -i
DeHackEdmarlinc: there's a few options that might appeal to you. If you're running mostly virtual machines with small record sizes or zvols then there's an SPL module parameter you can tweak
dasjoeHamzah: by-path links are no longer created for SATA devices
Hamzahoh i see
dasjoeDeliberate decision, as SATA paths are not guaranteed to be stable, iirc
dasjoeBest practice is to use by-id or SAS disks :)
Hamzahhmmm i see.
unomystEzwow, bigger community than i expected =)
dasjoeZFS on Linux may be the largest Open-ZFS based community
unomystEzso i take it zfs is quite stable on linux - i've used freebsd for about 12 years from 1995 to 2007, then linux since
dasjoeIt is deemed to be production-ready
unomystEzi was going to run it on freebsd cuz i didn't tihnk it was stable
unomystEzbut seems archlinux supports it fairly well
dasjoeryao blogged about its state a few months back
unomystEzalthough I think it's a separate branch of zfs-git than ZOL
dasjoeNo, it's using ZoL's github
unomystEzdasjoe, ah ok
unomystEzman im excited now
unomystEzi thought i'd have to splurge $600 at least for separate system to run freebsd
unomystEzit's great I can use my desktop to do it
unomystEzim playing around with file-backed vdevs to test a few scenarios
unomystEzand want to simulate my 2TB ext4 -> 3x2TB raidz1 migration
dasjoeunomystEz: I can't recommend raidz1, though
unomystEzdasjoe, why's that?
dasjoeJust single redundancy doesn't look like a good idea, to me
dasjoefwiw, I'm using a single raidz2 on 4 disks in my N54L
HamzahHP MicroServer :)
HamzahACTION is also using a N54L \o/
unomystEzis it possible to go from raidz1 to raidz2?
chungyonly by making a new pool from scratch and doing a send|recv cycle
dasjoeunomystEz: not directly
dasjoeYou can build a degraded raidz2, though
unomystEzdasjoe: so it's possible to go from raidz1(3) to raidz2(4) without completely backing up my raidz1? I just wouldn't have 3 extra drives sitting around
unomystEznice looking server btw
unomystEzi may build something into a node 304
unomystEzi think it has 6 bays
dasjoeunomystEz: it's possible to go from raidz2 (2 out of 4) to raidz2 (3/4), then 4/4
unomystEzwas going to start off with raidz1 then later on stripe in another raidz1
dasjoeI put an AsRock C2550D4I in my Node 304, it's okay
unomystEzdasjoe, cool, then for now i'll start with raidz1
dasjoeI prefer the N54L
dasjoeunomystEz: I never said raidz1 :)
unomystEzdasjoe, I know, but i have 3x2TB
unomystEzunless i wait and order another
dasjoeunomystEz: I'm suggesting to build a degraded raidz2, then
unomystEzdasjoe, ah, you mean right off the get-go, sorry I didn't follow
unomystEzso 3x2TB in raidz1 is 4TB usable, 4x2TB in raidz2 is again 4TB usable?
dasjoeYes, exactly
dasjoeYou'll lose a bit more to overhead, though
unomystEzthat's to be expected
unomystEzso why don't you recommend raidz1 again?
unomystEzi plan on buying 3x2TB WD reds
unomystEzmy zvol is really only for streaming movies, pics, etc.. the important data is always backed up somewhere else, so if it dies i don't really care
unomystEzim basically running off a single 2TB drive in ext4 now without redundancy for 4 years
dasjoeunomystEz: because you'll no longer have any redundancy when (when, not if) a single disk fails
unomystEzdasjoe, then i'll just order a replacement via 2-day amazon prime?
dasjoeAnd hope for no disk failures until it arrives? Too much stress for me
unomystEzcool, im not arguing with you
unomystEzim just weighing the risk
dasjoeDidn't feel like arguing, no. I'm a big fan of 6-disk raidz2s, they're pretty versatile
dasjoeStill, my Node 304 is using 3x 2-disk mirrors
unomystEzhow about this question, if it takes me another month or so to get the 4th disk for the degraded raidz2 you propose, how much more risky is the degraded raidz2 on 3 disks to a 3 disk raidz1?
unomystEzare they identical?
dasjoePutting my Steam library on the N54L nicely demonstrated how much having IOPS of a single spindle sucks
unomystEz(sorry im not familiar enough with zfs to know yet)
unomystEzdasjoe, yeah I agree =)
dasjoePretty identical, yes
unomystEzso then perhaps I will do a 3-disk raidz2 to start
jasonwcDoes anyone know how to calculate the risk of pool loss given a known AFR? For example, how would 2 x raidz2 of 9 disks each compare to a single 18 disk raidz3. Obviously the performance is worse for the second configuration, but wouldn't the risk of data loss actually be lower?
unomystEzi'll add a 4th when I can
unomystEzerrr, within the next 6 weeks at least
dasjoeunomystEz: keep in mind you can't modify a vdev after its creation
dasjoeSo you'd have to build it degraded
unomystEzdasjoe, right, so i use a sparse file first?
unomystEzthen take it offline to degrade it
dasjoeunomystEz: exactly
unomystEzyup I'm going to simulate that using file-backed vdevs
dasjoeMake sure the FS you're creating the sparse file on supports sparse files, eCryptfs doesn#t ;)
unomystEzdasjoe, not even sure what ecryptfs is
unomystEzdasjoe, it's an ext4 non-encrypted
dasjoeTransparent encryption layer
unomystEzi gotta run out for a bit but i'll be back a bit later
dasjoejasonwc: interesting idea, I'll look into that. I wanted to build various pool configurations using sparse files to check their overhead and post about that
unomystEzthanks for all your help you've been really helpful
unomystEzand prob saved me from building a custom separate fbsd installation to run this thing
dasjoeFreeBSD is good, though :)
jasonwcdasjoe, Yeah, I did that myself to verify that http://4.bp.blogspot.com/-2l9K6WpD1sc/U_pL3m-ReHI/AAAAAAAAUKk/GOTlWGq6Ia0/s1600/Screen%2BShot%2B2014-08-24%2Bat%2B21.27.29.png was accurate
jasonwcWith a 6 disk raidz2 with AF disks I see 2% overhead
dasjoejasonwc: very interesting. I plan to blog about some encryption stuff, too ;)
jasonwcLooks like 6 and 9 disks are the most efficient overhead-wise for raidz2 and 11 is a great choice for raidz3
flingjasonwc: is not 9 too much for raidz2?
fearedblissdasjoe: there's a guide I maintain on getting gentoo with zfs/encryption and a few other things as well
fearedblissdasjoe: http://xyinn.org/guides/Gentoo_Linux_on_Encrypted_ZFS_guide.html
unomystEzbtw, how is btrfs's stability compared to zfs today?
sheptardzfs > *
p_lunomystEz: I recalll reading recently that people are a bit... dismayed at how long it takes btrfs to mature
unomystEzp_l, I've read the same
p_lotoh, people are using it succesfully, so ymmv
unomystEzhow about for corporate use? you think zfs on linux or btrfs would be more welcome?
p_ldepends on the corp
p_lthere are enough buyers for ZoL-based appliances to keep some companies afloat
unomystEzgood to know
unomystEzso i have a 4disk raidz2 in my zpool, if i just got 2 new drives, should i create a mirror vdev and stripe them in?
p_lsounds usable, IMO
p_lthough I tend to avoid just striping in new parts that aren't the same
unomystEzim simulating a few scenarios using file-backed vdevs, i have the node 304 case which has 6 bays, i currently have a single 2TB drive with ext4, and im going to migrate to a 4-drive raidz2 using 3 new 2TB drives, but later on i'll want to add more space, i was thinking 2x8TB or 2x6TB when prices come down a little
unomystEzbut wasn't sure how to best "add space"
unomystEzi was gonna do 3x2TB to start to leave another 3x<n>TB later on but dasjoe talked me in to going right to raidz2
p_lraidz2 is probably better in general when compared to raidz1, especially given the data densities of modern disks vs. their speeds
unomystEzsince this is for home, i don't really have the luxury of backing up and re-creating zpools
unomystEzso im trying to think as much ahead as possible
unomystEzp_l, what would you do in my situation?
unomystEzwith 6 bays avail
unomystEzand prob not really wanting to grow much beyond that for my home
jasonwcfling, I dont' think so. RAID6 is used commercially in much larger arrays. I have a cold spare available so I can begin a resilver in 12-24 hours max which should take perhaps 16-24 hours. The risk of a second drive failing is higher than it would be if failures were statistically independent, but it's still rather low. Also, some failure modes that are fatal for RAID6 (two failed drive s+ a URE on a 3rd drive) would lik
jasonwcely only impact a single file on ZFS becuase of checksumming.
jasonwcIn the RAID6 case, you lose the entire array
jasonwcI actually had a similar situation where my SAS controller was corrupting writes and it resulted in more corruption than could be repaired. ZFS handled it gracefully, told me the file that was corrupt. I was able to delete the file, restore from backup, scrub and everything was fine. In contrast, traditional RAID can't recover from something like that.
jasonwcA user in #plex had a large ZFS pool made of Seagate 3TB consumer disks in raidz2. He lost 2 drives completely and had a 3rd begin to file and was able to fix the pool with only minor data loss, correctable from backups.
jasonwcSo, I dont' think 9 drives is insufficient, at least for my uses. I have a seperate backup pool for important data and also a cloud backup.
jasonwcI also tried to purchase drives that have low AFR based on publicly available data
flingjasonwc: but what about availability? It s worse to use raidz2 with 9 drives than raidz3
jasonwcThe array is still useable when degraded. In addition, this is for home use so durability is more of a concern than availability.
jasonwcI have multiple backups of my important data and at least one backup of other data
jasonwcIn any case, the level of risk will depend on your intended use for the array, your willingness to accept restoring from backups, the quality of your drives etc.
jasonwcIf your array is constantly active, a resilver will take much longer, for example
jasonwcI think this is a good article on the subject - http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/
jasonwcIt's all a matter of tradeoffs. A pool of 18 disks made up of 2 9 drive raidz2 vdevs provides for a 6 disk raidz2 backup array. So, the size works well for my chasis, it provides good space efficiency w/ low overhead using 4K drives, and gives me the random IOPS of two spindles minimum