From 1b814d7e38affa5cd073842ad1eb044e1ccb2089 Mon Sep 17 00:00:00 2001 From: Fam Zheng Date: Wed, 22 Mar 2017 23:51:07 +0100 Subject: [PATCH 10/12] file-posix: Consider max_segments for BlockLimits.max_transfer RH-Author: Fam Zheng Message-id: <20170322235109.24122-2-famz@redhat.com> Patchwork-id: 74433 O-Subject: [RHEV-7.3.z qemu-kvm-rhev PATCH v2 1/3] file-posix: Consider max_segments for BlockLimits.max_transfer Bugzilla: 1431149 RH-Acked-by: Stefan Hajnoczi RH-Acked-by: Paolo Bonzini RH-Acked-by: Markus Armbruster BlockLimits.max_transfer can be too high without this fix, guest will encounter I/O error or even get paused with werror=stop or rerror=stop. The cause is explained below. Linux has a separate limit, /sys/block/.../queue/max_segments, which in the worst case can be more restrictive than the BLKSECTGET which we already consider (note that they are two different things). So, the failure scenario before this patch is: 1) host device has max_sectors_kb = 4096 and max_segments = 64; 2) guest learns max_sectors_kb limit from QEMU, but doesn't know max_segments; 3) guest issues e.g. a 512KB request thinking it's okay, but actually it's not, because it will be passed through to host device as an SG_IO req that has niov > 64; 4) host kernel doesn't like the segmenting of the request, and returns -EINVAL; This patch checks the max_segments sysfs entry for the host device and calculates a "conservative" bytes limit using the page size, which is then merged into the existing max_transfer limit. Guest will discover this from the usual virtual block device interfaces. (In the case of scsi-generic, it will be done in the INQUIRY reply interception in device model.) The other possibility is to actually propagate it as a separate limit, but it's not better. On the one hand, there is a big complication: the limit is per-LUN in QEMU PoV (because we can attach LUNs from different host HBAs to the same virtio-scsi bus), but the channel to communicate it in a per-LUN manner is missing down the stack; on the other hand, two limits versus one doesn't change much about the valid size of I/O (because guest has no control over host segmenting). Also, the idea to fall back to bounce buffering in QEMU, upon -EINVAL, was explored. Unfortunately there is no neat way to ensure the bounce buffer is less segmented (in terms of DMA addr) than the guest buffer. Practically, this bug is not very common. It is only reported on a Emulex (lpfc), so it's okay to get it fixed in the easier way. Reviewed-by: Paolo Bonzini Signed-off-by: Fam Zheng Signed-off-by: Kevin Wolf (cherry picked from commit 9103f1ceb46614b150bcbc3c9a4fbc72b47fedcc) RHEL 7.3 uses sector based block limits, so divide the result by BDRV_SECTOR_SIZE. Signed-off-by: Fam Zheng Signed-off-by: Miroslav Rezanina --- block/raw-posix.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 48 insertions(+) diff --git a/block/raw-posix.c b/block/raw-posix.c index 3f55c7d..9d9a6a0 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -742,6 +742,48 @@ static int hdev_get_max_transfer_length(int fd) #endif } +static int hdev_get_max_segments(const struct stat *st) +{ +#ifdef CONFIG_LINUX + char buf[32]; + const char *end; + char *sysfspath; + int ret; + int fd = -1; + long max_segments; + + sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments", + major(st->st_rdev), minor(st->st_rdev)); + fd = open(sysfspath, O_RDONLY); + if (fd == -1) { + ret = -errno; + goto out; + } + do { + ret = read(fd, buf, sizeof(buf)); + } while (ret == -1 && errno == EINTR); + if (ret < 0) { + ret = -errno; + goto out; + } else if (ret == 0) { + ret = -EIO; + goto out; + } + buf[ret] = 0; + /* The file is ended with '\n', pass 'end' to accept that. */ + ret = qemu_strtol(buf, &end, 10, &max_segments); + if (ret == 0 && end && *end == '\n') { + ret = max_segments; + } + +out: + g_free(sysfspath); + return ret; +#else + return -ENOTSUP; +#endif +} + static void raw_refresh_limits(BlockDriverState *bs, Error **errp) { BDRVRawState *s = bs->opaque; @@ -753,6 +795,12 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) if (ret >= 0) { bs->bl.max_transfer_length = ret; } + ret = hdev_get_max_segments(&st); + if (ret > 0) { + bs->bl.max_transfer_length = + MIN(bs->bl.max_transfer_length, + ret * getpagesize() / BDRV_SECTOR_SIZE); + } } } -- 1.8.3.1