From 7f9b333861d96c952b740dde27186a9d1558b071 Mon Sep 17 00:00:00 2001 From: devzero Date: Fri, 28 Jul 2017 13:19:54 +0300 Subject: [PATCH] Ticket #3616: speed up of utf-8 normalization. When content of a large directory is being sorted by file names, a significant amount of CPU time is spent in str_utf8_normalize() that is called from str_utf8_create_key_gen(). For example, /usr/bin/ contains 5437 files on my Archlinux box. Running mc /usr/bin/ /usr/bin/ takes approx. 75 000 000 CPU instructions to sort file names, or 25% of total program run time. From these 75 000 000 instructions, 42 500 000 instruction are spent in str_utf8_normalize(). str_utf8_normalize() uses g_utf8_normalize() to do the work. g_utf8_normalize() is a heavyweight function, that converts UTF-8 into UCS-4, does the normalization and then converts UCS-4 back into UTF-8. Since file names are composed of ASCII characters in most cases, we can speed up str_utf8_normalize() by checking if the heavyweight Unicode normalization is actually needed. Normalization of ASCII string is no-op, so it is effectively "normalized" by just strdup(). With this patch, running mc /usr/bin/ /usr/bin/ requires just 37 000 000 instructions to sort the file names (down from 75 000 000) and 4 500 000 instuctions to do str_utf8_normalize() (down from 42 500 000). Signed-off-by: Andrew Borodin --- lib/strutil/strutilutf8.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/lib/strutil/strutilutf8.c b/lib/strutil/strutilutf8.c index f407d0ff3..c7376beb2 100644 --- a/lib/strutil/strutilutf8.c +++ b/lib/strutil/strutilutf8.c @@ -1080,6 +1080,25 @@ str_utf8_normalize (const char *text) const char *start; const char *end; + /* g_utf8_normalize() is a heavyweight function, that converts UTF-8 into UCS-4, + * does the normalization and then converts UCS-4 back into UTF-8. + * Since file names are composed of ASCII characters in most cases, we can speed up + * utf8 normalization by checking if the heavyweight Unicode normalization is actually + * needed. Normalization of ASCII string is no-op. + */ + + /* find out whether text is ASCII only */ + for (end = text; *end != '\0'; end++) + if ((*end & 0x80) != 0) + { + /* found 2nd byte of utf8-encoded symbol */ + break; + } + + /* if text is ASCII-only, return copy, normalize otherwise */ + if (*end == '\0') + return g_strndup (text, end - text); + fixed = g_string_sized_new (4); start = text;