Skip to content

sort should support invalid utf8 in the C locale #4049

Open
@Sanmayce

Description

@Sanmayce

Hi,
just now I found this so-needed port, for a while I needed mostly 'sort', so did throw 3 files at it (2 succeeded, 1 failed), the rival is my sort tool written in C.

The failed one, first:

C:\Temp>dir
10/12/2022  15:25        10,508,302 coreutils-0.0.16-x86_64-pc-windows-gnu.exe
10/15/2022  04:27       269,664,785 Schmekeriada_GCC_12.1.1_TetraThread.exe
10/15/2022  04:44               241 Rustsort_vs_Strongfool.bat
06/15/2016  00:52            49,152 sha1sum.exe
04/05/2022  07:59             6,144 timer64.exe
10/10/2022  21:50     1,340,416,000 kernel.org_linux-6.0.tar

C:\Temp>type Rustsort_vs_Strongfool.bat
copy "%1" nul
timer64.exe Schmekeriada_GCC_12.1.1_TetraThread.exe "%1"
sha1sum.exe Schmekeriada.txt
@del Schmekeriada.txt
timer64.exe coreutils-0.0.16-x86_64-pc-windows-gnu.exe sort "%1" -o Rustsort
sha1sum.exe Rustsort
@del Rustsort

C:\Temp>Rustsort_vs_Strongfool.bat kernel.org_linux-6.0.tar

C:\Temp>copy "kernel.org_linux-6.0.tar" nul
        1 file(s) copied.

C:\Temp>timer64.exe Schmekeriada_GCC_12.1.1_TetraThread.exe "kernel.org_linux-6.0.tar"
   _________        .__                       __                    __             .___
  /   _____/  ____  |  |__    _____    ____  |  | __  ____ _______ |__|_____     __| _/_____
  \_____  \ _/ ___\ |  |  \  /     \ _/ __ \ |  |/ /_/ __ \\_  __ \|  |\__  \   / __ | \__  \
  /        \\  \___ |   Y  \|  Y Y  \\  ___/ |    < \  ___/ |  | \/|  | / __ \_/ /_/ |  / __ \_
 /_______  / \___  >|___|  /|__|_|  / \___  >|__|_ \ \___  >|__|   |__|(____  /\____ | (____  /
         \/      \/      \/       \/      \/      \/     \/                 \/      \/      \/
This build (2022-Oct-12) features Quicksort-Magnetica, buffered dump of sorted data, bugfix: forgotten renaming of old function.
This tool is 100% FREE and open-source, for improvements: [email protected], enfun!
Current priority class is REALTIME_PRIORITY_CLASS.
Size of input file: 1,340,416,000
Allocating FILE-Buffer 1278MB ...
Counting lines ... Done in 166 clocks, 0.17 seconds.
Number of LF-ending lines: 35,137,185
Postfixing the last "line" with a LF.
Allocating Master-Buffer (Offsets+Lengths) NumberOfLFs*8*2 = 536MB ... Aligned to 64 bytes boundary.
Assigning pairs (of pointers and lengths) to lines ... Done in 1,068 clocks, 1.07 seconds.
ShortestLine = 0
LongestLine = 51,137
Sorting pointers to lines with 'Strongfool' a.k.a. 'Quicksort_Magnetica_v18_BalxchonkaForte_indirect' ...
Thread #1 of 4 sorting partition size=8034221
Thread #2 of 4 sorting partition size=5632802
Thread #3 of 4 sorting partition size=7211358
Thread #4 of 4 sorting partition size=14258808
Done (just sorting) in 9 seconds.
Writing sorted lines to 'Schmekeriada.txt' ... Allocating DUMP-Buffer (for 'fwrite()') 1024MB ...
Done (just writing) in 7 seconds.
Total LPS performance: 1,597,144 Lines-Per-Second
Total BPS performance: 60,928,000 Bytes-Per-Second

Kernel  Time =     1.281 =    6%
User    Time =    27.593 =  130%
Process Time =    28.875 =  136%    Virtual  Memory =   3102 MB
Global  Time =    21.195 =  100%    Physical Memory =   2842 MB

C:\Temp>sha1sum.exe Schmekeriada.txt
25e2479dfec352da57fabe06533ba4dcf76c58c5  Schmekeriada.txt

C:\Temp>timer64.exe coreutils-0.0.16-x86_64-pc-windows-gnu.exe sort "kernel.org_linux-6.0.tar" -o Rustsort
sort: invalid utf-8 sequence of 1 bytes from index 30938122

Exit code: 2

Kernel  Time =     0.046 =   45%
User    Time =     0.062 =   60%
Process Time =     0.109 =  105%    Virtual  Memory =    192 MB
Global  Time =     0.103 =  100%    Physical Memory =    132 MB

C:\Temp>sha1sum.exe Rustsort

Many multi-language Wikipedia titles concatenated:

C:\Temp\Quicksort_dumps.wikimedia.org_Wikipedia-TITLES_benchmark>dir *wiki*

09/27/2022  20:36       237,153,891 arwiki-20220920-all-titles
09/27/2022  05:35        18,302,882 dawiki-20220920-all-titles
09/27/2022  05:35       165,281,029 dewiki-20220920-all-titles
09/27/2022  05:35        24,324,262 elwiki-20220920-all-titles
09/21/2022  16:27     1,320,919,096 enwiki-20220920-all-titles
09/27/2022  05:35       173,224,497 eswiki-20220920-all-titles
09/27/2022  05:35        28,612,610 fiwiki-20220920-all-titles
09/27/2022  05:35       266,375,650 frwiki-20220920-all-titles
09/27/2022  20:36        39,501,061 hewiki-20220920-all-titles
09/27/2022  05:35       155,522,576 itwiki-20220920-all-titles
09/27/2022  05:35        98,362,799 jawiki-20220920-all-titles
09/27/2022  05:35        60,346,596 kowiki-20220920-all-titles
09/27/2022  05:35        93,107,997 nlwiki-20220920-all-titles
09/27/2022  05:35       117,815,247 ptwiki-20220920-all-titles
09/27/2022  20:36        70,390,886 rowiki-20220920-all-titles
09/27/2022  20:36       280,395,192 ruwiki-20220920-all-titles
09/27/2022  05:35       132,911,368 svwiki-20220920-all-titles
09/27/2022  20:36        51,304,688 trwiki-20220920-all-titles
09/27/2022  05:35       147,409,183 zhwiki-20220920-all-titles
              19 File(s)  3,481,261,510 bytes
               0 Dir(s)  18,172,588,032 bytes free

C:\Temp\Quicksort_dumps.wikimedia.org_Wikipedia-TITLES_benchmark>copy *wiki* ar_da_de_el_en_es_fi_fr_he_it_ja_ko_nl_pt_ro_ru_sv_tr_zh_146092425-lines /b
arwiki-20220920-all-titles
dawiki-20220920-all-titles
dewiki-20220920-all-titles
elwiki-20220920-all-titles
enwiki-20220920-all-titles
eswiki-20220920-all-titles
fiwiki-20220920-all-titles
frwiki-20220920-all-titles
hewiki-20220920-all-titles
itwiki-20220920-all-titles
jawiki-20220920-all-titles
kowiki-20220920-all-titles
nlwiki-20220920-all-titles
ptwiki-20220920-all-titles
rowiki-20220920-all-titles
ruwiki-20220920-all-titles
svwiki-20220920-all-titles
trwiki-20220920-all-titles
zhwiki-20220920-all-titles
        1 file(s) copied.

C:\Temp>dir

10/15/2022  03:33     3,481,261,510 ar_da_de_el_en_es_fi_fr_he_it_ja_ko_nl_pt_ro_ru_sv_tr_zh_146092425-lines
10/07/2022  03:46    15,674,562,752 jawiki-20220920-pages-articles.xml

C:\Temp>Rustsort_vs_Strongfool.bat ar_da_de_el_en_es_fi_fr_he_it_ja_ko_nl_pt_ro_ru_sv_tr_zh_146092425-lines

C:\Temp>copy "ar_da_de_el_en_es_fi_fr_he_it_ja_ko_nl_pt_ro_ru_sv_tr_zh_146092425-lines" nul
        1 file(s) copied.

C:\Temp>timer64.exe Schmekeriada_GCC_12.1.1_TetraThread.exe "ar_da_de_el_en_es_fi_fr_he_it_ja_ko_nl_pt_ro_ru_sv_tr_zh_146092425-lines"
   _________        .__                       __                    __             .___
  /   _____/  ____  |  |__    _____    ____  |  | __  ____ _______ |__|_____     __| _/_____
  \_____  \ _/ ___\ |  |  \  /     \ _/ __ \ |  |/ /_/ __ \\_  __ \|  |\__  \   / __ | \__  \
  /        \\  \___ |   Y  \|  Y Y  \\  ___/ |    < \  ___/ |  | \/|  | / __ \_/ /_/ |  / __ \_
 /_______  / \___  >|___|  /|__|_|  / \___  >|__|_ \ \___  >|__|   |__|(____  /\____ | (____  /
         \/      \/      \/       \/      \/      \/     \/                 \/      \/      \/
This build (2022-Oct-12) features Quicksort-Magnetica, buffered dump of sorted data, bugfix: forgotten renaming of old function.
This tool is 100% FREE and open-source, for improvements: [email protected], enfun!
Current priority class is REALTIME_PRIORITY_CLASS.
Size of input file: 3,481,261,510
Allocating FILE-Buffer 3319MB ...
Counting lines ... Done in 436 clocks, 0.44 seconds.
Number of LF-ending lines: 146,092,425
Allocating Master-Buffer (Offsets+Lengths) NumberOfLFs*8*2 = 2229MB ... Aligned to 64 bytes boundary.
Assigning pairs (of pointers and lengths) to lines ... Done in 3,324 clocks, 3.32 seconds.
ShortestLine = 3
LongestLine = 259
Sorting pointers to lines with 'Strongfool' a.k.a. 'Quicksort_Magnetica_v18_BalxchonkaForte_indirect' ...
Thread #1 of 4 sorting partition size=33003985
Thread #2 of 4 sorting partition size=59834494
Thread #3 of 4 sorting partition size=28334105
Thread #4 of 4 sorting partition size=24919844
Done (just sorting) in 21 seconds.
Writing sorted lines to 'Schmekeriada.txt' ... Allocating DUMP-Buffer (for 'fwrite()') 1024MB ...
Done (just writing) in 13 seconds.
Total LPS performance: 3,652,310 Lines-Per-Second
Total BPS performance: 87,031,537 Bytes-Per-Second

Kernel  Time =     2.843 =    7%
User    Time =    59.125 =  151%
Process Time =    61.968 =  158%    Virtual  Memory =   6844 MB
Global  Time =    39.138 =  100%    Physical Memory =   6577 MB

C:\Temp>sha1sum.exe Schmekeriada.txt
441564c51962d89e5e14c9aa854958106785b3ce  Schmekeriada.txt

C:\Temp>timer64.exe coreutils-0.0.16-x86_64-pc-windows-gnu.exe sort "ar_da_de_el_en_es_fi_fr_he_it_ja_ko_nl_pt_ro_ru_sv_tr_zh_146092425-lines" -o Rustsort

Kernel  Time =    15.515 =   18%
User    Time =   173.640 =  210%
Process Time =   189.156 =  229%    Virtual  Memory =    734 MB
Global  Time =    82.348 =  100%    Physical Memory =    497 MB

C:\Temp>sha1sum.exe Rustsort
441564c51962d89e5e14c9aa854958106785b3ce  Rustsort

C:\Temp>

The Japanese Wikipedia XML dump:

C:\Temp>Rustsort_vs_Strongfool.bat jawiki-20220920-pages-articles.xml

C:\Temp>copy "jawiki-20220920-pages-articles.xml" nul
        1 file(s) copied.

C:\Temp>timer64.exe Schmekeriada_GCC_12.1.1_TetraThread.exe "jawiki-20220920-pages-articles.xml"
   _________        .__                       __                    __             .___
  /   _____/  ____  |  |__    _____    ____  |  | __  ____ _______ |__|_____     __| _/_____
  \_____  \ _/ ___\ |  |  \  /     \ _/ __ \ |  |/ /_/ __ \\_  __ \|  |\__  \   / __ | \__  \
  /        \\  \___ |   Y  \|  Y Y  \\  ___/ |    < \  ___/ |  | \/|  | / __ \_/ /_/ |  / __ \_
 /_______  / \___  >|___|  /|__|_|  / \___  >|__|_ \ \___  >|__|   |__|(____  /\____ | (____  /
         \/      \/      \/       \/      \/      \/     \/                 \/      \/      \/
This build (2022-Oct-12) features Quicksort-Magnetica, buffered dump of sorted data, bugfix: forgotten renaming of old function.
This tool is 100% FREE and open-source, for improvements: [email protected], enfun!
Current priority class is REALTIME_PRIORITY_CLASS.
Size of input file: 15,674,562,752
Allocating FILE-Buffer 14948MB ...
Counting lines ... Done in 3,854 clocks, 3.85 seconds.
Number of LF-ending lines: 219,648,620
Postfixing the last "line" with a LF.
Allocating Master-Buffer (Offsets+Lengths) NumberOfLFs*8*2 = 3351MB ... Aligned to 64 bytes boundary.
Assigning pairs (of pointers and lengths) to lines ... Done in 10,943 clocks, 10.94 seconds.
ShortestLine = 0
LongestLine = 93,680
Sorting pointers to lines with 'Strongfool' a.k.a. 'Quicksort_Magnetica_v18_BalxchonkaForte_indirect' ...
Thread #1 of 4 sorting partition size=28973203
Thread #2 of 4 sorting partition size=37169789
Thread #3 of 4 sorting partition size=80704296
Thread #4 of 4 sorting partition size=72801336
Done (just sorting) in 92 seconds.
Writing sorted lines to 'Schmekeriada.txt' ... Allocating DUMP-Buffer (for 'fwrite()') 1024MB ...
Done (just writing) in 69 seconds.
Total LPS performance: 1,193,742 Lines-Per-Second
Total BPS performance: 85,187,841 Bytes-Per-Second

Kernel  Time =    12.312 =    6%
User    Time =   244.359 =  133%
Process Time =   256.671 =  140%    Virtual  Memory =  19620 MB
Global  Time =   183.015 =  100%    Physical Memory =  19328 MB

C:\Temp>sha1sum.exe Schmekeriada.txt
b97bab60515c4bafcaf4fa6b322b199f41e6d6d1  Schmekeriada.txt

C:\Temp>timer64.exe coreutils-0.0.16-x86_64-pc-windows-gnu.exe sort "jawiki-20220920-pages-articles.xml" -o Rustsort

Kernel  Time =    74.093 =   27%
User    Time =   493.703 =  184%
Process Time =   567.796 =  212%    Virtual  Memory =    373 MB
Global  Time =   267.792 =  100%    Physical Memory =    294 MB

C:\Temp>sha1sum.exe Rustsort
b97bab60515c4bafcaf4fa6b322b199f41e6d6d1  Rustsort

C:\Temp>

Note1: Speedwise, Schmekeriada is 267.792/183.015=1.46x to 82.348/39.138=2.10x faster.
Note2: The testmachine is laptop (i5-7200U 2cores/4threads, 36GB DDR4, SATA SSD) running Windows 10.

This prompts for the questions:

  • Currently, it sorts only in UTF-8 mode, right? Didn't see ability to set LC_ALL=C locale;
  • How well the cores are utilized, don't see the '--parallel' counterpart, is utilizing all-the-threads automatic?

Very nice initiative, indeed, looking forward for new releases, regards.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions