17 OCT 2024 - Welcome Back to TorrentFunk! Get your pirate hat back out. Streaming is dying and torrents are the new trend. Account Registration works again and so do Torrent Uploads. We invite you all to start uploading torrents again!
BOORU CHARS volume 2023 **completes** an attempt to consolidate and arrange available character-centric
almost SFW anime/CG/game art into localized format suited both for batch processing and visual estimation.
The whole evolved project consists of (in release order):
- [BOORU_CHARS_2021](https://nyaa.si/view/1384820) 1.593.429 images 472 GB topic starter
- [BOORU_CHARS_2015](https://nyaa.si/view/1468367) 463.873 images 148 GB old stuff
- [BOORU_CHARS_2022](https://nyaa.si/view/1547662) 705.467 images 191 GB newcomers
- [BOORU_CHARS_2023](https://nyaa.si/view/0000000) 1.153.513 images 320 GB this one
It is strongly recommended to inspect README's there and - of course - download and seed it.
This release covers
- ~98% newcoming images from composite rips
* [volume V2022C](https://nyaa.si/view/1574093) 05.2022 - 08.2022
* [volume V2022D](https://nyaa.si/view/1634287) 08.2022 - 11.2022
* [volume V2023A](https://nyaa.si/view/1720018) 12.2022 - 03.2023
* [volume V2023B](https://nyaa.si/view/1727186) 03.2023 - 06.2023
* [volume V2023C](https://nyaa.si/view/1733499) 06.2023 - 08.2023
- some old imageboards stuff forgotten in BC2015
- ~20% "the best of" [Dark Pixiv Collection project 202209](https://nyaa.si/view/1626495) mentioned as "imageboard" pixiv.sfs
Similarly to a whole project :
- files unique identified by (booru + fid) imageboard name and post ID key
verbose file naming used **%booru% - %fid% - %up-to-3-copyrights% ~ %up-to-5-characters% (%up-to-2-artists%)**
- aspect ratio clustered, priorities high to low 7x10 +/-4% ; 3x4 +/-10% ; 1x1 +/-20% ; 3x2 +/-40% ; 2x3 +/-40%
- (as of composite rips) image format JPG-fied and
* sampled 1280px longest side (1024px for 1x1)
* re-mogrified to 94% for 98-100% JPEG quality
- imageboard tags arranged and partially placed inside image EXIF-info
- some general image statistics got with [IMAGE MAGICK](https://imagemagick.org)
- content analisys basicly the same as for BC2022 but with advanced software and models
* [CRAFT text detector](https://github.com/fcakyon/craft-text-detector) used to estimate total size and number of text pieces
* torso components detected with [custom PyTorch models](https://github.com/aperveyev/booru_yolo/tree/main/models)
being built over [Ultralitics YOLOv8](https://github.com/ultralytics/ultralytics) where number of heads was used for folder clustering
This release contains BC2023 by itself :
- 1.153.513 sampled images clustered by aspect ratio and also number of heads detected
(0=letter A, 2=B, 3+=C, 1=letter E in folder name)
ordered and grouped into ~1000/2000-th zip/folders by "attractiveness score function"
- zipped tab separated texts
* file/image related metadata (BC_2023.tsv)
* tags list with Danbooru enrichment (BC_2023_tags.tsv) 25.250.897 rows
* 3.877.682 detailed results for yolo detection algorythms (BC_2023_yolo.tsv)
- dedicated (bc_readme.txt) with detailed structure description
and also huge crossBOORU catalog of URLs, tags and other metadata (partitioned by 1-st letter of MD5 hash, zipped)
- 17.733.350 items identified by MD5 (BOORU_*.tsv) with 35.033.097 (usually redundant) URLs
- correlated artist / copyright / character tag list (BOORU_*_TG.tsv) 63.900.184 rows
- 1.014.481 tags registry (BOORU_TG.tsv) zipped
- separate (booru_readme.txt) for detailed descrition
Simple numerical ranks among clusters of images has been built for each numerical criteria,
so both outlier processing and ranking deal only with relative ranks 1:MaxNum or simple functions over it.
Similarly to BC2015 and BC2022:
- "attractiveness score function" turned to definition "textless and colorful"
- "the worst of" outliers were deleted (rank by rank, ~2% in total)
P.S. almost 4M samples are ready (1.13TB) - time to use it for something valuable