17 OCT 2024 - Welcome Back to TorrentFunk! Get your pirate hat back out. Streaming is dying and torrents are the new trend. Account Registration works again and so do Torrent Uploads. We invite you all to start uploading torrents again!
Dispite my own promises, here is a hybrid followup for both [BOORU_CHARS datasets](https://nyaa.si/view/1740396)
and [safebooru centric composite rips](https://nyaa.si/view/1733499).
This time a main source was **danbooru** (safe+questionable, interval **ID 6640000..8200000 = 31.08.2023..24.09.2024**),
the best of furry-related e621 and loli-enabled gelbooru for the same interval used as addon.
Similar to rips :
- images initially filtered Mpixels>=0.48, shorter_side>=600 px, volume>=60000 bytes, no animations
stripes dropped or adjusted to aspect ratio 0.4..2.1
- PNG/WEBP/AVIF converted to JPG using cjpegli 96% quality (2000000 bytes limit)
modest sampling done to typical longer side 2560px (landscape) 1920px (1x1) 2480px (portrait)
- verbose file naming used **"%website% - %id% - %up_to_3_copyrights% ~ %up_to_5_characters% (%up_to_2_artists%).%ext%"**
files uniquely identified by "%website%+%id%"
Similar to BOORU CHARS datasets extensive processing done and used for content sorting :
- some general image statistics got with EXIFTOOL and IMAGE MAGICK
- content analisys mostly the same as for BC2023 with actual software and models
- [CRAFT text detector](https://github.com/fcakyon/craft-text-detector) used to estimate total size and number of text pieces
- torso components detected with [custom PyTorch model](https://github.com/aperveyev/booru_yolo/tree/main/models)
being built over [Ultralitics YOLOv11](https://github.com/ultralytics/ultralytics)
- imageboard tags arranged and partially placed inside image EXIF-info
- clustering implemented both
- by aspect ratio { 7x10 +/-4% ; 3x4 +/-10% ; 1x1 +/-20% ; 3x2 +/-40% ; 2x3 +/-40% }
- by head-count { 0 heads = letter A, 2 = B, 3-5 heads = C, 6+ heads = D, 1 = letter E }
- sorting inside cluster based on "attractiveness score function" == "colorful and textless"
- balanced folder/zip typically contains ~1000-2600 files
- least rated images tend to be manga-like and manually reviewed
Content is a little less processed and a little more NSFW compared to predecessors, nevertheless :
- real-life photos, no-character landscapes, foods and macro thrown away
- most of comic and N-koma, overtexted images and line-arts filtered out
- too "questionable" images (uncensored nipples or vulva, obvious adult actions) excluded
separate BOORU BOOBS planned
- some background crops, gamma correction, rotation, denoise and other nontrivial improvements implemented
Images deduplicatied with AntiDupl dot NET up to 2% similarity along with BOORU CHARS 2023 and 2022.
Beside images release contains tab separated texts :
- **BC_2024.tsv** file/image related metadata 1.260.629 rows
- **BC_2024_tags.tsv** tags list with Danbooru enrichment 49.041.220 rows
- **BC_2024_yolo.tsv** detailed results for torso components detection 4.431.887 rows
and also dedicated "readme" with structures description.
**Keep in mind this release is first of all
a dataset of character-centric art in effective local format suited for batch processing
and then
a representative catalog of anime/game/cartoon copyrigths, characters and artists for visual estimation
but not
a complete and maximum quality rip.**
Some tips on use cases :
```
@REM -- explore torrent
for %%F in ("d:torrBOORU_CHARS_20242024-3x4*.zip") do 7z x -r -o"C:TEMP" "%%F" *sousou*frieren*stark*
@REM -- much more effective if unzipped
xcopy /s "A:BCA*sousou*frieren*stark*" C:TEMP
-- and became sophisticated using database
select 'xcopy "'||bc.fpath||''||bc.fname||'" C:TEMP' xcpy
from bc
join bc_dt d on d.booru=bc.booru and d.fid=bc.fid
where bc.fname like '%dungeon%meshi%senshi%' and d.tag='pantyshot' -- brutal dwarf fanservice
```