Description: Find duplicated files/directories Author: Catalin(ux) M. BOIE Start date: 2012-04-09 Plan: - compute sha1 on files/dirs lazy (check only size and only after the checksum). - sort files and dir tables - check directories first - check files, hiding all siblings reporting above DIR subdir1 subsubdir1 subsubdir2 file1 DIR subdir2 DIR->subdirs = subdir1 subdir1->next = subdir2 subdir1->subdirs = subsubdir1 subsubdir1->next = subsubdir2 == Pseudocode == main.c: for every directory passed as parameters: call ntfw with callback 'callback': ignore !files and !dirs if we already seen that inode, skip it if is a dir, call dir_add: alloc a dir node and fill name, dev, ino, level if is a level 0 dir (passed as para), add it to dir_info array else find parent dir and set ->parent to it ->next_sibling = parent->subdirs parent->subdirs = q else, call file_add: alloc a file node q set size, name, dev, ino, level and init SHAs find parent and add q to parent->files set also the parent now, add q also to a hash by size (file_info), sorted by size call file_find_dups for every bucket of file_info that has at least one item if we have no next, it means that we cannot have a dup and we mark it up with no_dup_possible flag for every item in hash: we group by size and we call compare_file_range compare_file_range will fill item->dups call dir_find_dups for every dir passed as para (dir_info): call dir_build_hash we allocate an array that will keep all dirs that may have matches for every possible dir we call dir_find_dups_populate_list sort dirs by hash find same hash dirs call dir_process_range on first..last with same hash link all dirs under the lowest level one call dump_duplicates if flag no_dup_possible is set, skip if do_not_dump is set, skip if is alone in the chain, no dup possible, skip for every same hash dir: if left is 1, we skip it because was already dump if do_not_dump is set, skip mark dir as left, to not appear in a 'right' position mark main dir as 'do_not_dump', because we already dumped it mark current dir as 'do_not_dump' because we already dumped it dump Damn complicated. Let's try a simple approach. Let's build a single linked list of files, order by size. Hash was too complicated and saved nothing. Maybe saved some time to add files inside Build the dirs list Keep in mind mark up dirs that contains files that cannot have duplicates (unique size). Don't forget to sort the files inside a dir before building the hash.