Page 1 of 1
Signature cache feature in Usenet Explorer ?
Posted: Wed Oct 24, 2007 3:04 pm
by Lordcray
Hi,
I'm evaluating UE, and I found it very powerful.
But is missing a very useful feature : Signature cache like in NewsBin PRO.
In practice it's a way to checks if a file is already downloaded.
NewsBin Pro performs a quick CRC32 check on the first 4 kB of a file to determine whether a file has already been downloaded if I'm not mistaken.
Is this feature in roadmap for future release of UE ?
thanks
lordcray
Posted: Wed Oct 24, 2007 11:45 pm
by alex
try to explain me how the feature is actually useful, when you hit into duplicates are those mainly the same posts accidentally marked for download in the same newsgroup or those are different posts with the same files posted exactly?
in UE you can mark headers as deleted (they are marked automatically when you save attachments, it is default) so as long as you downloaded them you won't see them again unless you want to, i think with old newsbin it downloaded headers every time you started the program so it didn't keep track of previous downloads on the header level.
in the quick filter (view menu->quick filter to toggle the bar) - if you check "del" checkbox it will show headers marked as deleted. now edit menu->properties->articles->saving attachments "delete headers" is checked by default, it means headers are marked as deleted (not just permanently purged), so the program like remembers what you downloaded in the past.
------
now little technical info i put it first but then realized it is secondary thoughts, i think it is something related to inability to mark headers so you can invoke duplicate downloads too easily.
crc32 checksum cannot be used to check whether a file is a duplicate, crc32 was not designed to provide uniform distribution, so there is chance of collisions (different attachments falsely marked as duplicates).
if to accept a small file prefix is ok to use for the whole file, then to hash it maybe md5 could be used (md5 was broken in 2005 so collisions can be calculated quite fast but still it is uniform distribution and would work for this purpose).
then it would be 16+4 bytes overhead per downloaded file in RAM.
no p2p program includes keeping history of previous downloads and duplicate warning while it is there a very very straightforward matter - for non-usenet p2p the cache value is known from the outset without initiating any download.
also a set of options would be needed to go around the duplicate detection when redownload is intentional - adding to the number of options.
i'll be keeping md5/crc32/size info about downloaded files for the purpose of par2 recovery in v2.0, but i'm not planning to use it for duplicate detection unless someone will persuade me why it would be so useful.
Posted: Thu Oct 25, 2007 8:31 am
by Lordcray
Actually I'll try to explain you with an example.
I download a file posted into some messages.
After 10 days someone repost it again in the newsgroup, but with a different header.
UE saves another copy of this file since I've moved the old one to a different directory.
Newsbin 5, for the same file, doesn't save it again. It download the first part, check the CRC32 against it's signature database, and log it as a duplicate file. Obviously I can override this feature forcing to download it again if I deleted the file.
It's very useful since you don't have to remember what you have downloaded, since in a lot of newsgroup the same file or image goes reposted after a while.
I've tried it in the latest UE and Newsbin for comparison.
However, when I try to load full header of a newsgroup like alt.binaries.e-book.technical , newsbin took 25 seconds, while UE took less than one seconds !!!!
That's why I think UE is so better than newsbin
bye
Lordcray
Posted: Thu Oct 25, 2007 9:05 am
by alex
now i cannot quite think what i will do after finishing unpack (which probably will entail few releases after user requests asking for additional refinements).
to add the hash is very easy (just to use simple hash table with md5 as a key, not crc32, but is not a obstacle) - plus stacking more options upon the existing feature set. the question then whether an easy feature is implemented elsewhere than a single program, since if something is easy to add and it is useful it should be common as well, i'll try to check that out when i finish with the current work.