Current work - usenet indexing made easy (pre v1.1)
Posted: Fri Aug 12, 2005 4:55 am
i need to get some documents in some remote place and it proves to be a real problem (but at least i'm not in the army as it happened in the past and it was repeatedly proven that the army kills my thinking) , so right now most of the time i don't have Internet access alltogether, and when i'm coming to a city where the bureaucracy machine is located i can access the forum like that.
but let us talk about the current work.
in the end of february, when i was busy with UE in pre v1.0 state and it wasn't clear whether the program will be in releasable state at all i added newzbin support and was puzzled by their newsreader backend slowness. so i contacted them and proposed some better algorithm to fix that partially (they also have sql aspect which slowness is not fixable) on their web page you can see that as 'new faster search implemented'. in tests here the algorithm gave the speed about 1000 searches per minute on 2.8Ghz computer, but it also had some deficiencies which maybe are not problems for newzbin since they have 15 or so computers running but for implementing ue service it was not a perfect solution. then i was preparing official ue release and then reimplementing all essential newspro features until v1.1 so i was almost completely out of it.
about june 11, while working on filters i made a small diversion and found in theory another indexing solution which doesn't have deficiencies of the algorithm which i gave to newzbin. i wrote several sentences for future notice and forgot about it until after the v1.1 release.
before the leave in the end of July i downloaded some header samples into 256MB 1.8Ghz laptop and about August, 2 the work on the topic started (not all the time since I'm trying to get some rest too).
about August, 10, two days ago i had a working engine, not an emulation but code which will make it into the final product (we need engine first since you cannot estimate search speed and memory requirements and there is no sense to write the rest if they are not satisfactory). specifications currently are 100-1000 boolean wildmat searches per second on the laptop (newzbin i checked with substrings) and in 2GB computer it is possible to index about 7M files (newzbin index is about 4-5M files), it can be increased further maybe 20-30%, but i don't have enough material to do that, and probably it is not a priority (e.g. with 2 computers it will be 2 times more). the work was so fast because i used ue framework code, it has libraries of my own design which are quite reusable and universal.
most what is left is to add message-id disk storage (like compact binary groups in ue, at least i need to rewrite the code) and in ue to add displaying search result list. the engine is economical in bandwidth (since i don't know how it will be hosted) so i cannot reuse current ue header lists so we'll have displaying results in a temporary panes from which they can be marked to go into an import group, also it wouldn't be a problem to have a direct mode later, since the latter is a kind of slower simplification.
what we'll eventually get is a free program which can do usenet indexing for registered usenet explorer users. then we'll think where and how to host it, but the idea is everyone can run it (i don't have any desire to provide a service, so i wanted to replace current eniac like looing services with something very small (and it is a shame why servers don't provide such a service as a nntp command if it takes only days to get such an engine working).
as to leave i was initially planning to go back about august 15, but it may take a bit longer (in one place only status will be clarified on 15), in short at worst some time around august 20+ (also given i'm busy with indexing and i don't need internet access to implement it - the work shouldn't be slown down, who knows, maybe the opposite). my brother (his site www.netwu.com hosting both newspro and usenet explorer) is handling everything in the meantime.
in short as always i don't promise anything but it is where the things stand right now.
but let us talk about the current work.
in the end of february, when i was busy with UE in pre v1.0 state and it wasn't clear whether the program will be in releasable state at all i added newzbin support and was puzzled by their newsreader backend slowness. so i contacted them and proposed some better algorithm to fix that partially (they also have sql aspect which slowness is not fixable) on their web page you can see that as 'new faster search implemented'. in tests here the algorithm gave the speed about 1000 searches per minute on 2.8Ghz computer, but it also had some deficiencies which maybe are not problems for newzbin since they have 15 or so computers running but for implementing ue service it was not a perfect solution. then i was preparing official ue release and then reimplementing all essential newspro features until v1.1 so i was almost completely out of it.
about june 11, while working on filters i made a small diversion and found in theory another indexing solution which doesn't have deficiencies of the algorithm which i gave to newzbin. i wrote several sentences for future notice and forgot about it until after the v1.1 release.
before the leave in the end of July i downloaded some header samples into 256MB 1.8Ghz laptop and about August, 2 the work on the topic started (not all the time since I'm trying to get some rest too).
about August, 10, two days ago i had a working engine, not an emulation but code which will make it into the final product (we need engine first since you cannot estimate search speed and memory requirements and there is no sense to write the rest if they are not satisfactory). specifications currently are 100-1000 boolean wildmat searches per second on the laptop (newzbin i checked with substrings) and in 2GB computer it is possible to index about 7M files (newzbin index is about 4-5M files), it can be increased further maybe 20-30%, but i don't have enough material to do that, and probably it is not a priority (e.g. with 2 computers it will be 2 times more). the work was so fast because i used ue framework code, it has libraries of my own design which are quite reusable and universal.
most what is left is to add message-id disk storage (like compact binary groups in ue, at least i need to rewrite the code) and in ue to add displaying search result list. the engine is economical in bandwidth (since i don't know how it will be hosted) so i cannot reuse current ue header lists so we'll have displaying results in a temporary panes from which they can be marked to go into an import group, also it wouldn't be a problem to have a direct mode later, since the latter is a kind of slower simplification.
what we'll eventually get is a free program which can do usenet indexing for registered usenet explorer users. then we'll think where and how to host it, but the idea is everyone can run it (i don't have any desire to provide a service, so i wanted to replace current eniac like looing services with something very small (and it is a shame why servers don't provide such a service as a nntp command if it takes only days to get such an engine working).
as to leave i was initially planning to go back about august 15, but it may take a bit longer (in one place only status will be clarified on 15), in short at worst some time around august 20+ (also given i'm busy with indexing and i don't need internet access to implement it - the work shouldn't be slown down, who knows, maybe the opposite). my brother (his site www.netwu.com hosting both newspro and usenet explorer) is handling everything in the meantime.
in short as always i don't promise anything but it is where the things stand right now.