Page 1 of 2

Current work - usenet indexing made easy (pre v1.1)

Posted: Fri Aug 12, 2005 4:55 am
by alex
i need to get some documents in some remote place and it proves to be a real problem (but at least i'm not in the army as it happened in the past and it was repeatedly proven that the army kills my thinking) , so right now most of the time i don't have Internet access alltogether, and when i'm coming to a city where the bureaucracy machine is located i can access the forum like that.

but let us talk about the current work.

in the end of february, when i was busy with UE in pre v1.0 state and it wasn't clear whether the program will be in releasable state at all i added newzbin support and was puzzled by their newsreader backend slowness. so i contacted them and proposed some better algorithm to fix that partially (they also have sql aspect which slowness is not fixable) on their web page you can see that as 'new faster search implemented'. in tests here the algorithm gave the speed about 1000 searches per minute on 2.8Ghz computer, but it also had some deficiencies which maybe are not problems for newzbin since they have 15 or so computers running but for implementing ue service it was not a perfect solution. then i was preparing official ue release and then reimplementing all essential newspro features until v1.1 so i was almost completely out of it.

about june 11, while working on filters i made a small diversion and found in theory another indexing solution which doesn't have deficiencies of the algorithm which i gave to newzbin. i wrote several sentences for future notice and forgot about it until after the v1.1 release.

before the leave in the end of July i downloaded some header samples into 256MB 1.8Ghz laptop and about August, 2 the work on the topic started (not all the time since I'm trying to get some rest too).

about August, 10, two days ago i had a working engine, not an emulation but code which will make it into the final product (we need engine first since you cannot estimate search speed and memory requirements and there is no sense to write the rest if they are not satisfactory). specifications currently are 100-1000 boolean wildmat searches per second on the laptop (newzbin i checked with substrings) and in 2GB computer it is possible to index about 7M files (newzbin index is about 4-5M files), it can be increased further maybe 20-30%, but i don't have enough material to do that, and probably it is not a priority (e.g. with 2 computers it will be 2 times more). the work was so fast because i used ue framework code, it has libraries of my own design which are quite reusable and universal.

most what is left is to add message-id disk storage (like compact binary groups in ue, at least i need to rewrite the code) and in ue to add displaying search result list. the engine is economical in bandwidth (since i don't know how it will be hosted) so i cannot reuse current ue header lists so we'll have displaying results in a temporary panes from which they can be marked to go into an import group, also it wouldn't be a problem to have a direct mode later, since the latter is a kind of slower simplification.

what we'll eventually get is a free program which can do usenet indexing for registered usenet explorer users. then we'll think where and how to host it, but the idea is everyone can run it (i don't have any desire to provide a service, so i wanted to replace current eniac like looing services with something very small (and it is a shame why servers don't provide such a service as a nntp command if it takes only days to get such an engine working).

as to leave i was initially planning to go back about august 15, but it may take a bit longer (in one place only status will be clarified on 15), in short at worst some time around august 20+ (also given i'm busy with indexing and i don't need internet access to implement it - the work shouldn't be slown down, who knows, maybe the opposite). my brother (his site www.netwu.com hosting both newspro and usenet explorer) is handling everything in the meantime.

in short as always i don't promise anything but it is where the things stand right now.

Posted: Sat Aug 13, 2005 5:57 pm
by prusiner
Thnaks for your good news about development, Alex

Have a nice time, far from computers and connections! Keep the laptop closed as much as posible! (There will be time for work later)

Posted: Sun Aug 28, 2005 2:21 pm
by alex
prusiner wrote:Thnaks for your good news about development, Alex

Have a nice time, far from computers and connections! Keep the laptop closed as much as posible! (There will be time for work later)
i didn't overwork just results were a bit better than usual, also sometimes there is nothing else to do :)

Posted: Sun Aug 28, 2005 2:30 pm
by alex
i'm back but it will take one more day (tomorrow) to bring all in order since something unexpected happened today, afterwards all should normalize.

the productivity eventually degraded especially during the last week (actually the laptop keyboard failed in the end), what we have now is detailed data structures and workings details on the server side, and a sketch of the client side (basically tabbed view the client data is transient and of small volume so the client side is not critical, but interface is always time consuming).

so as to now since if i started the work i want to finish it given my detached condition at the very moment i can only give very rude time estimation - 3 weeks, say to have something towards September 20, i'll adjust the date as the work progresses, and it is the service implementation, running the service is something different (but i'll try to run it at least until the implementation is working and whatever it will be it won't be commercial, being an admin running a service is something different).

Posted: Mon Aug 29, 2005 1:25 pm
by alex
ok now all in order so tomorrow i'll continue the work.

maybe i'll blog every stage of the development, but if i get bored talking i may delete it all, it only means i don't like to brag in the meantime :)

so there is server side and the client side, so first comes server side and the next stage i need to implement is handling newsgroup list.

the sketch of overall plan for server side (preliminary and may change below):

newsgroup list
servers
message-id storage
getting headers

then implementing client side

then refining the server side (probably mostly getting headers addressing server inconsistency).

i want to finish it fast before getting bored with the topic itself :)

the most difficult part is already implemented (at least i estimate so right now), like what will underlie the engine.

as to usenet explorer code (i mean the client) the code integrity won't be significantly affected, the tab view is separate and as to the rest i added something around the current search interface code so it works all ways depending on settings in properties, i mean safe to add without compromising the rest.

Posted: Fri Sep 02, 2005 1:54 am
by alex
ok it seems the work is normalizing (these breaks are pretty distractive even if i manage to work during them).

maybe in about one week it will be more clear what is the progress and estimated completion date (maybe release is not the right word here, but the work will include ue release with needed capabilities on the client side along with the server). again it is important to finish the work fast since the code is not directly connected to conventional newsreader functionality.

in short every week (so friday) the info about progress will be in this thread.

Posted: Sun Sep 11, 2005 1:14 pm
by alex
currently i'll be working on message-id storage, it is like compact binary so reimplementing it is a bit boring (i want to keep ue and the indexing code separate so i only reducing it to some common ground when changes are simple so not to compromise the ue code).

i'll try to have the working code (so i can at least return to ue code and process pending requests) at most by the end of september.

Posted: Mon Sep 19, 2005 1:45 am
by alex
due to technical fault with my isp i didn't have connection since last friday.

i've implemented (preliminary, it means maybe to add more fault tolerancy later) message-id storage and checking now how everything behaves on larger volumes of posts and different search patterns, i observe 2 strange processor usage spikes, i need to be positive here before i continue further.

what will be left afterwards is getting headers (maybe at first i just take the code from ue) and the view to show search results and related on the client side.

Posted: Mon Sep 19, 2005 11:29 am
by alex
ok, tested with about 1.6M files (1.5 or 7 millions doesn't matter), now no problems, only found one small bug (i put > instead of >=) which was reponsible for spike in processor usage when "getting headers" in few cases and the windows function to show text SetDlgItemText takes too much processor time when text is long so i cut the text in the test window to display search results. these were two things which caused concern especially the latter since SetDlgItemText took about 4 seconds in some cases so there was an impression that the search itself is slow. in short i can continue further development, no problems as to now.

Posted: Sat Sep 24, 2005 10:06 am
by alex
well, i resumed work only today :(

it seems the solution is several working days ahead, so i'm trying to maximize the number of working days dedicated to the task, just mind wants to go here and there :)

i'll try not to let more diversions before having a working prototype.

i think there is also a solution which could increase the number of files from 7 millions to maybe 20-30 millions, just it will run slower, but maybe fast enough (maybe not), in any case it won't be included in the initial solution, just if to have a computer with 4GB memory i think it would be adequate (even 2GB would be ok for 14M files i think with that change).

Posted: Sun Sep 25, 2005 9:45 am
by alex
the work normalized, i'm just pushing it gradually so not to kill the proper mood and not to stop completely until all is done.

currently i'm working on interaction between the newsreader and the indexing server first (the server can supply actual data although not all on the server side is implemented - just the most time consuming work is now on the client side). probably i'll have listing of results in tab view tomorrow, then need to implement importing results into import groups (it is more easy) and then finish with the server.

Posted: Mon Sep 26, 2005 5:22 pm
by alex
the tab view is now bearable (the rest - column sort, more status icons can be added later), so i'll try to add tomorrow the message-id retrieval part (started it already) so there will be a functional client and then i'll switch to the server, i'll finish with the client side refinements during tests or while working on the server for a change.

Posted: Tue Sep 27, 2005 6:22 pm
by alex
message-id retrieval part finished, although it seems server doesn't always returns data properly but these are small things which should be fixed tomorrow, then probably i'll work on remaining server issues.

Posted: Wed Sep 28, 2005 5:16 pm
by alex
the mentioned above small problems resolved, the client side is satisfactory as well, it can be left as it is at least for the first release, few visual feedback aspects can be improved later.

what is left is on the server side - getting headers (now it takes headers from files), related and other few issues.

Posted: Mon Oct 10, 2005 8:22 pm
by alex
ok most coding has been completed, i have short to do list left and already running the server here in the meantime.

it seems what we have is a solution when everyone can run usenet indexing in a very simple way for himself and for other users as well, i'll try to finish and publish the program most probably by the middle of October, it is when it will be clear what we've really got.

since luckily the whole actual work on the topic is comparable e.g. with adding compact binary newsgroup type in UE and took only about 4-5 weeks (but i took from UE quite a lot of code) most probably i'll publish the server as a free to use companion application (in conjunction with UE itself).