Feature request : multithreaded header decode and filtering

Post Reply
SimonC
Posts: 2
Joined: Tue Nov 30, 2010 8:40 am

Feature request : multithreaded header decode and filtering

Post by SimonC »

Hi Alex!

Firstly, I know that these might be very large requests, so sorry about that! m(_ _)m

On my machine I can download headers at about 20 megabytes (not megabits) / second thanks to compression, but UE can only cope with about 5MB/s before it stops responding. (It doesn't crash, it just stops listening to mouse clicks etc). As I have some very large newsgroups in my list, and as I can't schedule regular header downloads, this means that I have to wait quite a while with the CPU usage locked at about ~25% (quad core) before I can look at what has been downloaded. Do you have any plans to multithread the header retrieval and kill code segments? I would imagine that all that would be needed is a buffer / allocator thread and (core-count or core-count -1) threads for killing / processing headers, but I'm aware that might be quite a big all. :) (Maybe it needs a reconstructor thread afterwards to pack the processed headers into whatever format UE uses?)

For the second part, I find that filters work quite quickly - just seconds in even large groups thanks partly to kill features and retention-limiting - but multithreading might be good there as well.

Anyway, just wondering!

Simon
alex
Posts: 4514
Joined: Thu Feb 27, 2003 5:57 pm

Re: Feature request : multithreaded header decode and filter

Post by alex »

I had more intensive multithreading in newspro, but because of tracking crossposted and GUI effectively most work was done in a single thread because of locks. In the same time multithreading made development more complicated and when newsgroup volume exploded it was one of main reasons I had to resort to full rewrite (very expensive in the development time) rather than changing existing code.

I think I saw higher speeds, when you download headers try to keep newsgroup vews closed (those effectively updated), updating newsgroup views takes significant toll on CPU (sorting new headers into views).

Also you may try to adjust update rate and freeing rate in edit menu->properties->tasks, header tasks, less frequent (newsgroup views) update rate - less CPU usage, less frequent freeing rate - takes more RAM but saves on CPU because less headers are loaded/unloaded (freeing is like purging cache).

Another serious problem when downloading large newsgroups is spam.

There are systematic spammers (most likely someone pays to them since it is constant activity for many months or maybe several years), the spam pattern is relatively small single messages with random subject. the volume of posts sometimes is like 50K-200K messages a day, if you are trying to download all headers in the seriously spammed large newsgroup with high retention - the number of header entries becomes huge and it is not possible to keep it in the database in reasonable amount of RAM, sorting etc. gets slower. if you scroll such a newsgroup - you see large clusters of those spam posts.

All databases trick is effective compression, which is possible only when data is well organized, if there is a lot of random junk - it will take most of RAM, say what CPU it takes to handle sorted list of several millions of entries.

The known spam gate very active recently is provider http://www.hitnews.eu

To be exact the UE code is multithreaded (when it downloads headers there is one thread per task), what is single threaded is inclusion into the database / updating newsgroups views. It is also possible more preparation work can be done to make integration of new headers faster, I need to check for exact bottlenecks, purging cache or updating newsgroup views is as fast as it can be, integration into database hashes there is some space to optimize it further by shifting more work to other threads, I'm not sure though it is significant percentage, maybe I'll try to doublecheck that.
SimonC
Posts: 2
Joined: Tue Nov 30, 2010 8:40 am

Re: Feature request : multithreaded header decode and filter

Post by SimonC »

Hi again Alex, and thanks for the quick reply!
When I read that each header task had a thread of its own I thought that I might have made a bit of a blunder. :oops: Increasing the number of header tasks to 4 (instead of 1, which is what I had it at) and increasing the freeing/update intervals to their maximum has helped a bit, so it goes up to about 30% cpu now. (I never have the newsgroups open when I'm getting headers.) Anyway, I'm guessing the bottleneck for me is in the kill filters or database integration code, because I have a lot of kill filters - I'll try merging multiple filters and post my results if they help, but I don't think it would be significant.
Thanks,
Simon
Maccara
Posts: 35
Joined: Thu Nov 06, 2008 12:18 pm
Location: Finland

Re: Feature request : multithreaded header decode and filter

Post by Maccara »

Simon, heavy kill filters really hog down CPU (at least mine, I have only an AMD x2). I know, I have multiple nested filters (tend to grow in complexity quite fast).

I solved "the problem" myself by using the filters manually - seems to work nicely that way and the way human perception works it "seems snappier" that way overall.

Also, merging filters does help a lot. I've been entertaining an idea of writing a boolean wildmat simplifier for the filters, but haven't gotten around to it.
Post Reply