Page 1 of 1

Boolean Wildmat problem

Posted: Sat Aug 19, 2006 6:30 am
by Greg_G
I've been using the same boolean wildmat for par detection for a long time now, but it's no longer working. I noticed in the dev forum that, in 1.7, Alex made some changes to the syntax. But, I'm having trouble adjusting for it.

The original working string was this:
(\.[pP][0-9][0-9]|vol*.par2)&^(\.[pP]art[0-9]*\.[^pP])&^(\(0/)&^(\(00/)&^(\(000/)

Essentially, it starts with a broad search, which gets a lot of false positives, and then removed the false positives with all those &^() sections.

I tried replacing [] with {}, as per that post, but that did not help. I know I can turn off case-sensitive to replace [pP] with p but I never bothered since "it it ain't broke, don't fix it".

It seems like it starts to fail when I have ()|(). For example, this very simple (useless) testing string gives no results:
(vol*.par2)|(par)
while the two individual components work fine by themselves.

This string used to work perfectly (well it didn't include the index .par2, but that's how I wanted it). Any suggestions? I haven't come up with anything that even comes close to working. I suspect there's just something about the new syntax I'm not grasping.

Thanks,
Greg

Edit: I should mention that I use this via the Filter Editor/Filter Droplist.

Re: Boolean Wildmat problem

Posted: Sat Aug 19, 2006 8:08 am
by Josef K
Greg_G wrote:It seems like it starts to fail when I have ()|(). For example, this very simple (useless) testing string gives no results: (vol*.par2)|(par) while the two individual components work fine by themselves.
I've noticed myself that the OR operator appears to be a bit flaky, even with simple queries. I mainly use mine for the search service but I've been getting hit/miss results since Alex changed the format slightly (brackets or braces also seems to make no difference to my results).

Posted: Sun Aug 20, 2006 4:03 pm
by alex
in this particular example \. should be replaced with . since . is not a special character (see the list of changes http://www.netwu.com/ue/UE.txt).

but i also see now expressions with parentheses were affected, since in addition i tried to discern cases when ) is special character or not and i introduced a bug in one line of code.

i've changed it here and it is in effect for the indexing server, but as to usenet explorer itself until the next release you can use this version which i've just compiled:

http://www.netwu.com/ue/ue1701.zip

so your pattern should be then:

(.{pP}{0-9}{0-9}|vol*.par2)&^(.{pP}art{0-9}*\.{^pP})&^(\(0/)&^(\(00/)&^(\(000/)

but additionally, as to this particular pattern {pP} may be replaced with {p} and {^pP} with {^p} since in Usenet Explorer search patterns are case insensitive, so it also may be replaced with:

(.p{0-9}{0-9}|vol*.par2)&^(.part{0-9}*\.{^p})&^(\(0/)&^(\(00/)&^(\(000/)

also in principle parentheses may be omitted in most places:

(.p{0-9}{0-9}|vol*.par2)&^.part{0-9}*\.{^p}&^\(0/&^\(00/&^\(000/

if you see any problems with the version above let me know (unlikely since the main work was adding the whole word special character and i checked the more important code very thoroughly before the 1.7 release), but before all pay attention to those changes listed in UE.txt:

[] were replaced with {} since square brackets are too common in real world subjects

" special character added to help in matching whole words

\ only turns off the special meaning of the special characters and not every character i.e.

\x - if x is a boolean wildmat special character, i.e. one of ? * { } " ( ) ^ & | \ turns off the special meaning of x and matches it directly otherwise \ is not interpreted as a special character. it is not special inside curly brackets.

Posted: Sun Aug 20, 2006 6:57 pm
by Greg_G
As always, Alex, you're the best. Soon as I get time (a day or two...as I'm in the middle of a project for a client) I'll give that build a check and see how things work out. I'll post back here once I have.

And if I run into problems I'll be sure to check the UE.txt first. ;)

Thanks so much,
Greg

Edit: I tested it out and it's working perfectly. Thanks again!

Posted: Wed Sep 06, 2006 1:19 pm
by alex
i found myself another glitch in " whole word special character implementation.

if we look for "red"*"flower" it is natural to have "red flower" matches included as well but in v1.7 it tries literally to match both " " surrounding the star to one or more spaces as in the description instead of the intuitive match.

i changed the implementation on the server, for the client it will be available in the next version.

i was thinking to change \ back to always work as a special character but maybe i'll leave it as it is in the meantime, since it appears that copying subject to search is needed more frequently than the need to use \ as special character and remember which characters are special to follow it.

Posted: Wed Sep 06, 2006 5:32 pm
by Greg_G
alex wrote:if we look for "red"*"flower" it is natural to have "red flower" matches included as well but in v1.7 it tries literally to match both " " surrounding the star to one or more spaces as in the description instead of the intuitive match.
Aye, in my opinion "red flower" should definately trigger for "red"*"flower".
alex wrote:i was thinking to change \ back to always work as a special character but maybe i'll leave it as it is in the meantime, since it appears that copying subject to search is needed more frequently than the need to use \ as special character and remember which characters are special to follow it.
As long as \\ (double backslash) acts as a normal backslash, I personally consider this a more traditional implementation and would agree with the change back. But since I'm used to things like C/C++ strings and regex it seems natural to me. To someone used to basic wildmats I suppose it's not. Tough call.

Edit: Another thing to think about with the second issue (and I'm sure you've already considered this) is along the same lines as your change from [] to {}. While I tend to like sticking with old standards like [], I have to admit switching to {} was brilliant since I can now paste file names containing square brackets into the quick finder, which is so very common. But I admit I don't see the backslash very often in subjects and (obviously) never in file names. But you have the stats on character usage and perhaps the backslash is used more than I've noticed. Again it's a tough call, I find myself ambivalent about it.