View Full Version : Desktop Search Engines
squian
04-09-2005, 04:32 PM
Inspired by vm's thread on Google Desktop Search (http://www.freethought-forum.com/forum/showthread.php?t=2292), I was inspired to try the competing MSN Desktop Search. People at work said it was much better for indexing Outlook than GDS. Since that's our corporate standard and finding emails is my biggest problem, I thought I'd try the MS one. But I would like to keep up with experiences with any desktop engine, in general.
I think the following comments apply as equally to the MSNDS as GDS:
Non-Windows Support
And I don't like it. Not because of the privacy issues, google's always been discreet with that, but because for the first time in their history they've made a product that screams "fuck you non-windows world, you need not apply".
Not surprising that MSN's wouldn't work elsewhere. Are there any MacOSX or Linux desktop search engines? With the efficiency *nix filesystems, are they even necessary?
Caching can be intrusive
It's the right here privacy that gets me. I have some password protected documents (not just personal stuff, also work things) - and GDS caches the entire contents of every document you edit. Not good if someone's working with you and sees the secret personnel review.
Also it caused problems with removable drives like thumb drives - it kept hold of them so the OS always thought a process was accessing them.
Basically to fix these issues you have to restrict it to specific drive letters and move anything you don't want it to fiddle with elsewhere. It's not low-configuration, it's high-interference.
Plus it assured me that it would not use more than 4GB for its index. 4GB! I don't want to sacrifice that much. Plus in several days of use it did not build enough index to make the searches actually useful. I don't use it any more.
Interference with anti-virus
As I kicked off the indexing in MSN, I noticed it would run over a file or two and then say something about going idle because of system activity. Checking Task Manager, I noticed that McAfee was intermittently taking up all the CPU resources. I think what is happening is that MSN touches a zip file and McAfee thinks it needs to scan all the contents, which causes the indexing to pause. Shutting down McAfee allowed MSN to index much faster. Unfortunately, I cannot permanently turn off McAfee or change it's settings (damn secure desktops).
Comparison
Anyone try different ones and have a preference?
Corona688
04-10-2005, 01:31 AM
Inspired by vm's thread on Google Desktop Search (http://www.freethought-forum.com/forum/showthread.php?t=2292), I was inspired to try the competing MSN Desktop Search. People at work said it was much better for indexing Outlook than GDS. Since that's our corporate standard and finding emails is my biggest problem, I thought I'd try the MS one. But I would like to keep up with experiences with any desktop engine, in general.
I think the following comments apply as equally to the MSNDS as GDS:
Non-Windows Support
And I don't like it. Not because of the privacy issues, google's always been discreet with that, but because for the first time in their history they've made a product that screams "fuck you non-windows world, you need not apply".
Not surprising that MSN's wouldn't work elsewhere. Are there any MacOSX or Linux desktop search engines? With the efficiency *nix filesystems, are they even necessary? There's a linux desktop search engine available, known as Beagle (http://www.gnome.org/projects/beagle/). I daresay it looks functional enough should I decide I want something like that.
As for the efficiency of filesystems, that's not really an issue. You can have a hyperefficient filesystem and still have a hard time finding what you want, though certain filesystems can be searched faster than others(Reiser4 and BeFS are probably the fastest for searching -- BeFS is practically a full-out relational database).
UNIX systems in general come with plenty of tools to help you look for things, though they're console tools, not generally GUI ones. locate in particular is close to a desktop search, except it caches filenames only; If I wanted to look for the file 'world-takeover-plan.html' I would just type 'locate world-takeover-plan' and it would search it's internal database(updated automatically at 24 hour intervals) and print the names and locations of files who's name contain world-takeover-plan. It's fast and pretty darn efficient, but doesn't cache file contents -- only file names.
If I wanted to look through file contents that would mean checking every single potential file in a brute-force manner, which is certainly possible, like so:egrep -H 'SODIUM IS THE KEY' *.html would search through all html files in the current directory and print files that contained that key phrase. egrep can do more complex things than simple text matches as well; it can do stuff like wildcards, too.
This isn't particularly efficient, though. No matter how fast your drive is and how efficient your filesystem is that still doesn't get past the fact that it has to examine every single html file in it's entirety in order to check if it has anything on my sodium-based world-domination plan in it.
That's what all this caching stuff is about. Some sort of index is needed in order to do an efficient search, so it can search only through the index rather than brute-forcing every damn file. There's no standard way to do this under UNIX that I know of... When Reiser4 becomes more commonly used I'm hoping it'll help rectify this.
I suppose UNIX's more organized file structure helps keep things sorted out without it, since I don't often have problems; It doesn't throw drive letters around randomly, it moves the drives around to where you want them in your file tree instead of vice versa. Your /home directories could be on a completely different drive for example, and it's the OS's job to connect that directory to that drive instead of having to remember 'OK, home directores are on q:\windows\long unfriedly directory name\<japanese characters>\downstairs\by_torchlight\behind-locked-door\beware-of-the-leopard'.
Macintosh OSX is another kettle of fish. It's UNIX, yes, but Apple has imposed a file structure on it completely different from the BSD it's derived from. It does come with an application called 'Finder', but I'm not sure if that's really a desktop search or just it's normal file-browsing application. I'll ask the macophile I know about it.
Now where'd I put that sodium?
Corona688
04-10-2005, 02:33 AM
OK, I've talked to mr macophile. Says he, Macintosh OSX already had pretty good search functionality through a program called "Sherlock", as well as filename searches in Finder. There's also Searchlight coming up in OSX 10.4, which'll unify and improve these.
viscousmemories
04-10-2005, 02:57 AM
Here's something I just posted in the GDS thread:
I wasn't too concerned about that since I never use password protected documents, but another thing I just noticed is that since everything you do with your browser is cached and indexed by GDS, that means data you have stored on password protected sites (like FF private messages) are cached, indexed and searchable by GDS. Ugh.
Corona688
04-10-2005, 04:16 AM
Here's something I just posted in the GDS thread:
I wasn't too concerned about that since I never use password protected documents, but another thing I just noticed is that since everything you do with your browser is cached and indexed by GDS, that means data you have stored on password protected sites (like FF private messages) are cached, indexed and searchable by GDS. Ugh. Y'know, I don't really think GDS keeps a copy of everything. It might keep some sort of fingerprint or list of keywords, but the whole thing? That'd kind of defeat the point of a search engine.
squian
04-10-2005, 06:06 PM
Corona688,
I see your point about "brute force" vs indexed searching on Unix. I guess that I was thinking that even brute force searching on Unix works better than on Windows. Maybe even well enough that indexing is not so important.
Moreover, one aspect of these Desktop search engines is data unification -- Emails, web pages, files, chat logs, and so on. Since the Unix approach is "everything is a file", it is easier, even if using brute force, to find everything with the same mechanism -- grep (and variants). Hence, I was thinking this might diminish the need Desktop search on a Unix box.
I wasn't too concerned about that since I never use password protected documents, but another thing I just noticed is that since everything you do with your browser is cached and indexed by GDS, that means data you have stored on password protected sites (like FF private messages) are cached, indexed and searchable by GDS. Ugh.
As Corona688 says, indexed and searchable by GDS, yes, but I doubt cached. That's your browser's job. Even then, most times the cache is in a location that can only be accessed by the user -- it's not public. I'm not sure about GDS, but the MSNDS is similar -- the index files are stored in my user directory so they are not public either. IMHO, there is no reason to believe security is an issue with desktop search engines unless the file systems are not secure or the browser and search engine are actively misconfigured.
Corona688
04-10-2005, 06:52 PM
Corona688,
I see your point about "brute force" vs indexed searching on Unix. I guess that I was thinking that even brute force searching on Unix works better than on Windows. Maybe even well enough that indexing is not so important. Well it depends on the amount of data being searched. Even for data on the order of hundreds of megs it's still possible to do a brute-force search and have it finish in a reasonable amount of time. But search through 100x as much data and it will take at least 100x as long, and really thrash your hard drive while it's doing it.Moreover, one aspect of these Desktop search engines is data unification -- Emails, web pages, files, chat logs, and so on. Since the Unix approach is "everything is a file", it is easier, even if using brute force, to find everything with the same mechanism -- grep (and variants). Hence, I was thinking this might diminish the need Desktop search on a Unix box. Sometimes yes, sometimes no. It is true that UNIX systems don't have a "registry" the way windows does, so nothing fancy is needed to search through config files. Grep has no knowledge of file formats other than text though, or character sets other than ASCII. Things like HTML tags can also mung up the search -- slap <em> tags around one word in a phrase and suddenly the phrase doesn't match anymore. Data unification is really good to have... I wonder if there's any project to tackle this from the commandline.
Here's something I just posted in the GDS thread:
I wasn't too concerned about that since I never use password protected documents, but another thing I just noticed is that since everything you do with your browser is cached and indexed by GDS, that means data you have stored on password protected sites (like FF private messages) are cached, indexed and searchable by GDS. Ugh. Y'know, I don't really think GDS keeps a copy of everything. It might keep some sort of fingerprint or list of keywords, but the whole thing? That'd kind of defeat the point of a search engine.
It keeps enough of a copy to return the whole of a deleted or inaccessible email or document. Ergo, it effectively caches the whole thing.
Corona688
04-11-2005, 03:26 PM
It keeps enough of a copy to return the whole of a deleted or inaccessible email or document. Ergo, it effectively caches the whole thing. That is pretty damn insidious. I can't think of any legitimate reason for them to do that.
I think the reasoning is (a) because original www-google does that - has acres of servers caching the whole www several times over, and (b) so you can retrieve documents without having to log into mail, open the application, and so on, plus even if you've deleted it (I think they call this a benefit)
squian
04-23-2005, 01:41 AM
So I've been using MSN Desktop Search for a while now and everyday it seems to try to index around 50,000 items. That seems way too high for the number of files that have changed in a days work. I'm going to see if it will settle down after a while but if it wants to eat that much CPU time every day, I'm going to rip it out.
vBulletin® v3.8.2, Copyright ©2000-2012, Jelsoft Enterprises Ltd.