Page cache??

Questions concerning installation and usage of YaCy.

Page cache??

Postby mick » Thu Mar 04, 2010 4:59 pm

Is there some way I can see the complete content of the "pages" YaCy has crawled/indexed? The pages YaCy is seeing anyway do not always appear to be the same as the pages a browser would see on a given url... so I'd like to get to the bottom of why that is...
mick
 
Posts: 100
Joined: Fri Aug 14, 2009 9:40 pm

Re: Page cache??

Postby Orbiter » Fri Mar 05, 2010 12:17 pm

If you do a search, then under every search result there is a line with the 'info' link. Just klick that link and you get a view on the parsed content of the page that is linked to the search result. You can then select there to view either the parsed text or a list of links that the parser found in the page.
Orbiter
 
Posts: 47
Joined: Thu Jun 28, 2007 7:39 am

Re: Page cache??

Postby Orbiter » Fri Mar 05, 2010 3:48 pm

you will like this one: in SVN 6713 (use the updater) is an enhancement of the build-in
http://localhost:8080/ViewFile.html
servlet which can now be user to enter any url and analyse that. The servlet also shows you if the url is either in the internal URL-database and/or in the URL content cache.
Orbiter
 
Posts: 47
Joined: Thu Jun 28, 2007 7:39 am

Re: Page cache??

Postby mick » Fri Mar 05, 2010 7:46 pm

When I do VIEW ORIGINAL on the info page, is this the actual page YaCy gets? Or just the page as grabbed by the browser I'm using? Because the links on the page are not the same as the links in the VIEW LINKS info. Like (http://www.xxxx.com/bbs/index.php/board,22.0.html) for yacy is (http://www.xxxx.com/bbs/index.php?board=22.0)

EDITED: Ok I realized that isn't the original view, but that's the rub... I'd like to see what YaCy saw actually looked like^

I'm trying to figure out what the hell is going on, but not getting very far. I think maybe the forum requires a session in order to generate the SEF (seach engine friendly) urls, but it seems very short sighted to rely on a search engine to have cookies enabled or anything like that. I can't figure out what's going on, and the SMF people aren't flocking to my support topics :(

It's not WAP related, but it might be YaCy is seeing a printer friendly styled page, I've no clue at this point...


PS: I guess I will try to setup a fetch of the page with the USER AGENT Low recommended in another thread.
mick
 
Posts: 100
Joined: Fri Aug 14, 2009 9:40 pm

Re: Page cache??

Postby Orbiter » Fri Mar 05, 2010 9:56 pm

what ViewFile.html shows you is what YaCy actually sees. There may be a difference to your browser because YaCy requests the page in a stateless way: no cookies are submitted. That could do a big difference.
Orbiter
 
Posts: 47
Joined: Thu Jun 28, 2007 7:39 am

Re: Page cache??

Postby mick » Sat Mar 06, 2010 4:34 pm

I realize this, I think I was just unconsciously complaining about the ORIGINAL VIEW specifically, which just shows you the url opened in an iframe, rather than a cache of the page like you can find on Google links... which might be a useful feature or not.
mick
 
Posts: 100
Joined: Fri Aug 14, 2009 9:40 pm

Re: Page cache??

Postby Orbiter » Sat Mar 06, 2010 11:44 pm

you are right that is confusing. Therefore I added another option to load actually the content of the iframe from the cache. This is available in SVN 6719, you get it from the updater (non-debian) and an updated debian version certainly in the next days.
Orbiter
 
Posts: 47
Joined: Thu Jun 28, 2007 7:39 am

Re: Page cache??

Postby mick » Wed Mar 10, 2010 7:57 pm

Offtopic: do you have statistics about how people are using YaCy? And how do you expect people to be using it? Is it designed primarily for servers (probably Linux) big/small? Or is it P2P in the sense the primary target is all the Windows clients scattered around the world? And if so is YaCy bothered by client like behavior, ie. powering off and operating with a low priority?

PS: I'm really sorry... I hate to bother a (or the?) principal author with such basic questions. I would make a dedicated thread so maybe more people would see it, but there's not really a board dedicated to these sorts of questions regarding YaCy (or YaCy* ?)
mick
 
Posts: 100
Joined: Fri Aug 14, 2009 9:40 pm

Re: Page cache??

Postby Low012 » Sun Jun 20, 2010 10:59 pm

mick wrote:Offtopic: do you have statistics about how people are using YaCy? And how do you expect people to be using it? Is it designed primarily for servers (probably Linux) big/small? Or is it P2P in the sense the primary target is all the Windows clients scattered around the world? And if so is YaCy bothered by client like behavior, ie. powering off and operating with a low priority?

We don't gather any statistics, but from looking at the numbers of peers which are usually online, I thing that about 10% of the peers in the "freeword" network are (semi-)dedicated machines which run 24/7 and the rest comes and goes with uptimes from a few minutes to several hours.

YaCy was originally designed to work as a caching proxy which also scrapes data from the dosuments which are loaded via the proxy. The idea back then was that as many people as possible were to install YaCy on their desktop computers. Then the crawler was added and recently new use cases like Intranet search were made possible due to repeated request. Opertaing with a low priority should be no problem as long as you don't expect low RAM/CPU/disk usage and high crawling performance at the same time. ;) The larger your index grows, the longer it will take to start YaCy, but shutdown is usually pretty fast. Killing YaCy might corrupt the database, even though it has become pretty robust over the years.

edit: No operating system is preferred. Large parts of YaCy were developed on Apple computers, but it should run equally well on Windows and Linux. I also know that there are (or have been?) peers running on Solaris.
Low012
 
Posts: 266
Joined: Thu Jun 28, 2007 8:55 am
Location: Germany


Return to Installation and Support

Who is online

Users browsing this forum: No registered users and 1 guest