about filtering unwanted website

Questions concerning installation and usage of YaCy.

about filtering unwanted website

Postby chunbogbog » Wed Dec 19, 2007 2:08 pm

i've done for my web database but i want to filtering some web that i don't want e.g. adult site

i known that yacy have Blacklist Administration but i try to add some on the list, just like when i want to ban http://www.sex.com and http://www.sexintro.net , i have added *.sex.*/.* and *.*+"sex"+*.*/* but it doesn't work. So, what's the words that i sho;d add to blacklist to ban these 2 webs.



if you have some keyword that broaden filtering adult/gross website or have imported blacklist, please share it with me, i would like to look for an example. Thank you.

lastly, this is my first search engine from yacy please take a look
http://www.custatportal.com:8080
i'm very please if you'll leave some comments to me. ^^
chunbogbog
 
Posts: 11
Joined: Tue Oct 30, 2007 4:10 am

Postby miTreD » Wed Dec 19, 2007 2:39 pm

The default blacklist engine doesn't support regular expressions. I'm in a hurry that's why I just give you the two entries for your example:
*.sex.com/.*
*.sexintro.net/.*
I'll give you some more info regarding the regexp blacklist engine of lulabad during the next days.
miTreD
 
Posts: 88
Joined: Sat Sep 01, 2007 11:49 am
Location: /home

Postby chunbogbog » Fri Dec 21, 2007 3:21 pm

Is there any way to automatically block web-site that contains some words?
chunbogbog
 
Posts: 11
Joined: Tue Oct 30, 2007 4:10 am

Postby Orbiter » Sun Dec 23, 2007 1:39 am

not directly using the blacklist, but there is another blocking function called 'bluelist': a simple file with words, each line one word in the home directory of YaCy with the name 'yacy.blue'.

Every word in the bluelist will cause that no text with that word inside is indexed, and also no text is searched with the word (it is excluded from the search words) and also results with the word inside is excluded as well.
Orbiter
 
Posts: 51
Joined: Thu Jun 28, 2007 7:39 am

Postby chunbogbog » Sun Dec 23, 2007 10:19 am

i don't see any yacy.blue in the entire yacy's folder but there's "yacy.yellow" file ,is it the same one?
if the answer is yes, how can i use it ....jut simply put only domainname on the list and yacy'll block it ,right?


here's another question , in the search result page there's text "YBR-16" , "YBR-13" ..... what's it.
firstly, i thought it's ranking points but the result don't seem to be sequence by this value.
chunbogbog
 
Posts: 11
Joined: Tue Oct 30, 2007 4:10 am

Postby Low012 » Sun Dec 23, 2007 2:20 pm

YBR means yacy block rank and it is one factor that determines how a page is ranked in the result, but there are several more. You can get an overview and change the weight of each factor on http://localhost:8080/Ranking_p.html. The way it is set right now is just a suggestion by Orbiter. If you find settings that you feel give better results or if you can think of other things that should be considered, don't hesitate to share them with us.

I have never used the bluslist so far, but I think you just have to create a new file called yacy.blue in the home directory of YaCy. yacy.yellow will block domains that contain a word in the list. If you want to block http://www.sex.com and http://www.sexintro.net just add sex to the list, but it will also block http://www.nosexuntilmarriage.com and http://www.responsibleteenagersdonthavesex.org. Using the blacklist probably leads to less collateral damage but means more work for the person who maintains the list.
Low012
 
Posts: 368
Joined: Thu Jun 28, 2007 8:55 am
Location: Germany

Postby chunbogbog » Sun Dec 23, 2007 2:51 pm

Thank you!! ^^
but how's yacy block rank work?
when'll it give web YBR-16? ,when'll it give YBR-13?


and i also try to use both yacy.blue and yacy.yellow by add word "sex" into both of them but it doesn't have any effect, the result keep shown the web that contain word "sex" for 5000 site.
i'm really confused now.
chunbogbog
 
Posts: 11
Joined: Tue Oct 30, 2007 4:10 am

Postby Low012 » Sun Dec 23, 2007 3:41 pm

I'm sorry, I told you something wrong about the yellowlist. I took a look at the German YaCy wiki and it should be like this: The yellowlist does not block anything. If you access pages through the proxy, YaCy will access them sending it's own user agent string. Some websites analyze the user agent and send diffent pages depending on the browser. The yellowlist allows you to define sites that YaCy will not send it's own user agent to, it will use the browser's user agent.

The bluelist also does not block a page if you load it through the proxy, it replaces words which are in the bluelist in pages that are loaded through the proxy by "XXXXXXX". I just tried creating the file, adding "sex", resatarting YaCy and it worked:

Image

It did not work on http://www.kuh.at/ though. I'm not sure why.

Block Rank is a simplified version of Googles Page Rank which Google also uses since calculating it takes considerably less time. A high YBR is rated better than a low YBR.
Low012
 
Posts: 368
Joined: Thu Jun 28, 2007 8:55 am
Location: Germany

Postby chunbogbog » Sun Dec 23, 2007 7:11 pm

so, i think about go back to use the blacklist is better way. ^^

here's one last (or not ^^;; ) question.
i wanna modify the Ranking Configuration,
but i don't seem to understand the algorithm of ranking result of yacy

please could you explain the calculate formula or just put some little examples, i'll try to figure out by myself.
chunbogbog
 
Posts: 11
Joined: Tue Oct 30, 2007 4:10 am

Postby Low012 » Sun Dec 23, 2007 10:44 pm

That's Orbiter's domain. I think he should explain that before I tell you something wrong again.
Low012
 
Posts: 368
Joined: Thu Jun 28, 2007 8:55 am
Location: Germany

Postby chunbogbog » Tue Dec 25, 2007 8:25 am

OK, i'll wait for Mr.Orbiter. :)
chunbogbog
 
Posts: 11
Joined: Tue Oct 30, 2007 4:10 am

Postby miTreD » Wed Dec 26, 2007 1:16 pm

chunbogbog wrote:so, i think about go back to use the blacklist is better way. ^^
Yes :-)
If you want to use regular expression http://en.wikipedia.org/wiki/Regexp at the blacklist filter you have to do the following:

Stop your peer.

Dowload the advanced Blacklist of lulabad http://www.yacystats.de/yacydownload/advancedBlacklist-0.3.jar and place it at $YACYROOT/libx/

Now open $YACYROOT/DATA/SETTINGS/httpProxy.conf with your favorit editor an search for BlackLists.class. This property has to be changed to BlackLists.class=de.lulabad.blacklist.regexURLPattern

Start your peer.

You're done :-)
Now stuff like this .*={1}[a-fA-F0-9]{32}.*/.* is working.
miTreD
 
Posts: 88
Joined: Sat Sep 01, 2007 11:49 am
Location: /home

Postby Carlos_Pfitzner » Wed Dec 26, 2007 1:19 pm

Look here too (I didnt know how to block unknow words or unknow domains)
so, I started using a antivirus :lol:
http://www.yacy-forum.org/viewtopic.php?p=378#378

ps: using a standard black list of yacy is easy (just type the pattern)
*I feel there are missing description of patterns
but, anyway do

click on admin screen
enter your password to login
then click "filter & black list"

type the pattern you want to add
eg: to block htttp://cheapa.de
your pattern will be
*.cheapa.de/.*
then mark waht you want to block

may be one of then or you can mark all of below
Code: Select all
proxy
crawler
dht
search
surftips
news



Then click the button save

ps: I believe you will need to re-start yacy to this take efect

Happy new year
Carlos_Pfitzner
 
Posts: 175
Joined: Fri Jun 29, 2007 12:45 pm
Location: Rio de Janeiro, Brazil, South America

Postby Carlos_Pfitzner » Wed Dec 26, 2007 2:08 pm

I'll give you some more info regarding the regexp blacklist engine of lulabad during the next days.


I want more info too :wink:
using a tool that I dont undertstant how it works is bad than using none

TIA
Happy new year
Carlos_Pfitzner
 
Posts: 175
Joined: Fri Jun 29, 2007 12:45 pm
Location: Rio de Janeiro, Brazil, South America

Postby chunbogbog » Wed Dec 26, 2007 4:46 pm

miTreD wrote:
chunbogbog wrote:so, i think about go back to use the blacklist is better way. ^^
Yes :-)
If you want to use regular expression http://en.wikipedia.org/wiki/Regexp at the blacklist filter you have to do the following:

Stop your peer.

Dowload the advanced Blacklist of lulabad http://www.yacystats.de/yacydownload/advancedBlacklist-0.3.jar and place it at $YACYROOT/libx/

Now open $YACYROOT/DATA/SETTINGS/httpProxy.conf with your favorit editor an search for BlackLists.class. This property has to be changed to BlackLists.class=de.lulabad.blacklist.regexURLPattern

Start your peer.

You're done :-)
Now stuff like this .*={1}[a-fA-F0-9]{32}.*/.* is working.




THAT'S REAL Help!!
Thank you. :oops: :oops:

Now, the last one that remind me is Ranking thing in YaCy.

Oop...almost forget.
Merry X'mas and Happy New Year!!
chunbogbog
 
Posts: 11
Joined: Tue Oct 30, 2007 4:10 am

Next

Return to Installation and Support

Who is online

Users browsing this forum: Bing [Bot] and 2 guests

cron