HighDots Forums  

SEO technology for Copyright Patrol?

Search Engine Optimization Discussion about SEO/Search Engine Optimization (alt.internet.search-engines)


Discuss SEO technology for Copyright Patrol? in the Search Engine Optimization forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
catherine yronwode
 
Posts: n/a

Default SEO technology for Copyright Patrol? - 02-24-2006 , 10:16 PM






Does anyone know of either a software package or a subscription service
that employs search engine technology to roam the net looking for
samples of copyright infringement / plagiarism?

The way i envision it, the search engine bot would be given the client's
domain name, then run around comparing random snips from each of the
client's pages with results at, say, google. When it finds a match, it
reports back with a daily log listing all files that duplicate portions
or the entirety of a client's files.

At this point what is sold could be just the search engine bot software
(to tech-oriented cients) or a subscription service (for business
clients without an interest in tech matters).

The SEO bot would run X number of pages per day, and check the site
through once, or it could go back and re-work the same site on a
continual, ongoing subscription basis.

The subscription service would also do a whois lookup and find the email
and street addresses of the domain owner and the domain host. It then
(hand-supervised, probably) would auto-generate legal letters of
complaint to the domain contact(s) and the contacts for te isp hosting
the site. This information would go into a weekly log. As a service, it
coud be programmed to auto-generate the exact forms requested by the
major isps, such as yahoo. It would also presumeably continually revisit
pages that it found had been infringed until the infringement was
terminated.

Thinking farther ahead, if a partnership with google were made, google
could agree to de-list sites that did not comply with the legal comlaint
rcedures (e.g. those bootleg Rumanian sites). (I do not want to get off
onto a tangent about google's own sopyright infringement issues; i know
about them and i hope and trust that they will be resolved. This is just
an idea, that's all, so please do not turn it into an excuse for
google-bashing. Thanks.)

I would pay a yearly fee for such a service.

Does it exist?

If not, why not? (And can those restrictions be overcome?)

cat yronwode
http://www.luckymojo.com/blues.html
Blues Lyrics and Hoodoo

Reply With Quote
  #2  
Old   
Roy Schestowitz
 
Posts: n/a

Default Re: SEO technology for Copyright Patrol? - 02-25-2006 , 01:44 AM






__/ [ catherine yronwode ] on Saturday 25 February 2006 03:16 \__

Quote:
Does anyone know of either a software package or a subscription service
that employs search engine technology to roam the net looking for
samples of copyright infringement / plagiarism?

Try:

http://copyscape.com/

I wrote about it last month, among other methods:

http://schestowitz.com/Weblog/archiv...og-plagiarism/


Quote:
The way i envision it, the search engine bot would be given the client's
domain name, then run around comparing random snips from each of the
client's pages with results at, say, google. When it finds a match, it
reports back with a daily log listing all files that duplicate portions
or the entirety of a client's files.

Yes, that is probably how copyscape works. It automates the analogous yet
more laborious process that a human otherwise undertakes. Some of the
lecturers used to be using Google to detect plagiarism. Submission in
electronic form has its merits.


Quote:
At this point what is sold could be just the search engine bot software
(to tech-oriented cients) or a subscription service (for business
clients without an interest in tech matters).

The SEO bot would run X number of pages per day, and check the site
through once, or it could go back and re-work the same site on a
continual, ongoing subscription basis.

The subscription service would also do a whois lookup and find the email
and street addresses of the domain owner and the domain host. It then
(hand-supervised, probably) would auto-generate legal letters of
complaint to the domain contact(s) and the contacts for te isp hosting
the site. This information would go into a weekly log. As a service, it
coud be programmed to auto-generate the exact forms requested by the
major isps, such as yahoo. It would also presumeably continually revisit
pages that it found had been infringed until the infringement was
terminated.

That's a lot of automated traffic, which can raise many concerns. What, for
example, will you do when the offending site copies in part or attributes
the source using a link? This needs careful attention and judgment by a
human, preferably the victim. Also, imagine the load on abuse@isp . net.


Quote:
Thinking farther ahead, if a partnership with google were made, google
could agree to de-list sites that did not comply with the legal complaint
rcedures (e.g. those bootleg Rumanian sites). (I do not want to get off
onto a tangent about google's own sopyright infringement issues; i know
about them and i hope and trust that they will be resolved. This is just
an idea, that's all, so please do not turn it into an excuse for
google-bashing. Thanks.)

Why just Google? *smile* It promotes monoculture.


Quote:
I would pay a yearly fee for such a service.

Does it exist?

I doubt it.


Quote:
If not, why not? (And can those restrictions be overcome?)

Such a tool would need to hammer a search engine quite heavily. How would the
search engine feel about it and what does the search engine have to earn?


Quote:
cat yronwode
http://www.luckymojo.com/blues.html
Blues Lyrics and Hoodoo

With friendly regards,

Roy

--
Roy S. Schestowitz
http://Schestowitz.com | SuSE Linux | PGP-Key: 0x74572E8E
6:30am up 7 days 18:49, 9 users, load average: 1.15, 1.04, 1.00
http://iuron.com - help build a non-profit search engine


Reply With Quote
  #3  
Old   
Fritz M
 
Posts: n/a

Default Re: SEO technology for Copyright Patrol? - 02-25-2006 , 09:29 PM



catherine yronwode wrote:

Quote:
Thinking farther ahead, if a partnership with google were made, google
could agree to de-list sites that did not comply with the legal comlaint
rcedures (e.g. those bootleg Rumanian sites).
Google already removes from its index what it thinks are duplicate
content. Unfortunately, they're about as likely to delist the original
as they are the scraper page.

RFM



Reply With Quote
  #4  
Old   
canadafred
 
Posts: n/a

Default Re: SEO technology for Copyright Patrol? - 02-25-2006 , 10:10 PM



"Fritz M" <nospam (AT) masoner (DOT) net> wrote

Quote:
catherine yronwode wrote:

Thinking farther ahead, if a partnership with google were made, google
could agree to de-list sites that did not comply with the legal comlaint
rcedures (e.g. those bootleg Rumanian sites).

Google already removes from its index what it thinks are duplicate
content. Unfortunately, they're about as likely to delist the original
as they are the scraper page.

RFM
I don't think Google implies that when it instructs webmaster here :
http://www.google.com/webmasters/guidelines.html

"Quality Guidelines - Specific recommendations:
....
Don't create multiple pages, subdomains, or domains with substantially
duplicate content.
...." Google

I wouldn't say it "already removes from its index", in most circumstances. I
would look at it like it seems to compare one to another and tries to
determine which has the better content and which one is more original. The
one it considers is the better of the two is delivered to the results pages
as if it were any other web page. Many times one page performs normally and
the other duplicate or very similar web pages can be found deeper in the
SERPs, it may require an excavator to find it though.

It is frequently a natural part of a project's development, for it to
transition from one domain to another. The whole process is expected to take
some time and some migrations require to be reconstructed piece by piece on
the other domain.

Sometimes the recreation can require more than one domain feeding it through
it's development. Similar or duplicate content gets shuffled into it's new
home where it may take some time to merge it into the new framework. In
natural circumstances, one of those two similar or duplicates should be able
to perform as any other web page, without penalty.

We are talking about two pages that are in a naturally duplicated or very
similar state. 2+ is another discussion.

--

Fred canadian_web (AT) hotmail (DOT) com
Ethical SEO Tips, Tools and Resources
www.rezultz-web-site-promotion.com




Reply With Quote
  #5  
Old   
catherine yronwode
 
Posts: n/a

Default Re: SEO technology for Copyright Patrol? - 02-28-2006 , 12:38 AM



Roy Schestowitz wrote:
Quote:
__/ [ catherine yronwode ] on Saturday 25 February 2006 03:16 \__

Does anyone know of either a software package or a subscription service
that employs search engine technology to roam the net looking for
samples of copyright infringement / plagiarism?

Try:

http://copyscape.com/

I wrote about it last month, among other methods:

http://schestowitz.com/Weblog/archiv...og-plagiarism/

The way i envision it, the search engine bot would be given the client's
domain name, then run around comparing random snips from each of the
client's pages with results at, say, google. When it finds a match, it
reports back with a daily log listing all files that duplicate portions
or the entirety of a client's files.

Yes, that is probably how copyscape works. It automates the analogous yet
more laborious process that a human otherwise undertakes. Some of the
lecturers used to be using Google to detect plagiarism. Submission in
electronic form has its merits.

At this point what is sold could be just the search engine bot software
(to tech-oriented cients) or a subscription service (for business
clients without an interest in tech matters).

The SEO bot would run X number of pages per day, and check the site
through once, or it could go back and re-work the same site on a
continual, ongoing subscription basis.

The subscription service would also do a whois lookup and find the email
and street addresses of the domain owner and the domain host. It then
(hand-supervised, probably) would auto-generate legal letters of
complaint to the domain contact(s) and the contacts for te isp hosting
the site. This information would go into a weekly log. As a service, it
coud be programmed to auto-generate the exact forms requested by the
major isps, such as yahoo. It would also presumeably continually revisit
pages that it found had been infringed until the infringement was
terminated.

That's a lot of automated traffic, which can raise many concerns. What,
for example, will you do when the offending site copies in part or
attributes the source using a link? This needs careful attention and
judgment by a human, preferably the victim. Also, imagine the load on
abuse@isp . net.
I am not talking about unasked for links to a site, or garbled scraping,
merely direct, unauthorized copying of whole articles / major portions
of articles (expressed as a percnetage, e.g. "URL xyz contains 67% text
identical to your URl jkl."

As for attention by a human, i would expect a design that offered me the
option to send automated cease and desist letters (customizable) to the
domain owner / tech rep and isp host copyright rep. Why would abuse@isp
.. net beome involved?

Quote:
Thinking farther ahead, if a partnership with google were made, google
could agree to de-list sites that did not comply with the legal
complaint procedures (e.g. those bootleg Rumanian sites). (I do not want
to get off onto a tangent about google's own sopyright infringement
issues; i know about them and i hope and trust that they will be
resolved. This is just an idea, that's all, so please do not turn it
into an excuse for google-bashing. Thanks.)

Why just Google? *smile* It promotes monoculture.
Because they are sharp, good at what they do, and abjure evil.

Quote:
I would pay a yearly fee for such a service.

Does it exist?

I doubt it.
Could you design and market it?

Quote:
If not, why not? (And can those restrictions be overcome?)

Such a tool would need to hammer a search engine quite heavily. How would
the search engine feel about it and what does the search engine have to
earn?
Well, a large (e.g. google) search engine could could charge money for
the service.

Or, perhaps you could design a search engine to handle it in a way that
does not hammer google.

For instance, my field is occultism / religion / spirituality folklore.
I supply your bot with keyword terms -- say 250 of them -- from my site.
Your bot goes to google and colllects the URLs for all sites ranking in
the top 200 for all those terms.

Then i submit my domain name to your bot. Your bot takes 1 page at a
time from my domain and searches all cached URLs it had retrieved
earlier from google. It then moves to my next age and repeats the
search.

That way it does not hammer google, but builds a database from customer
keywords.

Your bot also updates its cache from google's top 200 once a month and i
can change the keywords i want it to cache as well, on its next update.

How about that?

cat yronwode
http://www.luckymojo.com/blues.html
Blues Lyrics and Hoodoo


Reply With Quote
  #6  
Old   
Roy Schestowitz
 
Posts: n/a

Default Re: SEO technology for Copyright Patrol? - 02-28-2006 , 02:08 AM



__/ [ catherine yronwode ] on Tuesday 28 February 2006 05:38 \__

Quote:
Roy Schestowitz wrote:

__/ [ catherine yronwode ] on Saturday 25 February 2006 03:16 \__

Does anyone know of either a software package or a subscription service
that employs search engine technology to roam the net looking for
samples of copyright infringement / plagiarism?

snip /


The subscription service would also do a whois lookup and find the email
and street addresses of the domain owner and the domain host. It then
(hand-supervised, probably) would auto-generate legal letters of
complaint to the domain contact(s) and the contacts for te isp hosting
the site. This information would go into a weekly log. As a service, it
coud be programmed to auto-generate the exact forms requested by the
major isps, such as yahoo. It would also presumeably continually revisit
pages that it found had been infringed until the infringement was
terminated.

That's a lot of automated traffic, which can raise many concerns. What,
for example, will you do when the offending site copies in part or
attributes the source using a link? This needs careful attention and
judgment by a human, preferably the victim. Also, imagine the load on
abuse@isp . net.

I am not talking about unasked for links to a site, or garbled scraping,
merely direct, unauthorized copying of whole articles / major portions
of articles (expressed as a percnetage, e.g. "URL xyz contains 67% text
identical to your URl jkl."

You cannot quantify such things easily, just as you cannot merge two pieces
of similar text. Try, for example, to forge together two 'forks' of text
which have been worked on by different individuals. To use a familiar
example, have you ever mistakenly edited some older version of a text that
you worked on, only to *later* reveal that you had worked on an out-of-date
version? This is not the case with syntactic code, for instance, as it can
often be merged (CVS-like tool), much like isolated paragraphs in text,
which benefit from tools like 'diff'. Been there, (colleagues) done that.

The point of my babbling is that you can never measure such thing reliably,
let alone know their meaning. Statistics have their flaws. What if in your
text you cited (and linked to) an article and then provided some long quote?
A second site could do the same and unintentionally assimilate to your
content. The issue of copyrights and intellectual property suffers
tremendously nowadays. Bear in mind that apart from Google Groups, there are
at least half a dozen Web sites that copy the *entire* content of this
newsgroup, making it public.


Quote:
As for attention by a human, i would expect a design that offered me the
option to send automated cease and desist letters (customizable) to the
domain owner / tech rep and isp host copyright rep. Why would abuse@isp
. net beome involved?

I was referring to the people employed by ISP to deal with abuse reports. If
they began to receive automated mail, there would be no barrier on the
amount of workload. This would also cast a shadow on abuse reports which are
submitted manually.


Quote:
Thinking farther ahead, if a partnership with google were made, google
could agree to de-list sites that did not comply with the legal
complaint procedures (e.g. those bootleg Rumanian sites). (I do not want
to get off onto a tangent about google's own sopyright infringement
issues; i know about them and i hope and trust that they will be
resolved. This is just an idea, that's all, so please do not turn it
into an excuse for google-bashing. Thanks.)

Why just Google? *smile* It promotes monoculture.

Because they are sharp, good at what they do, and abjure evil.

I have got my own thoughts on that latest point. George Bush said he invaded
Iraq to save us all from WoMD. Everyone believes him at the time. Google
have done some evil things since their boasting of the mythical mantra in
the previous decade. Some of their actions were financially-motivated.


Quote:
I would pay a yearly fee for such a service.

Does it exist?

I doubt it.

Could you design and market it?

*smile* I am not a businessman.


Quote:
If not, why not? (And can those restrictions be overcome?)

Such a tool would need to hammer a search engine quite heavily. How would
the search engine feel about it and what does the search engine have to
earn?

Well, a large (e.g. google) search engine could could charge money for
the service.

This raises further questions. If that was the case:

-Could Google benefit from permitting plagiarism nests to exist?

-Would people truly waste and invest money in fighting evil?

This reminds me of the idea of pay-per-E-mail as means of preventing spam.


Quote:
Or, perhaps you could design a search engine to handle it in a way that
does not hammer google.

For instance, my field is occultism / religion / spirituality folklore.
I supply your bot with keyword terms -- say 250 of them -- from my site.
Your bot goes to google and colllects the URLs for all sites ranking in
the top 200 for all those terms.

Then i submit my domain name to your bot. Your bot takes 1 page at a
time from my domain and searches all cached URLs it had retrieved
earlier from google. It then moves to my next age and repeats the
search.

*smile* You got greedy.

,----[ Quote ]
Quote:
Your bot takes 1 page at a time from my
domain and searches all cached URLs
`----

The practicality of search engines is based on the fact that you index sites
off-line. You can't just go linearly searching for duplicates. The least you
can do is find pointers to potential culprits by using the indices. I guess
I have missed you point though. If you are talking about surveying and
analysing top pages for a given search phrase, how far should you go? There
are infinitely many search phrases.


Quote:
That way it does not hammer google, but builds a database from customer
keywords.

Your bot also updates its cache from google's top 200 once a month and i
can change the keywords i want it to cache as well, on its next update.

That leaves gaps for misuse. Any control that is given to the user over
indexing, keywords and the like is bound to break. This must be the reason
why search engines ignore meta data and will never have second thoughts.


Quote:
How about that?

cat yronwode
http://www.luckymojo.com/blues.html
Blues Lyrics and Hoodoo

With kind regards,

Roy

--
Roy S. Schestowitz | "Quote when replying in non-real-time dialogues"
http://Schestowitz.com | SuSE Linux | PGP-Key: 0x74572E8E
6:45am up 1 day 2:56, 8 users, load average: 0.42, 0.71, 0.62
http://iuron.com - help build a non-profit search engine


Reply With Quote
  #7  
Old   
catherine yronwode
 
Posts: n/a

Default Re: SEO technology for Copyright Patrol? - 02-28-2006 , 05:52 PM



Roy Schestowitz wrote:
Quote:
[snip]

Quote:
I am not talking about unasked for links to a site, or garbled scraping,
merely direct, unauthorized copying of whole articles / major portions
of articles (expressed as a percnetage, e.g. "URL xyz contains 67% text
identical to your URl jkl."

You cannot quantify such things easily, just as you cannot merge two
pieces of similar text. Try, for example, to forge together two 'forks' of
text which have been worked on by different individuals. To use a familiar
example, have you ever mistakenly edited some older version of a text that
you worked on, only to *later* reveal that you had worked on an out-of-
date version? This is not the case with syntactic code, for instance, as
it can often be merged (CVS-like tool), much like isolated paragraphs in
text,which benefit from tools like 'diff'. Been there, (colleagues) done
that.
Okay, i see i was not clear enough. I am looking for a service that
proves a semi-automated version of what i have successfully done by hand.

VERSION ONE -- GOOGLE-BASED

1) I submit my top 250 keywords to your web interface. I also submit one
ten-word sentence fragment (a "check phrase") for each URL i am
protecting. You may set me a limit of numbers of pages i can protect for
a iven amount of fee. Let's say you alow me 100 pages. My 250 keywords,
100 URLs, and the accompanying 100 check phrases are permanently logged
at your site (but can be changed by an "edit" function). The check
phrase for each URL is MY responsilibility to choose and must be way
unique. Like, say (real example):

"contingent of spiritually-inclined folks who will not use common"

which is from
http://www.luckymojo.com/candle,agic.html

2) Your service bot goes to google (or -- see below for VERSION TWO, in
which it does not go to google, but rather to a google-geneated
"personal cache) and it searches on the 100 check-phrases. In the real
life example above, my check-phrase turns up 4 matches. Two are at my
own domain (one is a weidly garbled URL that i have no idea what it's
about, but probably some wacky symbolic link thingie that my husband
screwed around with) and thus are eliminated -- and the other 2 are not
at my domain and thus are potential cases of illegal copyright
infringement, and are logged at your web-based interface so i can view them.

3) The bot does a whois lookup on the two infringing domains -- in this
real-life example:

ausetkmt. com

freewill.tzo. com/~callista

and it logs the data in your web-based interface so i can view it.

4) The bot obtains, from a cache, three copies of a customizable
"friendly (stage one) complaint letter and drafts them to each domain:
one to the owner, one to the owner's tech contact (in my experience
owners who plagiarize often claim inability to delete files as a reason
to avoid action; this works around their excuse-making), and one to the
domain's isp.

3) The bot generates a web-page based alert, displaying all information
about the infringing sites and notifying me that the draft "friendly"
complaints are ready to be sent.

6) At your web site, i can perform a personal check of the pages --
similar to the Wikipedia "diff" function and displayed the same way
(side by side) -- before i commit to sending the "friendly" complaint
letters or abort the send.

7) If i decide to send the "friendly" complaint letters, this action is
logged and dated and displayed at your web interface for my future
reference.

8) There is a 'tickler" function that makes a re-check of any site to
which i have sent a complaint at one-week intervals. This informs me at
the web site whether the infringement is still up.

9) Decision fork:

9A) If the infringing page is gone, it is marked (in red) "Page No
Longer Online" but it stays in the system for access anytime i wish to
re-check my "History" with that domain (or my "History" in general).

9B) If the infringing page is still there, I am offered the option to
send a strongly worded "legal" (stage two) complaint and to print two
hard copies to be sent to the contact addresses for domain owner and
isp. (Subsidiary idea: keep the snail-mail copyright department
addresses of major isps -- and any isps ever cntacted by the system --
on file, for they are uusally difficult to track down and it would save
the client time having to look them up.)

10, 11, 12) Repaeat steps 7, 8, and 9 for the "legal" complaint.

13) If the "legal" complaint generates no response, i am given the
option of sending a fully documented letter (with all relevant date
stamps and so forth from your service's histry records) to google
informing them of the infringement and requsting them to de-list the
offending URL (or domain) from their SERPs. (Side-note: if the service
is well-publicized, google will probably agree to honour their
complaints. If three such services exist, they can form an Association
and gogle will definitely have to deal with them.)

14) This ends the service's responsibilities. For any further actions, i
must hire a lawyer.

Quote:
The point of my babbling is that you can never measure such thing
reliably, let alone know their meaning. Statistics have their flaws. What
if in your text you cited (and linked to) an article and then provided
some long quote?A second site could do the same and unintentionally
assimilate to your content. The issue of copyrights and intellectual
property suffers tremendously nowadays. Bear in mind that apart from
Google Groups, there are at least half a dozen Web sites that copy the *
entire* content of this newsgroup, making it public.
This is all true, but not relevant. I am talking about webmasters who
build sites competing with my site's SERPs by deliberate copyright
infrinngement of my own copyright protected web pages. See above
scenario.

Quote:
As for attention by a human, i would expect a design that offered me the
option to send automated cease and desist letters (customizable) to the
domain owner / tech rep and isp host copyright rep. Why would abuse@isp
. net beome involved?

I was referring to the people employed by ISP to deal with abuse reports.
If they began to receive automated mail, there would be no barrier on the
amount of workload. This would also cast a shadow on abuse reports which
are submitted manually.
The letters would be submitted manually. See above.

[Google and Evil discussion tabled for anther thread -- an interesting
subject and one worthy of conversation, but off-topic here.]

Quote:
I would pay a yearly fee for such a service.

Does it exist?

I doubt it.

Could you design and market it?

*smile* I am not a businessman.
Could you design it?

Quote:
If not, why not? (And can those restrictions be overcome?)

Such a tool would need to hammer a search engine quite heavily. How
would the search engine feel about it and what does the search engine
have to earn?

Well, a large (e.g. google) search engine could charge money for
the service.

This raises further questions. If that was the case:

-Could Google benefit from permitting plagiarism nests to exist?
No.

Quote:
-Would people truly waste and invest money in fighting evil?
Authors and businesses invest in fighting copyright and trademark
infringement all the time. I spend many hours per year at the task. A
semi-automated web-based system would save me 100-plus hours per year
and a great deal of frustration. I would pay 250 dollars per year to
subscribe, maybe more. A sliding scale of pricing could allow for
different levels of examination based on varying the number of client
keywords / number of client pages handled.

Quote:
This reminds me of the idea of pay-per-E-mail as means of preventing spam.
I don't see the similarity. I am talking about a web-based service to
which i could subscribe that would allow me to patrol the web for
copyright infringments.

Quote:
Or, perhaps you could design a search engine to handle it in a way that
does not hammer google.

For instance, my field is occultism / religion / spirituality folklore.
I supply your bot with keyword terms -- say 250 of them -- from my site.
Your bot goes to google and colllects the URLs for all sites ranking in
the top 200 for all those terms.

Then i submit my domain name to your bot. Your bot takes 1 page at a
time from my domain and searches all cached URLs it had retrieved
earlier from google. It then moves to my next age and repeats the
search.

*smile* You got greedy.

,----[ Quote ]
| Your bot takes 1 page at a time from my
| domain and searches all cached URLs
`----

The practicality of search engines is based on the fact that you index
sites off-line. You can't just go linearly searching for duplicates. The
least you can do is find pointers to potential culprits by using the
indices. I guess I have missed you point though. If you are talking about
surveying and analysing top pages for a given search phrase, how far
should you go? There are infinitely many search phrases.
That is true -- and that is why, when you spoke of "hammering google," i
theorized another, less google-intensive way to do the job. Here is how
i envision it working with a non-google-hammering web interface, relying
only peripherally on google to generate the initial batch of
information.

VERSION TWO -- PERSONAL CACHE BASED

1) I submit my top 250 keywords to your web interface. I also submit one
ten-word sentence fragment (a "check phrase") for each URL i am
protecting. My 250 keywords, 100 URLs, and the accompanying 100 check
phrases are permanently logged at your site (but can be changed by an
"edit" function). The check phrase for each URL is MY responsilibility
to choose and must be way unique.

2) Your bot goes to google only ONCE for each those 250 kewords, finds
the top 200 results for each keyword, and caches them offline. 200 x 250
= 50,000 pages -- but there will be duplications of common terms, so,
with duplication eliminated, we might theorize that those 50,000
potential URLs will actually reduce down to 25,000 pages. Whatever the
number, that would be my personal index cache at your service.

3) If a trial proved that the above numbers were unworkable, we coud
limit my input to 100 keywords x top 100 results at google per keyword.
This would result in 10,000 pages, which, with duplication eliminated,
might reduce to 5,000 pgaes.

4) Levels of payment could be arranged for a 100 / 100 search or a 250 /
200 search or whatever other arrangments you deemed feasible. Thus
clients would pay for the amount of breadth and depth of search -- and
the amunt of cache space at your end -- that they required.

5) I could, at a specified interval -- say once a month -- rewrite my
250 (or 100) keywords. In any case, the 200 (or 100) top results for
each keyword would be automatically updated at google once a month (or
every three months, if that is easier.)

6) When i submit my ten-word check-phrases, your bot does not return to
google, but rather searches my personal index cache.

7) I believe that this system would be sufficient (and better than
hammering google) because my MAJOR goal is to eliminate successful
competitors for SERPs, and other, less successful plgiraists, are of far
lesser concern. A button at the web site that initiates a once-a-year
sweep of all google cached pages (as opposed t all of my personal
indexed cache pages at your service) would be sufficient to eliminate
the low-level plagiarists.

Quote:
Your bot also updates its cache from google's top 200 once a month and i
can change the keywords i want it to cache as well, on its next update.

That leaves gaps for misuse. Any control that is given to the user over
indexing, keywords and the like is bound to break. This must be the reason
why search engines ignore meta data and will never have second thoughts.
I disagree. This is a service that the user pays for and as long as the
interface is clear, clean, and functional, it is the service's
responsibilityto automate certain tasks and the user's responsibility to
authorize the implementation the semi-automatized tasks.

I really do think this is a useful commerical service just waiting to
happen. I look forward to your further comments, as you are one of the
few people i know in the world who can discuss these matters at all, as
well as being kind to those who, like me, are merely logical thinkers
and not actually computer programmers.

cat yronwode


Reply With Quote
  #8  
Old   
news
 
Posts: n/a

Default Re: SEO technology for Copyright Patrol? - 03-01-2006 , 12:00 AM



"catherine yronwode" <cat (AT) luckymojo (DOT) com> wrote

Quote:
Does anyone know of either a software package or a subscription service
that employs search engine technology to roam the net looking for
samples of copyright infringement / plagiarism?

These guys might http://www.cyveillance.com/


--------------
Jay
http://www.tequila-stuff.com/cgi-bin/dir - SEO Friendly Directory




Reply With Quote
  #9  
Old   
Roy Schestowitz
 
Posts: n/a

Default Re: SEO technology for Copyright Patrol? - 03-01-2006 , 08:27 AM



__/ [ catherine yronwode ] on Tuesday 28 February 2006 22:52 \__

Right *pulls sleeves*... here we have a lengthy post with plenty of
information to digest. I read it through quickly, but I had to procrastinate
an answer due to the dullest chores conceived (booking for 6, conference at
Washington next month). This kept me away from UseNet and my usual Web
activities <sarcasm type="self-derogatory"> and I can sense the withdrawal
symptoms</sarcasm>. *hand shake*

As a foreword, I think you have an excellent idea, but I doubt its
practicability and the rigour one could invest in it. I will now try to
comment as I go along. Here we go...


Quote:
Roy Schestowitz wrote:

[snip]

I am not talking about unasked for links to a site, or garbled scraping,
merely direct, unauthorized copying of whole articles / major portions
of articles (expressed as a percnetage, e.g. "URL xyz contains 67% text
identical to your URl jkl."

You cannot quantify such things easily, just as you cannot merge two
pieces of similar text. Try, for example, to forge together two 'forks' of
text which have been worked on by different individuals. To use a familiar
example, have you ever mistakenly edited some older version of a text that
you worked on, only to *later* reveal that you had worked on an out-of-
date version? This is not the case with syntactic code, for instance, as
it can often be merged (CVS-like tool), much like isolated paragraphs in
text,which benefit from tools like 'diff'. Been there, (colleagues) done
that.

Okay, i see i was not clear enough. I am looking for a service that
proves a semi-automated version of what i have successfully done by hand.

No job should be done by hand (don't interpret this in a sexual context,
please). Ideally, all should be self-sustaining and self-managing, which
brings up some social issues that we are yet to see in the future.


Quote:
VERSION ONE -- GOOGLE-BASED

1) I submit my top 250 keywords to your web interface. I also submit one
ten-word sentence fragment (a "check phrase") for each URL i am
protecting. You may set me a limit of numbers of pages i can protect for
a iven amount of fee. Let's say you alow me 100 pages. My 250 keywords,
100 URLs, and the accompanying 100 check phrases are permanently logged
at your site (but can be changed by an "edit" function). The check
phrase for each URL is MY responsilibility to choose and must be way
unique. Like, say (real example):

"contingent of spiritually-inclined folks who will not use common"

which is from
http://www.luckymojo.com/candle,agic.html

2) Your service bot goes to google (or -- see below for VERSION TWO, in
which it does not go to google, but rather to a google-geneated
"personal cache) and it searches on the 100 check-phrases. In the real
life example above, my check-phrase turns up 4 matches. Two are at my
own domain (one is a weidly garbled URL that i have no idea what it's
about, but probably some wacky symbolic link thingie that my husband
screwed around with) and thus are eliminated -- and the other 2 are not
at my domain and thus are potential cases of illegal copyright
infringement, and are logged at your web-based interface so i can view
them.

Okay, so far so good. However, bear in mind you intend to run a *service*
here. Even 100 queries become a heavy load if you have 1,000 thirsty
customers that make the service affordable (or at the least self-covering).

I recently read that gada.be is refused access by del.icio.us, which shows
that the whole Web 2.0 'spirit' does not work in practice. For Google to
give a share of their bandwidth and computer power it should take convincing
through negotiations. Is removal of duplicates helpful to Google? Yes.
Still, this cannot be done behind their back and at their own expense. The
acceptance of complaints and subsequent review are also expensive. These are
manual. You attempt to automate something you do manually, but in turn, it
can raise the amount of manual workload over at the Googleplex.


Quote:
3) The bot does a whois lookup on the two infringing domains -- in this
real-life example:

ausetkmt. com

freewill.tzo. com/~callista

and it logs the data in your web-based interface so i can view it.

4) The bot obtains, from a cache, three copies of a customizable
"friendly (stage one) complaint letter and drafts them to each domain:
one to the owner, one to the owner's tech contact (in my experience
owners who plagiarize often claim inability to delete files as a reason
to avoid action; this works around their excuse-making), and one to the
domain's isp.

....provided that all details are recorded consistently and the appropriate
fields can be pulled after parsing. In theory, you need access to the ICANN
databases and not rely on Web interfaces that make fetching of such
information rather tricky and prone to breakage (e.g. changes to interface).


Quote:
3) The bot generates a web-page based alert, displaying all information
about the infringing sites and notifying me that the draft "friendly"
complaints are ready to be sent.

6) At your web site, i can perform a personal check of the pages --
similar to the Wikipedia "diff" function and displayed the same way
(side by side) -- before i commit to sending the "friendly" complaint
letters or abort the send.

7) If i decide to send the "friendly" complaint letters, this action is
logged and dated and displayed at your web interface for my future
reference.

8) There is a 'tickler" function that makes a re-check of any site to
which i have sent a complaint at one-week intervals. This informs me at
the web site whether the infringement is still up.

9) Decision fork:

9A) If the infringing page is gone, it is marked (in red) "Page No
Longer Online" but it stays in the system for access anytime i wish to
re-check my "History" with that domain (or my "History" in general).

9B) If the infringing page is still there, I am offered the option to
send a strongly worded "legal" (stage two) complaint and to print two
hard copies to be sent to the contact addresses for domain owner and
isp. (Subsidiary idea: keep the snail-mail copyright department
addresses of major isps -- and any isps ever cntacted by the system --
on file, for they are uusally difficult to track down and it would save
the client time having to look them up.)

10, 11, 12) Repaeat steps 7, 8, and 9 for the "legal" complaint.

13) If the "legal" complaint generates no response, i am given the
option of sending a fully documented letter (with all relevant date
stamps and so forth from your service's histry records) to google
informing them of the infringement and requsting them to de-list the
offending URL (or domain) from their SERPs. (Side-note: if the service
is well-publicized, google will probably agree to honour their
complaints. If three such services exist, they can form an Association
and gogle will definitely have to deal with them.)

14) This ends the service's responsibilities. For any further actions, i
must hire a lawyer.

I am surprised that you are willing to go as far as that. Unless a site
copies your /entire/ content, would it ever be worth the time? (rhetorical)
I guess it depends on how much revenue/self pleasure the site generates.


Quote:
The point of my babbling is that you can never measure such thing
reliably, let alone know their meaning. Statistics have their flaws. What
if in your text you cited (and linked to) an article and then provided
some long quote?A second site could do the same and unintentionally
assimilate to your content. The issue of copyrights and intellectual
property suffers tremendously nowadays. Bear in mind that apart from
Google Groups, there are at least half a dozen Web sites that copy the *
entire* content of this newsgroup, making it public.

This is all true, but not relevant. I am talking about webmasters who
build sites competing with my site's SERPs by deliberate copyright
infrinngement of my own copyright protected web pages. See above
scenario.

OK, I now understand better.


Quote:
As for attention by a human, i would expect a design that offered me the
option to send automated cease and desist letters (customizable) to the
domain owner / tech rep and isp host copyright rep. Why would abuse@isp
. net beome involved?

I was referring to the people employed by ISP to deal with abuse reports.
If they began to receive automated mail, there would be no barrier on the
amount of workload. This would also cast a shadow on abuse reports which
are submitted manually.

The letters would be submitted manually. See above.

[Google and Evil discussion tabled for anther thread -- an interesting
subject and one worthy of conversation, but off-topic here.]

I would pay a yearly fee for such a service.

Does it exist?

I doubt it.

Could you design and market it?

*smile* I am not a businessman.

Could you design it?

When I started iuron.com, I believed that I had an idea that would work. I
still believe that. I spoke to a distinguished professor in the field to get
some pointers. However, I can't believe I can ever afford the time. I lack
the desire too. Sometimes I wonder how many aspects of life (personal and
professional) I will neglect. I'm like a toddler choosing a different shiny
object every now and then. It's worrisome. even my exercise regime has
dropped to 4 times a week. It is the lowest level in 10 years, since I got
started and I have little passion for it. I think it indirectly answers your
question, in the most candid way.


Quote:
If not, why not? (And can those restrictions be overcome?)

Such a tool would need to hammer a search engine quite heavily. How
would the search engine feel about it and what does the search engine
have to earn?

Well, a large (e.g. google) search engine could charge money for
the service.

This raises further questions. If that was the case:

-Could Google benefit from permitting plagiarism nests to exist?

No.

-Would people truly waste and invest money in fighting evil?

Authors and businesses invest in fighting copyright and trademark
infringement all the time. I spend many hours per year at the task. A
semi-automated web-based system would save me 100-plus hours per year
and a great deal of frustration. I would pay 250 dollars per year to
subscribe, maybe more. A sliding scale of pricing could allow for
different levels of examination based on varying the number of client
keywords / number of client pages handled.

This reminds me of the idea of pay-per-E-mail as means of preventing spam.

I don't see the similarity. I am talking about a web-based service to
which i could subscribe that would allow me to patrol the web for
copyright infringments.

True, I see now.


Quote:
Or, perhaps you could design a search engine to handle it in a way that
does not hammer google.

For instance, my field is occultism / religion / spirituality folklore.
I supply your bot with keyword terms -- say 250 of them -- from my site.
Your bot goes to google and colllects the URLs for all sites ranking in
the top 200 for all those terms.

Then i submit my domain name to your bot. Your bot takes 1 page at a
time from my domain and searches all cached URLs it had retrieved
earlier from google. It then moves to my next age and repeats the
search.

*smile* You got greedy.

,----[ Quote ]
| Your bot takes 1 page at a time from my
| domain and searches all cached URLs
`----

The practicality of search engines is based on the fact that you index
sites off-line. You can't just go linearly searching for duplicates. The
least you can do is find pointers to potential culprits by using the
indices. I guess I have missed you point though. If you are talking about
surveying and analysing top pages for a given search phrase, how far
should you go? There are infinitely many search phrases.

That is true -- and that is why, when you spoke of "hammering google," i
theorized another, less google-intensive way to do the job. Here is how
i envision it working with a non-google-hammering web interface, relying
only peripherally on google to generate the initial batch of
information.

VERSION TWO -- PERSONAL CACHE BASED

1) I submit my top 250 keywords to your web interface. I also submit one
ten-word sentence fragment (a "check phrase") for each URL i am
protecting. My 250 keywords, 100 URLs, and the accompanying 100 check
phrases are permanently logged at your site (but can be changed by an
"edit" function). The check phrase for each URL is MY responsilibility
to choose and must be way unique.

2) Your bot goes to google only ONCE for each those 250 kewords, finds
the top 200 results for each keyword, and caches them offline. 200 x 250
= 50,000 pages -- but there will be duplications of common terms, so,
with duplication eliminated, we might theorize that those 50,000
potential URLs will actually reduce down to 25,000 pages. Whatever the
number, that would be my personal index cache at your service.

Cache is a non-real-time element. I think the load would remain higher than
you predict.


Quote:
3) If a trial proved that the above numbers were unworkable, we coud
limit my input to 100 keywords x top 100 results at google per keyword.
This would result in 10,000 pages, which, with duplication eliminated,
might reduce to 5,000 pgaes.

The numbers still appear quite steep. Before taking this seriously, I think
negotiation with Google is worthwhile (as well as me submitting my thesis
and have it set out of the way). If you are serious about this, I have
contact with a Google manager, so I could maybe attempt a proposal...


Quote:
4) Levels of payment could be arranged for a 100 / 100 search or a 250 /
200 search or whatever other arrangements you deemed feasible. Thus
clients would pay for the amount of breadth and depth of search -- and
the amunt of cache space at your end -- that they required.

Web-based and 'cache' are somewhat conflicting. I thought it was worth
pointing out. The Web browser cannot read from or write to physical media.


Quote:
5) I could, at a specified interval -- say once a month -- rewrite my
250 (or 100) keywords. In any case, the 200 (or 100) top results for
each keyword would be automatically updated at google once a month (or
every three months, if that is easier.)

6) When i submit my ten-word check-phrases, your bot does not return to
google, but rather searches my personal index cache.

7) I believe that this system would be sufficient (and better than
hammering google) because my MAJOR goal is to eliminate successful
competitors for SERPs, and other, less successful plgiraists, are of far
lesser concern. A button at the web site that initiates a once-a-year
sweep of all google cached pages (as opposed t all of my personal
indexed cache pages at your service) would be sufficient to eliminate
the low-level plagiarists.

This leads me to thinking: what about Google's own detection of duplicates?
One could argue that 'interference' from a third-party is undesirable


Quote:
Your bot also updates its cache from google's top 200 once a month and i
can change the keywords i want it to cache as well, on its next update.

That leaves gaps for misuse. Any control that is given to the user over
indexing, keywords and the like is bound to break. This must be the reason
why search engines ignore meta data and will never have second thoughts.

I disagree. This is a service that the user pays for and as long as the
interface is clear, clean, and functional, it is the service's
responsibilityto automate certain tasks and the user's responsibility to
authorize the implementation the semi-automatized tasks.

What about misuse of the service? Such as signing up by spammers with the
goal of intercepting competitive sites?


Quote:
I really do think this is a useful commerical service just waiting to
happen. I look forward to your further comments, as you are one of the
few people i know in the world who can discuss these matters at all, as
well as being kind to those who, like me, are merely logical thinkers
and not actually computer programmers.

cat yronwode

I am flattered by your words, Cat.

Kind regards,

Roy

--
Roy S. Schestowitz | "Disk quota exceeded; sig discontinued"
http://Schestowitz.com | SuSE Linux | PGP-Key: 0x74572E8E
12:45pm up 8:23, 4 users, load average: 0.24, 0.32, 0.44
http://iuron.com - help build a non-profit search engine


Reply With Quote
  #10  
Old   
Big Bill
 
Posts: n/a

Default Re: SEO technology for Copyright Patrol? - 03-01-2006 , 10:18 AM



On Wed, 01 Mar 2006 13:27:51 +0000, Roy Schestowitz
<newsgroups (AT) schestowitz (DOT) com> wrote:

Quote:
__/ [ catherine yronwode ] on Tuesday 28 February 2006 22:52 \__

Right *pulls sleeves*... here we have a lengthy post with plenty of
information to digest. I read it through quickly, but I had to procrastinate
an answer due to the dullest chores conceived (booking for 6, conference at
Washington next month). This kept me away from UseNet and my usual Web
activities <sarcasm type="self-derogatory"> and I can sense the withdrawal
symptoms</sarcasm>. *hand shake*

As a foreword, I think you have an excellent idea, but I doubt its
practicability and the rigour one could invest in it. I will now try to
comment as I go along. Here we go...
Er..I'm going out for a beer just now..you kids talk among yorselves
for while, ok?

BB


--

http://homepage.ntlworld.com/bill.kr...tapestries.htm
http://www.crystal-liaison.com/artis-orbis/index.html
kruse (AT) crystal-liaison (DOT) com Gifty! Shiny! BB!


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.4
Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.