![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||||||
| |||||||
|
|
Does anyone know of either a software package or a subscription service that employs search engine technology to roam the net looking for samples of copyright infringement / plagiarism? |
|
The way i envision it, the search engine bot would be given the client's domain name, then run around comparing random snips from each of the client's pages with results at, say, google. When it finds a match, it reports back with a daily log listing all files that duplicate portions or the entirety of a client's files. |
|
At this point what is sold could be just the search engine bot software (to tech-oriented cients) or a subscription service (for business clients without an interest in tech matters). The SEO bot would run X number of pages per day, and check the site through once, or it could go back and re-work the same site on a continual, ongoing subscription basis. The subscription service would also do a whois lookup and find the email and street addresses of the domain owner and the domain host. It then (hand-supervised, probably) would auto-generate legal letters of complaint to the domain contact(s) and the contacts for te isp hosting the site. This information would go into a weekly log. As a service, it coud be programmed to auto-generate the exact forms requested by the major isps, such as yahoo. It would also presumeably continually revisit pages that it found had been infringed until the infringement was terminated. |
|
Thinking farther ahead, if a partnership with google were made, google could agree to de-list sites that did not comply with the legal complaint rcedures (e.g. those bootleg Rumanian sites). (I do not want to get off onto a tangent about google's own sopyright infringement issues; i know about them and i hope and trust that they will be resolved. This is just an idea, that's all, so please do not turn it into an excuse for google-bashing. Thanks.) |
|
I would pay a yearly fee for such a service. Does it exist? |
|
If not, why not? (And can those restrictions be overcome?) |
|
cat yronwode http://www.luckymojo.com/blues.html Blues Lyrics and Hoodoo |
#3
| |||
| |||
|
|
Thinking farther ahead, if a partnership with google were made, google could agree to de-list sites that did not comply with the legal comlaint rcedures (e.g. those bootleg Rumanian sites). |
#4
| |||
| |||
|
|
catherine yronwode wrote: Thinking farther ahead, if a partnership with google were made, google could agree to de-list sites that did not comply with the legal comlaint rcedures (e.g. those bootleg Rumanian sites). Google already removes from its index what it thinks are duplicate content. Unfortunately, they're about as likely to delist the original as they are the scraper page. RFM |
#5
| ||||
| ||||
|
|
__/ [ catherine yronwode ] on Saturday 25 February 2006 03:16 \__ Does anyone know of either a software package or a subscription service that employs search engine technology to roam the net looking for samples of copyright infringement / plagiarism? Try: http://copyscape.com/ I wrote about it last month, among other methods: http://schestowitz.com/Weblog/archiv...og-plagiarism/ The way i envision it, the search engine bot would be given the client's domain name, then run around comparing random snips from each of the client's pages with results at, say, google. When it finds a match, it reports back with a daily log listing all files that duplicate portions or the entirety of a client's files. Yes, that is probably how copyscape works. It automates the analogous yet more laborious process that a human otherwise undertakes. Some of the lecturers used to be using Google to detect plagiarism. Submission in electronic form has its merits. At this point what is sold could be just the search engine bot software (to tech-oriented cients) or a subscription service (for business clients without an interest in tech matters). The SEO bot would run X number of pages per day, and check the site through once, or it could go back and re-work the same site on a continual, ongoing subscription basis. The subscription service would also do a whois lookup and find the email and street addresses of the domain owner and the domain host. It then (hand-supervised, probably) would auto-generate legal letters of complaint to the domain contact(s) and the contacts for te isp hosting the site. This information would go into a weekly log. As a service, it coud be programmed to auto-generate the exact forms requested by the major isps, such as yahoo. It would also presumeably continually revisit pages that it found had been infringed until the infringement was terminated. That's a lot of automated traffic, which can raise many concerns. What, for example, will you do when the offending site copies in part or attributes the source using a link? This needs careful attention and judgment by a human, preferably the victim. Also, imagine the load on abuse@isp . net. |
|
Thinking farther ahead, if a partnership with google were made, google could agree to de-list sites that did not comply with the legal complaint procedures (e.g. those bootleg Rumanian sites). (I do not want to get off onto a tangent about google's own sopyright infringement issues; i know about them and i hope and trust that they will be resolved. This is just an idea, that's all, so please do not turn it into an excuse for google-bashing. Thanks.) Why just Google? *smile* It promotes monoculture. |
|
I would pay a yearly fee for such a service. Does it exist? I doubt it. |
|
If not, why not? (And can those restrictions be overcome?) Such a tool would need to hammer a search engine quite heavily. How would the search engine feel about it and what does the search engine have to earn? |
#6
| |||||||||
| |||||||||
|
|
Roy Schestowitz wrote: __/ [ catherine yronwode ] on Saturday 25 February 2006 03:16 \__ Does anyone know of either a software package or a subscription service that employs search engine technology to roam the net looking for samples of copyright infringement / plagiarism? snip / The subscription service would also do a whois lookup and find the email and street addresses of the domain owner and the domain host. It then (hand-supervised, probably) would auto-generate legal letters of complaint to the domain contact(s) and the contacts for te isp hosting the site. This information would go into a weekly log. As a service, it coud be programmed to auto-generate the exact forms requested by the major isps, such as yahoo. It would also presumeably continually revisit pages that it found had been infringed until the infringement was terminated. That's a lot of automated traffic, which can raise many concerns. What, for example, will you do when the offending site copies in part or attributes the source using a link? This needs careful attention and judgment by a human, preferably the victim. Also, imagine the load on abuse@isp . net. I am not talking about unasked for links to a site, or garbled scraping, merely direct, unauthorized copying of whole articles / major portions of articles (expressed as a percnetage, e.g. "URL xyz contains 67% text identical to your URl jkl." |
|
As for attention by a human, i would expect a design that offered me the option to send automated cease and desist letters (customizable) to the domain owner / tech rep and isp host copyright rep. Why would abuse@isp . net beome involved? |
|
Thinking farther ahead, if a partnership with google were made, google could agree to de-list sites that did not comply with the legal complaint procedures (e.g. those bootleg Rumanian sites). (I do not want to get off onto a tangent about google's own sopyright infringement issues; i know about them and i hope and trust that they will be resolved. This is just an idea, that's all, so please do not turn it into an excuse for google-bashing. Thanks.) Why just Google? *smile* It promotes monoculture. Because they are sharp, good at what they do, and abjure evil. |
|
I would pay a yearly fee for such a service. Does it exist? I doubt it. Could you design and market it? |
|
If not, why not? (And can those restrictions be overcome?) Such a tool would need to hammer a search engine quite heavily. How would the search engine feel about it and what does the search engine have to earn? Well, a large (e.g. google) search engine could could charge money for the service. |
|
Or, perhaps you could design a search engine to handle it in a way that does not hammer google. For instance, my field is occultism / religion / spirituality folklore. I supply your bot with keyword terms -- say 250 of them -- from my site. Your bot goes to google and colllects the URLs for all sites ranking in the top 200 for all those terms. Then i submit my domain name to your bot. Your bot takes 1 page at a time from my domain and searches all cached URLs it had retrieved earlier from google. It then moves to my next age and repeats the search. |
|
Your bot takes 1 page at a time from my domain and searches all cached URLs `---- |
|
That way it does not hammer google, but builds a database from customer keywords. Your bot also updates its cache from google's top 200 once a month and i can change the keywords i want it to cache as well, on its next update. |
|
How about that? cat yronwode http://www.luckymojo.com/blues.html Blues Lyrics and Hoodoo |
#7
| ||||||||||
| ||||||||||
|
| [snip] |
|
I am not talking about unasked for links to a site, or garbled scraping, merely direct, unauthorized copying of whole articles / major portions of articles (expressed as a percnetage, e.g. "URL xyz contains 67% text identical to your URl jkl." You cannot quantify such things easily, just as you cannot merge two pieces of similar text. Try, for example, to forge together two 'forks' of text which have been worked on by different individuals. To use a familiar example, have you ever mistakenly edited some older version of a text that you worked on, only to *later* reveal that you had worked on an out-of- date version? This is not the case with syntactic code, for instance, as it can often be merged (CVS-like tool), much like isolated paragraphs in text,which benefit from tools like 'diff'. Been there, (colleagues) done that. |
|
The point of my babbling is that you can never measure such thing reliably, let alone know their meaning. Statistics have their flaws. What if in your text you cited (and linked to) an article and then provided some long quote?A second site could do the same and unintentionally assimilate to your content. The issue of copyrights and intellectual property suffers tremendously nowadays. Bear in mind that apart from Google Groups, there are at least half a dozen Web sites that copy the * entire* content of this newsgroup, making it public. |
|
As for attention by a human, i would expect a design that offered me the option to send automated cease and desist letters (customizable) to the domain owner / tech rep and isp host copyright rep. Why would abuse@isp . net beome involved? I was referring to the people employed by ISP to deal with abuse reports. If they began to receive automated mail, there would be no barrier on the amount of workload. This would also cast a shadow on abuse reports which are submitted manually. |
|
I would pay a yearly fee for such a service. Does it exist? I doubt it. Could you design and market it? *smile* I am not a businessman. |
|
If not, why not? (And can those restrictions be overcome?) Such a tool would need to hammer a search engine quite heavily. How would the search engine feel about it and what does the search engine have to earn? Well, a large (e.g. google) search engine could charge money for the service. This raises further questions. If that was the case: -Could Google benefit from permitting plagiarism nests to exist? |
|
-Would people truly waste and invest money in fighting evil? |
|
This reminds me of the idea of pay-per-E-mail as means of preventing spam. |
|
Or, perhaps you could design a search engine to handle it in a way that does not hammer google. For instance, my field is occultism / religion / spirituality folklore. I supply your bot with keyword terms -- say 250 of them -- from my site. Your bot goes to google and colllects the URLs for all sites ranking in the top 200 for all those terms. Then i submit my domain name to your bot. Your bot takes 1 page at a time from my domain and searches all cached URLs it had retrieved earlier from google. It then moves to my next age and repeats the search. *smile* You got greedy. ,----[ Quote ] | Your bot takes 1 page at a time from my | domain and searches all cached URLs `---- The practicality of search engines is based on the fact that you index sites off-line. You can't just go linearly searching for duplicates. The least you can do is find pointers to potential culprits by using the indices. I guess I have missed you point though. If you are talking about surveying and analysing top pages for a given search phrase, how far should you go? There are infinitely many search phrases. |
|
Your bot also updates its cache from google's top 200 once a month and i can change the keywords i want it to cache as well, on its next update. That leaves gaps for misuse. Any control that is given to the user over indexing, keywords and the like is bound to break. This must be the reason why search engines ignore meta data and will never have second thoughts. |
#8
| |||
| |||
|
|
Does anyone know of either a software package or a subscription service that employs search engine technology to roam the net looking for samples of copyright infringement / plagiarism? |
#9
| |||||||||||||
| |||||||||||||
|
|
Roy Schestowitz wrote: [snip] I am not talking about unasked for links to a site, or garbled scraping, merely direct, unauthorized copying of whole articles / major portions of articles (expressed as a percnetage, e.g. "URL xyz contains 67% text identical to your URl jkl." You cannot quantify such things easily, just as you cannot merge two pieces of similar text. Try, for example, to forge together two 'forks' of text which have been worked on by different individuals. To use a familiar example, have you ever mistakenly edited some older version of a text that you worked on, only to *later* reveal that you had worked on an out-of- date version? This is not the case with syntactic code, for instance, as it can often be merged (CVS-like tool), much like isolated paragraphs in text,which benefit from tools like 'diff'. Been there, (colleagues) done that. Okay, i see i was not clear enough. I am looking for a service that proves a semi-automated version of what i have successfully done by hand. |
|
VERSION ONE -- GOOGLE-BASED 1) I submit my top 250 keywords to your web interface. I also submit one ten-word sentence fragment (a "check phrase") for each URL i am protecting. You may set me a limit of numbers of pages i can protect for a iven amount of fee. Let's say you alow me 100 pages. My 250 keywords, 100 URLs, and the accompanying 100 check phrases are permanently logged at your site (but can be changed by an "edit" function). The check phrase for each URL is MY responsilibility to choose and must be way unique. Like, say (real example): "contingent of spiritually-inclined folks who will not use common" which is from http://www.luckymojo.com/candle,agic.html 2) Your service bot goes to google (or -- see below for VERSION TWO, in which it does not go to google, but rather to a google-geneated "personal cache) and it searches on the 100 check-phrases. In the real life example above, my check-phrase turns up 4 matches. Two are at my own domain (one is a weidly garbled URL that i have no idea what it's about, but probably some wacky symbolic link thingie that my husband screwed around with) and thus are eliminated -- and the other 2 are not at my domain and thus are potential cases of illegal copyright infringement, and are logged at your web-based interface so i can view them. |
|
3) The bot does a whois lookup on the two infringing domains -- in this real-life example: ausetkmt. com freewill.tzo. com/~callista and it logs the data in your web-based interface so i can view it. 4) The bot obtains, from a cache, three copies of a customizable "friendly (stage one) complaint letter and drafts them to each domain: one to the owner, one to the owner's tech contact (in my experience owners who plagiarize often claim inability to delete files as a reason to avoid action; this works around their excuse-making), and one to the domain's isp. |
|
3) The bot generates a web-page based alert, displaying all information about the infringing sites and notifying me that the draft "friendly" complaints are ready to be sent. 6) At your web site, i can perform a personal check of the pages -- similar to the Wikipedia "diff" function and displayed the same way (side by side) -- before i commit to sending the "friendly" complaint letters or abort the send. 7) If i decide to send the "friendly" complaint letters, this action is logged and dated and displayed at your web interface for my future reference. 8) There is a 'tickler" function that makes a re-check of any site to which i have sent a complaint at one-week intervals. This informs me at the web site whether the infringement is still up. 9) Decision fork: 9A) If the infringing page is gone, it is marked (in red) "Page No Longer Online" but it stays in the system for access anytime i wish to re-check my "History" with that domain (or my "History" in general). 9B) If the infringing page is still there, I am offered the option to send a strongly worded "legal" (stage two) complaint and to print two hard copies to be sent to the contact addresses for domain owner and isp. (Subsidiary idea: keep the snail-mail copyright department addresses of major isps -- and any isps ever cntacted by the system -- on file, for they are uusally difficult to track down and it would save the client time having to look them up.) 10, 11, 12) Repaeat steps 7, 8, and 9 for the "legal" complaint. 13) If the "legal" complaint generates no response, i am given the option of sending a fully documented letter (with all relevant date stamps and so forth from your service's histry records) to google informing them of the infringement and requsting them to de-list the offending URL (or domain) from their SERPs. (Side-note: if the service is well-publicized, google will probably agree to honour their complaints. If three such services exist, they can form an Association and gogle will definitely have to deal with them.) 14) This ends the service's responsibilities. For any further actions, i must hire a lawyer. |
|
The point of my babbling is that you can never measure such thing reliably, let alone know their meaning. Statistics have their flaws. What if in your text you cited (and linked to) an article and then provided some long quote?A second site could do the same and unintentionally assimilate to your content. The issue of copyrights and intellectual property suffers tremendously nowadays. Bear in mind that apart from Google Groups, there are at least half a dozen Web sites that copy the * entire* content of this newsgroup, making it public. This is all true, but not relevant. I am talking about webmasters who build sites competing with my site's SERPs by deliberate copyright infrinngement of my own copyright protected web pages. See above scenario. |
|
As for attention by a human, i would expect a design that offered me the option to send automated cease and desist letters (customizable) to the domain owner / tech rep and isp host copyright rep. Why would abuse@isp . net beome involved? I was referring to the people employed by ISP to deal with abuse reports. If they began to receive automated mail, there would be no barrier on the amount of workload. This would also cast a shadow on abuse reports which are submitted manually. The letters would be submitted manually. See above. [Google and Evil discussion tabled for anther thread -- an interesting subject and one worthy of conversation, but off-topic here.] I would pay a yearly fee for such a service. Does it exist? I doubt it. Could you design and market it? *smile* I am not a businessman. Could you design it? |
|
If not, why not? (And can those restrictions be overcome?) Such a tool would need to hammer a search engine quite heavily. How would the search engine feel about it and what does the search engine have to earn? Well, a large (e.g. google) search engine could charge money for the service. This raises further questions. If that was the case: -Could Google benefit from permitting plagiarism nests to exist? No. -Would people truly waste and invest money in fighting evil? Authors and businesses invest in fighting copyright and trademark infringement all the time. I spend many hours per year at the task. A semi-automated web-based system would save me 100-plus hours per year and a great deal of frustration. I would pay 250 dollars per year to subscribe, maybe more. A sliding scale of pricing could allow for different levels of examination based on varying the number of client keywords / number of client pages handled. This reminds me of the idea of pay-per-E-mail as means of preventing spam. I don't see the similarity. I am talking about a web-based service to which i could subscribe that would allow me to patrol the web for copyright infringments. |
|
Or, perhaps you could design a search engine to handle it in a way that does not hammer google. For instance, my field is occultism / religion / spirituality folklore. I supply your bot with keyword terms -- say 250 of them -- from my site. Your bot goes to google and colllects the URLs for all sites ranking in the top 200 for all those terms. Then i submit my domain name to your bot. Your bot takes 1 page at a time from my domain and searches all cached URLs it had retrieved earlier from google. It then moves to my next age and repeats the search. *smile* You got greedy. ,----[ Quote ] | Your bot takes 1 page at a time from my | domain and searches all cached URLs `---- The practicality of search engines is based on the fact that you index sites off-line. You can't just go linearly searching for duplicates. The least you can do is find pointers to potential culprits by using the indices. I guess I have missed you point though. If you are talking about surveying and analysing top pages for a given search phrase, how far should you go? There are infinitely many search phrases. That is true -- and that is why, when you spoke of "hammering google," i theorized another, less google-intensive way to do the job. Here is how i envision it working with a non-google-hammering web interface, relying only peripherally on google to generate the initial batch of information. VERSION TWO -- PERSONAL CACHE BASED 1) I submit my top 250 keywords to your web interface. I also submit one ten-word sentence fragment (a "check phrase") for each URL i am protecting. My 250 keywords, 100 URLs, and the accompanying 100 check phrases are permanently logged at your site (but can be changed by an "edit" function). The check phrase for each URL is MY responsilibility to choose and must be way unique. 2) Your bot goes to google only ONCE for each those 250 kewords, finds the top 200 results for each keyword, and caches them offline. 200 x 250 = 50,000 pages -- but there will be duplications of common terms, so, with duplication eliminated, we might theorize that those 50,000 potential URLs will actually reduce down to 25,000 pages. Whatever the number, that would be my personal index cache at your service. |
|
3) If a trial proved that the above numbers were unworkable, we coud limit my input to 100 keywords x top 100 results at google per keyword. This would result in 10,000 pages, which, with duplication eliminated, might reduce to 5,000 pgaes. |
|
4) Levels of payment could be arranged for a 100 / 100 search or a 250 / 200 search or whatever other arrangements you deemed feasible. Thus clients would pay for the amount of breadth and depth of search -- and the amunt of cache space at your end -- that they required. |
|
5) I could, at a specified interval -- say once a month -- rewrite my 250 (or 100) keywords. In any case, the 200 (or 100) top results for each keyword would be automatically updated at google once a month (or every three months, if that is easier.) 6) When i submit my ten-word check-phrases, your bot does not return to google, but rather searches my personal index cache. 7) I believe that this system would be sufficient (and better than hammering google) because my MAJOR goal is to eliminate successful competitors for SERPs, and other, less successful plgiraists, are of far lesser concern. A button at the web site that initiates a once-a-year sweep of all google cached pages (as opposed t all of my personal indexed cache pages at your service) would be sufficient to eliminate the low-level plagiarists. |
|
Your bot also updates its cache from google's top 200 once a month and i can change the keywords i want it to cache as well, on its next update. That leaves gaps for misuse. Any control that is given to the user over indexing, keywords and the like is bound to break. This must be the reason why search engines ignore meta data and will never have second thoughts. I disagree. This is a service that the user pays for and as long as the interface is clear, clean, and functional, it is the service's responsibilityto automate certain tasks and the user's responsibility to authorize the implementation the semi-automatized tasks. |
|
I really do think this is a useful commerical service just waiting to happen. I look forward to your further comments, as you are one of the few people i know in the world who can discuss these matters at all, as well as being kind to those who, like me, are merely logical thinkers and not actually computer programmers. cat yronwode |
#10
| |||
| |||
|
|
__/ [ catherine yronwode ] on Tuesday 28 February 2006 22:52 \__ Right *pulls sleeves*... here we have a lengthy post with plenty of information to digest. I read it through quickly, but I had to procrastinate an answer due to the dullest chores conceived (booking for 6, conference at Washington next month). This kept me away from UseNet and my usual Web activities <sarcasm type="self-derogatory"> and I can sense the withdrawal symptoms</sarcasm>. *hand shake* As a foreword, I think you have an excellent idea, but I doubt its practicability and the rigour one could invest in it. I will now try to comment as I go along. Here we go... |
![]() |
| Thread Tools | |
| Display Modes | |
| |