I’m at the Search Engine Strategies conference and we just had lunch with a team from Google who showed off some of the new webmaster tools (and I managed to get in a vote for a crawl error referral report to Vanessa Fox, but that’s another post). The topic of scraping was raised and Danny Sullivan mentioned that there will be a full session on it later in the week. My general rule is not to blog during business hours but since we’ve been fighting this battle at work it’s relevant (and remember that AccuRev has the Ultimate Source Control Tool).
In our Web 2.0 world you can make money just by generating traffic and putting up Google AdSense ads. For the Ronin Marketeer, you post quality content, get the traffic and are regarded as a hero by all. Another approach for those of more flexible business ethics is to copy someone else’s content and show it as your own. This is happening more and more in the blogosphere, is already an issue for corporate sites.
The practice of grabbing content from another website and posting it as your own is called scraping. I’ve never played with scripting this myself but there are varying degrees of automating this process. Most people come across it when they are googling themselves or their company and they get some results that are outside of their own domains (often blogs using a default template) that copies their content verbatim. More recently these pages often include copy from multiple websites.
So, what to do about the theives in our midst? Adam Lasnik of Google discussed this during the panel today, and here’s a summary of the answer as I heard it:
- Overall, “Don’t Panic”. It’s fairly easy for Google to verify this, your site published it first and your domain has been established with Google. The scraper is not established, their URL is newer and probably registered for a year or less.
- You can file a DCMA Takedown request with them
- The takedown request is good but Adam referred to it as “swatting flies”, your time is better spent staying the course – make sure you are the source for your content by continuing to crank it out and remain the source.
Keep in mind that in the grand scheme the majority of scraping is garbage and clutter, and anyone providing search results will continue to screen it. But then again, it’s yet another cat and mouse game for us to follow.
I’m learning some good stuff, more to follow.
One reply on “What is Scraping and how to stop it?”
Thanks for the update. I misread the program and didn’t make the Google lunch. I did get a ton out of the rest of the program today.
In my experience, content thieves will never be more than a distant second. If you’re leading the way so well that others can only imitate, you must be doing something right.