Announcement

Collapse
No announcement yet.

Tutorial or Step-By-Step for Blocking Baidu Spider Bot?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tutorial or Step-By-Step for Blocking Baidu Spider Bot?

    Could someone (much more advanced than we are) please post a step by step on where to go and what to do in the 3dcart Admin to completely block the Baidu spider? We don't sell in China and I don't need them chewing through 2 gigs of my data plan each month (as I just discovered is happening).

    This is one of those "above my pay grade" type things for us and we know that messing around with stuff like this could tank our site if it's not done right.

    And apologies if something like this is already on the forum someplace (please point me to it) but my searches came up with nothing.

    Thanks in advance for any help.


  • #2
    We use Cloudflare.com and we just have whole countries including China, Russia, and India blocked. You can do that on Cloudflare.

    Comment


    • #3
      Here is my complete robots.txt file that you can access only through FTP. I cannot tell you how to do this, but I can tell you this: if you get yourself connected via FTP, the robots.txt file editing portion is trivial. This is not to say the FTP is difficult, and you got all the info you need in that initial email. Great info is found: https://searchenginewatch.com/sew/ne...king-parasites and this feature I did not know about: https://support.google.com/webmaster..._topic=6061961


      # Disallow all crawlers access to certain pages.

      User-agent: *
      Disallow: /checkout.asp
      Disallow: /add_cart.asp
      Disallow: /view_cart.asp
      Disallow: /error.asp
      Disallow: /shipquote.asp
      Disallow: /rssfeed.asp
      Disallow: /mobile/
      Disallow: /admin/

      User-agent: Yandex
      Disallow: /
      User-agent: moget
      User-agent: ichiro
      Disallow: /
      User-agent: NaverBot
      User-agent: Yeti
      Disallow: /
      User-agent: Baiduspider
      User-agent: Baiduspider-video
      User-agent: Baiduspider-image
      Disallow: /
      User-agent: sogou spider
      Disallow: /
      User-agent: YoudaoBot
      Disallow: /
      http://www.metrodarts.com
      [email protected]

      Comment


      • #4
        I missed this when 301bulls posted, so thank you. Can any of the resellers and/or 3dcart gurus review/comment on this? I don't want to do something that's going to blow up our site. Maybe someone can suggest if/where/how to implement? Maybe someone like DeanP etc :-)

        Also, I actually would be perfectly fine blocking whole countries - Russia, China etc - as mentioned by elightbox but we don't use Cloudflare so how could we do it?

        Though, an interesting point is that though Baiduspider alone burns through nearly 16mb per day of our bandwidth, China is a distant 8th place when looking at our Smarterstats demo info - so I'm not sure how that all shakes out.

        Some other examples of bandwidth use on a given day are: Unknown Bot 70MB per day; Pingdom Bot 26MB per day; Baiduspider 16MB per day; BLEXBot Crawler 5MB per day; Unknown Crawler 16MB per day.

        How do you nip "Unknown Bot" and "Unknown Crawler"?

        Also I *TRIED* to upload a 44kb image screen cap of our stats showing the bot activity but keep getting an error, so no dice there.

        Thanks for reading/input.

        Comment


        • #5
          Anyone know of any reason that the robots_ssl.txt file should contain anything but:

          User-agent: *
          Disallow: /

          Comment


          • #6
            I wouldn't recommend blocking any country wholesale, or attempting to block any spiders.

            Perhaps the shopper is an American on vacation and they're trying to place an order. Perhaps they're using a VPN for better privacy, so they only appear to be in Russia. Perhaps it's an Indian business wanting to ship things to their office in the US. There are numerous legitimate reasons for people in your "blocked" country list to be browsing your site. Blocking countries simply isn't a thing you should do on the internet. Not to mention the method of blocking countries is usually by IP address registration, which is terribly inaccurate at times. Instead, just don't list those countries as available shipping options - problem solved.

            Blocking spiders is also not advantageous. If you're cutting bandwidth so close that 2GB is a major issue, you need to upgrade your server. And even if the bot is from a Chinese search engine, and you don't sell in China, doesn't mean that's the only place that search engine data will show up. Browser plugins, Chinese American's, data being sold to other search engines, etc. And, if you let the bots do their thing, they should eventually quiet down after they've indexed your entire site. This is the same behavior of even Google's bots. You're on the internet - your site will get crawled. It's a fact of life.

            I would also recommend using something like Cloudflare, as it will alleviate your bandwidth issues immediately. Cloudflare also has the advantage speeding up your website via magic and a global CDN. They also block a significant amount of attack attempts, so if you're concerned about "shady" countries like Russian traffic, well you don't have to worry about it with Cloudflare. Cloudflare has a free service tier, which is perfect for smaller sites. I'd note that, 3dCart's "Enterprise" hosting plans include Cloudflare, so even 3dCart sees the benefits as being significant.

            Lastly, in your robots_ssl.txt file, you don't want to block the entire site from being indexed! It should look very similar to your regular robots.txt, and block the admin url's (Disallow: /admin/). Google is giving a leg up to sites that run in full SSL mode always, so obviously it wouldn't work well if every site blocked all pages. You won't get indexed, and that's bad.
            Last edited by Alupis; 05-04-2016, 11:51 AM.

            Comment

            Working...
            X