Announcement

Collapse
No announcement yet.

Duplicate content hit because of https

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicate content hit because of https

    Google has indexed a couple thousand of our pages at both http and https, and of course is saying there's duplicate content. So two questions.

    1) How do redirect all the indexed pages on https to http? Is there a way to use a matching pattern rule to redirect all https pages, except those like myaccount and the checkout pages? Or will I have to have a redirect for each?

    2) How do we prevent this going forward? Prevent indexing all https pages in the robots.txt?

  • #2
    Look at your Robots_ssl.txt under Marketing>SEO Tools, it should disallow all https pages. If this file was changed, you can click "restore default"

    Comment


    • #3
      Originally posted by wcsjohn View Post
      Google has indexed a couple thousand of our pages at both http and https, and of course is saying there's duplicate content. So two questions.

      1) How do redirect all the indexed pages on https to http? Is there a way to use a matching pattern rule to redirect all https pages, except those like myaccount and the checkout pages? Or will I have to have a redirect for each?

      2) How do we prevent this going forward? Prevent indexing all https pages in the robots.txt?
      Check also the Canonical when looking at a page via HTTPS, this should be set to the HTTP version of the page so it's clear to google that it's not duplicate content and any SEO weight from this HTTPS page should be passed to the HTTP version of it.

      Comment


      • #4
        Robots_ssl.txt states:

        Code:
        # Disallow all crawlers access to all pages. SSL
        User-agent: *
        Disallow: /admin
        I never changed it, so not sure how that happened.

        I restored it to:

        Code:
        # Disallow all crawlers access to all pages. SSL
        User-agent: *
        Disallow: /
        That should take care of the indexing problem going forward

        The canonical on a page through https does state http, so that's correct, so why would I be getting warnings about duplicate content on all these? How do I deal with the couple thousand that are already indexed when I can't redirect the https pages?

        Comment


        • #5
          Actually, just had a chat with my SEO guy and he clarified the problem. Let me try to explain. Google is visiting pages on https. The pages have the rel=canonical pointing to the http page, but the robots_ssl.txt is blocking googlebot from reaching the content so it can't see the canonical. He stating that in order for this to work, I should remove the following line.

          Code:
          Disallow: /
          This would allow the canonical to be found. BUT, it would significant reduce the crawl budget which should be going to the http pages. The better solution is to redirect. A 301 redirect is always better than a canonical.

          Comment


          • #6
            I think you may be putting too much faith into what your SEO guy says. A 301 says "this page is not valid, you should use this page instead." To me that does not apply to HTTPS pages, since they are valid. This situation is exactly what canonical links are for.

            How and When to Use 301 Redirects vs. Canonical - Search Engine Watch (#SEW)

            Comment


            • #7
              I understand what you're saying, but your site has a limited crawl budget. The data shows googlebot spending way too much of it on https sites, literally 75% of pages trying to be indexed, which is reducing the frequency of crawling the rest of the site. How do you deal with that?

              And even if we decided to go about using the canonical instead, letting google crawl http and https, then we have to actually allow googlebot to see the page to pick up the canonical. So the default robots_ssl.txt file is wrong.

              Comment

              Working...
              X