Crawlability Masterclass: Robots.txt, Sitemap XML, dan...

Crawlability Masterclass: Robots.txt, Sitemap XML, dan...

person JasaSEO.id Team
calendar_today 25 Jan 2026
schedule 7 min read
bolt

TL;DR (Ringkasan Singkat)

Crawlability* adalah kemampuan search engine untuk menemukan, mengakses, dan crawl halaman di website Anda. Crawl budget* adalah jumlah halaman yang Googlebot crawl di website Anda dalam periode tertentu.

Crawlability Masterclass: Robots.txt, Sitemap XML, dan Indexing untuk SEO 2026

Bayangkan Googlebot sebagai tamu VIP yang datang ke website Anda. Tugas Anda adalah:

Jika salah atur, tamu VIP ini bisa tersesat atau bahkan tidak datang lagi.

  • Artikel ini akan breakdown:*

  • Apa itu crawlability dan kenapa penting

    build Schema Generator

    Gunakan Schema Generator secara gratis untuk membantu optimasi Anda.

    Coba Sekarang Gratis

  • Cara kerja Googlebot

  • Robots.txt best practices

  • Sitemap XML optimization

  • Crawl budget management

  • Common crawl errors dan solusinya

Apa Itu Crawlability?

  • Crawlability* adalah kemampuan search engine untuk menemukan, mengakses, dan crawl halaman di website Anda.

  • 3 Tahap Proses:*

  • Discovery - Googlebot menemukan URL (via sitemap, internal links, external links)

  • Crawling - Googlebot mengakses dan membaca konten halaman

  • Indexing - Google menyimpan halaman di database untuk ditampilkan di search results

Cara Kerja Googlebot

Googlebot adalah crawler (spider/bot) yang Google pakai untuk discover dan crawl website.

Crawl Process:

  • Start from seed URLs (sitemap, homepage, known URLs)

  • Follow links dari halaman yang sudah di-crawl

  • Download HTML dan parse content

  • Extract links untuk crawl selanjutnya

  • Repeat sampai crawl budget habis

Crawl Budget

  • Crawl budget* adalah jumlah halaman yang Googlebot crawl di website Anda dalam periode tertentu.

  • Faktor yang mempengaruhi crawl budget:*

  • Site authority - High-authority sites dapat crawl budget lebih besar

  • Server performance - Fast server = more pages crawled

  • Site size - Larger sites perlu manage crawl budget lebih hati-hati

  • Update frequency - Sites yang sering update di-crawl lebih sering

  • Untuk website kecil (<1,000 pages):* Crawl budget biasanya bukan masalah.

  • Untuk website besar (10,000+ pages):* Crawl budget optimization critical untuk ensure important pages di-crawl.

Robots.txt: Panduan Lengkap

  • Robots.txt* adalah file text yang memberitahu search engine crawler mana halaman yang boleh dan tidak boleh di-crawl.

Lokasi File

https://yoursite.com/robots.txt

File harus di root directory website.

Syntax Dasar

User-agent: * Disallow: /admin/ Disallow: /private/ Allow: /public/ Sitemap: https://yoursite.com/sitemap.xml
  • Penjelasan:*

  • User-agent: * - Berlaku untuk semua bots

  • Disallow: /admin/ - Block semua URLs yang start dengan /admin/

  • Allow: /public/ - Explicitly allow /public/ (override disallow)

  • Sitemap: - Tell bots lokasi sitemap

Best Practices Robots.txt

Jangan Block Halaman Penting

  • ⚠️ Bad:*
User-agent: * Disallow: /blog/

Ini block semua blog posts dari crawling!

  • ✅ Good:*
User-agent: * Disallow: /admin/ Disallow: /wp-admin/ Allow: /

Block Low-Value Pages

  • Pages yang sebaiknya di-block:*

  • Admin panels (/admin/, /wp-admin/)

  • Login pages (/login/, /signin/)

  • Thank you pages (/thank-you/)

  • Search result pages (/search?q=)

  • Filter/sort pages (/?filter=, /?sort=)

  • Contoh:*

User-agent: * Disallow: /admin/ Disallow: /wp-admin/ Disallow: /search? Disallow: /*?filter= Disallow: /*?sort=

Allow CSS dan JavaScript

  • ⚠️ Bad (Old Practice):*
Disallow: /css/ Disallow: /js/

Google perlu CSS/JS untuk render halaman dengan benar.

  • ✅ Good:*
Allow: /css/ Allow: /js/

Specify Sitemap Location

Sitemap: https://yoursite.com/sitemap.xml Sitemap: https://yoursite.com/sitemap-products.xml Sitemap: https://yoursite.com/sitemap-blog.xml

Bisa specify multiple sitemaps.

Common Robots.txt Mistakes

⚠️ Mistake #1: Accidentally Blocking Entire Site

User-agent: * Disallow: /

Ini block SEMUA halaman dari crawling!

  • Kapan ini OK:* Saat development/staging (tapi jangan lupa remove saat launch).

⚠️ Mistake #2: Blocking Important Resources

Disallow: /images/

Google perlu images untuk understand konten.

⚠️ Mistake #3: Using Noindex in Robots.txt

  • ⚠️ Bad:*
User-agent: * Noindex: /old-page/

Noindex bukan valid directive di robots.txt. Pakai meta tag instead.

  • ✅ Good:*
<!-- In HTML <head> --> <meta name="robots" content="noindex, follow">

Testing Robots.txt

  • Google Search Console:*
  • Go to Robots.txt Tester
  • Enter URL yang mau test
  • Click Test Tool akan show apakah URL blocked atau allowed.

Sitemap XML: Optimization Guide

  • Sitemap XML* adalah file yang list semua URLs di website yang Anda mau Google index.

Format Dasar

<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>https://yoursite.com/</loc> <lastmod>2026-01-25</lastmod> <changefreq>daily</changefreq> <priority>1.0</priority> </url> <url> <loc>https://yoursite.com/blog/seo-guide</loc> <lastmod>2026-01-20</lastmod> <changefreq>weekly</changefreq> <priority>0.8</priority> </url> </urlset>
  • Elements:*

  • <loc> - URL (required)

  • <lastmod> - Last modified date (optional)

  • <changefreq> - Update frequency (optional, mostly ignored by Google)

  • <priority> - Relative priority 0.0-1.0 (optional, mostly ignored by Google)

Sitemap Best Practices

Only Include Indexable URLs

  • Include:*

  • ✅ Canonical URLs

  • ✅ Important pages (products, blog posts, categories)

  • ✅ Recently updated pages

  • Don't include:*

  • ⚠️ URLs dengan noindex tag

  • ⚠️ Redirected URLs (301/302)

  • ⚠️ Blocked by robots.txt

  • ⚠️ Duplicate content

  • ⚠️ Low-value pages (filters, sorts)

Keep Sitemap Under 50MB / 50,000 URLs

  • Limits:*

  • Max 50MB (uncompressed)

  • Max 50,000 URLs per sitemap

  • If larger:* Split into multiple sitemaps dengan sitemap index.

  • Sitemap Index:*

<?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc>https://yoursite.com/sitemap-posts.xml</loc> <lastmod>2026-01-25</lastmod> </sitemap> <sitemap> <loc>https://yoursite.com/sitemap-products.xml</loc> <lastmod>2026-01-25</lastmod> </sitemap> </sitemapindex>

Update Sitemap Automatically

  • For WordPress:* Use plugins like Yoast SEO atau Rank Math (auto-generate sitemap).

  • For custom sites:* Generate sitemap dynamically dari database.

  • Avoid:* Manually updating sitemap (error-prone).

Submit to Google Search Console

  • Go to Sitemaps section

  • Enter sitemap URL (e.g., sitemap.xml)

  • Click Submit

Google will crawl sitemap dan discover URLs.

Sitemap Types

Standard Sitemap (Pages)

<url> <loc>https://yoursite.com/page</loc> </url>

Image Sitemap

<url> <loc>https://yoursite.com/page</loc> <image:image> <image:loc>https://yoursite.com/image.jpg</image:loc> <image:caption>Image caption</image:caption> </image:image> </url>

Video Sitemap

<url> <loc>https://yoursite.com/page</loc> <video:video> <video:thumbnail_loc>https://yoursite.com/thumb.jpg</video:thumbnail_loc> <video:title>Video Title</video:title> <video:description>Video description</video:description> </video:video> </url>

News Sitemap

For news sites (special format dengan publication date).

Internal Linking untuk Crawlability

  • Internal links* adalah cara utama Googlebot discover halaman baru.

Best Practices:

Flat Site Architecture

  • ⚠️ Bad (Deep):*
Homepage → Category → Subcategory → Product → Variant (5 clicks)
  • ✅ Good (Flat):*
Homepage → Product (1-2 clicks)
  • Rule of thumb:* Important pages should be max 3 clicks dari homepage.

Avoid Orphan Pages

  • Orphan page* = page tanpa internal links dari halaman lain.

  • How to find:*

  • Crawl site dengan Screaming Frog
  • Filter pages dengan 0 inlinks
  • Add internal links dari related pages

Use Descriptive Anchor Text

  • ⚠️ Bad:*
<a href="/product">Click here</a>
  • ✅ Good:*
<a href="/product">Premium SEO Services</a>

Pages dengan more internal links = higher crawl priority.

  • Strategy:* Link to money pages dari:

  • Homepage

  • Navigation menu

  • Footer

  • Related posts sections

  • Breadcrumbs

Crawl Budget Optimization

For large sites (10,000+ pages), crawl budget optimization critical.

Strategies:

Fix Crawl Errors

  • Common errors:*

  • 404 Not Found

  • 500 Server Error

  • Redirect chains

  • Slow pages

  • Check in GSC:* Coverage Report → Errors tab

Block Low-Value Pages

Use robots.txt untuk block:

  • Admin pages

  • Search result pages

  • Filter/sort variations

  • Duplicate content

Improve Server Response Time

  • Target:* < 200ms server response time (TTFB)

  • How:*

  • Use CDN

  • Enable caching

  • Optimize database queries

  • Upgrade hosting

Reduce Redirect Chains

  • ⚠️ Bad:*
Page A → 301 → Page B → 301 → Page C
  • ✅ Good:*
Page A → 301 → Page C Page B → 301 → Page C

Use Canonical Tags

For duplicate content, use canonical tags instead of creating multiple crawlable versions.

📖 Pelajari lebih lanjut: Canonical Tag Guide

Common Crawl Errors & Solutions

Error 1: Server Error (5xx)

  • Cause:* Server down atau overloaded saat Googlebot crawl.

  • Solution:*

  • Upgrade hosting

  • Optimize server performance

  • Enable caching

  • Use CDN

Error 2: Soft 404

  • Cause:* Page return 200 status tapi content-nya "not found" atau empty.

  • Solution:*

  • Return proper 404 status untuk deleted pages

  • Or redirect 301 ke relevant page

Error 3: Redirect Error

  • Cause:* Redirect chains atau redirect loops.

  • Solution:*

  • Fix redirect chains (direct redirect)

  • Check for redirect loops

  • Use 301 (permanent) instead of 302 (temporary) when appropriate

Error 4: Blocked by Robots.txt

  • Cause:* Important pages accidentally blocked.

  • Solution:*

  • Review robots.txt

  • Remove overly broad disallow rules

  • Test dengan GSC Robots.txt Tester

Error 5: Crawled - Currently Not Indexed

  • Cause:* Google crawled tapi decide not to index (low quality, duplicate, crawl budget).

  • Solution:*

  • Improve content quality

  • Fix duplicate content

  • Add internal links

  • Improve page speed

    📖 Pelajari lebih lanjut: Discovered Not Indexed: Cara Fix

Monitoring Crawlability

Google Search Console

  • Coverage Report:*

  • Indexed pages - Successfully crawled dan indexed

  • Excluded pages - Crawled tapi not indexed (check why)

  • Error pages - Crawl errors (fix ASAP)

  • URL Inspection Tool:*

  • Check individual URL crawl status

  • See last crawl date

  • Request indexing

Log File Analysis

For advanced users: analyze server logs untuk see exactly apa yang Googlebot crawl.

  • Tools:*

  • Screaming Frog Log File Analyzer

  • Botify

  • OnCrawl

  • Insights:*

  • Which pages Googlebot crawl most

  • Crawl frequency per page type

  • Wasted crawl budget on low-value pages

Kesimpulan

Crawlability adalah foundation dari SEO. Jika Google tidak bisa crawl halaman Anda, konten terbaik sekalipun tidak akan ranking.

Key Takeaways:

  • Robots.txt: Block low-value pages, allow important pages

  • Sitemap XML: Include only indexable URLs, update automatically

  • Internal linking: Flat architecture, no orphan pages

  • Crawl budget: Optimize untuk large sites, fix errors

  • Monitor: Use GSC Coverage Report regularly

Action Items:

  • [ ] Audit robots.txt (ensure tidak block important pages)

  • [ ] Submit sitemap di Google Search Console

  • [ ] Fix crawl errors di GSC Coverage Report

  • [ ] Check for orphan pages (add internal links)

  • [ ] Optimize server response time (<200ms)

  • [ ] Monitor crawl stats monthly

  • Crawlability yang baik = foundation untuk indexing dan ranking yang optimal.*


Butuh Bantuan SEO Profesional?

Tim ahli kami siap membantu website Anda ranking di halaman 1 Google.