What are the best practices for picking selectors for web scrappers?

Question

The following is an example using https://github.com/GoogleChrome/puppeteer

'use strict';

const puppeteer = require('puppeteer');

(async() => {

// const browser = await puppeteer.launch();
// const page = await browser.newPage();

const browser = await puppeteer.launch({
    headless: true
});
const page = await browser.newPage();

await page.goto('https://quora.com', {waitUntil: 'networkidle'});

await page.click('.form_column input[name="email"]');
await page.type('MY_EMAIL');

await page.click('.form_column input[name="password"]');
await page.type('MY PASSWORD');

await page.waitFor(2 * 1000);

await page.click('.form_column input[type="submit"]');

// Wait for the results to show up
await page.waitForSelector('.question_text .rendered_qtext');

// Extract the results from the page
const links = await page.evaluate(() => {
  const anchors = Array.from(document.querySelectorAll('.question_text .rendered_qtext'));
  return anchors.map(anchor => anchor.textContent);
});
console.log(links.join('\n'));
browser.close();

})();

The script logs into Quora and logs the title of some answers.

Is my choice of selectors good? What are some good practices?

The name *web crawler* is generally associated with visiting *arbitrary* websites, so it will be looking for `` elements, and recursing on those. This looks more like a *scraper*, which visits *one specific* site and downloads *some specific* content. For politeness, you should look to obey a sites [`robots.txt`](http://www.robotstxt.org/) if it has one — Caleth, Aug 31 '17 at 13:33

What are the best practices for picking selectors for web scrappers?

0 Answers0