2

I need to make some get/post requests to a website that I have credential to log in. I plan to do this with Ruby and Net::Http. As I'm new to this kind of experience, I'm struggling with the fact that the log-in page requests robot verification (check-box kind) - that means I'm not able to automate the log-in phase. Besides that, the server keeps alive for some time until it verify that no active has been made, after that it request the log-in page again. The website is build with PHP and JS (most of it is JS) and it requires that the user enter with a "restrict-area" browser's mode after the log in phase.

It would be no problem I manually log in and execute an operation (few requests) for every time I need it. But I don't know how I could pass credential information from the browser, as session id, to my script. I need some concepts ideas about this.

Additional information:

  • There is no public API.
  • The "restrict-area" browser's mode is a browser without some buttons (forward and backward in history pages) and it don't permit to change the URL - that is all I know.
  • I need this for automating some manually tasks that take hours to do.
  • The website uses Ajax.

If additional information is needed I can add it, just ask in the comments.

Thanks in advance!

EDIT

My intension isn't to crawl random websites, but how to make specifics HTTP request in a specific website where is necessary credentials to do so.

  • 11
    By adding ReCaptcha to their page, they have decided to prevent precisely what you're attempting to do. You might want to check their Terms of Service; my guess is that they forbid this kind of automation. – Robert Harvey Mar 05 '18 at 19:26
  • 1
    Unfortunately, it's just not possible without some artificial intelligence to both understand the instructions and interact with the Captcha control. Once you get in, you can check the network traffic for each page request to see which cookies are in use, and if there are any `Authorization: Bearer` tokens. Unfortunately, without a public API, what you build is prone to break every time the target website does a new release. – Berin Loritsch Mar 05 '18 at 19:27
  • 1
    @BerinLoritsch: The flip side being that you can be reasonably sure that a public-facing web page will be properly maintained. API's, not so much. They are often neglected and sometimes don't work at all; getting fixes can be a slow process. – Robert Harvey Mar 05 '18 at 19:32
  • Thank you for the answer guys. I imagined that the ReCaptcha would be there for that reason. This website still has some improvements to do, but they prefer monopolize their service - it s*cks. Thank again anyway! – Pedro Gabriel Lima Mar 05 '18 at 20:23
  • Possible duplicate of [How to be a good citizen when crawling web sites?](https://softwareengineering.stackexchange.com/questions/91760/how-to-be-a-good-citizen-when-crawling-web-sites) – gnat Mar 05 '18 at 20:46
  • I don't think this is a duplication from your link. My intension isn't to crawls random site, but make specifics HTTP request in a specific website where is necessary credentials to do so. – Pedro Gabriel Lima Mar 05 '18 at 21:02
  • 1
    In theory, you can log in using a browser, steal a cookie, and then use that cookie in your program while emulating a browser. I'm not sure it's worth the effort, though. – Mael Mar 06 '18 at 07:02
  • @Mael, that was my idea! Hijack session. I was checking the package traffic to their website and apparently they only change cookies after re-log. About the worth, they have a establish website for now - after building a structure to the problem, changes would happen only for few things. I will speculate more about their security... – Pedro Gabriel Lima Mar 06 '18 at 12:00
  • For those who think bad about this idea, this website is a software we pay for using it. But, as a lot of others company paid for that too, the software company are focus in solving problems that affect most of their clients - not specifics ones that our company are dealing right now. – Pedro Gabriel Lima Mar 06 '18 at 12:05
  • @PedroGabrielLima: If you pay for the software, be aware that your contract with the company may _explicitly_ forbid what you are trying to do. Please get appropriate legal advice and / or a sign off from your manager (as applicable) before you try what you are doing. – sleske Mar 06 '18 at 13:25
  • @sleske - Ok, I will take your advice and give a look on that. I really appreciate your concern! – Pedro Gabriel Lima Mar 06 '18 at 13:41

1 Answers1

2

For JS-intensive websites, it might be much more convenient to use a "headless-browser" approach, such as capybara-webkit gem, which basically allows automation on top of a popular browser engine used in Chrome, Safari, Opera, etc. I'm not sure if it's good enough to cheat the robot verification (leaving moral aspect aside), but at least it beats Net::Http in cases like getting Google search results.

Also, have a look at PhantomJS which is a JS browser automation (as capybara-webkit is a Ruby browser automation), which gives an additional convenience of working with in-page elements in the same language which controls the browser.

  • Hey, thank you for the inside +1. I didn't let clear in the question the kind of searching I will be doing, but is basically request for specific endpoints and dealing with json data. I will probably not need to scrape pages for data. Right now I'm testing Ruby Mechanize. My intention is copy the cookies from a open session in Firefox and use the Mechanize to simulate a parallel browser using the same session (Hijack session), so I won't be dealing with robot verification. – Pedro Gabriel Lima Mar 06 '18 at 13:36