Thursday, January 01, 2015

Curl

Curl is a like a browser that runs from a command line to get content from a web site. I'll explain that in five pieces:

1. View a web page in a browser
2. View the source for that page
3. Use Curl to get source for a page
4. Why is this useful?
5. How can it be detected?

1. View a web page in a browser

To get to a web page you type the URL in a browser.

For example:


What you really received when you requested that web page was a file containing code that your browser interprets and transforms into something you understand.

2. View the source for that page

If you want to see the code you can do the follow these steps below:



You'll see something like this:



3. Use Curl to get source for a page

Note: download all open source software at your own risk.

Download curl: http://curl.haxx.se/


Open a command window.

To get the code for a web site (preferably your own - please read last section):

curl [web site url] 


Just like with a web browser it will get the source:


View help:

curl -h 

To put the source you requested into a file:

curl [url] > [file]

curl radicalsoftware.com > radicalsoftware.html

The command above puts the source for radicalsoftware.com into a file called radicalsoftware.html

To see if a specific string exists in the code you retrieved, you can use grep on Linux:

curl [url] | grep [string]

For example if I want to see if there is a line of code in the source that contains the string "F5" I can do that as follows:

I worked on one project at F5 Networks troubleshooting some Java code that was crashing during performance testing, so there's one line of code on my web site with a link to F5 and that line shows up as the output of my command.


3. Why is this useful?

There are many potential uses, good and not so nice. Here are a few:
  • Monitor a web site for a particular value to ensure it is up and running
  • Monitor a web site to see if a particular value appears on that web site that wasn't there originally
  • Monitor your web site content to ensure it was not altered between deployments (one of multiple ways to do this)
  • Scrape all the content from a web site by spidering through all the urls to evaluate the content offline by an automated program (hackers, competitors)
  • Scrape all the visible content from a web site when you don't have FTP or other access to the source code. Generally this is probably illegal activity because anyone who owned a web site would be accessing it through alternative, more efficient means.
  • Automated submission of web requests for performance, security and automated testing -- or for someone trying to use a site in a non-standard, automated way.

3. How can this be detected?
  • Automated traffic from non-sophisticated users of this and related tools will have obvious request headers indicating this is not human traffic.
  • Automated traffic typically has different traffic patterns that doesn't  match human traffic patterns.
  • Excessive, repetitive traffic generally is not human, though it could be an entire organization behind a proxy server.
  • The source IP may be spoofed or compromised, but you can see the IP address sending excessive or repetitive traffic and block it.
  • Abnormal paths through web sites may indicate a non-human visitor.
  • Traffic from IPs in parts of the world where you don't do business is indication of potential mischief.
  • Placing honey tokens and pages you don't advertise on your web site and then watching for traffic hitting those tokens is indication of a potential bot.
  • In my case back in 2005 I wrote a kind of web application filter that would analyze requests and block traffic like this. It's not running at the time of this writing. You can see the results of traffic I discovered in this blog's history. Now there are commercial web application firewalls that do similar things.