Scrapy will drop support for Python 2.5

After a year considering it, we have decided to go ahead and drop support for Python 2.5 in Scrapy.

Starting from 0.15, Scrapy will require Python 2.6 or above.

The main reasons for dropping support for Python 2.5 are:

  1. most people have already moved to 2.6 or 2.7
  2. we are no longer testing it ourselves on Python 2.5, so it's quite possible we release changes incompatible with Python 2.5 by mistake. By the way, this has happened in the past and only disovered after a while, which proves our belief that Python 2.5 is barely used anymore.
  3. having to check compatibility with 2.5 has become a painful, time consuming and error-prone task. We would rather spend that time on improving the framework.
  4. there are some ugly bugs in the standard library of Python 2.5 (mainly in urlparse, urllib & cookielib modules) that we would like to leave behind
  5. there is some boilerplate code to keep backwards compatibiltiy with 2.5 (scrapy.utils.py26 module, optional dependency on simplejson) that we want to remove to keep a cleaner codebase

The code cleanup and the changes haven't been done yet, nor they will happen all at once in a single commit, so Scrapy 0.15 still works on Python 2.5. Not for long however, since we will be making these changes during the remaining development phase of 0.15.

Scrapy 0.14 released

After 10 months of work, and many changes, we are pleased to announce the release of Scrapy 0.14.

For a detailed list of changes see the release notes:
https://github.com/scrapy/scrapy/wiki/Scrapy-0.14-release-notes

We expect this one to be the last release in the Scrapy 0.x series, and we are aiming for Scrapy 0.15 to become 1.0 sometime around the first quarter of 2012, where we may consider adopting semantic versioning (semver.org).

You may have also noticed that the source code, wiki and issues have been moved to Github and the documentation to readthedocs. Here are the new official links:

Homepage:
http://scrapy.org
Source code, issues & wiki:
https://github.com/scrapy/scrapy
Documentation (all versions):
http://readthedocs.org/projects/scrapy/

Happy scraping!

New example Scrapy project available

Scrapy users have complained in the past about the lack of a pre-built example project that contains, for example, the dmoz spider described in the tutorial.

Complain no more!. We're happy to let you know that there is a now functional Scrapy project available on Github which contains the old Google Directory spider and the Dmoz spider described in the tutorial.

The project is called "dirbot", and it's available at https://github.com/scrapy/dirbot

 

The documentation of Scrapy 0.13 (which will become the next stable release, Scrapy 0.14) has been updated to point to this new example project.

Introducing: w3lib and scrapely

Hi everyone,

In an effort to make Scrapy code smaller and more reusable, we’ve been working on splitting the Scrapy codebase into two different modules:

  1. w3lib
  2. scrapely

w3lib

A library with simple, reusable functions for working with URLs, HTML, forms, and HTTP. Things that aren’t found in the Python standard library. This library doesn’t have any external dependency.

For more info see: * w3lib Github repo & issue tracker * w3lib on PyPI

scrapely

Scrapely is library for extracting structured data from HTML pages. What makes it different from other Python web scraping libraries is that it doesn’t depend on lxml or libxml2. Instead, it uses an internal pure-python parser, which can accept poorly formed HTML. The HTML is converted into an array of token ids, which is used for matching the items to be extracted.

Scrapely depends on numpy (it uses it to speed up calculations) and w3lib.

You can find more info, or try it out, in the Github page.

Scrapy codebase

After these changes, Scrapy codebase has been reduced by 4574 lines, including blank and comments (according to cloc).

Before:

$ cloc /tmp/scrapy2/scrapy
     333 text files.
     332 unique files.                                       
      18 files ignored.

http://cloc.sourceforge.net v 1.09  T=0.5 s (628.0 files/s, 66050.0 lines/s)
-------------------------------------------------------------------------------
Language          files     blank   comment      code    scale   3rd gen. equiv
-------------------------------------------------------------------------------
Python              301      5819      5341     20663 x   4.20 =       86784.60
HTML                 11       117        93       792 x   1.90 =        1504.80
XML                   2         1         0       199 x   1.90 =         378.10
-------------------------------------------------------------------------------
SUM:                314      5937      5434     21654 x   4.09 =       88667.50
-------------------------------------------------------------------------------

After:

$ cloc /tmp/scrapy/scrapy
     308 text files.
     307 unique files.                                       
      14 files ignored.

http://cloc.sourceforge.net v 1.09  T=0.5 s (586.0 files/s, 55136.0 lines/s)
-------------------------------------------------------------------------------
Language          files     blank   comment      code    scale   3rd gen. equiv
-------------------------------------------------------------------------------
Python              284      5206      3801     18242 x   4.20 =       76616.40
XML                   2         1         0       199 x   1.90 =         378.10
HTML                  7        17         0       102 x   1.90 =         193.80
-------------------------------------------------------------------------------
SUM:                293      5224      3801     18543 x   4.16 =       77188.30
-------------------------------------------------------------------------------

Scrapy dependencies

Scrapy 0.14 will depend on w3lib. Scrapy 0.13 (current dev version) already depends on w3lib, but w3lib is already packaged and provided in the official APT repos (package python-w3lib). So, if you’re using Scrapy 0.13 on Ubuntu, you can upgrade safely. Otherwise, you can always install/upgrade with easy_install or pip. Stable version (Scrapy 0.12) is not affected at all by this change.

If you have any comments or questions feel free to post them in the scrapy-users group.

Scrapy 0.12 released

Hello everyone, we’re pleased to announce the release of Scrapy 0.12!

This release is based on the last development branch, aka. Scrapy 0.11.

Starting from this release, we’ll be following the odd-even versioning scheme. That means trunk is now Scrapy 0.13 and will became Scrapy 0.14 on next release.

Notable changes of this release:

  • many Scrapyd changes, including running one spider per process and a minimal web interface
  • lxml is now supported as an alternative xpath selectors backend to libxml2, which should make installing Scrapy on Mac as easy as: easy_install -U Scrapy
  • added project data storage directory
  • changed parameters of item_passed signal
  • new deploy command to deploy your project in Scrapyd server
  • change in semantics of HTTP cache middleware options – check the documentation
  • many bug fixes

For a more detailed list of changes, check the Scrapy 0.12 Release Notes.

Spoofing your Scrapy bot IP using tsocks

This post was contributed by Pablo Hoffman.

It is well known that many websites show different content depending on the region where they’re accessed. For example, some retailer sites show products available only for the region (US, Europe) of the user accessing the site.

Although this can be quite convenient for the website customers, it can be a pain for developers writing a spider for the site and running it from their local machines.

There is a simple way to proxy all requests as if they came from another server. You only need SSH access to this other server, no need to install any HTTP proxy. For this, you can use a program called tsocks.

Here’s how to do it in Ubuntu, though this recipe should be easy to extended to other Linux distros.

First, install tsocks with:

$ apt-get install tsocks

Then add this content to ~/.tsocksrc (this file may vary across distributions):

server = 127.0.0.1
server_type = 5
server_port = 9999

Next, SSH to the remote server you want to use:

$ ssh -D 9999 some_remote_server

And finally, in another terminal (without closing the SSH console), just run Scrapy by prefixing it with the tsocks command, like this:

$ tsocks scrapy crawl myspider

That’s all. Your spider will run in your local machine but proxying all communication through the remote server. No need to change any settings or configuration.

Scraping AJAX sites with Scrapy

This post was contributed by Ismael Carnales.

A common question in the Scrapy community is how to scrape AJAX sites. As Mark Ellul pointed out in the scrapy-users mailing list, there are two basic types of AJAX requests that web sites make use of. These are: "static" requests which their parameters (URL, post data) doesn't change, and "dynamic" requests that use some variables based on properties from the current page. 

The general approach when dealing with "static" AJAX requests is adding their URLs to start_urls attribute as a "normal" URL. And to deal with "dynamic" ones we will try to generate the same requests from Scrapy.

To help us in this task, we'll use a Firefox add-on called Firebug. This add-on comes with a Net panel that let us monitor the requests being sent to the server and their responses

We will scrape Nasa Image of the Day Gallery. When loading the site we can see that the page loads the gallery information from another source, so to find it out we launch Firebug, go to the Net panel and reload the page.

In the Net panel, we see each request (and its response) made to load the entire page contents, here we can filter the requests and look for XmlHttpRequests in the XHR tab. 

In the XHR tab, we see that two requests are made, one to iotdxml.xml and one to image_feature_NUMBER.xml. If we look at the response of the first one (clicking on it and then going to response tab) we see that it holds the gallery slider data.

Now, if we navigate to another photography, clicking on its slider link we'll see that a new request has been made. This request points to image_feature_NUMBER.xml, that looks suspiciously similar to the second request that we got when loading the page for the first time (that request got the first image on the gallery). So if we look at the iotdxml.xml file we'll find that the image URL for finding its complete data is stored in a ap attribute.

So, to scrape this site, we add the iotdxml.xml URL to the Spider start_urls, parse it and make requests for each individual image (mimicking the requests the browser makes when clicking on images).

Here's a simple spider to illustrate this:

You can run this spider quickly (without creating a project) by saving it into a nasaspider.py file and running:

    scrapy runspider nasaspider.py

Filed under  //   ajax  

Fresh Windows installers for latest Scrapy code

Hi, for those of you working on Windows, you'll be glad to know that we're now publishing Windows installers for continuous builds of Scrapy, courtesy of Insophia which provides the continuous building infrastructure. This means installers for the latest code committed to the Mercurial repo, both for the stable and development series.

This way you can get the latest bug fixes without waiting for the next stable release.

You can find the Windows installers for latest code available in the following locations:

Also remind that the Ubuntu packages (available in the official APT repo) also contain the latest bug fixes.

Scrapy 0.10.3 released

Hi everyone, we have just released Scrapy 0.10.3, which contains a few bug fixes and a small new feature (fromname argument added to FormRequest.from_response() method).

This is a recommended update for all 0.10 users.

As usual, you can get it from the Download pagePyPI (easy_install), or the Ubuntu repos.

 

New bugfix release: 0.10.1

We have just published a bugfix release that fixes a critical bug with the HTTP compression middleware and the response encoding discovery mechanism.

This is a highly recommended update for all 0.10 users.

As usual, you can get it from the Download page, PyPI (easy_install), or the Ubuntu repos.

UPDATE: We have published another bugfix release (0.10.2) to fix a similar encoding bug.

About

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Twitter