Scrapy Blog http://blog.scrapy.org Official blog of Scrapy, a Python web crawling & scraping framework posterous.com Mon, 27 Feb 2012 06:13:00 -0800 Scrapy will drop support for Python 2.5 http://blog.scrapy.org/scrapy-dropping-support-for-python-25 http://blog.scrapy.org/scrapy-dropping-support-for-python-25

After a year considering it, we have decided to go ahead and drop support for Python 2.5 in Scrapy.

Starting from 0.15, Scrapy will require Python 2.6 or above.

The main reasons for dropping support for Python 2.5 are:

  1. most people have already moved to 2.6 or 2.7
  2. we are no longer testing it ourselves on Python 2.5, so it's quite possible we release changes incompatible with Python 2.5 by mistake. By the way, this has happened in the past and only disovered after a while, which proves our belief that Python 2.5 is barely used anymore.
  3. having to check compatibility with 2.5 has become a painful, time consuming and error-prone task. We would rather spend that time on improving the framework.
  4. there are some ugly bugs in the standard library of Python 2.5 (mainly in urlparse, urllib & cookielib modules) that we would like to leave behind
  5. there is some boilerplate code to keep backwards compatibiltiy with 2.5 (scrapy.utils.py26 module, optional dependency on simplejson) that we want to remove to keep a cleaner codebase

The code cleanup and the changes haven't been done yet, nor they will happen all at once in a single commit, so Scrapy 0.15 still works on Python 2.5. Not for long however, since we will be making these changes during the remaining development phase of 0.15.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/594684/pablo.jpg http://posterous.com/users/5AqhUTboRuXD Pablo Hoffman Pablo Pablo Hoffman
Thu, 17 Nov 2011 10:13:00 -0800 Scrapy 0.14 released http://blog.scrapy.org/scrapy-014-released http://blog.scrapy.org/scrapy-014-released

After 10 months of work, and many changes, we are pleased to announce the release of Scrapy 0.14.

For a detailed list of changes see the release notes:
https://github.com/scrapy/scrapy/wiki/Scrapy-0.14-release-notes

We expect this one to be the last release in the Scrapy 0.x series, and we are aiming for Scrapy 0.15 to become 1.0 sometime around the first quarter of 2012, where we may consider adopting semantic versioning (semver.org).

You may have also noticed that the source code, wiki and issues have been moved to Github and the documentation to readthedocs. Here are the new official links:

Homepage:
http://scrapy.org
Source code, issues & wiki:
https://github.com/scrapy/scrapy
Documentation (all versions):
http://readthedocs.org/projects/scrapy/

Happy scraping!

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/594684/pablo.jpg http://posterous.com/users/5AqhUTboRuXD Pablo Hoffman Pablo Pablo Hoffman
Wed, 27 Apr 2011 22:29:00 -0700 New example Scrapy project available http://blog.scrapy.org/new-example-scrapy-project-available http://blog.scrapy.org/new-example-scrapy-project-available

Scrapy users have complained in the past about the lack of a pre-built example project that contains, for example, the dmoz spider described in the tutorial.

Complain no more!. We're happy to let you know that there is a now functional Scrapy project available on Github which contains the old Google Directory spider and the Dmoz spider described in the tutorial.

The project is called "dirbot", and it's available at https://github.com/scrapy/dirbot

 

The documentation of Scrapy 0.13 (which will become the next stable release, Scrapy 0.14) has been updated to point to this new example project.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/594684/pablo.jpg http://posterous.com/users/5AqhUTboRuXD Pablo Hoffman Pablo Pablo Hoffman
Mon, 18 Apr 2011 22:25:46 -0700 Introducing: w3lib and scrapely http://blog.scrapy.org/introducing-w3lib-and-scrapely http://blog.scrapy.org/introducing-w3lib-and-scrapely

Hi everyone,

In an effort to make Scrapy code smaller and more reusable, we’ve been working on splitting the Scrapy codebase into two different modules:

  1. w3lib
  2. scrapely

w3lib

A library with simple, reusable functions for working with URLs, HTML, forms, and HTTP. Things that aren’t found in the Python standard library. This library doesn’t have any external dependency.

For more info see: * w3lib Github repo & issue tracker * w3lib on PyPI

scrapely

Scrapely is library for extracting structured data from HTML pages. What makes it different from other Python web scraping libraries is that it doesn’t depend on lxml or libxml2. Instead, it uses an internal pure-python parser, which can accept poorly formed HTML. The HTML is converted into an array of token ids, which is used for matching the items to be extracted.

Scrapely depends on numpy (it uses it to speed up calculations) and w3lib.

You can find more info, or try it out, in the Github page.

Scrapy codebase

After these changes, Scrapy codebase has been reduced by 4574 lines, including blank and comments (according to cloc).

Before:

$ cloc /tmp/scrapy2/scrapy
     333 text files.
     332 unique files.                                       
      18 files ignored.

http://cloc.sourceforge.net v 1.09  T=0.5 s (628.0 files/s, 66050.0 lines/s)
-------------------------------------------------------------------------------
Language          files     blank   comment      code    scale   3rd gen. equiv
-------------------------------------------------------------------------------
Python              301      5819      5341     20663 x   4.20 =       86784.60
HTML                 11       117        93       792 x   1.90 =        1504.80
XML                   2         1         0       199 x   1.90 =         378.10
-------------------------------------------------------------------------------
SUM:                314      5937      5434     21654 x   4.09 =       88667.50
-------------------------------------------------------------------------------

After:

$ cloc /tmp/scrapy/scrapy
     308 text files.
     307 unique files.                                       
      14 files ignored.

http://cloc.sourceforge.net v 1.09  T=0.5 s (586.0 files/s, 55136.0 lines/s)
-------------------------------------------------------------------------------
Language          files     blank   comment      code    scale   3rd gen. equiv
-------------------------------------------------------------------------------
Python              284      5206      3801     18242 x   4.20 =       76616.40
XML                   2         1         0       199 x   1.90 =         378.10
HTML                  7        17         0       102 x   1.90 =         193.80
-------------------------------------------------------------------------------
SUM:                293      5224      3801     18543 x   4.16 =       77188.30
-------------------------------------------------------------------------------

Scrapy dependencies

Scrapy 0.14 will depend on w3lib. Scrapy 0.13 (current dev version) already depends on w3lib, but w3lib is already packaged and provided in the official APT repos (package python-w3lib). So, if you’re using Scrapy 0.13 on Ubuntu, you can upgrade safely. Otherwise, you can always install/upgrade with easy_install or pip. Stable version (Scrapy 0.12) is not affected at all by this change.

If you have any comments or questions feel free to post them in the scrapy-users group.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/594684/pablo.jpg http://posterous.com/users/5AqhUTboRuXD Pablo Hoffman Pablo Pablo Hoffman
Sun, 02 Jan 2011 11:54:16 -0800 Scrapy 0.12 released http://blog.scrapy.org/scrapy-012-released http://blog.scrapy.org/scrapy-012-released

Hello everyone, we’re pleased to announce the release of Scrapy 0.12!

This release is based on the last development branch, aka. Scrapy 0.11.

Starting from this release, we’ll be following the odd-even versioning scheme. That means trunk is now Scrapy 0.13 and will became Scrapy 0.14 on next release.

Notable changes of this release:

  • many Scrapyd changes, including running one spider per process and a minimal web interface
  • lxml is now supported as an alternative xpath selectors backend to libxml2, which should make installing Scrapy on Mac as easy as: easy_install -U Scrapy
  • added project data storage directory
  • changed parameters of item_passed signal
  • new deploy command to deploy your project in Scrapyd server
  • change in semantics of HTTP cache middleware options – check the documentation
  • many bug fixes

For a more detailed list of changes, check the Scrapy 0.12 Release Notes.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/594684/pablo.jpg http://posterous.com/users/5AqhUTboRuXD Pablo Hoffman Pablo Pablo Hoffman
Fri, 12 Nov 2010 11:17:00 -0800 Spoofing your Scrapy bot IP using tsocks http://blog.scrapy.org/spoofing-your-scrapy-bot-ip-using-tsocks http://blog.scrapy.org/spoofing-your-scrapy-bot-ip-using-tsocks

This post was contributed by Pablo Hoffman.

It is well known that many websites show different content depending on the region where they’re accessed. For example, some retailer sites show products available only for the region (US, Europe) of the user accessing the site.

Although this can be quite convenient for the website customers, it can be a pain for developers writing a spider for the site and running it from their local machines.

There is a simple way to proxy all requests as if they came from another server. You only need SSH access to this other server, no need to install any HTTP proxy. For this, you can use a program called tsocks.

Here’s how to do it in Ubuntu, though this recipe should be easy to extended to other Linux distros.

First, install tsocks with:

$ apt-get install tsocks

Then add this content to ~/.tsocksrc (this file may vary across distributions):

server = 127.0.0.1
server_type = 5
server_port = 9999

Next, SSH to the remote server you want to use:

$ ssh -D 9999 some_remote_server

And finally, in another terminal (without closing the SSH console), just run Scrapy by prefixing it with the tsocks command, like this:

$ tsocks scrapy crawl myspider

That’s all. Your spider will run in your local machine but proxying all communication through the remote server. No need to change any settings or configuration.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/594684/pablo.jpg http://posterous.com/users/5AqhUTboRuXD Pablo Hoffman Pablo Pablo Hoffman
Wed, 27 Oct 2010 11:00:00 -0700 Scraping AJAX sites with Scrapy http://blog.scrapy.org/scraping-ajax-sites-with-scrapy http://blog.scrapy.org/scraping-ajax-sites-with-scrapy

This post was contributed by Ismael Carnales.

A common question in the Scrapy community is how to scrape AJAX sites. As Mark Ellul pointed out in the scrapy-users mailing list, there are two basic types of AJAX requests that web sites make use of. These are: "static" requests which their parameters (URL, post data) doesn't change, and "dynamic" requests that use some variables based on properties from the current page. 

The general approach when dealing with "static" AJAX requests is adding their URLs to start_urls attribute as a "normal" URL. And to deal with "dynamic" ones we will try to generate the same requests from Scrapy.

To help us in this task, we'll use a Firefox add-on called Firebug. This add-on comes with a Net panel that let us monitor the requests being sent to the server and their responses

We will scrape Nasa Image of the Day Gallery. When loading the site we can see that the page loads the gallery information from another source, so to find it out we launch Firebug, go to the Net panel and reload the page.

In the Net panel, we see each request (and its response) made to load the entire page contents, here we can filter the requests and look for XmlHttpRequests in the XHR tab. 

In the XHR tab, we see that two requests are made, one to iotdxml.xml and one to image_feature_NUMBER.xml. If we look at the response of the first one (clicking on it and then going to response tab) we see that it holds the gallery slider data.

Now, if we navigate to another photography, clicking on its slider link we'll see that a new request has been made. This request points to image_feature_NUMBER.xml, that looks suspiciously similar to the second request that we got when loading the page for the first time (that request got the first image on the gallery). So if we look at the iotdxml.xml file we'll find that the image URL for finding its complete data is stored in a ap attribute.

So, to scrape this site, we add the iotdxml.xml URL to the Spider start_urls, parse it and make requests for each individual image (mimicking the requests the browser makes when clicking on images).

Here's a simple spider to illustrate this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from urlparse import urljoin

from scrapy.http import Request
from scrapy.selector import XmlXPathSelector
from scrapy.spider import BaseSpider


class NasaImagesSpider(BaseSpider):
    name = "nasa.gov"
    start_urls = (
        'http://www.nasa.gov/multimedia/imagegallery/iotdxml.xml',
    )

    def parse(self, response):
        xxs = XmlXPathSelector(response)
        urls = xxs.select('//ig/ap/text()').extract()
        for url in urls:
            abs_url = urljoin(self.start_urls[0], url) + '.xml'
            yield Request(abs_url, callback=self.parse_image)

    def parse_image(self, response):
        # parse individual images here
        pass


SPIDER = NasaImagesSpider()

You can run this spider quickly (without creating a project) by saving it into a nasaspider.py file and running:

    scrapy runspider nasaspider.py

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/594684/pablo.jpg http://posterous.com/users/5AqhUTboRuXD Pablo Hoffman Pablo Pablo Hoffman
Wed, 29 Sep 2010 09:11:00 -0700 Fresh Windows installers for latest Scrapy code http://blog.scrapy.org/fresh-windows-installers-for-latest-scrapy-co http://blog.scrapy.org/fresh-windows-installers-for-latest-scrapy-co

Hi, for those of you working on Windows, you'll be glad to know that we're now publishing Windows installers for continuous builds of Scrapy, courtesy of Insophia which provides the continuous building infrastructure. This means installers for the latest code committed to the Mercurial repo, both for the stable and development series.

This way you can get the latest bug fixes without waiting for the next stable release.

You can find the Windows installers for latest code available in the following locations:

Also remind that the Ubuntu packages (available in the official APT repo) also contain the latest bug fixes.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/594684/pablo.jpg http://posterous.com/users/5AqhUTboRuXD Pablo Hoffman Pablo Pablo Hoffman
Wed, 29 Sep 2010 09:04:00 -0700 Scrapy 0.10.3 released http://blog.scrapy.org/scrapy-0103-released http://blog.scrapy.org/scrapy-0103-released

Hi everyone, we have just released Scrapy 0.10.3, which contains a few bug fixes and a small new feature (fromname argument added to FormRequest.from_response() method).

This is a recommended update for all 0.10 users.

As usual, you can get it from the Download pagePyPI (easy_install), or the Ubuntu repos.

 

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/594684/pablo.jpg http://posterous.com/users/5AqhUTboRuXD Pablo Hoffman Pablo Pablo Hoffman
Tue, 14 Sep 2010 23:23:00 -0700 New bugfix release: 0.10.1 http://blog.scrapy.org/new-bugfix-release-0101 http://blog.scrapy.org/new-bugfix-release-0101
We have just published a bugfix release that fixes a critical bug with the HTTP compression middleware and the response encoding discovery mechanism.

This is a highly recommended update for all 0.10 users.

As usual, you can get it from the Download page, PyPI (easy_install), or the Ubuntu repos.

UPDATE: We have published another bugfix release (0.10.2) to fix a similar encoding bug.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/594684/pablo.jpg http://posterous.com/users/5AqhUTboRuXD Pablo Hoffman Pablo Pablo Hoffman
Fri, 10 Sep 2010 23:23:00 -0700 New Scrapy blog and Scrapy 0.10 release http://blog.scrapy.org/new-scrapy-blog-and-scrapy-010-release http://blog.scrapy.org/new-scrapy-blog-and-scrapy-010-release

Hi everyone!

After two months of work, we're happy to be announcing the release of Scrapy 0.10, the fourth major release after our first 0.7 release almost a year go. We're also using this opportunity to announce the launch the official Scrapy blog, which you are reading right now. And you may have heard of the new Scrapy snippets site to share code examples.

The 0.10 release includes several bug fixes, and a bunch of new features. Here we summarize the most important ones.

New command-line tool

There is a new command line tool for creating and controlling your Scrapy projects. Instead of using the old scrapy-ctl.py script you'll now use the scrapy command which "auto discovers" the project you're working on. So there's no need for a separate script per project anymore. Check the new command-line tool documentation for a list of available commands.

The new command also works out of the box on Windows, if you're using the Windows installer.

Feed exports

Feed exports provide a flexible way to generate data feeds with the scraped data. It supports multiple formats (JSON, XML, CSV) and storage backends (filesystem, FTP, S3). Both formats and storages are pluggable so you can plug your own one.

Persistent spider queues

Scrapy now comes with a persistent spider queue (sometimes referred as Execution queues in 0.9) out of the box (implemented using SQLite), which allows you to schedule spiders to a Scrapy process that is already running.

To illustrate, running this:

scrapy crawl myspider

Is the same as running this:

scrapy queue add myspider
scrapy crawl 

You can also start Scrapy in server mode with:

scrapy runserver

And schedule your spiders later with:

scrapy queue add myspider

Scrapy service

Scrapy 0.10 introduces Scrapyd, a new application for running Scrapy as a service and deploying Scrapy projects. Insophia provides Ubuntu packages for Scrapyd (the package source is in the /debian dir) so you can install it with apt-get install scrapyd and start using it. Hopefully, with some community help, we'll get packages for other platforms.

Ubuntu packages

New official Ubuntu packages (of Scrapy and Scrapyd) are provided through APT repos which are continuously updated with the latest bug fixes. If you install Scrapy through the APT repos you'll get bash autocompletion out of the box.

Other Scrapy 0.10 improvements

  • Simplified Images pipeline to make it easier to use - see documentation
  • Some Scrapy shell usability improvements (the shell should look and feel nicer now)
  • Support for returning deferreds on most signals (which enables support for implementing asynchronous extensions)
  • Scrapy log refactoring to support hooking up your own log observers (though this is not yet documented)

And much more. For more info see:

Finally, a big thanks to Mydeco (which is already using Scrapy 0.10 in production) for the beta-testing.

So what are you waiting for?. Go download it, and please report any issues you find in the scrapy-users group or the bug tracker. You can also fork Scrapy on Github and fix them.

Previous stable release

As usual, you can find the documentation for the previous stable release at http://doc.scrapy.org/0.9/ and download it from the releases archive at http://scrapy.org/releases/. The Mercurial repo for the previous stable release will also remain available at http://hg.scrapy.org/scrapy-0.9/

UPDATE: The release files have been rebuilt to fix the issue reported here by adding this change.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/594684/pablo.jpg http://posterous.com/users/5AqhUTboRuXD Pablo Hoffman Pablo Pablo Hoffman
Fri, 25 Sep 2009 13:43:00 -0700 First Scrapy release candidate available http://blog.scrapy.org/first-scrapy-release-candidate-available http://blog.scrapy.org/first-scrapy-release-candidate-available

Last week we proudly released the first release candidate of Scrapy 0.7, the best web crawling and screen scraping framework for Python.

For more info see the announcement in the scrapy-users list, or head to the Scrapy home page.

 

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/594684/pablo.jpg http://posterous.com/users/5AqhUTboRuXD Pablo Hoffman Pablo Pablo Hoffman