Skip to content

arshaw/scrapemark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NOTE: This project is no longer maintained! more info

Scrapemark

Scrapemark is a super-convenient way to scrape webpages in Python.

It utilizes an HTML-like markup language to extract the data you need. You get your results as plain old Python lists and dictionaries. Scrapemark internally utilizes regular expressions and is super-fast.

As an example, here is a way you could scrape all the links on the Digg homepage in one fell swoop:

import scrapemark

print scrapemark.scrape("""
  {*
    <div class='news-summary'>
      <h3><a href='{{ [links].url }}'>{{ [links].title }}</a></h3>
      <p>{{ [links].description }}</p>
      <li class='digg-count'>
        <strong>{{ [links].diggs|int }}</strong>
      </li>
    </div>
  *}
  """,
  url='http://digg.com/')