IJS newsfeed
a clean, continuous, real-time aggregated stream of
semantically enriched news articles from RSS-enabled sites across the world.

What it Does

The pipeline performs the following main steps:

  1. Periodically crawl a list of RSS feeds and obtain links to news articles
  2. Download the articles, taking care not to overload any of the hosting servers
  3. Parse each article to obtain
    1. Potential new RSS sources mentioned in the HTML, to be used in step (1)
    2. Cleartext version of the article body
  4. Process articles with Enrycher (English and Slovene only)
  5. Expose two streams of news articles (cleartext and Enrycher-processed) to end users.

Demo Visualization

Visit http://newsfeed.ijs.si/visual_demo/ for a real-time visualization of the news stream.

Accessing the stream

The stream of articles gets xml-formatted and segmented by time into .gz files, each a few MB in size.

Note: Due to the streaming nature of the pipeline, the articles in the gzipped files are only approximately chronological; they are sorted in the order in which they were processed rather than published.

Downloading the Stream - API

The stream is accessible at http://newsfeed.ijs.si/stream/ (internal use only due to copyright issues, sorry). The URL accepts an optional ?after=TIMESTAMP parameter, where TIMESTAMP takes the ISO format yyyy-mm-ddThh:mm:ssZ (Z denotes GMT timezone). The server will return the oldest gzip created later than TIMESTAMP.

HTTP headers (Content-disposition: attachment; filename="...") will contain the new gzip's filename which you can use to generate the next query, and so on. If the after parameter is too recent (no newer gzips available), HTTP 404 is returned. If no after is provided, the oldest available gzip is returned; we will attempt to maintain a month's worth of backlog.

To download the whole stream continuously, you can use the python script (see next section) or re-implement its functionality. The script does, in effect, the following:

timestamp = [when you want to start downloading, e.g. now() minus 1 hour]
while True:
   fetch http://newsfeed.ijs.si/stream/?after=timestamp
   if [404 error]:
      # there is no new data
      pause 1 minute
      save data
      timestamp = [extract it from the Content-Disposition HTTP header]

Downloading the Stream - Using the Script

You can also download our python script which uses the simple API described above to poll the server at one-minute intervals and copies new .gz files to the local disk as they are made available on the server.

A sample call for downloading the public stream (the default) into folder ijs_news:

./http2fs.py username:password -o ijs_news

This will download the whole available history and continue to follow the real-time stream. To follow the X-Like stream (only available to X-Like project partners), we provide an additinal stream URL besides the default one:

./http2fs.py username:password -o ijs_news -f http://newsfeed.ijs.si/stream/ -f http://newsfeed.ijs.si/stream/xlike

Stream Contents and Format

To have a quick look at the real stream data, click either of the links in the API description above.

Each .gz file contains a single XML tree. The root element, <article-set>, contains zero or more articles in the following format:

<article id="internal article ID; consistent across streams"> 
      <hostname> Publisher hostname </hostname> 
      <title> Name of the publisher; failing that, title of the RSS feed </title> 
         <longitude?> publisher longitude in degrees </longitude>
         <latitude?> publisher latitude in degrees </latitude>
         <city?> publisher city </city>
         <country?> publisher country </country>
         <tag+> a tag for the publisher; the vocabulary is not controlled </tag>
      <uri> URL from which the article was discovered; typically the RSS feed </uri>   
   <uri> URL from which the article was downloaded </uri>
   <publish-date?> The publication time and date. </publish-date>
   <retrieve-date> The retrieval time and date. </retrieve-date>
   <lang> 3-letter ISO 639-2 language code </lang> 
   <story_id?> story cluster this article was grouped into (at download time) </lang> 
   <location? +>
      <longitude?> story content longitude in degrees </longitude>
      <latitude?> story content latitude in degrees </latitude>
      <city?> story city </city>
      <country?> story country </country>
      <tag+> a tag for the article; the vocabulary is not controlled </tag>
   <img?> The URL of a related image, usually a thumbnail. </img> 
   <title> Title. Can be empty if we fail to identify it. </title> 
       Clear text body of the article, formatted only with <p> tags
   <body-rych?; only English, Slovene>
       Enriched article body; an XML subtree as returned by Enrycher.
   <body-xlike?; only English, Spanish, Catalan>
       Enriched article body; an XML subtree as returned by iSOCO; experimental.

Elements marked with ? may be omitted if the data is missing.
Elements marked with + may repeat.

All times are in UTC and take the format yyyy-mm-ddThh:mm:ssZ.



The pipeline has been developed and is being maintained by the Artificial Intelligence Laboratory at Jozef Stefan Institute, Slovenia. In case of questions, contact Blaz Novak at firstname.lastname@ijs.si.

Referencing: If you use newsfeed data, please reference it with the following paper:

Trampus, Mitja and Novak, Blaz: The Internals Of An Aggregated Web News Feed. Proceedings of 15th Multiconference on Information Society 2012 (IS-2012). [PDF]

The development was supported in part by the RENDER, X-Like, PlanetData and MetaNet EU FP7 projects.