The pipeline performs the following main steps:
Visit http://newsfeed.ijs.si/visual_demo/ for a real-time visualization of the news stream.
The stream of articles gets xml-formatted and segmented by time into
.gz files, each a few MB in size.
Note: Due to the streaming nature of the pipeline, the articles in the gzipped files are only approximately chronological; they are sorted in the order in which they were processed rather than published.
The stream is accessible at http://newsfeed.ijs.si/stream/ (internal use only due to copyright issues, sorry).
The URL accepts an optional
?after=TIMESTAMP parameter, where
takes the ISO format
Z denotes GMT timezone). The server will return
the oldest gzip created later than
HTTP headers (
Content-disposition: attachment; filename="...") will contain the new gzip's filename which
you can use to generate the next query, and so on. If the
after parameter is too recent (no newer gzips available),
HTTP 404 is returned. If no
after is provided, the oldest available gzip is returned; we will attempt
to maintain a month's worth of backlog.
To download the whole stream continuously, you can use the python script (see next section) or re-implement its functionality. The script does, in effect, the following:
timestamp = [when you want to start downloading, e.g. now() minus 1 hour] while True: fetch http://newsfeed.ijs.si/stream/?after=timestamp if [404 error]: # there is no new data pause 1 minute else: save data timestamp = [extract it from the Content-Disposition HTTP header]
You can also download our python script which uses the simple API described above to poll the server at one-minute intervals and copies new .gz files to the local disk as they are made available on the server.
A sample call for downloading the public stream (the default) into folder ijs_news:
./http2fs.py username:password -o ijs_news
This will download the whole available history and continue to follow the real-time stream. To follow the X-Like stream (only available to X-Like project partners), we provide an additinal stream URL besides the default one:
./http2fs.py username:password -o ijs_news -f http://newsfeed.ijs.si/stream/ -f http://newsfeed.ijs.si/stream/xlike
To have a quick look at the real stream data, click either of the links in the API description above.
Each .gz file contains a single XML tree. The root element,
<article-set>, contains zero or more articles in the following format:
<article id="internal article ID; consistent across streams"> <source> <hostname> Publisher hostname </hostname> <title> Name of the publisher; failing that, title of the RSS feed </title> <location?> <longitude?> publisher longitude in degrees </longitude> <latitude?> publisher latitude in degrees </latitude> <city?> publisher city </city> <country?> publisher country </country> </location> <tags?> <tag+> a tag for the publisher; the vocabulary is not controlled </tag> </tags> </source> <feed+> <uri> URL from which the article was discovered; typically the RSS feed </uri> </feed> <uri> URL from which the article was downloaded </uri> <publish-date?> The publication time and date. </publish-date> <retrieve-date> The retrieval time and date. </retrieve-date> <lang> 3-letter ISO 639-2 language code </lang> <story_id?> story cluster this article was grouped into (at download time) </lang> <location? +> <longitude?> story content longitude in degrees </longitude> <latitude?> story content latitude in degrees </latitude> <city?> story city </city> <country?> story country </country> </location> <tags?> <tag+> a tag for the article; the vocabulary is not controlled </tag> </tags> <img?> The URL of a related image, usually a thumbnail. </img> <title> Title. Can be empty if we fail to identify it. </title> <body-cleartext> Clear text body of the article, formatted only with
<p> tags</body-cleartext> <body-rych?; only English, Slovene> Enriched article body; an XML subtree as returned by Enrycher. </body-rych> <body-xlike?; only English, Spanish, Catalan> Enriched article body; an XML subtree as returned by iSOCO; experimental. </body-xlike> </article>
Elements marked with ? may be omitted if the data is missing.
Elements marked with + may repeat.
All times are in UTC and take the format
The pipeline has been developed and is being maintained by the Artificial Intelligence Laboratory at Jozef Stefan Institute, Slovenia. In case of questions, contact Blaz Novak at email@example.com.
Referencing: If you use newsfeed data, please reference it with the following paper:
The development was supported in part by the RENDER, X-Like, PlanetData and MetaNet EU FP7 projects.