How can I automatically save and index data from an internet database at a specified interval

9 views (last 30 days)
I want to be able to read source code from a URL and extract variable values (such as commodity prices or stocks) at a specified interval automatically (hourly or daily). The extracted values would then, ideally be appended to a matrix of value so I can look at fluctuations over time. Is this possible in MATLAB?
If so, is it elegant?
How would you suggest approaching the problem?
If not, is there another language I should consider? For what benefits?
Thanks.

Accepted Answer

Cedric
Cedric on 10 Mar 2013
Edited: Cedric on 12 Mar 2013
Most languages will allow you to extract data from the internet. Relevant questions might be..
  • Where to get data? Is it free, historical, real time, reliable, etc?
  • Is there an API available or do you need to parse web pages by yourself?
  • How much time can you afford spending in writing your own parser?
  • Is it meaningful to build some data logger when historical data are available?
  • If you use MATLAB, can you afford having it dedicated to data logging?
I would personally use Python at least for the data logging part, as this language usually minimizes the time to solution (and you might prefer investing your time in data analysis than in building some web crawler). There are plenty of libs for Python that will help you doing almost everything (I have seen many threads about that even though I've not been working on it myself). But more than that, I could not afford having MATLAB stuck with data extraction/logging a significant part of the day, everyday.
Now if you just want to play a little in MATLAB to see what you can do, it is not too difficult to build a simple code for extracting/logging data .. try the following for example:
Open http://www.google.com/finance?q=AAPL for having the Apple quote. 'AAPL' is the stock symbol and you see it appears in the URL. Open the source of the webpage (CTRL+u in Firefox) and lookup for the price (431.72 as I write). You'll find it at a place that will look like (with different numbers)
values:["AAPL","Apple Inc.","431.72","+1.14","chg","0.26"
which is probably a good chunk of string for pattern matching (because it is close to the stock symbol).
Now in MATLAB, do the following:
>> stockSymbol = 'AAPL' ;
>> buffer = urlread(['http://www.google.com/finance?q=', stockSymbol]) ;
If you look at the content of buffer, you'll recognize the source code of the web page. So at this point you want to extract the quote based on pattern matching. You can achieve this with a regexp:
>> pattern = ['values:["', stockSymbol, '",".*?","(?<price>[\d\.]*?)","(?<change>[+-\d\.]*?)".*?"(?<percent>[+-\d\.]*?)"'] ;
>> quote = regexp(buffer, pattern, 'names')
quote =
price: '431.72'
change: '+1.14'
percent: '0.26'
and voila! Then you can convert to double, store in a file, or anything else. I could describe a little better the pattern, but let's say for now that it is defined so it matches some static literal like "values:[", some literal again that is the stock symbol, and then the three numbers framed by double-quotes, comas, etc. Each part meant to match a number (including special characters like +-. when relevant) is saved as a named token. These tokens names are used to define the struct that is output-ed by regexp.
Wrapping the whole into a cute function, you get:
function quote = getQuote_google(stockSymbol)
buffer = urlread(['http://www.google.com/finance?q=', stockSymbol]) ;
pattern = ['values:["', stockSymbol, '",".*?","(?<price>[\d\.]*?)","(?<change>[+-\d\.]*?)".*?"(?<percent>[+-\d\.]*?)"'] ;
quote = regexp(buffer, pattern, 'names') ;
end
that you can then easily use as follows:
>> quote = getQuote_google('AAPL')
quote =
price: '431.72'
change: '+1.14'
percent: '0.26'
>> quote = getQuote_google('GOOG')
quote =
price: '831.52'
change: '-1.08'
percent: '-0.13'
  5 Comments
Damian
Damian on 8 Jul 2014
Edited: Damian on 8 Jul 2014
WRAPING THINGS UP AS COMMENTS
function quote = getQuote_google(stockSymbol)
I can easily get the data from a saved disk file, but I cant reach out to the website, I could reach out to google with your example code. What could be wrong ?? Anyways, opening from a file works fine.
% buffer = fileread(['G:\forex results\eurjpy\300\eur-jpy-technical@period=300.1.html']);
buffer = urlread('http://www.investing.com/currencies/eur-jpy-technical?period=300');
So I need to analyse this structure
%<span id="fl_header_pair_lst" class="arial_16 midNum">138.76</span>
%<span id="fl_header_pair_chg" class="arial_14 " dir="ltr">0.01</span>
%<span id="fl_header_pair_pch" class="arial_14 " dir="ltr">(0.00%)</span>
pattern_price = ['<span class="arial_26" id="last_last">(?<pricenew>[\d\.]*?)</span>' ,'.*?', '<span id="fl_header_pair_chg" class="arial_14 " dir="ltr">(?<ammountenew>[\d\.]*?)</span>' ,'.*?', ] ;
Its ok, I have two first ones in as test
another approach to another bit of code ( overwriting pattern_price btw ) bit of htm from the same webpage
<div class="studySummary bold arial_14">Summary:<span
class="studySummaryOval neutral arial_12 bold"
title="NEUTRAL">NEUTRAL</span>
I am trying to get to the word 'neutral', but most possibly have to change what comes after summary, but cant figure out to change it to store words.
pattern_price = ['<div class="studySummary bold
arial_14">Summary: <span class="studySummaryOval
neutral arial_12 bold" title="NEUTRAL">(?<summary>[\d\.]*?)
</span>' ,'.*?',] ;
and thats apart from the fact, that this string can change everytime I
load a page, but I shall deal with this on my own, please help me with this bit here.
finally I was trying to grab another value, such as RSI(14) in a bit of html here :
%<td class="first left symbol" id="pair_name_0">RSI(14) /td
%<td class="right" id="open_0">54.427</td>
PS I know above looks like mess, I dont know how to quote it not to look like mess.
I had some struggle with getting all the way to proper value, It seems like I couldnt even get the 14 out of RSI(14) !
pattern_price = ['>RSI((?<rsiblablala>[\d\.]*?))' ,'.*?', ] ;
quote = regexp(buffer, pattern_price, 'names') ;
end
Please help me out :)
Kind regards Damian
Cedric
Cedric on 9 Jul 2014
Edited: Cedric on 9 Jul 2014
Hi Damian, this is an old thread that I am not tracking anymore. I'll try to remember to check, but send me an email if I disappear.
I don't understand what you need to extract exactly, so here are a few examples tailored to the page that you are trying to process. I am using simple tokens (...) instead of named tokens (?<theName>...), to simplify patterns. Outputs are therefore not struct arrays, but simpler cell arrays.
url = 'http://www.investing.com/currencies/eur-jpy-technical?period=300' ;
html = urlread( url ) ;
To extract the 3 numbers in the top block:
pattern = 'last">([\d\.]+).*?ltr">([\-+\.\d]+).*?ltr">\(([\-+\.\d]+)' ;
tokens = regexp( html, pattern, 'tokens', 'once' ) ;
this gives you the following
>> tokens
tokens =
'138.31' '+0.04' '+0.03'
that you can convert to double then if need
>> str2double( tokens )
ans =
138.3100 0.0400 0.0300
To extract Neutral:
pattern = 'Neutral:<\D*?(\d+)' ;
tokens = regexp( html, pattern, 'tokens', 'once' ) ;
(again, you can convert tokens to double if needed). The get the RSI entry whatever the number:
pattern = 'RSI\(.*?open_\d+">([\-+\d\.]+)' ;
tokens = regexp( html, pattern, 'tokens', 'once' ) ;
To get it specifying the number:
n = 14 ;
pattern = sprintf( 'RSI\\(%d.*?open_\\d+">([\\-+\\d\\.]+)', n ) ;
tokens = regexp( html, pattern, 'tokens', 'once' ) ;
Etc..
Cheers,
Cedric
PS/EDIT : try to decompose the patterns and identify the different parts, e.g. in
Neutral:<\D*?(\d+)
the first part
Neutral:<
is a string of literals (static text to match as it is). The second part
\D*?
matches as little non-numeric characters as possible (meaning match whatever you find until you hit a numeric digit). In the third part
(\d+)
the parentheses delineate the token (the thing to extract), and \d+ matches one or more (as many as possible) numeric digits.

Sign in to comment.

More Answers (1)

Sven
Sven on 10 Mar 2013
MATLAB is reasonably well suited to do all the things you're looking for.
Check out, for example, the Trendy section of the MATLAB community website. It's an entire section specifically dedicated to periodic scraping of web URL data, that is subsequently presented to the user in some form showing changes in this data over time. The Trendy site was originally set up for daily scraping, but there's nothing stopping you from setting your own timing. I'm not suggesting that you should implement your solution as a Trendy entry, but you may benefit from checking out the source code of some of the entries.
A couple of points I would make:
  1. The latest version of MATLAB actually has a financial toolbox which may contain some of the functionality you're planning on writing yourself.
  2. There may already be various data sources set up that do the dirty work (scraping of historic data) for you. It might be more "elegant" to simply load from those data sources whenever you want to display your results.
  3. The periodic part of your implementation is something that might take some thought. One way would be to simply have a computer with MATLAB which runs all day, all night, on an infinite loop with a 1-hour pause() command. Another way might be to have some system event (say, a kron job) occur every hour which silently starts MATLAB, runs the short script which scrapes data, and then closes MATLAB again. I guess this would all depend on what you specifically want/need for this application to do.
Hope that could help in answering your question.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!