How to read multiple huge text files, the fastest way?

Hi All,
I am quite new to Matlab and sorry for the naive question. Request your kind help on my problem as given below.
I have around 10,000 input text files to read and process in Matlab. The text file contains only numerical data but each file is around 12-15MB and hence the total size of the input is around 125~150GB.
First, I tried to use fgetl() to read each line from each file and iterate but it took very long. So I have modified the input text file format as a set of numbers separated by white spaces and used fscanf() to read to a matrix of size [1 inf]. Still it takes couple of hours to read all the 10,000 files.
I have tried to use parfor loop and ran the code in a matlabpool of cluster size 8 (the system is a linux server - 4 processors, each dual code). Even then, it takes more than 2 hours to read all the files.
Could anyone kindly let me know what is the fastest way to read this much huge data in Matlab? My requirement is to read this much data (125~150 GB) in a couple of minutes.
Note: I can change the format of the input text file to achieve the highest possible file read. But I would like to read the inputs as numbers only (not string) as during processing str2double() takes much time.
Thanks a million in advance. Expecting your expert advice.
Warm Regards
Anand Uthaman

7 Comments

What sort of storage technology are you using in your system and where does the data live? Even simple operations like copying 125-150 GB of data on local storage can take minutes to accomplish -- and you aren't even doing anything with the data.
If you are storing the data on a network filesystem, simply accessing the data is going to take a significant amount of time, depending on the capability of your network and file server -- more than local disk.
Of course, you can get around the above problems by using multiple hosts to access the data on a network -- but you would need a pretty decent bit of hardware to be able to serve up 125 GB in a couple of minutes.
In a related question, how are the 10,000 files organized? In my experience, once directory sizes (number of files in them) begin to grow beyond a few thousand files, the time to do anything in a directory starts to grow by a significant amount. If you can reduce the sheer number of files by concatenating them in some manner or by utilizing a directory structure to keep the number of files in each directory down, this might help. If you Google around for your filesystem type and see if it has documented limits or recommended limits, this might help you with your problem.
Thanks a lot for your comment. There is no problem of network delay as all the files resides on the same local system, may be in the same folder as the Matlab code. I am doing some math processing with the data (after reading) but all those commands takes much less time compared to the file data read function.
I would try to rearrange file organization. I have experienced this huge directory size problem you have mentioned in Windows. But currently I am using a linux server and it doesn't seem to have any problem handing 10K files in a directly. But I would definitely give it a try. Thanks a lot.
Memory latency is measured in nanoseconds and disk access in milliseconds, so pretty much by definition the disk I/O is going to be more expensive. If you start loading up the machine enough to cause it to swap out, you'll be paying even more of a penalty.
Since you have full control over the file format and data, you might look into varying the file sizes to find the optimal file size for your problem. As Fangjun Jiang suggests below, doing selective reads and working on parts of the data could improve performance.
You might also consider looking into using a SSD disk in the system versus the conventional drive. The price of a SSD has dropped into the realm of the possible for more people nowadays. Conventional drives still hold the upper hand for storage size versus cost, but the SSD's access time is pretty darn amazing.
In fact, at first the input file size was 150GB and there was only 1 file. I split it up into 10,000 pieces to solve the memory problem and for faster file read.
When I try to execute 100 loops in parallel using parfor it took me 'x' seconds. But when I execute 1000 loops in parallel it takes more than 10*x seconds. Guess the non-linear time increase is because of swap out right?
The selective read is an ideal solution. But it is not possible in my case as I need the whole data for processing. I hope your solution of binary format would tremendously improve the performance.
The way to know if you are swapping is to watch something like "top". It seems that you might want to look somewhere less than 10,000 and more than 100 to see if you can do better :)
I am not using a conventional drive and this linux server, I guess, using already using a SSD drive. To give you some statistics, the file read time of the linux server is 18~20x faster than the conventional hard disk residing in my local PC.
So the 2+ hour time requirement for 150GB read (I have mentioned earlier) is in the SSD disk.

Sign in to comment.

 Accepted Answer

If you have total control over the file format, storing the data in a binary file format would make reading the data out of the file much faster.

6 Comments

Wow! that's a great input. I have total control over the file format. But currently I am not quite sure how to make these ascii files to binary and how to read those binary in matlab. Anyways, I will try to pursue this direction.
btw, can I ask why the file read turns out to be so fast in binary format? Is it because the file size would get reduced?
Anyways, thanks a zillion for the great input.
Dear Jeremy,
Can I please know why the binary data read/write is much faster than the ascii file read/write? Would it make much of a difference?
I found the way to read binary in matlab is so easy using fread(). But the program which output the 150GB file is in java and I have to look for binary output from java to make the interface work. btw, I wonder if there is any other penalty if we use the binary method.
Warm Regards
Anand
+1 vote. Reading binary data is *much* faster than parsing ASCII text files: Parsing a text number needs a lot of arithmetics: Read first character. If it is a '-', reply a negative value. If it is a number, store it. Read the next character. If it is a number, multiply the stored number by 10 and add the current number and store the result. If it is a dot, start reading the fractional part. If it is 'e', 'E', 'g' or 'G' start to read the exponent... And this has to be performed for each character! In addition exceptions for malformed numbers, 'Inf' and 'NaN' must be considered also.
In comparison to that reading a binary DOUBLE needs this steps: Read 8 byte, store it in 8 bytes of memory.
Writing binary files is not that different than writing ascii. Here is a very simple java example to get you started:
int i = 42;
double d = Math.PI;
String strFile = "binary.dat";
DataOutputStream os = new DataOutputStream(new FileOutputStream(strFile));
os.writeInt(i);
os.writeDouble(d);
os.close();
The only real downside to using binary files is that you can't use a text editor to look at the data. If you need to visually inspect the data you need a hex editor, but there are several really good free ones available.
Thank you so much for the code, Jeremy. I have tried this code in java to output a sample file and read in Matlab using fread(). But the behaviour in matlab is strange. The data (numbers) I write using java is not what is being read in Matlab. It is working when the data is written using Matlab itself but not from Java.
For small numbers like 1 to 1000 written from Java, matlab is able to read but there are 3 or 4 zeros getting inserted in the matrix between each of the actual numbers. Guess the problem is because of the number of write and read bytes, but not sure how to solve the problem.
My Sample Output Numbers (to be written to file):
3231212 -2312413 54388621
Java Code to write file:
for (i = 0; i < 10; i ++)
os.writeLong(i+123143);
Matlab Code to read: I have tried these but all are giving strange numbers in the matrix 'a'.
a = fread(fid, 'int');
a = fread(fid, 'int16');
a = fread(fid, 'int64');
a = fread(fid, 'int32');
a = fread(fid, 'long');
The reason for this problem is because the java is writing the binary output in big endian ordering, but the matlab by default reads the binary data in little endian ordering. So if you use writeLong in java and fread in Matlab (default functions) then the data that you write will not be equivalent to the data you read.
Solution:
Java: for (i = 0; i < 27; i ++)
os.writeLong(i+123243);
Matlab: a = fread(fid,'int64', 's');
You need to specify the machine format.
's' or 'ieee-be.l64'Big-endian ordering, 64-bit data type

Sign in to comment.

More Answers (2)

If you have to read it as ASCII, your best option is textscan, which will read directly into whatever numeric format you specify ( %f for double, %d for integer, etc).

1 Comment

I just found the textscan function used by many people to do file read. But I was not quite sure it would be more efficient than fgetl and fscanf functions in Matlab. From what you said, I guess it would be much faster; so I would definitely give it a try. Thank you so much Matt.

Sign in to comment.

I have been using importdata for textfiles, but it is very slow for text unless you rename all the files to '.txt'.
The function below seems to do the job for structured text files.
The structure the function can handle is shown at the bottom the function: it only works for tables of float numbers
Let me know whether this works well.
%% READTEXTFILE reads text files without any checks
% READTEXTFILE reads from file and immediately filters out the selectColumns
% READTEXTFILE can read any number of headerlines, but
% the headerlines must contain both nRows='a number' and 'nColumns=a number' in separate lines
% the lines containing nRows and nColumns must not contain any spaces
%
% parameter filename = if the selected file is not a text file, the function will fail
% the extension of the filename is ignored and does not need to be present
% parameter selectedcolumns = header names of columns to be selected from the file
% e.g., selectedcolumns = {'time', 'column_3'}
%
% The read data is returned in a struct c
% content.data = the actual data as a matrix
% content.colheaders = the headers of the remaining columns
% content.colheaders == selectedcolumns
% content.textdata == content.colheaders
%
function content=readtextfile(varargin) % filename,selectedcolumns
tic
selectedcolumns={};
if nargin>2 || nargin ==0
error('readtextfile: too many or too few arguments');
elseif nargin ==2
selectedcolumns=varargin{2};
end
filename=varargin{1};
fid = fopen(filename,'rt');
file.title = fgetl(fid);
file.nrows=string([]);
file.ncolumns=string([]);
line = string(fgetl(fid));
while line ~= "endheader"
if length(file.nrows)==0
file.nrows=regexp(line,'^nRows=(?<nrows>\d+)$','tokens','once');
end
if length(file.ncolumns)==0
file.ncolumns=regexp(line,'^nColumns=(?<ncolumns>\d+)$','tokens','once');
end
line = string(fgetl(fid));
end
file.nrows=str2num(file.nrows);
file.ncolumns=str2num(file.ncolumns);
fsColHeaders = repmat([' %s'],1,file.ncolumns);
colHeaders = textscan(fid,fsColHeaders,1,'EndOfLine','\r\n','MultipleDelimsAsOne',1); % 3rd param (N) == 1 --> read once
fsData = repmat([' %f'],1,file.ncolumns);
fileData = textscan(fid,fsData,'EndOfLine','\r\n','MultipleDelimsAsOne',1); % 3rd param (N) missing --> read until end of file
colHeaders =cellfun(@char,colHeaders,'UniformOutput',false);
[~,copiedColumns] = ismember(selectedcolumns,colHeaders);
if length(copiedColumns)>0
newMatrix= zeros(file.nrows,nnz(copiedColumns));
iNewColumns=1;
for iCopiedColumns = copiedColumns
if iCopiedColumns>0
newMatrix(:,iNewColumns) = fileData{iCopiedColumns};
% newHeaders is not necessary, since it corresponds to selectedcolumns
% but it is is helpful in checking proper operation of the function
newHeaders(:,iNewColumns) = colHeaders(iCopiedColumns);
iNewColumns=iNewColumns+1;
end
end
else
newMatrix = cell2mat(fileData);
newHeaders = colHeaders;
end
fclose(fid);
content.data = newMatrix;
content.textdata = newHeaders;
content.colheaders = newHeaders;
toc
end
% example of possible headerlines
%{
the title
nRows=437
nColumns=17
any number of lines
endheader
time column_1 column_2 column_3 ...
0.001 0.1234 0.3456 0.7891
0.002 0.2234 0.4456 0.8891
%}

Categories

Find more on MATLAB Parallel Server in Help Center and File Exchange

Asked:

on 18 Mar 2011

Answered:

bim
on 25 Dec 2022

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!