Documentation Center

  • Trials
  • Product Updates

Preference Settings for Large Data Import

When using the setdbprefs to set 'FetchInBatches' and 'FetchBatchSize' or the Cursor Fetch option for the Preference dialog box, use the following guidelines to determine what batch size value to use. These guidelines are based on evaluating:

  • The size of your data (n rows) to import into MATLAB®

  • The JVM™ heap requirements for the imported data

The general logic for making these evaluations are:

  • If your data (n rows) will fit in a MATLAB variable, then will all your data fit in the JVM heap?

    • If yes, use the following preference setting:

      setdbprefs('FetchInBatches','no')
    • If no, evaluate h such that h < n and data of size h rows fits in the JVM heap. Use the following preference setting:

      setdbprefs('FetchInBatches','yes')
      setdbprefs('FetchBatchSize','<h>')
  • If your data (n rows) will not fit in a MATLAB variable, then:

    • Evaluate m such that m < n and the data of size m rows fits in a MATLAB variable.

    • Evaluate h such that h < m < n and data of size h rows fits in the JVM heap. Use the following preference setting:

      setdbprefs('FetchInBatches','yes')
      setdbprefs('FetchBatchSize','<h>')

      Then import data using fetch or runsqlscript by using the value 'm' to limit the number of rows in the output:

      curs = fetch(curs,<m>)

      or

      results = runsqlscript(conn,<filename>.sql,'rowInc','<m>')

  • If you are using the native ODBC interface to import large amounts of data, you do not need to change these settings because the native ODBC interface always fetches data in batches of 100,000 rows. You can still override the default batch size by setting 'FetchInBatches' to 'yes' and 'FetchBatchSize' to a number of your choice. Note that JVM heap memory restrictions do not apply in this case since the native ODBC interface is a C++ API.

Will All Data (Size n) Fit in a MATLAB Variable?

This example shows how to estimate the size of data to import from a database.

It is important to have an idea of the size of data that you are looking to import from a database. Finding the size of the table(s) in the database can be misleading because MATLAB representation of the same data is most likely going to consume more memory. For instance, say your table has one numeric column and one text column and you are looking to import it in a cell array. Here is how you can estimate the total size.

data = {1001, 'some text here'};
whos data
 Name      Size            Bytes  Class    Attributes

  data      1x2               156  cell               

If you are looking to import a thousand rows of the table, the approximate size in MATLAB would be 156 * 1000 = 156 KB. You can replicate this process for a structure or a dataset depending on which data type you want to import the data in. Once you know the size of data to be imported in MATLAB, you can determine whether it fits in a MATLAB variable by executing the command memory in MATLAB.

A conservative approach is recommended here so as to take into account memory consumed by other MATLAB tasks and other processes running on your machine. For example, even if you have 12 GB RAM and the memory command in MATLAB shows 14 GB of longest array possible, it might still be a good idea to limit your imported data to a reasonable 2 or 3 GB to be able to process it without issues. Note that these numbers vary from site to site.

Will All of This Data Fit in the JVM Heap?

This example shows how to determine the size of the JVM heap.

The value of your JVM heap can be determined by selecting MATLAB Preferences and General > Java Heap Memory. You can increase this value to an allowable size, but keep in mind that increasing JVM heap reduces the total memory available to MATLAB arrays. Instead, consider fetching data in small batches to keep a low to medium value for heap memory.

How Do I Perform Batching?

There are three different methods based on your evaluations of the data size and the JVM heap size. Let n be the total number of rows in the data you are looking to import, m be the number of rows that fit in a MATLAB variable, and h be the number of rows that fit in the JVM heap.

Method 1 — Data Does Not Fit in MATLAB Variable or JVM Heap

If your data (n) does not fit in a MATLAB variable or a JVM heap, you need to find h and m such that h < m < n.

To use automated batching to fetch those m rows in MATLAB:

setdbprefs('FetchInBatches','yes')
setdbprefs('FetchBatchSize','<h>')

If using exec, fetch, and connection object conn:

curs = exec(conn,'Select….');
curs = fetch(curs,<m>);

If using runsqlscript to run a query from an SQL file:

results = runsqlscript(conn,'<filename>.sql','rowInc','<m>')

Once you are done processing these m rows, you can import the next m rows using the same commands. Keep in mind, however, that using the same cursor object curs for this results in the first curs being overwritten, including everything in curs.Data.

    Note:   If 'FetchInBatches' is set to 'yes' and the total number of rows fetched is less than 'FetchBatchSize', MATLAB shows a warning message and then fetches all the rows. The message is Batch size specified was larger than the number of rows fetched.

Method 2 — Data Does Fit In MATLAB Variable But Not in JVM Heap

If your data (n) does fit in a MATLAB variable but not in a JVM heap, you need to find h such that h < n.

To use automated batching to fetch where h rows fit in the JVM heap:

setdbprefs('FetchInBatches','yes')
setdbprefs('FetchBatchSize','<h>')

If using exec, fetch, and the connection object conn:

curs = exec(conn,'Select….');
curs = fetch(curs);

If using runsqlscript to run a query from an SQL file:

results = runsqlscript(conn,'<filename>.sql')

Note that when you use automated batching and do not supply the rowLimit parameter to fetch or the rowInc parameter to runsqlscript, a count query is executed internally to get the total number of rows to be imported. This is done to preallocate the output variable for better performance. In most cases, the count query overhead is not much, but you can easily avoid it if you know or have a good idea of the value of n:

curs = fetch(curs,<n>)

or

results = runsqlscript(conn,'<filename>.sql','rowInc','<n>')

Method 3 — Data Fits in MATLAB Variable and JVM Heap

If your data (n) fits in a MATLAB variable and also in a JVM heap, then you need not use batching at all.

setdbprefs('FetchInBatches','no')

If using fetch:

curs = fetch(curs);

If using runsqlscript to run a query from an SQL file:

results = runsqlscript(conn,'<filename>.sql')
Was this topic helpful?