Efficiently Uploading Data Using CSV and JDBC

Data feeds, or bulk transfers of data from one system to another, are a common feature in many enterprise applications. However, they are often implemented inefficiently. First, the source system runs a nightly batch job that generates a (typically large) update file. Some time later, another job looks for this file and uploads it to the destination system, usually via FTP or something similar. Finally, a third job on the remote system looks for the uploaded file and imports it (assuming that it is available).

This sort of process is slow and error prone. The destination system must wait until the first two processes (generating the export file, then uploading it) are complete before it can even start. Because the second and third steps are time-dependent, they may fail if an unexpected condition caused any preceding task to run late.

A better alternative is to perform these operations in parallel. By streaming the output generated by the source system, the destination system can consume it as it is being produced, eliminating the redundant and time-consuming copy steps.

Example Export

For example, the following code simulates a batch process that exports a CSV document. It uses the WebServiceProxy and CSVEncoder classes from the open-source HTTP-RPC framework to upload the simulated data to a RESTful web service. The string values passed to the encoder’s constructor represent the columns in the output file:

WebServiceProxy webServiceProxy = new WebServiceProxy("POST", new URL(url));

webServiceProxy.setRequestHandler((outputStream) -> {
    CSVEncoder csvEncoder = new CSVEncoder(listOf("text1", "text2", "number1", "number2", "number3"));

    csvEncoder.write(new Rows(count), outputStream);
});

webServiceProxy.invoke();

The data is provided by the following class, which simply generates an arbitrary number of duplicate rows. In a real application, the data would most likely come from a relational database or something similar:

public static class Rows implements Iterable<Map<String, Object>> {
    private int count;

    private int i = 0;

    public Rows(int count) {
        this.count = count;
    }

    private Map<String, Object> row = mapOf(
        entry("text1", "abcdefghijklmnopqrstuvwxyz"),
        entry("text2", "ABCDEFG"),
        entry("number1", 123456),
        entry("number2", 101.05),
        entry("number3", 2002.0125)
    );

    @Override
    public Iterator<Map<String, Object>> iterator() {
        return new Iterator<Map<String,Object>>() {
            @Override
            public boolean hasNext() {
                return i < count;
            }

            @Override
            public Map<String, Object> next() {
                i++;

                return row;
            }
        };
    }
}

The generated document would look something like this:

text1,text2,number1,number2,number3
"abcdefghijklmnopqrstuvwxyz","ABCDEFG",123456,101.05,2002.0125
"abcdefghijklmnopqrstuvwxyz","ABCDEFG",123456,101.05,2002.0125
"abcdefghijklmnopqrstuvwxyz","ABCDEFG",123456,101.05,2002.0125
...

Import Service

A web service for processing the exported data might look something like the following. Rather than reading the entire payload into memory up front, the method uses HTTP-RPC’s CSVDecoder class to obtain a cursor over the rows in the CSV document. As each record is read, it is inserted into the database:

@WebServlet(urlPatterns={"/bulk-upload/*"}, loadOnStartup=1)
public class BulkUploadService extends WebService {
    ...

    private static final String INSERT_SQL = "INSERT INTO bulk_upload_test ("
        + "text1, text2, number1, number2, number3) VALUES ("
        + ":text1, :text2, :number1, :number2, :number3)";

    @RequestMethod("POST")
    @ResourcePath("upload")
    public void upload() throws SQLException, IOException {
        CSVDecoder csvDecoder = new CSVDecoder();

        Iterable<Map<String, String>> cursor = csvDecoder.read(getRequest().getInputStream());

        Parameters parameters = Parameters.parse(INSERT_SQL);

        try (Connection connection = DriverManager.getConnection(DB_URL);
            PreparedStatement statement = connection.prepareStatement(parameters.getSQL())) {
            for (Map<String, String> row : cursor) {
                parameters.apply(statement, row);
                statement.executeUpdate();
            }
        }
    }
}

Batch Updates

Unfortunately, even though this service efficiently consumes the data provided by the client, it is very slow. Importing a mere 500 records takes nearly 30 seconds!

The solution is to insert the records in batches, as shown below:

@WebServlet(urlPatterns={"/bulk-upload/*"}, loadOnStartup=1)
public class BulkUploadService extends WebService {
    ...

    private static final int BATCH_SIZE = 25000;

    @RequestMethod("POST")
    @ResourcePath("upload-batch")
    public void uploadBatch() throws SQLException, IOException {
        CSVDecoder csvDecoder = new CSVDecoder();

        Iterable<Map<String, String>> cursor = csvDecoder.read(getRequest().getInputStream());

        Parameters parameters = Parameters.parse(INSERT_SQL);

        int i = 0;

        try (Connection connection = DriverManager.getConnection(DB_URL);
            PreparedStatement statement = connection.prepareStatement(parameters.getSQL())) {
            for (Map<String, String> row : cursor) {
                parameters.apply(statement, row);
                statement.addBatch();

                if (++i % BATCH_SIZE == 0) {
                    statement.executeBatch();
                }
            }

            statement.executeBatch();
        }
    }
}

This method is nearly identical to the preceding version. However, instead of executing a database update for each row, the updates are batched and executed once every 25,000 rows. The service can now process 500,000 rows in about 30 seconds – 1,000 times faster than the previous version!

Additional Information

Data feeds are a common element in many enterprise systems, and are often implemented inefficiently. However, by streaming imports and using batch updates, performance can be significantly improved.

The complete source code for this example can be found here:

For more information, see the HTTP-RPC README.

You Don’t Need GraphQL

1/19/2019 Updated for HTTP-RPC 6.0

GraphQL is a technology that seems to be getting a lot of attention in the developer community at the moment. Advocates describe it as “a better REST”, claming that it offers several advantages over traditional REST APIs:

  • Single request with nested results vs. multiple separate requests
  • Single endpoint for all requests vs. one endpoint per resource
  • Single evolving version vs. multiple (presumably incompatible) versions

For example, the following GraphQL query might be used to retrieve an employee record from a hypothetical service based on the MySQL “employees” sample database. In addition to the employee number, first name, and last name, the query also requests the employee’s title and salary history:

{
  employee(id: 10004) {
    employeeNumber
    firstName
    lastName
    titles {
      title
      fromDate
      toDate
    }
    salaries {
      salary
      fromDate
      toDate
    }
  }
}

The response might look something like this, with some results omitted for brevity:

{
  "employeeNumber": 10004,
  "firstName": "Chirstian",
  "lastName": "Koblick",
  "titles": [
    {
      "title": "Senior Engineer",
      "fromDate": 817794000000,
      "toDate": 253370782800000
    },
    ...
  ],
  "salaries": [
    {
      "salary": 74057,
      "fromDate": 1006837200000,
      "toDate": 253370782800000
    },
    ...
  ]
}

A RESTful Implementation

The data model for the sample database is shown below:

Employees Sample Database

A typical REST API might provide access to employee, title, and salary resources as follows:

GET /employees/10004
{
  "employeeNumber": 10004,
  "firstName": "Chirstian",
  "lastName": "Koblick"
}
GET /employees/10004/titles
[
  {
    "title": "Senior Engineer",
    "fromDate": 817794000000,
    "toDate": 253370782800000
  },
  ...
]
GET /employees/10004/salaries
[
  {
    "salary": 74057,
    "fromDate": 1006837200000,
    "toDate": 253370782800000
  },
  ...
]

This is indeed more verbose than the GraphQL version. However, there is nothing preventing a REST API from providing a similar interface.

For example, the following service method (implemented using the open-source HTTP-RPC framework) returns the same information as the GraphQL query. As with the GraphQL version, all of the data is obtained with a single request:

@RequestMethod("GET")
@ResourcePath("?:employeeNumber")
public void getEmployee(List<String> details) throws SQLException, IOException {
    String employeeNumber = getKey("employeeNumber");

    Parameters parameters = Parameters.parse("SELECT emp_no AS employeeNumber, "
        + "first_name AS firstName, "
        + "last_name AS lastName "
        + "FROM employees WHERE emp_no = :employeeNumber");

    HashMap<String, Object> arguments = new HashMap<>();

    arguments.put("employeeNumber", employeeNumber);

    try (Connection connection = DriverManager.getConnection(DB_URL);
        PreparedStatement statement = connection.prepareStatement(parameters.getSQL())) {
        parameters.apply(statement, arguments);

        try (ResultSet resultSet = statement.executeQuery()) {
            ResultSetAdapter resultSetAdapter = new ResultSetAdapter(resultSet);

            for (String detail : details) {
                switch (detail) {
                    case "titles": {
                        resultSetAdapter.attach("titles", "SELECT title, "
                            + "from_date AS fromDate, "
                            + "to_date AS toDate "
                            + "FROM titles WHERE emp_no = :employeeNumber");

                        break;
                    }

                    case "salaries": {
                        resultSetAdapter.attach("salaries", "SELECT salary, "
                            + "from_date AS fromDate, "
                            + "to_date AS toDate "
                            + "FROM salaries WHERE emp_no = :employeeNumber");

                        break;
                    }
                }
            }

            getResponse().setContentType("application/json");

            JSONEncoder jsonEncoder = new JSONEncoder();

            jsonEncoder.writeValue(resultSetAdapter.next(), getResponse().getOutputStream());
        }
    } 
}

The initial query retreives the employee’s number, first name, and last name from the “employees” table. Subqueries to return the employee’s salary and title history are optionally attached based on the values provided in the details parameter. Column aliases are used in all of the queries to make the field names more JSON-friendly.

Callers can access the API via a standard HTTP GET request, as shown below:

GET /employees/10004?details=titles&details=salaries
{
  "employeeNumber": 10004,
  "firstName": "Chirstian",
  "lastName": "Koblick",
  "titles": [
    {
      "title": "Senior Engineer",
      "fromDate": 817794000000,
      "toDate": 253370782800000
    },
    ...
  ],
  "salaries": [
    {
      "salary": 74057,
      "fromDate": 1006837200000,
      "toDate": 253370782800000
    },
    ...
  ]
}

Additional Observations

GraphQL advocates tout its single-endpoint model as a major advantage over REST. This capability is not exclusive to GraphQL – it is certainly possible for REST APIs to be implemented using a single endpoint as well. However, such a service would probably become untenable very quickly. A collection of independent endpoints, each of which represent a specific resource or set of resources, will most likely be much more manageable in the long run.

Further, the concept of a single evolving version is not unique to GraphQL. Implementing a successful versioning strategy is difficult, and there are many ways of approaching it. However, there is nothing to preclude a REST service from providing backwards compatibility. It is simply one option among many.

Finally, adopting GraphQL requires services to be completely re-implemented using the GraphQL library. For any non-trivial application, this would most likely be a major undertaking. Additionally, it forces clients to use GraphQL as well, rather than standard HTTP operations such as GET and POST. This means that GraphQL APIs also can’t be tested as easily in a web browser or using command-line utilties such as curl.

So, while there are certainly a number of compelling reasons to consider GraphQL, you don’t actually need to use GraphQL to take advantage of them.

For more information on HTTP-RPC, see the project README.

Dynamically Loading Table View Images in iOS

11/13/2018 Updated for Xcode 10/Swift 4.2

iOS applications often display thumbnail images in table views alongside other text-based content such as contact names or product descriptions. However, these images are not usually delivered with the initial response, but must instead be retrieved separately afterward. They are typically downloaded in the background as needed to avoid blocking the main thread, which would temporarily render the user interface unresponsive.

For example, consider this web service, which returns a list of simulated photo data:

[
  {
    "albumId": 1,
    "id": 1,
    "title": "accusamus beatae ad facilis cum similique qui sunt",
    "url": "http://placehold.it/600/92c952",
    "thumbnailUrl": "http://placehold.it/150/92c952"
  },
  {
    "albumId": 1,
    "id": 2,
    "title": "reprehenderit est deserunt velit ipsam",
    "url": "http://placehold.it/600/771796",
    "thumbnailUrl": "http://placehold.it/150/771796"
  },
  {
    "albumId": 1,
    "id": 3,
    "title": "officia porro iure quia iusto qui ipsa ut modi",
    "url": "http://placehold.it/600/24f355",
    "thumbnailUrl": "http://placehold.it/150/24f355"
  },
  ...
]

Each record contains a photo ID, album ID, and title, as well as URLs for both thumbnail and full-size images; for example:

View Controller

A basic user interface for displaying results returned by this service is shown below:

Row data is stored in an array of Photo instances:

struct Photo: Decodable {
    let id: Int
    let albumId: Int
    let title: String?
    var url: URL?
    var thumbnailUrl: URL?
}

Previously loaded thumbnail images are stored in a dictionary that associates UIImage instances with photo IDs:

class ViewController: UITableViewController {
    // Row data
    var photos: [Photo]?

    // Image cache
    var thumbnailImages: [Int: UIImage] = [:]

    ...    
}

The photo list is loaded the first time the view appears. The WebServiceProxy class provided by the open-source Kilo framework is used to retrieve the data:

override func viewWillAppear(_ animated: Bool) {
    super.viewWillAppear(animated)

    // Load photo data
    if (photos == nil) {
        let serviceProxy = WebServiceProxy(session: URLSession.shared, serverURL: URL(string: "https://jsonplaceholder.typicode.com")!)

        serviceProxy.invoke(.get, path: "/photos") { (result: [Photo]?, error: Error?) in
            self.photos = result ?? []

            self.tableView.reloadData()
        }
    }
}

Table view cells are represented by the following class, implemented using the open-source Lima layout framework:

class PhotoCell: LMTableViewCell {
    var thumbnailImageView: UIImageView!
    var titleLabel: UILabel!

    override init(style: UITableViewCell.CellStyle, reuseIdentifier: String?) {
        super.init(style: style, reuseIdentifier: reuseIdentifier)

        setContent(LMRowView(
            UIImageView(contentMode: .scaleAspectFit, width: 50, height: 50) { self.thumbnailImageView = $0 },
            LMSpacer(width: 0.5, backgroundColor: UIColor.lightGray),
            LMColumnView(spacing: 0,
                UILabel(font: UIFont.preferredFont(forTextStyle: .body), numberOfLines: 2) { self.titleLabel = $0 },
                LMSpacer()
            )
        ), ignoreMargins: false)
    }

    required init?(coder decoder: NSCoder) {
        return nil
    }
}

Cell content is generated as follows. The corresponding Photo instance is retrieved from the photos array and used to configure the cell. If the thumbnail image is already available in the cache, it is used to populate the cell's thumbnail image view. Otherwise, it is loaded from the server and added to the cache. If the cell is still visible when the image request returns, it is updated immediately:

override func tableView(_ tableView: UITableView, numberOfRowsInSection section: Int) -> Int {
    return photos?.count ?? 0
}

override func tableView(_ tableView: UITableView, cellForRowAt indexPath: IndexPath) -> UITableViewCell {
    let photoCell = tableView.dequeueReusableCell(withIdentifier: PhotoCell.description(), for: indexPath) as! PhotoCell

    guard let photo = photos?[indexPath.row] else {
        fatalError()
    }

    // Attempt to load image from cache
    photoCell.thumbnailImageView.image = thumbnailImages[photo.id]

    if photoCell.thumbnailImageView.image == nil,
        let url = photo.thumbnailUrl,
        let scheme = url.scheme,
        let host = url.host,
        let serverURL = URL(string: String(format: "%@://%@", scheme, host)) {
        // Request image
        let serviceProxy = WebServiceProxy(session: URLSession.shared, serverURL: serverURL)

        serviceProxy.invoke(.get, path: url.path, responseHandler: { content, contentType in
            return UIImage(data: content)
        }) { (result: UIImage?, error: Error?) in
            // Add image to cache and update cell, if visible
            if let thumbnailImage = result {
                self.thumbnailImages[photo.id] = thumbnailImage

                if let cell = tableView.cellForRow(at: indexPath) as? PhotoCell {
                    cell.thumbnailImageView.image = thumbnailImage
                }
            }
        }
    }

    photoCell.titleLabel.text = photo.title

    return photoCell
}

Finally, if the system is running low on memory, the image cache is cleared:

override func didReceiveMemoryWarning() {
    super.didReceiveMemoryWarning()

    thumbnailImages.removeAll()
}

Summary

This article provided an overview of how images can be dynamically loaded to populate table view cells in iOS. Complete source code for this example can be found here.

Caching Web Service Response Data in iOS

11/13/2018 Updated for Xcode 10/Swift 4.2

Many iOS applications obtain data via web APIs that return JSON documents. For example, the following table view controller uses the Kilo WebServiceProxy class to invoke a simple web service that returns a simulated list of users as JSON. The controller requests the user list when the view first appears, and reloads the table view once the data has been retrieved:

class ViewController: UITableViewController {
    var users: [User]?

    override func viewWillAppear(_ animated: Bool) {
        super.viewWillAppear(animated)
    
        // Load user data
        if (users == nil) {
            let serviceProxy = WebServiceProxy(session: URLSession.shared, serverURL: URL(string: "https://jsonplaceholder.typicode.com")!)
    
            serviceProxy.invoke(.get, path: "/users") { (result: [User]?, error: Error?) in
                if (error == nil) {
                    self.users = result ?? []
    
                    self.tableView.reloadData()
                }        
            }
        }
    }
    
    ...
}

User records are represented by instances of the following structure:

struct User: Codable {
    struct Address: Codable {
        let street: String
        let suite: String
        let city: String
        let zipcode: String

        struct Geo: Codable {
            let lat: String
            let lng: String
        }

        let geo: Geo
    }

    struct Company: Codable {
        let name: String
        let catchPhrase: String
        let bs: String
    }

    let id: Int
    let name: String
    let username: String
    let email: String
    let address: Address
    let phone: String
    let website: String
    let company: Company
}

The results are shown below:

This works fine when both the device and the service are online, but it fails if either one is not. In some cases this may be acceptable, but other times it might be preferable to show the user the most recent response when more current data is not available.

To facilitate offline support, the response data must be cached. However, since writing to the file system is a potentially time-consuming operation, it should be done in the background to avoid blocking the main (UI) thread. Here, the data is written using an operation queue to ensure that access to it is serialized:

class ViewController: UITableViewController {
    var userCacheURL: URL?
    let userCacheQueue = OperationQueue()

    var users: [User]?

    override func viewDidLoad() {
        super.viewDidLoad()

        title = "Response Data Cache"

        tableView.estimatedRowHeight = 2
        tableView.register(UserCell.self, forCellReuseIdentifier: UserCell.description())

        if let cacheURL = FileManager.default.urls(for: .cachesDirectory, in: .userDomainMask).first {
            userCacheURL = cacheURL.appendingPathComponent("users.json")
        }
    }

    override func viewWillAppear(_ animated: Bool) {
        super.viewWillAppear(animated)

        // Load user data
        if (users == nil) {
            let serviceProxy = WebServiceProxy(session: URLSession.shared, serverURL: URL(string: "https://jsonplaceholder.typicode.com")!)

            serviceProxy.invoke(.get, path: "/users") { (result: [User]?, error: Error?) in
                if (error == nil) {
                    self.users = result ?? []

                    self.tableView.reloadData()

                    // Write the response to the cache
                    if let userCacheURL = self.userCacheURL {
                        self.userCacheQueue.addOperation() {
                            let jsonEncoder = JSONEncoder()

                            if let data = try? jsonEncoder.encode(self.users) {
                                try? data.write(to: userCacheURL)
                            }
                        }
                    }
                } else {
                    ...
                }
            }
        }
    }
    
    ...
}

Finally, the data can be retrieved from the cache if the web service call fails. The data is read from the cache in the background, and the UI is updated by reloading the table view on the main thread:

class ViewController: UITableViewController {
    ...

    override func viewWillAppear(_ animated: Bool) {
        super.viewWillAppear(animated)

        // Load user data
        if (users == nil) {
            let serviceProxy = WebServiceProxy(session: URLSession.shared, serverURL: URL(string: "https://jsonplaceholder.typicode.com")!)

            serviceProxy.invoke(.get, path: "/users") { (result: [User]?, error: Error?) in
                if (error == nil) {
                    ...
                } else {
                    // Read the data from the cache
                    if let userCacheURL = self.userCacheURL {
                        self.userCacheQueue.addOperation() {
                            let jsonDecoder = JSONDecoder()

                            if let data = try? Data(contentsOf: userCacheURL) {
                                self.users = (try? jsonDecoder.decode([User].self, from: data)) ?? []

                                // Update the UI
                                OperationQueue.main.addOperation() {
                                    self.tableView.reloadData()
                                }
                            }
                        }
                    }
                }
            }
        }
    }
    
    ...
}

Now, as long as the application has been able to connect to the server at least once, it can function either online or offline, using the cached response data.

Complete source code for this example can be found here.