How to handle your imports?
Data imports are crucial to many websites, either it’s automotive portal or classifieds website. Even if volume of data is not big, there can be a lot other factors making them slow. I will explain our approach to imports, followed with case study.
Run imports in parallel.
If your import file contains multiple customers, it’s good practice to run their imports in parallel. Why? Let’s assume your file contains 10 dealers and your single pipeline imports whole file at once. What if due to any glitch, doesn’t matter software or hardware, it fails at dealer number 7? Three dealers will remain unprocessed, so their ads will remain not updated. If your run them in parallel, problematic dealer will be only affected by failure. So despite you still have a problem, it affects not 40% but 10% of your customers.
Never trust data you’re supposed to receive. Developers on other end are only humans and they make mistakes, just as everyone. Try to foresee all potential issues to data you will receive, i.e.
Feed provides vehicle’s engine capacity
- Is it always in liters or cubic centimeters? Or maybe mixed?
- You don’t know how 3rd party stores it locally, so do you know if they convert it properly when exporting to you?
- Are you sure data you’re receiving fits your database’s data range?
- Are you sure it’s numeric?
- It’s easy to catch conversion errors here:
- If engine size is smaller than 20, then most certainly it’s liters.
- If it’s between 400 and 20 000 then you deal with cubic centimeters.
- If it’s above 400 000 and 20 000 000 then most likely someone on other is converting local values to cubic centimeters with multiplication by 1000, without checking source data.
- Anything outside those ranges is most likely an error, so can be ignored.
Analyze which data is crucial.
With most ads, the most important data is text, which (happily) is also fastest to process. Next you have images and potential 3rd party integrations. Run in parallel anything that is possible, making process faster and safer.
Vehicles are again fine example here. When you receive stock update, there are two types of images there:
- Images for new vehicles.
- Updated images to existing vehicles.
From business perspective, it’s important to have images for all ads, so once ad texts imports are done, your new ad is imageless. That’s why it’s good to prioritize image imports for new ads. You can also include ads which had zero images but now feed have provided any.
Then you can process all other images. Remaining ads already have images, so they look good on website, there’s only a case of checking if ones in current feed are the same or updated.
There are various techniques to do so, starting from checking url – if it’s same a last time, then image is most likely unchanged. Second thing is remote image size and it’s ETAG. Both params are in header, so you don’t have to download whole remote file to get them. It’s important to check above for two reasons:
- Business always has most fresh data on website.
- It speeds up your imports.
- It saves hardware resources, so business can pay less for hardware.
No, it’s not, there a lot more quirks when it comes to imports. Remember main rules:
- Don’t trust incoming data.
- Prioritize most important data.
- Think ahead of all possible errors.
- Log as much as possible, so you can review past errors and analyze performance.
Using above knowledge, we have had a big successes. One of our customers was using Drupal plugin for imports and they’ve taken 24h to process 100 000 ads and 20 000 000 images. We were asked to rebuild this engine and below numbers will speak for themselves:
- New ads and existing ones’ text data is refreshed within 5 minutes.
- All new images are processed within 1 hour.
- Existing images are verified and updated if required within 4 hours.