Cost-Effective Next Generation Correlator

Cost-Effective Next Generation Correlator

Second draft

Jouko Ritakari, Jouko.Ritakari@hut.fi

Metsähovi Radio Observatory

January 25, 2001

The purpose of this document

The purpose of this document is to explore the possibility of building a cost-effective next generation correlator using mostly commercial off-the-shelf components. The only VLBI-specific component needed is the correlator chip or correlator board.

At least two new correlator chips are being developed, one in the ALMA project and one in the EVLA project.

Description of the ALMA correlator can be found at http://alma.nrao.edu/development/correlator/ and the EVLA correlator at http://www.aoc.nrao.edu/doc/vla/EVLA/EVLA_home.shtml

In this document the correlator is considered to be a special-purpose batch-processing computer cluster that is decoupled from the properties of the data storage or data transmission systems. This is an important change from the synchronized-data-stream approach used in the old-style correlators.

This is intentionally a quick-and-dirty design. Only components and technologies that are available now from the nearest computer store are used. In the last chapter I will list possible improvements to this design.

I thank the Evntech people for the suggestions and advice.

Background

Many of the technical limitations that constrained the design of the existing correlators do not exist any more. Buffer memory and processing power are cheap. New correlator chips are at least two times faster and have many times the density of the old chips (if 0.18 micron technology is used in the new chips, they contain approximately twenty times more transistors than the 0.8 micron chips we use now).

Moving to near real-time and real-time VLBI will impose new constraints, especially the new correlator design must be able to use IP-based data networks to communicate with the antennas.

Some of this framework has been discussed in the EVN technical document #111, Mark IV memo #281, "Concept for Next Generation VLBI", available at http://kurp.hut.fi/vlbi/instr/nexgen.html.

In several cases VLBI is moving to direct IF sampling and use of digital filters instead of old-style baseband converters. Some examples of this trend are the ALMA project http://alma.nrao.edu/ , the Japanese VERA system http://veraserver.mtk.nao.ac.jp/ and the VLA expansion project http://www.aoc.nrao.edu/doc/vla/EVLA/EVLA_home.shtml .

The main problem with the old-style correlators is that they operate on synchronized data streams. The speed of the data streams must be equal and the delay between the data streams must be carefully adjusted. If the correlator is faster than the data streams (from the tape recorders) the speed advantage is lost.

Specifications

I will outline a design of a 16-station correlator with 1 Gbit/s data rate per station.

Each station will have eight baseband converters, each baseband converter has two sidebands that are sampled at the rate of 32 Mbit/s. Two-bit sampling is used. This configuration is more or less the same that the VLBI people desire at this moment, and it has two to four times more bandwidth than what is available now.

The 1 Gbit/s data stream is divided into sixteen independent substreams. This means that we need sixteen independent correlator engines that can each correlate sixteen stations at 32 Mbit/s speed with two-bit sampling.

The purpose of this design is to demonstrate how the existing correlators could be replaced with relatively simple machines, if we abandon the tape-based mind set and treat the correlator as a special-purpose computer.

Design goals

The correlator must be scalable. Adding stations or speeding the correlator up must be possible without changing the basic design. Improving correlator speed beyond the capabilities of the correlator chips must be possible using time division multiplexing.
Must be able to use IP-based data networks. Cost-effective communications bandwidth will transport IP packets, not bits or custom frames. We must accept this.
Must be compatible with COTS tape drives.
Must be compatible with short-term storage of data on hard disks on the stations.
Technology must have a clear upgrade path. Replacing tape drives, network technology or memory technology should have only minimal impact to the design.
Performance of the correlator must be equal or better than the performance of existing correlators, for example the JIVE correlator.
Only standard and easily available components are used, with the exception of the correlator chips.

Non-goals

The following features are not supported. Adding these features may be detrimental to the functionality of the total system.

Support for legacy tape recorders (Mark IV, VLBA etc).
Support for real-time data streams (instead, data will be buffered and correlated in a batch).
Support for VLBI Standard Interface (VSI).

Proposed design

The new correlator will consist of a bank of sixteen correlator engines. The correlator engines are independent and batch-process chunks of data.

The correlator engines will not operate in real time. If real-time operation is required, a sufficient amount of correlator engines is used so that some of them are able to collect data while others are batch-processing.

Correlator engine

A correlator engine consists of a controller board and a correlator board. Another configuration could be a single board correlator that has only few correlator chips, since the data is correlated in batches and the correlator chips are typically much faster than the data communication lines or recorder channels.

The correlator chip has the following capabilities:

Clock rate 125 MHz
2-bit, 4-level correlation
4096 or 8192 lags per chip

The correlator engine controller has the following capabilities:

An embedded Linux controller to set up the correlation, controlled via 100 Mbit/s Ethernet.
Sixteen 100 Mbit/s Ethernets for incoming data, one for each data stream.
Massive buffer memory, 256 Mbytes of DDR SDRAM for each data stream.
Logic to feed the correlator card or correlator chips from DDR SDRAM at 125 Mhz clock rate.
Logic to adjust the fine delay of the data streams and do fringe rotation, if not implemented in the correlator chips.
All the logic in the controller board is implemented in FPGA in VHDL. Commercial VHDL cores for 100 Mbit/s Ethernet controller and DDR SDRAM controller are available.

Operation of the correlator engine

The correlator engine will batch-process the data using the following steps.

These steps will be controlled by the embedded Linux computer that has received the high-level commands from the main control computer.

The buffer memory is cleared or written with pseudorandom data. This step is necessary to ensure that packet loss does not cause erroneous correlation results.
The Linux computer requests a 256 Mbyte block (32 seconds at 32MHz, 2 bit) of data from each station to be correlated. The data can come from local tape playback units, from hard disk storage at the stations or directly via Internet from the digital baseband converters. The data request contains the starting time of the first data block.
The stations (or the tape playback units) send the data to the correlator engine in UDP packets. The UDP packets are sorted by an Gigabit Ethernet switch in front of the correlator engine. Each 100 Mbit/s Ethernet port receives only those UDP packets that are addressed to it.
The 100 Mbit/s Ethernet controller in the front-end FPGA receives the UDP packets. A small state machine discards the headers and stores the payload data into the SDRAM buffer. The memory address is determined by the unique packet sequence number in the packet header.
When all the data has arrived and the buffers are full, the data is correlated in the correlator board. The controller board has logic to transfer the data from buffer memory to the correlator board and adjust the fine delay of the data streams. In practice, the data is correlated in small chunks, the results are read out and the results of different batches are summed together.
The Linux computer sends the correlation results to the main control computer using Ethernet and TCP/IP.
The correlator engine is ready for the next batch of data.

Physical implementation

To minimize the complexity of the correlator engine printed circuit board, each correlator channel is designed to be as independent from other channels as possible.

Basically the correlator channel consists of a DDR SDRAM memory module and one FPGA chip.

All the logic (100 Mbit/s Ethernet Controller, SDRAM controller, fringe rotator etc.) is implemented in the FPGA.

Passing of station parameters to the correlator channels

One of the problems in designing the correlator engine is the need to pass station parameters from the control computer to the correlator channels. The normal solution would be to design a bus structure connecting all the correlator channel FPGAs.

However, in this case we could use the 100 Mbit/s Ethernet ports and send the station parameters in UDP packets to the FPGAs. In this case the wiring of the correlator engine card would be very simple, only a few reset- and clock lines would be needed between the channels.

Performance compared to the old-style correlators

These are rough estimates of the performance and complexity of the proposed quick-and-dirty correlator compared to the existing correlators, for example the JIVE correlator.

The existing correlators have 64 MHz clock speed that cannot be fully utilized because of the slow speed of the old-style tape drives. (The tape drives have 8 Mbit/s speed per track, in the future 16 Mbit/s may be possible, but not easy)
Jive correlator has 32 boards with 32*512 lags = total 524.288 lags. If eight 4096 lag chips are used for every correlator engine, total number of lags is the same. If separate correlator boards are used, the number of lags is considerably greater.
Compatible with real-time and near real-time VLBI vs. old-style correlators not compatible.
Compatible with COTS tape recording vs. old-style correlators use legacy open-reel tapes.
Easy to expand or reconfigure vs. hardwired design in the old correlators.

Extending the capabilities

Adding lags

If more lags are needed, the correlator engine performs several correlation runs with the same data shifting data streams by the number of lags available in correlator chips.

Doubling the lags doubles the correlation time needed. Adding lags may be especially helpful in fringe searches. The number of lags available is limited only by the size of the buffer memory. Time-consuming tape operations (rewinding etc.) are unnecessary, because the data is in the buffer memory.

Adding stations

The most sensible way to add stations is to collect all the data in one correlation engine and perform several correlation runs. Doubling the number of stations quadruples the time needed for correlation.

Improving correlation speed

If wider continuous bandwidth is required, correlation engines can be used in time-multiplexed fashion in the same style as in the ALMA correlator. In this case Gigabit Ethernet links for incoming data would be very useful, otherwise the design could remain the same.

Bandwidth considerations

The following estimates are based on a technology that is available now. Probably higher-speed technologies will become available before the final design of the system.

Network interface subsystem

The network interface subsystem will use 100 Mbit/s Ethernets, one for each channel.

If 32 megasamples per second speed is used with two-bit sampling, data will be arriving at the speed of 64 Mbit/s (+ overhead), clearly within the capabilities of a 100 Mbit/s Ethernet.

Memory subsystem

The memory subsystem will use commercial DDR (double data rate) SDRAM modules.

At this moment the clock speed of the modules is 133 MHz and the modules are 64-bit wide, clocking data in on both edges of the clock.

The maximum sustained data rate of one module is 2.1 gigabytes per second.

At this moment 256 MB DDR SDRAM modules are commercially available. At the speed of 32 megasamples per second and two-bit sampling, one module can store 32 seconds of data.

The memory subsystem clearly will not limit the performance of the correlator engine.

Correlator card input

The correlator card will accept 16 two-bit signals at a rate of 125 Mhz, the maximum input bandwidth of the correlator card is 2*16*2*125 Mbit/s = 8 Gbit/s. (The same data will be fed to the correlator card from two directions).

If enough correlator chips are used that we can correlate all the data at one time, correlating the 256 MB input buffers (or 32 seconds of real time data) takes eight seconds.

Correlator card output

The contents of the correlator card can be drained in less than one millisecond.

The correlator card contains 64*4096 accumulators, each 16 bits wide. (512 kilobytes)

Improving the design

This quick-and-dirty design can be improved in several ways, most of them probably not worth the effort.

Replace the 100 Mbit/s Ethernets with Gigabit Ethernets would reduce the sixteen input lines per correlator engine to two. A more complicated state machine would be needed to sort the incoming UDP packets to memory and the speed of the controller FPGA needs to be higher.
Retransmission of lost packets. This could be implemented fairly easily. The Ethernet controllers can collect the sequence numbers of received packets and the control processor can then send a retransmission request containing the numbers of missing packets to the stations. Optimum would be one or two retransmissions, the still missing packets would be lost.
Use of one gigabyte of memory to buffer each data stream, which would increase the buffer size from 32 seconds of data to 128 seconds.
Divide the buffer memory in two and use these as ping-pong buffers. This would facilitate real-time correlation, one half of buffer memory would accept data while the other half is being correlated. This would be at the cost of more complicated software.
Decrease the number of correlator chips. The bandwidth of the correlator card is four times higher than the bandwidth of the input lines. The number of the correlator chips could be decreased and the whole correlator engine could fit in one board. Or the correlator engine could use a more convenient form factor.
The FPGA chips could be programmed by the embedded Linux computer on board via JTAG interface after every power-up. This would facilitate easy upgrading, since no program ROMs would be needed.

Conclusion

Designing the next generation correlator with easily available microcomputer components (with the exception of the custom correlator chip) seems to be feasible and cost-effective.

In this document, I have proposed a simple, cost-effective correlator with significantly better performance than the existing VLBI correlators.

Although the title of this paper was "Cost-Effective Next Generation Correlator", the same method can be used to design the best-in-the-world all-singing-all-dancing supercorrelator.

Just add more modules.