Parquet file format

Parquet is an open source file format available to any project in the Hadoop ecosystem. Apache Parquet is designed for efficient as well as performant flat columnar storage format of data compared to row based files like CSV or TSV files.

Parquet uses the record shredding and assembly algorithm which is superior to simple flattening of nested namespaces. Parquet is optimized to work with complex data in bulk and features different ways for efficient data compression and encoding types.  This approach is best especially for those queries that need to read certain columns from a large table. Parquet can only read the needed columns therefore greatly minimizing the IO.

Advantages of Storing Data in a Columnar Format:

  • Columnar storage like Apache Parquet is designed to bring efficiency compared to row-based files like CSV. When querying, columnar storage you can skip over the non-relevant data very quickly. As a result, aggregation queries are less time consuming compared to row-oriented databases. This way of storage has translated into hardware savings and minimized latency for accessing data.
  • Apache Parquet is built from the ground up. Hence it is able to support advanced nested data structures. The layout of Parquet data files is optimized for queries that process large volumes of data, in the gigabyte range for each individual file.
  • Parquet is built to support flexible compression options and efficient encoding schemes. As the data type for each column is quite similar, the compression of each column is straightforward (which makes queries even faster). Data can be compressed by using one of the several codecs available; as a result, different data files can be compressed differently.
  • Apache Parquet works best with interactive and serverless technologies like AWS Athena, Amazon Redshift Spectrum, Google BigQuery and Google Dataproc.

Difference Between Parquet and CSV

CSV is a simple and widely spread format that is used by many tools such as Excel, Google Sheets, and numerous others can generate CSV files. Even though the CSV files are the default format for data processing pipelines it has some disadvantages:

  • Amazon Athena and Spectrum will charge based on the amount of data scanned per query.
  • Google and Amazon will charge you according to the amount of data stored on GS/S3.
  • Google Dataproc charges are time-based.

Parquet has helped its users reduce storage requirements by at least one-third on large datasets, in addition, it greatly improved scan and deserialization time, hence the overall costs.

Zia Chesti and his AI unicorn – Affiniti

Located in a loft-like office across the street from the White House, Afiniti is one of the DC tech scene’s major success stories. It’s one of just two “unicorns”—privately held startups valued at more than $1 billion—in the District, along with Vox Media. And its artificial-intelligence technology is being used by businesses such as United Health Group and T-Mobile. If you’ve called a customer-service number lately, you might have used it, too.

So what does Afiniti do? At its core, it’s software that utilizes various kinds of information to match callers with agents in ways that improve the interaction on both ends. Rather than calls being assigned to representatives in the order they come in, the algorithm analyzes traits—prior interactions with the company, purchase history, demographic information, and so forth—to predict which pairings will have the best results, for both the caller and the company. It then assigns the caller to the staffer who’s likely to be the best match.×

“We look at the history of the agent—the last 100, 1,000, 10,000 calls the agent has taken,” explains CEO Zia Chishti. “Almost all of them are with different people. If you look at the differing outcomes that associate with different people, you can use that to predict which kinds of individuals that agent would best pair with in the future.” Neither the caller nor the agent has any idea what info is being used to make the connection. It’s all calculated in real time by the software.

Afiniti says the results can be dramatic, adding up to significant savings or additional revenues for clients. If customers feel a connection with the representative they’re assigned to, a successful outcome of the call is more likely. “In terms of why this happens, it’s the name of the company,” says Afiniti global head of data Julian Lopez-Portillo, explaining that some clients think of the concept as a sort of Tinder for call centers. “If you have a greater affinity for a particular agent, you’re more likely to want to buy something from them, or if you’re going to cancel, you’re more likely to want to stay with the company.”

Currently, the technology is used by more than 30 companies in 18 countries, including big players in financial services, insurance, hospitality, and health care. Next time you call your bank, pay attention to the person on the other end of the line. She might have been custom-picked by Afiniti to fit your specific profile.

Chishti is an entrepreneur whose first ventureAlign Technology—the company behind Invisalign clear braces—was also a massive success. (It pulled in about $2 billion in revenue last year.) Born in the US, he grew up mostly in Pakistan. He came back to this country to attend college, then got his MBA at Stanford. After cofounding Align Technology in Northern California in 1997, he parted ways with the company in early 2003 and started a new business, TRG, that pursued a broad range of other ventures. One of his partners, Mohammad Khaishgi, relocated to Washington that year after his wife got a job here, and Chishti decided to follow. “I was a little hesitant to move,” he says. “I bought into the California lifestyle.”

But Chishti ended up staying in Washington, and while steering TRG, he started thinking about potential new products. Customer-service technology might not seem like a logical next step from teeth straighteners, but Chishti was looking to innovate in a way that “addresses a large market and creates a lot of value in that market,” he says. “Call centers are huge. Everybody uses a phone.” The business that evolved into Afiniti launched in 2006, initially using software Chishti had programmed himself at home.