About
Principal SDE in Azure AI. Working on infusing AI into product, including big LLM models,…
Activity
-
GPT5 in Azure AI Foundry! So excited to see what people can build with the next generation of model. https://xmrrwallet.com/cmx.paka.ms/GPT-5-blog
GPT5 in Azure AI Foundry! So excited to see what people can build with the next generation of model. https://xmrrwallet.com/cmx.paka.ms/GPT-5-blog
Liked by Haiyuan Cao
-
With custom instructions tailored to your repo, agents like GitHub Copilot coding agent can work faster and write higher quality code. But writing…
With custom instructions tailored to your repo, agents like GitHub Copilot coding agent can work faster and write higher quality code. But writing…
Liked by Haiyuan Cao
-
I'm more optimistic than ever that we at OpenAI can eliminate hallucinations. There's still more research to be done, but GPT-5 is solid progress. 🚀
I'm more optimistic than ever that we at OpenAI can eliminate hallucinations. There's still more research to be done, but GPT-5 is solid progress. 🚀
Liked by Haiyuan Cao
Experience
Education
-
Columbia University in the City of New York
-
Activities and Societies: Columbia Data Science Society
Focus on Machine Leaning, Platform for Big Data Analysis and Developing Data Driven Product
-
-
Activities and Societies: Member of American Physical Society, Referee of the journal 'Nanotechnology' (impact factor 3.821)
Major in computational simulation, mathematical modelling and data analysis in energy transport in nano-structures and magnetic properties in iron superconductors. Proposed an algorithm in calculating the magnetic interaction and co-proposed an global optimization method in search the structure of complex grain boundaries.
-
-
Licenses & Certifications
Volunteer Experience
-
Team Member of Youth Ambassador Program for Minorities (YAPM)
Technology and Education: Connecting Cultures, Inc. (TECC) - 501c3
- Present 15 years 2 months
Education
China, home to 55 minority groups, enjoys a rich ethnic and cultural diversity. However, minority cultures are at risk of being marginalized by economic modernization and national education. Confronted with the outside influences of Western and Han culture, local youngsters are unaware of the role they should play in preserving their own culture.
In this project, we went to a remote village of Honghe Hani and Yi Autonomous Prefecture in Yunnan Province. Out of all Hani Villages in Honghe…China, home to 55 minority groups, enjoys a rich ethnic and cultural diversity. However, minority cultures are at risk of being marginalized by economic modernization and national education. Confronted with the outside influences of Western and Han culture, local youngsters are unaware of the role they should play in preserving their own culture.
In this project, we went to a remote village of Honghe Hani and Yi Autonomous Prefecture in Yunnan Province. Out of all Hani Villages in Honghe in north Yunnan Province, only a little part of villages still speak the Hani language. While most programs address minority culture preservation through recording and documenting carried out by outside observers, we actively involved local youth by making them ambassadors of their own cultures. The vision of us is to awaken a sense of responsibility among local youngsters and empower them to play a positive role in protecting their own culture and constructively impacting their surroundings.
Project objective:
1. Raised awareness about quintessential elements of Buyi culture and stress the importance of cultural preservation among local youth.
2. Educated local youngsters to use digital cameras, audio recorders and the Internet to record the Hani culture. Encourage local students to communicate with the outside world and provide a platform for global interaction.
3. Devised an organized and effective ethnic culture course to incorporate into local schools’ daily curriculum.
4. Refined a local culture preservation model which other minority groups can adopt. -
Referee
Nanoscale
- Present 14 years
Science and Technology
Initiated as the referee of the leading peer reviewed journal focused in Nano-science <Nanoscale>.
Publications
-
Thermal conductivity of disordered two-dimensional binary alloys
Nanoscale
Using advanced statistical simulations, we have studied the effect of disorder on the thermal conductivity of two-dimensional alloys. We find that the thermal conductivity not only depends on the substitution concentration of different elements, but also strongly depends on the disorder distribution.
-
Oxygen Vacancy Induced Flat Phonon Mode at FeSe /SrTiO3 interface
Nature Scientific Reports
-
Antiferromagnetic ground state with pair-checkerboard order in FeSe
Physical Review B
-
Measurement of an Enhanced Superconducting Phase and a Pronounced Anisotropy of the Energy Gap of a Strained FeSe Single Layer in FeSe/Nb: SrTiO 3/KTaO 3 Heterostructures Using Photoemission Spectroscopy
Physical Review Letters (Top journal in physics community)
-
Interfacial effects on the spin density wave in FeSe/SrTiO3 thin films
Physical Review B
-
Unexpected large thermal rectification in asymmetric grain boundary of graphene
Solid State Communications
Courses
-
Algorithms for Data Science
CSOR 4246
-
Bayesian Model in Machine Learning
EECS 6720
-
Computer Systems for Data Science
COMS 4121
-
Data Mining
STAT 4240
-
Exploratory Data Analysis and Visualisation
STAT 4701
-
Foundations of Graphical Models
STAT 6701
-
Introduction to Databases
COMS 4111
-
Natural Language Processing
COMS 4705
-
Statistical Machine Learning
STAT 4400
Projects
-
(Kaggle Like) Rang-Tech Data Analytics Competition
This is a Kaggle like competition which used the transaction data to predict the active customer.
i. Understanding the data, clean the data and subset the data.
We are not provided with a background intro to the data so we spend some time looking into the each variable and tried to find some pattern. Luckily we finally found some correlation between variables and then grouped and reduced the number of variables.
We do not use the…This is a Kaggle like competition which used the transaction data to predict the active customer.
i. Understanding the data, clean the data and subset the data.
We are not provided with a background intro to the data so we spend some time looking into the each variable and tried to find some pattern. Luckily we finally found some correlation between variables and then grouped and reduced the number of variables.
We do not use the features directly, and we do the feature engineering carefully for each feature. For some feature has outliers ,we eliminate those outliers. For some feature has the range value is quite wide, we do the sqrt transform. For some data, we also found that the NAs occur in all the records about food so we decided to train separate models on the data containing food NAs and those without NAs, We do really a lot of work on feature transformation and engineering.
ii. Add new features. We tried with the variables from the data but cannot make progess when we hit approximately 68% in public leader board. One teammate found a paper describing some interesting features to be used in the customer classification using transaction data. In that paper the authors introduced the variable "number of NAs and number of 0 for each customers" are quite important for final prediction of the active customer, so we add these features to our result and the model give the result goes beyond 69%.
iii. Ensemble methods. We first tried a single model but stopped at around 69%. After that we tried to combine 13 kinds of models with both parametric and non-parametric machine learning method. Based on these prediction models, we use the 2-layer 5-fold stacking method ensemble the output of the first-layer models.Other creatorsSee project -
Entity Resolution Matching between Foursquare and Locu’s dataset
1. Take two datasets from Foursquare and Locu that describe the same entities, and identify which entity in one dataset is the same as an entity in the other dataset.
2. We construct some features according to the input dataset. We construct the features hiversine_distances for the location information including longitude and lattitude. The 'name' and 'address' information are evaluated used the jaccard similarity score for both the whole entry and each character in the entry. The 'phone…1. Take two datasets from Foursquare and Locu that describe the same entities, and identify which entity in one dataset is the same as an entity in the other dataset.
2. We construct some features according to the input dataset. We construct the features hiversine_distances for the location information including longitude and lattitude. The 'name' and 'address' information are evaluated used the jaccard similarity score for both the whole entry and each character in the entry. The 'phone number' is evaluted through the simple matching. The missing values in 'phone number' and 'address' are also marked by the dummy variable feature.
3. In our algorithm, we combine the records in the locu train dataset and the foursquare train dataset, featurize the dataset and then add the tag that whether they are in the matched list or not. Then we use the training data to train the random forest classifier. The number of trees are chosen by the cross validation method and the number of features are used the general "sqrt" method. Finally we choose the random forest classifier with the 100 trees according to the cross-validation F1 score.
4. Here we set a threshold 0.53 which comes the cross validation used in the matching method. For several matched items in the test dataset through the random forest classifier, we use the matched item with the highest probability.
5. Our result has precision 100%, recall 98.33% and F1 score 99.16%.
Other creators -
Using AWS Cloud Platform and Spark Machine Learning to Recommend Music and with the Last.fm’s Audioscrobbler Data Set
1. Using the data set published by Audioscrobbler with 24.2 million records about user’s player of artists to build the music recommender engine.
2. Implementing the alternating least squares recommender algorithm through the MLLib on Spark to build the music recommener
3. Preprocessing the raw data set using python functional programming to correct the misspelled or nonstandard artist’s ID
4. Using cross validation on Spark to select the hyperparameters for the matrix factorization…1. Using the data set published by Audioscrobbler with 24.2 million records about user’s player of artists to build the music recommender engine.
2. Implementing the alternating least squares recommender algorithm through the MLLib on Spark to build the music recommener
3. Preprocessing the raw data set using python functional programming to correct the misspelled or nonstandard artist’s ID
4. Using cross validation on Spark to select the hyperparameters for the matrix factorization model
5. Implement the final model on AWS platform to handle the huge amount of data
Other creators -
Using Hadoop Hive and Mapreduce to analysis Nasa Server Logs
1. Dealing with the data set contains Apache Logs gathered by NASA's server in the months of July-October, 1995, which is around 1 GB using the HDFS.
2. Create a schema for the dataset in Hive through the regular expression to describe a concrete structure describing all the required fields.
3. Make the plot to depicting the number of requests made in a day for every day in the month of October.
4. Write a MapReduce job to calculate total bandwidth add all the response bytes sent by…1. Dealing with the data set contains Apache Logs gathered by NASA's server in the months of July-October, 1995, which is around 1 GB using the HDFS.
2. Create a schema for the dataset in Hive through the regular expression to describe a concrete structure describing all the required fields.
3. Make the plot to depicting the number of requests made in a day for every day in the month of October.
4. Write a MapReduce job to calculate total bandwidth add all the response bytes sent by NASA webserver.
-
Zynga Game Payer Prediction and User Pattern Analysis
1. Processing real user data and metrics from Zynga platform with 1 million user records and 247 features.
2. Implemented Lasso, ridge regression with logistic regression and random forest method to select the important features in predicting whether the user would be a payer.
3. Ensemble the stochastic gradient descent classifier with perceptron, log and hinge loss function, the knn method and the decision tree method with the selected important features to predict the payer. The…1. Processing real user data and metrics from Zynga platform with 1 million user records and 247 features.
2. Implemented Lasso, ridge regression with logistic regression and random forest method to select the important features in predicting whether the user would be a payer.
3. Ensemble the stochastic gradient descent classifier with perceptron, log and hinge loss function, the knn method and the decision tree method with the selected important features to predict the payer. The precision, recall and F1 score all reach up to 95%.
4. Using Kmeans++ method with the important features to cluster the user patterns on Zynga platform. The number of cluster is determined by the elbow method. Using cluster method, we can correctly reveal the difference pattern between paying users, the risk-prefer user and the mature user.
5. Based on the user pattern, we propose the strategy to hold campaign between different group of users to improve the engagement of users.
-
Handwriting Recognizing by SVM and Adaboost Supervised Learning with R
1. Processed JPEG data from the USPS open handwriting datasets data into the matrix with R.
2. Implemented the non-linear SVM method and Adaboost with R to recognize the handwriting numbers.
3. Chosen the kernel and margin parameters through cross validation to improve the recognized rate to 90%.
-
Document Text Classification Using Lasso/Ridge Regression and Naïve Bayes
1. Building an efficient Naïve Bayes classifier to classify the papers belonging to Hamilton or Madison with the help of natural language processing package of R
2. Implementing the Ridge regression, Lasso and mutual information selection, respectively, to remove the irrelevant features in the text documents to improve the efficiency of the Bayes classifier.
-
Mining the NYPD Open Datasets to Predict the Danger Area for Car Collision in NYC on AWS Cloud Platform
1. Cleaned, processed and selected a bunch of features to find correlation between the rate of vehicle collisions and the location, time and weather of the driving route with R script through the API of NYC open dataset.
2. Applied normalization and PCA for the features of data, then implementing the unsupervised K-means++ method on AWS Cloud Platform with Spark, obtain the heat map of high danger area in NYC with the inputting time and driving route.
Other creators -
Study the Relation Between Users’ Sentiment and Location Tags in Twitter with SQL and
1. Processed tweets from Twitter Streaming API to extract tweets with locations tags using Python and SQL
2. Done sentiment analysis by writing the classifiers with python: naive Bayes classifier, maximum entropy classifier and support vector machines. The NLTK package is used to parse and analyze each tweet.
3. Improved the accuracy of self-written machine learning classifier by using the bi-grams, tri-grams and word dictionaries. The accuracy is around 80%.
-
Predict the SSE Index by Bouchard-Sornette option pricing-model
Developed C code to implement Bouchard-Sornette option pricing-model to predict the SSE Index
-
Computational study of the phase transition in the Hexagonal Ising Model
Implemented Wolff-Monte-Carlo method by C code to study the phase transition in hexagonal Ising model.
-
Developing New Global Optimization Algorithm for Material Science with Hadoop
-
1. Proposed a new global optimization algorithm for functional material searching based on the differential evolution algorithm using python.
2. Utilized new algorithm to find the grain boundary structures with lower formation energy on Hadoop.
3. Design A/B test to select the components in the algorithm to make the optimization efficient.
-
Developing High Efficient Algorithm in Scientific Computation
-
1. Developed the efficient algorithm with python to accelerate parallel large-scale data-analysis on Hadoop.
2. Accelerating the efficiency of calculation 10 times without lost the major accuracy comparing to the previous.
-
Computational Study of the Energy Transport in Nanostructures, Fudan University
-
1. Developed python code to simulate the thermal transport in graphene-based materials.
2. Using multivariate numerical method with R to analyze the datasets obtained from the experiments.
3. Designed a new kind of 2D thermal rectifier and publish in peer-reviewed paper (top cited paper in journal)
Honors & Awards
-
Rank 91/2070 (top 5%) in Kaggle Two Sigma Financial Model Challenge
Kaggle
As a member in the team attending Kaggle Two Sigma Financial Model Challenge. Implementing time-series feature engineering and linear/tree regressors to build the model which achieve top 5% score in the private leaderboard on test dataset.
-
Brown Medal in Hackerrank Coding Contest (top 15%)
Hackerrank
Top 15% in Hackerrank Week of Code 23 contest with 10000+ attendees.
-
Rank No.1 among 279 teams in Rang-Technology Data Analytic Competition (Kaggle like data competition)
Rang-Technology
https://xmrrwallet.com/cmx.prang.shinyapps.io/Competition/
Rank No.1 among 279 teams composed of Master students around 50 Universities, including CMU, Columbia, Cornell, USC, UIUC etc in the Rang-Tech Data Analytics Competition, a Kaggle like competition which used the transaction data to predict the active customer. -
2015 Web of Science Highly Cited Paper Worldwide (First author)
Thomson Reuters
My first author paper published on Physical Review B about the theoretical computation on magnetic materials "Antiferromagnetic ground state with pair-checkerboard order in FeSe " has been selected as the "Highly Cited Paper" in 2015 period.
My first author paper has been selected as the top 1% high quality science paper from the about 1170000 papers published in physics related subjects. This is the most prestigious criteria about the research impact in the science research field. -
National Scholarship
Ministry of Education of People's Republic of China
Top honor for the best academic achievement of graduate student in China.
-
Fellowship for Graduate Student’s Short-term International Visiting (to Lawrence Berkeley National Lab)
Fudan University
Fellowship for excellent graduate student to visit top-class institutions worldwide.
-
Distinguished award for new graduate student
Fudan University
For the excellent new coming graduate student.
Languages
-
Mandarin
Native or bilingual proficiency
-
English
Professional working proficiency
Organizations
-
American Physics Society
Member
- PresentStudent member of the American Physics Society. Give two oral talks in the 2013 and 2014 APS Annual March Meeting.
More activity by Haiyuan
-
Meet GPT-5 - our smartest, fastest and most useful model. It is a unified system that automatically switches between providing a quick response and…
Meet GPT-5 - our smartest, fastest and most useful model. It is a unified system that automatically switches between providing a quick response and…
Liked by Haiyuan Cao
-
Here at OpenAI we've cracked pretraining, then reasoning, and now we're experimenting with a new set of techniques that maximally leverage their…
Here at OpenAI we've cracked pretraining, then reasoning, and now we're experimenting with a new set of techniques that maximally leverage their…
Liked by Haiyuan Cao
-
At Zoom we’re thrilled to be among the first to integrate OpenAI’s GPT-5 into our federated AI architecture. Zoom AI Companion is powered by our…
At Zoom we’re thrilled to be among the first to integrate OpenAI’s GPT-5 into our federated AI architecture. Zoom AI Companion is powered by our…
Liked by Haiyuan Cao
-
I’m hiring STRONG Engineers for data pipelines and analysis, cluster infrastructure, deep learning infrastructure, GPUs acceleration and utilisation,…
I’m hiring STRONG Engineers for data pipelines and analysis, cluster infrastructure, deep learning infrastructure, GPUs acceleration and utilisation,…
Liked by Haiyuan Cao
-
Welcome to the era of GPT-5: OpenAI’s most advanced model yet. 🖐️ And it’s rolling out to all paid GitHub Copilot plans, starting today. In our…
Welcome to the era of GPT-5: OpenAI’s most advanced model yet. 🖐️ And it’s rolling out to all paid GitHub Copilot plans, starting today. In our…
Liked by Haiyuan Cao
-
Today marks a major milestone: GPT-5 is now live in Microsoft 365 Copilot and Copilot Studio. This unlocks a new level of capability for our…
Today marks a major milestone: GPT-5 is now live in Microsoft 365 Copilot and Copilot Studio. This unlocks a new level of capability for our…
Liked by Haiyuan Cao
-
https://xmrrwallet.com/cmx.plnkd.in/gu4Ecdin make sure to tune in at 10am PT!!!
https://xmrrwallet.com/cmx.plnkd.in/gu4Ecdin make sure to tune in at 10am PT!!!
Liked by Haiyuan Cao
-
Career advice: Live modestly, save money Put your family and health first Keep learning - don't live on your laurels No jerks: don't work for 'em…
Career advice: Live modestly, save money Put your family and health first Keep learning - don't live on your laurels No jerks: don't work for 'em…
Liked by Haiyuan Cao
-
The public preview for BigQuery Advanced Runtime is here! I'm excited to share our blog post that details how we're boosting throughput and reducing…
The public preview for BigQuery Advanced Runtime is here! I'm excited to share our blog post that details how we're boosting throughput and reducing…
Liked by Haiyuan Cao
-
What is it like to be an AI Product Manager? Here are 3 very important things we did to launch a Data Science Agent at Google 👇 There's 2 types of…
What is it like to be an AI Product Manager? Here are 3 very important things we did to launch a Data Science Agent at Google 👇 There's 2 types of…
Liked by Haiyuan Cao
-
A new chapter: I am excited to share that I have recently joined Anthropic as a member of technical staff. Anthropic is a unique company with an even…
A new chapter: I am excited to share that I have recently joined Anthropic as a member of technical staff. Anthropic is a unique company with an even…
Liked by Haiyuan Cao
-
Amazon Nova models were recognized among the top performers in a new research from #Aymara which involved testing 20 leading language models for…
Amazon Nova models were recognized among the top performers in a new research from #Aymara which involved testing 20 leading language models for…
Liked by Haiyuan Cao
-
We are looking to hire a research engineer on the Universal Knowledge team in Google DeepMind in Zurich. This is an exciting area of research, in…
We are looking to hire a research engineer on the Universal Knowledge team in Google DeepMind in Zurich. This is an exciting area of research, in…
Liked by Haiyuan Cao
-
Truly grateful and humbled to receive the award. It's gratifying to see this 13-year old work continues to be useful, and exciting to witness how…
Truly grateful and humbled to receive the award. It's gratifying to see this 13-year old work continues to be useful, and exciting to witness how…
Liked by Haiyuan Cao
-
CoreAI is hiring a Principal Technical Program Manager to help us build the future of coding with AI. https://xmrrwallet.com/cmx.plnkd.in/gjikmvmk…
CoreAI is hiring a Principal Technical Program Manager to help us build the future of coding with AI. https://xmrrwallet.com/cmx.plnkd.in/gjikmvmk…
Liked by Haiyuan Cao
-
As students get ready to start classes around the world, we're making our most advanced AI tools available to college students in the US, Japan…
As students get ready to start classes around the world, we're making our most advanced AI tools available to college students in the US, Japan…
Liked by Haiyuan Cao
Other similar profiles
Explore top content on LinkedIn
Find curated posts and insights for relevant topics all in one place.
View top contentOthers named Haiyuan Cao
21 others named Haiyuan Cao are on LinkedIn
See others named Haiyuan Cao